linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
@ 2018-06-03  5:22 Dan Williams
  2018-06-03  5:22 ` [PATCH v2 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
                   ` (11 more replies)
  0 siblings, 12 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-edac, Tony Luck, Borislav Petkov, Jérôme Glisse,
	Jan Kara, H. Peter Anvin, x86, Thomas Gleixner,
	Christoph Hellwig, Ross Zwisler, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Naoya Horiguchi, Souptick Joarder, linux-mm,
	linux-fsdevel, jack

Changes since v1 [1]:
* Rework the locking to not use lock_page() instead use a combination of
  rcu_read_lock(), xa_lock_irq(&mapping->pages), and igrab() to validate
  that dax pages are still associated with the given mapping, and to
  prevent the address_space from being freed while memory_failure() is
  busy. (Jan)

* Fix use of MF_COUNT_INCREASED in madvise_inject_error() to account for
  the case where the injected error is a dax mapping and the pinned
  reference needs to be dropped. (Naoya)

* Clarify with a comment that VM_FAULT_NOPAGE may not always indicate a
  mapping of the storage capacity, it could also indicate the zero page.
  (Jan)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015932.html

---

As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.

---

Dan Williams (11):
      device-dax: Convert to vmf_insert_mixed and vm_fault_t
      device-dax: Cleanup vm_fault de-reference chains
      device-dax: Enable page_mapping()
      device-dax: Set page->index
      filesystem-dax: Set page->index
      mm, madvise_inject_error: Let memory_failure() optionally take a page reference
      x86, memory_failure: Introduce {set,clear}_mce_nospec()
      mm, memory_failure: Pass page size to kill_proc()
      mm, memory_failure: Fix page->mapping assumptions relative to the page lock
      mm, memory_failure: Teach memory_failure() about dev_pagemap pages
      libnvdimm, pmem: Restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h         |   29 ++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 --
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 -----
 drivers/dax/device.c                      |   97 ++++++++-----
 drivers/nvdimm/pmem.c                     |   26 ++++
 drivers/nvdimm/pmem.h                     |   13 ++
 fs/dax.c                                  |   16 ++
 include/linux/huge_mm.h                   |    5 -
 include/linux/mm.h                        |    1 
 include/linux/set_memory.h                |   14 ++
 mm/huge_memory.c                          |    4 -
 mm/madvise.c                              |   18 ++
 mm/memory-failure.c                       |  209 ++++++++++++++++++++++++++---
 13 files changed, 366 insertions(+), 119 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
@ 2018-06-03  5:22 ` Dan Williams
  2018-06-03  5:22 ` [PATCH v2 02/11] device-dax: Cleanup vm_fault de-reference chains Dan Williams
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Souptick Joarder, Matthew Wilcox, Ross Zwisler, hch, linux-mm,
	linux-fsdevel, jack

Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno.  Once all instances are converted, vm_fault_t will
become a distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c    |   26 +++++++++++---------------
 include/linux/huge_mm.h |    5 +++--
 mm/huge_memory.c        |    4 ++--
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index aff2c1594220..d44d98c54d0f 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
-	int rc = VM_FAULT_SIGBUS;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
-	if (rc == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (rc < 0 && rc != -EBUSY)
-		return VM_FAULT_SIGBUS;
-
-	return VM_FAULT_NOPAGE;
+	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
 
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
 	return rc;
 }
 
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
 {
 	return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
 }
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/sched/coredump.h>
+#include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..6af976472a5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -755,7 +755,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
@@ -815,7 +815,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 02/11] device-dax: Cleanup vm_fault de-reference chains
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-06-03  5:22 ` [PATCH v2 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
@ 2018-06-03  5:22 ` Dan Williams
  2018-06-03  5:22 ` [PATCH v2 03/11] device-dax: Enable page_mapping() Dan Williams
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

Define a local 'vma' variable rather than repetitively de-referencing
the passed in 'struct vm_fault *' instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d44d98c54d0f..686de08e120b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -247,13 +247,14 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -274,13 +275,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -288,7 +290,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -310,11 +312,10 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pmd_addr < vmf->vma->vm_start ||
-			(pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
+	if (pmd_addr < vma->vm_start || (pmd_addr + PMD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pmd_addr);
+	pgoff = linear_page_index(vma, pmd_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -323,7 +324,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
@@ -332,6 +333,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -340,7 +342,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	unsigned int fault_size = PUD_SIZE;
 
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -362,11 +364,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pud_addr < vmf->vma->vm_start ||
-			(pud_addr + PUD_SIZE) > vmf->vma->vm_end)
+	if (pud_addr < vma->vm_start || (pud_addr + PUD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pud_addr);
+	pgoff = linear_page_index(vma, pud_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -375,7 +376,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,12 +391,13 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
-	struct file *filp = vmf->vma->vm_file;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *filp = vma->vm_file;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
 			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
+			vma->vm_start, vma->vm_end, pe_size);
 
 	id = dax_read_lock();
 	switch (pe_size) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 03/11] device-dax: Enable page_mapping()
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-06-03  5:22 ` [PATCH v2 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
  2018-06-03  5:22 ` [PATCH v2 02/11] device-dax: Cleanup vm_fault de-reference chains Dan Williams
@ 2018-06-03  5:22 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 04/11] device-dax: Set page->index Dan Williams
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.

The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   53 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 686de08e120b..f7e926f4ec12 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,13 +245,12 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
-	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -273,13 +272,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, *pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -287,7 +286,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -322,15 +320,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -338,7 +336,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
 
@@ -374,9 +371,9 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,9 +387,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
-	int rc, id;
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *filp = vma->vm_file;
+	unsigned long fault_size;
+	int rc, id;
+	pfn_t pfn;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -402,17 +401,39 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	id = dax_read_lock();
 	switch (pe_size) {
 	case PE_SIZE_PTE:
-		rc = __dev_dax_pte_fault(dev_dax, vmf);
+		fault_size = PAGE_SIZE;
+		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PMD:
-		rc = __dev_dax_pmd_fault(dev_dax, vmf);
+		fault_size = PMD_SIZE;
+		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PUD:
-		rc = __dev_dax_pud_fault(dev_dax, vmf);
+		fault_size = PUD_SIZE;
+		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
 		break;
 	default:
 		rc = VM_FAULT_SIGBUS;
 	}
+
+	if (rc == VM_FAULT_NOPAGE) {
+		unsigned long i;
+
+		/*
+		 * In the device-dax case the only possibility for a
+		 * VM_FAULT_NOPAGE result is when device-dax capacity is
+		 * mapped. No need to consider the zero page, or racing
+		 * conflicting mappings.
+		 */
+		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+			struct page *page;
+
+			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+			if (page->mapping)
+				continue;
+			page->mapping = filp->f_mapping;
+		}
+	}
 	dax_read_unlock(id);
 
 	return rc;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 04/11] device-dax: Set page->index
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (2 preceding siblings ...)
  2018-06-03  5:22 ` [PATCH v2 03/11] device-dax: Enable page_mapping() Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 05/11] filesystem-dax: " Dan Williams
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.

The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index f7e926f4ec12..499299e36ee2 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -418,6 +418,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 
 	if (rc == VM_FAULT_NOPAGE) {
 		unsigned long i;
+		pgoff_t pgoff;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -425,6 +426,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
+		pgoff = linear_page_index(vma, vmf->address
+				& ~(fault_size - 1));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
@@ -432,6 +435,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			if (page->mapping)
 				continue;
 			page->mapping = filp->f_mapping;
+			page->index = pgoff + i;
 		}
 	}
 	dax_read_unlock(id);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 05/11] filesystem-dax: Set page->index
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (3 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 04/11] device-dax: Set page->index Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 06/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index aaec72ded1b6..cccf6cad1a7a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,27 @@ static unsigned long dax_radix_end_pfn(void *entry)
 	for (pfn = dax_radix_pfn(entry); \
 			pfn < dax_radix_end_pfn(entry); pfn++)
 
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+/*
+ * TODO: for reflink+dax we need a way to associate a single page with
+ * multiple address_space instances at different linear_page_index()
+ * offsets.
+ */
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn, index;
+	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
+	index = linear_page_index(vma, address & ~(size - 1));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(page->mapping);
 		page->mapping = mapping;
+		page->index = index + i++;
 	}
 }
 
@@ -348,6 +357,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
+		page->index = 0;
 	}
 }
 
@@ -604,7 +614,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	new_entry = dax_radix_locked_entry(pfn, flags);
 	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping);
+		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 06/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (4 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 05/11] filesystem-dax: " Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Naoya Horiguchi, hch, linux-mm, linux-fsdevel, jack

The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it does not release
that pin. The assumption is that failed pages should be taken out of
circulation.

However, for dax mappings it is not possible to take pages out of
circulation since they are 1:1 physically mapped as filesystem blocks,
or device-dax capacity. They also typically represent persistent memory
which has an error clearing capability.

In preparation for adding a special handler for dax mappings, shift the
responsibility of taking the page reference to memory_failure(). I.e.
drop the page reference and do not specify MF_COUNT_INCREASED to
memory_failure().

Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/madvise.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..b731933dddae 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
 
 
 	for (; start < end; start += PAGE_SIZE << order) {
+		unsigned long pfn;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		pfn = page_to_pfn(page);
 
 		/*
 		 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,27 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+					pfn, start);
 
 			ret = soft_offline_page(page, MF_COUNT_INCREASED);
 			if (ret)
 				return ret;
 			continue;
 		}
+
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+				pfn, start);
+
+		ret = memory_failure(pfn, 0);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
 
-		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
 		if (ret)
 			return ret;
 	}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (5 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 06/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-04 17:08   ` Luck, Tony
  2018-06-03  5:23 ` [PATCH v2 08/11] mm, memory_failure: Pass page size to kill_proc() Dan Williams
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, linux-edac, x86, hch, linux-mm, linux-fsdevel,
	jack

Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/set_memory.h         |   29 ++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 -----------
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 ++---------------------------
 include/linux/set_memory.h                |   14 +++++++++++
 4 files changed, 46 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index bd090367236c..debc1fee1457 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,33 @@ extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
 
+#ifdef CONFIG_X86_64
+/*
+ * Mark the linear address as UC to disable speculative pre-fetches into
+ * potentially poisoned memory.
+ */
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	int rc;
+
+	rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) __va(PFN_PHYS(pfn)), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb)	{ }
 static inline void mce_unregister_injector_chain(struct notifier_block *nb)	{ }
 #endif
 
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- *    are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- *    recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
 struct mca_config {
 	bool dont_log_ce;
 	bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..a0fbf0a8b7e6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/jump_label.h>
+#include <linux/set_memory.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -50,7 +51,6 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
-#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 	if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
 		pfn = mce->addr >> PAGE_SHIFT;
 		if (!memory_failure(pfn, 0))
-			mce_unmap_kpfn(pfn);
+			set_mce_nospec(pfn);
 	}
 
 	return NOTIFY_OK;
@@ -1070,38 +1066,10 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT);
 	return ret;
 }
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn)
-{
-	unsigned long decoy_addr;
-
-	/*
-	 * Unmap this page from the kernel 1:1 mappings to make sure
-	 * we don't log more errors because of speculative access to
-	 * the page.
-	 * We would like to just call:
-	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_np() not checking whether we passed
-	 * a legal address.
-	 */
-
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (set_memory_np(decoy_addr, 1))
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-}
-#endif
-
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index da5178216da5..2a986d282a97 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr,  int numpages) { return 0; }
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+#ifndef set_mce_nospec
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
+#ifndef clear_mce_nospec
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
 #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
 static inline int set_memory_encrypted(unsigned long addr, int numpages)
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 08/11] mm, memory_failure: Pass page size to kill_proc()
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (6 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 09/11] mm, memory_failure: Fix page->mapping assumptions relative to the page lock Dan Williams
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Naoya Horiguchi, hch, linux-mm, linux-fsdevel, jack

Given that ZONE_DEVICE / dev_pagemap pages are never assembled into
compound pages, the size determination logic in kill_proc() needs
updating for the dev_pagemap case. In preparation for dev_pagemap
support rework memory_failure() and kill_proc() to pass / consume the page
size explicitly.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..42a193ee14d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,18 +179,16 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, struct page *page, int flags)
+			unsigned long pfn, unsigned size_shift, int flags)
 {
-	short addr_lsb;
 	int ret;
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
 		pfn, t->comm, t->pid);
-	addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
 		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
-				       addr_lsb, current);
+				       size_shift, current);
 	} else {
 		/*
 		 * Don't use force here, it's convenient if the signal
@@ -199,7 +197,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		 * to SIG_IGN, but hopefully no one will do that?
 		 */
 		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
-				      addr_lsb, t);  /* synchronous? */
+				      size_shift, t);  /* synchronous? */
 	}
 	if (ret < 0)
 		pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
@@ -318,7 +316,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, struct page *page, unsigned long pfn,
+			  bool fail, unsigned size_shift, unsigned long pfn,
 			  int flags)
 {
 	struct to_kill *tk, *next;
@@ -343,7 +341,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * process anyways.
 			 */
 			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, page, flags) < 0)
+					      pfn, size_shift, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
 				       pfn, tk->tsk->comm, tk->tsk->pid);
 		}
@@ -928,6 +926,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	unsigned size_shift;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1012,7 +1011,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * any accesses to the poisoned memory.
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
-	kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
+	size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+	kill_procs(&tokill, forcekill, !unmap_success, size_shift, pfn, flags);
 
 	return unmap_success;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 09/11] mm, memory_failure: Fix page->mapping assumptions relative to the page lock
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (7 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 08/11] mm, memory_failure: Pass page size to kill_proc() Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 10/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

The current memory_failure() implementation assumes that lock_page() is
sufficient for stabilizing page->mapping and that ->mapping->host will
not be freed. The dax implementation, on the other hand, relies on
xa_lock_irq() for stabilizing the page->mapping relationship and it is
not possible to hold the lock over current routines in the
memory_failure() path that run under lock_page().

Teach the various memory_failure() helpers to pin the address_space and
revalidate page->mapping under xa_lock_irq(mapping->i_pages).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   56 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 42a193ee14d3..b6efb78ba49b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,12 +179,20 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, unsigned size_shift, int flags)
+		struct address_space *mapping, struct page *page,
+		unsigned size_shift, int flags)
 {
-	int ret;
+	int ret = 0;
+
+	/* revalidate the page before killing the process */
+	xa_lock_irq(&mapping->i_pages);
+	if (page->mapping != mapping) {
+		xa_unlock_irq(&mapping->i_pages);
+		return 0;
+	}
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
-		pfn, t->comm, t->pid);
+			page_to_pfn(page), t->comm, t->pid);
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
 		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
@@ -199,6 +207,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
 				      size_shift, t);  /* synchronous? */
 	}
+	xa_unlock_irq(&mapping->i_pages);
 	if (ret < 0)
 		pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
 			t->comm, t->pid, ret);
@@ -316,8 +325,8 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, unsigned size_shift, unsigned long pfn,
-			  int flags)
+		bool fail, unsigned size_shift, struct address_space *mapping,
+		struct page *page, int flags)
 {
 	struct to_kill *tk, *next;
 
@@ -330,7 +339,8 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 */
 			if (fail || tk->addr_valid == 0) {
 				pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
-				       pfn, tk->tsk->comm, tk->tsk->pid);
+						page_to_pfn(page), tk->tsk->comm,
+						tk->tsk->pid);
 				force_sig(SIGKILL, tk->tsk);
 			}
 
@@ -341,9 +351,10 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * process anyways.
 			 */
 			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, size_shift, flags) < 0)
+					      mapping, page, size_shift, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
-				       pfn, tk->tsk->comm, tk->tsk->pid);
+						page_to_pfn(page), tk->tsk->comm,
+						tk->tsk->pid);
 		}
 		put_task_struct(tk->tsk);
 		kfree(tk);
@@ -429,21 +440,27 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 /*
  * Collect processes when the error hit a file mapped page.
  */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-			      struct to_kill **tkc, int force_early)
+static void collect_procs_file(struct address_space *mapping, struct page *page,
+		struct list_head *to_kill, struct to_kill **tkc,
+		int force_early)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct address_space *mapping = page->mapping;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page_to_pgoff(page);
+		pgoff_t pgoff;
 		struct task_struct *t = task_early_kill(tsk, force_early);
 
 		if (!t)
 			continue;
+		xa_lock_irq(&mapping->i_pages);
+		if (page->mapping != mapping) {
+			xa_unlock_irq(&mapping->i_pages);
+			break;
+		}
+		pgoff = page_to_pgoff(page);
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
 				      pgoff) {
 			/*
@@ -456,6 +473,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			if (vma->vm_mm == t->mm)
 				add_to_kill(t, page, vma, to_kill, tkc);
 		}
+		xa_unlock_irq(&mapping->i_pages);
 	}
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
@@ -467,12 +485,12 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
  * First preallocate one tokill structure outside the spin locks,
  * so that we can kill at least one process reasonably reliable.
  */
-static void collect_procs(struct page *page, struct list_head *tokill,
-				int force_early)
+static void collect_procs(struct address_space *mapping, struct page *page,
+		struct list_head *tokill, int force_early)
 {
 	struct to_kill *tk;
 
-	if (!page->mapping)
+	if (!mapping)
 		return;
 
 	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
@@ -481,7 +499,7 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 	if (PageAnon(page))
 		collect_procs_anon(page, tokill, &tk, force_early);
 	else
-		collect_procs_file(page, tokill, &tk, force_early);
+		collect_procs_file(mapping, page, tokill, &tk, force_early);
 	kfree(tk);
 }
 
@@ -986,7 +1004,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
+		collect_procs(mapping, hpage, &tokill,
+				flags & MF_ACTION_REQUIRED);
 
 	unmap_success = try_to_unmap(hpage, ttu);
 	if (!unmap_success)
@@ -1012,7 +1031,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
 	size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
-	kill_procs(&tokill, forcekill, !unmap_success, size_shift, pfn, flags);
+	kill_procs(&tokill, forcekill, !unmap_success, size_shift, mapping,
+			hpage, flags);
 
 	return unmap_success;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 10/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (8 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 09/11] mm, memory_failure: Fix page->mapping assumptions relative to the page lock Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-03  5:23 ` [PATCH v2 11/11] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
  2018-06-04 12:40 ` [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Michal Hocko
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, linux-mm,
	linux-fsdevel, jack

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h  |    1 
 mm/memory-failure.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 146 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..566c972e03e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2669,6 +2669,7 @@ enum mf_action_page_type {
 	MF_MSG_TRUNCATED_LRU,
 	MF_MSG_BUDDY,
 	MF_MSG_BUDDY_2ND,
+	MF_MSG_DAX,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b6efb78ba49b..de0bc897d6e7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
+#include <linux/memremap.h>
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include "internal.h"
@@ -531,6 +532,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_TRUNCATED_LRU]		= "already truncated LRU page",
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
+	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1132,6 +1134,144 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return res;
 }
 
+static unsigned long dax_mapping_size(struct address_space *mapping,
+		struct page *page)
+{
+	pgoff_t pgoff = page_to_pgoff(page);
+	struct vm_area_struct *vma;
+	unsigned long size = 0;
+
+	i_mmap_lock_read(mapping);
+	xa_lock_irq(&mapping->i_pages);
+	/* validate that @page is still linked to @mapping */
+	if (page->mapping != mapping) {
+		xa_unlock_irq(&mapping->i_pages);
+		i_mmap_unlock_read(mapping);
+			return 0;
+	}
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long address = vma_address(page, vma);
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, address);
+		if (!pgd_present(*pgd))
+			continue;
+		p4d = p4d_offset(pgd, address);
+		if (!p4d_present(*p4d))
+			continue;
+		pud = pud_offset(p4d, address);
+		if (!pud_present(*pud))
+			continue;
+		if (pud_devmap(*pud)) {
+			size = PUD_SIZE;
+			break;
+		}
+		pmd = pmd_offset(pud, address);
+		if (!pmd_present(*pmd))
+			continue;
+		if (pmd_devmap(*pmd)) {
+			size = PMD_SIZE;
+			break;
+		}
+		pte = pte_offset_map(pmd, address);
+		if (!pte_present(*pte))
+			continue;
+		if (pte_devmap(*pte)) {
+			size = PAGE_SIZE;
+			break;
+		}
+	}
+	xa_unlock_irq(&mapping->i_pages);
+	i_mmap_unlock_read(mapping);
+
+	return size;
+}
+
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	const bool unmap_success = true;
+	struct address_space *mapping;
+	unsigned long size;
+	LIST_HEAD(tokill);
+	int rc = -EBUSY;
+	loff_t start;
+
+	/*
+	 * Prevent the inode from being freed while we are interrogating
+	 * the address_space, typically this would be handled by
+	 * lock_page(), but dax pages do not use the page lock.
+	 */
+	rcu_read_lock();
+	mapping = page->mapping;
+	if (!mapping) {
+		rcu_read_unlock();
+		goto out;
+	}
+	if (!igrab(mapping->host)) {
+		mapping = NULL;
+		rcu_read_unlock();
+		goto out;
+	}
+	rcu_read_unlock();
+
+	if (hwpoison_filter(page)) {
+		rc = 0;
+		goto out;
+	}
+
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		goto out;
+	default:
+		break;
+	}
+
+	/*
+	 * If the page is not mapped in userspace then report it as
+	 * unhandled.
+	 */
+	size = dax_mapping_size(mapping, page);
+	if (!size) {
+		pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
+		goto out;
+	}
+
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(mapping, page, &tokill, flags & MF_ACTION_REQUIRED);
+
+	start = (page->index << PAGE_SHIFT) & ~(size - 1);
+	unmap_mapping_range(page->mapping, start, start + size, 0);
+
+	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
+			mapping, page, flags);
+	rc = 0;
+out:
+	if (mapping)
+		iput(mapping->host);
+	put_dev_pagemap(pgmap);
+	action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
+	return rc;
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1154,6 +1294,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *p;
 	struct page *hpage;
 	struct page *orig_head;
+	struct dev_pagemap *pgmap;
 	int res;
 	unsigned long page_flags;
 
@@ -1166,6 +1307,10 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	pgmap = get_dev_pagemap(pfn, NULL);
+	if (pgmap)
+		return memory_failure_dev_pagemap(pfn, flags, pgmap);
+
 	p = pfn_to_page(pfn);
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 11/11] libnvdimm, pmem: Restore page attributes when clearing errors
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (9 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 10/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-06-03  5:23 ` Dan Williams
  2018-06-04 12:40 ` [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Michal Hocko
  11 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-03  5:23 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   26 ++++++++++++++++++++++++++
 drivers/nvdimm/pmem.h |   13 +++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..04ee1fdee219 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
 #include <linux/hdreg.h>
 #include <linux/init.h>
 #include <linux/platform_device.h>
+#include <linux/set_memory.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/badblocks.h>
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
+static void hwpoison_clear(struct pmem_device *pmem,
+		phys_addr_t phys, unsigned int len)
+{
+	unsigned long pfn_start, pfn_end, pfn;
+
+	/* only pmem in the linear map supports HWPoison */
+	if (is_vmalloc_addr(pmem->virt_addr))
+		return;
+
+	pfn_start = PHYS_PFN(phys);
+	pfn_end = pfn_start + PHYS_PFN(len);
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * Note, no need to hold a get_dev_pagemap() reference
+		 * here since we're in the driver I/O path and
+		 * outstanding I/O requests pin the dev_pagemap.
+		 */
+		if (test_and_clear_pmem_poison(page))
+			clear_mce_nospec(pfn);
+	}
+}
+
 static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 		phys_addr_t offset, unsigned int len)
 {
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	if (cleared < len)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
+		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
+#include <linux/page-flags.h>
 #include <linux/badblocks.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -27,4 +28,16 @@ struct pmem_device {
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return false;
+}
+#endif
 #endif /* __NVDIMM_PMEM_H__ */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (10 preceding siblings ...)
  2018-06-03  5:23 ` [PATCH v2 11/11] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
@ 2018-06-04 12:40 ` Michal Hocko
  2018-06-04 14:31   ` Dan Williams
  11 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-06-04 12:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, x86,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, linux-mm,
	linux-fsdevel

On Sat 02-06-18 22:22:43, Dan Williams wrote:
> Changes since v1 [1]:
> * Rework the locking to not use lock_page() instead use a combination of
>   rcu_read_lock(), xa_lock_irq(&mapping->pages), and igrab() to validate
>   that dax pages are still associated with the given mapping, and to
>   prevent the address_space from being freed while memory_failure() is
>   busy. (Jan)
> 
> * Fix use of MF_COUNT_INCREASED in madvise_inject_error() to account for
>   the case where the injected error is a dax mapping and the pinned
>   reference needs to be dropped. (Naoya)
> 
> * Clarify with a comment that VM_FAULT_NOPAGE may not always indicate a
>   mapping of the storage capacity, it could also indicate the zero page.
>   (Jan)
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015932.html
> 
> ---
> 
> As it stands, memory_failure() gets thoroughly confused by dev_pagemap
> backed mappings. The recovery code has specific enabling for several
> possible page states and needs new enabling to handle poison in dax
> mappings.
> 
> In order to support reliable reverse mapping of user space addresses:
> 
> 1/ Add new locking in the memory_failure() rmap path to prevent races
> that would typically be handled by the page lock.
> 
> 2/ Since dev_pagemap pages are hidden from the page allocator and the
> "compound page" accounting machinery, add a mechanism to determine the
> size of the mapping that encompasses a given poisoned pfn.
> 
> 3/ Given pmem errors can be repaired, change the speculatively accessed
> poison protection, mce_unmap_kpfn(), to be reversible and otherwise
> allow ongoing access from the kernel.

This doesn't really describe the problem you are trying to solve and why
do you believe that HWPoison is the best way to handle it. As things
stand HWPoison is rather ad-hoc and I am not sure adding more to it is
really great without some deep reconsidering how the whole thing is done
right now IMHO. Are you actually trying to solve some real world problem
or you merely want to make soft offlining work properly?

> ---
> 
> Dan Williams (11):
>       device-dax: Convert to vmf_insert_mixed and vm_fault_t
>       device-dax: Cleanup vm_fault de-reference chains
>       device-dax: Enable page_mapping()
>       device-dax: Set page->index
>       filesystem-dax: Set page->index
>       mm, madvise_inject_error: Let memory_failure() optionally take a page reference
>       x86, memory_failure: Introduce {set,clear}_mce_nospec()
>       mm, memory_failure: Pass page size to kill_proc()
>       mm, memory_failure: Fix page->mapping assumptions relative to the page lock
>       mm, memory_failure: Teach memory_failure() about dev_pagemap pages
>       libnvdimm, pmem: Restore page attributes when clearing errors
> 
> 
>  arch/x86/include/asm/set_memory.h         |   29 ++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 --
>  arch/x86/kernel/cpu/mcheck/mce.c          |   38 -----
>  drivers/dax/device.c                      |   97 ++++++++-----
>  drivers/nvdimm/pmem.c                     |   26 ++++
>  drivers/nvdimm/pmem.h                     |   13 ++
>  fs/dax.c                                  |   16 ++
>  include/linux/huge_mm.h                   |    5 -
>  include/linux/mm.h                        |    1 
>  include/linux/set_memory.h                |   14 ++
>  mm/huge_memory.c                          |    4 -
>  mm/madvise.c                              |   18 ++
>  mm/memory-failure.c                       |  209 ++++++++++++++++++++++++++---
>  13 files changed, 366 insertions(+), 119 deletions(-)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-04 12:40 ` [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Michal Hocko
@ 2018-06-04 14:31   ` Dan Williams
  2018-06-05 14:11     ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-04 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Mon, Jun 4, 2018 at 5:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Sat 02-06-18 22:22:43, Dan Williams wrote:
>> Changes since v1 [1]:
>> * Rework the locking to not use lock_page() instead use a combination of
>>   rcu_read_lock(), xa_lock_irq(&mapping->pages), and igrab() to validate
>>   that dax pages are still associated with the given mapping, and to
>>   prevent the address_space from being freed while memory_failure() is
>>   busy. (Jan)
>>
>> * Fix use of MF_COUNT_INCREASED in madvise_inject_error() to account for
>>   the case where the injected error is a dax mapping and the pinned
>>   reference needs to be dropped. (Naoya)
>>
>> * Clarify with a comment that VM_FAULT_NOPAGE may not always indicate a
>>   mapping of the storage capacity, it could also indicate the zero page.
>>   (Jan)
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015932.html
>>
>> ---
>>
>> As it stands, memory_failure() gets thoroughly confused by dev_pagemap
>> backed mappings. The recovery code has specific enabling for several
>> possible page states and needs new enabling to handle poison in dax
>> mappings.
>>
>> In order to support reliable reverse mapping of user space addresses:
>>
>> 1/ Add new locking in the memory_failure() rmap path to prevent races
>> that would typically be handled by the page lock.
>>
>> 2/ Since dev_pagemap pages are hidden from the page allocator and the
>> "compound page" accounting machinery, add a mechanism to determine the
>> size of the mapping that encompasses a given poisoned pfn.
>>
>> 3/ Given pmem errors can be repaired, change the speculatively accessed
>> poison protection, mce_unmap_kpfn(), to be reversible and otherwise
>> allow ongoing access from the kernel.
>
> This doesn't really describe the problem you are trying to solve and why
> do you believe that HWPoison is the best way to handle it. As things
> stand HWPoison is rather ad-hoc and I am not sure adding more to it is
> really great without some deep reconsidering how the whole thing is done
> right now IMHO. Are you actually trying to solve some real world problem
> or you merely want to make soft offlining work properly?

I'm trying to solve this real world problem when real poison is
consumed through a dax mapping:

        mce: Uncorrected hardware memory error in user-access at af34214200
        {1}[Hardware Error]: It has been corrected by h/w and requires
no further action
        mce: [Hardware Error]: Machine check events logged
        {1}[Hardware Error]: event severity: corrected
        Memory failure: 0xaf34214: reserved kernel page still
referenced by 1 users
        [..]
        Memory failure: 0xaf34214: recovery action for reserved kernel
page: Failed
        mce: Memory error not recovered

...i.e. currently all poison consumed through dax mappings is
needlessly system fatal.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-03  5:23 ` [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-06-04 17:08   ` Luck, Tony
  2018-06-04 17:39     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2018-06-04 17:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, linux-edac, x86, hch, linux-mm, linux-fsdevel,
	jack

On Sat, Jun 02, 2018 at 10:23:20PM -0700, Dan Williams wrote:
> +static inline int set_mce_nospec(unsigned long pfn)
> +{
> +	int rc;
> +
> +	rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);

You should really do the decoy_addr thing here that I had in mce_unmap_kpfn().
Putting the virtual address of the page you mustn't accidentally prefetch
from into a register is a pretty good way to make sure that the processor
does do a prefetch.

-Tony

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-04 17:08   ` Luck, Tony
@ 2018-06-04 17:39     ` Dan Williams
  2018-06-04 18:08       ` Luck, Tony
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-04 17:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-nvdimm, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, linux-edac, X86 ML, Christoph Hellwig, Linux MM,
	linux-fsdevel, Jan Kara

On Mon, Jun 4, 2018 at 10:08 AM, Luck, Tony <tony.luck@intel.com> wrote:
> On Sat, Jun 02, 2018 at 10:23:20PM -0700, Dan Williams wrote:
>> +static inline int set_mce_nospec(unsigned long pfn)
>> +{
>> +     int rc;
>> +
>> +     rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);
>
> You should really do the decoy_addr thing here that I had in mce_unmap_kpfn().
> Putting the virtual address of the page you mustn't accidentally prefetch
> from into a register is a pretty good way to make sure that the processor
> does do a prefetch.

Maybe I'm misreading, but doesn't that make the page completely
inaccessible? We still want to read pmem through the driver and the
linear mapping with memcpy_mcsafe(). Alternatively I could just drop
this patch and setup a private / alias mapping for the pmem driver to
use. It seems aliased mappings would be the safer option, but I want
to make sure I've comprehended your suggestion correctly?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-04 17:39     ` Dan Williams
@ 2018-06-04 18:08       ` Luck, Tony
  2018-06-04 18:35         ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Luck, Tony @ 2018-06-04 18:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, linux-edac, X86 ML, Christoph Hellwig, Linux MM,
	linux-fsdevel, Jan Kara

On Mon, Jun 04, 2018 at 10:39:48AM -0700, Dan Williams wrote:
> On Mon, Jun 4, 2018 at 10:08 AM, Luck, Tony <tony.luck@intel.com> wrote:
> > On Sat, Jun 02, 2018 at 10:23:20PM -0700, Dan Williams wrote:
> >> +static inline int set_mce_nospec(unsigned long pfn)
> >> +{
> >> +     int rc;
> >> +
> >> +     rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);
> >
> > You should really do the decoy_addr thing here that I had in mce_unmap_kpfn().
> > Putting the virtual address of the page you mustn't accidentally prefetch
> > from into a register is a pretty good way to make sure that the processor
> > does do a prefetch.
> 
> Maybe I'm misreading, but doesn't that make the page completely
> inaccessible? We still want to read pmem through the driver and the
> linear mapping with memcpy_mcsafe(). Alternatively I could just drop
> this patch and setup a private / alias mapping for the pmem driver to
> use. It seems aliased mappings would be the safer option, but I want
> to make sure I've comprehended your suggestion correctly?

I'm OK with the call to set_memory_uc() to make this uncacheable
instead of set_memory_np() to make it inaccessible.

The problem is how to achieve that.

The result of __va(PFN_PHYS(pfn) is the virtual address where the poison
page is currently mapped into the kernel. That value gets put into
register %rdi to make the call to set_memory_uc() (which goes on to
call a bunch of other functions passing the virtual address along
the way).

Now imagine an impatient super-speculative processor is waiting for
some result to decide where to jump next, and picks a path that isn't
going to be taken ... out in the weeds somewhere it runs into:

	movzbl	(%rdi), %eax

Oops ... now you just read from the address you were trying to
avoid. So we log an error. Eventually the speculation gets sorted
out and the processor knows not to signal a machine check. But the
log is sitting in a machine check bank waiting to cause an overflow
if we try to log a second error.

The decoy_addr trick in mce_unmap_kpfn() throws in the high bit
to the address passed.  The set_memory_np() code (and I assume the
set_memory_uc()) code ignores it, but it means any stray speculative
access won't point at the poison page.

-Tony

Note: this is *mostly* a problem if the poison is in the first
cache line of the page.  But you could hit other lines if the
instruction you speculatively ran into had the right offset. E.g.
to hit the third line:

	movzbl	128(%rdi), %eax

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-04 18:08       ` Luck, Tony
@ 2018-06-04 18:35         ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-04 18:35 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-nvdimm, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, linux-edac, X86 ML, Christoph Hellwig, Linux MM,
	linux-fsdevel, Jan Kara

On Mon, Jun 4, 2018 at 11:08 AM, Luck, Tony <tony.luck@intel.com> wrote:
> On Mon, Jun 04, 2018 at 10:39:48AM -0700, Dan Williams wrote:
>> On Mon, Jun 4, 2018 at 10:08 AM, Luck, Tony <tony.luck@intel.com> wrote:
>> > On Sat, Jun 02, 2018 at 10:23:20PM -0700, Dan Williams wrote:
>> >> +static inline int set_mce_nospec(unsigned long pfn)
>> >> +{
>> >> +     int rc;
>> >> +
>> >> +     rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);
>> >
>> > You should really do the decoy_addr thing here that I had in mce_unmap_kpfn().
>> > Putting the virtual address of the page you mustn't accidentally prefetch
>> > from into a register is a pretty good way to make sure that the processor
>> > does do a prefetch.
>>
>> Maybe I'm misreading, but doesn't that make the page completely
>> inaccessible? We still want to read pmem through the driver and the
>> linear mapping with memcpy_mcsafe(). Alternatively I could just drop
>> this patch and setup a private / alias mapping for the pmem driver to
>> use. It seems aliased mappings would be the safer option, but I want
>> to make sure I've comprehended your suggestion correctly?
>
> I'm OK with the call to set_memory_uc() to make this uncacheable
> instead of set_memory_np() to make it inaccessible.
>
> The problem is how to achieve that.
>
> The result of __va(PFN_PHYS(pfn) is the virtual address where the poison
> page is currently mapped into the kernel. That value gets put into
> register %rdi to make the call to set_memory_uc() (which goes on to
> call a bunch of other functions passing the virtual address along
> the way).
>
> Now imagine an impatient super-speculative processor is waiting for
> some result to decide where to jump next, and picks a path that isn't
> going to be taken ... out in the weeds somewhere it runs into:
>
>         movzbl  (%rdi), %eax
>
> Oops ... now you just read from the address you were trying to
> avoid. So we log an error. Eventually the speculation gets sorted
> out and the processor knows not to signal a machine check. But the
> log is sitting in a machine check bank waiting to cause an overflow
> if we try to log a second error.
>
> The decoy_addr trick in mce_unmap_kpfn() throws in the high bit
> to the address passed.  The set_memory_np() code (and I assume the
> set_memory_uc()) code ignores it, but it means any stray speculative
> access won't point at the poison page.
>
> -Tony
>
> Note: this is *mostly* a problem if the poison is in the first
> cache line of the page.  But you could hit other lines if the
> instruction you speculatively ran into had the right offset. E.g.
> to hit the third line:
>
>         movzbl  128(%rdi), %eax

Ok, makes sense and I do see now that this decoy resolves to the same
physical address once PTE_PFN_MASK is applied when we start messing
with page tables.

However, set_memory_uc() is currently not prepared for this trick as
it specifies the unmasked physical address to reserve_memtype().

        ret = reserve_memtype(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
                              _PAGE_CACHE_MODE_UC_MINUS, NULL);

...compared to set_memory_np() which does not manipulate the memtype tracking.

I'll fix up reserve_memtype() and free_memtype() to be prepared for
decoy addresses.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-04 14:31   ` Dan Williams
@ 2018-06-05 14:11     ` Michal Hocko
  2018-06-05 14:33       ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-06-05 14:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Mon 04-06-18 07:31:25, Dan Williams wrote:
[...]
> I'm trying to solve this real world problem when real poison is
> consumed through a dax mapping:
> 
>         mce: Uncorrected hardware memory error in user-access at af34214200
>         {1}[Hardware Error]: It has been corrected by h/w and requires
> no further action
>         mce: [Hardware Error]: Machine check events logged
>         {1}[Hardware Error]: event severity: corrected
>         Memory failure: 0xaf34214: reserved kernel page still
> referenced by 1 users
>         [..]
>         Memory failure: 0xaf34214: recovery action for reserved kernel
> page: Failed
>         mce: Memory error not recovered
> 
> ...i.e. currently all poison consumed through dax mappings is
> needlessly system fatal.

Thanks. That should be a part of the changelog. It would be great to
describe why this cannot be simply handled by hwpoison code without any
ZONE_DEVICE specific hacks? The error is recoverable so why does
hwpoison code even care?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-05 14:11     ` Michal Hocko
@ 2018-06-05 14:33       ` Dan Williams
  2018-06-06  7:39         ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-05 14:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 04-06-18 07:31:25, Dan Williams wrote:
> [...]
>> I'm trying to solve this real world problem when real poison is
>> consumed through a dax mapping:
>>
>>         mce: Uncorrected hardware memory error in user-access at af34214200
>>         {1}[Hardware Error]: It has been corrected by h/w and requires
>> no further action
>>         mce: [Hardware Error]: Machine check events logged
>>         {1}[Hardware Error]: event severity: corrected
>>         Memory failure: 0xaf34214: reserved kernel page still
>> referenced by 1 users
>>         [..]
>>         Memory failure: 0xaf34214: recovery action for reserved kernel
>> page: Failed
>>         mce: Memory error not recovered
>>
>> ...i.e. currently all poison consumed through dax mappings is
>> needlessly system fatal.
>
> Thanks. That should be a part of the changelog.

...added for v3:
https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html

> It would be great to
> describe why this cannot be simply handled by hwpoison code without any
> ZONE_DEVICE specific hacks? The error is recoverable so why does
> hwpoison code even care?
>

Up until we started testing hardware poison recovery for persistent
memory I assumed that the kernel did not need any new enabling to get
basic support for recovering userspace consumed poison.

However, the recovery code has a dedicated path for many different
page states (see: action_page_types). Without any changes it
incorrectly assumes that a dax mapped page is a page cache page
undergoing dma, or some other pinned operation. It also assumes that
the page must be offlined which is not correct / possible for dax
mapped pages. There is a possibility to repair poison to dax mapped
persistent memory pages, and the pages can't otherwise be offlined
because they 1:1 correspond with a physical storage block, i.e.
offlining pmem would be equivalent to punching a hole in the physical
address space.

There's also the entanglement of device-dax which guarantees a given
mapping size (4K, 2M, 1G). This requires determining the size of the
mapping encompassing a given pfn to know how much to unmap. Since dax
mapped pfns don't come from the page allocator we need to read the
page size from the page tables, not compound_order(page).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-05 14:33       ` Dan Williams
@ 2018-06-06  7:39         ` Michal Hocko
  2018-06-06 13:44           ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-06-06  7:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Tue 05-06-18 07:33:17, Dan Williams wrote:
> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
> > [...]
> >> I'm trying to solve this real world problem when real poison is
> >> consumed through a dax mapping:
> >>
> >>         mce: Uncorrected hardware memory error in user-access at af34214200
> >>         {1}[Hardware Error]: It has been corrected by h/w and requires
> >> no further action
> >>         mce: [Hardware Error]: Machine check events logged
> >>         {1}[Hardware Error]: event severity: corrected
> >>         Memory failure: 0xaf34214: reserved kernel page still
> >> referenced by 1 users
> >>         [..]
> >>         Memory failure: 0xaf34214: recovery action for reserved kernel
> >> page: Failed
> >>         mce: Memory error not recovered
> >>
> >> ...i.e. currently all poison consumed through dax mappings is
> >> needlessly system fatal.
> >
> > Thanks. That should be a part of the changelog.
> 
> ...added for v3:
> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
> 
> > It would be great to
> > describe why this cannot be simply handled by hwpoison code without any
> > ZONE_DEVICE specific hacks? The error is recoverable so why does
> > hwpoison code even care?
> >
> 
> Up until we started testing hardware poison recovery for persistent
> memory I assumed that the kernel did not need any new enabling to get
> basic support for recovering userspace consumed poison.
> 
> However, the recovery code has a dedicated path for many different
> page states (see: action_page_types). Without any changes it
> incorrectly assumes that a dax mapped page is a page cache page
> undergoing dma, or some other pinned operation. It also assumes that
> the page must be offlined which is not correct / possible for dax
> mapped pages. There is a possibility to repair poison to dax mapped
> persistent memory pages, and the pages can't otherwise be offlined
> because they 1:1 correspond with a physical storage block, i.e.
> offlining pmem would be equivalent to punching a hole in the physical
> address space.
> 
> There's also the entanglement of device-dax which guarantees a given
> mapping size (4K, 2M, 1G). This requires determining the size of the
> mapping encompassing a given pfn to know how much to unmap. Since dax
> mapped pfns don't come from the page allocator we need to read the
> page size from the page tables, not compound_order(page).

OK, but my question is still. Do we really want to do more on top of the
existing code and add even more special casing or it is time to rethink
the whole hwpoison design?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-06  7:39         ` Michal Hocko
@ 2018-06-06 13:44           ` Dan Williams
  2018-06-07 14:37             ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-06 13:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 05-06-18 07:33:17, Dan Williams wrote:
>> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
>> > [...]
>> >> I'm trying to solve this real world problem when real poison is
>> >> consumed through a dax mapping:
>> >>
>> >>         mce: Uncorrected hardware memory error in user-access at af34214200
>> >>         {1}[Hardware Error]: It has been corrected by h/w and requires
>> >> no further action
>> >>         mce: [Hardware Error]: Machine check events logged
>> >>         {1}[Hardware Error]: event severity: corrected
>> >>         Memory failure: 0xaf34214: reserved kernel page still
>> >> referenced by 1 users
>> >>         [..]
>> >>         Memory failure: 0xaf34214: recovery action for reserved kernel
>> >> page: Failed
>> >>         mce: Memory error not recovered
>> >>
>> >> ...i.e. currently all poison consumed through dax mappings is
>> >> needlessly system fatal.
>> >
>> > Thanks. That should be a part of the changelog.
>>
>> ...added for v3:
>> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
>>
>> > It would be great to
>> > describe why this cannot be simply handled by hwpoison code without any
>> > ZONE_DEVICE specific hacks? The error is recoverable so why does
>> > hwpoison code even care?
>> >
>>
>> Up until we started testing hardware poison recovery for persistent
>> memory I assumed that the kernel did not need any new enabling to get
>> basic support for recovering userspace consumed poison.
>>
>> However, the recovery code has a dedicated path for many different
>> page states (see: action_page_types). Without any changes it
>> incorrectly assumes that a dax mapped page is a page cache page
>> undergoing dma, or some other pinned operation. It also assumes that
>> the page must be offlined which is not correct / possible for dax
>> mapped pages. There is a possibility to repair poison to dax mapped
>> persistent memory pages, and the pages can't otherwise be offlined
>> because they 1:1 correspond with a physical storage block, i.e.
>> offlining pmem would be equivalent to punching a hole in the physical
>> address space.
>>
>> There's also the entanglement of device-dax which guarantees a given
>> mapping size (4K, 2M, 1G). This requires determining the size of the
>> mapping encompassing a given pfn to know how much to unmap. Since dax
>> mapped pfns don't come from the page allocator we need to read the
>> page size from the page tables, not compound_order(page).
>
> OK, but my question is still. Do we really want to do more on top of the
> existing code and add even more special casing or it is time to rethink
> the whole hwpoison design?

Well, there's the immediate problem that the current implementation is
broken for dax and then the longer term problem that the current
design appears to be too literal with a lot of custom marshaling.

At least for dax in the long term we want to offer an alternative
error handling model and get the filesystem much more involved. That
filesystem redesign work has been waiting for the reverse-block-map
effort to settle in xfs. However, that's more custom work for dax and
not a redesign that helps the core-mm more generically.

I think the unmap and SIGBUS portion of poison handling is relatively
straightforward. It's the handling of the page HWPoison page flag that
seems a bit ad hoc. The current implementation certainly was not
prepared for the concept that memory can be repaired. set_mce_nospec()
is a step in the direction of generic memory error handling.

Thoughts on other pain points in the design that are on your mind, Michal?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-06 13:44           ` Dan Williams
@ 2018-06-07 14:37             ` Michal Hocko
  2018-06-07 16:52               ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-06-07 14:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Wed 06-06-18 06:44:45, Dan Williams wrote:
> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
> >> > [...]
> >> >> I'm trying to solve this real world problem when real poison is
> >> >> consumed through a dax mapping:
> >> >>
> >> >>         mce: Uncorrected hardware memory error in user-access at af34214200
> >> >>         {1}[Hardware Error]: It has been corrected by h/w and requires
> >> >> no further action
> >> >>         mce: [Hardware Error]: Machine check events logged
> >> >>         {1}[Hardware Error]: event severity: corrected
> >> >>         Memory failure: 0xaf34214: reserved kernel page still
> >> >> referenced by 1 users
> >> >>         [..]
> >> >>         Memory failure: 0xaf34214: recovery action for reserved kernel
> >> >> page: Failed
> >> >>         mce: Memory error not recovered
> >> >>
> >> >> ...i.e. currently all poison consumed through dax mappings is
> >> >> needlessly system fatal.
> >> >
> >> > Thanks. That should be a part of the changelog.
> >>
> >> ...added for v3:
> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
> >>
> >> > It would be great to
> >> > describe why this cannot be simply handled by hwpoison code without any
> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
> >> > hwpoison code even care?
> >> >
> >>
> >> Up until we started testing hardware poison recovery for persistent
> >> memory I assumed that the kernel did not need any new enabling to get
> >> basic support for recovering userspace consumed poison.
> >>
> >> However, the recovery code has a dedicated path for many different
> >> page states (see: action_page_types). Without any changes it
> >> incorrectly assumes that a dax mapped page is a page cache page
> >> undergoing dma, or some other pinned operation. It also assumes that
> >> the page must be offlined which is not correct / possible for dax
> >> mapped pages. There is a possibility to repair poison to dax mapped
> >> persistent memory pages, and the pages can't otherwise be offlined
> >> because they 1:1 correspond with a physical storage block, i.e.
> >> offlining pmem would be equivalent to punching a hole in the physical
> >> address space.
> >>
> >> There's also the entanglement of device-dax which guarantees a given
> >> mapping size (4K, 2M, 1G). This requires determining the size of the
> >> mapping encompassing a given pfn to know how much to unmap. Since dax
> >> mapped pfns don't come from the page allocator we need to read the
> >> page size from the page tables, not compound_order(page).
> >
> > OK, but my question is still. Do we really want to do more on top of the
> > existing code and add even more special casing or it is time to rethink
> > the whole hwpoison design?
> 
> Well, there's the immediate problem that the current implementation is
> broken for dax and then the longer term problem that the current
> design appears to be too literal with a lot of custom marshaling.
> 
> At least for dax in the long term we want to offer an alternative
> error handling model and get the filesystem much more involved. That
> filesystem redesign work has been waiting for the reverse-block-map
> effort to settle in xfs. However, that's more custom work for dax and
> not a redesign that helps the core-mm more generically.
> 
> I think the unmap and SIGBUS portion of poison handling is relatively
> straightforward. It's the handling of the page HWPoison page flag that
> seems a bit ad hoc. The current implementation certainly was not
> prepared for the concept that memory can be repaired. set_mce_nospec()
> is a step in the direction of generic memory error handling.

Agreed! Moreover random checks for HWPoison pages is just a maintenance
hell.

> Thoughts on other pain points in the design that are on your mind, Michal?

we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
The main problem is that there is besically no design description so the
whole feature is very easy to break. Yours is another good example of
that. We need to get back to the drawing board and think about how to
make this more robust.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-07 14:37             ` Michal Hocko
@ 2018-06-07 16:52               ` Dan Williams
  2018-06-11  7:50                 ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-07 16:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Thu, Jun 7, 2018 at 7:37 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 06-06-18 06:44:45, Dan Williams wrote:
>> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
>> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
>> >> > [...]
>> >> >> I'm trying to solve this real world problem when real poison is
>> >> >> consumed through a dax mapping:
>> >> >>
>> >> >>         mce: Uncorrected hardware memory error in user-access at af34214200
>> >> >>         {1}[Hardware Error]: It has been corrected by h/w and requires
>> >> >> no further action
>> >> >>         mce: [Hardware Error]: Machine check events logged
>> >> >>         {1}[Hardware Error]: event severity: corrected
>> >> >>         Memory failure: 0xaf34214: reserved kernel page still
>> >> >> referenced by 1 users
>> >> >>         [..]
>> >> >>         Memory failure: 0xaf34214: recovery action for reserved kernel
>> >> >> page: Failed
>> >> >>         mce: Memory error not recovered
>> >> >>
>> >> >> ...i.e. currently all poison consumed through dax mappings is
>> >> >> needlessly system fatal.
>> >> >
>> >> > Thanks. That should be a part of the changelog.
>> >>
>> >> ...added for v3:
>> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
>> >>
>> >> > It would be great to
>> >> > describe why this cannot be simply handled by hwpoison code without any
>> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
>> >> > hwpoison code even care?
>> >> >
>> >>
>> >> Up until we started testing hardware poison recovery for persistent
>> >> memory I assumed that the kernel did not need any new enabling to get
>> >> basic support for recovering userspace consumed poison.
>> >>
>> >> However, the recovery code has a dedicated path for many different
>> >> page states (see: action_page_types). Without any changes it
>> >> incorrectly assumes that a dax mapped page is a page cache page
>> >> undergoing dma, or some other pinned operation. It also assumes that
>> >> the page must be offlined which is not correct / possible for dax
>> >> mapped pages. There is a possibility to repair poison to dax mapped
>> >> persistent memory pages, and the pages can't otherwise be offlined
>> >> because they 1:1 correspond with a physical storage block, i.e.
>> >> offlining pmem would be equivalent to punching a hole in the physical
>> >> address space.
>> >>
>> >> There's also the entanglement of device-dax which guarantees a given
>> >> mapping size (4K, 2M, 1G). This requires determining the size of the
>> >> mapping encompassing a given pfn to know how much to unmap. Since dax
>> >> mapped pfns don't come from the page allocator we need to read the
>> >> page size from the page tables, not compound_order(page).
>> >
>> > OK, but my question is still. Do we really want to do more on top of the
>> > existing code and add even more special casing or it is time to rethink
>> > the whole hwpoison design?
>>
>> Well, there's the immediate problem that the current implementation is
>> broken for dax and then the longer term problem that the current
>> design appears to be too literal with a lot of custom marshaling.
>>
>> At least for dax in the long term we want to offer an alternative
>> error handling model and get the filesystem much more involved. That
>> filesystem redesign work has been waiting for the reverse-block-map
>> effort to settle in xfs. However, that's more custom work for dax and
>> not a redesign that helps the core-mm more generically.
>>
>> I think the unmap and SIGBUS portion of poison handling is relatively
>> straightforward. It's the handling of the page HWPoison page flag that
>> seems a bit ad hoc. The current implementation certainly was not
>> prepared for the concept that memory can be repaired. set_mce_nospec()
>> is a step in the direction of generic memory error handling.
>
> Agreed! Moreover random checks for HWPoison pages is just a maintenance
> hell.
>
>> Thoughts on other pain points in the design that are on your mind, Michal?
>
> we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
> The main problem is that there is besically no design description so the
> whole feature is very easy to break. Yours is another good example of
> that. We need to get back to the drawing board and think about how to
> make this more robust.

I saw that article, but to be honest I did not glean any direct
suggestions that read on these current patches. I'm interested in
discussing a redesign, but I'm not interested in leaving poison
unhandled for DAX while we figure it out. Developers are actively
testing media error handling for persistent memory applications, and
expect the current SIGBUS + BUS_MCEERR_AR kernel ABI to work for
memory errors in userspace mappings.

The article mentioned that PUD poison does not work. That may be the
case for THP, but it does work for DAX. Here's the simple test [1],
yes, more test driven design of poison handling as you lamented.

[1]: https://patchwork.kernel.org/patch/10451131/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-07 16:52               ` Dan Williams
@ 2018-06-11  7:50                 ` Michal Hocko
  2018-06-11 14:44                   ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-06-11  7:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Thu 07-06-18 09:52:22, Dan Williams wrote:
> On Thu, Jun 7, 2018 at 7:37 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 06-06-18 06:44:45, Dan Williams wrote:
> >> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
> >> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
> >> >> > [...]
> >> >> >> I'm trying to solve this real world problem when real poison is
> >> >> >> consumed through a dax mapping:
> >> >> >>
> >> >> >>         mce: Uncorrected hardware memory error in user-access at af34214200
> >> >> >>         {1}[Hardware Error]: It has been corrected by h/w and requires
> >> >> >> no further action
> >> >> >>         mce: [Hardware Error]: Machine check events logged
> >> >> >>         {1}[Hardware Error]: event severity: corrected
> >> >> >>         Memory failure: 0xaf34214: reserved kernel page still
> >> >> >> referenced by 1 users
> >> >> >>         [..]
> >> >> >>         Memory failure: 0xaf34214: recovery action for reserved kernel
> >> >> >> page: Failed
> >> >> >>         mce: Memory error not recovered
> >> >> >>
> >> >> >> ...i.e. currently all poison consumed through dax mappings is
> >> >> >> needlessly system fatal.
> >> >> >
> >> >> > Thanks. That should be a part of the changelog.
> >> >>
> >> >> ...added for v3:
> >> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
> >> >>
> >> >> > It would be great to
> >> >> > describe why this cannot be simply handled by hwpoison code without any
> >> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
> >> >> > hwpoison code even care?
> >> >> >
> >> >>
> >> >> Up until we started testing hardware poison recovery for persistent
> >> >> memory I assumed that the kernel did not need any new enabling to get
> >> >> basic support for recovering userspace consumed poison.
> >> >>
> >> >> However, the recovery code has a dedicated path for many different
> >> >> page states (see: action_page_types). Without any changes it
> >> >> incorrectly assumes that a dax mapped page is a page cache page
> >> >> undergoing dma, or some other pinned operation. It also assumes that
> >> >> the page must be offlined which is not correct / possible for dax
> >> >> mapped pages. There is a possibility to repair poison to dax mapped
> >> >> persistent memory pages, and the pages can't otherwise be offlined
> >> >> because they 1:1 correspond with a physical storage block, i.e.
> >> >> offlining pmem would be equivalent to punching a hole in the physical
> >> >> address space.
> >> >>
> >> >> There's also the entanglement of device-dax which guarantees a given
> >> >> mapping size (4K, 2M, 1G). This requires determining the size of the
> >> >> mapping encompassing a given pfn to know how much to unmap. Since dax
> >> >> mapped pfns don't come from the page allocator we need to read the
> >> >> page size from the page tables, not compound_order(page).
> >> >
> >> > OK, but my question is still. Do we really want to do more on top of the
> >> > existing code and add even more special casing or it is time to rethink
> >> > the whole hwpoison design?
> >>
> >> Well, there's the immediate problem that the current implementation is
> >> broken for dax and then the longer term problem that the current
> >> design appears to be too literal with a lot of custom marshaling.
> >>
> >> At least for dax in the long term we want to offer an alternative
> >> error handling model and get the filesystem much more involved. That
> >> filesystem redesign work has been waiting for the reverse-block-map
> >> effort to settle in xfs. However, that's more custom work for dax and
> >> not a redesign that helps the core-mm more generically.
> >>
> >> I think the unmap and SIGBUS portion of poison handling is relatively
> >> straightforward. It's the handling of the page HWPoison page flag that
> >> seems a bit ad hoc. The current implementation certainly was not
> >> prepared for the concept that memory can be repaired. set_mce_nospec()
> >> is a step in the direction of generic memory error handling.
> >
> > Agreed! Moreover random checks for HWPoison pages is just a maintenance
> > hell.
> >
> >> Thoughts on other pain points in the design that are on your mind, Michal?
> >
> > we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
> > The main problem is that there is besically no design description so the
> > whole feature is very easy to break. Yours is another good example of
> > that. We need to get back to the drawing board and think about how to
> > make this more robust.
> 
> I saw that article, but to be honest I did not glean any direct
> suggestions that read on these current patches. I'm interested in
> discussing a redesign, but I'm not interested in leaving poison
> unhandled for DAX while we figure it out.

Sure but that just keeps the status quo and grows DAX special case like
we've done for hugetlb. We should not repeat that mistake. So we should
better start figuring the design _now_ rather than build the hous of
cards even higher.

> Developers are actively
> testing media error handling for persistent memory applications, and
> expect the current SIGBUS + BUS_MCEERR_AR kernel ABI to work for
> memory errors in userspace mappings.

I do understand you want your usecase to be covered but that is exactly
the reason why we have ended up in the current state. I am not going to
nack your patches on that groud, although I would be so tempted to do
so, but I would really like to see some improvements here. I am sorry I
cannot do that myself right now but somebody with usecases _should_
otherwise we stay in the unfortunate whack a mole state for ever.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-11  7:50                 ` Michal Hocko
@ 2018-06-11 14:44                   ` Dan Williams
  2018-06-11 14:56                     ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2018-06-11 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Mon, Jun 11, 2018 at 12:50 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 07-06-18 09:52:22, Dan Williams wrote:
>> On Thu, Jun 7, 2018 at 7:37 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 06-06-18 06:44:45, Dan Williams wrote:
>> >> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
>> >> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
>> >> >> > [...]
>> >> >> >> I'm trying to solve this real world problem when real poison is
>> >> >> >> consumed through a dax mapping:
>> >> >> >>
>> >> >> >>         mce: Uncorrected hardware memory error in user-access at af34214200
>> >> >> >>         {1}[Hardware Error]: It has been corrected by h/w and requires
>> >> >> >> no further action
>> >> >> >>         mce: [Hardware Error]: Machine check events logged
>> >> >> >>         {1}[Hardware Error]: event severity: corrected
>> >> >> >>         Memory failure: 0xaf34214: reserved kernel page still
>> >> >> >> referenced by 1 users
>> >> >> >>         [..]
>> >> >> >>         Memory failure: 0xaf34214: recovery action for reserved kernel
>> >> >> >> page: Failed
>> >> >> >>         mce: Memory error not recovered
>> >> >> >>
>> >> >> >> ...i.e. currently all poison consumed through dax mappings is
>> >> >> >> needlessly system fatal.
>> >> >> >
>> >> >> > Thanks. That should be a part of the changelog.
>> >> >>
>> >> >> ...added for v3:
>> >> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
>> >> >>
>> >> >> > It would be great to
>> >> >> > describe why this cannot be simply handled by hwpoison code without any
>> >> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
>> >> >> > hwpoison code even care?
>> >> >> >
>> >> >>
>> >> >> Up until we started testing hardware poison recovery for persistent
>> >> >> memory I assumed that the kernel did not need any new enabling to get
>> >> >> basic support for recovering userspace consumed poison.
>> >> >>
>> >> >> However, the recovery code has a dedicated path for many different
>> >> >> page states (see: action_page_types). Without any changes it
>> >> >> incorrectly assumes that a dax mapped page is a page cache page
>> >> >> undergoing dma, or some other pinned operation. It also assumes that
>> >> >> the page must be offlined which is not correct / possible for dax
>> >> >> mapped pages. There is a possibility to repair poison to dax mapped
>> >> >> persistent memory pages, and the pages can't otherwise be offlined
>> >> >> because they 1:1 correspond with a physical storage block, i.e.
>> >> >> offlining pmem would be equivalent to punching a hole in the physical
>> >> >> address space.
>> >> >>
>> >> >> There's also the entanglement of device-dax which guarantees a given
>> >> >> mapping size (4K, 2M, 1G). This requires determining the size of the
>> >> >> mapping encompassing a given pfn to know how much to unmap. Since dax
>> >> >> mapped pfns don't come from the page allocator we need to read the
>> >> >> page size from the page tables, not compound_order(page).
>> >> >
>> >> > OK, but my question is still. Do we really want to do more on top of the
>> >> > existing code and add even more special casing or it is time to rethink
>> >> > the whole hwpoison design?
>> >>
>> >> Well, there's the immediate problem that the current implementation is
>> >> broken for dax and then the longer term problem that the current
>> >> design appears to be too literal with a lot of custom marshaling.
>> >>
>> >> At least for dax in the long term we want to offer an alternative
>> >> error handling model and get the filesystem much more involved. That
>> >> filesystem redesign work has been waiting for the reverse-block-map
>> >> effort to settle in xfs. However, that's more custom work for dax and
>> >> not a redesign that helps the core-mm more generically.
>> >>
>> >> I think the unmap and SIGBUS portion of poison handling is relatively
>> >> straightforward. It's the handling of the page HWPoison page flag that
>> >> seems a bit ad hoc. The current implementation certainly was not
>> >> prepared for the concept that memory can be repaired. set_mce_nospec()
>> >> is a step in the direction of generic memory error handling.
>> >
>> > Agreed! Moreover random checks for HWPoison pages is just a maintenance
>> > hell.
>> >
>> >> Thoughts on other pain points in the design that are on your mind, Michal?
>> >
>> > we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
>> > The main problem is that there is besically no design description so the
>> > whole feature is very easy to break. Yours is another good example of
>> > that. We need to get back to the drawing board and think about how to
>> > make this more robust.
>>
>> I saw that article, but to be honest I did not glean any direct
>> suggestions that read on these current patches. I'm interested in
>> discussing a redesign, but I'm not interested in leaving poison
>> unhandled for DAX while we figure it out.
>
> Sure but that just keeps the status quo and grows DAX special case like
> we've done for hugetlb. We should not repeat that mistake. So we should
> better start figuring the design _now_ rather than build the hous of
> cards even higher.
>
>> Developers are actively
>> testing media error handling for persistent memory applications, and
>> expect the current SIGBUS + BUS_MCEERR_AR kernel ABI to work for
>> memory errors in userspace mappings.
>
> I do understand you want your usecase to be covered but that is exactly
> the reason why we have ended up in the current state. I am not going to
> nack your patches on that groud, although I would be so tempted to do
> so, but I would really like to see some improvements here. I am sorry I
> cannot do that myself right now but somebody with usecases _should_
> otherwise we stay in the unfortunate whack a mole state for ever.

I'm still trying to understand the next level of detail on where you
think the design should go next? Is it just the HWPoison page flag?
Are you concerned about supporting greater than PAGE_SIZE poison?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-11 14:44                   ` Dan Williams
@ 2018-06-11 14:56                     ` Michal Hocko
  2018-06-11 15:19                       ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-06-11 14:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel

On Mon 11-06-18 07:44:39, Dan Williams wrote:
[...]
> I'm still trying to understand the next level of detail on where you
> think the design should go next? Is it just the HWPoison page flag?
> Are you concerned about supporting greater than PAGE_SIZE poison?

I simply do not want to check for HWPoison at zillion of places and have
each type of page to have some special handling which can get wrong very
easily. I am not clear on details here, this is something for users of
hwpoison to define what is the reasonable scenarios when the feature is
useful and turn that into a feature list that can be actually turned
into a design document. See the different from let's put some more on
top approach...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-11 14:56                     ` Michal Hocko
@ 2018-06-11 15:19                       ` Dan Williams
  2018-06-11 17:35                         ` Andi Kleen
  2018-06-12  1:50                         ` Naoya Horiguchi
  0 siblings, 2 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-11 15:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-nvdimm, linux-edac, Tony Luck, Borislav Petkov,
	Jérôme Glisse, Jan Kara, H. Peter Anvin, X86 ML,
	Thomas Gleixner, Christoph Hellwig, Ross Zwisler, Matthew Wilcox,
	Ingo Molnar, Naoya Horiguchi, Souptick Joarder, Linux MM,
	linux-fsdevel, Andi Kleen

On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 11-06-18 07:44:39, Dan Williams wrote:
> [...]
>> I'm still trying to understand the next level of detail on where you
>> think the design should go next? Is it just the HWPoison page flag?
>> Are you concerned about supporting greater than PAGE_SIZE poison?
>
> I simply do not want to check for HWPoison at zillion of places and have
> each type of page to have some special handling which can get wrong very
> easily. I am not clear on details here, this is something for users of
> hwpoison to define what is the reasonable scenarios when the feature is
> useful and turn that into a feature list that can be actually turned
> into a design document. See the different from let's put some more on
> top approach...
>

So you want me to pay the toll of writing a design document justifying
all the existing use cases of HWPoison before we fix the DAX bugs, and
the design document may or may not result in any substantive change to
these patches?

Naoya or Andi, can you chime in here?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-11 15:19                       ` Dan Williams
@ 2018-06-11 17:35                         ` Andi Kleen
  2018-06-12  1:50                         ` Naoya Horiguchi
  1 sibling, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2018-06-11 17:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, linux-nvdimm, linux-edac, Tony Luck,
	Borislav Petkov, Jérôme Glisse, Jan Kara,
	H. Peter Anvin, X86 ML, Thomas Gleixner, Christoph Hellwig,
	Ross Zwisler, Matthew Wilcox, Ingo Molnar, Naoya Horiguchi,
	Souptick Joarder, Linux MM, linux-fsdevel

On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
> On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 11-06-18 07:44:39, Dan Williams wrote:
> > [...]
> >> I'm still trying to understand the next level of detail on where you
> >> think the design should go next? Is it just the HWPoison page flag?
> >> Are you concerned about supporting greater than PAGE_SIZE poison?
> >
> > I simply do not want to check for HWPoison at zillion of places and have
> > each type of page to have some special handling which can get wrong very
> > easily. I am not clear on details here, this is something for users of
> > hwpoison to define what is the reasonable scenarios when the feature is
> > useful and turn that into a feature list that can be actually turned
> > into a design document. See the different from let's put some more on
> > top approach...
> >
> 
> So you want me to pay the toll of writing a design document justifying
> all the existing use cases of HWPoison before we fix the DAX bugs, and
> the design document may or may not result in any substantive change to
> these patches?
> 
> Naoya or Andi, can you chime in here?

A new document doesn't make any sense. We have the commit messages and
the code comments as design documents, and as usual the ultimative authority is
what the code does.

The guiding light for new memory recovery code is just these sentences (taken
from the beginning of the main file):

 * In general any code for handling new cases should only be added iff:
 * - You know how to test it.
 * - You have a test that can be added to mce-test
 *   https://git.kernel.org/cgit/utils/cpu/mce/mce-test.git/
 * - The case actually shows up as a frequent (top 10) page state in
 *   tools/vm/page-types when running a real workload.

Since persistent memory is so big it makes sense to add support
for it in common code paths. That is usually just kernel copies and
user space execution.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-11 15:19                       ` Dan Williams
  2018-06-11 17:35                         ` Andi Kleen
@ 2018-06-12  1:50                         ` Naoya Horiguchi
  2018-06-12  1:58                           ` Dan Williams
  2018-06-12  4:04                           ` Jane Chu
  1 sibling, 2 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2018-06-12  1:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, linux-nvdimm, linux-edac, Tony Luck,
	Borislav Petkov, Jérôme Glisse, Jan Kara,
	H. Peter Anvin, X86 ML, Thomas Gleixner, Christoph Hellwig,
	Ross Zwisler, Matthew Wilcox, Ingo Molnar, Souptick Joarder,
	Linux MM, linux-fsdevel, Andi Kleen

On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
> On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 11-06-18 07:44:39, Dan Williams wrote:
> > [...]
> >> I'm still trying to understand the next level of detail on where you
> >> think the design should go next? Is it just the HWPoison page flag?
> >> Are you concerned about supporting greater than PAGE_SIZE poison?
> >
> > I simply do not want to check for HWPoison at zillion of places and have
> > each type of page to have some special handling which can get wrong very
> > easily. I am not clear on details here, this is something for users of
> > hwpoison to define what is the reasonable scenarios when the feature is
> > useful and turn that into a feature list that can be actually turned
> > into a design document. See the different from let's put some more on
> > top approach...
> >
> 
> So you want me to pay the toll of writing a design document justifying
> all the existing use cases of HWPoison before we fix the DAX bugs, and
> the design document may or may not result in any substantive change to
> these patches?
> 
> Naoya or Andi, can you chime in here?

memory_failure() does 3 things:

 - unmapping the error page from processes using it,
 - isolating the error page with PageHWPoison,
 - logging/reporting.

The unmapping part and the isolating part are quite page type dependent,
so this seems to me hard to do them in generic manner (so supporting new
page type always needs case specific new code.)
But I agree that we can improve code and document to help developers add
support for new page type.

About documenting, the content of Documentation/vm/hwpoison.rst is not
updated since 2009, so some update with design thing might be required.
My current thought about update items are like this:

  - detailing general workflow,
  - adding some about soft offline,
  - guideline for developers to support new type of memory,
  (- and anything helpful/requested.)

Making code more readable/self-descriptive is helpful, though I'm
not clear now about how.

Anyway I'll find time to work on this, while now I'm testing the dax
support patches and fixing a bug I found recently.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-12  1:50                         ` Naoya Horiguchi
@ 2018-06-12  1:58                           ` Dan Williams
  2018-06-12  4:04                           ` Jane Chu
  1 sibling, 0 replies; 32+ messages in thread
From: Dan Williams @ 2018-06-12  1:58 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, linux-nvdimm, linux-edac, Tony Luck,
	Borislav Petkov, Jérôme Glisse, Jan Kara,
	H. Peter Anvin, X86 ML, Thomas Gleixner, Christoph Hellwig,
	Ross Zwisler, Matthew Wilcox, Ingo Molnar, Souptick Joarder,
	Linux MM, linux-fsdevel, Andi Kleen

On Mon, Jun 11, 2018 at 6:50 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
[..]
> Anyway I'll find time to work on this, while now I'm testing the dax
> support patches and fixing a bug I found recently.

Ok, with this and other review feedback these patches are not ready
for 4.18. I'll circle back for 4.19 and we can try again.

Thanks for taking a look!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-06-12  1:50                         ` Naoya Horiguchi
  2018-06-12  1:58                           ` Dan Williams
@ 2018-06-12  4:04                           ` Jane Chu
  1 sibling, 0 replies; 32+ messages in thread
From: Jane Chu @ 2018-06-12  4:04 UTC (permalink / raw)
  To: Naoya Horiguchi, Dan Williams
  Cc: Andi Kleen, Tony Luck, Jan Kara, Matthew Wilcox, linux-nvdimm,
	X86 ML, Michal Hocko, Linux MM, Jérôme Glisse,
	Ingo Molnar, Borislav Petkov, Souptick Joarder, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, Christoph Hellwig, linux-edac

On 6/11/2018 6:50 PM, Naoya Horiguchi wrote:

> On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
>> On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> On Mon 11-06-18 07:44:39, Dan Williams wrote:
>>> [...]
>>>> I'm still trying to understand the next level of detail on where you
>>>> think the design should go next? Is it just the HWPoison page flag?
>>>> Are you concerned about supporting greater than PAGE_SIZE poison?
>>> I simply do not want to check for HWPoison at zillion of places and have
>>> each type of page to have some special handling which can get wrong very
>>> easily. I am not clear on details here, this is something for users of
>>> hwpoison to define what is the reasonable scenarios when the feature is
>>> useful and turn that into a feature list that can be actually turned
>>> into a design document. See the different from let's put some more on
>>> top approach...
>>>
>> So you want me to pay the toll of writing a design document justifying
>> all the existing use cases of HWPoison before we fix the DAX bugs, and
>> the design document may or may not result in any substantive change to
>> these patches?
>>
>> Naoya or Andi, can you chime in here?
> memory_failure() does 3 things:
>
>   - unmapping the error page from processes using it,
>   - isolating the error page with PageHWPoison,
>   - logging/reporting.
>
> The unmapping part and the isolating part are quite page type dependent,
> so this seems to me hard to do them in generic manner (so supporting new
> page type always needs case specific new code.)
> But I agree that we can improve code and document to help developers add
> support for new page type.
>
> About documenting, the content of Documentation/vm/hwpoison.rst is not
> updated since 2009, so some update with design thing might be required.
> My current thought about update items are like this:
>
>    - detailing general workflow,
>    - adding some about soft offline,
>    - guideline for developers to support new type of memory,
>    (- and anything helpful/requested.)
>
> Making code more readable/self-descriptive is helpful, though I'm
> not clear now about how.
>
> Anyway I'll find time to work on this, while now I'm testing the dax
> support patches and fixing a bug I found recently.

Thank you. Maybe it's already on your mind, but just in case. When you update the
document, would you add the characteristics of pmem error handling in that
   . UE/poison can be repaired until the wear and tear reaches a max level
   . many user applications mmap the entire capacity, leaving no spare pages
     for swapping (unlike the volatile memory UE handling)
   . the what-you-see-is-what-you-get nature
?
Regarding HWPOISON redesign, a nagging thought is that a memory UE typically indicts
a very small blast radius, less than 4KB.  But it seems that the larger the page size,
the greater the 'penalty' in terms of how much memory would end up being offlined.
If there a way to be frugal?

Thanks!
-jane

>
> Thanks,
> Naoya Horiguchi
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-06-12  4:04 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-03  5:22 [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
2018-06-03  5:22 ` [PATCH v2 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
2018-06-03  5:22 ` [PATCH v2 02/11] device-dax: Cleanup vm_fault de-reference chains Dan Williams
2018-06-03  5:22 ` [PATCH v2 03/11] device-dax: Enable page_mapping() Dan Williams
2018-06-03  5:23 ` [PATCH v2 04/11] device-dax: Set page->index Dan Williams
2018-06-03  5:23 ` [PATCH v2 05/11] filesystem-dax: " Dan Williams
2018-06-03  5:23 ` [PATCH v2 06/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
2018-06-03  5:23 ` [PATCH v2 07/11] x86, memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
2018-06-04 17:08   ` Luck, Tony
2018-06-04 17:39     ` Dan Williams
2018-06-04 18:08       ` Luck, Tony
2018-06-04 18:35         ` Dan Williams
2018-06-03  5:23 ` [PATCH v2 08/11] mm, memory_failure: Pass page size to kill_proc() Dan Williams
2018-06-03  5:23 ` [PATCH v2 09/11] mm, memory_failure: Fix page->mapping assumptions relative to the page lock Dan Williams
2018-06-03  5:23 ` [PATCH v2 10/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
2018-06-03  5:23 ` [PATCH v2 11/11] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
2018-06-04 12:40 ` [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Michal Hocko
2018-06-04 14:31   ` Dan Williams
2018-06-05 14:11     ` Michal Hocko
2018-06-05 14:33       ` Dan Williams
2018-06-06  7:39         ` Michal Hocko
2018-06-06 13:44           ` Dan Williams
2018-06-07 14:37             ` Michal Hocko
2018-06-07 16:52               ` Dan Williams
2018-06-11  7:50                 ` Michal Hocko
2018-06-11 14:44                   ` Dan Williams
2018-06-11 14:56                     ` Michal Hocko
2018-06-11 15:19                       ` Dan Williams
2018-06-11 17:35                         ` Andi Kleen
2018-06-12  1:50                         ` Naoya Horiguchi
2018-06-12  1:58                           ` Dan Williams
2018-06-12  4:04                           ` Jane Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).