linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
@ 2018-05-22 14:39 Dan Williams
  2018-05-22 14:39 ` [PATCH 01/11] device-dax: convert to vmf_insert_mixed and vm_fault_t Dan Williams
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:39 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-edac, Tony Luck, Borislav Petkov, stable, Jan Kara,
	H. Peter Anvin, x86, Thomas Gleixner, Andi Kleen,
	Christoph Hellwig, Ross Zwisler, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Naoya Horiguchi, Jérôme Glisse,
	Wu Fengguang, Souptick Joarder, linux-mm, linux-fsdevel,
	tony.luck

As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses add
new locking in the fsdax implementation to prevent races between
page-address_space disassociation events and the rmap performed in the
memory_failure() path. Additionally, since dev_pagemap pages are hidden
from the page allocator, add a mechanism to determine the size of the
mapping that encompasses a given poisoned pfn. Lastly, since pmem errors
can be repaired, change the speculatively accessed poison protection,
mce_unmap_kpfn(), to be reversible and otherwise allow ongoing access
from the kernel.

---

Dan Williams (11):
      device-dax: convert to vmf_insert_mixed and vm_fault_t
      device-dax: cleanup vm_fault de-reference chains
      device-dax: enable page_mapping()
      device-dax: set page->index
      filesystem-dax: set page->index
      filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock
      mm, madvise_inject_error: fix page count leak
      x86, memory_failure: introduce {set,clear}_mce_nospec()
      mm, memory_failure: pass page size to kill_proc()
      mm, memory_failure: teach memory_failure() about dev_pagemap pages
      libnvdimm, pmem: restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h         |   29 ++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ---
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 +-------
 drivers/dax/device.c                      |   91 ++++++++++++--------
 drivers/nvdimm/pmem.c                     |   26 ++++++
 drivers/nvdimm/pmem.h                     |   13 +++
 fs/dax.c                                  |  102 ++++++++++++++++++++--
 include/linux/huge_mm.h                   |    5 +
 include/linux/set_memory.h                |   14 +++
 mm/huge_memory.c                          |    4 -
 mm/madvise.c                              |   11 ++
 mm/memory-failure.c                       |  133 +++++++++++++++++++++++++++--
 12 files changed, 370 insertions(+), 111 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/11] device-dax: convert to vmf_insert_mixed and vm_fault_t
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
@ 2018-05-22 14:39 ` Dan Williams
  2018-05-22 14:39 ` [PATCH 02/11] device-dax: cleanup vm_fault de-reference chains Dan Williams
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:39 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Souptick Joarder, Matthew Wilcox, Ross Zwisler, hch, linux-mm,
	linux-fsdevel, tony.luck

Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno.  Once all instances are converted, vm_fault_t will
become a distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c    |   26 +++++++++++---------------
 include/linux/huge_mm.h |    5 +++--
 mm/huge_memory.c        |    4 ++--
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index aff2c1594220..d44d98c54d0f 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
-	int rc = VM_FAULT_SIGBUS;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
-	if (rc == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (rc < 0 && rc != -EBUSY)
-		return VM_FAULT_SIGBUS;
-
-	return VM_FAULT_NOPAGE;
+	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
 
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
 	return rc;
 }
 
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
 {
 	return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
 }
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/sched/coredump.h>
+#include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..6af976472a5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -755,7 +755,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
@@ -815,7 +815,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 02/11] device-dax: cleanup vm_fault de-reference chains
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-05-22 14:39 ` [PATCH 01/11] device-dax: convert to vmf_insert_mixed and vm_fault_t Dan Williams
@ 2018-05-22 14:39 ` Dan Williams
  2018-05-22 14:39 ` [PATCH 03/11] device-dax: enable page_mapping() Dan Williams
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:39 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, tony.luck

Define a local 'vma' variable rather than repetitively de-referencing
the passed in 'struct vm_fault *' instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d44d98c54d0f..686de08e120b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -247,13 +247,14 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -274,13 +275,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -288,7 +290,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -310,11 +312,10 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pmd_addr < vmf->vma->vm_start ||
-			(pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
+	if (pmd_addr < vma->vm_start || (pmd_addr + PMD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pmd_addr);
+	pgoff = linear_page_index(vma, pmd_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -323,7 +324,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
@@ -332,6 +333,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -340,7 +342,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	unsigned int fault_size = PUD_SIZE;
 
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -362,11 +364,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pud_addr < vmf->vma->vm_start ||
-			(pud_addr + PUD_SIZE) > vmf->vma->vm_end)
+	if (pud_addr < vma->vm_start || (pud_addr + PUD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pud_addr);
+	pgoff = linear_page_index(vma, pud_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -375,7 +376,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,12 +391,13 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
-	struct file *filp = vmf->vma->vm_file;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *filp = vma->vm_file;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
 			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
+			vma->vm_start, vma->vm_end, pe_size);
 
 	id = dax_read_lock();
 	switch (pe_size) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 03/11] device-dax: enable page_mapping()
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-05-22 14:39 ` [PATCH 01/11] device-dax: convert to vmf_insert_mixed and vm_fault_t Dan Williams
  2018-05-22 14:39 ` [PATCH 02/11] device-dax: cleanup vm_fault de-reference chains Dan Williams
@ 2018-05-22 14:39 ` Dan Williams
  2018-05-23  9:03   ` Jan Kara
  2018-05-30 19:54   ` kbuild test robot
  2018-05-22 14:39 ` [PATCH 04/11] device-dax: set page->index Dan Williams
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:39 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, tony.luck

In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.

The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   47 +++++++++++++++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 686de08e120b..8e986478d48d 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,13 +245,12 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
-	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -273,13 +272,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, *pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -287,7 +286,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -322,15 +320,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -338,7 +336,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
 
@@ -374,9 +371,9 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,9 +387,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
-	int rc, id;
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *filp = vma->vm_file;
+	unsigned long fault_size;
+	int rc, id;
+	pfn_t pfn;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -402,17 +401,33 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	id = dax_read_lock();
 	switch (pe_size) {
 	case PE_SIZE_PTE:
-		rc = __dev_dax_pte_fault(dev_dax, vmf);
+		fault_size = PAGE_SIZE;
+		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PMD:
-		rc = __dev_dax_pmd_fault(dev_dax, vmf);
+		fault_size = PMD_SIZE;
+		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PUD:
-		rc = __dev_dax_pud_fault(dev_dax, vmf);
+		fault_size = PUD_SIZE;
+		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
 		break;
 	default:
 		rc = VM_FAULT_SIGBUS;
 	}
+
+	if (rc == VM_FAULT_NOPAGE) {
+		unsigned long i;
+
+		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+			struct page *page;
+
+			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+			if (page->mapping)
+				continue;
+			page->mapping = filp->f_mapping;
+		}
+	}
 	dax_read_unlock(id);
 
 	return rc;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 04/11] device-dax: set page->index
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (2 preceding siblings ...)
  2018-05-22 14:39 ` [PATCH 03/11] device-dax: enable page_mapping() Dan Williams
@ 2018-05-22 14:39 ` Dan Williams
  2018-05-22 14:39 ` [PATCH 05/11] filesystem-dax: " Dan Williams
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:39 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, tony.luck

In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.

The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 8e986478d48d..b33e45ee4f70 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -418,7 +418,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 
 	if (rc == VM_FAULT_NOPAGE) {
 		unsigned long i;
+		pgoff_t pgoff;
 
+		pgoff = linear_page_index(vma, vmf->address
+				& ~(fault_size - 1));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
@@ -426,6 +429,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			if (page->mapping)
 				continue;
 			page->mapping = filp->f_mapping;
+			page->index = pgoff + i;
 		}
 	}
 	dax_read_unlock(id);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 05/11] filesystem-dax: set page->index
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (3 preceding siblings ...)
  2018-05-22 14:39 ` [PATCH 04/11] device-dax: set page->index Dan Williams
@ 2018-05-22 14:39 ` Dan Williams
  2018-05-23  8:40   ` Jan Kara
  2018-05-22 14:40 ` [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock Dan Williams
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:39 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	linux-mm, linux-fsdevel, tony.luck

In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index aaec72ded1b6..2e4682cd7c69 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
 	for (pfn = dax_radix_pfn(entry); \
 			pfn < dax_radix_end_pfn(entry); pfn++)
 
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn, index;
+	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
+	index = linear_page_index(vma, address & ~(size - 1));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(page->mapping);
 		page->mapping = mapping;
+		page->index = index + i++;
 	}
 }
 
@@ -348,6 +352,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
+		page->index = 0;
 	}
 }
 
@@ -604,7 +609,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	new_entry = dax_radix_locked_entry(pfn, flags);
 	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping);
+		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (4 preceding siblings ...)
  2018-05-22 14:39 ` [PATCH 05/11] filesystem-dax: " Dan Williams
@ 2018-05-22 14:40 ` Dan Williams
  2018-05-23  9:35   ` Jan Kara
  2018-05-22 14:40 ` [PATCH 07/11] mm, madvise_inject_error: fix page count leak Dan Williams
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	linux-mm, linux-fsdevel, tony.luck

Hold the page lock while invalidating mapping entries to prevent races
between rmap using the address_space and the filesystem freeing the
address_space.

This is more complicated than the simple description implies because
dev_pagemap pages that fsdax uses do not have any concept of page size.
Size information is stored in the radix and can only be safely read
while holding the xa_lock. Since lock_page() can not be taken while
holding xa_lock, drop xa_lock and speculatively lock all the associated
pages. Once all the pages are locked re-take the xa_lock and revalidate
that the radix entry did not change.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 85 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2e4682cd7c69..e6d44d336283 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,6 +319,13 @@ static unsigned long dax_radix_end_pfn(void *entry)
 	for (pfn = dax_radix_pfn(entry); \
 			pfn < dax_radix_end_pfn(entry); pfn++)
 
+#define for_each_mapped_pfn_reverse(entry, pfn) \
+	for (pfn = dax_radix_end_pfn(entry) - 1; \
+			dax_entry_size(entry) \
+			&& pfn >= dax_radix_pfn(entry); \
+			pfn--)
+
+
 static void dax_associate_entry(void *entry, struct address_space *mapping,
 		struct vm_area_struct *vma, unsigned long address)
 {
@@ -497,6 +504,80 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static bool dax_lock_pages(struct address_space *mapping, pgoff_t index,
+		void **entry)
+{
+	struct radix_tree_root *pages = &mapping->i_pages;
+	unsigned long pfn;
+	void *entry2;
+
+	xa_lock_irq(pages);
+	*entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	if (!*entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(*entry))) {
+		put_unlocked_mapping_entry(mapping, index, entry);
+		xa_unlock_irq(pages);
+		return false;
+	}
+
+	/*
+	 * In the limited case there are no races to prevent with rmap,
+	 * because rmap can not perform pfn_to_page().
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return true;
+
+	/*
+	 * Now, drop the xa_lock, grab all the page locks then validate
+	 * that the entry has not changed and return with the xa_lock
+	 * held.
+	 */
+	xa_unlock_irq(pages);
+
+	/*
+	 * Retry until the entry stabilizes or someone else invalidates
+	 * the entry;
+	 */
+	for (;;) {
+		for_each_mapped_pfn(*entry, pfn)
+			lock_page(pfn_to_page(pfn));
+
+		xa_lock_irq(pages);
+		entry2 = get_unlocked_mapping_entry(mapping, index, NULL);
+		if (!entry2 || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry2))
+				|| entry2 != *entry) {
+			put_unlocked_mapping_entry(mapping, index, entry2);
+			xa_unlock_irq(pages);
+
+			for_each_mapped_pfn_reverse(*entry, pfn)
+				unlock_page(pfn_to_page(pfn));
+
+			if (!entry2 || !radix_tree_exceptional_entry(entry2))
+				return false;
+			*entry = entry2;
+			continue;
+		}
+		break;
+	}
+
+	return true;
+}
+
+static void dax_unlock_pages(struct address_space *mapping, pgoff_t index,
+		void *entry)
+{
+	struct radix_tree_root *pages = &mapping->i_pages;
+	unsigned long pfn;
+
+	put_unlocked_mapping_entry(mapping, index, entry);
+	xa_unlock_irq(pages);
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_mapped_pfn_reverse(entry, pfn)
+		unlock_page(pfn_to_page(pfn));
+}
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -504,10 +585,8 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	void *entry;
 	struct radix_tree_root *pages = &mapping->i_pages;
 
-	xa_lock_irq(pages);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
-	if (!entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)))
-		goto out;
+	if (!dax_lock_pages(mapping, index, &entry))
+		return ret;
 	if (!trunc &&
 	    (radix_tree_tag_get(pages, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE)))
@@ -517,8 +596,8 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	mapping->nrexceptional--;
 	ret = 1;
 out:
-	put_unlocked_mapping_entry(mapping, index, entry);
-	xa_unlock_irq(pages);
+	dax_unlock_pages(mapping, index, entry);
+
 	return ret;
 }
 /*

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 07/11] mm, madvise_inject_error: fix page count leak
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (5 preceding siblings ...)
  2018-05-22 14:40 ` [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock Dan Williams
@ 2018-05-22 14:40 ` Dan Williams
  2018-05-23  4:19   ` Naoya Horiguchi
  2018-05-22 14:40 ` [PATCH 08/11] x86, memory_failure: introduce {set, clear}_mce_nospec() Dan Williams
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: stable, Michal Hocko, Andi Kleen, Wu Fengguang, hch, linux-mm,
	linux-fsdevel, tony.luck

The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it fails to release
that pin.

The dax-dma-vs-truncate warning catches this failure with the following
signature:

 Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
 Memory failure: 0x208900: reserved kernel page still referenced by 1 users
 Memory failure: 0x208900: recovery action for reserved kernel page: Failed
 WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
 CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
 [..]
 RIP: 0010:dax_disassociate_entry+0x4e/0x90
 RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
 RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
 RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
 RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
 R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
 FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
 Call Trace:
  __dax_invalidate_mapping_entry+0xab/0xe0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
  truncate_inode_pages_range+0x291/0x920
  ? kmem_cache_free+0x1f8/0x300
  ? lock_acquire+0x9f/0x200
  ? truncate_inode_pages_final+0x31/0x50
  ext4_evict_inode+0x69/0x740

Cc: <stable@vger.kernel.org>
Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/madvise.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..246fa4d4eee2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
 
 
 	for (; start < end; start += PAGE_SIZE << order) {
+		unsigned long pfn;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		pfn = page_to_pfn(page);
 
 		/*
 		 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+					pfn, start);
 
 			ret = soft_offline_page(page, MF_COUNT_INCREASED);
+			put_page(page);
 			if (ret)
 				return ret;
 			continue;
 		}
+		put_page(page);
+
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+				pfn, start);
 
-		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
+		ret = memory_failure(pfn, MF_COUNT_INCREASED);
 		if (ret)
 			return ret;
 	}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 08/11] x86, memory_failure: introduce {set, clear}_mce_nospec()
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (6 preceding siblings ...)
  2018-05-22 14:40 ` [PATCH 07/11] mm, madvise_inject_error: fix page count leak Dan Williams
@ 2018-05-22 14:40 ` Dan Williams
  2018-05-22 14:40 ` [PATCH 09/11] mm, memory_failure: pass page size to kill_proc() Dan Williams
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Tony Luck,
	Borislav Petkov, linux-edac, hch, linux-mm, linux-fsdevel,
	tony.luck

Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/set_memory.h         |   29 ++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 -----------
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 ++---------------------------
 include/linux/set_memory.h                |   14 +++++++++++
 4 files changed, 46 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index bd090367236c..debc1fee1457 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,33 @@ extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
 
+#ifdef CONFIG_X86_64
+/*
+ * Mark the linear address as UC to disable speculative pre-fetches into
+ * potentially poisoned memory.
+ */
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	int rc;
+
+	rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) __va(PFN_PHYS(pfn)), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb)	{ }
 static inline void mce_unregister_injector_chain(struct notifier_block *nb)	{ }
 #endif
 
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- *    are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- *    recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
 struct mca_config {
 	bool dont_log_ce;
 	bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..a0fbf0a8b7e6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/jump_label.h>
+#include <linux/set_memory.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -50,7 +51,6 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
-#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 	if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
 		pfn = mce->addr >> PAGE_SHIFT;
 		if (!memory_failure(pfn, 0))
-			mce_unmap_kpfn(pfn);
+			set_mce_nospec(pfn);
 	}
 
 	return NOTIFY_OK;
@@ -1070,38 +1066,10 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT);
 	return ret;
 }
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn)
-{
-	unsigned long decoy_addr;
-
-	/*
-	 * Unmap this page from the kernel 1:1 mappings to make sure
-	 * we don't log more errors because of speculative access to
-	 * the page.
-	 * We would like to just call:
-	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_np() not checking whether we passed
-	 * a legal address.
-	 */
-
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (set_memory_np(decoy_addr, 1))
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-}
-#endif
-
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index da5178216da5..2a986d282a97 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr,  int numpages) { return 0; }
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+#ifndef set_mce_nospec
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
+#ifndef clear_mce_nospec
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
 #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
 static inline int set_memory_encrypted(unsigned long addr, int numpages)
 {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 09/11] mm, memory_failure: pass page size to kill_proc()
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (7 preceding siblings ...)
  2018-05-22 14:40 ` [PATCH 08/11] x86, memory_failure: introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-05-22 14:40 ` Dan Williams
  2018-05-23  6:41   ` Naoya Horiguchi
  2018-05-22 14:40 ` [PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages Dan Williams
  2018-05-22 14:40 ` [PATCH 11/11] libnvdimm, pmem: restore page attributes when clearing errors Dan Williams
  10 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:40 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Naoya Horiguchi, hch, linux-mm, linux-fsdevel, tony.luck

Given that ZONE_DEVICE / dev_pagemap pages are never assembled into
compound pages, the size determination logic in kill_proc() needs
updating for the dev_pagemap case. In preparation for dev_pagemap
support rework memory_failure() and kill_proc() to pass / consume the page
size explicitly.

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..42a193ee14d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,18 +179,16 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, struct page *page, int flags)
+			unsigned long pfn, unsigned size_shift, int flags)
 {
-	short addr_lsb;
 	int ret;
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
 		pfn, t->comm, t->pid);
-	addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
 		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
-				       addr_lsb, current);
+				       size_shift, current);
 	} else {
 		/*
 		 * Don't use force here, it's convenient if the signal
@@ -199,7 +197,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		 * to SIG_IGN, but hopefully no one will do that?
 		 */
 		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
-				      addr_lsb, t);  /* synchronous? */
+				      size_shift, t);  /* synchronous? */
 	}
 	if (ret < 0)
 		pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
@@ -318,7 +316,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, struct page *page, unsigned long pfn,
+			  bool fail, unsigned size_shift, unsigned long pfn,
 			  int flags)
 {
 	struct to_kill *tk, *next;
@@ -343,7 +341,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * process anyways.
 			 */
 			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, page, flags) < 0)
+					      pfn, size_shift, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
 				       pfn, tk->tsk->comm, tk->tsk->pid);
 		}
@@ -928,6 +926,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	unsigned size_shift;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1012,7 +1011,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * any accesses to the poisoned memory.
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
-	kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
+	size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+	kill_procs(&tokill, forcekill, !unmap_success, size_shift, pfn, flags);
 
 	return unmap_success;
 }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (8 preceding siblings ...)
  2018-05-22 14:40 ` [PATCH 09/11] mm, memory_failure: pass page size to kill_proc() Dan Williams
@ 2018-05-22 14:40 ` Dan Williams
  2018-05-23  6:48   ` Naoya Horiguchi
  2018-05-22 14:40 ` [PATCH 11/11] libnvdimm, pmem: restore page attributes when clearing errors Dan Williams
  10 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, linux-mm,
	linux-fsdevel, tony.luck

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 42a193ee14d3..f95036f99a79 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
+#include <linux/memremap.h>
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include "internal.h"
@@ -1112,6 +1113,117 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return res;
 }
 
+static unsigned long dax_mapping_size(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page_to_pgoff(page);
+	struct vm_area_struct *vma;
+	unsigned long size = 0;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long address = vma_address(page, vma);
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, address);
+		if (!pgd_present(*pgd))
+			continue;
+		p4d = p4d_offset(pgd, address);
+		if (!p4d_present(*p4d))
+			continue;
+		pud = pud_offset(p4d, address);
+		if (!pud_present(*pud))
+			continue;
+		if (pud_devmap(*pud)) {
+			size = PUD_SIZE;
+			break;
+		}
+		pmd = pmd_offset(pud, address);
+		if (!pmd_present(*pmd))
+			continue;
+		if (pmd_devmap(*pmd)) {
+			size = PMD_SIZE;
+			break;
+		}
+		pte = pte_offset_map(pmd, address);
+		if (!pte_present(*pte))
+			continue;
+		if (pte_devmap(*pte)) {
+			size = PAGE_SIZE;
+			break;
+		}
+	}
+	i_mmap_unlock_read(mapping);
+	return size;
+}
+
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	const bool unmap_success = true;
+	unsigned long size;
+	LIST_HEAD(tokill);
+	int rc = -EBUSY;
+	loff_t start;
+
+	lock_page(page);
+	if (hwpoison_filter(page)) {
+		rc = 0;
+		goto out;
+	}
+
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		goto out;
+	default:
+		if (!page->mapping)
+			goto out;
+		break;
+	}
+
+	/*
+	 * If the page is not mapped in userspace then report it as
+	 * unhandled.
+	 */
+	size = dax_mapping_size(page);
+	if (!size) {
+		pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
+		goto out;
+	}
+
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+
+	start = (page->index << PAGE_SHIFT) & ~(size - 1);
+	unmap_mapping_range(page->mapping, start, start + size, 0);
+
+	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
+			pfn, flags);
+	rc = 0;
+out:
+	unlock_page(page);
+	put_dev_pagemap(pgmap);
+	return rc;
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1134,6 +1246,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *p;
 	struct page *hpage;
 	struct page *orig_head;
+	struct dev_pagemap *pgmap;
 	int res;
 	unsigned long page_flags;
 
@@ -1146,6 +1259,10 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	pgmap = get_dev_pagemap(pfn, NULL);
+	if (pgmap)
+		return memory_failure_dev_pagemap(pfn, flags, pgmap);
+
 	p = pfn_to_page(pfn);
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 11/11] libnvdimm, pmem: restore page attributes when clearing errors
  2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (9 preceding siblings ...)
  2018-05-22 14:40 ` [PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-05-22 14:40 ` Dan Williams
  10 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-22 14:40 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, tony.luck

Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   26 ++++++++++++++++++++++++++
 drivers/nvdimm/pmem.h |   13 +++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..04ee1fdee219 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
 #include <linux/hdreg.h>
 #include <linux/init.h>
 #include <linux/platform_device.h>
+#include <linux/set_memory.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/badblocks.h>
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
+static void hwpoison_clear(struct pmem_device *pmem,
+		phys_addr_t phys, unsigned int len)
+{
+	unsigned long pfn_start, pfn_end, pfn;
+
+	/* only pmem in the linear map supports HWPoison */
+	if (is_vmalloc_addr(pmem->virt_addr))
+		return;
+
+	pfn_start = PHYS_PFN(phys);
+	pfn_end = pfn_start + PHYS_PFN(len);
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * Note, no need to hold a get_dev_pagemap() reference
+		 * here since we're in the driver I/O path and
+		 * outstanding I/O requests pin the dev_pagemap.
+		 */
+		if (test_and_clear_pmem_poison(page))
+			clear_mce_nospec(pfn);
+	}
+}
+
 static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 		phys_addr_t offset, unsigned int len)
 {
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	if (cleared < len)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
+		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
+#include <linux/page-flags.h>
 #include <linux/badblocks.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -27,4 +28,16 @@ struct pmem_device {
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return false;
+}
+#endif
 #endif /* __NVDIMM_PMEM_H__ */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 07/11] mm, madvise_inject_error: fix page count leak
  2018-05-22 14:40 ` [PATCH 07/11] mm, madvise_inject_error: fix page count leak Dan Williams
@ 2018-05-23  4:19   ` Naoya Horiguchi
  2018-05-24 20:55     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Naoya Horiguchi @ 2018-05-23  4:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, stable, Michal Hocko, Andi Kleen, Wu Fengguang,
	hch, linux-mm, linux-fsdevel, tony.luck

On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
> The madvise_inject_error() routine uses get_user_pages() to lookup the
> pfn and other information for injected error, but it fails to release
> that pin.
> 
> The dax-dma-vs-truncate warning catches this failure with the following
> signature:
> 
>  Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>  CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
>  [..]
>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>  RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
>  RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
>  RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
>  RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
>  R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
>  R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
>  FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
>  Call Trace:
>   __dax_invalidate_mapping_entry+0xab/0xe0
>   dax_delete_mapping_entry+0xf/0x20
>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>   truncate_inode_pages_range+0x291/0x920
>   ? kmem_cache_free+0x1f8/0x300
>   ? lock_acquire+0x9f/0x200
>   ? truncate_inode_pages_final+0x31/0x50
>   ext4_evict_inode+0x69/0x740
> 
> Cc: <stable@vger.kernel.org>
> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  mm/madvise.c |   11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d3c922ea1a1..246fa4d4eee2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>  
>  
>  	for (; start < end; start += PAGE_SIZE << order) {
> +		unsigned long pfn;
>  		int ret;
>  
>  		ret = get_user_pages_fast(start, 1, 0, &page);
>  		if (ret != 1)
>  			return ret;
> +		pfn = page_to_pfn(page);
>  
>  		/*
>  		 * When soft offlining hugepages, after migrating the page
> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>  
>  		if (behavior == MADV_SOFT_OFFLINE) {
>  			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> -						page_to_pfn(page), start);
> +					pfn, start);
>  
>  			ret = soft_offline_page(page, MF_COUNT_INCREASED);
> +			put_page(page);
>  			if (ret)
>  				return ret;
>  			continue;
>  		}
> +		put_page(page);

We keep the page count pinned after the isolation of the error page
in order to make sure that the error page is disabled and never reused.
This seems not explicit enough, so some comment should be helpful.

BTW, looking at the kernel message like "Memory failure: 0x208900:
reserved kernel page still referenced by 1 users", memory_failure()
considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
If memory error handler recovers a dav_pagemap page in its special way,
we can define a new action_page_types entry like MF_MSG_DAX.
Reporting like "Memory failure: 0xXXXXX: recovery action for dax page:
Failed" might be helpful for end user's perspective.

Thanks,
Naoya Horiguchi

> +
>  		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> -						page_to_pfn(page), start);
> +				pfn, start);
>  
> -		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
> +		ret = memory_failure(pfn, MF_COUNT_INCREASED);
>  		if (ret)
>  			return ret;
>  	}
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 09/11] mm, memory_failure: pass page size to kill_proc()
  2018-05-22 14:40 ` [PATCH 09/11] mm, memory_failure: pass page size to kill_proc() Dan Williams
@ 2018-05-23  6:41   ` Naoya Horiguchi
  0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2018-05-23  6:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, hch, linux-mm, linux-fsdevel, tony.luck

On Tue, May 22, 2018 at 07:40:19AM -0700, Dan Williams wrote:
> Given that ZONE_DEVICE / dev_pagemap pages are never assembled into
> compound pages, the size determination logic in kill_proc() needs
> updating for the dev_pagemap case. In preparation for dev_pagemap
> support rework memory_failure() and kill_proc() to pass / consume the page
> size explicitly.
> 
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages
  2018-05-22 14:40 ` [PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-05-23  6:48   ` Naoya Horiguchi
  0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2018-05-23  6:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Ross Zwisler, linux-mm,
	linux-fsdevel, tony.luck

On Tue, May 22, 2018 at 07:40:24AM -0700, Dan Williams wrote:
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
> 
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
> 
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  mm/memory-failure.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 117 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 42a193ee14d3..f95036f99a79 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -55,6 +55,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm_inline.h>
> +#include <linux/memremap.h>
>  #include <linux/kfifo.h>
>  #include <linux/ratelimit.h>
>  #include "internal.h"
> @@ -1112,6 +1113,117 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  	return res;
>  }
>  
> +static unsigned long dax_mapping_size(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	pgoff_t pgoff = page_to_pgoff(page);
> +	struct vm_area_struct *vma;
> +	unsigned long size = 0;
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +		unsigned long address = vma_address(page, vma);
> +		pgd_t *pgd;
> +		p4d_t *p4d;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, address);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		p4d = p4d_offset(pgd, address);
> +		if (!p4d_present(*p4d))
> +			continue;
> +		pud = pud_offset(p4d, address);
> +		if (!pud_present(*pud))
> +			continue;
> +		if (pud_devmap(*pud)) {
> +			size = PUD_SIZE;
> +			break;
> +		}
> +		pmd = pmd_offset(pud, address);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		if (pmd_devmap(*pmd)) {
> +			size = PMD_SIZE;
> +			break;
> +		}
> +		pte = pte_offset_map(pmd, address);
> +		if (!pte_present(*pte))
> +			continue;
> +		if (pte_devmap(*pte)) {
> +			size = PAGE_SIZE;
> +			break;
> +		}
> +	}
> +	i_mmap_unlock_read(mapping);
> +	return size;
> +}
> +
> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> +		struct dev_pagemap *pgmap)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	const bool unmap_success = true;
> +	unsigned long size;
> +	LIST_HEAD(tokill);
> +	int rc = -EBUSY;
> +	loff_t start;
> +
> +	lock_page(page);
> +	if (hwpoison_filter(page)) {
> +		rc = 0;
> +		goto out;
> +	}
> +
> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_PUBLIC:
> +		/*
> +		 * TODO: Handle HMM pages which may need coordination
> +		 * with device-side memory.
> +		 */
> +		goto out;
> +	default:
> +		if (!page->mapping)
> +			goto out;
> +		break;
> +	}
> +
> +	/*
> +	 * If the page is not mapped in userspace then report it as
> +	 * unhandled.
> +	 */
> +	size = dax_mapping_size(page);
> +	if (!size) {
> +		pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
> +		goto out;
> +	}
> +
> +	SetPageHWPoison(page);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a
> +	 * different physical page at a given virtual address, so all
> +	 * userspace consumption of ZONE_DEVICE memory necessitates
> +	 * SIGBUS (i.e. MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> +
> +	start = (page->index << PAGE_SHIFT) & ~(size - 1);
> +	unmap_mapping_range(page->mapping, start, start + size, 0);
> +
> +	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
> +			pfn, flags);
> +	rc = 0;
> +out:
> +	unlock_page(page);

I wrote as below in reply to 7/11

> > @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
> >  
> >               if (behavior == MADV_SOFT_OFFLINE) {
> >                       pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> > -                                             page_to_pfn(page), start);
> > +                                     pfn, start);
> >  
> >                       ret = soft_offline_page(page, MF_COUNT_INCREASED);
> > +                     put_page(page);
> >                       if (ret)
> >                               return ret;
> >                       continue;
> >               }
> > +             put_page(page);
> 
> We keep the page count pinned after the isolation of the error page
> in order to make sure that the error page is disabled and never reused.
> This seems not explicit enough, so some comment should be helpful.

... but it was lack of words, sorry. More precisely, a refcount incremented
before calling memory_failure() is kept only when the error page is in-use
as a normal lru page when an error happens on it and it's successfully handled.
The reason of this behavior (along with avoiding the risk of unexpected reusing)
is to make sure that unpoison (cancelling mechanism of hwpoison) can trigger
page freeing code (__put_page() for normal pages).
But I think that this tricky behavior comes from historical reason and
we can go without it, so I don't think you have to inherit it for new code.

(Although I'm not familiar with dax,) if dev_pagemap page has a different
lifecycle from one of normal pages and it has its own mechanism of cancelling
memory error, then you can simply release the page refcount at the end of
memory_failure_dev_pagemap().


I have another comment about page refcount. memory_failure() is sometimes
called with pinning and sometimes called without pinning, which is indicated
by MF_COUNT_INCREASED flag. When MF_COUNT_INCREASED is not set,
memory_failure() tries to take it by itself.
So you might need some adjustment for non-MF_COUNT_INCREASED case.

Thanks,
Naoya Horiguchi

> +	put_dev_pagemap(pgmap);
> +	return rc;
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1134,6 +1246,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	struct page *p;
>  	struct page *hpage;
>  	struct page *orig_head;
> +	struct dev_pagemap *pgmap;
>  	int res;
>  	unsigned long page_flags;
>  
> @@ -1146,6 +1259,10 @@ int memory_failure(unsigned long pfn, int flags)
>  		return -ENXIO;
>  	}
>  
> +	pgmap = get_dev_pagemap(pfn, NULL);
> +	if (pgmap)
> +		return memory_failure_dev_pagemap(pfn, flags, pgmap);
> +
>  	p = pfn_to_page(pfn);
>  	if (PageHuge(p))
>  		return memory_failure_hugetlb(pfn, flags);
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] filesystem-dax: set page->index
  2018-05-22 14:39 ` [PATCH 05/11] filesystem-dax: " Dan Williams
@ 2018-05-23  8:40   ` Jan Kara
  2018-05-30  1:38     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2018-05-23  8:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Ross Zwisler, linux-mm, linux-fsdevel, tony.luck, linux-xfs

On Tue 22-05-18 07:39:57, Dan Williams wrote:
> In support of enabling memory_failure() handling for filesystem-dax
> mappings, set ->index to the pgoff of the page. The rmap implementation
> requires ->index to bound the search through the vma interval tree. The
> index is set and cleared at dax_associate_entry() and
> dax_disassociate_entry() time respectively.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index aaec72ded1b6..2e4682cd7c69 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
>  	for (pfn = dax_radix_pfn(entry); \
>  			pfn < dax_radix_end_pfn(entry); pfn++)
>  
> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> +		struct vm_area_struct *vma, unsigned long address)
>  {
> -	unsigned long pfn;
> +	unsigned long size = dax_entry_size(entry), pfn, index;
> +	int i = 0;
>  
>  	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>  		return;
>  
> +	index = linear_page_index(vma, address & ~(size - 1));
>  	for_each_mapped_pfn(entry, pfn) {
>  		struct page *page = pfn_to_page(pfn);
>  
>  		WARN_ON_ONCE(page->mapping);
>  		page->mapping = mapping;
> +		page->index = index + i++;
>  	}
>  }

Hum, this just made me think: How is this going to work with XFS reflink?
In fact is not the page->mapping association already broken by XFS reflink?
Because with reflink we can have two or more mappings pointing to the same
physical blocks (i.e., pages in DAX case)...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/11] device-dax: enable page_mapping()
  2018-05-22 14:39 ` [PATCH 03/11] device-dax: enable page_mapping() Dan Williams
@ 2018-05-23  9:03   ` Jan Kara
  2018-05-30 19:54   ` kbuild test robot
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2018-05-23  9:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, hch, linux-mm, linux-fsdevel, tony.luck

On Tue 22-05-18 07:39:47, Dan Williams wrote:
> In support of enabling memory_failure() handling for device-dax
> mappings, set the ->mapping association of pages backing device-dax
> mappings. The rmap implementation requires page_mapping() to return the
> address_space hosting the vmas that map the page.
> 
> The ->mapping pointer is never cleared. There is no possibility for the
> page to become associated with another address_space while the device is
> enabled. When the device is disabled the 'struct page' array for the
> device is destroyed / later reinitialized to zero.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
...
> @@ -402,17 +401,33 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  	id = dax_read_lock();
>  	switch (pe_size) {
>  	case PE_SIZE_PTE:
> -		rc = __dev_dax_pte_fault(dev_dax, vmf);
> +		fault_size = PAGE_SIZE;
> +		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
>  		break;
>  	case PE_SIZE_PMD:
> -		rc = __dev_dax_pmd_fault(dev_dax, vmf);
> +		fault_size = PMD_SIZE;
> +		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
>  		break;
>  	case PE_SIZE_PUD:
> -		rc = __dev_dax_pud_fault(dev_dax, vmf);
> +		fault_size = PUD_SIZE;
> +		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
>  		break;
>  	default:
>  		rc = VM_FAULT_SIGBUS;
>  	}
> +
> +	if (rc == VM_FAULT_NOPAGE) {
> +		unsigned long i;
> +
> +		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> +			struct page *page;
> +
> +			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> +			if (page->mapping)
> +				continue;
> +			page->mapping = filp->f_mapping;
> +		}
> +	}
>  	dax_read_unlock(id);

Careful here. Page fault can return VM_FAULT_NOPAGE even if we raced with
somebody modifying the PTE or if we installed a zero page. With shared DAX
mappings (and device dax does not allow anything else if I'm right) this
does not matter as given file offset is guaranteed to always map to the
same page but I think it deserves a comment.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock
  2018-05-22 14:40 ` [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock Dan Williams
@ 2018-05-23  9:35   ` Jan Kara
  2018-05-23 13:50     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2018-05-23  9:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Ross Zwisler, linux-mm, linux-fsdevel, tony.luck

On Tue 22-05-18 07:40:03, Dan Williams wrote:
> Hold the page lock while invalidating mapping entries to prevent races
> between rmap using the address_space and the filesystem freeing the
> address_space.
> 
> This is more complicated than the simple description implies because
> dev_pagemap pages that fsdax uses do not have any concept of page size.
> Size information is stored in the radix and can only be safely read
> while holding the xa_lock. Since lock_page() can not be taken while
> holding xa_lock, drop xa_lock and speculatively lock all the associated
> pages. Once all the pages are locked re-take the xa_lock and revalidate
> that the radix entry did not change.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

IMO this is too ugly to live. The combination of entry locks in the radix
tree and page locks is just too big mess. And from a quick look I don't see
a reason why we could not use entry locks to protect rmap code as well -
when you have PFN for which you need to walk rmap, you can grab
rcu_read_lock(), then you can safely look at page->mapping, grab xa_lock,
verify the radix tree points where it should and grab entry lock. I agree
it's a bit complicated but for memory failure I think it is fine.

Or we could talk about switching everything to page locks instead of entry
locks but that isn't trivial either as we need something to serialized page
faults on even before we go into the filesystem and allocate blocks for the
fault...
								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock
  2018-05-23  9:35   ` Jan Kara
@ 2018-05-23 13:50     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-23 13:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	Linux MM, linux-fsdevel, Luck, Tony

On Wed, May 23, 2018 at 2:35 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-05-18 07:40:03, Dan Williams wrote:
>> Hold the page lock while invalidating mapping entries to prevent races
>> between rmap using the address_space and the filesystem freeing the
>> address_space.
>>
>> This is more complicated than the simple description implies because
>> dev_pagemap pages that fsdax uses do not have any concept of page size.
>> Size information is stored in the radix and can only be safely read
>> while holding the xa_lock. Since lock_page() can not be taken while
>> holding xa_lock, drop xa_lock and speculatively lock all the associated
>> pages. Once all the pages are locked re-take the xa_lock and revalidate
>> that the radix entry did not change.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> IMO this is too ugly to live.

The same thought crossed my mind...

> The combination of entry locks in the radix
> tree and page locks is just too big mess. And from a quick look I don't see
> a reason why we could not use entry locks to protect rmap code as well -
> when you have PFN for which you need to walk rmap, you can grab
> rcu_read_lock(), then you can safely look at page->mapping, grab xa_lock,
> verify the radix tree points where it should and grab entry lock. I agree
> it's a bit complicated but for memory failure I think it is fine.

Ah, I missed this cleverness with rcu relative to keeping the
page->mapping valid. I'll take a look.

> Or we could talk about switching everything to page locks instead of entry
> locks but that isn't trivial either as we need something to serialized page
> faults on even before we go into the filesystem and allocate blocks for the
> fault...

I'd rather use entry locks everywhere and not depend on the page lock
for rmap if at all possible. Ideally lock_page is only used for
typical pages and not these dev_pagemap related structures.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 07/11] mm, madvise_inject_error: fix page count leak
  2018-05-23  4:19   ` Naoya Horiguchi
@ 2018-05-24 20:55     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-24 20:55 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-nvdimm, stable, Michal Hocko, Andi Kleen, Wu Fengguang,
	hch, linux-mm, linux-fsdevel, tony.luck

On Tue, May 22, 2018 at 9:19 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
>> The madvise_inject_error() routine uses get_user_pages() to lookup the
>> pfn and other information for injected error, but it fails to release
>> that pin.
>>
>> The dax-dma-vs-truncate warning catches this failure with the following
>> signature:
>>
>>  Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
>>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>>  CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
>>  [..]
>>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>>  RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
>>  RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
>>  RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
>>  RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
>>  R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
>>  R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
>>  FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
>>  Call Trace:
>>   __dax_invalidate_mapping_entry+0xab/0xe0
>>   dax_delete_mapping_entry+0xf/0x20
>>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>>   truncate_inode_pages_range+0x291/0x920
>>   ? kmem_cache_free+0x1f8/0x300
>>   ? lock_acquire+0x9f/0x200
>>   ? truncate_inode_pages_final+0x31/0x50
>>   ext4_evict_inode+0x69/0x740
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Wu Fengguang <fengguang.wu@intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  mm/madvise.c |   11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d3c922ea1a1..246fa4d4eee2 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>>
>>
>>       for (; start < end; start += PAGE_SIZE << order) {
>> +             unsigned long pfn;
>>               int ret;
>>
>>               ret = get_user_pages_fast(start, 1, 0, &page);
>>               if (ret != 1)
>>                       return ret;
>> +             pfn = page_to_pfn(page);
>>
>>               /*
>>                * When soft offlining hugepages, after migrating the page
>> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>>
>>               if (behavior == MADV_SOFT_OFFLINE) {
>>                       pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>> -                                             page_to_pfn(page), start);
>> +                                     pfn, start);
>>
>>                       ret = soft_offline_page(page, MF_COUNT_INCREASED);
>> +                     put_page(page);
>>                       if (ret)
>>                               return ret;
>>                       continue;
>>               }
>> +             put_page(page);
>
> We keep the page count pinned after the isolation of the error page
> in order to make sure that the error page is disabled and never reused.
> This seems not explicit enough, so some comment should be helpful.

As far as I can see this extra reference count to keep the page from
being should be taken internal to memory_failure(), not assumed from
the inject error path. I might be overlooking something, but I do not
see who is responsible for taking this extra reference in the case
where memory_failure() is called by the machine check code rather than
madvise_inject_error()?

>
> BTW, looking at the kernel message like "Memory failure: 0x208900:
> reserved kernel page still referenced by 1 users", memory_failure()
> considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
> If memory error handler recovers a dav_pagemap page in its special way,
> we can define a new action_page_types entry like MF_MSG_DAX.
> Reporting like "Memory failure: 0xXXXXX: recovery action for dax page:
> Failed" might be helpful for end user's perspective.

Sounds good, I'll take a look at this.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] filesystem-dax: set page->index
  2018-05-23  8:40   ` Jan Kara
@ 2018-05-30  1:38     ` Dan Williams
  2018-05-30  8:13       ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-30  1:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	Linux MM, linux-fsdevel, Luck, Tony, linux-xfs, Darrick J. Wong

On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-05-18 07:39:57, Dan Williams wrote:
>> In support of enabling memory_failure() handling for filesystem-dax
>> mappings, set ->index to the pgoff of the page. The rmap implementation
>> requires ->index to bound the search through the vma interval tree. The
>> index is set and cleared at dax_associate_entry() and
>> dax_disassociate_entry() time respectively.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c |   11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index aaec72ded1b6..2e4682cd7c69 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
>>       for (pfn = dax_radix_pfn(entry); \
>>                       pfn < dax_radix_end_pfn(entry); pfn++)
>>
>> -static void dax_associate_entry(void *entry, struct address_space *mapping)
>> +static void dax_associate_entry(void *entry, struct address_space *mapping,
>> +             struct vm_area_struct *vma, unsigned long address)
>>  {
>> -     unsigned long pfn;
>> +     unsigned long size = dax_entry_size(entry), pfn, index;
>> +     int i = 0;
>>
>>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>>               return;
>>
>> +     index = linear_page_index(vma, address & ~(size - 1));
>>       for_each_mapped_pfn(entry, pfn) {
>>               struct page *page = pfn_to_page(pfn);
>>
>>               WARN_ON_ONCE(page->mapping);
>>               page->mapping = mapping;
>> +             page->index = index + i++;
>>       }
>>  }
>
> Hum, this just made me think: How is this going to work with XFS reflink?
> In fact is not the page->mapping association already broken by XFS reflink?
> Because with reflink we can have two or more mappings pointing to the same
> physical blocks (i.e., pages in DAX case)...

Good question. I assume we are ok in the non-DAX reflink case because
rmap of failing / poison pages is only relative to the specific page
cache page for a given inode in the reflink. However, DAX would seem
to break this because we only get one shared 'struct page' for all
possible mappings of the physical file block. I think this means for
iterating over the rmap of "where is this page mapped" would require
iterating over the other "sibling" inodes that know about the given
physical file block.

As far as I can see reflink+dax would require teaching kernel code
paths that ->mapping may not be a singular relationship. Something
along the line's of what Jerome was presenting at LSF to create a
special value to indicate, "call back into the filesystem (or the page
owner)" to perform this operation.

In the meantime the kernel crashes when userspace accesses poisoned
pmem via DAX. I assume that reworking rmap for the dax+reflink case
should not block dax poison handling? Yell if you disagree.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] filesystem-dax: set page->index
  2018-05-30  1:38     ` Dan Williams
@ 2018-05-30  8:13       ` Jan Kara
  2018-05-30 23:21         ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2018-05-30  8:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Christoph Hellwig, Matthew Wilcox,
	Ross Zwisler, Linux MM, linux-fsdevel, Luck, Tony, linux-xfs,
	Darrick J. Wong

On Tue 29-05-18 18:38:41, Dan Williams wrote:
> On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-05-18 07:39:57, Dan Williams wrote:
> >> In support of enabling memory_failure() handling for filesystem-dax
> >> mappings, set ->index to the pgoff of the page. The rmap implementation
> >> requires ->index to bound the search through the vma interval tree. The
> >> index is set and cleared at dax_associate_entry() and
> >> dax_disassociate_entry() time respectively.
> >>
> >> Cc: Jan Kara <jack@suse.cz>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  fs/dax.c |   11 ++++++++---
> >>  1 file changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index aaec72ded1b6..2e4682cd7c69 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
> >>       for (pfn = dax_radix_pfn(entry); \
> >>                       pfn < dax_radix_end_pfn(entry); pfn++)
> >>
> >> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> >> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> >> +             struct vm_area_struct *vma, unsigned long address)
> >>  {
> >> -     unsigned long pfn;
> >> +     unsigned long size = dax_entry_size(entry), pfn, index;
> >> +     int i = 0;
> >>
> >>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >>               return;
> >>
> >> +     index = linear_page_index(vma, address & ~(size - 1));
> >>       for_each_mapped_pfn(entry, pfn) {
> >>               struct page *page = pfn_to_page(pfn);
> >>
> >>               WARN_ON_ONCE(page->mapping);
> >>               page->mapping = mapping;
> >> +             page->index = index + i++;
> >>       }
> >>  }
> >
> > Hum, this just made me think: How is this going to work with XFS reflink?
> > In fact is not the page->mapping association already broken by XFS reflink?
> > Because with reflink we can have two or more mappings pointing to the same
> > physical blocks (i.e., pages in DAX case)...
> 
> Good question. I assume we are ok in the non-DAX reflink case because
> rmap of failing / poison pages is only relative to the specific page
> cache page for a given inode in the reflink. However, DAX would seem
> to break this because we only get one shared 'struct page' for all
> possible mappings of the physical file block. I think this means for
> iterating over the rmap of "where is this page mapped" would require
> iterating over the other "sibling" inodes that know about the given
> physical file block.
> 
> As far as I can see reflink+dax would require teaching kernel code
> paths that ->mapping may not be a singular relationship. Something
> along the line's of what Jerome was presenting at LSF to create a
> special value to indicate, "call back into the filesystem (or the page
> owner)" to perform this operation.
> 
> In the meantime the kernel crashes when userspace accesses poisoned
> pmem via DAX. I assume that reworking rmap for the dax+reflink case
> should not block dax poison handling? Yell if you disagree.

The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
page->mapping to warn if truncate collides with a busy page" in
particular), DAX was perfectly fine with reflinks since we never needed
page->mapping. Now this series adds even page->index dependency which makes
life for rmap with reflinks even harder. So if nothing else we should at
least make sure reflinked filesystems cannot be mounted with dax mount
option for now and seriously start looking into how to implement rmap with
reflinked files for DAX because this noticeably reduces its usefulness.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/11] device-dax: enable page_mapping()
  2018-05-22 14:39 ` [PATCH 03/11] device-dax: enable page_mapping() Dan Williams
  2018-05-23  9:03   ` Jan Kara
@ 2018-05-30 19:54   ` kbuild test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2018-05-30 19:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: kbuild-all, linux-nvdimm, hch, linux-mm, linux-fsdevel, tony.luck

[-- Attachment #1: Type: text/plain, Size: 2829 bytes --]

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc7 next-20180530]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/mm-Teach-memory_failure-about-ZONE_DEVICE-pages/20180525-035652
config: arm64-allmodconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   drivers//dax/device.c: In function 'dev_dax_huge_fault':
>> drivers//dax/device.c:413:8: error: too many arguments to function '__dev_dax_pud_fault'
      rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
           ^~~~~~~~~~~~~~~~~~~
   drivers//dax/device.c:380:19: note: declared here
    static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
                      ^~~~~~~~~~~~~~~~~~~

vim +/__dev_dax_pud_fault +413 drivers//dax/device.c

   386	
   387	static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
   388			enum page_entry_size pe_size)
   389	{
   390		struct vm_area_struct *vma = vmf->vma;
   391		struct file *filp = vma->vm_file;
   392		unsigned long fault_size;
   393		int rc, id;
   394		pfn_t pfn;
   395		struct dev_dax *dev_dax = filp->private_data;
   396	
   397		dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
   398				(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
   399				vma->vm_start, vma->vm_end, pe_size);
   400	
   401		id = dax_read_lock();
   402		switch (pe_size) {
   403		case PE_SIZE_PTE:
   404			fault_size = PAGE_SIZE;
   405			rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
   406			break;
   407		case PE_SIZE_PMD:
   408			fault_size = PMD_SIZE;
   409			rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
   410			break;
   411		case PE_SIZE_PUD:
   412			fault_size = PUD_SIZE;
 > 413			rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
   414			break;
   415		default:
   416			rc = VM_FAULT_SIGBUS;
   417		}
   418	
   419		if (rc == VM_FAULT_NOPAGE) {
   420			unsigned long i;
   421	
   422			for (i = 0; i < fault_size / PAGE_SIZE; i++) {
   423				struct page *page;
   424	
   425				page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
   426				if (page->mapping)
   427					continue;
   428				page->mapping = filp->f_mapping;
   429			}
   430		}
   431		dax_read_unlock(id);
   432	
   433		return rc;
   434	}
   435	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 59026 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] filesystem-dax: set page->index
  2018-05-30  8:13       ` Jan Kara
@ 2018-05-30 23:21         ` Dan Williams
  2018-05-31 10:08           ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-05-30 23:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	Linux MM, linux-fsdevel, Luck, Tony, linux-xfs, Darrick J. Wong

On Wed, May 30, 2018 at 1:13 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 29-05-18 18:38:41, Dan Williams wrote:
>> On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 22-05-18 07:39:57, Dan Williams wrote:
>> >> In support of enabling memory_failure() handling for filesystem-dax
>> >> mappings, set ->index to the pgoff of the page. The rmap implementation
>> >> requires ->index to bound the search through the vma interval tree. The
>> >> index is set and cleared at dax_associate_entry() and
>> >> dax_disassociate_entry() time respectively.
>> >>
>> >> Cc: Jan Kara <jack@suse.cz>
>> >> Cc: Christoph Hellwig <hch@lst.de>
>> >> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> >> ---
>> >>  fs/dax.c |   11 ++++++++---
>> >>  1 file changed, 8 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/fs/dax.c b/fs/dax.c
>> >> index aaec72ded1b6..2e4682cd7c69 100644
>> >> --- a/fs/dax.c
>> >> +++ b/fs/dax.c
>> >> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
>> >>       for (pfn = dax_radix_pfn(entry); \
>> >>                       pfn < dax_radix_end_pfn(entry); pfn++)
>> >>
>> >> -static void dax_associate_entry(void *entry, struct address_space *mapping)
>> >> +static void dax_associate_entry(void *entry, struct address_space *mapping,
>> >> +             struct vm_area_struct *vma, unsigned long address)
>> >>  {
>> >> -     unsigned long pfn;
>> >> +     unsigned long size = dax_entry_size(entry), pfn, index;
>> >> +     int i = 0;
>> >>
>> >>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> >>               return;
>> >>
>> >> +     index = linear_page_index(vma, address & ~(size - 1));
>> >>       for_each_mapped_pfn(entry, pfn) {
>> >>               struct page *page = pfn_to_page(pfn);
>> >>
>> >>               WARN_ON_ONCE(page->mapping);
>> >>               page->mapping = mapping;
>> >> +             page->index = index + i++;
>> >>       }
>> >>  }
>> >
>> > Hum, this just made me think: How is this going to work with XFS reflink?
>> > In fact is not the page->mapping association already broken by XFS reflink?
>> > Because with reflink we can have two or more mappings pointing to the same
>> > physical blocks (i.e., pages in DAX case)...
>>
>> Good question. I assume we are ok in the non-DAX reflink case because
>> rmap of failing / poison pages is only relative to the specific page
>> cache page for a given inode in the reflink. However, DAX would seem
>> to break this because we only get one shared 'struct page' for all
>> possible mappings of the physical file block. I think this means for
>> iterating over the rmap of "where is this page mapped" would require
>> iterating over the other "sibling" inodes that know about the given
>> physical file block.
>>
>> As far as I can see reflink+dax would require teaching kernel code
>> paths that ->mapping may not be a singular relationship. Something
>> along the line's of what Jerome was presenting at LSF to create a
>> special value to indicate, "call back into the filesystem (or the page
>> owner)" to perform this operation.
>>
>> In the meantime the kernel crashes when userspace accesses poisoned
>> pmem via DAX. I assume that reworking rmap for the dax+reflink case
>> should not block dax poison handling? Yell if you disagree.
>
> The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
> page->mapping to warn if truncate collides with a busy page" in
> particular), DAX was perfectly fine with reflinks since we never needed
> page->mapping.

Sure, but if this rmap series had come first I still would have needed
to implement ->mapping. So unless we invent a general ->mapping
replacement and switch all mapping users, it was always going to
collide with DAX eventually.

> Now this series adds even page->index dependency which makes
> life for rmap with reflinks even harder. So if nothing else we should at
> least make sure reflinked filesystems cannot be mounted with dax mount
> option for now and seriously start looking into how to implement rmap with
> reflinked files for DAX because this noticeably reduces its usefulness.

This restriction is already in place. In xfs_reflink_remap_range() we have:

        /* Don't share DAX file data for now. */
        if (IS_DAX(inode_in) || IS_DAX(inode_out))
                goto out_unlock;

All this said, perhaps we don't need to set ->link, it would just mean
a wider search through the rmap tree to find if the given page is
mapped. So, I think I can forgo setting ->link if I teach the rmap
code to search the entire ->mapping.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] filesystem-dax: set page->index
  2018-05-30 23:21         ` Dan Williams
@ 2018-05-31 10:08           ` Jan Kara
  2018-05-31 21:49             ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2018-05-31 10:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Christoph Hellwig, Matthew Wilcox,
	Ross Zwisler, Linux MM, linux-fsdevel, Luck, Tony, linux-xfs,
	Darrick J. Wong

On Wed 30-05-18 16:21:33, Dan Williams wrote:
> On Wed, May 30, 2018 at 1:13 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 29-05-18 18:38:41, Dan Williams wrote:
> >> On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Tue 22-05-18 07:39:57, Dan Williams wrote:
> >> >> In support of enabling memory_failure() handling for filesystem-dax
> >> >> mappings, set ->index to the pgoff of the page. The rmap implementation
> >> >> requires ->index to bound the search through the vma interval tree. The
> >> >> index is set and cleared at dax_associate_entry() and
> >> >> dax_disassociate_entry() time respectively.
> >> >>
> >> >> Cc: Jan Kara <jack@suse.cz>
> >> >> Cc: Christoph Hellwig <hch@lst.de>
> >> >> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> >> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> >> ---
> >> >>  fs/dax.c |   11 ++++++++---
> >> >>  1 file changed, 8 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/fs/dax.c b/fs/dax.c
> >> >> index aaec72ded1b6..2e4682cd7c69 100644
> >> >> --- a/fs/dax.c
> >> >> +++ b/fs/dax.c
> >> >> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
> >> >>       for (pfn = dax_radix_pfn(entry); \
> >> >>                       pfn < dax_radix_end_pfn(entry); pfn++)
> >> >>
> >> >> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> >> >> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> >> >> +             struct vm_area_struct *vma, unsigned long address)
> >> >>  {
> >> >> -     unsigned long pfn;
> >> >> +     unsigned long size = dax_entry_size(entry), pfn, index;
> >> >> +     int i = 0;
> >> >>
> >> >>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> >>               return;
> >> >>
> >> >> +     index = linear_page_index(vma, address & ~(size - 1));
> >> >>       for_each_mapped_pfn(entry, pfn) {
> >> >>               struct page *page = pfn_to_page(pfn);
> >> >>
> >> >>               WARN_ON_ONCE(page->mapping);
> >> >>               page->mapping = mapping;
> >> >> +             page->index = index + i++;
> >> >>       }
> >> >>  }
> >> >
> >> > Hum, this just made me think: How is this going to work with XFS reflink?
> >> > In fact is not the page->mapping association already broken by XFS reflink?
> >> > Because with reflink we can have two or more mappings pointing to the same
> >> > physical blocks (i.e., pages in DAX case)...
> >>
> >> Good question. I assume we are ok in the non-DAX reflink case because
> >> rmap of failing / poison pages is only relative to the specific page
> >> cache page for a given inode in the reflink. However, DAX would seem
> >> to break this because we only get one shared 'struct page' for all
> >> possible mappings of the physical file block. I think this means for
> >> iterating over the rmap of "where is this page mapped" would require
> >> iterating over the other "sibling" inodes that know about the given
> >> physical file block.
> >>
> >> As far as I can see reflink+dax would require teaching kernel code
> >> paths that ->mapping may not be a singular relationship. Something
> >> along the line's of what Jerome was presenting at LSF to create a
> >> special value to indicate, "call back into the filesystem (or the page
> >> owner)" to perform this operation.
> >>
> >> In the meantime the kernel crashes when userspace accesses poisoned
> >> pmem via DAX. I assume that reworking rmap for the dax+reflink case
> >> should not block dax poison handling? Yell if you disagree.
> >
> > The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
> > page->mapping to warn if truncate collides with a busy page" in
> > particular), DAX was perfectly fine with reflinks since we never needed
> > page->mapping.
> 
> Sure, but if this rmap series had come first I still would have needed
> to implement ->mapping. So unless we invent a general ->mapping
> replacement and switch all mapping users, it was always going to
> collide with DAX eventually.

Yes, my comment was more in direction that life would be easier if we could
keep DAX without rmap support but I guess that's just too cumbersome.

> > Now this series adds even page->index dependency which makes
> > life for rmap with reflinks even harder. So if nothing else we should at
> > least make sure reflinked filesystems cannot be mounted with dax mount
> > option for now and seriously start looking into how to implement rmap with
> > reflinked files for DAX because this noticeably reduces its usefulness.
> 
> This restriction is already in place. In xfs_reflink_remap_range() we have:
> 
>         /* Don't share DAX file data for now. */
>         if (IS_DAX(inode_in) || IS_DAX(inode_out))
>                 goto out_unlock;

OK, good.

> All this said, perhaps we don't need to set ->link, it would just mean
> a wider search through the rmap tree to find if the given page is
> mapped. So, I think I can forgo setting ->link if I teach the rmap
> code to search the entire ->mapping.

I guess you mean ->index in the above. And now when thinking about it I don't
think it is worth the complications to avoid using ->index.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/11] filesystem-dax: set page->index
  2018-05-31 10:08           ` Jan Kara
@ 2018-05-31 21:49             ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-05-31 21:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	Linux MM, linux-fsdevel, Luck, Tony, linux-xfs, Darrick J. Wong

On Thu, May 31, 2018 at 3:08 AM, Jan Kara <jack@suse.cz> wrote:
[..]
>> >> As far as I can see reflink+dax would require teaching kernel code
>> >> paths that ->mapping may not be a singular relationship. Something
>> >> along the line's of what Jerome was presenting at LSF to create a
>> >> special value to indicate, "call back into the filesystem (or the page
>> >> owner)" to perform this operation.
>> >>
>> >> In the meantime the kernel crashes when userspace accesses poisoned
>> >> pmem via DAX. I assume that reworking rmap for the dax+reflink case
>> >> should not block dax poison handling? Yell if you disagree.
>> >
>> > The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
>> > page->mapping to warn if truncate collides with a busy page" in
>> > particular), DAX was perfectly fine with reflinks since we never needed
>> > page->mapping.
>>
>> Sure, but if this rmap series had come first I still would have needed
>> to implement ->mapping. So unless we invent a general ->mapping
>> replacement and switch all mapping users, it was always going to
>> collide with DAX eventually.
>
> Yes, my comment was more in direction that life would be easier if we could
> keep DAX without rmap support but I guess that's just too cumbersome.

I'm open to deeper reworks later. As it stands currently just calling
madvise(..., MADV_HWPOISON) on a DAX mapping causes a page reference
to be leaked because the madvise code has no clue about proper
handling of DAX pages, and consuming real poison leads to a fatal
condition / reset.

I think fixing those bugs with the current rmap dependencies on
->mapping and ->index is step1 and step2 is a longer term solution for
dax rmap that does also allow reflink. I.e. it's an rmap > reflink
argument for now.

>
>> > Now this series adds even page->index dependency which makes
>> > life for rmap with reflinks even harder. So if nothing else we should at
>> > least make sure reflinked filesystems cannot be mounted with dax mount
>> > option for now and seriously start looking into how to implement rmap with
>> > reflinked files for DAX because this noticeably reduces its usefulness.
>>
>> This restriction is already in place. In xfs_reflink_remap_range() we have:
>>
>>         /* Don't share DAX file data for now. */
>>         if (IS_DAX(inode_in) || IS_DAX(inode_out))
>>                 goto out_unlock;
>
> OK, good.
>
>> All this said, perhaps we don't need to set ->link, it would just mean
>> a wider search through the rmap tree to find if the given page is
>> mapped. So, I think I can forgo setting ->link if I teach the rmap
>> code to search the entire ->mapping.
>
> I guess you mean ->index in the above. And now when thinking about it I don't
> think it is worth the complications to avoid using ->index.

Ok, and yes I meant ->index... sorry, too much struct page on the
brain presently.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-05-31 21:49 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-22 14:39 [PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
2018-05-22 14:39 ` [PATCH 01/11] device-dax: convert to vmf_insert_mixed and vm_fault_t Dan Williams
2018-05-22 14:39 ` [PATCH 02/11] device-dax: cleanup vm_fault de-reference chains Dan Williams
2018-05-22 14:39 ` [PATCH 03/11] device-dax: enable page_mapping() Dan Williams
2018-05-23  9:03   ` Jan Kara
2018-05-30 19:54   ` kbuild test robot
2018-05-22 14:39 ` [PATCH 04/11] device-dax: set page->index Dan Williams
2018-05-22 14:39 ` [PATCH 05/11] filesystem-dax: " Dan Williams
2018-05-23  8:40   ` Jan Kara
2018-05-30  1:38     ` Dan Williams
2018-05-30  8:13       ` Jan Kara
2018-05-30 23:21         ` Dan Williams
2018-05-31 10:08           ` Jan Kara
2018-05-31 21:49             ` Dan Williams
2018-05-22 14:40 ` [PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock Dan Williams
2018-05-23  9:35   ` Jan Kara
2018-05-23 13:50     ` Dan Williams
2018-05-22 14:40 ` [PATCH 07/11] mm, madvise_inject_error: fix page count leak Dan Williams
2018-05-23  4:19   ` Naoya Horiguchi
2018-05-24 20:55     ` Dan Williams
2018-05-22 14:40 ` [PATCH 08/11] x86, memory_failure: introduce {set, clear}_mce_nospec() Dan Williams
2018-05-22 14:40 ` [PATCH 09/11] mm, memory_failure: pass page size to kill_proc() Dan Williams
2018-05-23  6:41   ` Naoya Horiguchi
2018-05-22 14:40 ` [PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages Dan Williams
2018-05-23  6:48   ` Naoya Horiguchi
2018-05-22 14:40 ` [PATCH 11/11] libnvdimm, pmem: restore page attributes when clearing errors Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).