linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages
@ 2018-06-08 23:50 Dan Williams
  2018-06-08 23:50 ` [PATCH v4 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-edac, Tony Luck, Borislav Petkov, Jérôme Glisse,
	Jan Kara, H. Peter Anvin, x86, Thomas Gleixner,
	Christoph Hellwig, Ross Zwisler, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Naoya Horiguchi, Souptick Joarder, linux-mm,
	linux-fsdevel, jack

Changes since v3 [1]:

* Introduce dax_lock_page(), using the radix exceptional entry lock, for
  pinning down page->mapping while memory_failure() interrogates the
  page. (Jan)

* Collect acks and reviews from Tony and Jan.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html

---

As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.

A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

---

Dan Williams (12):
      device-dax: Convert to vmf_insert_mixed and vm_fault_t
      device-dax: Cleanup vm_fault de-reference chains
      device-dax: Enable page_mapping()
      device-dax: Set page->index
      filesystem-dax: Set page->index
      mm, madvise_inject_error: Let memory_failure() optionally take a page reference
      x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
      x86/memory_failure: Introduce {set,clear}_mce_nospec()
      mm, memory_failure: Pass page size to kill_proc()
      filesystem-dax: Introduce dax_lock_page()
      mm, memory_failure: Teach memory_failure() about dev_pagemap pages
      libnvdimm, pmem: Restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h         |   42 +++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ---
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 +-------
 arch/x86/mm/pat.c                         |   16 +++
 drivers/dax/device.c                      |   97 ++++++++++++--------
 drivers/nvdimm/pmem.c                     |   26 +++++
 drivers/nvdimm/pmem.h                     |   13 +++
 fs/dax.c                                  |   92 ++++++++++++++++++-
 include/linux/dax.h                       |   15 +++
 include/linux/huge_mm.h                   |    5 +
 include/linux/mm.h                        |    1 
 include/linux/set_memory.h                |   14 +++
 mm/huge_memory.c                          |    4 -
 mm/madvise.c                              |   18 +++-
 mm/memory-failure.c                       |  143 +++++++++++++++++++++++++++--
 15 files changed, 434 insertions(+), 105 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-08 23:50 ` [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Souptick Joarder, Matthew Wilcox, Ross Zwisler, hch, linux-mm,
	linux-fsdevel, jack

Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno.  Once all instances are converted, vm_fault_t will
become a distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c    |   26 +++++++++++---------------
 include/linux/huge_mm.h |    5 +++--
 mm/huge_memory.c        |    4 ++--
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index aff2c1594220..d44d98c54d0f 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
-	int rc = VM_FAULT_SIGBUS;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
-	if (rc == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (rc < 0 && rc != -EBUSY)
-		return VM_FAULT_SIGBUS;
-
-	return VM_FAULT_NOPAGE;
+	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
 
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
 	return rc;
 }
 
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
 {
 	return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
 }
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/sched/coredump.h>
+#include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..6af976472a5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -755,7 +755,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
@@ -815,7 +815,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-06-08 23:50 ` [PATCH v4 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-11 17:12   ` Laurent Dufour
  2018-06-08 23:50 ` [PATCH v4 03/12] device-dax: Enable page_mapping() Dan Williams
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

Define a local 'vma' variable rather than repetitively de-referencing
the passed in 'struct vm_fault *' instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d44d98c54d0f..686de08e120b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -247,13 +247,14 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -274,13 +275,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -288,7 +290,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -310,11 +312,10 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pmd_addr < vmf->vma->vm_start ||
-			(pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
+	if (pmd_addr < vma->vm_start || (pmd_addr + PMD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pmd_addr);
+	pgoff = linear_page_index(vma, pmd_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -323,7 +324,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
@@ -332,6 +333,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -340,7 +342,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	unsigned int fault_size = PUD_SIZE;
 
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -362,11 +364,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pud_addr < vmf->vma->vm_start ||
-			(pud_addr + PUD_SIZE) > vmf->vma->vm_end)
+	if (pud_addr < vma->vm_start || (pud_addr + PUD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pud_addr);
+	pgoff = linear_page_index(vma, pud_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -375,7 +376,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,12 +391,13 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
-	struct file *filp = vmf->vma->vm_file;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *filp = vma->vm_file;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
 			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
+			vma->vm_start, vma->vm_end, pe_size);
 
 	id = dax_read_lock();
 	switch (pe_size) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 03/12] device-dax: Enable page_mapping()
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-06-08 23:50 ` [PATCH v4 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
  2018-06-08 23:50 ` [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-08 23:50 ` [PATCH v4 04/12] device-dax: Set page->index Dan Williams
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Jan Kara, hch, linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.

The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   55 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 686de08e120b..7ec246549721 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,13 +245,12 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
-	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -273,13 +272,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, *pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -287,7 +286,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -322,15 +320,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -338,7 +336,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
 
@@ -374,14 +371,14 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	return VM_FAULT_FALLBACK;
 }
@@ -390,9 +387,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
-	int rc, id;
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *filp = vma->vm_file;
+	unsigned long fault_size;
+	int rc, id;
+	pfn_t pfn;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -402,17 +401,39 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	id = dax_read_lock();
 	switch (pe_size) {
 	case PE_SIZE_PTE:
-		rc = __dev_dax_pte_fault(dev_dax, vmf);
+		fault_size = PAGE_SIZE;
+		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PMD:
-		rc = __dev_dax_pmd_fault(dev_dax, vmf);
+		fault_size = PMD_SIZE;
+		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PUD:
-		rc = __dev_dax_pud_fault(dev_dax, vmf);
+		fault_size = PUD_SIZE;
+		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
 		break;
 	default:
 		rc = VM_FAULT_SIGBUS;
 	}
+
+	if (rc == VM_FAULT_NOPAGE) {
+		unsigned long i;
+
+		/*
+		 * In the device-dax case the only possibility for a
+		 * VM_FAULT_NOPAGE result is when device-dax capacity is
+		 * mapped. No need to consider the zero page, or racing
+		 * conflicting mappings.
+		 */
+		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+			struct page *page;
+
+			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+			if (page->mapping)
+				continue;
+			page->mapping = filp->f_mapping;
+		}
+	}
 	dax_read_unlock(id);
 
 	return rc;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 04/12] device-dax: Set page->index
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (2 preceding siblings ...)
  2018-06-08 23:50 ` [PATCH v4 03/12] device-dax: Enable page_mapping() Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-08 23:50 ` [PATCH v4 05/12] filesystem-dax: " Dan Williams
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Jan Kara, hch, linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.

The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 7ec246549721..2b120e397e08 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -418,6 +418,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 
 	if (rc == VM_FAULT_NOPAGE) {
 		unsigned long i;
+		pgoff_t pgoff;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -425,6 +426,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
+		pgoff = linear_page_index(vma, vmf->address
+				& ~(fault_size - 1));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
@@ -432,6 +435,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			if (page->mapping)
 				continue;
 			page->mapping = filp->f_mapping;
+			page->index = pgoff + i;
 		}
 	}
 	dax_read_unlock(id);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 05/12] filesystem-dax: Set page->index
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (3 preceding siblings ...)
  2018-06-08 23:50 ` [PATCH v4 04/12] device-dax: Set page->index Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-08 23:50 ` [PATCH v4 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Christoph Hellwig, Matthew Wilcox, Ross Zwisler, Jan Kara,
	linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index aaec72ded1b6..cccf6cad1a7a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,27 @@ static unsigned long dax_radix_end_pfn(void *entry)
 	for (pfn = dax_radix_pfn(entry); \
 			pfn < dax_radix_end_pfn(entry); pfn++)
 
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+/*
+ * TODO: for reflink+dax we need a way to associate a single page with
+ * multiple address_space instances at different linear_page_index()
+ * offsets.
+ */
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn, index;
+	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
+	index = linear_page_index(vma, address & ~(size - 1));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(page->mapping);
 		page->mapping = mapping;
+		page->index = index + i++;
 	}
 }
 
@@ -348,6 +357,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
+		page->index = 0;
 	}
 }
 
@@ -604,7 +614,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	new_entry = dax_radix_locked_entry(pfn, flags);
 	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping);
+		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (4 preceding siblings ...)
  2018-06-08 23:50 ` [PATCH v4 05/12] filesystem-dax: " Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-08 23:50 ` [PATCH v4 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Naoya Horiguchi, hch, linux-mm, linux-fsdevel, jack

The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it does not release
that pin. The assumption is that failed pages should be taken out of
circulation.

However, for dax mappings it is not possible to take pages out of
circulation since they are 1:1 physically mapped as filesystem blocks,
or device-dax capacity. They also typically represent persistent memory
which has an error clearing capability.

In preparation for adding a special handler for dax mappings, shift the
responsibility of taking the page reference to memory_failure(). I.e.
drop the page reference and do not specify MF_COUNT_INCREASED to
memory_failure().

Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/madvise.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..b731933dddae 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
 
 
 	for (; start < end; start += PAGE_SIZE << order) {
+		unsigned long pfn;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		pfn = page_to_pfn(page);
 
 		/*
 		 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,27 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+					pfn, start);
 
 			ret = soft_offline_page(page, MF_COUNT_INCREASED);
 			if (ret)
 				return ret;
 			continue;
 		}
+
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+				pfn, start);
+
+		ret = memory_failure(pfn, 0);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
 
-		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
 		if (ret)
 			return ret;
 	}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (5 preceding siblings ...)
  2018-06-08 23:50 ` [PATCH v4 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
@ 2018-06-08 23:50 ` Dan Williams
  2018-06-08 23:51 ` [PATCH v4 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:50 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, linux-edac, x86, hch, linux-mm, linux-fsdevel,
	jack

In preparation for using set_memory_uc() instead set_memory_np() for
isolating poison from speculation, teach the memtype code to sanitize
physical addresses vs __PHYSICAL_MASK.

The motivation for using set_memory_uc() for this case is to allow
ongoing access to persistent memory pages via the pmem-driver +
memcpy_mcsafe() until the poison is repaired.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/pat.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 1555bd7d3449..6788ffa990f8 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -512,6 +512,17 @@ static int free_ram_pages_type(u64 start, u64 end)
 	return 0;
 }
 
+static u64 sanitize_phys(u64 address)
+{
+	/*
+	 * When changing the memtype for pages containing poison allow
+	 * for a "decoy" virtual address (bit 63 clear) passed to
+	 * set_memory_X(). __pa() on a "decoy" address results in a
+	 * physical address with it 63 set.
+	 */
+	return address & __PHYSICAL_MASK;
+}
+
 /*
  * req_type typically has one of the:
  * - _PAGE_CACHE_MODE_WB
@@ -533,6 +544,8 @@ int reserve_memtype(u64 start, u64 end, enum page_cache_mode req_type,
 	int is_range_ram;
 	int err = 0;
 
+	start = sanitize_phys(start);
+	end = sanitize_phys(end);
 	BUG_ON(start >= end); /* end is exclusive */
 
 	if (!pat_enabled()) {
@@ -609,6 +622,9 @@ int free_memtype(u64 start, u64 end)
 	if (!pat_enabled())
 		return 0;
 
+	start = sanitize_phys(start);
+	end = sanitize_phys(end);
+
 	/* Low ISA region is always mapped WB. No need to track */
 	if (x86_platform.is_untracked_pat_range(start, end))
 		return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (6 preceding siblings ...)
  2018-06-08 23:50 ` [PATCH v4 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
@ 2018-06-08 23:51 ` Dan Williams
  2018-06-08 23:51 ` [PATCH v4 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Borislav Petkov,
	linux-edac, x86, Tony Luck, hch, linux-mm, linux-fsdevel, jack

Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/set_memory.h         |   42 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ----------
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 ++------------------------
 include/linux/set_memory.h                |   14 ++++++++++
 4 files changed, 59 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index bd090367236c..cf5e9124b45e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,46 @@ extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
 
+#ifdef CONFIG_X86_64
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+	int rc;
+
+	/*
+	 * Mark the linear address as UC to make sure we don't log more
+	 * errors because of speculative access to the page.
+	 * We would like to just call:
+	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the poison page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+	 */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+	rc = set_memory_uc(decoy_addr, 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb)	{ }
 static inline void mce_unregister_injector_chain(struct notifier_block *nb)	{ }
 #endif
 
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- *    are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- *    recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
 struct mca_config {
 	bool dont_log_ce;
 	bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..a0fbf0a8b7e6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/jump_label.h>
+#include <linux/set_memory.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -50,7 +51,6 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
-#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 	if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
 		pfn = mce->addr >> PAGE_SHIFT;
 		if (!memory_failure(pfn, 0))
-			mce_unmap_kpfn(pfn);
+			set_mce_nospec(pfn);
 	}
 
 	return NOTIFY_OK;
@@ -1070,38 +1066,10 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT);
 	return ret;
 }
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn)
-{
-	unsigned long decoy_addr;
-
-	/*
-	 * Unmap this page from the kernel 1:1 mappings to make sure
-	 * we don't log more errors because of speculative access to
-	 * the page.
-	 * We would like to just call:
-	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_np() not checking whether we passed
-	 * a legal address.
-	 */
-
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (set_memory_np(decoy_addr, 1))
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-}
-#endif
-
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index da5178216da5..2a986d282a97 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr,  int numpages) { return 0; }
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+#ifndef set_mce_nospec
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
+#ifndef clear_mce_nospec
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
 #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
 static inline int set_memory_encrypted(unsigned long addr, int numpages)
 {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 09/12] mm, memory_failure: Pass page size to kill_proc()
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (7 preceding siblings ...)
  2018-06-08 23:51 ` [PATCH v4 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-06-08 23:51 ` Dan Williams
  2018-06-08 23:51 ` [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Dan Williams
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:51 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Naoya Horiguchi, hch, linux-mm, linux-fsdevel, jack

Given that ZONE_DEVICE / dev_pagemap pages are never assembled into
compound pages, the size determination logic in kill_proc() needs
updating for the dev_pagemap case. In preparation for dev_pagemap
support rework memory_failure() and kill_proc() to pass / consume the page
size explicitly.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..42a193ee14d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,18 +179,16 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, struct page *page, int flags)
+			unsigned long pfn, unsigned size_shift, int flags)
 {
-	short addr_lsb;
 	int ret;
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
 		pfn, t->comm, t->pid);
-	addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
 		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
-				       addr_lsb, current);
+				       size_shift, current);
 	} else {
 		/*
 		 * Don't use force here, it's convenient if the signal
@@ -199,7 +197,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		 * to SIG_IGN, but hopefully no one will do that?
 		 */
 		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
-				      addr_lsb, t);  /* synchronous? */
+				      size_shift, t);  /* synchronous? */
 	}
 	if (ret < 0)
 		pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
@@ -318,7 +316,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, struct page *page, unsigned long pfn,
+			  bool fail, unsigned size_shift, unsigned long pfn,
 			  int flags)
 {
 	struct to_kill *tk, *next;
@@ -343,7 +341,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * process anyways.
 			 */
 			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, page, flags) < 0)
+					      pfn, size_shift, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
 				       pfn, tk->tsk->comm, tk->tsk->pid);
 		}
@@ -928,6 +926,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	unsigned size_shift;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1012,7 +1011,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * any accesses to the poisoned memory.
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
-	kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
+	size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+	kill_procs(&tokill, forcekill, !unmap_success, size_shift, pfn, flags);
 
 	return unmap_success;
 }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (8 preceding siblings ...)
  2018-06-08 23:51 ` [PATCH v4 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
@ 2018-06-08 23:51 ` Dan Williams
  2018-06-11 15:41   ` Jan Kara
  2018-06-12 18:15   ` Ross Zwisler
  2018-06-08 23:51 ` [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
  2018-06-08 23:51 ` [PATCH v4 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
  11 siblings, 2 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:51 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

In preparation for implementing support for memory poison (media error)
handling via dax mappings, implement a lock_page() equivalent. Poison
error handling requires rmap and needs guarantees that the page->mapping
association is maintained / valid (inode not freed) for the duration of
the lookup.

In the device-dax case it is sufficient to simply hold a dev_pagemap
reference. In the filesystem-dax case we need to use the entry lock.

Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
protect against the inode being freed, and revalidates the page->mapping
association under xa_lock().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   15 ++++++++++
 2 files changed, 91 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index cccf6cad1a7a..b7e71b108fcf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	}
 }
 
+struct page *dax_lock_page(unsigned long pfn)
+{
+	pgoff_t index;
+	struct inode *inode;
+	wait_queue_head_t *wq;
+	void *entry = NULL, **slot;
+	struct address_space *mapping;
+	struct wait_exceptional_entry_queue ewait;
+	struct page *ret = NULL, *page = pfn_to_page(pfn);
+
+	rcu_read_lock();
+	for (;;) {
+		mapping = READ_ONCE(page->mapping);
+
+		if (!mapping || !IS_DAX(mapping->host))
+			break;
+
+		/*
+		 * In the device-dax case there's no need to lock, a
+		 * struct dev_pagemap pin is sufficient to keep the
+		 * inode alive.
+		 */
+		inode = mapping->host;
+		if (S_ISCHR(inode->i_mode)) {
+			ret = page;
+			break;
+		}
+
+		xa_lock_irq(&mapping->i_pages);
+		if (mapping != page->mapping) {
+			xa_unlock_irq(&mapping->i_pages);
+			continue;
+		}
+		index = page->index;
+
+		init_wait(&ewait.wait);
+		ewait.wait.func = wake_exceptional_entry_func;
+
+		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
+				&slot);
+		if (!entry ||
+		    WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) {
+			xa_unlock_irq(&mapping->i_pages);
+			break;
+		} else if (!slot_locked(mapping, slot)) {
+			lock_slot(mapping, slot);
+			ret = page;
+			xa_unlock_irq(&mapping->i_pages);
+			break;
+		}
+
+		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
+		prepare_to_wait_exclusive(wq, &ewait.wait,
+				TASK_UNINTERRUPTIBLE);
+		xa_unlock_irq(&mapping->i_pages);
+		rcu_read_unlock();
+		schedule();
+		finish_wait(wq, &ewait.wait);
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+
+	return page;
+}
+
+void dax_unlock_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+
+	if (S_ISCHR(inode->i_mode))
+		return;
+
+	dax_unlock_mapping_entry(mapping, page->index);
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f9eb22ad341e..641cab7e1fa7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -83,6 +83,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+struct page *dax_lock_page(unsigned long pfn);
+void dax_unlock_page(struct page *page);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -108,6 +110,19 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline struct page *dax_lock_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	if (IS_DAX(page->mapping->host))
+		return page;
+	return NULL;
+}
+
+static inline void dax_unlock_page(struct page *page)
+{
+}
 #endif
 
 int dax_read_lock(void);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (9 preceding siblings ...)
  2018-06-08 23:51 ` [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Dan Williams
@ 2018-06-08 23:51 ` Dan Williams
  2018-06-11 15:50   ` Jan Kara
  2018-06-12 20:14   ` Ross Zwisler
  2018-06-08 23:51 ` [PATCH v4 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
  11 siblings, 2 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, linux-mm,
	linux-fsdevel, jack

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h  |    1 
 mm/memory-failure.c |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..566c972e03e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2669,6 +2669,7 @@ enum mf_action_page_type {
 	MF_MSG_TRUNCATED_LRU,
 	MF_MSG_BUDDY,
 	MF_MSG_BUDDY_2ND,
+	MF_MSG_DAX,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 42a193ee14d3..a5912b27fea7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
+#include <linux/memremap.h>
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include "internal.h"
@@ -513,6 +514,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_TRUNCATED_LRU]		= "already truncated LRU page",
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
+	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1112,6 +1114,126 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return res;
 }
 
+static unsigned long dax_mapping_size(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page_to_pgoff(page);
+	struct vm_area_struct *vma;
+	unsigned long size = 0;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long address = vma_address(page, vma);
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, address);
+		if (!pgd_present(*pgd))
+			continue;
+		p4d = p4d_offset(pgd, address);
+		if (!p4d_present(*p4d))
+			continue;
+		pud = pud_offset(p4d, address);
+		if (!pud_present(*pud))
+			continue;
+		if (pud_devmap(*pud)) {
+			size = PUD_SIZE;
+			break;
+		}
+		pmd = pmd_offset(pud, address);
+		if (!pmd_present(*pmd))
+			continue;
+		if (pmd_devmap(*pmd)) {
+			size = PMD_SIZE;
+			break;
+		}
+		pte = pte_offset_map(pmd, address);
+		if (!pte_present(*pte))
+			continue;
+		if (pte_devmap(*pte)) {
+			size = PAGE_SIZE;
+			break;
+		}
+	}
+	i_mmap_unlock_read(mapping);
+
+	return size;
+}
+
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	const bool unmap_success = true;
+	unsigned long size;
+	struct page *page;
+	LIST_HEAD(tokill);
+	int rc = -EBUSY;
+	loff_t start;
+
+	/*
+	 * Prevent the inode from being freed while we are interrogating
+	 * the address_space, typically this would be handled by
+	 * lock_page(), but dax pages do not use the page lock.
+	 */
+	page = dax_lock_page(pfn);
+	if (!page)
+		goto out;
+
+	if (hwpoison_filter(page)) {
+		rc = 0;
+		goto unlock;
+	}
+
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		goto unlock;
+	default:
+		break;
+	}
+
+	/*
+	 * If the page is not mapped in userspace then report it as
+	 * unhandled.
+	 */
+	size = dax_mapping_size(page);
+	if (!size) {
+		pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
+		goto unlock;
+	}
+
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+
+	start = (page->index << PAGE_SHIFT) & ~(size - 1);
+	unmap_mapping_range(page->mapping, start, start + size, 0);
+
+	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
+			pfn, flags);
+	rc = 0;
+unlock:
+	dax_unlock_page(page);
+out:
+	put_dev_pagemap(pgmap);
+	action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
+	return rc;
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1134,6 +1256,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *p;
 	struct page *hpage;
 	struct page *orig_head;
+	struct dev_pagemap *pgmap;
 	int res;
 	unsigned long page_flags;
 
@@ -1146,6 +1269,10 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	pgmap = get_dev_pagemap(pfn, NULL);
+	if (pgmap)
+		return memory_failure_dev_pagemap(pfn, flags, pgmap);
+
 	p = pfn_to_page(pfn);
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 12/12] libnvdimm, pmem: Restore page attributes when clearing errors
  2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (10 preceding siblings ...)
  2018-06-08 23:51 ` [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-06-08 23:51 ` Dan Williams
  11 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-08 23:51 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   26 ++++++++++++++++++++++++++
 drivers/nvdimm/pmem.h |   13 +++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..04ee1fdee219 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
 #include <linux/hdreg.h>
 #include <linux/init.h>
 #include <linux/platform_device.h>
+#include <linux/set_memory.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/badblocks.h>
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
+static void hwpoison_clear(struct pmem_device *pmem,
+		phys_addr_t phys, unsigned int len)
+{
+	unsigned long pfn_start, pfn_end, pfn;
+
+	/* only pmem in the linear map supports HWPoison */
+	if (is_vmalloc_addr(pmem->virt_addr))
+		return;
+
+	pfn_start = PHYS_PFN(phys);
+	pfn_end = pfn_start + PHYS_PFN(len);
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * Note, no need to hold a get_dev_pagemap() reference
+		 * here since we're in the driver I/O path and
+		 * outstanding I/O requests pin the dev_pagemap.
+		 */
+		if (test_and_clear_pmem_poison(page))
+			clear_mce_nospec(pfn);
+	}
+}
+
 static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 		phys_addr_t offset, unsigned int len)
 {
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	if (cleared < len)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
+		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
+#include <linux/page-flags.h>
 #include <linux/badblocks.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -27,4 +28,16 @@ struct pmem_device {
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return false;
+}
+#endif
 #endif /* __NVDIMM_PMEM_H__ */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-08 23:51 ` [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Dan Williams
@ 2018-06-11 15:41   ` Jan Kara
  2018-06-11 16:48     ` Dan Williams
                       ` (2 more replies)
  2018-06-12 18:15   ` Ross Zwisler
  1 sibling, 3 replies; 26+ messages in thread
From: Jan Kara @ 2018-06-11 15:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, hch, linux-mm, linux-fsdevel, jack

On Fri 08-06-18 16:51:14, Dan Williams wrote:
> In preparation for implementing support for memory poison (media error)
> handling via dax mappings, implement a lock_page() equivalent. Poison
> error handling requires rmap and needs guarantees that the page->mapping
> association is maintained / valid (inode not freed) for the duration of
> the lookup.
> 
> In the device-dax case it is sufficient to simply hold a dev_pagemap
> reference. In the filesystem-dax case we need to use the entry lock.
> 
> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> protect against the inode being freed, and revalidates the page->mapping
> association under xa_lock().
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Some comments below...

> diff --git a/fs/dax.c b/fs/dax.c
> index cccf6cad1a7a..b7e71b108fcf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>  	}
>  }
>  
> +struct page *dax_lock_page(unsigned long pfn)
> +{

Why do you return struct page here? Any reason behind that? Because struct
page exists and can be accessed through pfn_to_page() regardless of result
of this function so it looks a bit confusing. Also dax_lock_page() name
seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

> +	pgoff_t index;
> +	struct inode *inode;
> +	wait_queue_head_t *wq;
> +	void *entry = NULL, **slot;
> +	struct address_space *mapping;
> +	struct wait_exceptional_entry_queue ewait;
> +	struct page *ret = NULL, *page = pfn_to_page(pfn);
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		mapping = READ_ONCE(page->mapping);
> +
> +		if (!mapping || !IS_DAX(mapping->host))
> +			break;
> +
> +		/*
> +		 * In the device-dax case there's no need to lock, a
> +		 * struct dev_pagemap pin is sufficient to keep the
> +		 * inode alive.
> +		 */
> +		inode = mapping->host;
> +		if (S_ISCHR(inode->i_mode)) {
> +			ret = page;
> +			break;
> +		}
> +
> +		xa_lock_irq(&mapping->i_pages);
> +		if (mapping != page->mapping) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			continue;
> +		}
> +		index = page->index;
> +
> +		init_wait(&ewait.wait);
> +		ewait.wait.func = wake_exceptional_entry_func;

This initialization could be before the loop.

> +
> +		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
> +				&slot);
> +		if (!entry ||
> +		    WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			break;
> +		} else if (!slot_locked(mapping, slot)) {
> +			lock_slot(mapping, slot);
> +			ret = page;
> +			xa_unlock_irq(&mapping->i_pages);
> +			break;
> +		}
> +
> +		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
> +		prepare_to_wait_exclusive(wq, &ewait.wait,
> +				TASK_UNINTERRUPTIBLE);
> +		xa_unlock_irq(&mapping->i_pages);
> +		rcu_read_unlock();
> +		schedule();
> +		finish_wait(wq, &ewait.wait);
> +		rcu_read_lock();
> +	}
> +	rcu_read_unlock();

I don't like how this duplicates a lot of get_unlocked_mapping_entry().
Can we possibly factor this out similary as done for wait_event()?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-08 23:51 ` [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-06-11 15:50   ` Jan Kara
  2018-06-11 16:45     ` Dan Williams
  2018-06-12 20:14   ` Ross Zwisler
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Kara @ 2018-06-11 15:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Naoya Horiguchi,
	Ross Zwisler, linux-mm, linux-fsdevel

On Fri 08-06-18 16:51:19, Dan Williams wrote:
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
> 
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
> 
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.

...

> +static unsigned long dax_mapping_size(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	pgoff_t pgoff = page_to_pgoff(page);
> +	struct vm_area_struct *vma;
> +	unsigned long size = 0;
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +		unsigned long address = vma_address(page, vma);
> +		pgd_t *pgd;
> +		p4d_t *p4d;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, address);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		p4d = p4d_offset(pgd, address);
> +		if (!p4d_present(*p4d))
> +			continue;
> +		pud = pud_offset(p4d, address);
> +		if (!pud_present(*pud))
> +			continue;
> +		if (pud_devmap(*pud)) {
> +			size = PUD_SIZE;
> +			break;
> +		}
> +		pmd = pmd_offset(pud, address);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		if (pmd_devmap(*pmd)) {
> +			size = PMD_SIZE;
> +			break;
> +		}
> +		pte = pte_offset_map(pmd, address);
> +		if (!pte_present(*pte))
> +			continue;
> +		if (pte_devmap(*pte)) {
> +			size = PAGE_SIZE;
> +			break;
> +		}
> +	}
> +	i_mmap_unlock_read(mapping);
> +
> +	return size;
> +}

Correct me if I'm wrong but cannot the same pfn be mapped by different VMAs
with different granularity? I recall that if we have a fully allocated PMD
entry in the radix tree we can hand out 4k entries from inside of it just
fine... So whether dax_mapping_size() returns 4k or 2MB would be random?
Why don't we use the entry size in the radix tree when we have done all the
work and looked it up there to lock it anyway?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-11 15:50   ` Jan Kara
@ 2018-06-11 16:45     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-11 16:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, Linux MM,
	linux-fsdevel

On Mon, Jun 11, 2018 at 8:50 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 08-06-18 16:51:19, Dan Williams wrote:
>>     mce: Uncorrected hardware memory error in user-access at af34214200
>>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>>     mce: [Hardware Error]: Machine check events logged
>>     {1}[Hardware Error]: event severity: corrected
>>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>>     [..]
>>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>>     mce: Memory error not recovered
>>
>> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
>> dax there is no possibility to map in another page dynamically since dax
>> establishes 1:1 physical address to file offset associations. Also
>> dev_pagemap pages associated with NVDIMM / persistent memory devices can
>> internal remap/repair addresses with poison. While memory_failure()
>> assumes that it can discard typical poisoned pages and keep them
>> unmapped indefinitely, dev_pagemap pages may be returned to service
>> after the error is cleared.
>>
>> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
>> dev_pagemap pages that have poison consumed by userspace. Mark the
>> memory as UC instead of unmapping it completely to allow ongoing access
>> via the device driver (nd_pmem). Later, nd_pmem will grow support for
>> marking the page back to WB when the error is cleared.
>
> ...
>
>> +static unsigned long dax_mapping_size(struct page *page)
>> +{
>> +     struct address_space *mapping = page->mapping;
>> +     pgoff_t pgoff = page_to_pgoff(page);
>> +     struct vm_area_struct *vma;
>> +     unsigned long size = 0;
>> +
>> +     i_mmap_lock_read(mapping);
>> +     vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> +             unsigned long address = vma_address(page, vma);
>> +             pgd_t *pgd;
>> +             p4d_t *p4d;
>> +             pud_t *pud;
>> +             pmd_t *pmd;
>> +             pte_t *pte;
>> +
>> +             pgd = pgd_offset(vma->vm_mm, address);
>> +             if (!pgd_present(*pgd))
>> +                     continue;
>> +             p4d = p4d_offset(pgd, address);
>> +             if (!p4d_present(*p4d))
>> +                     continue;
>> +             pud = pud_offset(p4d, address);
>> +             if (!pud_present(*pud))
>> +                     continue;
>> +             if (pud_devmap(*pud)) {
>> +                     size = PUD_SIZE;
>> +                     break;
>> +             }
>> +             pmd = pmd_offset(pud, address);
>> +             if (!pmd_present(*pmd))
>> +                     continue;
>> +             if (pmd_devmap(*pmd)) {
>> +                     size = PMD_SIZE;
>> +                     break;
>> +             }
>> +             pte = pte_offset_map(pmd, address);
>> +             if (!pte_present(*pte))
>> +                     continue;
>> +             if (pte_devmap(*pte)) {
>> +                     size = PAGE_SIZE;
>> +                     break;
>> +             }
>> +     }
>> +     i_mmap_unlock_read(mapping);
>> +
>> +     return size;
>> +}
>
> Correct me if I'm wrong but cannot the same pfn be mapped by different VMAs
> with different granularity? I recall that if we have a fully allocated PMD
> entry in the radix tree we can hand out 4k entries from inside of it just
> fine...

Oh, I thought we broke up the 2M entry when that happened.

> So whether dax_mapping_size() returns 4k or 2MB would be random?
> Why don't we use the entry size in the radix tree when we have done all the
> work and looked it up there to lock it anyway?

Device-dax has no use case to populate the radix.

I think this means that we need to track the mapping size in the
memory_failure() path per vma that has the pfn mapped. I'd prefer that
over teaching device-dax to populate the radix, or teaching fs-dax to
break up huge pages when another vma wants 4K.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-11 15:41   ` Jan Kara
@ 2018-06-11 16:48     ` Dan Williams
  2018-06-12 18:07     ` Ross Zwisler
  2018-07-04 15:17     ` Dan Williams
  2 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-11 16:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, Christoph Hellwig, Linux MM, linux-fsdevel

On Mon, Jun 11, 2018 at 8:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Some comments below...
>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>
> Why do you return struct page here? Any reason behind that? Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?
>
>> +     pgoff_t index;
>> +     struct inode *inode;
>> +     wait_queue_head_t *wq;
>> +     void *entry = NULL, **slot;
>> +     struct address_space *mapping;
>> +     struct wait_exceptional_entry_queue ewait;
>> +     struct page *ret = NULL, *page = pfn_to_page(pfn);
>> +
>> +     rcu_read_lock();
>> +     for (;;) {
>> +             mapping = READ_ONCE(page->mapping);
>> +
>> +             if (!mapping || !IS_DAX(mapping->host))
>> +                     break;
>> +
>> +             /*
>> +              * In the device-dax case there's no need to lock, a
>> +              * struct dev_pagemap pin is sufficient to keep the
>> +              * inode alive.
>> +              */
>> +             inode = mapping->host;
>> +             if (S_ISCHR(inode->i_mode)) {
>> +                     ret = page;
>> +                     break;
>> +             }
>> +
>> +             xa_lock_irq(&mapping->i_pages);
>> +             if (mapping != page->mapping) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     continue;
>> +             }
>> +             index = page->index;
>> +
>> +             init_wait(&ewait.wait);
>> +             ewait.wait.func = wake_exceptional_entry_func;
>
> This initialization could be before the loop.
>
>> +
>> +             entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
>> +                             &slot);
>> +             if (!entry ||
>> +                 WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     break;
>> +             } else if (!slot_locked(mapping, slot)) {
>> +                     lock_slot(mapping, slot);
>> +                     ret = page;
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     break;
>> +             }
>> +
>> +             wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
>> +             prepare_to_wait_exclusive(wq, &ewait.wait,
>> +                             TASK_UNINTERRUPTIBLE);
>> +             xa_unlock_irq(&mapping->i_pages);
>> +             rcu_read_unlock();
>> +             schedule();
>> +             finish_wait(wq, &ewait.wait);
>> +             rcu_read_lock();
>> +     }
>> +     rcu_read_unlock();
>
> I don't like how this duplicates a lot of get_unlocked_mapping_entry().
> Can we possibly factor this out similary as done for wait_event()?

Ok, I'll give that a shot.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains
  2018-06-08 23:50 ` [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
@ 2018-06-11 17:12   ` Laurent Dufour
  2018-06-11 17:14     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Laurent Dufour @ 2018-06-11 17:12 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

On 09/06/2018 01:50, Dan Williams wrote:
> Define a local 'vma' variable rather than repetitively de-referencing
> the passed in 'struct vm_fault *' instance.

Hi Dan,

Why is this needed ?

I can't see the real benefit, having the vma deferenced from the vm_fault
structure is not obfuscating the code and it eases to follow the use of vmf->vma.

Am I missing something ?

Cheers,
Laurent.

> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/device.c |   30 ++++++++++++++++--------------
>  1 file changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index d44d98c54d0f..686de08e120b 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -247,13 +247,14 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
>  static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
>  				struct vm_fault *vmf)
>  {
> +	struct vm_area_struct *vma = vmf->vma;
>  	struct device *dev = &dev_dax->dev;
>  	struct dax_region *dax_region;
>  	phys_addr_t phys;
>  	pfn_t pfn;
>  	unsigned int fault_size = PAGE_SIZE;
> 
> -	if (check_vma(dev_dax, vmf->vma, __func__))
> +	if (check_vma(dev_dax, vma, __func__))
>  		return VM_FAULT_SIGBUS;
> 
>  	dax_region = dev_dax->region;
> @@ -274,13 +275,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
> 
>  	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> 
> -	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> +	return vmf_insert_mixed(vma, vmf->address, pfn);
>  }
> 
>  static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
>  				struct vm_fault *vmf)
>  {
>  	unsigned long pmd_addr = vmf->address & PMD_MASK;
> +	struct vm_area_struct *vma = vmf->vma;
>  	struct device *dev = &dev_dax->dev;
>  	struct dax_region *dax_region;
>  	phys_addr_t phys;
> @@ -288,7 +290,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
>  	pfn_t pfn;
>  	unsigned int fault_size = PMD_SIZE;
> 
> -	if (check_vma(dev_dax, vmf->vma, __func__))
> +	if (check_vma(dev_dax, vma, __func__))
>  		return VM_FAULT_SIGBUS;
> 
>  	dax_region = dev_dax->region;
> @@ -310,11 +312,10 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_FALLBACK;
> 
>  	/* if we are outside of the VMA */
> -	if (pmd_addr < vmf->vma->vm_start ||
> -			(pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
> +	if (pmd_addr < vma->vm_start || (pmd_addr + PMD_SIZE) > vma->vm_end)
>  		return VM_FAULT_SIGBUS;
> 
> -	pgoff = linear_page_index(vmf->vma, pmd_addr);
> +	pgoff = linear_page_index(vma, pmd_addr);
>  	phys = dax_pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
>  	if (phys == -1) {
>  		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
> @@ -323,7 +324,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
> 
>  	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> 
> -	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
> +	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
>  			vmf->flags & FAULT_FLAG_WRITE);
>  }
> 
> @@ -332,6 +333,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  				struct vm_fault *vmf)
>  {
>  	unsigned long pud_addr = vmf->address & PUD_MASK;
> +	struct vm_area_struct *vma = vmf->vma;
>  	struct device *dev = &dev_dax->dev;
>  	struct dax_region *dax_region;
>  	phys_addr_t phys;
> @@ -340,7 +342,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  	unsigned int fault_size = PUD_SIZE;
> 
> 
> -	if (check_vma(dev_dax, vmf->vma, __func__))
> +	if (check_vma(dev_dax, vma, __func__))
>  		return VM_FAULT_SIGBUS;
> 
>  	dax_region = dev_dax->region;
> @@ -362,11 +364,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_FALLBACK;
> 
>  	/* if we are outside of the VMA */
> -	if (pud_addr < vmf->vma->vm_start ||
> -			(pud_addr + PUD_SIZE) > vmf->vma->vm_end)
> +	if (pud_addr < vma->vm_start || (pud_addr + PUD_SIZE) > vma->vm_end)
>  		return VM_FAULT_SIGBUS;
> 
> -	pgoff = linear_page_index(vmf->vma, pud_addr);
> +	pgoff = linear_page_index(vma, pud_addr);
>  	phys = dax_pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
>  	if (phys == -1) {
>  		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
> @@ -375,7 +376,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
> 
>  	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> 
> -	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
> +	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
>  			vmf->flags & FAULT_FLAG_WRITE);
>  }
>  #else
> @@ -390,12 +391,13 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  		enum page_entry_size pe_size)
>  {
>  	int rc, id;
> -	struct file *filp = vmf->vma->vm_file;
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct file *filp = vma->vm_file;
>  	struct dev_dax *dev_dax = filp->private_data;
> 
>  	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
>  			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
> -			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
> +			vma->vm_start, vma->vm_end, pe_size);
> 
>  	id = dax_read_lock();
>  	switch (pe_size) {
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains
  2018-06-11 17:12   ` Laurent Dufour
@ 2018-06-11 17:14     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-11 17:14 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linux-nvdimm, Christoph Hellwig, Linux MM, linux-fsdevel, Jan Kara

On Mon, Jun 11, 2018 at 10:12 AM, Laurent Dufour
<ldufour@linux.vnet.ibm.com> wrote:
> On 09/06/2018 01:50, Dan Williams wrote:
>> Define a local 'vma' variable rather than repetitively de-referencing
>> the passed in 'struct vm_fault *' instance.
>
> Hi Dan,
>
> Why is this needed ?
>
> I can't see the real benefit, having the vma deferenced from the vm_fault
> structure is not obfuscating the code and it eases to follow the use of vmf->vma.
>
> Am I missing something ?

No, and now that I take another look it's just noise. I'll drop it.

Thanks for the poke.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-11 15:41   ` Jan Kara
  2018-06-11 16:48     ` Dan Williams
@ 2018-06-12 18:07     ` Ross Zwisler
  2018-07-04 15:20       ` Dan Williams
  2018-07-04 15:17     ` Dan Williams
  2 siblings, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2018-06-12 18:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dan Williams, linux-fsdevel, linux-mm, hch, linux-nvdimm

On Mon, Jun 11, 2018 at 05:41:46PM +0200, Jan Kara wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
> > In preparation for implementing support for memory poison (media error)
> > handling via dax mappings, implement a lock_page() equivalent. Poison
> > error handling requires rmap and needs guarantees that the page->mapping
> > association is maintained / valid (inode not freed) for the duration of
> > the lookup.
> > 
> > In the device-dax case it is sufficient to simply hold a dev_pagemap
> > reference. In the filesystem-dax case we need to use the entry lock.
> > 
> > Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> > protect against the inode being freed, and revalidates the page->mapping
> > association under xa_lock().
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> Some comments below...
> 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index cccf6cad1a7a..b7e71b108fcf 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> >  	}
> >  }
> >  
> > +struct page *dax_lock_page(unsigned long pfn)
> > +{
> 
> Why do you return struct page here? Any reason behind that? Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

It's also a bit awkward that the functions are asymmetric in their arguments:
dax_lock_page(pfn) vs dax_unlock_page(struct page)

Looking at dax_lock_page(), we only use 'pfn' to get 'page', so maybe it would
be cleaner to just always deal with struct page, i.e.:

void dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-08 23:51 ` [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Dan Williams
  2018-06-11 15:41   ` Jan Kara
@ 2018-06-12 18:15   ` Ross Zwisler
  2018-07-04 15:11     ` Dan Williams
  1 sibling, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2018-06-12 18:15 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, linux-fsdevel, linux-mm, jack, hch

On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
> In preparation for implementing support for memory poison (media error)
> handling via dax mappings, implement a lock_page() equivalent. Poison
> error handling requires rmap and needs guarantees that the page->mapping
> association is maintained / valid (inode not freed) for the duration of
> the lookup.
> 
> In the device-dax case it is sufficient to simply hold a dev_pagemap
> reference. In the filesystem-dax case we need to use the entry lock.
> 
> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> protect against the inode being freed, and revalidates the page->mapping
> association under xa_lock().
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |   15 ++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index cccf6cad1a7a..b7e71b108fcf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>  	}
>  }
>  
> +struct page *dax_lock_page(unsigned long pfn)
> +{
> +	pgoff_t index;
> +	struct inode *inode;
> +	wait_queue_head_t *wq;
> +	void *entry = NULL, **slot;
> +	struct address_space *mapping;
> +	struct wait_exceptional_entry_queue ewait;
> +	struct page *ret = NULL, *page = pfn_to_page(pfn);
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		mapping = READ_ONCE(page->mapping);

Why the READ_ONCE()?

> +
> +		if (!mapping || !IS_DAX(mapping->host))

Might read better using the dax_mapping() helper.

Also, forgive my ignorance, but this implies that dev dax has page->mapping
set up and that that inode will have IS_DAX set, right?  This will let us get
past this point for device DAX, and we'll bail out at the S_ISCHR() check?

> +			break;
> +
> +		/*
> +		 * In the device-dax case there's no need to lock, a
> +		 * struct dev_pagemap pin is sufficient to keep the
> +		 * inode alive.
> +		 */
> +		inode = mapping->host;
> +		if (S_ISCHR(inode->i_mode)) {
> +			ret = page;

'ret' isn't actually used for anything in this function, we just
unconditionally return 'page'.

> +			break;
> +		}
> +
> +		xa_lock_irq(&mapping->i_pages);
> +		if (mapping != page->mapping) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			continue;
> +		}
> +		index = page->index;
> +
> +		init_wait(&ewait.wait);
> +		ewait.wait.func = wake_exceptional_entry_func;
> +
> +		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
> +				&slot);
> +		if (!entry ||

So if we do a lookup and there is no entry in the tree, we won't add an empty
entry and lock it, we'll just return with no entry in the tree and nothing
locked.

Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in 
dax_unlock_mapping_entry() when we see entry is 0.  And, in that gap we've got
nothing locked so page faults could have happened, etc... (which would mean
that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
unlocked entry).

Is that okay?  Or do we need to insert a locked empty entry here?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-08 23:51 ` [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
  2018-06-11 15:50   ` Jan Kara
@ 2018-06-12 20:14   ` Ross Zwisler
  2018-06-12 23:38     ` Dan Williams
  1 sibling, 1 reply; 26+ messages in thread
From: Ross Zwisler @ 2018-06-12 20:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Naoya Horiguchi,
	Ross Zwisler, linux-mm, linux-fsdevel

On Fri, Jun 08, 2018 at 04:51:19PM -0700, Dan Williams wrote:
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
> 
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
> 
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: J�r�me Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> +		struct dev_pagemap *pgmap)
> +{
> +	const bool unmap_success = true;
> +	unsigned long size;
> +	struct page *page;
> +	LIST_HEAD(tokill);
> +	int rc = -EBUSY;
> +	loff_t start;
> +
> +	/*
> +	 * Prevent the inode from being freed while we are interrogating
> +	 * the address_space, typically this would be handled by
> +	 * lock_page(), but dax pages do not use the page lock.
> +	 */
> +	page = dax_lock_page(pfn);
> +	if (!page)
> +		goto out;
> +
> +	if (hwpoison_filter(page)) {
> +		rc = 0;
> +		goto unlock;
> +	}
> +
> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_PUBLIC:
> +		/*
> +		 * TODO: Handle HMM pages which may need coordination
> +		 * with device-side memory.
> +		 */
> +		goto unlock;
> +	default:
> +		break;
> +	}
> +
> +	/*
> +	 * If the page is not mapped in userspace then report it as
> +	 * unhandled.
> +	 */
> +	size = dax_mapping_size(page);
> +	if (!size) {
> +		pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
> +		goto unlock;
> +	}
> +
> +	SetPageHWPoison(page);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a
> +	 * different physical page at a given virtual address, so all
> +	 * userspace consumption of ZONE_DEVICE memory necessitates
> +	 * SIGBUS (i.e. MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);

You know "flags & MF_ACTION_REQUIRED" will always be true, so you can just
pass in MF_ACTION_REQUIRED or even just "true".

> +
> +	start = (page->index << PAGE_SHIFT) & ~(size - 1);
> +	unmap_mapping_range(page->mapping, start, start + size, 0);
> +
> +	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),

You know "flags & MF_MUST_KILL" will always be true, so you can just pass in
MF_MUST_KILL or even just "true".

Also, you can get rid of the constant "unmap_success" if you want and just
pass in false as the 3rd argument.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-12 20:14   ` Ross Zwisler
@ 2018-06-12 23:38     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-06-12 23:38 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, linux-nvdimm, Jan Kara,
	Christoph Hellwig, Jérôme Glisse, Matthew Wilcox,
	Naoya Horiguchi, Linux MM, linux-fsdevel

On Tue, Jun 12, 2018 at 1:14 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jun 08, 2018 at 04:51:19PM -0700, Dan Williams wrote:
>>     mce: Uncorrected hardware memory error in user-access at af34214200
>>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>>     mce: [Hardware Error]: Machine check events logged
>>     {1}[Hardware Error]: event severity: corrected
>>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>>     [..]
>>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>>     mce: Memory error not recovered
>>
>> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
>> dax there is no possibility to map in another page dynamically since dax
>> establishes 1:1 physical address to file offset associations. Also
>> dev_pagemap pages associated with NVDIMM / persistent memory devices can
>> internal remap/repair addresses with poison. While memory_failure()
>> assumes that it can discard typical poisoned pages and keep them
>> unmapped indefinitely, dev_pagemap pages may be returned to service
>> after the error is cleared.
>>
>> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
>> dev_pagemap pages that have poison consumed by userspace. Mark the
>> memory as UC instead of unmapping it completely to allow ongoing access
>> via the device driver (nd_pmem). Later, nd_pmem will grow support for
>> marking the page back to WB when the error is cleared.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
> <>
>> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> +             struct dev_pagemap *pgmap)
>> +{
>> +     const bool unmap_success = true;
>> +     unsigned long size;
>> +     struct page *page;
>> +     LIST_HEAD(tokill);
>> +     int rc = -EBUSY;
>> +     loff_t start;
>> +
>> +     /*
>> +      * Prevent the inode from being freed while we are interrogating
>> +      * the address_space, typically this would be handled by
>> +      * lock_page(), but dax pages do not use the page lock.
>> +      */
>> +     page = dax_lock_page(pfn);
>> +     if (!page)
>> +             goto out;
>> +
>> +     if (hwpoison_filter(page)) {
>> +             rc = 0;
>> +             goto unlock;
>> +     }
>> +
>> +     switch (pgmap->type) {
>> +     case MEMORY_DEVICE_PRIVATE:
>> +     case MEMORY_DEVICE_PUBLIC:
>> +             /*
>> +              * TODO: Handle HMM pages which may need coordination
>> +              * with device-side memory.
>> +              */
>> +             goto unlock;
>> +     default:
>> +             break;
>> +     }
>> +
>> +     /*
>> +      * If the page is not mapped in userspace then report it as
>> +      * unhandled.
>> +      */
>> +     size = dax_mapping_size(page);
>> +     if (!size) {
>> +             pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
>> +             goto unlock;
>> +     }
>> +
>> +     SetPageHWPoison(page);
>> +
>> +     /*
>> +      * Unlike System-RAM there is no possibility to swap in a
>> +      * different physical page at a given virtual address, so all
>> +      * userspace consumption of ZONE_DEVICE memory necessitates
>> +      * SIGBUS (i.e. MF_MUST_KILL)
>> +      */
>> +     flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>> +     collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
>
> You know "flags & MF_ACTION_REQUIRED" will always be true, so you can just
> pass in MF_ACTION_REQUIRED or even just "true".
>
>> +
>> +     start = (page->index << PAGE_SHIFT) & ~(size - 1);
>> +     unmap_mapping_range(page->mapping, start, start + size, 0);
>> +
>> +     kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
>
> You know "flags & MF_MUST_KILL" will always be true, so you can just pass in
> MF_MUST_KILL or even just "true".
>
> Also, you can get rid of the constant "unmap_success" if you want and just
> pass in false as the 3rd argument.

I don't like reading "true" and "false" as arguments to functions,
because the immediate next question is "what does true mean"? I could
just pass MF_MUST_KILL and MF_ACTION_REQUIRED directly, but was trying
to keep some consistency with other callers in that file.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-12 18:15   ` Ross Zwisler
@ 2018-07-04 15:11     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-07-04 15:11 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, linux-nvdimm, linux-fsdevel,
	Linux MM, Jan Kara, Christoph Hellwig

On Tue, Jun 12, 2018 at 11:15 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/dax.h |   15 ++++++++++
>>  2 files changed, 91 insertions(+)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>> +     pgoff_t index;
>> +     struct inode *inode;
>> +     wait_queue_head_t *wq;
>> +     void *entry = NULL, **slot;
>> +     struct address_space *mapping;
>> +     struct wait_exceptional_entry_queue ewait;
>> +     struct page *ret = NULL, *page = pfn_to_page(pfn);
>> +
>> +     rcu_read_lock();
>> +     for (;;) {
>> +             mapping = READ_ONCE(page->mapping);
>
> Why the READ_ONCE()?

We're potentially racing inode teardown, so the READ_ONCE() prevents
the compiler from trying to de-reference page->mapping twice and
getting inconsistent answers.

>
>> +
>> +             if (!mapping || !IS_DAX(mapping->host))
>
> Might read better using the dax_mapping() helper.

Sure.

>
> Also, forgive my ignorance, but this implies that dev dax has page->mapping
> set up and that that inode will have IS_DAX set, right?  This will let us get
> past this point for device DAX, and we'll bail out at the S_ISCHR() check?

Yes.

>
>> +                     break;
>> +
>> +             /*
>> +              * In the device-dax case there's no need to lock, a
>> +              * struct dev_pagemap pin is sufficient to keep the
>> +              * inode alive.
>> +              */
>> +             inode = mapping->host;
>> +             if (S_ISCHR(inode->i_mode)) {
>> +                     ret = page;
>
> 'ret' isn't actually used for anything in this function, we just
> unconditionally return 'page'.
>

Yes, bug.

>> +                     break;
>> +             }
>> +
>> +             xa_lock_irq(&mapping->i_pages);
>> +             if (mapping != page->mapping) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     continue;
>> +             }
>> +             index = page->index;
>> +
>> +             init_wait(&ewait.wait);
>> +             ewait.wait.func = wake_exceptional_entry_func;
>> +
>> +             entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
>> +                             &slot);
>> +             if (!entry ||
>
> So if we do a lookup and there is no entry in the tree, we won't add an empty
> entry and lock it, we'll just return with no entry in the tree and nothing
> locked.
>
> Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in
> dax_unlock_mapping_entry() when we see entry is 0.  And, in that gap we've got
> nothing locked so page faults could have happened, etc... (which would mean
> that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
> unlocked entry).
>
> Is that okay?  Or do we need to insert a locked empty entry here?

No, the intent was to return NULL and fail the lock, but I messed up
and unconditionally returned the page.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-11 15:41   ` Jan Kara
  2018-06-11 16:48     ` Dan Williams
  2018-06-12 18:07     ` Ross Zwisler
@ 2018-07-04 15:17     ` Dan Williams
  2 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-07-04 15:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, Christoph Hellwig, Linux MM, linux-fsdevel

On Mon, Jun 11, 2018 at 8:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Some comments below...
>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>
> Why do you return struct page here? Any reason behind that?

Unlike lock_page() there is no guarantee that we can lock a mapping
entry given a pfn. There is a chance that we lose a race and can't
validate the pfn to take the lock. So returning 'struct page *' was
there to indicate that we successfully validated the pfn and were able
to take the lock. I'll rework it to just return bool.

> Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

Ok.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
  2018-06-12 18:07     ` Ross Zwisler
@ 2018-07-04 15:20       ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-07-04 15:20 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, Dan Williams, linux-fsdevel, Linux MM,
	Christoph Hellwig, linux-nvdimm

On Tue, Jun 12, 2018 at 11:07 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Jun 11, 2018 at 05:41:46PM +0200, Jan Kara wrote:
>> On Fri 08-06-18 16:51:14, Dan Williams wrote:
>> > In preparation for implementing support for memory poison (media error)
>> > handling via dax mappings, implement a lock_page() equivalent. Poison
>> > error handling requires rmap and needs guarantees that the page->mapping
>> > association is maintained / valid (inode not freed) for the duration of
>> > the lookup.
>> >
>> > In the device-dax case it is sufficient to simply hold a dev_pagemap
>> > reference. In the filesystem-dax case we need to use the entry lock.
>> >
>> > Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> > protect against the inode being freed, and revalidates the page->mapping
>> > association under xa_lock().
>> >
>> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>
>> Some comments below...
>>
>> > diff --git a/fs/dax.c b/fs/dax.c
>> > index cccf6cad1a7a..b7e71b108fcf 100644
>> > --- a/fs/dax.c
>> > +++ b/fs/dax.c
>> > @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>> >     }
>> >  }
>> >
>> > +struct page *dax_lock_page(unsigned long pfn)
>> > +{
>>
>> Why do you return struct page here? Any reason behind that? Because struct
>> page exists and can be accessed through pfn_to_page() regardless of result
>> of this function so it looks a bit confusing. Also dax_lock_page() name
>> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?
>
> It's also a bit awkward that the functions are asymmetric in their arguments:
> dax_lock_page(pfn) vs dax_unlock_page(struct page)
>
> Looking at dax_lock_page(), we only use 'pfn' to get 'page', so maybe it would
> be cleaner to just always deal with struct page, i.e.:
>
> void dax_lock_page(struct page *page);
> void dax_unlock_page(struct page *page);

No, intent was to have the locking routine return the object that it
validated and then deal with that same object at unlock.
dax_lock_page() can fail to acquire a lock.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-07-04 15:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
2018-06-08 23:50 ` [PATCH v4 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
2018-06-08 23:50 ` [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
2018-06-11 17:12   ` Laurent Dufour
2018-06-11 17:14     ` Dan Williams
2018-06-08 23:50 ` [PATCH v4 03/12] device-dax: Enable page_mapping() Dan Williams
2018-06-08 23:50 ` [PATCH v4 04/12] device-dax: Set page->index Dan Williams
2018-06-08 23:50 ` [PATCH v4 05/12] filesystem-dax: " Dan Williams
2018-06-08 23:50 ` [PATCH v4 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
2018-06-08 23:50 ` [PATCH v4 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
2018-06-08 23:51 ` [PATCH v4 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
2018-06-08 23:51 ` [PATCH v4 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
2018-06-08 23:51 ` [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Dan Williams
2018-06-11 15:41   ` Jan Kara
2018-06-11 16:48     ` Dan Williams
2018-06-12 18:07     ` Ross Zwisler
2018-07-04 15:20       ` Dan Williams
2018-07-04 15:17     ` Dan Williams
2018-06-12 18:15   ` Ross Zwisler
2018-07-04 15:11     ` Dan Williams
2018-06-08 23:51 ` [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
2018-06-11 15:50   ` Jan Kara
2018-06-11 16:45     ` Dan Williams
2018-06-12 20:14   ` Ross Zwisler
2018-06-12 23:38     ` Dan Williams
2018-06-08 23:51 ` [PATCH v4 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).