linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages
@ 2018-06-04 23:11 Dan Williams
  2018-06-04 23:11 ` [PATCH v3 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:11 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-edac, Tony Luck, Borislav Petkov, Jérôme Glisse,
	Jan Kara, H. Peter Anvin, x86, Thomas Gleixner,
	Christoph Hellwig, Ross Zwisler, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Naoya Horiguchi, Souptick Joarder, linux-mm,
	linux-fsdevel, jack

Note: With this repost this series is now being actively revised during
the 4.18 merge window. Just holler if you want this to stop and wait for
4.19. Otherwise, this series completes the poison handling story for
pmem + dax with the other memcpy_mcsafe() changes that are going in this
cycle.

---

Changes since v2 [1]:
* Fix compilation failure in "device-dax: Enable page_mapping()". Update
  the function signature for __dev_dax_pud_fault() in the
  CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=n case. (0day Kbuild Robot)

* Prepare {reserve,free}_memtype() for decoy addresses

* Fix the set_mce_nospec() helper to use a decoy address. (Tony)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016120.html

---

As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.

A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

---

Dan Williams (12):
      device-dax: Convert to vmf_insert_mixed and vm_fault_t
      device-dax: Cleanup vm_fault de-reference chains
      device-dax: Enable page_mapping()
      device-dax: Set page->index
      filesystem-dax: Set page->index
      mm, madvise_inject_error: Let memory_failure() optionally take a page reference
      x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
      x86/memory_failure: Introduce {set,clear}_mce_nospec()
      mm, memory_failure: Pass page size to kill_proc()
      mm, memory_failure: Fix page->mapping assumptions relative to the page lock
      mm, memory_failure: Teach memory_failure() about dev_pagemap pages
      libnvdimm, pmem: Restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h         |   42 ++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 --
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 -----
 arch/x86/mm/pat.c                         |   16 ++
 drivers/dax/device.c                      |   97 ++++++++-----
 drivers/nvdimm/pmem.c                     |   26 ++++
 drivers/nvdimm/pmem.h                     |   13 ++
 fs/dax.c                                  |   16 ++
 include/linux/huge_mm.h                   |    5 -
 include/linux/mm.h                        |    1 
 include/linux/set_memory.h                |   14 ++
 mm/huge_memory.c                          |    4 -
 mm/madvise.c                              |   18 ++
 mm/memory-failure.c                       |  209 ++++++++++++++++++++++++++---
 14 files changed, 395 insertions(+), 119 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
@ 2018-06-04 23:11 ` Dan Williams
  2018-06-04 23:11 ` [PATCH v3 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:11 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Souptick Joarder, Matthew Wilcox, Ross Zwisler, hch, linux-mm,
	linux-fsdevel, jack

Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno.  Once all instances are converted, vm_fault_t will
become a distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c    |   26 +++++++++++---------------
 include/linux/huge_mm.h |    5 +++--
 mm/huge_memory.c        |    4 ++--
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index aff2c1594220..d44d98c54d0f 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
-	int rc = VM_FAULT_SIGBUS;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
-	if (rc == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (rc < 0 && rc != -EBUSY)
-		return VM_FAULT_SIGBUS;
-
-	return VM_FAULT_NOPAGE;
+	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
 
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
 	return rc;
 }
 
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
 {
 	return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
 }
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/sched/coredump.h>
+#include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..6af976472a5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -755,7 +755,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
@@ -815,7 +815,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 02/12] device-dax: Cleanup vm_fault de-reference chains
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-06-04 23:11 ` [PATCH v3 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
@ 2018-06-04 23:11 ` Dan Williams
  2018-06-04 23:11 ` [PATCH v3 03/12] device-dax: Enable page_mapping() Dan Williams
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:11 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

Define a local 'vma' variable rather than repetitively de-referencing
the passed in 'struct vm_fault *' instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d44d98c54d0f..686de08e120b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -247,13 +247,14 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -274,13 +275,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -288,7 +290,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -310,11 +312,10 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pmd_addr < vmf->vma->vm_start ||
-			(pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
+	if (pmd_addr < vma->vm_start || (pmd_addr + PMD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pmd_addr);
+	pgoff = linear_page_index(vma, pmd_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -323,7 +324,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
@@ -332,6 +333,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
@@ -340,7 +342,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	unsigned int fault_size = PUD_SIZE;
 
 
-	if (check_vma(dev_dax, vmf->vma, __func__))
+	if (check_vma(dev_dax, vma, __func__))
 		return VM_FAULT_SIGBUS;
 
 	dax_region = dev_dax->region;
@@ -362,11 +364,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
-	if (pud_addr < vmf->vma->vm_start ||
-			(pud_addr + PUD_SIZE) > vmf->vma->vm_end)
+	if (pud_addr < vma->vm_start || (pud_addr + PUD_SIZE) > vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	pgoff = linear_page_index(vmf->vma, pud_addr);
+	pgoff = linear_page_index(vma, pud_addr);
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
 	if (phys == -1) {
 		dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -375,7 +376,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,12 +391,13 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
-	struct file *filp = vmf->vma->vm_file;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *filp = vma->vm_file;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
 			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
+			vma->vm_start, vma->vm_end, pe_size);
 
 	id = dax_read_lock();
 	switch (pe_size) {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 03/12] device-dax: Enable page_mapping()
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-06-04 23:11 ` [PATCH v3 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
  2018-06-04 23:11 ` [PATCH v3 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
@ 2018-06-04 23:11 ` Dan Williams
  2018-06-07 15:39   ` Jan Kara
  2018-06-04 23:12 ` [PATCH v3 04/12] device-dax: Set page->index Dan Williams
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:11 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.

The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   55 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 686de08e120b..7ec246549721 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,13 +245,12 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
-	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -273,13 +272,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vma, vmf->address, pfn);
+	return vmf_insert_mixed(vma, vmf->address, *pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -287,7 +286,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
 	if (check_vma(dev_dax, vma, __func__))
@@ -322,15 +320,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
@@ -338,7 +336,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
 
@@ -374,14 +371,14 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	return VM_FAULT_FALLBACK;
 }
@@ -390,9 +387,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
-	int rc, id;
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *filp = vma->vm_file;
+	unsigned long fault_size;
+	int rc, id;
+	pfn_t pfn;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -402,17 +401,39 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	id = dax_read_lock();
 	switch (pe_size) {
 	case PE_SIZE_PTE:
-		rc = __dev_dax_pte_fault(dev_dax, vmf);
+		fault_size = PAGE_SIZE;
+		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PMD:
-		rc = __dev_dax_pmd_fault(dev_dax, vmf);
+		fault_size = PMD_SIZE;
+		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PUD:
-		rc = __dev_dax_pud_fault(dev_dax, vmf);
+		fault_size = PUD_SIZE;
+		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
 		break;
 	default:
 		rc = VM_FAULT_SIGBUS;
 	}
+
+	if (rc == VM_FAULT_NOPAGE) {
+		unsigned long i;
+
+		/*
+		 * In the device-dax case the only possibility for a
+		 * VM_FAULT_NOPAGE result is when device-dax capacity is
+		 * mapped. No need to consider the zero page, or racing
+		 * conflicting mappings.
+		 */
+		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+			struct page *page;
+
+			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+			if (page->mapping)
+				continue;
+			page->mapping = filp->f_mapping;
+		}
+	}
 	dax_read_unlock(id);
 
 	return rc;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 04/12] device-dax: Set page->index
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (2 preceding siblings ...)
  2018-06-04 23:11 ` [PATCH v3 03/12] device-dax: Enable page_mapping() Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-07 15:50   ` Jan Kara
  2018-06-04 23:12 ` [PATCH v3 05/12] filesystem-dax: " Dan Williams
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.

The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 7ec246549721..2b120e397e08 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -418,6 +418,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 
 	if (rc == VM_FAULT_NOPAGE) {
 		unsigned long i;
+		pgoff_t pgoff;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -425,6 +426,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
+		pgoff = linear_page_index(vma, vmf->address
+				& ~(fault_size - 1));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
@@ -432,6 +435,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			if (page->mapping)
 				continue;
 			page->mapping = filp->f_mapping;
+			page->index = pgoff + i;
 		}
 	}
 	dax_read_unlock(id);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 05/12] filesystem-dax: Set page->index
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (3 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 04/12] device-dax: Set page->index Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-07 16:02   ` Jan Kara
  2018-06-04 23:12 ` [PATCH v3 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	linux-mm, linux-fsdevel, jack

In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index aaec72ded1b6..cccf6cad1a7a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,27 @@ static unsigned long dax_radix_end_pfn(void *entry)
 	for (pfn = dax_radix_pfn(entry); \
 			pfn < dax_radix_end_pfn(entry); pfn++)
 
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+/*
+ * TODO: for reflink+dax we need a way to associate a single page with
+ * multiple address_space instances at different linear_page_index()
+ * offsets.
+ */
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn, index;
+	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
+	index = linear_page_index(vma, address & ~(size - 1));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(page->mapping);
 		page->mapping = mapping;
+		page->index = index + i++;
 	}
 }
 
@@ -348,6 +357,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
+		page->index = 0;
 	}
 }
 
@@ -604,7 +614,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	new_entry = dax_radix_locked_entry(pfn, flags);
 	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping);
+		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (4 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 05/12] filesystem-dax: " Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-04 23:12 ` [PATCH v3 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Naoya Horiguchi, hch, linux-mm, linux-fsdevel, jack

The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it does not release
that pin. The assumption is that failed pages should be taken out of
circulation.

However, for dax mappings it is not possible to take pages out of
circulation since they are 1:1 physically mapped as filesystem blocks,
or device-dax capacity. They also typically represent persistent memory
which has an error clearing capability.

In preparation for adding a special handler for dax mappings, shift the
responsibility of taking the page reference to memory_failure(). I.e.
drop the page reference and do not specify MF_COUNT_INCREASED to
memory_failure().

Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/madvise.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..b731933dddae 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
 
 
 	for (; start < end; start += PAGE_SIZE << order) {
+		unsigned long pfn;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		pfn = page_to_pfn(page);
 
 		/*
 		 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,27 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+					pfn, start);
 
 			ret = soft_offline_page(page, MF_COUNT_INCREASED);
 			if (ret)
 				return ret;
 			continue;
 		}
+
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+				pfn, start);
+
+		ret = memory_failure(pfn, 0);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
 
-		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
 		if (ret)
 			return ret;
 	}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (5 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-04 23:12 ` [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, linux-edac, x86, hch, linux-mm, linux-fsdevel,
	jack

In preparation for using set_memory_uc() instead set_memory_np() for
isolating poison from speculation, teach the memtype code to sanitize
physical addresses vs __PHYSICAL_MASK.

The motivation for using set_memory_uc() for this case is to allow
ongoing access to persistent memory pages via the pmem-driver +
memcpy_mcsafe() until the poison is repaired.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/pat.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 1555bd7d3449..6788ffa990f8 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -512,6 +512,17 @@ static int free_ram_pages_type(u64 start, u64 end)
 	return 0;
 }
 
+static u64 sanitize_phys(u64 address)
+{
+	/*
+	 * When changing the memtype for pages containing poison allow
+	 * for a "decoy" virtual address (bit 63 clear) passed to
+	 * set_memory_X(). __pa() on a "decoy" address results in a
+	 * physical address with it 63 set.
+	 */
+	return address & __PHYSICAL_MASK;
+}
+
 /*
  * req_type typically has one of the:
  * - _PAGE_CACHE_MODE_WB
@@ -533,6 +544,8 @@ int reserve_memtype(u64 start, u64 end, enum page_cache_mode req_type,
 	int is_range_ram;
 	int err = 0;
 
+	start = sanitize_phys(start);
+	end = sanitize_phys(end);
 	BUG_ON(start >= end); /* end is exclusive */
 
 	if (!pat_enabled()) {
@@ -609,6 +622,9 @@ int free_memtype(u64 start, u64 end)
 	if (!pat_enabled())
 		return 0;
 
+	start = sanitize_phys(start);
+	end = sanitize_phys(end);
+
 	/* Low ISA region is always mapped WB. No need to track */
 	if (x86_platform.is_untracked_pat_range(start, end))
 		return 0;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (6 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-07  4:42   ` Dan Williams
  2018-06-04 23:12 ` [PATCH v3 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, linux-edac, x86, hch, linux-mm, linux-fsdevel,
	jack

Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/set_memory.h         |   42 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ----------
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 ++------------------------
 include/linux/set_memory.h                |   14 ++++++++++
 4 files changed, 59 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index bd090367236c..cf5e9124b45e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,46 @@ extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
 
+#ifdef CONFIG_X86_64
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+	int rc;
+
+	/*
+	 * Mark the linear address as UC to make sure we don't log more
+	 * errors because of speculative access to the page.
+	 * We would like to just call:
+	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the poison page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+	 */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+	rc = set_memory_uc(decoy_addr, 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb)	{ }
 static inline void mce_unregister_injector_chain(struct notifier_block *nb)	{ }
 #endif
 
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- *    are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- *    recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
 struct mca_config {
 	bool dont_log_ce;
 	bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..a0fbf0a8b7e6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/jump_label.h>
+#include <linux/set_memory.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -50,7 +51,6 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
-#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 	if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
 		pfn = mce->addr >> PAGE_SHIFT;
 		if (!memory_failure(pfn, 0))
-			mce_unmap_kpfn(pfn);
+			set_mce_nospec(pfn);
 	}
 
 	return NOTIFY_OK;
@@ -1070,38 +1066,10 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT);
 	return ret;
 }
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn)
-{
-	unsigned long decoy_addr;
-
-	/*
-	 * Unmap this page from the kernel 1:1 mappings to make sure
-	 * we don't log more errors because of speculative access to
-	 * the page.
-	 * We would like to just call:
-	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_np() not checking whether we passed
-	 * a legal address.
-	 */
-
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (set_memory_np(decoy_addr, 1))
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-}
-#endif
-
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index da5178216da5..2a986d282a97 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr,  int numpages) { return 0; }
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+#ifndef set_mce_nospec
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
+#ifndef clear_mce_nospec
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
 #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
 static inline int set_memory_encrypted(unsigned long addr, int numpages)
 {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 09/12] mm, memory_failure: Pass page size to kill_proc()
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (7 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-04 23:12 ` [PATCH v3 10/12] mm, memory_failure: Fix page->mapping assumptions relative to the page lock Dan Williams
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Naoya Horiguchi, hch, linux-mm, linux-fsdevel, jack

Given that ZONE_DEVICE / dev_pagemap pages are never assembled into
compound pages, the size determination logic in kill_proc() needs
updating for the dev_pagemap case. In preparation for dev_pagemap
support rework memory_failure() and kill_proc() to pass / consume the page
size explicitly.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..42a193ee14d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,18 +179,16 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, struct page *page, int flags)
+			unsigned long pfn, unsigned size_shift, int flags)
 {
-	short addr_lsb;
 	int ret;
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
 		pfn, t->comm, t->pid);
-	addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
 		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
-				       addr_lsb, current);
+				       size_shift, current);
 	} else {
 		/*
 		 * Don't use force here, it's convenient if the signal
@@ -199,7 +197,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		 * to SIG_IGN, but hopefully no one will do that?
 		 */
 		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
-				      addr_lsb, t);  /* synchronous? */
+				      size_shift, t);  /* synchronous? */
 	}
 	if (ret < 0)
 		pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
@@ -318,7 +316,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, struct page *page, unsigned long pfn,
+			  bool fail, unsigned size_shift, unsigned long pfn,
 			  int flags)
 {
 	struct to_kill *tk, *next;
@@ -343,7 +341,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * process anyways.
 			 */
 			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, page, flags) < 0)
+					      pfn, size_shift, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
 				       pfn, tk->tsk->comm, tk->tsk->pid);
 		}
@@ -928,6 +926,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	unsigned size_shift;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1012,7 +1011,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * any accesses to the poisoned memory.
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
-	kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
+	size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+	kill_procs(&tokill, forcekill, !unmap_success, size_shift, pfn, flags);
 
 	return unmap_success;
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 10/12] mm, memory_failure: Fix page->mapping assumptions relative to the page lock
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (8 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-04 23:12 ` [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
  2018-06-04 23:12 ` [PATCH v3 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

The current memory_failure() implementation assumes that lock_page() is
sufficient for stabilizing page->mapping and that ->mapping->host will
not be freed. The dax implementation, on the other hand, relies on
xa_lock_irq() for stabilizing the page->mapping relationship and it is
not possible to hold the lock over current routines in the
memory_failure() path that run under lock_page().

Teach the various memory_failure() helpers to pin the address_space and
revalidate page->mapping under xa_lock_irq(mapping->i_pages).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   56 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 42a193ee14d3..b6efb78ba49b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,12 +179,20 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, unsigned size_shift, int flags)
+		struct address_space *mapping, struct page *page,
+		unsigned size_shift, int flags)
 {
-	int ret;
+	int ret = 0;
+
+	/* revalidate the page before killing the process */
+	xa_lock_irq(&mapping->i_pages);
+	if (page->mapping != mapping) {
+		xa_unlock_irq(&mapping->i_pages);
+		return 0;
+	}
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
-		pfn, t->comm, t->pid);
+			page_to_pfn(page), t->comm, t->pid);
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
 		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
@@ -199,6 +207,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
 				      size_shift, t);  /* synchronous? */
 	}
+	xa_unlock_irq(&mapping->i_pages);
 	if (ret < 0)
 		pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
 			t->comm, t->pid, ret);
@@ -316,8 +325,8 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, unsigned size_shift, unsigned long pfn,
-			  int flags)
+		bool fail, unsigned size_shift, struct address_space *mapping,
+		struct page *page, int flags)
 {
 	struct to_kill *tk, *next;
 
@@ -330,7 +339,8 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 */
 			if (fail || tk->addr_valid == 0) {
 				pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
-				       pfn, tk->tsk->comm, tk->tsk->pid);
+						page_to_pfn(page), tk->tsk->comm,
+						tk->tsk->pid);
 				force_sig(SIGKILL, tk->tsk);
 			}
 
@@ -341,9 +351,10 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * process anyways.
 			 */
 			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, size_shift, flags) < 0)
+					      mapping, page, size_shift, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
-				       pfn, tk->tsk->comm, tk->tsk->pid);
+						page_to_pfn(page), tk->tsk->comm,
+						tk->tsk->pid);
 		}
 		put_task_struct(tk->tsk);
 		kfree(tk);
@@ -429,21 +440,27 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 /*
  * Collect processes when the error hit a file mapped page.
  */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-			      struct to_kill **tkc, int force_early)
+static void collect_procs_file(struct address_space *mapping, struct page *page,
+		struct list_head *to_kill, struct to_kill **tkc,
+		int force_early)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct address_space *mapping = page->mapping;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page_to_pgoff(page);
+		pgoff_t pgoff;
 		struct task_struct *t = task_early_kill(tsk, force_early);
 
 		if (!t)
 			continue;
+		xa_lock_irq(&mapping->i_pages);
+		if (page->mapping != mapping) {
+			xa_unlock_irq(&mapping->i_pages);
+			break;
+		}
+		pgoff = page_to_pgoff(page);
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
 				      pgoff) {
 			/*
@@ -456,6 +473,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			if (vma->vm_mm == t->mm)
 				add_to_kill(t, page, vma, to_kill, tkc);
 		}
+		xa_unlock_irq(&mapping->i_pages);
 	}
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
@@ -467,12 +485,12 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
  * First preallocate one tokill structure outside the spin locks,
  * so that we can kill at least one process reasonably reliable.
  */
-static void collect_procs(struct page *page, struct list_head *tokill,
-				int force_early)
+static void collect_procs(struct address_space *mapping, struct page *page,
+		struct list_head *tokill, int force_early)
 {
 	struct to_kill *tk;
 
-	if (!page->mapping)
+	if (!mapping)
 		return;
 
 	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
@@ -481,7 +499,7 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 	if (PageAnon(page))
 		collect_procs_anon(page, tokill, &tk, force_early);
 	else
-		collect_procs_file(page, tokill, &tk, force_early);
+		collect_procs_file(mapping, page, tokill, &tk, force_early);
 	kfree(tk);
 }
 
@@ -986,7 +1004,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
+		collect_procs(mapping, hpage, &tokill,
+				flags & MF_ACTION_REQUIRED);
 
 	unmap_success = try_to_unmap(hpage, ttu);
 	if (!unmap_success)
@@ -1012,7 +1031,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
 	size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
-	kill_procs(&tokill, forcekill, !unmap_success, size_shift, pfn, flags);
+	kill_procs(&tokill, forcekill, !unmap_success, size_shift, mapping,
+			hpage, flags);
 
 	return unmap_success;
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (9 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 10/12] mm, memory_failure: Fix page->mapping assumptions relative to the page lock Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  2018-06-07  4:22   ` Dan Williams
  2018-06-07 16:30   ` Jan Kara
  2018-06-04 23:12 ` [PATCH v3 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
  11 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, linux-mm,
	linux-fsdevel, jack

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h  |    1 
 mm/memory-failure.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 146 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..566c972e03e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2669,6 +2669,7 @@ enum mf_action_page_type {
 	MF_MSG_TRUNCATED_LRU,
 	MF_MSG_BUDDY,
 	MF_MSG_BUDDY_2ND,
+	MF_MSG_DAX,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b6efb78ba49b..de0bc897d6e7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
+#include <linux/memremap.h>
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include "internal.h"
@@ -531,6 +532,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_TRUNCATED_LRU]		= "already truncated LRU page",
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
+	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1132,6 +1134,144 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return res;
 }
 
+static unsigned long dax_mapping_size(struct address_space *mapping,
+		struct page *page)
+{
+	pgoff_t pgoff = page_to_pgoff(page);
+	struct vm_area_struct *vma;
+	unsigned long size = 0;
+
+	i_mmap_lock_read(mapping);
+	xa_lock_irq(&mapping->i_pages);
+	/* validate that @page is still linked to @mapping */
+	if (page->mapping != mapping) {
+		xa_unlock_irq(&mapping->i_pages);
+		i_mmap_unlock_read(mapping);
+			return 0;
+	}
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long address = vma_address(page, vma);
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, address);
+		if (!pgd_present(*pgd))
+			continue;
+		p4d = p4d_offset(pgd, address);
+		if (!p4d_present(*p4d))
+			continue;
+		pud = pud_offset(p4d, address);
+		if (!pud_present(*pud))
+			continue;
+		if (pud_devmap(*pud)) {
+			size = PUD_SIZE;
+			break;
+		}
+		pmd = pmd_offset(pud, address);
+		if (!pmd_present(*pmd))
+			continue;
+		if (pmd_devmap(*pmd)) {
+			size = PMD_SIZE;
+			break;
+		}
+		pte = pte_offset_map(pmd, address);
+		if (!pte_present(*pte))
+			continue;
+		if (pte_devmap(*pte)) {
+			size = PAGE_SIZE;
+			break;
+		}
+	}
+	xa_unlock_irq(&mapping->i_pages);
+	i_mmap_unlock_read(mapping);
+
+	return size;
+}
+
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	const bool unmap_success = true;
+	struct address_space *mapping;
+	unsigned long size;
+	LIST_HEAD(tokill);
+	int rc = -EBUSY;
+	loff_t start;
+
+	/*
+	 * Prevent the inode from being freed while we are interrogating
+	 * the address_space, typically this would be handled by
+	 * lock_page(), but dax pages do not use the page lock.
+	 */
+	rcu_read_lock();
+	mapping = page->mapping;
+	if (!mapping) {
+		rcu_read_unlock();
+		goto out;
+	}
+	if (!igrab(mapping->host)) {
+		mapping = NULL;
+		rcu_read_unlock();
+		goto out;
+	}
+	rcu_read_unlock();
+
+	if (hwpoison_filter(page)) {
+		rc = 0;
+		goto out;
+	}
+
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		goto out;
+	default:
+		break;
+	}
+
+	/*
+	 * If the page is not mapped in userspace then report it as
+	 * unhandled.
+	 */
+	size = dax_mapping_size(mapping, page);
+	if (!size) {
+		pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
+		goto out;
+	}
+
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(mapping, page, &tokill, flags & MF_ACTION_REQUIRED);
+
+	start = (page->index << PAGE_SHIFT) & ~(size - 1);
+	unmap_mapping_range(page->mapping, start, start + size, 0);
+
+	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
+			mapping, page, flags);
+	rc = 0;
+out:
+	if (mapping)
+		iput(mapping->host);
+	put_dev_pagemap(pgmap);
+	action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
+	return rc;
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1154,6 +1294,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *p;
 	struct page *hpage;
 	struct page *orig_head;
+	struct dev_pagemap *pgmap;
 	int res;
 	unsigned long page_flags;
 
@@ -1166,6 +1307,10 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	pgmap = get_dev_pagemap(pfn, NULL);
+	if (pgmap)
+		return memory_failure_dev_pagemap(pfn, flags, pgmap);
+
 	p = pfn_to_page(pfn);
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 12/12] libnvdimm, pmem: Restore page attributes when clearing errors
  2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (10 preceding siblings ...)
  2018-06-04 23:12 ` [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-06-04 23:12 ` Dan Williams
  11 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-04 23:12 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: hch, linux-mm, linux-fsdevel, jack

Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   26 ++++++++++++++++++++++++++
 drivers/nvdimm/pmem.h |   13 +++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..04ee1fdee219 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
 #include <linux/hdreg.h>
 #include <linux/init.h>
 #include <linux/platform_device.h>
+#include <linux/set_memory.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/badblocks.h>
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
+static void hwpoison_clear(struct pmem_device *pmem,
+		phys_addr_t phys, unsigned int len)
+{
+	unsigned long pfn_start, pfn_end, pfn;
+
+	/* only pmem in the linear map supports HWPoison */
+	if (is_vmalloc_addr(pmem->virt_addr))
+		return;
+
+	pfn_start = PHYS_PFN(phys);
+	pfn_end = pfn_start + PHYS_PFN(len);
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * Note, no need to hold a get_dev_pagemap() reference
+		 * here since we're in the driver I/O path and
+		 * outstanding I/O requests pin the dev_pagemap.
+		 */
+		if (test_and_clear_pmem_poison(page))
+			clear_mce_nospec(pfn);
+	}
+}
+
 static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 		phys_addr_t offset, unsigned int len)
 {
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	if (cleared < len)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
+		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
+#include <linux/page-flags.h>
 #include <linux/badblocks.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -27,4 +28,16 @@ struct pmem_device {
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return false;
+}
+#endif
 #endif /* __NVDIMM_PMEM_H__ */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-04 23:12 ` [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-06-07  4:22   ` Dan Williams
  2018-06-07 16:30   ` Jan Kara
  1 sibling, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-07  4:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, Linux MM,
	linux-fsdevel

On Mon, Jun 4, 2018 at 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
>
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
>
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.
>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Naoya, your thoughts on this patch? It is passing my unit tests for
filesystem-dax and device-dax.

> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/mm.h  |    1
>  mm/memory-failure.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 146 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ac1f06a4be6..566c972e03e7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2669,6 +2669,7 @@ enum mf_action_page_type {
>         MF_MSG_TRUNCATED_LRU,
>         MF_MSG_BUDDY,
>         MF_MSG_BUDDY_2ND,
> +       MF_MSG_DAX,
>         MF_MSG_UNKNOWN,
>  };
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index b6efb78ba49b..de0bc897d6e7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -55,6 +55,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm_inline.h>
> +#include <linux/memremap.h>
>  #include <linux/kfifo.h>
>  #include <linux/ratelimit.h>
>  #include "internal.h"
> @@ -531,6 +532,7 @@ static const char * const action_page_types[] = {
>         [MF_MSG_TRUNCATED_LRU]          = "already truncated LRU page",
>         [MF_MSG_BUDDY]                  = "free buddy page",
>         [MF_MSG_BUDDY_2ND]              = "free buddy page (2nd try)",
> +       [MF_MSG_DAX]                    = "dax page",
>         [MF_MSG_UNKNOWN]                = "unknown page",
>  };
>
> @@ -1132,6 +1134,144 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>         return res;
>  }
>
> +static unsigned long dax_mapping_size(struct address_space *mapping,
> +               struct page *page)
> +{
> +       pgoff_t pgoff = page_to_pgoff(page);
> +       struct vm_area_struct *vma;
> +       unsigned long size = 0;
> +
> +       i_mmap_lock_read(mapping);
> +       xa_lock_irq(&mapping->i_pages);
> +       /* validate that @page is still linked to @mapping */
> +       if (page->mapping != mapping) {
> +               xa_unlock_irq(&mapping->i_pages);
> +               i_mmap_unlock_read(mapping);
> +                       return 0;
> +       }
> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +               unsigned long address = vma_address(page, vma);
> +               pgd_t *pgd;
> +               p4d_t *p4d;
> +               pud_t *pud;
> +               pmd_t *pmd;
> +               pte_t *pte;
> +
> +               pgd = pgd_offset(vma->vm_mm, address);
> +               if (!pgd_present(*pgd))
> +                       continue;
> +               p4d = p4d_offset(pgd, address);
> +               if (!p4d_present(*p4d))
> +                       continue;
> +               pud = pud_offset(p4d, address);
> +               if (!pud_present(*pud))
> +                       continue;
> +               if (pud_devmap(*pud)) {
> +                       size = PUD_SIZE;
> +                       break;
> +               }
> +               pmd = pmd_offset(pud, address);
> +               if (!pmd_present(*pmd))
> +                       continue;
> +               if (pmd_devmap(*pmd)) {
> +                       size = PMD_SIZE;
> +                       break;
> +               }
> +               pte = pte_offset_map(pmd, address);
> +               if (!pte_present(*pte))
> +                       continue;
> +               if (pte_devmap(*pte)) {
> +                       size = PAGE_SIZE;
> +                       break;
> +               }
> +       }
> +       xa_unlock_irq(&mapping->i_pages);
> +       i_mmap_unlock_read(mapping);
> +
> +       return size;
> +}
> +
> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> +               struct dev_pagemap *pgmap)
> +{
> +       struct page *page = pfn_to_page(pfn);
> +       const bool unmap_success = true;
> +       struct address_space *mapping;
> +       unsigned long size;
> +       LIST_HEAD(tokill);
> +       int rc = -EBUSY;
> +       loff_t start;
> +
> +       /*
> +        * Prevent the inode from being freed while we are interrogating
> +        * the address_space, typically this would be handled by
> +        * lock_page(), but dax pages do not use the page lock.
> +        */
> +       rcu_read_lock();
> +       mapping = page->mapping;
> +       if (!mapping) {
> +               rcu_read_unlock();
> +               goto out;
> +       }
> +       if (!igrab(mapping->host)) {
> +               mapping = NULL;
> +               rcu_read_unlock();
> +               goto out;
> +       }
> +       rcu_read_unlock();
> +
> +       if (hwpoison_filter(page)) {
> +               rc = 0;
> +               goto out;
> +       }
> +
> +       switch (pgmap->type) {
> +       case MEMORY_DEVICE_PRIVATE:
> +       case MEMORY_DEVICE_PUBLIC:
> +               /*
> +                * TODO: Handle HMM pages which may need coordination
> +                * with device-side memory.
> +                */
> +               goto out;
> +       default:
> +               break;
> +       }
> +
> +       /*
> +        * If the page is not mapped in userspace then report it as
> +        * unhandled.
> +        */
> +       size = dax_mapping_size(mapping, page);
> +       if (!size) {
> +               pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
> +               goto out;
> +       }
> +
> +       SetPageHWPoison(page);
> +
> +       /*
> +        * Unlike System-RAM there is no possibility to swap in a
> +        * different physical page at a given virtual address, so all
> +        * userspace consumption of ZONE_DEVICE memory necessitates
> +        * SIGBUS (i.e. MF_MUST_KILL)
> +        */
> +       flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +       collect_procs(mapping, page, &tokill, flags & MF_ACTION_REQUIRED);
> +
> +       start = (page->index << PAGE_SHIFT) & ~(size - 1);
> +       unmap_mapping_range(page->mapping, start, start + size, 0);
> +
> +       kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, ilog2(size),
> +                       mapping, page, flags);
> +       rc = 0;
> +out:
> +       if (mapping)
> +               iput(mapping->host);
> +       put_dev_pagemap(pgmap);
> +       action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
> +       return rc;
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1154,6 +1294,7 @@ int memory_failure(unsigned long pfn, int flags)
>         struct page *p;
>         struct page *hpage;
>         struct page *orig_head;
> +       struct dev_pagemap *pgmap;
>         int res;
>         unsigned long page_flags;
>
> @@ -1166,6 +1307,10 @@ int memory_failure(unsigned long pfn, int flags)
>                 return -ENXIO;
>         }
>
> +       pgmap = get_dev_pagemap(pfn, NULL);
> +       if (pgmap)
> +               return memory_failure_dev_pagemap(pfn, flags, pgmap);
> +
>         p = pfn_to_page(pfn);
>         if (PageHuge(p))
>                 return memory_failure_hugetlb(pfn, flags);
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-04 23:12 ` [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-06-07  4:42   ` Dan Williams
  2018-06-07 17:02     ` Luck, Tony
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-06-07  4:42 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, linux-edac, X86 ML, Christoph Hellwig, Linux MM,
	linux-fsdevel, Jan Kara

On Mon, Jun 4, 2018 at 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Currently memory_failure() returns zero if the error was handled. On
> that result mce_unmap_kpfn() is called to zap the page out of the kernel
> linear mapping to prevent speculative fetches of potentially poisoned
> memory. However, in the case of dax mapped devmap pages the page may be
> in active permanent use by the device driver, so it cannot be unmapped
> from the kernel.
>
> Instead of marking the page not present, marking the page UC should
> be sufficient for preventing poison from being pre-fetched into the
> cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
> UC, to hide it from speculative accesses.
>
> Given that that persistent memory errors can be cleared by the driver,
> include a facility to restore the page to cacheable operation,
> clear_mce_nospec().
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Tony Luck <tony.luck@intel.com>

Tony, safe to assume you are ok with this patch now that the
decoy_addr approach is back?

> Cc: Borislav Petkov <bp@alien8.de>
> Cc: <linux-edac@vger.kernel.org>
> Cc: <x86@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/include/asm/set_memory.h         |   42 +++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ----------
>  arch/x86/kernel/cpu/mcheck/mce.c          |   38 ++------------------------
>  include/linux/set_memory.h                |   14 ++++++++++
>  4 files changed, 59 insertions(+), 50 deletions(-)
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index bd090367236c..cf5e9124b45e 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -88,4 +88,46 @@ extern int kernel_set_to_readonly;
>  void set_kernel_text_rw(void);
>  void set_kernel_text_ro(void);
>
> +#ifdef CONFIG_X86_64
> +static inline int set_mce_nospec(unsigned long pfn)
> +{
> +       unsigned long decoy_addr;
> +       int rc;
> +
> +       /*
> +        * Mark the linear address as UC to make sure we don't log more
> +        * errors because of speculative access to the page.
> +        * We would like to just call:
> +        *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
> +        * but doing that would radically increase the odds of a
> +        * speculative access to the poison page because we'd have
> +        * the virtual address of the kernel 1:1 mapping sitting
> +        * around in registers.
> +        * Instead we get tricky.  We create a non-canonical address
> +        * that looks just like the one we want, but has bit 63 flipped.
> +        * This relies on set_memory_uc() properly sanitizing any __pa()
> +        * results with __PHYSICAL_MASK or PTE_PFN_MASK.
> +        */
> +       decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +
> +       rc = set_memory_uc(decoy_addr, 1);
> +       if (rc)
> +               pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
> +       return rc;
> +}
> +#define set_mce_nospec set_mce_nospec
> +
> +/* Restore full speculative operation to the pfn. */
> +static inline int clear_mce_nospec(unsigned long pfn)
> +{
> +       return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
> +}
> +#define clear_mce_nospec clear_mce_nospec
> +#else
> +/*
> + * Few people would run a 32-bit kernel on a machine that supports
> + * recoverable errors because they have too much memory to boot 32-bit.
> + */
> +#endif
> +
>  #endif /* _ASM_X86_SET_MEMORY_H */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 374d1aa66952..ceb67cd5918f 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb)  { }
>  static inline void mce_unregister_injector_chain(struct notifier_block *nb)    { }
>  #endif
>
> -#ifndef CONFIG_X86_64
> -/*
> - * On 32-bit systems it would be difficult to safely unmap a poison page
> - * from the kernel 1:1 map because there are no non-canonical addresses that
> - * we can use to refer to the address without risking a speculative access.
> - * However, this isn't much of an issue because:
> - * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
> - *    are only mapped into the kernel as needed
> - * 2) Few people would run a 32-bit kernel on a machine that supports
> - *    recoverable errors because they have too much memory to boot 32-bit.
> - */
> -static inline void mce_unmap_kpfn(unsigned long pfn) {}
> -#define mce_unmap_kpfn mce_unmap_kpfn
> -#endif
> -
>  struct mca_config {
>         bool dont_log_ce;
>         bool cmci_disabled;
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 42cf2880d0ed..a0fbf0a8b7e6 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -42,6 +42,7 @@
>  #include <linux/irq_work.h>
>  #include <linux/export.h>
>  #include <linux/jump_label.h>
> +#include <linux/set_memory.h>
>
>  #include <asm/intel-family.h>
>  #include <asm/processor.h>
> @@ -50,7 +51,6 @@
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>  #include <asm/reboot.h>
> -#include <asm/set_memory.h>
>
>  #include "mce-internal.h"
>
> @@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
>
>  static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
>
> -#ifndef mce_unmap_kpfn
> -static void mce_unmap_kpfn(unsigned long pfn);
> -#endif
> -
>  /*
>   * CPU/chipset specific EDAC code can register a notifier call here to print
>   * MCE errors in a human-readable form.
> @@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
>         if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
>                 pfn = mce->addr >> PAGE_SHIFT;
>                 if (!memory_failure(pfn, 0))
> -                       mce_unmap_kpfn(pfn);
> +                       set_mce_nospec(pfn);
>         }
>
>         return NOTIFY_OK;
> @@ -1070,38 +1066,10 @@ static int do_memory_failure(struct mce *m)
>         if (ret)
>                 pr_err("Memory error not recovered");
>         else
> -               mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
> +               set_mce_nospec(m->addr >> PAGE_SHIFT);
>         return ret;
>  }
>
> -#ifndef mce_unmap_kpfn
> -static void mce_unmap_kpfn(unsigned long pfn)
> -{
> -       unsigned long decoy_addr;
> -
> -       /*
> -        * Unmap this page from the kernel 1:1 mappings to make sure
> -        * we don't log more errors because of speculative access to
> -        * the page.
> -        * We would like to just call:
> -        *      set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
> -        * but doing that would radically increase the odds of a
> -        * speculative access to the poison page because we'd have
> -        * the virtual address of the kernel 1:1 mapping sitting
> -        * around in registers.
> -        * Instead we get tricky.  We create a non-canonical address
> -        * that looks just like the one we want, but has bit 63 flipped.
> -        * This relies on set_memory_np() not checking whether we passed
> -        * a legal address.
> -        */
> -
> -       decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> -
> -       if (set_memory_np(decoy_addr, 1))
> -               pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
> -}
> -#endif
> -
>  /*
>   * The actual machine check handler. This only handles real
>   * exceptions when something got corrupted coming in through int 18.
> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
> index da5178216da5..2a986d282a97 100644
> --- a/include/linux/set_memory.h
> +++ b/include/linux/set_memory.h
> @@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr,  int numpages) { return 0; }
>  static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
>  #endif
>
> +#ifndef set_mce_nospec
> +static inline int set_mce_nospec(unsigned long pfn)
> +{
> +       return 0;
> +}
> +#endif
> +
> +#ifndef clear_mce_nospec
> +static inline int clear_mce_nospec(unsigned long pfn)
> +{
> +       return 0;
> +}
> +#endif
> +
>  #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
>  static inline int set_memory_encrypted(unsigned long addr, int numpages)
>  {
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 03/12] device-dax: Enable page_mapping()
  2018-06-04 23:11 ` [PATCH v3 03/12] device-dax: Enable page_mapping() Dan Williams
@ 2018-06-07 15:39   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2018-06-07 15:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, hch, linux-mm, linux-fsdevel, jack

On Mon 04-06-18 16:11:55, Dan Williams wrote:
> In support of enabling memory_failure() handling for device-dax
> mappings, set the ->mapping association of pages backing device-dax
> mappings. The rmap implementation requires page_mapping() to return the
> address_space hosting the vmas that map the page.
> 
> The ->mapping pointer is never cleared. There is no possibility for the
> page to become associated with another address_space while the device is
> enabled. When the device is disabled the 'struct page' array for the
> device is destroyed / later reinitialized to zero.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/device.c |   55 +++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 38 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 686de08e120b..7ec246549721 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -245,13 +245,12 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
>  }
>  
>  static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
> -				struct vm_fault *vmf)
> +				struct vm_fault *vmf, pfn_t *pfn)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct device *dev = &dev_dax->dev;
>  	struct dax_region *dax_region;
>  	phys_addr_t phys;
> -	pfn_t pfn;
>  	unsigned int fault_size = PAGE_SIZE;
>  
>  	if (check_vma(dev_dax, vma, __func__))
> @@ -273,13 +272,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> -	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> +	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>  
> -	return vmf_insert_mixed(vma, vmf->address, pfn);
> +	return vmf_insert_mixed(vma, vmf->address, *pfn);
>  }
>  
>  static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
> -				struct vm_fault *vmf)
> +				struct vm_fault *vmf, pfn_t *pfn)
>  {
>  	unsigned long pmd_addr = vmf->address & PMD_MASK;
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -287,7 +286,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
>  	struct dax_region *dax_region;
>  	phys_addr_t phys;
>  	pgoff_t pgoff;
> -	pfn_t pfn;
>  	unsigned int fault_size = PMD_SIZE;
>  
>  	if (check_vma(dev_dax, vma, __func__))
> @@ -322,15 +320,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> -	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> +	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>  
> -	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
> +	return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, *pfn,
>  			vmf->flags & FAULT_FLAG_WRITE);
>  }
>  
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>  static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
> -				struct vm_fault *vmf)
> +				struct vm_fault *vmf, pfn_t *pfn)
>  {
>  	unsigned long pud_addr = vmf->address & PUD_MASK;
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -338,7 +336,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  	struct dax_region *dax_region;
>  	phys_addr_t phys;
>  	pgoff_t pgoff;
> -	pfn_t pfn;
>  	unsigned int fault_size = PUD_SIZE;
>  
>  
> @@ -374,14 +371,14 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> -	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
> +	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>  
> -	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
> +	return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, *pfn,
>  			vmf->flags & FAULT_FLAG_WRITE);
>  }
>  #else
>  static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
> -				struct vm_fault *vmf)
> +				struct vm_fault *vmf, pfn_t *pfn)
>  {
>  	return VM_FAULT_FALLBACK;
>  }
> @@ -390,9 +387,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  		enum page_entry_size pe_size)
>  {
> -	int rc, id;
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct file *filp = vma->vm_file;
> +	unsigned long fault_size;
> +	int rc, id;
> +	pfn_t pfn;
>  	struct dev_dax *dev_dax = filp->private_data;
>  
>  	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
> @@ -402,17 +401,39 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  	id = dax_read_lock();
>  	switch (pe_size) {
>  	case PE_SIZE_PTE:
> -		rc = __dev_dax_pte_fault(dev_dax, vmf);
> +		fault_size = PAGE_SIZE;
> +		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
>  		break;
>  	case PE_SIZE_PMD:
> -		rc = __dev_dax_pmd_fault(dev_dax, vmf);
> +		fault_size = PMD_SIZE;
> +		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
>  		break;
>  	case PE_SIZE_PUD:
> -		rc = __dev_dax_pud_fault(dev_dax, vmf);
> +		fault_size = PUD_SIZE;
> +		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
>  		break;
>  	default:
>  		rc = VM_FAULT_SIGBUS;
>  	}
> +
> +	if (rc == VM_FAULT_NOPAGE) {
> +		unsigned long i;
> +
> +		/*
> +		 * In the device-dax case the only possibility for a
> +		 * VM_FAULT_NOPAGE result is when device-dax capacity is
> +		 * mapped. No need to consider the zero page, or racing
> +		 * conflicting mappings.
> +		 */
> +		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> +			struct page *page;
> +
> +			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> +			if (page->mapping)
> +				continue;
> +			page->mapping = filp->f_mapping;
> +		}
> +	}
>  	dax_read_unlock(id);
>  
>  	return rc;
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 04/12] device-dax: Set page->index
  2018-06-04 23:12 ` [PATCH v3 04/12] device-dax: Set page->index Dan Williams
@ 2018-06-07 15:50   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2018-06-07 15:50 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm, hch, linux-mm, linux-fsdevel, jack

On Mon 04-06-18 16:12:01, Dan Williams wrote:
> In support of enabling memory_failure() handling for device-dax
> mappings, set ->index to the pgoff of the page. The rmap implementation
> requires ->index to bound the search through the vma interval tree.
> 
> The ->index value is never cleared. There is no possibility for the
> page to become associated with another pgoff while the device is
> enabled. When the device is disabled the 'struct page' array for the
> device is destroyed and ->index is reinitialized to zero.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 05/12] filesystem-dax: Set page->index
  2018-06-04 23:12 ` [PATCH v3 05/12] filesystem-dax: " Dan Williams
@ 2018-06-07 16:02   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2018-06-07 16:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Ross Zwisler, linux-mm, linux-fsdevel

On Mon 04-06-18 16:12:06, Dan Williams wrote:
> In support of enabling memory_failure() handling for filesystem-dax
> mappings, set ->index to the pgoff of the page. The rmap implementation
> requires ->index to bound the search through the vma interval tree. The
> index is set and cleared at dax_associate_entry() and
> dax_disassociate_entry() time respectively.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index aaec72ded1b6..cccf6cad1a7a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -319,18 +319,27 @@ static unsigned long dax_radix_end_pfn(void *entry)
>  	for (pfn = dax_radix_pfn(entry); \
>  			pfn < dax_radix_end_pfn(entry); pfn++)
>  
> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> +/*
> + * TODO: for reflink+dax we need a way to associate a single page with
> + * multiple address_space instances at different linear_page_index()
> + * offsets.
> + */
> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> +		struct vm_area_struct *vma, unsigned long address)
>  {
> -	unsigned long pfn;
> +	unsigned long size = dax_entry_size(entry), pfn, index;
> +	int i = 0;
>  
>  	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>  		return;
>  
> +	index = linear_page_index(vma, address & ~(size - 1));
>  	for_each_mapped_pfn(entry, pfn) {
>  		struct page *page = pfn_to_page(pfn);
>  
>  		WARN_ON_ONCE(page->mapping);
>  		page->mapping = mapping;
> +		page->index = index + i++;
>  	}
>  }
>  
> @@ -348,6 +357,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>  		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>  		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>  		page->mapping = NULL;
> +		page->index = 0;
>  	}
>  }
>  
> @@ -604,7 +614,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  	new_entry = dax_radix_locked_entry(pfn, flags);
>  	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
>  		dax_disassociate_entry(entry, mapping, false);
> -		dax_associate_entry(new_entry, mapping);
> +		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>  	}
>  
>  	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-04 23:12 ` [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
  2018-06-07  4:22   ` Dan Williams
@ 2018-06-07 16:30   ` Jan Kara
  2018-06-07 16:45     ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kara @ 2018-06-07 16:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Naoya Horiguchi,
	Ross Zwisler, linux-mm, linux-fsdevel

On Mon 04-06-18 16:12:37, Dan Williams wrote:
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
> 
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
> 
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.

...

> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> +		struct dev_pagemap *pgmap)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	const bool unmap_success = true;
> +	struct address_space *mapping;
> +	unsigned long size;
> +	LIST_HEAD(tokill);
> +	int rc = -EBUSY;
> +	loff_t start;
> +
> +	/*
> +	 * Prevent the inode from being freed while we are interrogating
> +	 * the address_space, typically this would be handled by
> +	 * lock_page(), but dax pages do not use the page lock.
> +	 */
> +	rcu_read_lock();
> +	mapping = page->mapping;
> +	if (!mapping) {
> +		rcu_read_unlock();
> +		goto out;
> +	}
> +	if (!igrab(mapping->host)) {
> +		mapping = NULL;
> +		rcu_read_unlock();
> +		goto out;
> +	}
> +	rcu_read_unlock();

Why don't you use radix tree entry lock here instead? That is a direct
replacement of the page lock and you don't have to play games with pinning
the inode and verifying the mapping afterwards.

> +out:
> +	if (mapping)
> +		iput(mapping->host);

BTW, this would have to be prepared to do full inode deletion which I'm not
quite sure is safe from this context.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-06-07 16:30   ` Jan Kara
@ 2018-06-07 16:45     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-06-07 16:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, Linux MM,
	linux-fsdevel

On Thu, Jun 7, 2018 at 9:30 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 04-06-18 16:12:37, Dan Williams wrote:
>>     mce: Uncorrected hardware memory error in user-access at af34214200
>>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>>     mce: [Hardware Error]: Machine check events logged
>>     {1}[Hardware Error]: event severity: corrected
>>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>>     [..]
>>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>>     mce: Memory error not recovered
>>
>> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
>> dax there is no possibility to map in another page dynamically since dax
>> establishes 1:1 physical address to file offset associations. Also
>> dev_pagemap pages associated with NVDIMM / persistent memory devices can
>> internal remap/repair addresses with poison. While memory_failure()
>> assumes that it can discard typical poisoned pages and keep them
>> unmapped indefinitely, dev_pagemap pages may be returned to service
>> after the error is cleared.
>>
>> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
>> dev_pagemap pages that have poison consumed by userspace. Mark the
>> memory as UC instead of unmapping it completely to allow ongoing access
>> via the device driver (nd_pmem). Later, nd_pmem will grow support for
>> marking the page back to WB when the error is cleared.
>
> ...
>
>> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> +             struct dev_pagemap *pgmap)
>> +{
>> +     struct page *page = pfn_to_page(pfn);
>> +     const bool unmap_success = true;
>> +     struct address_space *mapping;
>> +     unsigned long size;
>> +     LIST_HEAD(tokill);
>> +     int rc = -EBUSY;
>> +     loff_t start;
>> +
>> +     /*
>> +      * Prevent the inode from being freed while we are interrogating
>> +      * the address_space, typically this would be handled by
>> +      * lock_page(), but dax pages do not use the page lock.
>> +      */
>> +     rcu_read_lock();
>> +     mapping = page->mapping;
>> +     if (!mapping) {
>> +             rcu_read_unlock();
>> +             goto out;
>> +     }
>> +     if (!igrab(mapping->host)) {
>> +             mapping = NULL;
>> +             rcu_read_unlock();
>> +             goto out;
>> +     }
>> +     rcu_read_unlock();
>
> Why don't you use radix tree entry lock here instead? That is a direct
> replacement of the page lock and you don't have to play games with pinning
> the inode and verifying the mapping afterwards.

I was hoping to avoid teaching device-dax about radix entries...

However, now that I look again, we're already protected against
device-dax inode teardown by the dev_pagemap reference.

I'll switch over.

>> +out:
>> +     if (mapping)
>> +             iput(mapping->host);
>
> BTW, this would have to be prepared to do full inode deletion which I'm not
> quite sure is safe from this context.

Ah, I had not considered that, now I know for future reference.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec()
  2018-06-07  4:42   ` Dan Williams
@ 2018-06-07 17:02     ` Luck, Tony
  0 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2018-06-07 17:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, linux-edac, X86 ML, Christoph Hellwig, Linux MM,
	linux-fsdevel, Jan Kara

On Wed, Jun 06, 2018 at 09:42:28PM -0700, Dan Williams wrote:
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Tony Luck <tony.luck@intel.com>
> 
> Tony, safe to assume you are ok with this patch now that the
> decoy_addr approach is back?
> 

Yes. s/Cc/Acked-by/ for my line above.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-06-07 17:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 23:11 [PATCH v3 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
2018-06-04 23:11 ` [PATCH v3 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
2018-06-04 23:11 ` [PATCH v3 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
2018-06-04 23:11 ` [PATCH v3 03/12] device-dax: Enable page_mapping() Dan Williams
2018-06-07 15:39   ` Jan Kara
2018-06-04 23:12 ` [PATCH v3 04/12] device-dax: Set page->index Dan Williams
2018-06-07 15:50   ` Jan Kara
2018-06-04 23:12 ` [PATCH v3 05/12] filesystem-dax: " Dan Williams
2018-06-07 16:02   ` Jan Kara
2018-06-04 23:12 ` [PATCH v3 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
2018-06-04 23:12 ` [PATCH v3 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
2018-06-04 23:12 ` [PATCH v3 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
2018-06-07  4:42   ` Dan Williams
2018-06-07 17:02     ` Luck, Tony
2018-06-04 23:12 ` [PATCH v3 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
2018-06-04 23:12 ` [PATCH v3 10/12] mm, memory_failure: Fix page->mapping assumptions relative to the page lock Dan Williams
2018-06-04 23:12 ` [PATCH v3 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
2018-06-07  4:22   ` Dan Williams
2018-06-07 16:30   ` Jan Kara
2018-06-07 16:45     ` Dan Williams
2018-06-04 23:12 ` [PATCH v3 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).