linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
@ 2018-07-04 21:40 Dan Williams
  2018-07-04 21:40 ` [PATCH v5 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
                   ` (11 more replies)
  0 siblings, 12 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-edac, Tony Luck, Borislav Petkov, Jérôme Glisse,
	Jan Kara, H. Peter Anvin, x86, Thomas Gleixner,
	Christoph Hellwig, Ross Zwisler, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Naoya Horiguchi, Souptick Joarder, hch,
	linux-fsdevel, linux-mm, linux-kernel, jack, ross.zwisler

Changes since v4 [1]:
* Rework dax_lock_page() to reuse get_unlocked_mapping_entry() (Jan)

* Change the calling convention to take a 'struct page *' and return
  success / failure instead of performing the pfn_to_page() internal to
  the api (Jan, Ross).

* Rename dax_lock_page() to dax_lock_mapping_entry() (Jan)

* Account for the case that a given pfn can be fsdax mapped with
  different sizes in different vmas (Jan)

* Update collect_procs() to determine the mapping size of the pfn for
  each page given it can be variable in the dax case.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016279.html

---

As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.

A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks.

---

Dan Williams (11):
      device-dax: Convert to vmf_insert_mixed and vm_fault_t
      device-dax: Enable page_mapping()
      device-dax: Set page->index
      filesystem-dax: Set page->index
      mm, madvise_inject_error: Let memory_failure() optionally take a page reference
      mm, memory_failure: Collect mapping size in collect_procs()
      filesystem-dax: Introduce dax_lock_mapping_entry()
      mm, memory_failure: Teach memory_failure() about dev_pagemap pages
      x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
      x86/memory_failure: Introduce {set,clear}_mce_nospec()
      libnvdimm, pmem: Restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h         |   42 ++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 --
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 -----
 arch/x86/mm/pat.c                         |   16 ++
 drivers/dax/device.c                      |   75 +++++++----
 drivers/nvdimm/pmem.c                     |   26 ++++
 drivers/nvdimm/pmem.h                     |   13 ++
 fs/dax.c                                  |  125 +++++++++++++++++-
 include/linux/dax.h                       |   24 +++
 include/linux/huge_mm.h                   |    5 -
 include/linux/mm.h                        |    1 
 include/linux/set_memory.h                |   14 ++
 mm/huge_memory.c                          |    4 -
 mm/madvise.c                              |   18 ++-
 mm/memory-failure.c                       |  201 +++++++++++++++++++++++------
 15 files changed, 483 insertions(+), 134 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
@ 2018-07-04 21:40 ` Dan Williams
  2018-07-04 21:40 ` [PATCH v5 02/11] device-dax: Enable page_mapping() Dan Williams
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Souptick Joarder, Matthew Wilcox, Ross Zwisler, hch, hch,
	linux-fsdevel, linux-mm, linux-kernel, jack, ross.zwisler

Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno.  Once all instances are converted, vm_fault_t will
become a distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c    |   26 +++++++++++---------------
 include/linux/huge_mm.h |    5 +++--
 mm/huge_memory.c        |    4 ++--
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index de2f8297a210..ad5e7b4a15dc 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
-	int rc = VM_FAULT_SIGBUS;
 	phys_addr_t phys;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 
 	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
-	if (rc == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (rc < 0 && rc != -EBUSY)
-		return VM_FAULT_SIGBUS;
-
-	return VM_FAULT_NOPAGE;
+	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
 
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+				struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
 	return rc;
 }
 
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
 {
 	return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
 }
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/sched/coredump.h>
+#include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..feba371169ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -752,7 +752,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
@@ -812,7 +812,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	spin_unlock(ptl);
 }
 
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 			pud_t *pud, pfn_t pfn, bool write)
 {
 	pgprot_t pgprot = vma->vm_page_prot;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 02/11] device-dax: Enable page_mapping()
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-07-04 21:40 ` [PATCH v5 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
@ 2018-07-04 21:40 ` Dan Williams
  2018-07-04 21:40 ` [PATCH v5 03/11] device-dax: Set page->index Dan Williams
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, hch, hch, linux-fsdevel, linux-mm, linux-kernel, jack,
	ross.zwisler

In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.

The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |   55 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ad5e7b4a15dc..95cfcfd612df 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,12 +245,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
-	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
 	if (check_vma(dev_dax, vmf->vma, __func__))
@@ -272,20 +271,19 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
 	if (check_vma(dev_dax, vmf->vma, __func__))
@@ -321,22 +319,21 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
-	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
 
@@ -373,14 +370,14 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+	return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, *pfn,
 			vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, pfn_t *pfn)
 {
 	return VM_FAULT_FALLBACK;
 }
@@ -389,8 +386,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
-	int rc, id;
 	struct file *filp = vmf->vma->vm_file;
+	unsigned long fault_size;
+	int rc, id;
+	pfn_t pfn;
 	struct dev_dax *dev_dax = filp->private_data;
 
 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -400,17 +399,39 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	id = dax_read_lock();
 	switch (pe_size) {
 	case PE_SIZE_PTE:
-		rc = __dev_dax_pte_fault(dev_dax, vmf);
+		fault_size = PAGE_SIZE;
+		rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PMD:
-		rc = __dev_dax_pmd_fault(dev_dax, vmf);
+		fault_size = PMD_SIZE;
+		rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
 		break;
 	case PE_SIZE_PUD:
-		rc = __dev_dax_pud_fault(dev_dax, vmf);
+		fault_size = PUD_SIZE;
+		rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
 		break;
 	default:
 		rc = VM_FAULT_SIGBUS;
 	}
+
+	if (rc == VM_FAULT_NOPAGE) {
+		unsigned long i;
+
+		/*
+		 * In the device-dax case the only possibility for a
+		 * VM_FAULT_NOPAGE result is when device-dax capacity is
+		 * mapped. No need to consider the zero page, or racing
+		 * conflicting mappings.
+		 */
+		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+			struct page *page;
+
+			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+			if (page->mapping)
+				continue;
+			page->mapping = filp->f_mapping;
+		}
+	}
 	dax_read_unlock(id);
 
 	return rc;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 03/11] device-dax: Set page->index
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  2018-07-04 21:40 ` [PATCH v5 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
  2018-07-04 21:40 ` [PATCH v5 02/11] device-dax: Enable page_mapping() Dan Williams
@ 2018-07-04 21:40 ` Dan Williams
  2018-07-04 21:40 ` [PATCH v5 04/11] filesystem-dax: " Dan Williams
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, hch, hch, linux-fsdevel, linux-mm, linux-kernel, jack,
	ross.zwisler

In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.

The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 95cfcfd612df..361a11089591 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -416,6 +416,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 
 	if (rc == VM_FAULT_NOPAGE) {
 		unsigned long i;
+		pgoff_t pgoff;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -423,6 +424,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
+		pgoff = linear_page_index(vmf->vma, vmf->address
+				& ~(fault_size - 1));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
@@ -430,6 +433,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			if (page->mapping)
 				continue;
 			page->mapping = filp->f_mapping;
+			page->index = pgoff + i;
 		}
 	}
 	dax_read_unlock(id);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 04/11] filesystem-dax: Set page->index
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (2 preceding siblings ...)
  2018-07-04 21:40 ` [PATCH v5 03/11] device-dax: Set page->index Dan Williams
@ 2018-07-04 21:40 ` Dan Williams
  2018-07-04 21:40 ` [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Christoph Hellwig, Matthew Wilcox, Ross Zwisler, Jan Kara, hch,
	linux-fsdevel, linux-mm, linux-kernel, jack, ross.zwisler

In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 641192808bb6..4de11ed463ce 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,27 @@ static unsigned long dax_radix_end_pfn(void *entry)
 	for (pfn = dax_radix_pfn(entry); \
 			pfn < dax_radix_end_pfn(entry); pfn++)
 
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+/*
+ * TODO: for reflink+dax we need a way to associate a single page with
+ * multiple address_space instances at different linear_page_index()
+ * offsets.
+ */
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn, index;
+	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
+	index = linear_page_index(vma, address & ~(size - 1));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(page->mapping);
 		page->mapping = mapping;
+		page->index = index + i++;
 	}
 }
 
@@ -348,6 +357,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
+		page->index = 0;
 	}
 }
 
@@ -701,7 +711,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	new_entry = dax_radix_locked_entry(pfn, flags);
 	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping);
+		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (3 preceding siblings ...)
  2018-07-04 21:40 ` [PATCH v5 04/11] filesystem-dax: " Dan Williams
@ 2018-07-04 21:40 ` Dan Williams
  2018-07-13  6:31   ` Naoya Horiguchi
  2018-07-04 21:40 ` [PATCH v5 06/11] mm, memory_failure: Collect mapping size in collect_procs() Dan Williams
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Naoya Horiguchi, hch, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it does not release
that pin. The assumption is that failed pages should be taken out of
circulation.

However, for dax mappings it is not possible to take pages out of
circulation since they are 1:1 physically mapped as filesystem blocks,
or device-dax capacity. They also typically represent persistent memory
which has an error clearing capability.

In preparation for adding a special handler for dax mappings, shift the
responsibility of taking the page reference to memory_failure(). I.e.
drop the page reference and do not specify MF_COUNT_INCREASED to
memory_failure().

Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/madvise.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..b731933dddae 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
 
 
 	for (; start < end; start += PAGE_SIZE << order) {
+		unsigned long pfn;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		pfn = page_to_pfn(page);
 
 		/*
 		 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,27 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+					pfn, start);
 
 			ret = soft_offline_page(page, MF_COUNT_INCREASED);
 			if (ret)
 				return ret;
 			continue;
 		}
+
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+				pfn, start);
+
+		ret = memory_failure(pfn, 0);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
 
-		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
 		if (ret)
 			return ret;
 	}

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 06/11] mm, memory_failure: Collect mapping size in collect_procs()
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (4 preceding siblings ...)
  2018-07-04 21:40 ` [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
@ 2018-07-04 21:40 ` Dan Williams
  2018-07-13  6:49   ` Naoya Horiguchi
  2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:40 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Naoya Horiguchi, hch, hch, linux-fsdevel, linux-mm, linux-kernel,
	jack, ross.zwisler

In preparation for supporting memory_failure() for dax mappings, teach
collect_procs() to also determine the mapping size. Unlike typical
mappings the dax mapping size is determined by walking page-table
entries rather than using the compound-page accounting for THP pages.

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory-failure.c |   81 +++++++++++++++++++++++++--------------------------
 1 file changed, 40 insertions(+), 41 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..4d70753af59c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -174,22 +174,51 @@ int hwpoison_filter(struct page *p)
 EXPORT_SYMBOL_GPL(hwpoison_filter);
 
 /*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+	short size_shift;
+	char addr_valid;
+};
+
+/*
  * Send all the processes who have the page mapped a signal.
  * ``action optional'' if they are not immediately affected by the error
  * ``action required'' if error happened in current execution context
  */
-static int kill_proc(struct task_struct *t, unsigned long addr,
-			unsigned long pfn, struct page *page, int flags)
+static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
 {
-	short addr_lsb;
+	struct task_struct *t = tk->tsk;
+	short addr_lsb = tk->size_shift;
 	int ret;
 
 	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
 		pfn, t->comm, t->pid);
-	addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
 
 	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
-		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
+		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
 				       addr_lsb, current);
 	} else {
 		/*
@@ -198,7 +227,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
 		 * This could cause a loop when the user sets SIGBUS
 		 * to SIG_IGN, but hopefully no one will do that?
 		 */
-		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
+		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
 				      addr_lsb, t);  /* synchronous? */
 	}
 	if (ret < 0)
@@ -235,35 +264,6 @@ void shake_page(struct page *p, int access)
 EXPORT_SYMBOL_GPL(shake_page);
 
 /*
- * Kill all processes that have a poisoned page mapped and then isolate
- * the page.
- *
- * General strategy:
- * Find all processes having the page mapped and kill them.
- * But we keep a page reference around so that the page is not
- * actually freed yet.
- * Then stash the page away
- *
- * There's no convenient way to get back to mapped processes
- * from the VMAs. So do a brute-force search over all
- * running processes.
- *
- * Remember that machine checks are not common (or rather
- * if they are common you have other problems), so this shouldn't
- * be a performance issue.
- *
- * Also there are some races possible while we get from the
- * error detection to actually handle it.
- */
-
-struct to_kill {
-	struct list_head nd;
-	struct task_struct *tsk;
-	unsigned long addr;
-	char addr_valid;
-};
-
-/*
  * Failure handling: if we can't find or can't kill a process there's
  * not much we can do.	We just print a message and ignore otherwise.
  */
@@ -292,6 +292,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	}
 	tk->addr = page_address_in_vma(p, vma);
 	tk->addr_valid = 1;
+	tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
 	/*
 	 * In theory we don't have to kill when the page was
@@ -317,9 +318,8 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * Also when FAIL is set do a force kill because something went
  * wrong earlier.
  */
-static void kill_procs(struct list_head *to_kill, int forcekill,
-			  bool fail, struct page *page, unsigned long pfn,
-			  int flags)
+static void kill_procs(struct list_head *to_kill, int forcekill, bool fail,
+		unsigned long pfn, int flags)
 {
 	struct to_kill *tk, *next;
 
@@ -342,8 +342,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
 			 * check for that, but we need to tell the
 			 * process anyways.
 			 */
-			else if (kill_proc(tk->tsk, tk->addr,
-					      pfn, page, flags) < 0)
+			else if (kill_proc(tk, pfn, flags) < 0)
 				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
 				       pfn, tk->tsk->comm, tk->tsk->pid);
 		}
@@ -1012,7 +1011,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * any accesses to the poisoned memory.
 	 */
 	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
-	kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
+	kill_procs(&tokill, forcekill, !unmap_success, pfn, flags);
 
 	return unmap_success;
 }

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry()
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (5 preceding siblings ...)
  2018-07-04 21:40 ` [PATCH v5 06/11] mm, memory_failure: Collect mapping size in collect_procs() Dan Williams
@ 2018-07-04 21:41 ` Dan Williams
  2018-07-05  1:07   ` kbuild test robot
                     ` (3 more replies)
  2018-07-04 21:41 ` [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
                   ` (4 subsequent siblings)
  11 siblings, 4 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:41 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: hch, hch, linux-fsdevel, linux-mm, linux-kernel, jack, ross.zwisler

In preparation for implementing support for memory poison (media error)
handling via dax mappings, implement a lock_page() equivalent. Poison
error handling requires rmap and needs guarantees that the page->mapping
association is maintained / valid (inode not freed) for the duration of
the lookup.

In the device-dax case it is sufficient to simply hold a dev_pagemap
reference. In the filesystem-dax case we need to use the entry lock.

Export the entry lock via dax_lock_mapping_entry() that uses
rcu_read_lock() to protect against the inode being freed, and
revalidates the page->mapping association under xa_lock().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |  109 ++++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dax.h |   24 +++++++++++
 2 files changed, 127 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4de11ed463ce..57ec272038da 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -226,8 +226,8 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_mapping_entry(struct address_space *mapping,
-					pgoff_t index, void ***slotp)
+static void *__get_unlocked_mapping_entry(struct address_space *mapping,
+		pgoff_t index, void ***slotp, bool (*wait_fn)(void))
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
@@ -237,6 +237,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.wait.func = wake_exceptional_entry_func;
 
 	for (;;) {
+		bool revalidate;
+
 		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
 					  &slot);
 		if (!entry ||
@@ -251,14 +253,31 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		xa_unlock_irq(&mapping->i_pages);
-		schedule();
+		revalidate = wait_fn();
 		finish_wait(wq, &ewait.wait);
 		xa_lock_irq(&mapping->i_pages);
+		if (revalidate)
+			return ERR_PTR(-EAGAIN);
 	}
 }
 
-static void dax_unlock_mapping_entry(struct address_space *mapping,
-				     pgoff_t index)
+static bool entry_wait(void)
+{
+	schedule();
+	/*
+	 * Never return an ERR_PTR() from
+	 * __get_unlocked_mapping_entry(), just keep looping.
+	 */
+	return false;
+}
+
+static void *get_unlocked_mapping_entry(struct address_space *mapping,
+		pgoff_t index, void ***slotp)
+{
+	return __get_unlocked_mapping_entry(mapping, index, slotp, entry_wait);
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
 	void *entry, **slot;
 
@@ -277,7 +296,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,
 static void put_locked_mapping_entry(struct address_space *mapping,
 		pgoff_t index)
 {
-	dax_unlock_mapping_entry(mapping, index);
+	unlock_mapping_entry(mapping, index);
 }
 
 /*
@@ -374,6 +393,84 @@ static struct page *dax_busy_page(void *entry)
 	return NULL;
 }
 
+static bool entry_wait_revalidate(void)
+{
+	rcu_read_unlock();
+	schedule();
+	rcu_read_lock();
+
+	/*
+	 * Tell __get_unlocked_mapping_entry() to take a break, we need
+	 * to revalidate page->mapping after dropping locks
+	 */
+	return true;
+}
+
+bool dax_lock_mapping_entry(struct page *page)
+{
+	pgoff_t index;
+	struct inode *inode;
+	bool did_lock = false;
+	void *entry = NULL, **slot;
+	struct address_space *mapping;
+
+	rcu_read_lock();
+	for (;;) {
+		mapping = READ_ONCE(page->mapping);
+
+		if (!dax_mapping(mapping))
+			break;
+
+		/*
+		 * In the device-dax case there's no need to lock, a
+		 * struct dev_pagemap pin is sufficient to keep the
+		 * inode alive, and we assume we have dev_pagemap pin
+		 * otherwise we would not have a valid pfn_to_page()
+		 * translation.
+		 */
+		inode = mapping->host;
+		if (S_ISCHR(inode->i_mode)) {
+			did_lock = true;
+			break;
+		}
+
+		xa_lock_irq(&mapping->i_pages);
+		if (mapping != page->mapping) {
+			xa_unlock_irq(&mapping->i_pages);
+			continue;
+		}
+		index = page->index;
+
+		entry = __get_unlocked_mapping_entry(mapping, index, &slot,
+				entry_wait_revalidate);
+		if (!entry) {
+			xa_unlock_irq(&mapping->i_pages);
+			break;
+		} else if (IS_ERR(entry)) {
+			WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
+			continue;
+		}
+		lock_slot(mapping, slot);
+		did_lock = true;
+		xa_unlock_irq(&mapping->i_pages);
+		break;
+	}
+	rcu_read_unlock();
+
+	return did_lock;
+}
+
+void dax_unlock_mapping_entry(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+
+	if (S_ISCHR(inode->i_mode))
+		return;
+
+	unlock_mapping_entry(mapping, page->index);
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
diff --git a/include/linux/dax.h b/include/linux/dax.h
index deb0f663252f..82b9856faf7f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -88,6 +88,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
+bool dax_lock_mapping_entry(struct page *page);
+void dax_unlock_mapping_entry(struct page *page);
 #else
 static inline bool bdev_dax_supported(struct block_device *bdev,
 		int blocksize)
@@ -119,6 +121,28 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+
+static inline bool dax_lock_mapping_entry(struct page *page)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	if (IS_DAX(page->mapping->host))
+		return true;
+	return false;
+}
+
+void dax_unlock_mapping_entry(struct page *page)
+{
+}
+
+static inline struct page *dax_lock_page(unsigned long pfn)
+{
+}
+
+static inline void dax_unlock_page(struct page *page)
+{
+}
 #endif
 
 int dax_read_lock(void);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (6 preceding siblings ...)
  2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
@ 2018-07-04 21:41 ` Dan Williams
  2018-07-13  8:52   ` Naoya Horiguchi
  2018-07-04 21:41 ` [PATCH v5 09/11] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:41 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Matthew Wilcox, Naoya Horiguchi, Ross Zwisler, hch,
	linux-fsdevel, linux-mm, linux-kernel, jack, ross.zwisler

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h  |    1 
 mm/memory-failure.c |  124 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..374e5e9284f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2725,6 +2725,7 @@ enum mf_action_page_type {
 	MF_MSG_TRUNCATED_LRU,
 	MF_MSG_BUDDY,
 	MF_MSG_BUDDY_2ND,
+	MF_MSG_DAX,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d70753af59c..161aa1b70212 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
+#include <linux/memremap.h>
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include "internal.h"
@@ -263,6 +264,39 @@ void shake_page(struct page *p, int access)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
+static unsigned long mapping_size(struct page *page, struct vm_area_struct *vma)
+{
+	unsigned long address = vma_address(page, vma);
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return 0;
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return 0;
+	pud = pud_offset(p4d, address);
+	if (!pud_present(*pud))
+		return 0;
+	if (pud_devmap(*pud))
+		return PUD_SIZE;
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+	if (pmd_devmap(*pmd))
+		return PMD_SIZE;
+	pte = pte_offset_map(pmd, address);
+	if (!pte_present(*pte))
+		return 0;
+	if (pte_devmap(*pte))
+		return PAGE_SIZE;
+	return 0;
+}
+
 /*
  * Failure handling: if we can't find or can't kill a process there's
  * not much we can do.	We just print a message and ignore otherwise.
@@ -292,7 +326,10 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	}
 	tk->addr = page_address_in_vma(p, vma);
 	tk->addr_valid = 1;
-	tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+	if (is_zone_device_page(p))
+		tk->size_shift = ilog2(mapping_size(p, vma));
+	else
+		tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
 	/*
 	 * In theory we don't have to kill when the page was
@@ -300,7 +337,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	 * likely very rare kill anyways just out of paranoia, but use
 	 * a SIGKILL because the error is not contained anymore.
 	 */
-	if (tk->addr == -EFAULT) {
+	if (tk->addr == -EFAULT || tk->size_shift == 0) {
 		pr_info("Memory failure: Unable to find user space address %lx in %s\n",
 			page_to_pfn(p), tsk->comm);
 		tk->addr_valid = 0;
@@ -514,6 +551,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_TRUNCATED_LRU]		= "already truncated LRU page",
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
+	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1111,6 +1149,83 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return res;
 }
 
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	const bool unmap_success = true;
+	unsigned long size = 0;
+	struct to_kill *tk;
+	LIST_HEAD(tokill);
+	int rc = -EBUSY;
+	loff_t start;
+
+	/*
+	 * Prevent the inode from being freed while we are interrogating
+	 * the address_space, typically this would be handled by
+	 * lock_page(), but dax pages do not use the page lock. This
+	 * also prevents changes to the mapping of this pfn until
+	 * poison signaling is complete.
+	 */
+	if (!dax_lock_mapping_entry(page))
+		goto out;
+
+	if (hwpoison_filter(page)) {
+		rc = 0;
+		goto unlock;
+	}
+
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		goto unlock;
+	default:
+		break;
+	}
+
+	/*
+	 * Use this flag as an indication that the dax page has been
+	 * remapped UC to prevent speculative consumption of poison.
+	 */
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+
+	list_for_each_entry(tk, &tokill, nd)
+		if (tk->size_shift)
+			size = max(size, 1UL << tk->size_shift);
+	if (size) {
+		/*
+		 * Unmap the largest mapping to avoid breaking up
+		 * device-dax mappings which are constant size. The
+		 * actual size of the mapping being torn down is
+		 * communicated in siginfo, see kill_proc()
+		 */
+		start = (page->index << PAGE_SHIFT) & ~(size - 1);
+		unmap_mapping_range(page->mapping, start, start + size, 0);
+	}
+	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
+	rc = 0;
+unlock:
+	dax_unlock_mapping_entry(page);
+out:
+	/* drop pgmap ref acquired in caller */
+	put_dev_pagemap(pgmap);
+	action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
+	return rc;
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1133,6 +1248,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *p;
 	struct page *hpage;
 	struct page *orig_head;
+	struct dev_pagemap *pgmap;
 	int res;
 	unsigned long page_flags;
 
@@ -1145,6 +1261,10 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	pgmap = get_dev_pagemap(pfn, NULL);
+	if (pgmap)
+		return memory_failure_dev_pagemap(pfn, flags, pgmap);
+
 	p = pfn_to_page(pfn);
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 09/11] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (7 preceding siblings ...)
  2018-07-04 21:41 ` [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-07-04 21:41 ` Dan Williams
  2018-07-04 21:41 ` [PATCH v5 10/11] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:41 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, linux-edac, x86, hch, hch, linux-fsdevel,
	linux-mm, linux-kernel, jack, ross.zwisler

In preparation for using set_memory_uc() instead set_memory_np() for
isolating poison from speculation, teach the memtype code to sanitize
physical addresses vs __PHYSICAL_MASK.

The motivation for using set_memory_uc() for this case is to allow
ongoing access to persistent memory pages via the pmem-driver +
memcpy_mcsafe() until the poison is repaired.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/pat.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 1555bd7d3449..6788ffa990f8 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -512,6 +512,17 @@ static int free_ram_pages_type(u64 start, u64 end)
 	return 0;
 }
 
+static u64 sanitize_phys(u64 address)
+{
+	/*
+	 * When changing the memtype for pages containing poison allow
+	 * for a "decoy" virtual address (bit 63 clear) passed to
+	 * set_memory_X(). __pa() on a "decoy" address results in a
+	 * physical address with it 63 set.
+	 */
+	return address & __PHYSICAL_MASK;
+}
+
 /*
  * req_type typically has one of the:
  * - _PAGE_CACHE_MODE_WB
@@ -533,6 +544,8 @@ int reserve_memtype(u64 start, u64 end, enum page_cache_mode req_type,
 	int is_range_ram;
 	int err = 0;
 
+	start = sanitize_phys(start);
+	end = sanitize_phys(end);
 	BUG_ON(start >= end); /* end is exclusive */
 
 	if (!pat_enabled()) {
@@ -609,6 +622,9 @@ int free_memtype(u64 start, u64 end)
 	if (!pat_enabled())
 		return 0;
 
+	start = sanitize_phys(start);
+	end = sanitize_phys(end);
+
 	/* Low ISA region is always mapped WB. No need to track */
 	if (x86_platform.is_untracked_pat_range(start, end))
 		return 0;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 10/11] x86/memory_failure: Introduce {set, clear}_mce_nospec()
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (8 preceding siblings ...)
  2018-07-04 21:41 ` [PATCH v5 09/11] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
@ 2018-07-04 21:41 ` Dan Williams
  2018-07-04 21:41 ` [PATCH v5 11/11] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
  2018-07-13  4:44 ` [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:41 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Borislav Petkov,
	linux-edac, x86, Tony Luck, hch, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/set_memory.h         |   42 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ----------
 arch/x86/kernel/cpu/mcheck/mce.c          |   38 ++------------------------
 include/linux/set_memory.h                |   14 ++++++++++
 4 files changed, 59 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index bd090367236c..cf5e9124b45e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,46 @@ extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
 
+#ifdef CONFIG_X86_64
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+	int rc;
+
+	/*
+	 * Mark the linear address as UC to make sure we don't log more
+	 * errors because of speculative access to the page.
+	 * We would like to just call:
+	 *      set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the poison page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_uc() properly sanitizing any __pa()
+	 * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+	 */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+	rc = set_memory_uc(decoy_addr, 1);
+	if (rc)
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+	return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb)	{ }
 static inline void mce_unregister_injector_chain(struct notifier_block *nb)	{ }
 #endif
 
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- *    are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- *    recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
 struct mca_config {
 	bool dont_log_ce;
 	bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c102ad51025e..42a061ce1f5d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
 #include <linux/irq_work.h>
 #include <linux/export.h>
 #include <linux/jump_label.h>
+#include <linux/set_memory.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -50,7 +51,6 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
-#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 	if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
 		pfn = mce->addr >> PAGE_SHIFT;
 		if (!memory_failure(pfn, 0))
-			mce_unmap_kpfn(pfn);
+			set_mce_nospec(pfn);
 	}
 
 	return NOTIFY_OK;
@@ -1072,38 +1068,10 @@ static int do_memory_failure(struct mce *m)
 	if (ret)
 		pr_err("Memory error not recovered");
 	else
-		mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
+		set_mce_nospec(m->addr >> PAGE_SHIFT);
 	return ret;
 }
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn)
-{
-	unsigned long decoy_addr;
-
-	/*
-	 * Unmap this page from the kernel 1:1 mappings to make sure
-	 * we don't log more errors because of speculative access to
-	 * the page.
-	 * We would like to just call:
-	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
-	 * but doing that would radically increase the odds of a
-	 * speculative access to the poison page because we'd have
-	 * the virtual address of the kernel 1:1 mapping sitting
-	 * around in registers.
-	 * Instead we get tricky.  We create a non-canonical address
-	 * that looks just like the one we want, but has bit 63 flipped.
-	 * This relies on set_memory_np() not checking whether we passed
-	 * a legal address.
-	 */
-
-	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
-	if (set_memory_np(decoy_addr, 1))
-		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-}
-#endif
-
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index da5178216da5..2a986d282a97 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr,  int numpages) { return 0; }
 static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
+#ifndef set_mce_nospec
+static inline int set_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
+#ifndef clear_mce_nospec
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+	return 0;
+}
+#endif
+
 #ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
 static inline int set_memory_encrypted(unsigned long addr, int numpages)
 {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 11/11] libnvdimm, pmem: Restore page attributes when clearing errors
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (9 preceding siblings ...)
  2018-07-04 21:41 ` [PATCH v5 10/11] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
@ 2018-07-04 21:41 ` Dan Williams
  2018-07-13  4:44 ` [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-04 21:41 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: hch, hch, linux-fsdevel, linux-mm, linux-kernel, jack, ross.zwisler

Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   26 ++++++++++++++++++++++++++
 drivers/nvdimm/pmem.h |   13 +++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8b1fd7f1a224..55c7a69751d3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
 #include <linux/hdreg.h>
 #include <linux/init.h>
 #include <linux/platform_device.h>
+#include <linux/set_memory.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/badblocks.h>
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
 	return to_nd_region(to_dev(pmem)->parent);
 }
 
+static void hwpoison_clear(struct pmem_device *pmem,
+		phys_addr_t phys, unsigned int len)
+{
+	unsigned long pfn_start, pfn_end, pfn;
+
+	/* only pmem in the linear map supports HWPoison */
+	if (is_vmalloc_addr(pmem->virt_addr))
+		return;
+
+	pfn_start = PHYS_PFN(phys);
+	pfn_end = pfn_start + PHYS_PFN(len);
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * Note, no need to hold a get_dev_pagemap() reference
+		 * here since we're in the driver I/O path and
+		 * outstanding I/O requests pin the dev_pagemap.
+		 */
+		if (test_and_clear_pmem_poison(page))
+			clear_mce_nospec(pfn);
+	}
+}
+
 static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 		phys_addr_t offset, unsigned int len)
 {
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	if (cleared < len)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
+		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
+#include <linux/page-flags.h>
 #include <linux/badblocks.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -27,4 +28,16 @@ struct pmem_device {
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+	return false;
+}
+#endif
 #endif /* __NVDIMM_PMEM_H__ */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry()
  2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
@ 2018-07-05  1:07   ` kbuild test robot
  2018-07-05  3:31   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2018-07-05  1:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: kbuild-all, linux-nvdimm, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

[-- Attachment #1: Type: text/plain, Size: 3443 bytes --]

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc3]
[cannot apply to next-20180704]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/device-dax-Convert-to-vmf_insert_mixed-and-vm_fault_t/20180705-075150
config: x86_64-acpi-redef (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   In file included from fs/ext2/file.c:24:0:
   include/linux/dax.h: In function 'dax_lock_mapping_entry':
>> include/linux/dax.h:128:15: error: 'page' redeclared as different kind of symbol
     struct page *page = pfn_to_page(pfn);
                  ^~~~
   include/linux/dax.h:126:56: note: previous definition of 'page' was here
    static inline bool dax_lock_mapping_entry(struct page *page)
                                                           ^~~~
   In file included from arch/x86/include/asm/page.h:76:0,
                    from arch/x86/include/asm/thread_info.h:12,
                    from include/linux/thread_info.h:38,
                    from arch/x86/include/asm/preempt.h:7,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from include/linux/seqlock.h:36,
                    from include/linux/time.h:6,
                    from fs/ext2/file.c:22:
>> include/linux/dax.h:128:34: error: 'pfn' undeclared (first use in this function); did you mean '__pfn'?
     struct page *page = pfn_to_page(pfn);
                                     ^
   include/asm-generic/memory_model.h:69:27: note: in definition of macro '__pfn_to_page'
    ({ unsigned long __pfn = (pfn);   \
                              ^~~
   include/linux/dax.h:128:22: note: in expansion of macro 'pfn_to_page'
     struct page *page = pfn_to_page(pfn);
                         ^~~~~~~~~~~
   include/linux/dax.h:128:34: note: each undeclared identifier is reported only once for each function it appears in
     struct page *page = pfn_to_page(pfn);
                                     ^
   include/asm-generic/memory_model.h:69:27: note: in definition of macro '__pfn_to_page'
    ({ unsigned long __pfn = (pfn);   \
                              ^~~
   include/linux/dax.h:128:22: note: in expansion of macro 'pfn_to_page'
     struct page *page = pfn_to_page(pfn);
                         ^~~~~~~~~~~
   In file included from fs/ext2/file.c:24:0:
   include/linux/dax.h: In function 'dax_lock_page':
>> include/linux/dax.h:141:1: warning: no return statement in function returning non-void [-Wreturn-type]
    }
    ^

vim +/page +128 include/linux/dax.h

   124	
   125	
   126	static inline bool dax_lock_mapping_entry(struct page *page)
   127	{
 > 128		struct page *page = pfn_to_page(pfn);
   129	
   130		if (IS_DAX(page->mapping->host))
   131			return true;
   132		return false;
   133	}
   134	
   135	void dax_unlock_mapping_entry(struct page *page)
   136	{
   137	}
   138	
   139	static inline struct page *dax_lock_page(unsigned long pfn)
   140	{
 > 141	}
   142	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30726 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry()
  2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
  2018-07-05  1:07   ` kbuild test robot
@ 2018-07-05  3:31   ` kbuild test robot
  2018-07-05  3:33   ` [PATCH v6] " Dan Williams
  2018-09-24 15:57   ` [PATCH v5 07/11] " Barret Rhoden
  3 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2018-07-05  3:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: kbuild-all, linux-nvdimm, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc3]
[cannot apply to next-20180704]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/device-dax-Convert-to-vmf_insert_mixed-and-vm_fault_t/20180705-075150
config: openrisc-or1ksim_defconfig (attached as .config)
compiler: or1k-linux-gcc (GCC) 6.0.0 20160327 (experimental)
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mempolicy.h:11:0,
                    from init/main.c:56:
   include/linux/dax.h: In function 'dax_lock_mapping_entry':
   include/linux/dax.h:128:15: error: 'page' redeclared as different kind of symbol
     struct page *page = pfn_to_page(pfn);
                  ^~~~
   include/linux/dax.h:126:56: note: previous definition of 'page' was here
    static inline bool dax_lock_mapping_entry(struct page *page)
                                                           ^~~~
   In file included from arch/openrisc/include/asm/page.h:98:0,
                    from arch/openrisc/include/asm/processor.h:23,
                    from arch/openrisc/include/asm/thread_info.h:26,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/openrisc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from include/linux/seqlock.h:36,
                    from include/linux/time.h:6,
                    from include/linux/stat.h:19,
                    from include/linux/module.h:10,
                    from init/main.c:16:
>> include/linux/dax.h:128:34: error: 'pfn' undeclared (first use in this function)
     struct page *page = pfn_to_page(pfn);
                                     ^
   include/asm-generic/memory_model.h:33:41: note: in definition of macro '__pfn_to_page'
    #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
                                            ^~~
>> include/linux/dax.h:128:22: note: in expansion of macro 'pfn_to_page'
     struct page *page = pfn_to_page(pfn);
                         ^~~~~~~~~~~
   include/linux/dax.h:128:34: note: each undeclared identifier is reported only once for each function it appears in
     struct page *page = pfn_to_page(pfn);
                                     ^
   include/asm-generic/memory_model.h:33:41: note: in definition of macro '__pfn_to_page'
    #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
                                            ^~~
>> include/linux/dax.h:128:22: note: in expansion of macro 'pfn_to_page'
     struct page *page = pfn_to_page(pfn);
                         ^~~~~~~~~~~
   In file included from include/linux/mempolicy.h:11:0,
                    from init/main.c:56:
   include/linux/dax.h: In function 'dax_lock_page':
   include/linux/dax.h:141:1: warning: no return statement in function returning non-void [-Wreturn-type]
    }
    ^

vim +/pfn +128 include/linux/dax.h

   124	
   125	
   126	static inline bool dax_lock_mapping_entry(struct page *page)
   127	{
 > 128		struct page *page = pfn_to_page(pfn);
   129	
   130		if (IS_DAX(page->mapping->host))
   131			return true;
   132		return false;
   133	}
   134	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7654 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6] filesystem-dax: Introduce dax_lock_mapping_entry()
  2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
  2018-07-05  1:07   ` kbuild test robot
  2018-07-05  3:31   ` kbuild test robot
@ 2018-07-05  3:33   ` Dan Williams
  2018-09-24 15:57   ` [PATCH v5 07/11] " Barret Rhoden
  3 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-05  3:33 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: jack, hch, linux-fsdevel, linux-kernel, linux-mm

In preparation for implementing support for memory poison (media error)
handling via dax mappings, implement a lock_page() equivalent. Poison
error handling requires rmap and needs guarantees that the page->mapping
association is maintained / valid (inode not freed) for the duration of
the lookup.

In the device-dax case it is sufficient to simply hold a dev_pagemap
reference. In the filesystem-dax case we need to use the entry lock.

Export the entry lock via dax_lock_mapping_entry() that uses
rcu_read_lock() to protect against the inode being freed, and
revalidates the page->mapping association under xa_lock().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v5, fixup kbuild robot reports.

 fs/dax.c            |  109 ++++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dax.h |   13 ++++++
 2 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4de11ed463ce..57ec272038da 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -226,8 +226,8 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_mapping_entry(struct address_space *mapping,
-					pgoff_t index, void ***slotp)
+static void *__get_unlocked_mapping_entry(struct address_space *mapping,
+		pgoff_t index, void ***slotp, bool (*wait_fn)(void))
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
@@ -237,6 +237,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.wait.func = wake_exceptional_entry_func;
 
 	for (;;) {
+		bool revalidate;
+
 		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
 					  &slot);
 		if (!entry ||
@@ -251,14 +253,31 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		xa_unlock_irq(&mapping->i_pages);
-		schedule();
+		revalidate = wait_fn();
 		finish_wait(wq, &ewait.wait);
 		xa_lock_irq(&mapping->i_pages);
+		if (revalidate)
+			return ERR_PTR(-EAGAIN);
 	}
 }
 
-static void dax_unlock_mapping_entry(struct address_space *mapping,
-				     pgoff_t index)
+static bool entry_wait(void)
+{
+	schedule();
+	/*
+	 * Never return an ERR_PTR() from
+	 * __get_unlocked_mapping_entry(), just keep looping.
+	 */
+	return false;
+}
+
+static void *get_unlocked_mapping_entry(struct address_space *mapping,
+		pgoff_t index, void ***slotp)
+{
+	return __get_unlocked_mapping_entry(mapping, index, slotp, entry_wait);
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
 	void *entry, **slot;
 
@@ -277,7 +296,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,
 static void put_locked_mapping_entry(struct address_space *mapping,
 		pgoff_t index)
 {
-	dax_unlock_mapping_entry(mapping, index);
+	unlock_mapping_entry(mapping, index);
 }
 
 /*
@@ -374,6 +393,84 @@ static struct page *dax_busy_page(void *entry)
 	return NULL;
 }
 
+static bool entry_wait_revalidate(void)
+{
+	rcu_read_unlock();
+	schedule();
+	rcu_read_lock();
+
+	/*
+	 * Tell __get_unlocked_mapping_entry() to take a break, we need
+	 * to revalidate page->mapping after dropping locks
+	 */
+	return true;
+}
+
+bool dax_lock_mapping_entry(struct page *page)
+{
+	pgoff_t index;
+	struct inode *inode;
+	bool did_lock = false;
+	void *entry = NULL, **slot;
+	struct address_space *mapping;
+
+	rcu_read_lock();
+	for (;;) {
+		mapping = READ_ONCE(page->mapping);
+
+		if (!dax_mapping(mapping))
+			break;
+
+		/*
+		 * In the device-dax case there's no need to lock, a
+		 * struct dev_pagemap pin is sufficient to keep the
+		 * inode alive, and we assume we have dev_pagemap pin
+		 * otherwise we would not have a valid pfn_to_page()
+		 * translation.
+		 */
+		inode = mapping->host;
+		if (S_ISCHR(inode->i_mode)) {
+			did_lock = true;
+			break;
+		}
+
+		xa_lock_irq(&mapping->i_pages);
+		if (mapping != page->mapping) {
+			xa_unlock_irq(&mapping->i_pages);
+			continue;
+		}
+		index = page->index;
+
+		entry = __get_unlocked_mapping_entry(mapping, index, &slot,
+				entry_wait_revalidate);
+		if (!entry) {
+			xa_unlock_irq(&mapping->i_pages);
+			break;
+		} else if (IS_ERR(entry)) {
+			WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
+			continue;
+		}
+		lock_slot(mapping, slot);
+		did_lock = true;
+		xa_unlock_irq(&mapping->i_pages);
+		break;
+	}
+	rcu_read_unlock();
+
+	return did_lock;
+}
+
+void dax_unlock_mapping_entry(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+
+	if (S_ISCHR(inode->i_mode))
+		return;
+
+	unlock_mapping_entry(mapping, page->index);
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
diff --git a/include/linux/dax.h b/include/linux/dax.h
index deb0f663252f..450b28db9533 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -88,6 +88,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
+bool dax_lock_mapping_entry(struct page *page);
+void dax_unlock_mapping_entry(struct page *page);
 #else
 static inline bool bdev_dax_supported(struct block_device *bdev,
 		int blocksize)
@@ -119,6 +121,17 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline bool dax_lock_mapping_entry(struct page *page)
+{
+	if (IS_DAX(page->mapping->host))
+		return true;
+	return false;
+}
+
+static inline void dax_unlock_mapping_entry(struct page *page)
+{
+}
 #endif
 
 int dax_read_lock(void);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
  2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
                   ` (10 preceding siblings ...)
  2018-07-04 21:41 ` [PATCH v5 11/11] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
@ 2018-07-13  4:44 ` Dan Williams
  11 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-13  4:44 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-edac, Tony Luck, Borislav Petkov, Jérôme Glisse,
	Jan Kara, H. Peter Anvin, X86 ML, Thomas Gleixner,
	Christoph Hellwig, Ross Zwisler, Ingo Molnar, Michal Hocko,
	Naoya Horiguchi, Souptick Joarder, linux-fsdevel, Linux MM,
	Linux Kernel Mailing List, Matthew Wilcox

On Wed, Jul 4, 2018 at 2:40 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Changes since v4 [1]:
> * Rework dax_lock_page() to reuse get_unlocked_mapping_entry() (Jan)
>
> * Change the calling convention to take a 'struct page *' and return
>   success / failure instead of performing the pfn_to_page() internal to
>   the api (Jan, Ross).
>
> * Rename dax_lock_page() to dax_lock_mapping_entry() (Jan)
>
> * Account for the case that a given pfn can be fsdax mapped with
>   different sizes in different vmas (Jan)
>
> * Update collect_procs() to determine the mapping size of the pfn for
>   each page given it can be variable in the dax case.
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016279.html
>
> ---
>
> As it stands, memory_failure() gets thoroughly confused by dev_pagemap
> backed mappings. The recovery code has specific enabling for several
> possible page states and needs new enabling to handle poison in dax
> mappings.
>
> In order to support reliable reverse mapping of user space addresses:
>
> 1/ Add new locking in the memory_failure() rmap path to prevent races
> that would typically be handled by the page lock.
>
> 2/ Since dev_pagemap pages are hidden from the page allocator and the
> "compound page" accounting machinery, add a mechanism to determine the
> size of the mapping that encompasses a given poisoned pfn.
>
> 3/ Given pmem errors can be repaired, change the speculatively accessed
> poison protection, mce_unmap_kpfn(), to be reversible and otherwise
> allow ongoing access from the kernel.
>
> A side effect of this enabling is that MADV_HWPOISON becomes usable for
> dax mappings, however the primary motivation is to allow the system to
> survive userspace consumption of hardware-poison via dax. Specifically
> the current behavior is:
>
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
>     <reboot>
>
> ...and with these changes:
>
>     Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
>     Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
>     Memory failure: 0x20cb00: recovery action for dax page: Recovered
>
> Given all the cross dependencies I propose taking this through
> nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
> folks.
>

Hi,

Any comments on this series? Matthew is patiently waiting to rebase
some of his Xarray work until the dax_lock_mapping_entry() changes hit
-next.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  2018-07-04 21:40 ` [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
@ 2018-07-13  6:31   ` Naoya Horiguchi
  2018-07-14  0:34     ` Dan Williams
  0 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2018-07-13  6:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

Hello Dan,

On Wed, Jul 04, 2018 at 02:40:49PM -0700, Dan Williams wrote:
> The madvise_inject_error() routine uses get_user_pages() to lookup the
> pfn and other information for injected error, but it does not release
> that pin. The assumption is that failed pages should be taken out of
> circulation.
>
> However, for dax mappings it is not possible to take pages out of
> circulation since they are 1:1 physically mapped as filesystem blocks,
> or device-dax capacity. They also typically represent persistent memory
> which has an error clearing capability.
>
> In preparation for adding a special handler for dax mappings, shift the
> responsibility of taking the page reference to memory_failure(). I.e.
> drop the page reference and do not specify MF_COUNT_INCREASED to
> memory_failure().
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  mm/madvise.c |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d3c922ea1a1..b731933dddae 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>
>
>  	for (; start < end; start += PAGE_SIZE << order) {
> +		unsigned long pfn;
>  		int ret;
>
>  		ret = get_user_pages_fast(start, 1, 0, &page);
>  		if (ret != 1)
>  			return ret;
> +		pfn = page_to_pfn(page);
>
>  		/*
>  		 * When soft offlining hugepages, after migrating the page
> @@ -651,17 +653,27 @@ static int madvise_inject_error(int behavior,
>
>  		if (behavior == MADV_SOFT_OFFLINE) {
>  			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> -						page_to_pfn(page), start);
> +					pfn, start);
>
>  			ret = soft_offline_page(page, MF_COUNT_INCREASED);
>  			if (ret)
>  				return ret;
>  			continue;
>  		}
> +
>  		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> -						page_to_pfn(page), start);
> +				pfn, start);
> +
> +		ret = memory_failure(pfn, 0);
> +
> +		/*
> +		 * Drop the page reference taken by get_user_pages_fast(). In
> +		 * the absence of MF_COUNT_INCREASED the memory_failure()
> +		 * routine is responsible for pinning the page to prevent it
> +		 * from being released back to the page allocator.
> +		 */
> +		put_page(page);
>
> -		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);

MF_COUNT_INCREASED means that the page refcount for memory error handling
is taken by the caller so you don't have to take one inside memory_failure().
So this code don't keep with the definition, then another refcount can be
taken in memory_failure() in normal LRU page's case for example.
As a result the error message "Memory failure: %#lx: %s still referenced by
%d users\n" will be dumped in page_action().

So if you want to put put_page() in madvise_inject_error(), I think that

 		put_page(page);
 		ret = memory_failure(pfn, 0);

can be acceptable because the purpose of get_user_pages_fast() here is
just getting pfn, and the refcount itself is not so important.
IOW, memory_failure() is called only with pfn which never changes depending
on the page's status.
In production system memory_failure() is called via machine check code
without taking any pagecount, so I don't think the this injection interface
is properly mocking the real thing. So I'm feeling that this flag will be
wiped out at some point.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 06/11] mm, memory_failure: Collect mapping size in collect_procs()
  2018-07-04 21:40 ` [PATCH v5 06/11] mm, memory_failure: Collect mapping size in collect_procs() Dan Williams
@ 2018-07-13  6:49   ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2018-07-13  6:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, hch, linux-fsdevel, linux-mm, linux-kernel, jack,
	ross.zwisler

On Wed, Jul 04, 2018 at 02:40:55PM -0700, Dan Williams wrote:
> In preparation for supporting memory_failure() for dax mappings, teach
> collect_procs() to also determine the mapping size. Unlike typical
> mappings the dax mapping size is determined by walking page-table
> entries rather than using the compound-page accounting for THP pages.
> 
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  mm/memory-failure.c |   81 +++++++++++++++++++++++++--------------------------
>  1 file changed, 40 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 9d142b9b86dc..4d70753af59c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -174,22 +174,51 @@ int hwpoison_filter(struct page *p)
>  EXPORT_SYMBOL_GPL(hwpoison_filter);
>  
>  /*
> + * Kill all processes that have a poisoned page mapped and then isolate
> + * the page.
> + *
> + * General strategy:
> + * Find all processes having the page mapped and kill them.
> + * But we keep a page reference around so that the page is not
> + * actually freed yet.
> + * Then stash the page away
> + *
> + * There's no convenient way to get back to mapped processes
> + * from the VMAs. So do a brute-force search over all
> + * running processes.
> + *
> + * Remember that machine checks are not common (or rather
> + * if they are common you have other problems), so this shouldn't
> + * be a performance issue.
> + *
> + * Also there are some races possible while we get from the
> + * error detection to actually handle it.
> + */
> +
> +struct to_kill {
> +	struct list_head nd;
> +	struct task_struct *tsk;
> +	unsigned long addr;
> +	short size_shift;
> +	char addr_valid;
> +};
> +
> +/*
>   * Send all the processes who have the page mapped a signal.
>   * ``action optional'' if they are not immediately affected by the error
>   * ``action required'' if error happened in current execution context
>   */
> -static int kill_proc(struct task_struct *t, unsigned long addr,
> -			unsigned long pfn, struct page *page, int flags)
> +static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
>  {
> -	short addr_lsb;
> +	struct task_struct *t = tk->tsk;
> +	short addr_lsb = tk->size_shift;
>  	int ret;
>  
>  	pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
>  		pfn, t->comm, t->pid);
> -	addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
>  
>  	if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> -		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
> +		ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
>  				       addr_lsb, current);
>  	} else {
>  		/*
> @@ -198,7 +227,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
>  		 * This could cause a loop when the user sets SIGBUS
>  		 * to SIG_IGN, but hopefully no one will do that?
>  		 */
> -		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
> +		ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
>  				      addr_lsb, t);  /* synchronous? */
>  	}
>  	if (ret < 0)
> @@ -235,35 +264,6 @@ void shake_page(struct page *p, int access)
>  EXPORT_SYMBOL_GPL(shake_page);
>  
>  /*
> - * Kill all processes that have a poisoned page mapped and then isolate
> - * the page.
> - *
> - * General strategy:
> - * Find all processes having the page mapped and kill them.
> - * But we keep a page reference around so that the page is not
> - * actually freed yet.
> - * Then stash the page away
> - *
> - * There's no convenient way to get back to mapped processes
> - * from the VMAs. So do a brute-force search over all
> - * running processes.
> - *
> - * Remember that machine checks are not common (or rather
> - * if they are common you have other problems), so this shouldn't
> - * be a performance issue.
> - *
> - * Also there are some races possible while we get from the
> - * error detection to actually handle it.
> - */
> -
> -struct to_kill {
> -	struct list_head nd;
> -	struct task_struct *tsk;
> -	unsigned long addr;
> -	char addr_valid;
> -};
> -
> -/*
>   * Failure handling: if we can't find or can't kill a process there's
>   * not much we can do.	We just print a message and ignore otherwise.
>   */
> @@ -292,6 +292,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>  	}
>  	tk->addr = page_address_in_vma(p, vma);
>  	tk->addr_valid = 1;
> +	tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>  
>  	/*
>  	 * In theory we don't have to kill when the page was
> @@ -317,9 +318,8 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>   * Also when FAIL is set do a force kill because something went
>   * wrong earlier.
>   */
> -static void kill_procs(struct list_head *to_kill, int forcekill,
> -			  bool fail, struct page *page, unsigned long pfn,
> -			  int flags)
> +static void kill_procs(struct list_head *to_kill, int forcekill, bool fail,
> +		unsigned long pfn, int flags)
>  {
>  	struct to_kill *tk, *next;
>  
> @@ -342,8 +342,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
>  			 * check for that, but we need to tell the
>  			 * process anyways.
>  			 */
> -			else if (kill_proc(tk->tsk, tk->addr,
> -					      pfn, page, flags) < 0)
> +			else if (kill_proc(tk, pfn, flags) < 0)
>  				pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
>  				       pfn, tk->tsk->comm, tk->tsk->pid);
>  		}
> @@ -1012,7 +1011,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  	 * any accesses to the poisoned memory.
>  	 */
>  	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
> -	kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
> +	kill_procs(&tokill, forcekill, !unmap_success, pfn, flags);
>  
>  	return unmap_success;
>  }
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-07-04 21:41 ` [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
@ 2018-07-13  8:52   ` Naoya Horiguchi
  2018-07-14  0:28     ` Dan Williams
  0 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2018-07-13  8:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Ross Zwisler,
	linux-fsdevel, linux-mm, linux-kernel

On Wed, Jul 04, 2018 at 02:41:06PM -0700, Dan Williams wrote:
>     mce: Uncorrected hardware memory error in user-access at af34214200
>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>     mce: [Hardware Error]: Machine check events logged
>     {1}[Hardware Error]: event severity: corrected
>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>     [..]
>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>     mce: Memory error not recovered
>
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
>
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.

By the way, what happens if madvise(MADV_SOFT_OFFLINE) is called on
a dev_pagemap page?  I'm not sure that cmci can be triggered on the
nvdimm device, but this injection interface is open for such a case.
Maybe simply ignoring the event is an expected behavior?

A few comments/quetions below ...

>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/mm.h  |    1
>  mm/memory-failure.c |  124 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 123 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9ffe380..374e5e9284f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2725,6 +2725,7 @@ enum mf_action_page_type {
>  	MF_MSG_TRUNCATED_LRU,
>  	MF_MSG_BUDDY,
>  	MF_MSG_BUDDY_2ND,
> +	MF_MSG_DAX,
>  	MF_MSG_UNKNOWN,
>  };
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 4d70753af59c..161aa1b70212 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -55,6 +55,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm_inline.h>
> +#include <linux/memremap.h>
>  #include <linux/kfifo.h>
>  #include <linux/ratelimit.h>
>  #include "internal.h"
> @@ -263,6 +264,39 @@ void shake_page(struct page *p, int access)
>  }
>  EXPORT_SYMBOL_GPL(shake_page);
>
> +static unsigned long mapping_size(struct page *page, struct vm_area_struct *vma)
> +{
> +	unsigned long address = vma_address(page, vma);
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +
> +	pgd = pgd_offset(vma->vm_mm, address);
> +	if (!pgd_present(*pgd))
> +		return 0;
> +	p4d = p4d_offset(pgd, address);
> +	if (!p4d_present(*p4d))
> +		return 0;
> +	pud = pud_offset(p4d, address);
> +	if (!pud_present(*pud))
> +		return 0;
> +	if (pud_devmap(*pud))
> +		return PUD_SIZE;
> +	pmd = pmd_offset(pud, address);
> +	if (!pmd_present(*pmd))
> +		return 0;
> +	if (pmd_devmap(*pmd))
> +		return PMD_SIZE;
> +	pte = pte_offset_map(pmd, address);
> +	if (!pte_present(*pte))
> +		return 0;
> +	if (pte_devmap(*pte))
> +		return PAGE_SIZE;
> +	return 0;
> +}
> +

The function name looks generic, but this function seems to focus on
devmap thing, so could you include the word 'devmap' in the name?

>  /*
>   * Failure handling: if we can't find or can't kill a process there's
>   * not much we can do.	We just print a message and ignore otherwise.
> @@ -292,7 +326,10 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>  	}
>  	tk->addr = page_address_in_vma(p, vma);
>  	tk->addr_valid = 1;
> -	tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
> +	if (is_zone_device_page(p))
> +		tk->size_shift = ilog2(mapping_size(p, vma));
> +	else
> +		tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>
>  	/*
>  	 * In theory we don't have to kill when the page was
> @@ -300,7 +337,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>  	 * likely very rare kill anyways just out of paranoia, but use
>  	 * a SIGKILL because the error is not contained anymore.
>  	 */
> -	if (tk->addr == -EFAULT) {
> +	if (tk->addr == -EFAULT || tk->size_shift == 0) {
>  		pr_info("Memory failure: Unable to find user space address %lx in %s\n",
>  			page_to_pfn(p), tsk->comm);
>  		tk->addr_valid = 0;
> @@ -514,6 +551,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_TRUNCATED_LRU]		= "already truncated LRU page",
>  	[MF_MSG_BUDDY]			= "free buddy page",
>  	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
> +	[MF_MSG_DAX]			= "dax page",
>  	[MF_MSG_UNKNOWN]		= "unknown page",
>  };
>
> @@ -1111,6 +1149,83 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  	return res;
>  }
>
> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> +		struct dev_pagemap *pgmap)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	const bool unmap_success = true;
> +	unsigned long size = 0;
> +	struct to_kill *tk;
> +	LIST_HEAD(tokill);
> +	int rc = -EBUSY;
> +	loff_t start;
> +
> +	/*
> +	 * Prevent the inode from being freed while we are interrogating
> +	 * the address_space, typically this would be handled by
> +	 * lock_page(), but dax pages do not use the page lock. This
> +	 * also prevents changes to the mapping of this pfn until
> +	 * poison signaling is complete.
> +	 */
> +	if (!dax_lock_mapping_entry(page))
> +		goto out;
> +
> +	if (hwpoison_filter(page)) {
> +		rc = 0;
> +		goto unlock;
> +	}
> +
> +	switch (pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_PUBLIC:
> +		/*
> +		 * TODO: Handle HMM pages which may need coordination
> +		 * with device-side memory.
> +		 */
> +		goto unlock;
> +	default:
> +		break;
> +	}
> +
> +	/*
> +	 * Use this flag as an indication that the dax page has been
> +	 * remapped UC to prevent speculative consumption of poison.
> +	 */
> +	SetPageHWPoison(page);

The number of hwpoison pages is maintained by num_poisoned_pages,
so you can call num_poisoned_pages_inc()?

Related to this, I'm interested in whether/how unpoison_page() works
on a hwpoisoned dev_pagemap page.

Thanks,
Naoya Horiguchi

> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a
> +	 * different physical page at a given virtual address, so all
> +	 * userspace consumption of ZONE_DEVICE memory necessitates
> +	 * SIGBUS (i.e. MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> +
> +	list_for_each_entry(tk, &tokill, nd)
> +		if (tk->size_shift)
> +			size = max(size, 1UL << tk->size_shift);
> +	if (size) {
> +		/*
> +		 * Unmap the largest mapping to avoid breaking up
> +		 * device-dax mappings which are constant size. The
> +		 * actual size of the mapping being torn down is
> +		 * communicated in siginfo, see kill_proc()
> +		 */
> +		start = (page->index << PAGE_SHIFT) & ~(size - 1);
> +		unmap_mapping_range(page->mapping, start, start + size, 0);
> +	}
> +	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
> +	rc = 0;
> +unlock:
> +	dax_unlock_mapping_entry(page);
> +out:
> +	/* drop pgmap ref acquired in caller */
> +	put_dev_pagemap(pgmap);
> +	action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
> +	return rc;
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1133,6 +1248,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	struct page *p;
>  	struct page *hpage;
>  	struct page *orig_head;
> +	struct dev_pagemap *pgmap;
>  	int res;
>  	unsigned long page_flags;
>
> @@ -1145,6 +1261,10 @@ int memory_failure(unsigned long pfn, int flags)
>  		return -ENXIO;
>  	}
>
> +	pgmap = get_dev_pagemap(pfn, NULL);
> +	if (pgmap)
> +		return memory_failure_dev_pagemap(pfn, flags, pgmap);
> +
>  	p = pfn_to_page(pfn);
>  	if (PageHuge(p))
>  		return memory_failure_hugetlb(pfn, flags);
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-07-13  8:52   ` Naoya Horiguchi
@ 2018-07-14  0:28     ` Dan Williams
  2018-07-17  6:36       ` Naoya Horiguchi
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2018-07-14  0:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Ross Zwisler,
	linux-fsdevel, linux-mm, linux-kernel

On Fri, Jul 13, 2018 at 1:52 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> On Wed, Jul 04, 2018 at 02:41:06PM -0700, Dan Williams wrote:
>>     mce: Uncorrected hardware memory error in user-access at af34214200
>>     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>>     mce: [Hardware Error]: Machine check events logged
>>     {1}[Hardware Error]: event severity: corrected
>>     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
>>     [..]
>>     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
>>     mce: Memory error not recovered
>>
>> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
>> dax there is no possibility to map in another page dynamically since dax
>> establishes 1:1 physical address to file offset associations. Also
>> dev_pagemap pages associated with NVDIMM / persistent memory devices can
>> internal remap/repair addresses with poison. While memory_failure()
>> assumes that it can discard typical poisoned pages and keep them
>> unmapped indefinitely, dev_pagemap pages may be returned to service
>> after the error is cleared.
>>
>> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
>> dev_pagemap pages that have poison consumed by userspace. Mark the
>> memory as UC instead of unmapping it completely to allow ongoing access
>> via the device driver (nd_pmem). Later, nd_pmem will grow support for
>> marking the page back to WB when the error is cleared.
>
> By the way, what happens if madvise(MADV_SOFT_OFFLINE) is called on
> a dev_pagemap page?  I'm not sure that cmci can be triggered on the
> nvdimm device, but this injection interface is open for such a case.
> Maybe simply ignoring the event is an expected behavior?

Yeah, I should explicitly block attempts to MADV_SOFT_OFFLINE a
dev_pagemap page because they are never onlined. They'll never appear
on an lru and never be available via the page allocator.

>
> A few comments/quetions below ...
>
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  include/linux/mm.h  |    1
>>  mm/memory-failure.c |  124 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 123 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a0fbb9ffe380..374e5e9284f7 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2725,6 +2725,7 @@ enum mf_action_page_type {
>>       MF_MSG_TRUNCATED_LRU,
>>       MF_MSG_BUDDY,
>>       MF_MSG_BUDDY_2ND,
>> +     MF_MSG_DAX,
>>       MF_MSG_UNKNOWN,
>>  };
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 4d70753af59c..161aa1b70212 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -55,6 +55,7 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/memory_hotplug.h>
>>  #include <linux/mm_inline.h>
>> +#include <linux/memremap.h>
>>  #include <linux/kfifo.h>
>>  #include <linux/ratelimit.h>
>>  #include "internal.h"
>> @@ -263,6 +264,39 @@ void shake_page(struct page *p, int access)
>>  }
>>  EXPORT_SYMBOL_GPL(shake_page);
>>
>> +static unsigned long mapping_size(struct page *page, struct vm_area_struct *vma)
>> +{
>> +     unsigned long address = vma_address(page, vma);
>> +     pgd_t *pgd;
>> +     p4d_t *p4d;
>> +     pud_t *pud;
>> +     pmd_t *pmd;
>> +     pte_t *pte;
>> +
>> +     pgd = pgd_offset(vma->vm_mm, address);
>> +     if (!pgd_present(*pgd))
>> +             return 0;
>> +     p4d = p4d_offset(pgd, address);
>> +     if (!p4d_present(*p4d))
>> +             return 0;
>> +     pud = pud_offset(p4d, address);
>> +     if (!pud_present(*pud))
>> +             return 0;
>> +     if (pud_devmap(*pud))
>> +             return PUD_SIZE;
>> +     pmd = pmd_offset(pud, address);
>> +     if (!pmd_present(*pmd))
>> +             return 0;
>> +     if (pmd_devmap(*pmd))
>> +             return PMD_SIZE;
>> +     pte = pte_offset_map(pmd, address);
>> +     if (!pte_present(*pte))
>> +             return 0;
>> +     if (pte_devmap(*pte))
>> +             return PAGE_SIZE;
>> +     return 0;
>> +}
>> +
>
> The function name looks generic, but this function seems to focus on
> devmap thing, so could you include the word 'devmap' in the name?

Ok.

>
>>  /*
>>   * Failure handling: if we can't find or can't kill a process there's
>>   * not much we can do.       We just print a message and ignore otherwise.
>> @@ -292,7 +326,10 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>>       }
>>       tk->addr = page_address_in_vma(p, vma);
>>       tk->addr_valid = 1;
>> -     tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>> +     if (is_zone_device_page(p))
>> +             tk->size_shift = ilog2(mapping_size(p, vma));
>> +     else
>> +             tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>>
>>       /*
>>        * In theory we don't have to kill when the page was
>> @@ -300,7 +337,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>>        * likely very rare kill anyways just out of paranoia, but use
>>        * a SIGKILL because the error is not contained anymore.
>>        */
>> -     if (tk->addr == -EFAULT) {
>> +     if (tk->addr == -EFAULT || tk->size_shift == 0) {
>>               pr_info("Memory failure: Unable to find user space address %lx in %s\n",
>>                       page_to_pfn(p), tsk->comm);
>>               tk->addr_valid = 0;
>> @@ -514,6 +551,7 @@ static const char * const action_page_types[] = {
>>       [MF_MSG_TRUNCATED_LRU]          = "already truncated LRU page",
>>       [MF_MSG_BUDDY]                  = "free buddy page",
>>       [MF_MSG_BUDDY_2ND]              = "free buddy page (2nd try)",
>> +     [MF_MSG_DAX]                    = "dax page",
>>       [MF_MSG_UNKNOWN]                = "unknown page",
>>  };
>>
>> @@ -1111,6 +1149,83 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>>       return res;
>>  }
>>
>> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> +             struct dev_pagemap *pgmap)
>> +{
>> +     struct page *page = pfn_to_page(pfn);
>> +     const bool unmap_success = true;
>> +     unsigned long size = 0;
>> +     struct to_kill *tk;
>> +     LIST_HEAD(tokill);
>> +     int rc = -EBUSY;
>> +     loff_t start;
>> +
>> +     /*
>> +      * Prevent the inode from being freed while we are interrogating
>> +      * the address_space, typically this would be handled by
>> +      * lock_page(), but dax pages do not use the page lock. This
>> +      * also prevents changes to the mapping of this pfn until
>> +      * poison signaling is complete.
>> +      */
>> +     if (!dax_lock_mapping_entry(page))
>> +             goto out;
>> +
>> +     if (hwpoison_filter(page)) {
>> +             rc = 0;
>> +             goto unlock;
>> +     }
>> +
>> +     switch (pgmap->type) {
>> +     case MEMORY_DEVICE_PRIVATE:
>> +     case MEMORY_DEVICE_PUBLIC:
>> +             /*
>> +              * TODO: Handle HMM pages which may need coordination
>> +              * with device-side memory.
>> +              */
>> +             goto unlock;
>> +     default:
>> +             break;
>> +     }
>> +
>> +     /*
>> +      * Use this flag as an indication that the dax page has been
>> +      * remapped UC to prevent speculative consumption of poison.
>> +      */
>> +     SetPageHWPoison(page);
>
> The number of hwpoison pages is maintained by num_poisoned_pages,
> so you can call num_poisoned_pages_inc()?

I don't think we want these pages accounted in num_poisoned_pages().
We have the badblocks infrastructure in libnvdimm to track how many
errors and where they are located, and since they can be repaired via
driver actions I think we should track them separately.

> Related to this, I'm interested in whether/how unpoison_page() works
> on a hwpoisoned dev_pagemap page.

unpoison_page() is only triggered via freeing pages to the page
allocator, and that never happens for dev_pagemap / ZONE_DEVICE pages.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  2018-07-13  6:31   ` Naoya Horiguchi
@ 2018-07-14  0:34     ` Dan Williams
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2018-07-14  0:34 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-nvdimm, Michal Hocko, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

On Thu, Jul 12, 2018 at 11:31 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Hello Dan,
>
> On Wed, Jul 04, 2018 at 02:40:49PM -0700, Dan Williams wrote:
>> The madvise_inject_error() routine uses get_user_pages() to lookup the
>> pfn and other information for injected error, but it does not release
>> that pin. The assumption is that failed pages should be taken out of
>> circulation.
>>
>> However, for dax mappings it is not possible to take pages out of
>> circulation since they are 1:1 physically mapped as filesystem blocks,
>> or device-dax capacity. They also typically represent persistent memory
>> which has an error clearing capability.
>>
>> In preparation for adding a special handler for dax mappings, shift the
>> responsibility of taking the page reference to memory_failure(). I.e.
>> drop the page reference and do not specify MF_COUNT_INCREASED to
>> memory_failure().
>>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  mm/madvise.c |   18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d3c922ea1a1..b731933dddae 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>>
>>
>>       for (; start < end; start += PAGE_SIZE << order) {
>> +             unsigned long pfn;
>>               int ret;
>>
>>               ret = get_user_pages_fast(start, 1, 0, &page);
>>               if (ret != 1)
>>                       return ret;
>> +             pfn = page_to_pfn(page);
>>
>>               /*
>>                * When soft offlining hugepages, after migrating the page
>> @@ -651,17 +653,27 @@ static int madvise_inject_error(int behavior,
>>
>>               if (behavior == MADV_SOFT_OFFLINE) {
>>                       pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>> -                                             page_to_pfn(page), start);
>> +                                     pfn, start);
>>
>>                       ret = soft_offline_page(page, MF_COUNT_INCREASED);
>>                       if (ret)
>>                               return ret;
>>                       continue;
>>               }
>> +
>>               pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>> -                                             page_to_pfn(page), start);
>> +                             pfn, start);
>> +
>> +             ret = memory_failure(pfn, 0);
>> +
>> +             /*
>> +              * Drop the page reference taken by get_user_pages_fast(). In
>> +              * the absence of MF_COUNT_INCREASED the memory_failure()
>> +              * routine is responsible for pinning the page to prevent it
>> +              * from being released back to the page allocator.
>> +              */
>> +             put_page(page);
>>
>> -             ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
>
> MF_COUNT_INCREASED means that the page refcount for memory error handling
> is taken by the caller so you don't have to take one inside memory_failure().
> So this code don't keep with the definition, then another refcount can be
> taken in memory_failure() in normal LRU page's case for example.
> As a result the error message "Memory failure: %#lx: %s still referenced by
> %d users\n" will be dumped in page_action().
>
> So if you want to put put_page() in madvise_inject_error(), I think that
>
>                 put_page(page);
>                 ret = memory_failure(pfn, 0);
>
> can be acceptable because the purpose of get_user_pages_fast() here is
> just getting pfn, and the refcount itself is not so important.
> IOW, memory_failure() is called only with pfn which never changes depending
> on the page's status.

Ok, I'll resend with the put_page() moved before memory_failure() to
make it more clear that memory_failure() is responsible for taking its
own reference and that there is no dependency to hold the reference in
madvise_inject_error().

> In production system memory_failure() is called via machine check code
> without taking any pagecount, so I don't think the this injection interface
> is properly mocking the real thing. So I'm feeling that this flag will be
> wiped out at some point.

Ok, makes sense.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  2018-07-14  0:28     ` Dan Williams
@ 2018-07-17  6:36       ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2018-07-17  6:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Matthew Wilcox, Ross Zwisler,
	linux-fsdevel, linux-mm, linux-kernel

On Fri, Jul 13, 2018 at 05:28:05PM -0700, Dan Williams wrote:
> On Fri, Jul 13, 2018 at 1:52 AM, Naoya Horiguchi
> <n-horiguchi@ah.jp.nec.com> wrote:
> > On Wed, Jul 04, 2018 at 02:41:06PM -0700, Dan Williams wrote:
...
> >> +
> >> +     /*
> >> +      * Use this flag as an indication that the dax page has been
> >> +      * remapped UC to prevent speculative consumption of poison.
> >> +      */
> >> +     SetPageHWPoison(page);
> >
> > The number of hwpoison pages is maintained by num_poisoned_pages,
> > so you can call num_poisoned_pages_inc()?
> 
> I don't think we want these pages accounted in num_poisoned_pages().
> We have the badblocks infrastructure in libnvdimm to track how many
> errors and where they are located, and since they can be repaired via
> driver actions I think we should track them separately.

OK.

> > Related to this, I'm interested in whether/how unpoison_page() works
> > on a hwpoisoned dev_pagemap page.
> 
> unpoison_page() is only triggered via freeing pages to the page
> allocator, and that never happens for dev_pagemap / ZONE_DEVICE pages.

sorry, my bad comment.
I meant unpoison_memory() in mm/memory-failure.c, which is triggered
via debugfs:hwpoison/unpoison-pfn. This interface looks like below

  int unpoison_memory(unsigned long pfn)
  {
          struct page *page;
          struct page *p;
          int freeit = 0;
          static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
                                          DEFAULT_RATELIMIT_BURST);

          if (!pfn_valid(pfn))
                  return -ENXIO;

          p = pfn_to_page(pfn);
          page = compound_head(p);

          if (!PageHWPoison(p)) {
                  unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
                                   pfn, &unpoison_rs);
                  return 0;
          }
  ...

so I think that we can add is_zone_device_page() check at the beginning
of this function to call hwpoison_clear() introduced in patch 13/13?
Otherwise maybe compound_head() will cause some critical issue like
general protection fault.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry()
  2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
                     ` (2 preceding siblings ...)
  2018-07-05  3:33   ` [PATCH v6] " Dan Williams
@ 2018-09-24 15:57   ` Barret Rhoden
  2018-09-27 11:13     ` Jan Kara
  3 siblings, 1 reply; 24+ messages in thread
From: Barret Rhoden @ 2018-09-24 15:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, hch, linux-fsdevel, linux-mm, linux-kernel, jack,
	ross.zwisler

Hi Dan -

On 2018-07-04 at 14:41 Dan Williams <dan.j.williams@intel.com> wrote:
[snip]
> diff --git a/fs/dax.c b/fs/dax.c
> index 4de11ed463ce..57ec272038da 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
[snip]
> +bool dax_lock_mapping_entry(struct page *page)
> +{
> +	pgoff_t index;
> +	struct inode *inode;
> +	bool did_lock = false;
> +	void *entry = NULL, **slot;
> +	struct address_space *mapping;
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		mapping = READ_ONCE(page->mapping);
> +
> +		if (!dax_mapping(mapping))
> +			break;
> +
> +		/*
> +		 * In the device-dax case there's no need to lock, a
> +		 * struct dev_pagemap pin is sufficient to keep the
> +		 * inode alive, and we assume we have dev_pagemap pin
> +		 * otherwise we would not have a valid pfn_to_page()
> +		 * translation.
> +		 */
> +		inode = mapping->host;
> +		if (S_ISCHR(inode->i_mode)) {
> +			did_lock = true;
> +			break;
> +		}
> +
> +		xa_lock_irq(&mapping->i_pages);
> +		if (mapping != page->mapping) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			continue;
> +		}
> +		index = page->index;
> +
> +		entry = __get_unlocked_mapping_entry(mapping, index, &slot,
> +				entry_wait_revalidate);
> +		if (!entry) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			break;
> +		} else if (IS_ERR(entry)) {
> +			WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
> +			continue;

In the IS_ERR case, do you need to xa_unlock the mapping?  It looks
like you'll deadlock the next time around the loop.

> +		}
> +		lock_slot(mapping, slot);
> +		did_lock = true;
> +		xa_unlock_irq(&mapping->i_pages);
> +		break;
> +	}
> +	rcu_read_unlock();
> +
> +	return did_lock;
> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry()
  2018-09-24 15:57   ` [PATCH v5 07/11] " Barret Rhoden
@ 2018-09-27 11:13     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2018-09-27 11:13 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Dan Williams, linux-nvdimm, hch, linux-fsdevel, linux-mm,
	linux-kernel, jack, ross.zwisler

On Mon 24-09-18 11:57:21, Barret Rhoden wrote:
> Hi Dan -
> 
> On 2018-07-04 at 14:41 Dan Williams <dan.j.williams@intel.com> wrote:
> [snip]
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4de11ed463ce..57ec272038da 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> [snip]
> > +bool dax_lock_mapping_entry(struct page *page)
> > +{
> > +	pgoff_t index;
> > +	struct inode *inode;
> > +	bool did_lock = false;
> > +	void *entry = NULL, **slot;
> > +	struct address_space *mapping;
> > +
> > +	rcu_read_lock();
> > +	for (;;) {
> > +		mapping = READ_ONCE(page->mapping);
> > +
> > +		if (!dax_mapping(mapping))
> > +			break;
> > +
> > +		/*
> > +		 * In the device-dax case there's no need to lock, a
> > +		 * struct dev_pagemap pin is sufficient to keep the
> > +		 * inode alive, and we assume we have dev_pagemap pin
> > +		 * otherwise we would not have a valid pfn_to_page()
> > +		 * translation.
> > +		 */
> > +		inode = mapping->host;
> > +		if (S_ISCHR(inode->i_mode)) {
> > +			did_lock = true;
> > +			break;
> > +		}
> > +
> > +		xa_lock_irq(&mapping->i_pages);
> > +		if (mapping != page->mapping) {
> > +			xa_unlock_irq(&mapping->i_pages);
> > +			continue;
> > +		}
> > +		index = page->index;
> > +
> > +		entry = __get_unlocked_mapping_entry(mapping, index, &slot,
> > +				entry_wait_revalidate);
> > +		if (!entry) {
> > +			xa_unlock_irq(&mapping->i_pages);
> > +			break;
> > +		} else if (IS_ERR(entry)) {
> > +			WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
> > +			continue;
> 
> In the IS_ERR case, do you need to xa_unlock the mapping?  It looks
> like you'll deadlock the next time around the loop.

Yep, that looks certainly wrong. I'll send a fix.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-09-27 11:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-04 21:40 [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
2018-07-04 21:40 ` [PATCH v5 01/11] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
2018-07-04 21:40 ` [PATCH v5 02/11] device-dax: Enable page_mapping() Dan Williams
2018-07-04 21:40 ` [PATCH v5 03/11] device-dax: Set page->index Dan Williams
2018-07-04 21:40 ` [PATCH v5 04/11] filesystem-dax: " Dan Williams
2018-07-04 21:40 ` [PATCH v5 05/11] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
2018-07-13  6:31   ` Naoya Horiguchi
2018-07-14  0:34     ` Dan Williams
2018-07-04 21:40 ` [PATCH v5 06/11] mm, memory_failure: Collect mapping size in collect_procs() Dan Williams
2018-07-13  6:49   ` Naoya Horiguchi
2018-07-04 21:41 ` [PATCH v5 07/11] filesystem-dax: Introduce dax_lock_mapping_entry() Dan Williams
2018-07-05  1:07   ` kbuild test robot
2018-07-05  3:31   ` kbuild test robot
2018-07-05  3:33   ` [PATCH v6] " Dan Williams
2018-09-24 15:57   ` [PATCH v5 07/11] " Barret Rhoden
2018-09-27 11:13     ` Jan Kara
2018-07-04 21:41 ` [PATCH v5 08/11] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
2018-07-13  8:52   ` Naoya Horiguchi
2018-07-14  0:28     ` Dan Williams
2018-07-17  6:36       ` Naoya Horiguchi
2018-07-04 21:41 ` [PATCH v5 09/11] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
2018-07-04 21:41 ` [PATCH v5 10/11] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
2018-07-04 21:41 ` [PATCH v5 11/11] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams
2018-07-13  4:44 ` [PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).