[PATCH v5 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE
@ 2021-05-21  3:01 Naoya Horiguchi
  2021-05-21  3:01 ` [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-21  3:01 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

I updated the series again with the following changes based on the discussion
over v4:

  - separated v4's 2/2 into two as done in the former version,
  - switched to "first found" approach in getting error virtual address,
    which could report wrong error address to applications but that's rare
    and not critical,
  - rebased onto v5.13-rc2.

v1: https://lore.kernel.org/linux-mm/20210412224320.1747638-1-nao.horiguchi@gmail.com/T
v2 (only 3/3 is posted): https://lore.kernel.org/linux-mm/20210419023658.GA1962954@u2004/
v3: https://lore.kernel.org/linux-mm/20210421005728.1994268-1-nao.horiguchi@gmail.com/T
v4: https://lore.kernel.org/linux-mm/20210427062953.2080293-1-nao.horiguchi@gmail.com/T

Thanks,
Naoya Horiguchi

--- quote from cover letter of v1 ---

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].
I simply borrowed Tony's mutex patch and Aili's return code patch,
then I queued another one to find error virtual address in the best
effort manner.  I know that this is not a perfect solution, but
should work for some typical case.

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
---
Summary:

Aili Yao (1):
      mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned

Naoya Horiguchi (1):
      mm,hwpoison: Send SIGBUS with error virutal address

Tony Luck (1):
      mm/memory-failure: Use a mutex to avoid memory_failure() races

 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 188 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 190 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-05-21  3:01 [PATCH v5 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
@ 2021-05-21  3:01 ` Naoya Horiguchi
  2021-05-22 22:09   ` Andrew Morton
  2021-05-26  9:47   ` Oscar Salvador
  2021-05-21  3:01 ` [PATCH v5 2/3] mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned Naoya Horiguchi
  2021-05-21  3:01 ` [PATCH v5 3/3] mm,hwpoison: Send SIGBUS with error virutal address Naoya Horiguchi
  2 siblings, 2 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-21  3:01 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

From: Tony Luck <tony.luck@intel.com>

There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
 mm/memory-failure.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 9a7c12ace9e2..0f0b932ccbca 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -1400,6 +1400,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1423,7 +1425,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *hpage;
 	struct page *orig_head;
 	struct dev_pagemap *pgmap;
-	int res;
+	int res = 0;
 	unsigned long page_flags;
 	bool retry = true;
 
@@ -1443,13 +1445,18 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	orig_head = hpage = compound_head(p);
@@ -1482,17 +1489,19 @@ int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
-			return res == MF_RECOVERED ? 0 : -EBUSY;
+			res = res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
-			return -EBUSY;
+			res = -EBUSY;
 		}
+		goto unlock_mutex;
 	}
 
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
-			return -EBUSY;
+			res = -EBUSY;
+			goto unlock_mutex;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
@@ -1516,7 +1525,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageCompound(p) && compound_head(p) != orig_head) {
 		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 	/*
@@ -1536,14 +1545,14 @@ int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
-		return 0;
+		goto unlock_mutex;
 	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	if (!PageTransTail(p) && !PageLRU(p))
@@ -1562,7 +1571,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 	/*
@@ -1571,13 +1580,15 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
 		action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 identify_page_state:
 	res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
 	unlock_page(p);
+unlock_mutex:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/3] mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned
  2021-05-21  3:01 [PATCH v5 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
  2021-05-21  3:01 ` [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
@ 2021-05-21  3:01 ` Naoya Horiguchi
  2021-05-26 10:18   ` Oscar Salvador
  2021-05-21  3:01 ` [PATCH v5 3/3] mm,hwpoison: Send SIGBUS with error virutal address Naoya Horiguchi
  2 siblings, 1 reply; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-21  3:01 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

From: Aili Yao <yaoaili@kingsoft.com>

When memory_failure() is called with MF_ACTION_REQUIRED on the
page that has already been hwpoisoned, memory_failure() could fail
to send SIGBUS to the affected process, which results in infinite
loop of MCEs.

Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return
without sending SIGBUS to current process.  An action required MCE
is raised when the current process accesses to the broken memory,
so no SIGBUS means that the current process continues to run and
access to the error page again soon, so running into MCE loop.

This issue can arise for example in the following scenarios:

  - Two or more threads access to the poisoned page concurrently.
    If local MCE is enabled, MCE handler independently handles the
    MCE events.  So there's a race among MCE events, and the
    second or latter threads fall into the situation in question.

  - If there was a precedent memory error event and memory_failure()
    for the event failed to unmap the error page for some reason,
    the subsequent memory access to the error page triggers the
    MCE loop situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned.  This allows memory error
handler to control how it sends signals to userspace.  And make sure
that any process touching a hwpoisoned page should get a SIGBUS even
in "already hwpoisoned" path of memory_failure() as is done in page
fault path.

Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
ChangeLog v5:
- update patch description.
---
 mm/memory-failure.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 0f0b932ccbca..8add7cafad5e 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -1247,7 +1247,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EHWPOISON;
 	}

 	num_poisoned_pages_inc();
@@ -1456,6 +1456,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		res = -EHWPOISON;
 		goto unlock_mutex;
 	}

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/3] mm,hwpoison: Send SIGBUS with error virutal address
  2021-05-21  3:01 [PATCH v5 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
  2021-05-21  3:01 ` [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
  2021-05-21  3:01 ` [PATCH v5 2/3] mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned Naoya Horiguchi
@ 2021-05-21  3:01 ` Naoya Horiguchi
  2021-06-03  5:10   ` HORIGUCHI NAOYA(堀口　直也)
  2 siblings, 1 reply; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-21  3:01 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address.  That's not optimal for hwpoison-aware applications.

To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address.  It could
find multiple virtual addresses for the same error page, and it seems
hard to tell which virtual address is correct one.  But that's rare
and sending incorrect virtual address could be better than no address.
So let's report the first found virtual address for now.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
change log v4 -> v5:
- switched to first found approach,
- introduced check_hwpoisoned_pmd_entry() to fix build failure on arch
  without thp support.

change log v3 -> v4:
- refactored hwpoison_pte_range to save indentation,
- updated patch description

change log v1 -> v2:
- initialize local variables in check_hwpoisoned_entry() and
  hwpoison_pte_range()
- fix and improve logic to calculate error address offset.
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 150 ++++++++++++++++++++++++++++++++-
 3 files changed, 165 insertions(+), 3 deletions(-)

diff --git v5.13-rc2/arch/x86/kernel/cpu/mce/core.c v5.13-rc2_patched/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..22791aadc085 100644
--- v5.13-rc2/arch/x86/kernel/cpu/mce/core.c
+++ v5.13-rc2_patched/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int ret;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+	if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
 
+	/*
+	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+	 * to the current process with the proper error info, so no need to
+	 * send SIGBUS here again.
+	 */
+	if (ret == -EHWPOISON)
+		return;
+
 	if (p->mce_vaddr != (void __user *)-1l) {
 		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
 	} else {
diff --git v5.13-rc2/include/linux/swapops.h v5.13-rc2_patched/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- v5.13-rc2/include/linux/swapops.h
+++ v5.13-rc2_patched/include/linux/swapops.h
@@ -323,6 +323,11 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 8add7cafad5e..137cd0f61af3 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,148 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 		collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+	struct to_kill tk;
+	unsigned long pfn;
+	int flags;
+};
+
+static void set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+	tk->addr = addr;
+	tk->size_shift = shift;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+				unsigned long poisoned_pfn, struct to_kill *tk)
+{
+	unsigned long pfn = 0;
+
+	if (pte_present(pte)) {
+		pfn = pte_pfn(pte);
+	} else {
+		swp_entry_t swp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp))
+			pfn = hwpoison_entry_to_pfn(swp);
+	}
+
+	if (!pfn || pfn != poisoned_pfn)
+		return 0;
+
+	set_to_kill(tk, addr, shift);
+	return 1;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr,
+				      struct hwp_walk *hwp)
+{
+	pmd_t pmd = *pmdp;
+	unsigned long pfn;
+	unsigned long hwpoison_vaddr;
+
+	if (!pmd_present(pmd))
+		return 0;
+	pfn = pmd_pfn(pmd);
+	if (pfn <= hwp->pfn && hwp->pfn < pfn + HPAGE_PMD_NR) {
+		hwpoison_vaddr = addr + ((hwp->pfn - pfn) << PAGE_SHIFT);
+		set_to_kill(&hwp->tk, hwpoison_vaddr, PAGE_SHIFT);
+		return 1;
+	}
+	return 0;
+}
+#else
+static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr,
+				      struct hwp_walk *hwp)
+{
+	return 0;
+}
+#endif
+
+static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	int ret = 0;
+	pte_t *ptep;
+	spinlock_t *ptl;
+
+	ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+	if (ptl) {
+		ret = check_hwpoisoned_pmd_entry(pmdp, addr, hwp);
+		spin_unlock(ptl);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmdp))
+		goto out;
+
+	ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != end; ptep++, addr += PAGE_SIZE) {
+		ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
+					     hwp->pfn, &hwp->tk);
+		if (ret == 1)
+			break;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+out:
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
+			    unsigned long addr, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	pte_t pte = huge_ptep_get(ptep);
+	struct hstate *h = hstate_vma(walk->vma);
+
+	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
+				      hwp->pfn, &hwp->tk);
+}
+#else
+#define hwpoison_hugetlb_range	NULL
+#endif
+
+static struct mm_walk_ops hwp_walk_ops = {
+	.pmd_entry = hwpoison_pte_range,
+	.hugetlb_entry = hwpoison_hugetlb_range,
+};
+
+/*
+ * Sends SIGBUS to the current process with error info.
+ *
+ * This function is intended to handle "Action Required" MCEs on already
+ * hardware poisoned pages. They could happen, for example, when
+ * memory_failure() failed to unmap the error page at the first call, or
+ * when multiple local machine checks happened on different CPUs.
+ *
+ * MCE handler currently has no easy access to the error virtual address,
+ * so this function walks page table to find it. The returned virtual address
+ * is proper in most cases, but it could be wrong when the application
+ * process has multiple entries mapping the error page.
+ */
+static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
+				  int flags)
+{
+	int ret;
+	struct hwp_walk priv = {
+		.pfn = pfn,
+	};
+	priv.tk.tsk = p;
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops,
+			      (void *)&priv);
+	if (!ret && priv.tk.addr)
+		kill_proc(&priv.tk, pfn, flags);
+	mmap_read_unlock(p->mm);
+	return ret ? -EFAULT : -EHWPOISON;
+}
+
 static const char *action_name[] = {
 	[MF_IGNORED] = "Ignored",
 	[MF_FAILED] = "Failed",
@@ -1247,7 +1390,10 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return -EHWPOISON;
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, page_to_pfn(head), flags);
+		return res;
 	}
 
 	num_poisoned_pages_inc();
@@ -1457,6 +1603,8 @@ int memory_failure(unsigned long pfn, int flags)
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, pfn, flags);
 		goto unlock_mutex;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-05-21  3:01 ` [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
@ 2021-05-22 22:09   ` Andrew Morton
  2021-05-24  8:42     ` HORIGUCHI NAOYA(堀口　直也)
  2021-05-26  9:47   ` Oscar Salvador
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2021-05-22 22:09 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Tony Luck, Aili Yao, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

On Fri, 21 May 2021 12:01:54 +0900 Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:

> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.

We can reduce the scope of that mutex, which helps readability at least.

--- a/mm/memory-failure.c~mm-memory-failure-use-a-mutex-to-avoid-memory_failure-races-fix
+++ a/mm/memory-failure.c
@@ -1397,8 +1397,6 @@ out:
 	return rc;
 }
 
-static DEFINE_MUTEX(mf_mutex);
-
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1425,6 +1423,7 @@ int memory_failure(unsigned long pfn, in
 	int res = 0;
 	unsigned long page_flags;
 	bool retry = true;
+	static DEFINE_MUTEX(mf_mutex);
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
_


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-05-22 22:09   ` Andrew Morton
@ 2021-05-24  8:42     ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 0 replies; 9+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-05-24  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Naoya Horiguchi, linux-mm, Tony Luck, Aili Yao, Oscar Salvador,
	David Hildenbrand, Borislav Petkov, Andy Lutomirski, Jue Wang,
	linux-kernel

On Sat, May 22, 2021 at 03:09:00PM -0700, Andrew Morton wrote:
> On Fri, 21 May 2021 12:01:54 +0900 Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:
> 
> > There can be races when multiple CPUs consume poison from the same
> > page. The first into memory_failure() atomically sets the HWPoison
> > page flag and begins hunting for tasks that map this page. Eventually
> > it invalidates those mappings and may send a SIGBUS to the affected
> > tasks.
> > 
> > But while all that work is going on, other CPUs see a "success"
> > return code from memory_failure() and so they believe the error
> > has been handled and continue executing.
> > 
> > Fix by wrapping most of the internal parts of memory_failure() in
> > a mutex.
> 
> We can reduce the scope of that mutex, which helps readability at least.

Thanks, this change is totally fine to me.

> 
> --- a/mm/memory-failure.c~mm-memory-failure-use-a-mutex-to-avoid-memory_failure-races-fix
> +++ a/mm/memory-failure.c
> @@ -1397,8 +1397,6 @@ out:
>  	return rc;
>  }
>  
> -static DEFINE_MUTEX(mf_mutex);
> -
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1425,6 +1423,7 @@ int memory_failure(unsigned long pfn, in
>  	int res = 0;
>  	unsigned long page_flags;
>  	bool retry = true;
> +	static DEFINE_MUTEX(mf_mutex);
>  
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> _
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-05-21  3:01 ` [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
  2021-05-22 22:09   ` Andrew Morton
@ 2021-05-26  9:47   ` Oscar Salvador
  1 sibling, 0 replies; 9+ messages in thread
From: Oscar Salvador @ 2021-05-26  9:47 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Tony Luck, Aili Yao, Andrew Morton, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

On Fri, May 21, 2021 at 12:01:54PM +0900, Naoya Horiguchi wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Reviewed-by: Borislav Petkov <bp@suse.de>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/3] mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned
  2021-05-21  3:01 ` [PATCH v5 2/3] mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned Naoya Horiguchi
@ 2021-05-26 10:18   ` Oscar Salvador
  0 siblings, 0 replies; 9+ messages in thread
From: Oscar Salvador @ 2021-05-26 10:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Tony Luck, Aili Yao, Andrew Morton, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, Jue Wang,
	linux-kernel

On Fri, May 21, 2021 at 12:01:55PM +0900, Naoya Horiguchi wrote:
> From: Aili Yao <yaoaili@kingsoft.com>
> 
> When memory_failure() is called with MF_ACTION_REQUIRED on the
> page that has already been hwpoisoned, memory_failure() could fail
> to send SIGBUS to the affected process, which results in infinite
> loop of MCEs.
> 
> Currently memory_failure() returns 0 if it's called for already
> hwpoisoned page, then the caller, kill_me_maybe(), could return
> without sending SIGBUS to current process.  An action required MCE
> is raised when the current process accesses to the broken memory,
> so no SIGBUS means that the current process continues to run and
> access to the error page again soon, so running into MCE loop.
> 
> This issue can arise for example in the following scenarios:
> 
>   - Two or more threads access to the poisoned page concurrently.
>     If local MCE is enabled, MCE handler independently handles the
>     MCE events.  So there's a race among MCE events, and the
>     second or latter threads fall into the situation in question.
> 
>   - If there was a precedent memory error event and memory_failure()
>     for the event failed to unmap the error page for some reason,
>     the subsequent memory access to the error page triggers the
>     MCE loop situation.
> 
> To fix the issue, make memory_failure() return an error code when the
> error page has already been hwpoisoned.  This allows memory error
> handler to control how it sends signals to userspace.  And make sure
> that any process touching a hwpoisoned page should get a SIGBUS even
> in "already hwpoisoned" path of memory_failure() as is done in page
> fault path.
> 
> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/3] mm,hwpoison: Send SIGBUS with error virutal address
  2021-05-21  3:01 ` [PATCH v5 3/3] mm,hwpoison: Send SIGBUS with error virutal address Naoya Horiguchi
@ 2021-06-03  5:10   ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 0 replies; 9+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-06-03  5:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Tony Luck, Aili Yao, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Jue Wang, Naoya Horiguchi,
	linux-kernel

On Fri, May 21, 2021 at 12:01:56PM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Now an action required MCE in already hwpoisoned address surely sends a
> SIGBUS to current process, but the SIGBUS doesn't convey error virtual
> address.  That's not optimal for hwpoison-aware applications.
> 
> To fix the issue, make memory_failure() call kill_accessing_process(),
> that does pagetable walk to find the error virtual address.  It could
> find multiple virtual addresses for the same error page, and it seems
> hard to tell which virtual address is correct one.  But that's rare
> and sending incorrect virtual address could be better than no address.
> So let's report the first found virtual address for now.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
> change log v4 -> v5:
> - switched to first found approach,
> - introduced check_hwpoisoned_pmd_entry() to fix build failure on arch
>   without thp support.
> 
> change log v3 -> v4:
> - refactored hwpoison_pte_range to save indentation,
> - updated patch description
> 
> change log v1 -> v2:
> - initialize local variables in check_hwpoisoned_entry() and
>   hwpoison_pte_range()
> - fix and improve logic to calculate error address offset.
> ---
...
> +static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> +				  int flags)
> +{
> +	int ret;
> +	struct hwp_walk priv = {
> +		.pfn = pfn,
> +	};
> +	priv.tk.tsk = p;
> +
> +	mmap_read_lock(p->mm);
> +	ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops,
> +			      (void *)&priv);
> +	if (!ret && priv.tk.addr)

Sorry, I found a silly mistake, the walk_page_range() got to return 1 when it
found at least error virtual address since v5, so this if-condition should be
like this.

@@ -691,7 +691,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
        mmap_read_lock(p->mm);
        ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops,
                              (void *)&priv);
-       if (!ret && priv.tk.addr)
+       if (ret == 1 && priv.tk.addr)
                kill_proc(&priv.tk, pfn, flags);
        mmap_read_unlock(p->mm);
        return ret ? -EFAULT : -EHWPOISON;

Andrew, this patch is now in linux-mm, so could you apply this fix onto
mmhwpoison-send-sigbus-with-error-virutal-address.patch ?
Or if it's better to resend a whole patch, please let me know.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-03  5:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-21  3:01 [PATCH v5 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
2021-05-21  3:01 ` [PATCH v5 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
2021-05-22 22:09   ` Andrew Morton
2021-05-24  8:42     ` HORIGUCHI NAOYA(堀口　直也)
2021-05-26  9:47   ` Oscar Salvador
2021-05-21  3:01 ` [PATCH v5 2/3] mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned Naoya Horiguchi
2021-05-26 10:18   ` Oscar Salvador
2021-05-21  3:01 ` [PATCH v5 3/3] mm,hwpoison: Send SIGBUS with error virutal address Naoya Horiguchi
2021-06-03  5:10   ` HORIGUCHI NAOYA(堀口　直也)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.