[PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE
@ 2021-04-12 22:43 Naoya Horiguchi
  2021-04-12 22:43 ` [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2021-04-12 22:43 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, linux-kernel

Hi,

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].
I simply borrowed Tony's mutex patch and Aili's return code patch,
then I queued another one to find error virtual address in the best
effort manner.  I know that this is not a perfect solution, but
should work for some typical case.

My simple testing showed this patchset seems to work as intended,
but if you have the related testcases, could you please test and
let me have some feedback?

Thanks,
Naoya Horiguchi

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
---
Summary:

Aili Yao (1):
      mm,hwpoison: return -EHWPOISON when page already

Naoya Horiguchi (1):
      mm,hwpoison: add kill_accessing_process() to find error virtual address

Tony Luck (1):
      mm/memory-failure: Use a mutex to avoid memory_failure() races

 arch/x86/kernel/cpu/mce/core.c |  13 +++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 166 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 178 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-04-12 22:43 [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
@ 2021-04-12 22:43 ` Naoya Horiguchi
  2021-04-19 17:05   ` Borislav Petkov
  2021-04-12 22:43 ` [PATCH v1 2/3] mm,hwpoison: return -EHWPOISON when page already Naoya Horiguchi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2021-04-12 22:43 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, linux-kernel

From: Tony Luck <tony.luck@intel.com>

There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- v5.12-rc5/mm/memory-failure.c
+++ v5.12-rc5_patched/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto out2;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
+			mutex_unlock(&mf_mutex);
 			return res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 	}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
 	res = identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(p);
+out2:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 2/3] mm,hwpoison: return -EHWPOISON when page already
  2021-04-12 22:43 [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
  2021-04-12 22:43 ` [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
@ 2021-04-12 22:43 ` Naoya Horiguchi
  2021-04-12 22:43 ` [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address Naoya Horiguchi
  2021-04-17  5:47 ` [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Aili Yao
  3 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2021-04-12 22:43 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, linux-kernel

From: Aili Yao <yaoaili@kingsoft.com>

When the page is already poisoned, another memory_failure() call in the
same page now returns 0, meaning OK. For nested memory mce handling, this
behavior may lead to one mce looping, Example:

1. When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2. Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3. The kill_me_maybe will check the return:

    1244 static void kill_me_maybe(struct callback_head *cb)
    1245 {
    ...
    1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
    1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
    1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
    1257                 sync_core();
    1258                 return;
    1259         }
    ...
    1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
index c1509f4b565e..368ef77e01f9 100644
--- v5.12-rc5/mm/memory-failure.c
+++ v5.12-rc5_patched/mm/memory-failure.c
@@ -1228,7 +1228,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EHWPOISON;
 	}

 	num_poisoned_pages_inc();
@@ -1438,7 +1438,7 @@ int memory_failure(unsigned long pfn, int flags)
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		mutex_unlock(&mf_mutex);
-		return 0;
+		return -EHWPOISON;
 	}

 	orig_head = hpage = compound_head(p);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-12 22:43 [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
  2021-04-12 22:43 ` [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
  2021-04-12 22:43 ` [PATCH v1 2/3] mm,hwpoison: return -EHWPOISON when page already Naoya Horiguchi
@ 2021-04-12 22:43 ` Naoya Horiguchi
  2021-04-17  5:47 ` [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Aili Yao
  3 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2021-04-12 22:43 UTC (permalink / raw)
  To: linux-mm, Tony Luck, Aili Yao
  Cc: Andrew Morton, Oscar Salvador, David Hildenbrand,
	Borislav Petkov, Andy Lutomirski, Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

The previous patch solves the infinite MCE loop issue when multiple
MCE events races.  The remaining issue is to make sure that all threads
processing Action Required MCEs send to the current processes the
SIGBUS with the proper virtual address and the error size.

This patch suggests to do page table walk to find the error virtual
address.  If we find multiple virtual addresses in walking, we now can't
determine which one is correct, so we fall back to sending SIGBUS in
kill_me_maybe() without error info as we do now.  This corner case needs
to be solved in the future.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 147 ++++++++++++++++++++++++++++++++-
 3 files changed, 161 insertions(+), 4 deletions(-)

diff --git v5.12-rc5/arch/x86/kernel/cpu/mce/core.c v5.12-rc5_patched/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..3ce23445a48c 100644
--- v5.12-rc5/arch/x86/kernel/cpu/mce/core.c
+++ v5.12-rc5_patched/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int ret;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+	if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
 
+	/*
+	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+	 * to the current process with the proper error info, so no need to
+	 * send it here again.
+	 */
+	if (ret == -EHWPOISON)
+		return;
+
 	if (p->mce_vaddr != (void __user *)-1l) {
 		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
 	} else {
diff --git v5.12-rc5/include/linux/swapops.h v5.12-rc5_patched/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- v5.12-rc5/include/linux/swapops.h
+++ v5.12-rc5_patched/include/linux/swapops.h
@@ -323,6 +323,11 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
index 368ef77e01f9..04e002bd573a 100644
--- v5.12-rc5/mm/memory-failure.c
+++ v5.12-rc5_patched/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,142 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 		collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+	struct to_kill tk;
+	unsigned long pfn;
+	int flags;
+};
+
+static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+	/* Abort pagewalk when finding multiple mappings to the error page. */
+	if (tk->addr)
+		return 1;
+	tk->addr = addr;
+	tk->size_shift = shift;
+	return 0;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+				unsigned long poisoned_pfn, struct to_kill *tk)
+{
+	unsigned long pfn;
+
+	if (pte_present(pte)) {
+		pfn = pte_pfn(pte);
+	} else {
+		swp_entry_t swp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp))
+			pfn = hwpoison_entry_to_pfn(swp);
+	}
+
+	if (!pfn || pfn != poisoned_pfn)
+		return 0;
+
+	return set_to_kill(tk, addr, shift);
+}
+
+static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	int ret;
+	pte_t *ptep;
+	spinlock_t *ptl;
+
+	ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+	if (ptl) {
+		pmd_t pmd = *pmdp;
+
+		if (pmd_present(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd);
+
+			if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) {
+				unsigned long hwpoison_vaddr = addr +
+					(hwp->pfn << PAGE_SHIFT & ~PMD_MASK);
+
+				ret = set_to_kill(&hwp->tk, hwpoison_vaddr,
+						  PAGE_SHIFT);
+			}
+		}
+		spin_unlock(ptl);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmdp))
+		goto out;
+
+	ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != end; ptep++, addr += PAGE_SIZE) {
+		ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
+					     hwp->pfn, &hwp->tk);
+		if (ret == 1)
+			break;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+out:
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
+			    unsigned long addr, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	pte_t pte = huge_ptep_get(ptep);
+	struct hstate *h = hstate_vma(walk->vma);
+
+	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
+				      hwp->pfn, &hwp->tk);
+}
+#else
+#define hwpoison_hugetlb_range	NULL
+#endif
+
+static struct mm_walk_ops hwp_walk_ops = {
+	.pmd_entry = hwpoison_pte_range,
+	.hugetlb_entry = hwpoison_hugetlb_range,
+};
+
+/*
+ * Sends SIGBUS to the current process with the error info.
+ *
+ * This function is intended to handle "Action Required" MCEs on already
+ * hardware poisoned pages. They could happen, for example, when
+ * memory_failure() failed to unmap the error page at the first call, or
+ * when multiple Action Optional MCE events races on different CPUs with
+ * Local MCE enabled.
+ *
+ * MCE handler currently has no easy access to the error virtual address,
+ * so this function walks page table to find it. One challenge on this is
+ * to reliably get the proper virual address of the error to report to
+ * applications via SIGBUS. A process could map a page multiple times to
+ * different virtual addresses, then we now have no way to tell which virtual
+ * address was accessed when the Action Required MCE was generated.
+ * So in such a corner case, we now give up and fall back to sending SIGBUS
+ * with no error info.
+ */
+static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
+				  int flags)
+{
+	int ret;
+	struct hwp_walk priv = {
+		.pfn = pfn,
+	};
+	priv.tk.tsk = p;
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &hwp_walk_ops,
+			      (void *)&priv);
+	if (!ret && priv.tk.addr)
+		kill_proc(&priv.tk, pfn, flags);
+	mmap_read_unlock(p->mm);
+	return ret ? -EFAULT : -EHWPOISON;
+}
+
 static const char *action_name[] = {
 	[MF_IGNORED] = "Ignored",
 	[MF_FAILED] = "Failed",
@@ -1228,7 +1365,10 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return -EHWPOISON;
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, page_to_pfn(head), flags);
+		return res;
 	}
 
 	num_poisoned_pages_inc();
@@ -1437,8 +1577,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, pfn, flags);
 		mutex_unlock(&mf_mutex);
-		return -EHWPOISON;
+		return res;
 	}
 
 	orig_head = hpage = compound_head(p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE
  2021-04-12 22:43 [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2021-04-12 22:43 ` [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address Naoya Horiguchi
@ 2021-04-17  5:47 ` Aili Yao
  2021-04-19  1:09   ` HORIGUCHI NAOYA(堀口　直也)
  3 siblings, 1 reply; 22+ messages in thread
From: Aili Yao @ 2021-04-17  5:47 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Tony Luck, Andrew Morton, Oscar Salvador,
	David Hildenbrand, Borislav Petkov, Andy Lutomirski,
	Naoya Horiguchi, linux-kernel, yaoaili

On Tue, 13 Apr 2021 07:43:17 +0900
Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:

> Hi,
> 
> I wrote this patchset to materialize what I think is the current
> allowable solution mentioned by the previous discussion [1].
> I simply borrowed Tony's mutex patch and Aili's return code patch,
> then I queued another one to find error virtual address in the best
> effort manner.  I know that this is not a perfect solution, but
> should work for some typical case.
> 
> My simple testing showed this patchset seems to work as intended,
> but if you have the related testcases, could you please test and
> let me have some feedback?
> 
> Thanks,
> Naoya Horiguchi
> 
> [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
> ---
> Summary:
> 
> Aili Yao (1):
>       mm,hwpoison: return -EHWPOISON when page already
> 
> Naoya Horiguchi (1):
>       mm,hwpoison: add kill_accessing_process() to find error virtual address
> 
> Tony Luck (1):
>       mm/memory-failure: Use a mutex to avoid memory_failure() races
> 
>  arch/x86/kernel/cpu/mce/core.c |  13 +++-
>  include/linux/swapops.h        |   5 ++
>  mm/memory-failure.c            | 166 ++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 178 insertions(+), 6 deletions(-)

Hi Naoya,

Thanks for your patch and complete fix for this race issue.

I test your patches, mainly it worked as expected, but in some cases it failed, I checked  it
and find some doubt places, could you help confirm it?

1. there is a compile warning:
static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
			      unsigned long end, struct mm_walk *walk)
{
	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
	int ret;    ---- here

It seems this ret may not be initialized, and some time ret may be error retruned?

and for this:
static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
				unsigned long poisoned_pfn, struct to_kill *tk)
{
	unsigned long pfn;

I think it better to be initialized too.

2. In the function hwpoison_pte_range():
if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use PMD_SIZE/PAGE_SIZE or some macro like this?

3. unsigned long hwpoison_vaddr = addr + (hwp->pfn << PAGE_SHIFT & ~PMD_MASK); this seems not exact accurate?

4. static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
{
	if (tk->addr) {    --- I am not sure about this check and if it will lead failure.
		return 1;
	}
In my test, it seems sometimes it will hit this branch, I don't know it's multi entry issue or multi posion issue.
when i get to this fail, there is not enough log for this, but i can't reproduce it after that.

wolud you help confirm this and if any changes, please post again and I will do the test again.

Thansk
Aili Yao

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE
  2021-04-17  5:47 ` [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Aili Yao
@ 2021-04-19  1:09   ` HORIGUCHI NAOYA(堀口　直也)
  2021-04-19  2:36     ` [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address Naoya Horiguchi
  0 siblings, 1 reply; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-19  1:09 UTC (permalink / raw)
  To: Aili Yao
  Cc: Naoya Horiguchi, linux-mm, Tony Luck, Andrew Morton,
	Oscar Salvador, David Hildenbrand, Borislav Petkov,
	Andy Lutomirski, linux-kernel

On Sat, Apr 17, 2021 at 01:47:51PM +0800, Aili Yao wrote:
> On Tue, 13 Apr 2021 07:43:17 +0900
> Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:
> 
> > Hi,
> > 
> > I wrote this patchset to materialize what I think is the current
> > allowable solution mentioned by the previous discussion [1].
> > I simply borrowed Tony's mutex patch and Aili's return code patch,
> > then I queued another one to find error virtual address in the best
> > effort manner.  I know that this is not a perfect solution, but
> > should work for some typical case.
> > 
> > My simple testing showed this patchset seems to work as intended,
> > but if you have the related testcases, could you please test and
> > let me have some feedback?
> > 
> > Thanks,
> > Naoya Horiguchi
> > 
> > [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
> > ---
> > Summary:
> > 
> > Aili Yao (1):
> >       mm,hwpoison: return -EHWPOISON when page already
> > 
> > Naoya Horiguchi (1):
> >       mm,hwpoison: add kill_accessing_process() to find error virtual address
> > 
> > Tony Luck (1):
> >       mm/memory-failure: Use a mutex to avoid memory_failure() races
> > 
> >  arch/x86/kernel/cpu/mce/core.c |  13 +++-
> >  include/linux/swapops.h        |   5 ++
> >  mm/memory-failure.c            | 166 ++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 178 insertions(+), 6 deletions(-)
> 
> Hi Naoya,
> 
> Thanks for your patch and complete fix for this race issue.
> 
> I test your patches, mainly it worked as expected, but in some cases it failed, I checked  it
> and find some doubt places, could you help confirm it?
> 
> 1. there is a compile warning:
> static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
> 			      unsigned long end, struct mm_walk *walk)
> {
> 	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
> 	int ret;    ---- here
> 
> It seems this ret may not be initialized, and some time ret may be error retruned?

Yes, I'll initialize with 0.

> 
> and for this:
> static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
> 				unsigned long poisoned_pfn, struct to_kill *tk)
> {
> 	unsigned long pfn;
> 
> I think it better to be initialized too.

OK.

> 
> 2. In the function hwpoison_pte_range():
> if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use PMD_SIZE/PAGE_SIZE or some macro like this?

Thanks, that's right.  HPAGE_PMD_NR seems to fit here.
We also need "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it.

> 
> 3. unsigned long hwpoison_vaddr = addr + (hwp->pfn << PAGE_SHIFT & ~PMD_MASK); this seems not exact accurate?

As operators precedence rule, we evaluate this in the following order:
  1. ~ (bitwise NOT),
  2. << (bitwise left shift), and
  3. & (bitwise AND).

So this part are equivalent with ((hwp->pfn << PAGE_SHIFT) & (~PMD_MASK)),
which should properly calculate an address offset within a pmd.

> 
> 4. static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
> {
> 	if (tk->addr) {    --- I am not sure about this check and if it will lead failure.
> 		return 1;
> 	}
> In my test, it seems sometimes it will hit this branch, I don't know it's multi entry issue or multi posion issue.
> when i get to this fail, there is not enough log for this, but i can't reproduce it after that.

As you commented above, this version is buggy, and that might
make you see the random failure. Sorry about that.

> 
> wolud you help confirm this and if any changes, please post again and I will do the test again.

I'll send the fixed one for patch 3/3 soon later.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-19  1:09   ` HORIGUCHI NAOYA(堀口　直也)
@ 2021-04-19  2:36     ` Naoya Horiguchi
  2021-04-19  3:43       ` Aili Yao
  0 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2021-04-19  2:36 UTC (permalink / raw)
  To: Aili Yao
  Cc: linux-mm, Tony Luck, Andrew Morton, Oscar Salvador,
	David Hildenbrand, Borislav Petkov, Andy Lutomirski,
	linux-kernel,
	HORIGUCHI NAOYA(堀口　直也)

> > 2. In the function hwpoison_pte_range():
> > if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use PMD_SIZE/PAGE_SIZE or some macro like this?
> 
> Thanks, that's right.  HPAGE_PMD_NR seems to fit here.
> We also need "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it.

I found that the #ifdef is not necessary because the whole
"if (ptl)" is compiled out.  So I don't add #ifdef.

Here's the v2 of 3/3.

Aili, could you test with it?

Thanks,
Naoya Horiguchi

-----
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Tue, 13 Apr 2021 07:26:25 +0900
Subject: [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error
 virtual address

The previous patch solves the infinite MCE loop issue when multiple
MCE events races.  The remaining issue is to make sure that all threads
processing Action Required MCEs send to the current processes the
SIGBUS with the proper virtual address and the error size.

This patch suggests to do page table walk to find the error virtual
address.  If we find multiple virtual addresses in walking, we now can't
determine which one is correct, so we fall back to sending SIGBUS in
kill_me_maybe() without error info as we do now.  This corner case needs
to be solved in the future.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
change log v1 -> v2:
- initialize local variables in check_hwpoisoned_entry() and
  hwpoison_pte_range()
- fix and improve logic to calculate error address offset.
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 147 ++++++++++++++++++++++++++++++++-
 3 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..3ce23445a48c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int ret;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+	if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
 
+	/*
+	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+	 * to the current process with the proper error info, so no need to
+	 * send it here again.
+	 */
+	if (ret == -EHWPOISON)
+		return;
+
 	if (p->mce_vaddr != (void __user *)-1l) {
 		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
 	} else {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,11 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 368ef77e01f9..99dd4caf43cb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,142 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 		collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+	struct to_kill tk;
+	unsigned long pfn;
+	int flags;
+};
+
+static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+	/* Abort pagewalk when finding multiple mappings to the error page. */
+	if (tk->addr)
+		return 1;
+	tk->addr = addr;
+	tk->size_shift = shift;
+	return 0;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+				unsigned long poisoned_pfn, struct to_kill *tk)
+{
+	unsigned long pfn = 0;
+
+	if (pte_present(pte)) {
+		pfn = pte_pfn(pte);
+	} else {
+		swp_entry_t swp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp))
+			pfn = hwpoison_entry_to_pfn(swp);
+	}
+
+	if (!pfn || pfn != poisoned_pfn)
+		return 0;
+
+	return set_to_kill(tk, addr, shift);
+}
+
+static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	int ret = 0;
+	pte_t *ptep;
+	spinlock_t *ptl;
+
+	ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+	if (ptl) {
+		pmd_t pmd = *pmdp;
+
+		if (pmd_present(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd);
+
+			if (pfn <= hwp->pfn && hwp->pfn < pfn + HPAGE_PMD_NR) {
+				unsigned long hwpoison_vaddr = addr +
+					((hwp->pfn - pfn) << PAGE_SHIFT);
+
+				ret = set_to_kill(&hwp->tk, hwpoison_vaddr,
+						  PAGE_SHIFT);
+			}
+		}
+		spin_unlock(ptl);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmdp))
+		goto out;
+
+	ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != end; ptep++, addr += PAGE_SIZE) {
+		ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
+					     hwp->pfn, &hwp->tk);
+		if (ret == 1)
+			break;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+out:
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
+			    unsigned long addr, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	pte_t pte = huge_ptep_get(ptep);
+	struct hstate *h = hstate_vma(walk->vma);
+
+	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
+				      hwp->pfn, &hwp->tk);
+}
+#else
+#define hwpoison_hugetlb_range	NULL
+#endif
+
+static struct mm_walk_ops hwp_walk_ops = {
+	.pmd_entry = hwpoison_pte_range,
+	.hugetlb_entry = hwpoison_hugetlb_range,
+};
+
+/*
+ * Sends SIGBUS to the current process with the error info.
+ *
+ * This function is intended to handle "Action Required" MCEs on already
+ * hardware poisoned pages. They could happen, for example, when
+ * memory_failure() failed to unmap the error page at the first call, or
+ * when multiple Action Optional MCE events races on different CPUs with
+ * Local MCE enabled.
+ *
+ * MCE handler currently has no easy access to the error virtual address,
+ * so this function walks page table to find it. One challenge on this is
+ * to reliably get the proper virual address of the error to report to
+ * applications via SIGBUS. A process could map a page multiple times to
+ * different virtual addresses, then we now have no way to tell which virtual
+ * address was accessed when the Action Required MCE was generated.
+ * So in such a corner case, we now give up and fall back to sending SIGBUS
+ * with no error info.
+ */
+static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
+				  int flags)
+{
+	int ret;
+	struct hwp_walk priv = {
+		.pfn = pfn,
+	};
+	priv.tk.tsk = p;
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &hwp_walk_ops,
+			      (void *)&priv);
+	if (!ret && priv.tk.addr)
+		kill_proc(&priv.tk, pfn, flags);
+	mmap_read_unlock(p->mm);
+	return ret ? -EFAULT : -EHWPOISON;
+}
+
 static const char *action_name[] = {
 	[MF_IGNORED] = "Ignored",
 	[MF_FAILED] = "Failed",
@@ -1228,7 +1365,10 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return -EHWPOISON;
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, page_to_pfn(head), flags);
+		return res;
 	}
 
 	num_poisoned_pages_inc();
@@ -1437,8 +1577,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, pfn, flags);
 		mutex_unlock(&mf_mutex);
-		return -EHWPOISON;
+		return res;
 	}
 
 	orig_head = hpage = compound_head(p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-19  2:36     ` [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address Naoya Horiguchi
@ 2021-04-19  3:43       ` Aili Yao
  0 siblings, 0 replies; 22+ messages in thread
From: Aili Yao @ 2021-04-19  3:43 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Tony Luck, Andrew Morton, Oscar Salvador,
	David Hildenbrand, Borislav Petkov, Andy Lutomirski,
	linux-kernel,
	HORIGUCHI NAOYA(堀口　直也),
	yaoaili

On Mon, 19 Apr 2021 11:36:58 +0900
Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:

> > > 2. In the function hwpoison_pte_range():
> > > if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use PMD_SIZE/PAGE_SIZE or some macro like this?  
> > 
> > Thanks, that's right.  HPAGE_PMD_NR seems to fit here.
> > We also need "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it.  
> 
> I found that the #ifdef is not necessary because the whole
> "if (ptl)" is compiled out.  So I don't add #ifdef.
> 
> Here's the v2 of 3/3.
> 
> Aili, could you test with it?
> 
> Thanks,
> Naoya Horiguchi
> 

I tested this v2 version, In my test, this patches worked as expected and the previous
issues didn't happen again.

Test-by: Aili Yao <yaoaili@kingsoft.com>

Thanks,
Aili Yao

> -----
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Tue, 13 Apr 2021 07:26:25 +0900
> Subject: [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error
>  virtual address
> 
> The previous patch solves the infinite MCE loop issue when multiple
> MCE events races.  The remaining issue is to make sure that all threads
> processing Action Required MCEs send to the current processes the
> SIGBUS with the proper virtual address and the error size.
> 
> This patch suggests to do page table walk to find the error virtual
> address.  If we find multiple virtual addresses in walking, we now can't
> determine which one is correct, so we fall back to sending SIGBUS in
> kill_me_maybe() without error info as we do now.  This corner case needs
> to be solved in the future.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
> change log v1 -> v2:
> - initialize local variables in check_hwpoisoned_entry() and
>   hwpoison_pte_range()
> - fix and improve logic to calculate error address offset.
> ---
>  arch/x86/kernel/cpu/mce/core.c |  13 ++-
>  include/linux/swapops.h        |   5 ++
>  mm/memory-failure.c            | 147 ++++++++++++++++++++++++++++++++-
>  3 files changed, 161 insertions(+), 4 deletions(-)
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-04-12 22:43 ` [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
@ 2021-04-19 17:05   ` Borislav Petkov
  2021-04-20  7:46     ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-04-19 17:05 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Tony Luck, Aili Yao, Andrew Morton, Oscar Salvador,
	David Hildenbrand, Andy Lutomirski, Naoya Horiguchi,
	linux-kernel

On Tue, Apr 13, 2021 at 07:43:18AM +0900, Naoya Horiguchi wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  mm/memory-failure.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
> index 24210c9bd843..c1509f4b565e 100644
> --- v5.12-rc5/mm/memory-failure.c
> +++ v5.12-rc5_patched/mm/memory-failure.c
> @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>  		return -ENXIO;
>  	}

So the locking patterns are done in two different ways, which are
confusing when following this code:

> +	mutex_lock(&mf_mutex);
> +
>  try_again:
> -	if (PageHuge(p))
> -		return memory_failure_hugetlb(pfn, flags);
> +	if (PageHuge(p)) {
> +		res = memory_failure_hugetlb(pfn, flags);
> +		goto out2;
> +	}

You have the goto to a label where you do the unlocking (btw, pls do
s/out2/out_unlock/g;)...

> +
>  	if (TestSetPageHWPoison(p)) {
>  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
>  			pfn);
> +		mutex_unlock(&mf_mutex);
>  		return 0;

... and you have the other case where you unlock before returning.

Since you've added the label, I think *all* the unlocking should do
"goto out_unlock" instead of doing either/or.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-04-19 17:05   ` Borislav Petkov
@ 2021-04-20  7:46     ` HORIGUCHI NAOYA(堀口　直也)
  2021-04-20 10:16       ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-20  7:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Naoya Horiguchi, linux-mm, Tony Luck, Aili Yao, Andrew Morton,
	Oscar Salvador, David Hildenbrand, Andy Lutomirski, linux-kernel

On Mon, Apr 19, 2021 at 07:05:38PM +0200, Borislav Petkov wrote:
> On Tue, Apr 13, 2021 at 07:43:18AM +0900, Naoya Horiguchi wrote:
> > From: Tony Luck <tony.luck@intel.com>
> > 
> > There can be races when multiple CPUs consume poison from the same
> > page. The first into memory_failure() atomically sets the HWPoison
> > page flag and begins hunting for tasks that map this page. Eventually
> > it invalidates those mappings and may send a SIGBUS to the affected
> > tasks.
> > 
> > But while all that work is going on, other CPUs see a "success"
> > return code from memory_failure() and so they believe the error
> > has been handled and continue executing.
> > 
> > Fix by wrapping most of the internal parts of memory_failure() in
> > a mutex.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
> >  mm/memory-failure.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git v5.12-rc5/mm/memory-failure.c v5.12-rc5_patched/mm/memory-failure.c
> > index 24210c9bd843..c1509f4b565e 100644
> > --- v5.12-rc5/mm/memory-failure.c
> > +++ v5.12-rc5_patched/mm/memory-failure.c
> > @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> >  	return rc;
> >  }
> >  
> > +static DEFINE_MUTEX(mf_mutex);
> > +
> >  /**
> >   * memory_failure - Handle memory failure of a page.
> >   * @pfn: Page Number of the corrupted page
> > @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
> >  		return -ENXIO;
> >  	}
> 
> So the locking patterns are done in two different ways, which are
> confusing when following this code:
>
> > +	mutex_lock(&mf_mutex);
> > +
> >  try_again:
> > -	if (PageHuge(p))
> > -		return memory_failure_hugetlb(pfn, flags);
> > +	if (PageHuge(p)) {
> > +		res = memory_failure_hugetlb(pfn, flags);
> > +		goto out2;
> > +	}
> 
> You have the goto to a label where you do the unlocking (btw, pls do
> s/out2/out_unlock/g;)...
> 
> > +
> >  	if (TestSetPageHWPoison(p)) {
> >  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
> >  			pfn);
> > +		mutex_unlock(&mf_mutex);
> >  		return 0;
> 
> ... and you have the other case where you unlock before returning.
> 
> Since you've added the label, I think *all* the unlocking should do
> "goto out_unlock" instead of doing either/or.

Make sense, so I updated the patch like below.  The existing label "out"
could be renamed in the same manner, so I replaced it with "unlock_page"
and "out2" with "unlock_mutex". I thought of using "out_unlock_{page,mutex}"
but maybe it's clear enough without "out_".

If you have any other suggestion, please let me know.

Thanks,
Naoya Horiguchi

---
From 19dbd0e9cd2c7febf67370ac9957bdde83084316 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Tue, 20 Apr 2021 16:42:01 +0900
Subject: [PATCH 1/3] mm/memory-failure: Use a mutex to avoid memory_failure()
 races

There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Along with introducing an additional goto label, this patch also
aligns the returning code with "goto out" pattern and renames
the existing label "out' with clearer one to make code clearer.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..4087308e4b32 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1404,7 +1406,7 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *hpage;
 	struct page *orig_head;
 	struct dev_pagemap *pgmap;
-	int res;
+	int res = 0;
 	unsigned long page_flags;
 	bool retry = true;
 
@@ -1424,13 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	orig_head = hpage = compound_head(p);
@@ -1463,17 +1470,19 @@ int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
-			return res == MF_RECOVERED ? 0 : -EBUSY;
+			res = res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
-			return -EBUSY;
+			res = -EBUSY;
 		}
+		goto unlock_mutex;
 	}
 
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
-			return -EBUSY;
+			res = -EBUSY;
+			goto unlock_mutex;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
@@ -1497,7 +1506,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageCompound(p) && compound_head(p) != orig_head) {
 		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 	/*
@@ -1517,14 +1526,14 @@ int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
-		return 0;
+		goto unlock_mutex;
 	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	if (!PageTransTail(p) && !PageLRU(p))
@@ -1543,7 +1552,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 	/*
@@ -1552,13 +1561,15 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
 		action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
 		res = -EBUSY;
-		goto out;
+		goto unlock_page;
 	}
 
 identify_page_state:
 	res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
 	unlock_page(p);
+unlock_mutex:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-04-20  7:46     ` HORIGUCHI NAOYA(堀口　直也)
@ 2021-04-20 10:16       ` Borislav Petkov
  2021-04-21  0:57         ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-04-20 10:16 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口　直也)
  Cc: Naoya Horiguchi, linux-mm, Tony Luck, Aili Yao, Andrew Morton,
	Oscar Salvador, David Hildenbrand, Andy Lutomirski, linux-kernel

On Tue, Apr 20, 2021 at 07:46:26AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> If you have any other suggestion, please let me know.

Looks almost ok...

> From: Tony Luck <tony.luck@intel.com>
> Date: Tue, 20 Apr 2021 16:42:01 +0900
> Subject: [PATCH 1/3] mm/memory-failure: Use a mutex to avoid memory_failure()
>  races
> 
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Along with introducing an additional goto label, this patch also

... avoid having "This patch" or "This commit" in the commit message.
It is tautologically useless. Also, you don't have to explain what the
patch does - that's visible hopefully. :-)

Other than that, it makes sense and the "sandwitching" looks correct:

	mutex_lock
	lock_page

	...

	unlock_page
	mutex_unlock

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-04-20 10:16       ` Borislav Petkov
@ 2021-04-21  0:57         ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 0 replies; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-21  0:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Naoya Horiguchi, linux-mm, Tony Luck, Aili Yao, Andrew Morton,
	Oscar Salvador, David Hildenbrand, Andy Lutomirski, linux-kernel

On Tue, Apr 20, 2021 at 12:16:57PM +0200, Borislav Petkov wrote:
> On Tue, Apr 20, 2021 at 07:46:26AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > If you have any other suggestion, please let me know.
> 
> Looks almost ok...
> 
> > From: Tony Luck <tony.luck@intel.com>
> > Date: Tue, 20 Apr 2021 16:42:01 +0900
> > Subject: [PATCH 1/3] mm/memory-failure: Use a mutex to avoid memory_failure()
> >  races
> > 
> > There can be races when multiple CPUs consume poison from the same
> > page. The first into memory_failure() atomically sets the HWPoison
> > page flag and begins hunting for tasks that map this page. Eventually
> > it invalidates those mappings and may send a SIGBUS to the affected
> > tasks.
> > 
> > But while all that work is going on, other CPUs see a "success"
> > return code from memory_failure() and so they believe the error
> > has been handled and continue executing.
> > 
> > Fix by wrapping most of the internal parts of memory_failure() in
> > a mutex.
> > 
> > Along with introducing an additional goto label, this patch also
> 
> ... avoid having "This patch" or "This commit" in the commit message.
> It is tautologically useless. Also, you don't have to explain what the
> patch does - that's visible hopefully. :-)

OK, I'll drop this paragraph from next version.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-20 15:42 ` Luck, Tony
@ 2021-04-21  1:04   ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 0 replies; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-21  1:04 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jue Wang, nao.horiguchi, akpm, bp, david, linux-kernel, linux-mm,
	luto, osalvador, yaoaili

On Tue, Apr 20, 2021 at 08:42:53AM -0700, Luck, Tony wrote:
> On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote:
> > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> > ...
> > > + * This function is intended to handle "Action Required" MCEs on already
> > > + * hardware poisoned pages. They could happen, for example, when
> > > + * memory_failure() failed to unmap the error page at the first call, or
> > > + * when multiple Action Optional MCE events races on different CPUs with
> > > + * Local MCE enabled.
> > 
> > +Tony Luck
> > 
> > Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system
> > as they come without an execution context, is it correct?
> > 
> > If Yes, Naoya, I think we might want to remove the comments about the
> > "multiple Action Optional MCE racing" part.
> 
> Jue,
> 
> Correct. SRAO machine checks are broadcast.  But rather than remove the
> second part, just replace with "multiple local machine checks on different
> CPUs".

This looks more precise, so I replaced as such in v3.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-20 16:30   ` Jue Wang
@ 2021-04-20 17:15     ` Luck, Tony
  0 siblings, 0 replies; 22+ messages in thread
From: Luck, Tony @ 2021-04-20 17:15 UTC (permalink / raw)
  To: Jue Wang
  Cc: Naoya Horiguchi, Andrew Morton, Borislav Petkov, david,
	linux-kernel, linux-mm, luto,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, yaoaili

> I meant in this case (racing to access the same poisoned pages), the
> page mapping should have been removed by and the hwpoison swap pte
> installed by the winner thread?

My "mutex" patch that Horiguchi-san has included his summary series should
make this happen in most cases. Only problem is if the first thread runs into some
error and does not complete unmapping the poison page from all other tasks.

So the backup plan is to scan the page tables.

>> Well, I did try a patch that removed *all* user mappings (switched CR3 to
>> swapper_pgdir) and returned to user. Then have the resulting page fault
>> report the address. But that didn't work very well.
> Curious what didn't work well in this case? :-)

Andy Lutomirski wasn't happy with the approach. It was specifically
to cover the "kernel accesses poison more than once from get_user()".

It doesn't generalize to the case where the user accessed the poison (because
you'll just take the #PF on the instruction fetch ... everything is unmapped ...
instead of on the original data access).

-Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-20 15:47 ` Luck, Tony
@ 2021-04-20 16:30   ` Jue Wang
  2021-04-20 17:15     ` Luck, Tony
  0 siblings, 1 reply; 22+ messages in thread
From: Jue Wang @ 2021-04-20 16:30 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Naoya Horiguchi, Andrew Morton, Borislav Petkov, david,
	linux-kernel, linux-mm, luto,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, yaoaili

On Tue, Apr 20, 2021 at 8:48 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Mon, Apr 19, 2021 at 07:03:01PM -0700, Jue Wang wrote:
> > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> >
> > > This patch suggests to do page table walk to find the error virtual
> > > address.  If we find multiple virtual addresses in walking, we now can't
> > > determine which one is correct, so we fall back to sending SIGBUS in
> > > kill_me_maybe() without error info as we do now.  This corner case needs
> > > to be solved in the future.
> >
> > Instead of walking the page tables, I wonder what about the following idea:
> >
> > When failing to get vaddr, memory_failure just ensures the mapping is removed
> > and an hwpoisoned swap pte is put in place; or the original page is flagged with
> > PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP).
>
> To remove the mapping, you need to know the virtual address :-)
I meant in this case (racing to access the same poisoned pages), the
page mapping should have been removed by and the hwpoison swap pte
installed by the winner thread?

Other racing threads can rely on the subsequent #PFs to get the
correct SIGBUS with accurate vaddr semantics? Or is the goal to "give
back correct SIGBUS with accurate vaddr on _the first MCE on ANY
threads_"? I wonder if that goal is absolutely necessary and can be
relaxed a little to take into account subsequent #PFs.
>
> Well, I did try a patch that removed *all* user mappings (switched CR3 to
> swapper_pgdir) and returned to user. Then have the resulting page fault
> report the address. But that didn't work very well.
Curious what didn't work well in this case? :-)

>
>
>
>
> > NOTE: no SIGBUS is sent to user space.
> >
> > Then do_machine_check just returns to user space to resume execution, the
> > re-execution will result in a #PF and should land to the exact page fault
> > handling code that generates a SIGBUS with the precise vaddr info:
>
> That's how SRAO (and other races) are supposed to work.
Hmm, I wonder why it doesn't apply to this race.
>
> -Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-20  2:03 ` Jue Wang
  (?)
@ 2021-04-20 15:47 ` Luck, Tony
  2021-04-20 16:30   ` Jue Wang
  -1 siblings, 1 reply; 22+ messages in thread
From: Luck, Tony @ 2021-04-20 15:47 UTC (permalink / raw)
  To: Jue Wang
  Cc: nao.horiguchi, akpm, bp, david, linux-kernel, linux-mm, luto,
	naoya.horiguchi, osalvador, yaoaili

On Mon, Apr 19, 2021 at 07:03:01PM -0700, Jue Wang wrote:
> On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> 
> > This patch suggests to do page table walk to find the error virtual
> > address.  If we find multiple virtual addresses in walking, we now can't
> > determine which one is correct, so we fall back to sending SIGBUS in
> > kill_me_maybe() without error info as we do now.  This corner case needs
> > to be solved in the future.
> 
> Instead of walking the page tables, I wonder what about the following idea:
> 
> When failing to get vaddr, memory_failure just ensures the mapping is removed
> and an hwpoisoned swap pte is put in place; or the original page is flagged with
> PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP).

To remove the mapping, you need to know the virtual address :-)

Well, I did try a patch that removed *all* user mappings (switched CR3 to
swapper_pgdir) and returned to user. Then have the resulting page fault
report the address. But that didn't work very well.

> NOTE: no SIGBUS is sent to user space.
> 
> Then do_machine_check just returns to user space to resume execution, the
> re-execution will result in a #PF and should land to the exact page fault
> handling code that generates a SIGBUS with the precise vaddr info:

That's how SRAO (and other races) are supposed to work.

-Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-20  1:49 ` Jue Wang
  (?)
  (?)
@ 2021-04-20 15:42 ` Luck, Tony
  2021-04-21  1:04   ` HORIGUCHI NAOYA(堀口　直也)
  -1 siblings, 1 reply; 22+ messages in thread
From: Luck, Tony @ 2021-04-20 15:42 UTC (permalink / raw)
  To: Jue Wang
  Cc: nao.horiguchi, akpm, bp, david, linux-kernel, linux-mm, luto,
	naoya.horiguchi, osalvador, yaoaili

On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote:
> On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> ...
> > + * This function is intended to handle "Action Required" MCEs on already
> > + * hardware poisoned pages. They could happen, for example, when
> > + * memory_failure() failed to unmap the error page at the first call, or
> > + * when multiple Action Optional MCE events races on different CPUs with
> > + * Local MCE enabled.
> 
> +Tony Luck
> 
> Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system
> as they come without an execution context, is it correct?
> 
> If Yes, Naoya, I think we might want to remove the comments about the
> "multiple Action Optional MCE racing" part.

Jue,

Correct. SRAO machine checks are broadcast.  But rather than remove the
second part, just replace with "multiple local machine checks on different
CPUs".

-Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
  2021-04-20  1:49 ` Jue Wang
  (?)
@ 2021-04-20  7:51 ` HORIGUCHI NAOYA(堀口　直也)
  -1 siblings, 0 replies; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2021-04-20  7:51 UTC (permalink / raw)
  To: Jue Wang
  Cc: nao.horiguchi, Luck, Tony, akpm, bp, david, linux-kernel,
	linux-mm, luto, osalvador, yaoaili

On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote:
> On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> ...
> > + * This function is intended to handle "Action Required" MCEs on already
> > + * hardware poisoned pages. They could happen, for example, when
> > + * memory_failure() failed to unmap the error page at the first call, or
> > + * when multiple Action Optional MCE events races on different CPUs with
> > + * Local MCE enabled.
> 
> +Tony Luck
> 
> Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system
> as they come without an execution context, is it correct?
> 
> If Yes, Naoya, I think we might want to remove the comments about the
> "multiple Action Optional MCE racing" part.

Hi Jue,

I just wrote a mistake and I meant "when multiple Action Required MCE events races ...".
Sorry and thank you for finding it.  This will be fixed in the later versions.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
@ 2021-04-20  2:03 ` Jue Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Jue Wang @ 2021-04-20  2:03 UTC (permalink / raw)
  To: nao.horiguchi, Luck, Tony
  Cc: akpm, bp, david, linux-kernel, linux-mm, luto, naoya.horiguchi,
	osalvador, yaoaili

On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:

> This patch suggests to do page table walk to find the error virtual
> address.  If we find multiple virtual addresses in walking, we now can't
> determine which one is correct, so we fall back to sending SIGBUS in
> kill_me_maybe() without error info as we do now.  This corner case needs
> to be solved in the future.

Instead of walking the page tables, I wonder what about the following idea:

When failing to get vaddr, memory_failure just ensures the mapping is removed
and an hwpoisoned swap pte is put in place; or the original page is flagged with
PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP).

NOTE: no SIGBUS is sent to user space.

Then do_machine_check just returns to user space to resume execution, the
re-execution will result in a #PF and should land to the exact page fault
handling code that generates a SIGBUS with the precise vaddr info:

https://github.com/torvalds/linux/blob/7af08140979a6e7e12b78c93b8625c8d25b084e2/mm/memory.c#L3290
https://github.com/torvalds/linux/blob/7af08140979a6e7e12b78c93b8625c8d25b084e2/mm/memory.c#L3647

Thanks,
-Jue

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
@ 2021-04-20  2:03 ` Jue Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Jue Wang @ 2021-04-20  2:03 UTC (permalink / raw)
  To: nao.horiguchi, Luck, Tony
  Cc: akpm, bp, david, linux-kernel, linux-mm, luto, naoya.horiguchi,
	osalvador, yaoaili

On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:

> This patch suggests to do page table walk to find the error virtual
> address.  If we find multiple virtual addresses in walking, we now can't
> determine which one is correct, so we fall back to sending SIGBUS in
> kill_me_maybe() without error info as we do now.  This corner case needs
> to be solved in the future.

Instead of walking the page tables, I wonder what about the following idea:

When failing to get vaddr, memory_failure just ensures the mapping is removed
and an hwpoisoned swap pte is put in place; or the original page is flagged with
PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP).

NOTE: no SIGBUS is sent to user space.

Then do_machine_check just returns to user space to resume execution, the
re-execution will result in a #PF and should land to the exact page fault
handling code that generates a SIGBUS with the precise vaddr info:

https://github.com/torvalds/linux/blob/7af08140979a6e7e12b78c93b8625c8d25b084e2/mm/memory.c#L3290
https://github.com/torvalds/linux/blob/7af08140979a6e7e12b78c93b8625c8d25b084e2/mm/memory.c#L3647

Thanks,
-Jue


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
@ 2021-04-20  1:49 ` Jue Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Jue Wang @ 2021-04-20  1:49 UTC (permalink / raw)
  To: nao.horiguchi, Luck, Tony
  Cc: akpm, bp, david, linux-kernel, linux-mm, luto, naoya.horiguchi,
	osalvador, yaoaili

On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
...
> + * This function is intended to handle "Action Required" MCEs on already
> + * hardware poisoned pages. They could happen, for example, when
> + * memory_failure() failed to unmap the error page at the first call, or
> + * when multiple Action Optional MCE events races on different CPUs with
> + * Local MCE enabled.

+Tony Luck

Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system
as they come without an execution context, is it correct?

If Yes, Naoya, I think we might want to remove the comments about the
"multiple Action Optional MCE racing" part.

Best,
-Jue

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
@ 2021-04-20  1:49 ` Jue Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Jue Wang @ 2021-04-20  1:49 UTC (permalink / raw)
  To: nao.horiguchi, Luck, Tony
  Cc: akpm, bp, david, linux-kernel, linux-mm, luto, naoya.horiguchi,
	osalvador, yaoaili

On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
...
> + * This function is intended to handle "Action Required" MCEs on already
> + * hardware poisoned pages. They could happen, for example, when
> + * memory_failure() failed to unmap the error page at the first call, or
> + * when multiple Action Optional MCE events races on different CPUs with
> + * Local MCE enabled.

+Tony Luck

Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system
as they come without an execution context, is it correct?

If Yes, Naoya, I think we might want to remove the comments about the
"multiple Action Optional MCE racing" part.

Best,
-Jue


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-04-21  1:04 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12 22:43 [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Naoya Horiguchi
2021-04-12 22:43 ` [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races Naoya Horiguchi
2021-04-19 17:05   ` Borislav Petkov
2021-04-20  7:46     ` HORIGUCHI NAOYA(堀口　直也)
2021-04-20 10:16       ` Borislav Petkov
2021-04-21  0:57         ` HORIGUCHI NAOYA(堀口　直也)
2021-04-12 22:43 ` [PATCH v1 2/3] mm,hwpoison: return -EHWPOISON when page already Naoya Horiguchi
2021-04-12 22:43 ` [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address Naoya Horiguchi
2021-04-17  5:47 ` [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Aili Yao
2021-04-19  1:09   ` HORIGUCHI NAOYA(堀口　直也)
2021-04-19  2:36     ` [PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address Naoya Horiguchi
2021-04-19  3:43       ` Aili Yao
2021-04-20  1:49 [PATCH v1 " Jue Wang
2021-04-20  1:49 ` Jue Wang
2021-04-20  7:51 ` HORIGUCHI NAOYA(堀口　直也)
2021-04-20 15:42 ` Luck, Tony
2021-04-21  1:04   ` HORIGUCHI NAOYA(堀口　直也)
2021-04-20  2:03 Jue Wang
2021-04-20  2:03 ` Jue Wang
2021-04-20 15:47 ` Luck, Tony
2021-04-20 16:30   ` Jue Wang
2021-04-20 17:15     ` Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.