Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
       [not found]             ` <20210226021907.GA27861@hori.linux.bs1.fc.nec.co.jp>
@ 2021-02-26  2:59               ` Aili Yao
  2021-03-03  3:39                 ` Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-02-26  2:59 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Luck, Tony
  Cc: Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

Hi naoya, tony:
> > 
> > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > from user context ... maybe parallel calls for the same page should
> > be blocked until the first caller completes so we can:
> > a) know that pages are unmapped (if that happens)
> > b) all get the same success/fail status  
> 
> One memory_failure() call changes the target page's status and
> affects all mappings to all affected processes, so I think that
> (ideally) we don't have to block other threads (letting them
> early return seems fine).  Sometimes memory_failure() fails,
> but even in such case, PG_hwpoison is set on the page and other
> threads properly get SIGBUSs with this patch, so I think that
> we can avoid the worst scenario (like system stall by MCE loop).
> 
I agree with naoya's point, if we block for this issue, Does this change the result
that the process should be killed? Or is there something other still need to be considered?

Thanks!
Aili Yao  

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
       [not found]         ` <20210226105250.3a15e35c@alex-virtual-machine>
@ 2021-02-26 17:58           ` Luck, Tony
  2021-03-02  4:32             ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-02-26 17:58 UTC (permalink / raw)
  To: Aili Yao
  Cc: Oscar Salvador,
	HORIGUCHI NAOYA(堀口 直也),
	david, akpm, bp, tglx, mingo, hpa, x86, linux-edac, linux-kernel,
	linux-mm, yangfeng1

On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote:
> Hi naoya,Oscar,david:
> > 
> > > We could use some negative value (error code) to report the reported case,
> > > then as you mentioned above, some callers need change to handle the
> > > new case, and the same is true if you use some positive value.
> > > My preference is -EHWPOISON, but other options are fine if justified well.  
> > 
> > -EHWPOISON seems like a good fit.
> > 
> I am OK with the -EHWPOISON error code, But I have one doubt here:
> When we return this -EHWPOISON error code, Does this means we have to add a new error code
> to error-base.h or errno.h? Is this easy realized?

The page already poisoned isn't really an error though. Just the result
of a race condition.  What if we added an extra argument to memory_failure()
so it can tell the caller that the specific reason for the early successful
return is that the page was already poisoned?

Something like this (untested - patch against v5.11):

-Tony

---

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..0e32c4d879fb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -637,6 +637,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 {
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
+	int already = 0;
 
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
@@ -646,8 +647,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		return NOTIFY_DONE;
 
 	pfn = mce->addr >> PAGE_SHIFT;
-	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+	if (!memory_failure(pfn, 0, &already)) {
+		if (!already)
+			set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1245,15 +1247,19 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int already = 0;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		if (already)
+			force_sig(SIGBUS); // BETTER CODE NEEDED HERE!!!
+		else
+			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
@@ -1440,7 +1446,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	/* mce_severity() should not hand us an ACTION_REQUIRED error */
 	BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index eef4ffb6122c..24c36623e492 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = memory_failure(pfn, 0);
+	ret = memory_failure(pfn, 0, NULL);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..88b92820465c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3045,7 +3045,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, 0);
+	return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6a660858784b..ade1956632aa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -907,7 +907,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e9481632fcd1..e8508e4d70e5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1388,7 +1388,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks hold.
  */
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	struct page *p;
 	struct page *hpage;
@@ -1418,6 +1418,8 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
+		if (already)
+			*already = 1;
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		return 0;
@@ -1624,7 +1626,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
 		else
-			memory_failure(entry.pfn, entry.flags);
+			memory_failure(entry.pfn, entry.flags, NULL);
 	}
 }
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-02-26 17:58           ` Luck, Tony
@ 2021-03-02  4:32             ` Aili Yao
  0 siblings, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-02  4:32 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Oscar Salvador,
	HORIGUCHI NAOYA( 堀口 直也),
	david, akpm, bp, tglx, mingo, hpa, x86, linux-edac, linux-kernel,
	linux-mm, yangfeng1, yaoaili

On Fri, 26 Feb 2021 09:58:37 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote:
> > Hi naoya,Oscar,david:  
> > >   
> > > > We could use some negative value (error code) to report the reported case,
> > > > then as you mentioned above, some callers need change to handle the
> > > > new case, and the same is true if you use some positive value.
> > > > My preference is -EHWPOISON, but other options are fine if justified well.    
> > > 
> > > -EHWPOISON seems like a good fit.
> > >   
> > I am OK with the -EHWPOISON error code, But I have one doubt here:
> > When we return this -EHWPOISON error code, Does this means we have to add a new error code
> > to error-base.h or errno.h? Is this easy realized?  
> 
> The page already poisoned isn't really an error though. Just the result
> of a race condition.  What if we added an extra argument to memory_failure()
> so it can tell the caller that the specific reason for the early successful
> return is that the page was already poisoned?
> 

It may be not an error, Is it reasonable to return a positive value like MF_HWPOISON, it seems the 0
return code donesn't tell the whole story. 

Your patch seems more safer, But I don't know if it's worth such multi module modifications for this case.
It really should be referenced to other maintainers and reviewers and thet can give more expert suggestions.

Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-02-26  2:59               ` [PATCH] mm,hwpoison: return -EBUSY when page already poisoned Aili Yao
@ 2021-03-03  3:39                 ` Luck, Tony
  2021-03-03  3:57                   ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-03  3:39 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:
> Hi naoya, tony:
> > > 
> > > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > > from user context ... maybe parallel calls for the same page should
> > > be blocked until the first caller completes so we can:
> > > a) know that pages are unmapped (if that happens)
> > > b) all get the same success/fail status  
> > 
> > One memory_failure() call changes the target page's status and
> > affects all mappings to all affected processes, so I think that
> > (ideally) we don't have to block other threads (letting them
> > early return seems fine).  Sometimes memory_failure() fails,
> > but even in such case, PG_hwpoison is set on the page and other
> > threads properly get SIGBUSs with this patch, so I think that
> > we can avoid the worst scenario (like system stall by MCE loop).
> > 
> I agree with naoya's point, if we block for this issue, Does this change the result
> that the process should be killed? Or is there something other still need to be considered?

Ok ... no blocking ... I think someone in this thread suggested
scanning the page tables to make sure the poisoned page had been
unmapped.

There's a walk_page_range() function that does all the work for that.
Just need to supply some callback routines that check whether a
mapping includes the bad PFN and clear the PRESENT bit.

RFC patch below against v5.12-rc1

-Tony

From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Tue, 2 Mar 2021 15:06:33 -0800
Subject: [PATCH] x86/mce: Handle races between machine checks

When multiple CPUs hit the same poison memory there is a race. The
first CPU into memory_failure() atomically marks the page as poison
and continues processing to hunt down all the tasks that map this page
so that the virtual addresses can be marked not-present and SIGBUS
sent to the task that did the access.

Later CPUs get an early return from memory_failure() and may return
to user mode and access the poison again.

Add a new argument to memory_failure() so that it can indicate when
the race has been lost. Fix kill_me_maybe() to scan page tables in
this case to unmap pages.
---
 arch/x86/kernel/cpu/mce/core.c | 61 +++++++++++++++++++++++++++++++---
 drivers/base/memory.c          |  2 +-
 include/linux/mm.h             |  2 +-
 mm/hwpoison-inject.c           |  2 +-
 mm/madvise.c                   |  2 +-
 mm/memory-failure.c            |  6 ++--
 6 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..2c6c560f3f92 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -30,6 +30,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -637,6 +638,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 {
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
+	int already = 0;
 
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
@@ -646,8 +648,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		return NOTIFY_DONE;
 
 	pfn = mce->addr >> PAGE_SHIFT;
-	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+	if (!memory_failure(pfn, 0, &already)) {
+		if (!already)
+			set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1248,6 +1251,50 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	*m = *final;
 }
 
+static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 pfn = (u64)walk->private;
+
+	if (pte_pfn(*pte) == pfn)
+		pte->pte = pte->pte & ~_PAGE_PRESENT;
+
+	return 0;
+}
+
+static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PMD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+
+	if (!pmd_large(*pmd))
+		return 0;
+
+	if (pmd_pfn(*pmd) >> shift == pfn >> shift)
+		pmd->pmd = pmd->pmd & ~_PAGE_PRESENT;
+
+	return 0;
+}
+
+static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PUD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+
+	if (!pud_large(*pud))
+		return 0;
+
+	if (pud_pfn(*pud) >> shift == pfn >> shift)
+		pud->pud = pud->pud & ~_PAGE_PRESENT;
+
+	return 0;
+}
+
+static struct mm_walk_ops walk = {
+	.pte_entry = pte_entry,
+	.pmd_entry = pmd_entry,
+	.pud_entry = pud_entry
+};
+
 static void kill_me_now(struct callback_head *ch)
 {
 	force_sig(SIGBUS);
@@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int already = 0;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		if (already)
+			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
+		else
+			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
@@ -1452,7 +1503,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	/* mce_severity() should not hand us an ACTION_REQUIRED error */
 	BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..144500983656 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = memory_failure(pfn, 0);
+	ret = memory_failure(pfn, 0, NULL);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..beaa6e871cbe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3003,7 +3003,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, 0);
+	return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index df692d2e35d4..09f569fed68d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..9a8911aa5fc9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks hold.
  */
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	struct page *p;
 	struct page *hpage;
@@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
+		if (already)
+			*already = 1;
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		return 0;
@@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
 		else
-			memory_failure(entry.pfn, entry.flags);
+			memory_failure(entry.pfn, entry.flags, NULL);
 	}
 }
 
-- 
2.29.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-03  3:39                 ` Luck, Tony
@ 2021-03-03  3:57                   ` Aili Yao
  2021-03-03  8:39                     ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-03  3:57 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Tue, 2 Mar 2021 19:39:53 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:
> > Hi naoya, tony:  
> > > > 
> > > > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > > > from user context ... maybe parallel calls for the same page should
> > > > be blocked until the first caller completes so we can:
> > > > a) know that pages are unmapped (if that happens)
> > > > b) all get the same success/fail status    
> > > 
> > > One memory_failure() call changes the target page's status and
> > > affects all mappings to all affected processes, so I think that
> > > (ideally) we don't have to block other threads (letting them
> > > early return seems fine).  Sometimes memory_failure() fails,
> > > but even in such case, PG_hwpoison is set on the page and other
> > > threads properly get SIGBUSs with this patch, so I think that
> > > we can avoid the worst scenario (like system stall by MCE loop).
> > >   
> > I agree with naoya's point, if we block for this issue, Does this change the result
> > that the process should be killed? Or is there something other still need to be considered?  
> 
> Ok ... no blocking ... I think someone in this thread suggested
> scanning the page tables to make sure the poisoned page had been
> unmapped.
> 
> There's a walk_page_range() function that does all the work for that.
> Just need to supply some callback routines that check whether a
> mapping includes the bad PFN and clear the PRESENT bit.
> 
> RFC patch below against v5.12-rc1
> 
> -Tony
> 
> From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Tue, 2 Mar 2021 15:06:33 -0800
> Subject: [PATCH] x86/mce: Handle races between machine checks
> 
> When multiple CPUs hit the same poison memory there is a race. The
> first CPU into memory_failure() atomically marks the page as poison
> and continues processing to hunt down all the tasks that map this page
> so that the virtual addresses can be marked not-present and SIGBUS
> sent to the task that did the access.
> 
> Later CPUs get an early return from memory_failure() and may return
> to user mode and access the poison again.
> 
> Add a new argument to memory_failure() so that it can indicate when
> the race has been lost. Fix kill_me_maybe() to scan page tables in
> this case to unmap pages.

> +
>  static void kill_me_now(struct callback_head *ch)
>  {
>  	force_sig(SIGBUS);
> @@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb)
>  {
>  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>  	int flags = MF_ACTION_REQUIRED;
> +	int already = 0;
>  
>  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
>  
>  	if (!p->mce_ripv)
>  		flags |= MF_MUST_KILL;
>  
> -	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> +	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
>  	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> -		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +		if (already)
> +			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
> +		else
> +			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
>  		sync_core();
>  		return;

>  	MF_MUST_KILL = 1 << 2,
>  	MF_SOFT_OFFLINE = 1 << 3,
>  };

I have one doubt here, when "walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));" is done,
so how is the process triggering this error returned if it have taken the wrong data?

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-03  3:57                   ` Aili Yao
@ 2021-03-03  8:39                     ` Aili Yao
  2021-03-03 15:41                       ` Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-03  8:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

Hi tony:

> On Tue, 2 Mar 2021 19:39:53 -0800
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:  
> > > Hi naoya, tony:    
> > > > > 
> > > > > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > > > > from user context ... maybe parallel calls for the same page should
> > > > > be blocked until the first caller completes so we can:
> > > > > a) know that pages are unmapped (if that happens)
> > > > > b) all get the same success/fail status      
> > > > 
> > > > One memory_failure() call changes the target page's status and
> > > > affects all mappings to all affected processes, so I think that
> > > > (ideally) we don't have to block other threads (letting them
> > > > early return seems fine).  Sometimes memory_failure() fails,
> > > > but even in such case, PG_hwpoison is set on the page and other
> > > > threads properly get SIGBUSs with this patch, so I think that
> > > > we can avoid the worst scenario (like system stall by MCE loop).
> > > >     
> > > I agree with naoya's point, if we block for this issue, Does this change the result
> > > that the process should be killed? Or is there something other still need to be considered?    
> > 
> > Ok ... no blocking ... 

I do think about blocking method and the error address issue with sigbus,here is my opinion, maybe helpful:

For blocking, if we block here, there are some undefine work i think should be done.
As we don't know the process B triggering this error again is early-kill or not, so the previous memory_failure() call may 
not add B on kill_list, even if B is on kill_list, the error level for B is not proper set, as B should get an AR SIGBUS;

So we can't just wait, We must have some logic adding the process B to kill list, and as this is an AR error
another change should be done to current code, we need more logic in kill_proc or some other place.

Even if all the work is done right. There is one more serious scenario though, we even don't know the current step the previous memory_failure() is on,
So previous modification may not be usefull at all; When this scenario happens, what we can do?  block or return ?
if finally we return, an error code should be taken back; so we have to go to error process logic and a signal without right address will be sent;

For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
to the error brach again, remaining current code may be an easy option;

Any methods or patchs can solve the issue in a better way is OK to me, i want this issue fixed and in more complete way!

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-03  8:39                     ` Aili Yao
@ 2021-03-03 15:41                       ` Luck, Tony
  2021-03-04  2:16                         ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-03 15:41 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

> For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> to the error brach again, remaining current code may be an easy option;

My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
do a bit more that just clear the PRESENT bit).

So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()

If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF

The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
(just like it does if the page had been removed by an earlier UCNA/SRAO error).

At least that's the theory.

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-03 15:41                       ` Luck, Tony
@ 2021-03-04  2:16                         ` Aili Yao
  2021-03-04  4:19                           ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-04  2:16 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Wed, 3 Mar 2021 15:41:35 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> > to the error brach again, remaining current code may be an easy option;  
> 
> My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
> tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
> do a bit more that just clear the PRESENT bit).
> 
> So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()
> 
> If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF

Got this, I have some error thoughts here.


> The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
> (just like it does if the page had been removed by an earlier UCNA/SRAO error).

if your methods works, should it be like this?

1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
1583                         if (PageHuge(page)) {
1584                                 hugetlb_count_sub(compound_nr(page), mm);
1585                                 set_huge_swap_pte_at(mm, address,
1586                                                      pvmw.pte, pteval,
1587                                                      vma_mmu_pagesize(vma));
1588                         } else {
1589                                 dec_mm_counter(mm, mm_counter(page));
1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
1591                         }

the page fault check if it's a poison page using is_hwpoison_entry(),

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-04  2:16                         ` Aili Yao
@ 2021-03-04  4:19                           ` Aili Yao
  2021-03-04  6:45                             ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-04  4:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Thu, 4 Mar 2021 10:16:53 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> On Wed, 3 Mar 2021 15:41:35 +0000
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> > > to the error brach again, remaining current code may be an easy option;    
> > 
> > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
> > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
> > do a bit more that just clear the PRESENT bit).
> > 
> > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()

And when this happen, the process will receive an SIGBUS with AO level, is it proper as not an AR?

> > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF  
> 
> Got this, I have some error thoughts here.
> 
> 
> > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
> > (just like it does if the page had been removed by an earlier UCNA/SRAO error).  
> 
> if your methods works, should it be like this?
> 
> 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> 1583                         if (PageHuge(page)) {
> 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> 1585                                 set_huge_swap_pte_at(mm, address,
> 1586                                                      pvmw.pte, pteval,
> 1587                                                      vma_mmu_pagesize(vma));
> 1588                         } else {
> 1589                                 dec_mm_counter(mm, mm_counter(page));
> 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> 1591                         }
> 
> the page fault check if it's a poison page using is_hwpoison_entry(),
> 

And if it works, does we need some locking mechanism before we call walk_page_range();
if we lock, does we need to process the blocking interrupted error as other places will do?

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-04  4:19                           ` Aili Yao
@ 2021-03-04  6:45                             ` Aili Yao
  2021-03-04 23:57                               ` Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-04  6:45 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Thu, 4 Mar 2021 12:19:41 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> On Thu, 4 Mar 2021 10:16:53 +0800
> Aili Yao <yaoaili@kingsoft.com> wrote:
> 
> > On Wed, 3 Mar 2021 15:41:35 +0000
> > "Luck, Tony" <tony.luck@intel.com> wrote:
> >   
> > > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> > > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> > > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> > > > to the error brach again, remaining current code may be an easy option;      
> > > 
> > > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
> > > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
> > > do a bit more that just clear the PRESENT bit).
> > > 
> > > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()  
> 
> And when this happen, the process will receive an SIGBUS with AO level, is it proper as not an AR?
> 
> > > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF    
> > 
> > Got this, I have some error thoughts here.
> > 
> >   
> > > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
> > > (just like it does if the page had been removed by an earlier UCNA/SRAO error).    
> > 
> > if your methods works, should it be like this?
> > 
> > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > 1583                         if (PageHuge(page)) {
> > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > 1585                                 set_huge_swap_pte_at(mm, address,
> > 1586                                                      pvmw.pte, pteval,
> > 1587                                                      vma_mmu_pagesize(vma));
> > 1588                         } else {
> > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > 1591                         }
> > 
> > the page fault check if it's a poison page using is_hwpoison_entry(),
> >   
> 
> And if it works, does we need some locking mechanism before we call walk_page_range();
> if we lock, does we need to process the blocking interrupted error as other places will do?
> 

And another thing:
Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-04  6:45                             ` Aili Yao
@ 2021-03-04 23:57                               ` Luck, Tony
  2021-03-05  1:30                                 ` Aili Yao
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-04 23:57 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > if your methods works, should it be like this?
> > > 
> > > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > 1583                         if (PageHuge(page)) {
> > > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > > 1585                                 set_huge_swap_pte_at(mm, address,
> > > 1586                                                      pvmw.pte, pteval,
> > > 1587                                                      vma_mmu_pagesize(vma));
> > > 1588                         } else {
> > > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > > 1591                         }
> > > 
> > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > >   
> > 
> > And if it works, does we need some locking mechanism before we call walk_page_range();
> > if we lock, does we need to process the blocking interrupted error as other places will do?
> > 
> 
> And another thing:
> Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?

Thanks for all the pointers.  I added them to the patch (see below).
[The pmd/pud cases may need some tweaking ... but just trying to get
the 4K page case working first]

I tried testing by skipping the call to memory_failure() and just
using this new code to search the page tables for current page and
marking it hwpoison (to simulate the case where 2nd process gets the
early return from memory_failure(). Something is still missing because I get:

[  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
[  481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400
[  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
[  482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6
[  482.951355] memcg:ffff94a809834000
[  482.955153] aops:shmem_aops ino:3c04
[  482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison)
[  482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8
[  482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000
[  482.984310] page dumped because: still mapped when deleted

-Tony

commit e5de44560b33e2d407704243566253a70f858a59
Author: Tony Luck <tony.luck@intel.com>
Date:   Tue Mar 2 15:06:33 2021 -0800

    x86/mce: Handle races between machine checks
    
    When multiple CPUs hit the same poison memory there is a race. The
    first CPU into memory_failure() atomically marks the page as poison
    and continues processing to hunt down all the tasks that map this page
    so that the virtual addresses can be marked not-present and SIGBUS
    sent to the task that did the access.
    
    Later CPUs get an early return from memory_failure() and may return
    to user mode and access the poison again.
    
    Add a new argument to memory_failure() so that it can indicate when
    the race has been lost. Fix kill_me_maybe() to scan page tables in
    this case to unmap pages.

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..a52c6a772de2 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 {
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
+	int already = 0;
 
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
@@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		return NOTIFY_DONE;
 
 	pfn = mce->addr >> PAGE_SHIFT;
-	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+	if (!memory_failure(pfn, 0, &already)) {
+		if (!already)
+			set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	*m = *final;
 }
 
+static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 pfn = (u64)walk->private;
+	struct page *page;
+	pte_t pteval;
+
+	if (pte_pfn(*pte) == pfn) {
+pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
+		page = pfn_to_page(pfn);
+		lock_page(page);
+SetPageHWPoison(page);
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+		dec_mm_counter(walk->mm, mm_counter(page));
+		set_pte_at(current->mm, addr, pte, pteval);
+		unlock_page(page);
+		flush_tlb_page(walk->vma, addr);
+	}
+
+	return 0;
+}
+
+static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PMD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+	struct page *page;
+	pte_t pteval;
+
+	if (!pmd_large(*pmd))
+		return 0;
+
+	if (pmd_pfn(*pmd) >> shift == pfn >> shift) {
+		page = pfn_to_page(pfn);
+		lock_page(page);
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+		hugetlb_count_sub(compound_nr(page), walk->mm);
+		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma));
+		unlock_page(page);
+		flush_tlb_page(walk->vma, addr);
+	}
+
+	return 0;
+}
+
+static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PUD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+	struct page *page;
+	pte_t pteval;
+
+	if (!pud_large(*pud))
+		return 0;
+
+	if (pud_pfn(*pud) >> shift == pfn >> shift) {
+		page = pfn_to_page(pfn);
+		lock_page(page);
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+		hugetlb_count_sub(compound_nr(page), walk->mm);
+		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma));
+		unlock_page(page);
+		flush_tlb_page(walk->vma, addr);
+	}
+
+	return 0;
+}
+
+static struct mm_walk_ops walk = {
+	.pte_entry = pte_entry,
+	.pmd_entry = pmd_entry,
+	.pud_entry = pud_entry
+};
+
 static void kill_me_now(struct callback_head *ch)
 {
 	force_sig(SIGBUS);
@@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int already = 0;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		if (already) {
+			mmap_read_lock(current->mm);
+			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
+			mmap_read_unlock(current->mm);
+		} else {
+			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		}
 		sync_core();
 		return;
 	}
@@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	/* mce_severity() should not hand us an ACTION_REQUIRED error */
 	BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..144500983656 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = memory_failure(pfn, 0);
+	ret = memory_failure(pfn, 0, NULL);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..beaa6e871cbe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3003,7 +3003,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, 0);
+	return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index df692d2e35d4..09f569fed68d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..9a8911aa5fc9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks hold.
  */
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	struct page *p;
 	struct page *hpage;
@@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
+		if (already)
+			*already = 1;
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		return 0;
@@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
 		else
-			memory_failure(entry.pfn, entry.flags);
+			memory_failure(entry.pfn, entry.flags, NULL);
 	}
 }
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-04 23:57                               ` Luck, Tony
@ 2021-03-05  1:30                                 ` Aili Yao
  2021-03-05  1:36                                   ` Aili Yao
  2021-03-05 15:55                                   ` [PATCH] mm,hwpoison: return -EBUSY when page already poisoned Luck, Tony
  0 siblings, 2 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-05  1:30 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Thu, 4 Mar 2021 15:57:20 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > > if your methods works, should it be like this?
> > > > 
> > > > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > > 1583                         if (PageHuge(page)) {
> > > > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > > > 1585                                 set_huge_swap_pte_at(mm, address,
> > > > 1586                                                      pvmw.pte, pteval,
> > > > 1587                                                      vma_mmu_pagesize(vma));
> > > > 1588                         } else {
> > > > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > > > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > > > 1591                         }
> > > > 
> > > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > > >     
> > > 
> > > And if it works, does we need some locking mechanism before we call walk_page_range();
> > > if we lock, does we need to process the blocking interrupted error as other places will do?
> > >   
> > 
> > And another thing:
> > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?  
> 
> Thanks for all the pointers.  I added them to the patch (see below).
> [The pmd/pud cases may need some tweaking ... but just trying to get
> the 4K page case working first]
> 
> I tried testing by skipping the call to memory_failure() and just
> using this new code to search the page tables for current page and
> marking it hwpoison (to simulate the case where 2nd process gets the
> early return from memory_failure(). Something is still missing because I get:
> 
> [  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
> [  481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400
> [  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
> [  482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6
> [  482.951355] memcg:ffff94a809834000
> [  482.955153] aops:shmem_aops ino:3c04
> [  482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison)
> [  482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8
> [  482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000
> [  482.984310] page dumped because: still mapped when deleted

From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?



> commit e5de44560b33e2d407704243566253a70f858a59
> Author: Tony Luck <tony.luck@intel.com>
> Date:   Tue Mar 2 15:06:33 2021 -0800
> 
>     x86/mce: Handle races between machine checks
>     
>     When multiple CPUs hit the same poison memory there is a race. The
>     first CPU into memory_failure() atomically marks the page as poison
>     and continues processing to hunt down all the tasks that map this page
>     so that the virtual addresses can be marked not-present and SIGBUS
>     sent to the task that did the access.
>     
>     Later CPUs get an early return from memory_failure() and may return
>     to user mode and access the poison again.
>     
>     Add a new argument to memory_failure() so that it can indicate when
>     the race has been lost. Fix kill_me_maybe() to scan page tables in
>     this case to unmap pages.
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7962355436da..a52c6a772de2 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -28,8 +28,12 @@
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/init.h>
>  #include <linux/kmod.h>
> +#include <linux/pagewalk.h>
>  #include <linux/poll.h>
>  #include <linux/nmi.h>
>  #include <linux/cpu.h>
> @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  {
>  	struct mce *mce = (struct mce *)data;
>  	unsigned long pfn;
> +	int already = 0;
>  
>  	if (!mce || !mce_usable_address(mce))
>  		return NOTIFY_DONE;
> @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  		return NOTIFY_DONE;
>  
>  	pfn = mce->addr >> PAGE_SHIFT;
> -	if (!memory_failure(pfn, 0)) {
> -		set_mce_nospec(pfn, whole_page(mce));
> +	if (!memory_failure(pfn, 0, &already)) {
> +		if (!already)
> +			set_mce_nospec(pfn, whole_page(mce));
>  		mce->kflags |= MCE_HANDLED_UC;
>  	}
>  
> @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>  	*m = *final;
>  }
>  
> +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	u64 pfn = (u64)walk->private;
> +	struct page *page;
> +	pte_t pteval;
> +
> +	if (pte_pfn(*pte) == pfn) {
> +pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
> +		page = pfn_to_page(pfn);
> +		lock_page(page);
> +SetPageHWPoison(page);
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> +		dec_mm_counter(walk->mm, mm_counter(page));
> +		set_pte_at(current->mm, addr, pte, pteval);
> +		unlock_page(page);
> +		flush_tlb_page(walk->vma, addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	int shift = PMD_SHIFT - PAGE_SHIFT;
> +	u64 pfn = (u64)walk->private;
> +	struct page *page;
> +	pte_t pteval;
> +
> +	if (!pmd_large(*pmd))
> +		return 0;
> +
> +	if (pmd_pfn(*pmd) >> shift == pfn >> shift) {
> +		page = pfn_to_page(pfn);
> +		lock_page(page);
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> +		hugetlb_count_sub(compound_nr(page), walk->mm);
> +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma));
> +		unlock_page(page);
> +		flush_tlb_page(walk->vma, addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	int shift = PUD_SHIFT - PAGE_SHIFT;
> +	u64 pfn = (u64)walk->private;
> +	struct page *page;
> +	pte_t pteval;
> +
> +	if (!pud_large(*pud))
> +		return 0;
> +
> +	if (pud_pfn(*pud) >> shift == pfn >> shift) {
> +		page = pfn_to_page(pfn);
> +		lock_page(page);
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> +		hugetlb_count_sub(compound_nr(page), walk->mm);
> +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma));
> +		unlock_page(page);
> +		flush_tlb_page(walk->vma, addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct mm_walk_ops walk = {
> +	.pte_entry = pte_entry,
> +	.pmd_entry = pmd_entry,
> +	.pud_entry = pud_entry
> +};
> +
>  static void kill_me_now(struct callback_head *ch)
>  {
>  	force_sig(SIGBUS);
> @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb)
>  {
>  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>  	int flags = MF_ACTION_REQUIRED;
> +	int already = 0;
>  
>  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
>  
>  	if (!p->mce_ripv)
>  		flags |= MF_MUST_KILL;
>  
> -	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> +	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
>  	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> -		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +		if (already) {
> +			mmap_read_lock(current->mm);
> +			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
> +			mmap_read_unlock(current->mm);
> +		} else {
> +			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +		}
>  		sync_core();
>  		return;
>  	}
> @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  EXPORT_SYMBOL_GPL(do_machine_check);
>  
>  #ifndef CONFIG_MEMORY_FAILURE
> -int memory_failure(unsigned long pfn, int flags)
> +int memory_failure(unsigned long pfn, int flags, int *already)
>  {
>  	/* mce_severity() should not hand us an ACTION_REQUIRED error */
>  	BUG_ON(flags & MF_ACTION_REQUIRED);
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f35298425575..144500983656 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
>  	if (kstrtoull(buf, 0, &pfn) < 0)
>  		return -EINVAL;
>  	pfn >>= PAGE_SHIFT;
> -	ret = memory_failure(pfn, 0);
> +	ret = memory_failure(pfn, 0, NULL);
>  	return ret ? ret : count;
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..beaa6e871cbe 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3003,7 +3003,7 @@ enum mf_flags {
>  	MF_MUST_KILL = 1 << 2,
>  	MF_SOFT_OFFLINE = 1 << 3,
>  };
> -extern int memory_failure(unsigned long pfn, int flags);
> +extern int memory_failure(unsigned long pfn, int flags, int *already);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
>  extern int unpoison_memory(unsigned long pfn);
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 1ae1ebc2b9b1..bfd5151dcd3f 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
>  
>  inject:
>  	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
> -	return memory_failure(pfn, 0);
> +	return memory_failure(pfn, 0, NULL);
>  }
>  
>  static int hwpoison_unpoison(void *data, u64 val)
> diff --git a/mm/madvise.c b/mm/madvise.c
> index df692d2e35d4..09f569fed68d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
>  		} else {
>  			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>  				 pfn, start);
> -			ret = memory_failure(pfn, MF_COUNT_INCREASED);
> +			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
>  		}
>  
>  		if (ret)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..9a8911aa5fc9 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>   * Must run in process context (e.g. a work queue) with interrupts
>   * enabled and no spinlocks hold.
>   */
> -int memory_failure(unsigned long pfn, int flags)
> +int memory_failure(unsigned long pfn, int flags, int *already)
>  {
>  	struct page *p;
>  	struct page *hpage;
> @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (PageHuge(p))
>  		return memory_failure_hugetlb(pfn, flags);
>  	if (TestSetPageHWPoison(p)) {
> +		if (already)
> +			*already = 1;
>  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
>  			pfn);
>  		return 0;
> @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
>  		if (entry.flags & MF_SOFT_OFFLINE)
>  			soft_offline_page(entry.pfn, entry.flags);
>  		else
> -			memory_failure(entry.pfn, entry.flags);
> +			memory_failure(entry.pfn, entry.flags, NULL);
>  	}
>  }
>  

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-05  1:30                                 ` Aili Yao
@ 2021-03-05  1:36                                   ` Aili Yao
  2021-03-05 22:11                                     ` Luck, Tony
  2021-03-05 15:55                                   ` [PATCH] mm,hwpoison: return -EBUSY when page already poisoned Luck, Tony
  1 sibling, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-05  1:36 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Fri, 5 Mar 2021 09:30:16 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> On Thu, 4 Mar 2021 15:57:20 -0800
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > > > if your methods works, should it be like this?
> > > > > 
> > > > > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > > > 1583                         if (PageHuge(page)) {
> > > > > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > > > > 1585                                 set_huge_swap_pte_at(mm, address,
> > > > > 1586                                                      pvmw.pte, pteval,
> > > > > 1587                                                      vma_mmu_pagesize(vma));
> > > > > 1588                         } else {
> > > > > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > > > > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > > > > 1591                         }
> > > > > 
> > > > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > > > >     
> > > > 
> > > > And if it works, does we need some locking mechanism before we call walk_page_range();
> > > > if we lock, does we need to process the blocking interrupted error as other places will do?
> > > >   
> > > 
> > > And another thing:
> > > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?  
> > 
> > Thanks for all the pointers.  I added them to the patch (see below).
> > [The pmd/pud cases may need some tweaking ... but just trying to get
> > the 4K page case working first]
> > 
> > I tried testing by skipping the call to memory_failure() and just
> > using this new code to search the page tables for current page and
> > marking it hwpoison (to simulate the case where 2nd process gets the
> > early return from memory_failure(). Something is still missing because I get:
> > 
> > [  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
> > [  481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400
> > [  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
> > [  482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6
> > [  482.951355] memcg:ffff94a809834000
> > [  482.955153] aops:shmem_aops ino:3c04
> > [  482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison)
> > [  482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8
> > [  482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000
> > [  482.984310] page dumped because: still mapped when deleted
> 
> From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?

Does this walk proper for other memory-failure error cases, can it be applied to if (p->mce_vaddr != (void __user *)-1l) branch
in kill_me_maybe()?

> > commit e5de44560b33e2d407704243566253a70f858a59
> > Author: Tony Luck <tony.luck@intel.com>
> > Date:   Tue Mar 2 15:06:33 2021 -0800
> > 
> >     x86/mce: Handle races between machine checks
> >     
> >     When multiple CPUs hit the same poison memory there is a race. The
> >     first CPU into memory_failure() atomically marks the page as poison
> >     and continues processing to hunt down all the tasks that map this page
> >     so that the virtual addresses can be marked not-present and SIGBUS
> >     sent to the task that did the access.
> >     
> >     Later CPUs get an early return from memory_failure() and may return
> >     to user mode and access the poison again.
> >     
> >     Add a new argument to memory_failure() so that it can indicate when
> >     the race has been lost. Fix kill_me_maybe() to scan page tables in
> >     this case to unmap pages.
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index 7962355436da..a52c6a772de2 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -28,8 +28,12 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/types.h>
> >  #include <linux/slab.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  #include <linux/init.h>
> >  #include <linux/kmod.h>
> > +#include <linux/pagewalk.h>
> >  #include <linux/poll.h>
> >  #include <linux/nmi.h>
> >  #include <linux/cpu.h>
> > @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
> >  {
> >  	struct mce *mce = (struct mce *)data;
> >  	unsigned long pfn;
> > +	int already = 0;
> >  
> >  	if (!mce || !mce_usable_address(mce))
> >  		return NOTIFY_DONE;
> > @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
> >  		return NOTIFY_DONE;
> >  
> >  	pfn = mce->addr >> PAGE_SHIFT;
> > -	if (!memory_failure(pfn, 0)) {
> > -		set_mce_nospec(pfn, whole_page(mce));
> > +	if (!memory_failure(pfn, 0, &already)) {
> > +		if (!already)
> > +			set_mce_nospec(pfn, whole_page(mce));
> >  		mce->kflags |= MCE_HANDLED_UC;
> >  	}
> >  
> > @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> >  	*m = *final;
> >  }
> >  
> > +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	u64 pfn = (u64)walk->private;
> > +	struct page *page;
> > +	pte_t pteval;
> > +
> > +	if (pte_pfn(*pte) == pfn) {
> > +pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
> > +		page = pfn_to_page(pfn);
> > +		lock_page(page);
> > +SetPageHWPoison(page);
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> > +		dec_mm_counter(walk->mm, mm_counter(page));
> > +		set_pte_at(current->mm, addr, pte, pteval);
> > +		unlock_page(page);
> > +		flush_tlb_page(walk->vma, addr);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	int shift = PMD_SHIFT - PAGE_SHIFT;
> > +	u64 pfn = (u64)walk->private;
> > +	struct page *page;
> > +	pte_t pteval;
> > +
> > +	if (!pmd_large(*pmd))
> > +		return 0;
> > +
> > +	if (pmd_pfn(*pmd) >> shift == pfn >> shift) {
> > +		page = pfn_to_page(pfn);
> > +		lock_page(page);
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> > +		hugetlb_count_sub(compound_nr(page), walk->mm);
> > +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma));
> > +		unlock_page(page);
> > +		flush_tlb_page(walk->vma, addr);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	int shift = PUD_SHIFT - PAGE_SHIFT;
> > +	u64 pfn = (u64)walk->private;
> > +	struct page *page;
> > +	pte_t pteval;
> > +
> > +	if (!pud_large(*pud))
> > +		return 0;
> > +
> > +	if (pud_pfn(*pud) >> shift == pfn >> shift) {
> > +		page = pfn_to_page(pfn);
> > +		lock_page(page);
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> > +		hugetlb_count_sub(compound_nr(page), walk->mm);
> > +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma));
> > +		unlock_page(page);
> > +		flush_tlb_page(walk->vma, addr);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct mm_walk_ops walk = {
> > +	.pte_entry = pte_entry,
> > +	.pmd_entry = pmd_entry,
> > +	.pud_entry = pud_entry
> > +};
> > +
> >  static void kill_me_now(struct callback_head *ch)
> >  {
> >  	force_sig(SIGBUS);
> > @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb)
> >  {
> >  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> >  	int flags = MF_ACTION_REQUIRED;
> > +	int already = 0;
> >  
> >  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
> >  
> >  	if (!p->mce_ripv)
> >  		flags |= MF_MUST_KILL;
> >  
> > -	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > +	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
> >  	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > -		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > +		if (already) {
> > +			mmap_read_lock(current->mm);
> > +			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
> > +			mmap_read_unlock(current->mm);
> > +		} else {
> > +			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > +		}
> >  		sync_core();
> >  		return;
> >  	}
> > @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
> >  EXPORT_SYMBOL_GPL(do_machine_check);
> >  
> >  #ifndef CONFIG_MEMORY_FAILURE
> > -int memory_failure(unsigned long pfn, int flags)
> > +int memory_failure(unsigned long pfn, int flags, int *already)
> >  {
> >  	/* mce_severity() should not hand us an ACTION_REQUIRED error */
> >  	BUG_ON(flags & MF_ACTION_REQUIRED);
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index f35298425575..144500983656 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
> >  	if (kstrtoull(buf, 0, &pfn) < 0)
> >  		return -EINVAL;
> >  	pfn >>= PAGE_SHIFT;
> > -	ret = memory_failure(pfn, 0);
> > +	ret = memory_failure(pfn, 0, NULL);
> >  	return ret ? ret : count;
> >  }
> >  
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 77e64e3eac80..beaa6e871cbe 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3003,7 +3003,7 @@ enum mf_flags {
> >  	MF_MUST_KILL = 1 << 2,
> >  	MF_SOFT_OFFLINE = 1 << 3,
> >  };
> > -extern int memory_failure(unsigned long pfn, int flags);
> > +extern int memory_failure(unsigned long pfn, int flags, int *already);
> >  extern void memory_failure_queue(unsigned long pfn, int flags);
> >  extern void memory_failure_queue_kick(int cpu);
> >  extern int unpoison_memory(unsigned long pfn);
> > diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> > index 1ae1ebc2b9b1..bfd5151dcd3f 100644
> > --- a/mm/hwpoison-inject.c
> > +++ b/mm/hwpoison-inject.c
> > @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
> >  
> >  inject:
> >  	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
> > -	return memory_failure(pfn, 0);
> > +	return memory_failure(pfn, 0, NULL);
> >  }
> >  
> >  static int hwpoison_unpoison(void *data, u64 val)
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index df692d2e35d4..09f569fed68d 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
> >  		} else {
> >  			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> >  				 pfn, start);
> > -			ret = memory_failure(pfn, MF_COUNT_INCREASED);
> > +			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
> >  		}
> >  
> >  		if (ret)
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 24210c9bd843..9a8911aa5fc9 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> >   * Must run in process context (e.g. a work queue) with interrupts
> >   * enabled and no spinlocks hold.
> >   */
> > -int memory_failure(unsigned long pfn, int flags)
> > +int memory_failure(unsigned long pfn, int flags, int *already)
> >  {
> >  	struct page *p;
> >  	struct page *hpage;
> > @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (PageHuge(p))
> >  		return memory_failure_hugetlb(pfn, flags);
> >  	if (TestSetPageHWPoison(p)) {
> > +		if (already)
> > +			*already = 1;
> >  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
> >  			pfn);
> >  		return 0;
> > @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
> >  		if (entry.flags & MF_SOFT_OFFLINE)
> >  			soft_offline_page(entry.pfn, entry.flags);
> >  		else
> > -			memory_failure(entry.pfn, entry.flags);
> > +			memory_failure(entry.pfn, entry.flags, NULL);
> >  	}
> >  }
> >  
> 



-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-05  1:30                                 ` Aili Yao
  2021-03-05  1:36                                   ` Aili Yao
@ 2021-03-05 15:55                                   ` Luck, Tony
  2021-03-10  6:10                                     ` Aili Yao
  1 sibling, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-05 15:55 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

> From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?

If the walk wins the race and the pte for the poisoned page is still valid, then yes.

But we could have:

CPU1                            CPU2
memory_failure sets poison
bit for struct page


rmap finds page in task
on CPU2 and sets PTE
to not-valid-poison

                                memory_failure returns
                                early because struct page
                                already marked as poison

                                walk page tables looking
                                for mapping - don't find it

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-05  1:36                                   ` Aili Yao
@ 2021-03-05 22:11                                     ` Luck, Tony
  2021-03-08  6:45                                       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-05 22:11 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

This whole page table walking patch is trying to work around the
races caused by multiple calls to memory_failure() for the same
page.

Maybe better to just avoid the races.  The comment right above
memory_failure says:

 * Must run in process context (e.g. a work queue) with interrupts
 * enabled and no spinlocks hold.

So it should be safe to grab and hold a mutex.  See patch below.

-Tony

commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9
Author: Tony Luck <tony.luck@intel.com>
Date:   Fri Mar 5 10:40:48 2021 -0800

    Use a mutex to avoid memory_failure() races

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto out2;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
+			mutex_unlock(&mf_mutex);
 			return res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 	}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
 	res = identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(p);
+out2:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-05 22:11                                     ` Luck, Tony
@ 2021-03-08  6:45                                       ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-08 18:54                                         ` Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-08  6:45 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Fri, Mar 05, 2021 at 02:11:43PM -0800, Luck, Tony wrote:
> This whole page table walking patch is trying to work around the
> races caused by multiple calls to memory_failure() for the same
> page.
> 
> Maybe better to just avoid the races.  The comment right above
> memory_failure says:
> 
>  * Must run in process context (e.g. a work queue) with interrupts
>  * enabled and no spinlocks hold.
> 
> So it should be safe to grab and hold a mutex.  See patch below.

The mutex approach looks simpler and safer, so I'm fine with it.

> 
> -Tony
> 
> commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9
> Author: Tony Luck <tony.luck@intel.com>
> Date:   Fri Mar 5 10:40:48 2021 -0800
> 
>     Use a mutex to avoid memory_failure() races
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..c1509f4b565e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>  		return -ENXIO;
>  	}
>  
> +	mutex_lock(&mf_mutex);

Is it better to take mutex before memory_failure_dev_pagemap() block?
Or we don't have to protect against race for device memory?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-08  6:45                                       ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-08 18:54                                         ` Luck, Tony
  2021-03-08 22:38                                           ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-08 18:54 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

>> So it should be safe to grab and hold a mutex.  See patch below.
>
> The mutex approach looks simpler and safer, so I'm fine with it.

Thanks.  Is that an "Acked-by:"?

>>  /**
>>   * memory_failure - Handle memory failure of a page.
>>   * @pfn: Page Number of the corrupted page
>> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>>  		return -ENXIO;
>>  	}
>>  
>> +	mutex_lock(&mf_mutex);
>
> Is it better to take mutex before memory_failure_dev_pagemap() block?
> Or we don't have to protect against race for device memory?

No races (recovery is only attempted for errors in normal memory).

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-08 18:54                                         ` Luck, Tony
@ 2021-03-08 22:38                                           ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-08 22:55                                             ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-08 22:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Mon, Mar 08, 2021 at 06:54:02PM +0000, Luck, Tony wrote:
> >> So it should be safe to grab and hold a mutex.  See patch below.
> >
> > The mutex approach looks simpler and safer, so I'm fine with it.
> 
> Thanks.  Is that an "Acked-by:"?

Not yet, I intended to add it after full patch is submitted
(with your Signed-off-by and commit log).

> 
> >>  /**
> >>   * memory_failure - Handle memory failure of a page.
> >>   * @pfn: Page Number of the corrupted page
> >> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
> >>  		return -ENXIO;
> >>  	}
> >>  
> >> +	mutex_lock(&mf_mutex);
> >
> > Is it better to take mutex before memory_failure_dev_pagemap() block?
> > Or we don't have to protect against race for device memory?
> 
> No races (recovery is only attempted for errors in normal memory).

OK, thanks.

- Naoya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-03-08 22:38                                           ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-08 22:55                                             ` Luck, Tony
  2021-03-08 23:42                                               ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09  2:04                                               ` Aili Yao
  0 siblings, 2 replies; 42+ messages in thread
From: Luck, Tony @ 2021-03-08 22:55 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 mm/memory-failure.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto out2;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
+			mutex_unlock(&mf_mutex);
 			return res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 	}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
 	res = identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(p);
+out2:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.29.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-03-08 22:55                                             ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Luck, Tony
@ 2021-03-08 23:42                                               ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09  2:04                                               ` Aili Yao
  1 sibling, 0 replies; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-08 23:42 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Mon, Mar 08, 2021 at 02:55:04PM -0800, Luck, Tony wrote:
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Thanks!

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-03-08 22:55                                             ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Luck, Tony
  2021-03-08 23:42                                               ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-09  2:04                                               ` Aili Yao
  2021-03-09  6:04                                                 ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-09  2:04 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Mon, 8 Mar 2021 14:55:04 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  mm/memory-failure.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..c1509f4b565e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>  		return -ENXIO;
>  	}
>  
> +	mutex_lock(&mf_mutex);
> +
>  try_again:
> -	if (PageHuge(p))
> -		return memory_failure_hugetlb(pfn, flags);
> +	if (PageHuge(p)) {
> +		res = memory_failure_hugetlb(pfn, flags);
> +		goto out2;
> +	}
> +
>  	if (TestSetPageHWPoison(p)) {
>  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
>  			pfn);
> +		mutex_unlock(&mf_mutex);
>  		return 0;
>  	}
>  
> @@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
>  				res = MF_FAILED;
>  			}
>  			action_result(pfn, MF_MSG_BUDDY, res);
> +			mutex_unlock(&mf_mutex);
>  			return res == MF_RECOVERED ? 0 : -EBUSY;
>  		} else {
>  			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> +			mutex_unlock(&mf_mutex);
>  			return -EBUSY;
>  		}
>  	}
> @@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (PageTransHuge(hpage)) {
>  		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
>  			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> +			mutex_unlock(&mf_mutex);
>  			return -EBUSY;
>  		}
>  		VM_BUG_ON_PAGE(!page_count(p), p);
> @@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
>  		num_poisoned_pages_dec();
>  		unlock_page(p);
>  		put_page(p);
> +		mutex_unlock(&mf_mutex);
>  		return 0;
>  	}
>  	if (hwpoison_filter(p)) {
> @@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
>  			num_poisoned_pages_dec();
>  		unlock_page(p);
>  		put_page(p);
> +		mutex_unlock(&mf_mutex);
>  		return 0;
>  	}
>  
> @@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
>  	res = identify_page_state(pfn, p, page_flags);
>  out:
>  	unlock_page(p);
> +out2:
> +	mutex_unlock(&mf_mutex);
>  	return res;
>  }
>  EXPORT_SYMBOL_GPL(memory_failure);

If others are OK with this method, then I am OK too.
But I have two concerns, May you take into account:

1. The memory_failure with 0 return code for race condition, then the kill_me_maybe() goes into branch:
	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
		sync_core();
		return;
	}

while we place set_mce_nospec() here is for a reason, please see commit fd0e786d9d09024f67b.

2. When memory_failure return 0 and maybe return to user process, and it may re-execute the instruction triggering previous fault, this behavior
assume an implicit dependence that the related pte has been correctly set. or if not correctlily set, it will lead to infinite loop again.

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-03-09  2:04                                               ` Aili Yao
@ 2021-03-09  6:04                                                 ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09  6:35                                                   ` [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned Aili Yao
  2021-03-09  6:38                                                   ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Aili Yao
  0 siblings, 2 replies; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-09  6:04 UTC (permalink / raw)
  To: Aili Yao
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1

On Tue, Mar 09, 2021 at 10:04:21AM +0800, Aili Yao wrote:
> On Mon, 8 Mar 2021 14:55:04 -0800
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > There can be races when multiple CPUs consume poison from the same
> > page. The first into memory_failure() atomically sets the HWPoison
> > page flag and begins hunting for tasks that map this page. Eventually
> > it invalidates those mappings and may send a SIGBUS to the affected
> > tasks.
> > 
> > But while all that work is going on, other CPUs see a "success"
> > return code from memory_failure() and so they believe the error
> > has been handled and continue executing.
> > 
> > Fix by wrapping most of the internal parts of memory_failure() in
> > a mutex.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
...
> 
> If others are OK with this method, then I am OK too.
> But I have two concerns, May you take into account:
> 
> 1. The memory_failure with 0 return code for race condition, then the kill_me_maybe() goes into branch:
> 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> 		sync_core();
> 		return;
> 	}
> 
> while we place set_mce_nospec() here is for a reason, please see commit fd0e786d9d09024f67b.
> 
> 2. When memory_failure return 0 and maybe return to user process, and it may re-execute the instruction triggering previous fault, this behavior
> assume an implicit dependence that the related pte has been correctly set. or if not correctlily set, it will lead to infinite loop again.

These seem to be separate issues from memory_failure()'s concurrency issue,
so I'm still expecting that your patch is to be merged. Maybe do you want
to update it based on the discussion (if it's concluded)?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-09  6:04                                                 ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-09  6:35                                                   ` Aili Yao
  2021-03-09  8:28                                                     ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09  6:38                                                   ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Aili Yao
  1 sibling, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-09  6:35 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

When the page is already poisoned, another memory_failure() call in the
same page now return 0, meaning OK. For nested memory mce handling, this
behavior may lead to mce looping, Example:

1.When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2.Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3.The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {

1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
p->mce_whole_page);
1257                 sync_core();
1258                 return;
1259         }

1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..b6bc77460ee1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1228,7 +1228,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EBUSY;
 	}
 
 	num_poisoned_pages_inc();
@@ -1430,7 +1430,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
-		return 0;
+		return -EBUSY;
 	}
 
 	orig_head = hpage = compound_head(p);
-- 
2.25.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races
  2021-03-09  6:04                                                 ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09  6:35                                                   ` [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned Aili Yao
@ 2021-03-09  6:38                                                   ` Aili Yao
  1 sibling, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-09  6:38 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Tue, 9 Mar 2021 06:04:41 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> ...
> > 
> > If others are OK with this method, then I am OK too.
> > But I have two concerns, May you take into account:
> > 
> > 1. The memory_failure with 0 return code for race condition, then the kill_me_maybe() goes into branch:
> > 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > 		sync_core();
> > 		return;
> > 	}
> > 
> > while we place set_mce_nospec() here is for a reason, please see commit fd0e786d9d09024f67b.
> > 
> > 2. When memory_failure return 0 and maybe return to user process, and it may re-execute the instruction triggering previous fault, this behavior
> > assume an implicit dependence that the related pte has been correctly set. or if not correctlily set, it will lead to infinite loop again.  
> 
> These seem to be separate issues from memory_failure()'s concurrency issue,
> so I'm still expecting that your patch is to be merged. Maybe do you want
> to update it based on the discussion (if it's concluded)?
> 
> Thanks,
> Naoya Horiguchi

I have submitted a v2 patch, and please help review.

Thanks!
 
-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-09  6:35                                                   ` [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned Aili Yao
@ 2021-03-09  8:28                                                     ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09 20:01                                                       ` Luck, Tony
  2021-03-10  8:01                                                       ` Aili Yao
  0 siblings, 2 replies; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-09  8:28 UTC (permalink / raw)
  To: Aili Yao
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1

On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote:
> When the page is already poisoned, another memory_failure() call in the
> same page now return 0, meaning OK. For nested memory mce handling, this
> behavior may lead to mce looping, Example:
> 
> 1.When LCME is enabled, and there are two processes A && B running on
> different core X && Y separately, which will access one same page, then
> the page corrupted when process A access it, a MCE will be rasied to
> core X and the error process is just underway.
> 
> 2.Then B access the page and trigger another MCE to core Y, it will also
> do error process, it will see TestSetPageHWPoison be true, and 0 is
> returned.
> 
> 3.The kill_me_maybe will check the return:
> 
> 1244 static void kill_me_maybe(struct callback_head *cb)
> 1245 {
> 
> 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> p->mce_whole_page);
> 1257                 sync_core();
> 1258                 return;
> 1259         }
> 
> 1267 }
> 
> 4. The error process for B will end, and may nothing happened if
> kill-early is not set, The process B will re-excute instruction and get
> into mce again and then loop happens. And also the set_mce_nospec()
> here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
> mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").
> 
> For other cases which care the return value of memory_failure() should
> check why they want to process a memory error which have already been
> processed. This behavior seems reasonable.

Other reviewers shared ideas about the returned value, but actually
I'm not sure which the best one is (EBUSY is not that bad).
What we need to fix the reported issue is to return non-zero value
for "already poisoned" case (the value itself is not so important). 

Other callers of memory_failure() (mostly test programs) could see
the change of return value, but they can already see EBUSY now and
anyway they should check dmesg for more detail about why failed,
so the impact of the change is not so big.

> 
> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-09  8:28                                                     ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-09 20:01                                                       ` Luck, Tony
  2021-03-10  8:05                                                         ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-13  1:55                                                         ` Jue Wang
  2021-03-10  8:01                                                       ` Aili Yao
  1 sibling, 2 replies; 42+ messages in thread
From: Luck, Tony @ 2021-03-09 20:01 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Tue, Mar 09, 2021 at 08:28:24AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257                 sync_core();
> > 1258                 return;
> > 1259         }
> > 
> > 1267 }
> > 
> > 4. The error process for B will end, and may nothing happened if
> > kill-early is not set, The process B will re-excute instruction and get
> > into mce again and then loop happens. And also the set_mce_nospec()
> > here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
> > mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").
> > 
> > For other cases which care the return value of memory_failure() should
> > check why they want to process a memory error which have already been
> > processed. This behavior seems reasonable.
> 
> Other reviewers shared ideas about the returned value, but actually
> I'm not sure which the best one is (EBUSY is not that bad).
> What we need to fix the reported issue is to return non-zero value
> for "already poisoned" case (the value itself is not so important). 
> 
> Other callers of memory_failure() (mostly test programs) could see
> the change of return value, but they can already see EBUSY now and
> anyway they should check dmesg for more detail about why failed,
> so the impact of the change is not so big.
> 
> > 
> > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> 
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

I think that both this and my "add a mutex" patch are both
too simplistic for this complex problem :-(

When multiple CPUs race to call memory_failure() for the same
page we need the following results:

1) Poison page should be marked not-present in all tasks
	I think the mutex patch achieves this as long as
	memory_failure() doesn't hit an error[1].

2) All tasks that were executing an instruction that was accessing
   the poison location should see a SIGBUS with virtual address and
   BUS_MCEERR_AR signature in siginfo.
	Neither solution achieves this. The -EBUSY return ensures
	that there is a SIGBUS for the tasks that get the -EBUSY
	return, but no siginfo details.
	Just the mutex patch *might* have BUS_MCEERR_AO signature
	to the race losing tasks, but only if they have PF_MCE_EARLY
	set (so says the comment in kill_proc() ... but I don't
	see the code checking for that bit).

#2 seems hard to achieve ... there are inherent races that mean the
AO SIGBUS could have been queued to the task before it even hits
the poison.

Maybe should include a non-action:

3) A task should only see one SIGBUS per poison?
	Not sure if this is achievable either ... what if the task
	has the same page mapped multiple times?

-Tony

[1] still looking at why my futex injection test ends with a "reserved
kernel page still referenced by 1 users"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-05 15:55                                   ` [PATCH] mm,hwpoison: return -EBUSY when page already poisoned Luck, Tony
@ 2021-03-10  6:10                                     ` Aili Yao
  2021-03-11  8:55                                       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-10  6:10 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili

On Fri, 5 Mar 2021 15:55:25 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?  
> 
> If the walk wins the race and the pte for the poisoned page is still valid, then yes.
> 
> But we could have:
> 
> CPU1                            CPU2
> memory_failure sets poison
> bit for struct page
> 
> 
> rmap finds page in task
> on CPU2 and sets PTE
> to not-valid-poison
> 
>                                 memory_failure returns
>                                 early because struct page
>                                 already marked as poison
> 
>                                 walk page tables looking
>                                 for mapping - don't find it
> 
> -Tony

While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not
proper, I think following patch maybe one way.
I copy your abandon code, and make a little modification, and just now it pass
my simple test.

And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right.

Thanks!

From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
From: Aili Yao <yaoaili@kingsoft.com>
Date: Wed, 10 Mar 2021 13:59:18 +0800
Subject: [PATCH] x86/mce: fix invalid SIGBUS address

walk the current process pte and compare with the pfn;
1. only test for normal page and 2M hugetlb page;
2. 1G hugetlb and transparentHuge is not support currently;
3. May other fails is not recognized, This is a RFC version.

---
 arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..65d7ef7 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	/* mce_clear_state will clear *final, save locally for use later */
 	*m = *final;
 }
+static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+
+	if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
+		goto find;
+	else if (pte_pfn(*pte) == pfn)
+		goto find;
+
+	return 0;
+find:
+	buff[0] = addr;
+	buff[1] = PAGE_SHIFT;
+	return true;
+}
+
+extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
+
+static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	int shift = PMD_SHIFT;
+	pte_t pte =  huge_ptep_get(ptep);
+
+	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
+		goto find;
+
+	if (pte_pfn(*ptep) == pfn)
+		goto find;
+
+	return 0;
+find:
+	buff[0] = addr;
+	buff[1] = shift;
+	return true;
+}
+
+static struct mm_walk_ops walk = {
+	.pte_entry = mc_pte_entry,
+	.hugetlb_entry	= mc_hugetlb_range
+};
+
+void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+	u64 buff[2] = {pfn, 0};
+	struct page *page;
+	int ret = -1;
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		goto force_sigbus;
+
+	if (is_zone_device_page(page))
+		goto force_sigbus;
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
+	mmap_read_unlock(p->mm);
+
+	if (ret && buff[0]) {
+		pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+		buff[0], p->comm, p->pid);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
+	} else {
+force_sigbus:
+		pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+		pfn, p->comm, p->pid);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
+	}
+
+}
 
 static void kill_me_now(struct callback_head *ch)
 {
@@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 
 	if (p->mce_vaddr != (void __user *)-1l) {
-		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
+		mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
 	} else {
 		pr_err("Memory error not recovered");
 		kill_me_now(cb);
-- 
1.8.3.1



-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-09  8:28                                                     ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-09 20:01                                                       ` Luck, Tony
@ 2021-03-10  8:01                                                       ` Aili Yao
  1 sibling, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-10  8:01 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili,
	sunhao2

On Tue, 9 Mar 2021 08:28:24 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257                 sync_core();
> > 1258                 return;
> > 1259         }
> > 
> > 1267 }
> > 
> > 4. The error process for B will end, and may nothing happened if
> > kill-early is not set, The process B will re-excute instruction and get
> > into mce again and then loop happens. And also the set_mce_nospec()
> > here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
> > mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").
> > 
> > For other cases which care the return value of memory_failure() should
> > check why they want to process a memory error which have already been
> > processed. This behavior seems reasonable.  
> 
> Other reviewers shared ideas about the returned value, but actually
> I'm not sure which the best one is (EBUSY is not that bad).
> What we need to fix the reported issue is to return non-zero value
> for "already poisoned" case (the value itself is not so important). 
> 
> Other callers of memory_failure() (mostly test programs) could see
> the change of return value, but they can already see EBUSY now and
> anyway they should check dmesg for more detail about why failed,
> so the impact of the change is not so big.
> 
> > 
> > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>  
> 
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Thanks,
> Naoya Horiguchi

Thanks!

And I found my mail was lost in mailist!

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-09 20:01                                                       ` Luck, Tony
@ 2021-03-10  8:05                                                         ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-13  1:55                                                         ` Jue Wang
  1 sibling, 0 replies; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-10  8:05 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

On Tue, Mar 09, 2021 at 12:01:40PM -0800, Luck, Tony wrote:
> On Tue, Mar 09, 2021 at 08:28:24AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote:
> > > When the page is already poisoned, another memory_failure() call in the
> > > same page now return 0, meaning OK. For nested memory mce handling, this
> > > behavior may lead to mce looping, Example:
> > > 
> > > 1.When LCME is enabled, and there are two processes A && B running on
> > > different core X && Y separately, which will access one same page, then
> > > the page corrupted when process A access it, a MCE will be rasied to
> > > core X and the error process is just underway.
> > > 
> > > 2.Then B access the page and trigger another MCE to core Y, it will also
> > > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > > returned.
> > > 
> > > 3.The kill_me_maybe will check the return:
> > > 
> > > 1244 static void kill_me_maybe(struct callback_head *cb)
> > > 1245 {
> > > 
> > > 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > > 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > > 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > > p->mce_whole_page);
> > > 1257                 sync_core();
> > > 1258                 return;
> > > 1259         }
> > > 
> > > 1267 }
> > > 
> > > 4. The error process for B will end, and may nothing happened if
> > > kill-early is not set, The process B will re-excute instruction and get
> > > into mce again and then loop happens. And also the set_mce_nospec()
> > > here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
> > > mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").
> > > 
> > > For other cases which care the return value of memory_failure() should
> > > check why they want to process a memory error which have already been
> > > processed. This behavior seems reasonable.
> > 
> > Other reviewers shared ideas about the returned value, but actually
> > I'm not sure which the best one is (EBUSY is not that bad).
> > What we need to fix the reported issue is to return non-zero value
> > for "already poisoned" case (the value itself is not so important). 
> > 
> > Other callers of memory_failure() (mostly test programs) could see
> > the change of return value, but they can already see EBUSY now and
> > anyway they should check dmesg for more detail about why failed,
> > so the impact of the change is not so big.
> > 
> > > 
> > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> > 
> > Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> I think that both this and my "add a mutex" patch are both
> too simplistic for this complex problem :-(
> 
> When multiple CPUs race to call memory_failure() for the same
> page we need the following results:
> 
> 1) Poison page should be marked not-present in all tasks
> 	I think the mutex patch achieves this as long as
> 	memory_failure() doesn't hit an error[1].

My assumption is that reserved kernel pages is not supposed to be mapped to any
process, so once memory_failure() judges a page as such, we never mark any page
table entry to hwpoison entry, is that correct?  So my question is why some
user-mapped page was judged as "reserved kernel page".  Futex allows such a situation?

I personally tried some testcase crossing futex and hwpoison, but I can't
reproduced "reserved kernel page" case.  If possible, could you provide me
with a little more detail about your testcase?

> 
> 2) All tasks that were executing an instruction that was accessing
>    the poison location should see a SIGBUS with virtual address and
>    BUS_MCEERR_AR signature in siginfo.
> 	Neither solution achieves this. The -EBUSY return ensures
> 	that there is a SIGBUS for the tasks that get the -EBUSY
> 	return, but no siginfo details.

Yes, that's not yet perfect but avoiding MCE loop is a progress.

> 	Just the mutex patch *might* have BUS_MCEERR_AO signature
> 	to the race losing tasks, but only if they have PF_MCE_EARLY
> 	set (so says the comment in kill_proc() ... but I don't
> 	see the code checking for that bit).

commit 30c9cf49270 might explain this, task_early_kill() got to call
find_early_kill_thread() (checking PF_MCE_EARLY) in this case.

> 
> #2 seems hard to achieve ... there are inherent races that mean the
> AO SIGBUS could have been queued to the task before it even hits
> the poison.

So I feel that we might want some change on memory_failure() to send
SIGBUS(BUS_MCEERR_AR) to "race losing tasks" within the new mutex.
I agree that how we find the error address it also a problem.
For now, I still have no better idea than page table walk.

> 
> Maybe should include a non-action:
> 
> 3) A task should only see one SIGBUS per poison?
> 	Not sure if this is achievable either ... what if the task
> 	has the same page mapped multiple times?

My thought is that hwpoison-aware applications could have dedlicated thread
for SIGBUS handling, so it's better to be prepared for multiple signals for
the same error event.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-10  6:10                                     ` Aili Yao
@ 2021-03-11  8:55                                       ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-11 11:23                                         ` Aili Yao
                                                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-11  8:55 UTC (permalink / raw)
  To: Aili Yao
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1

On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote:
> On Fri, 5 Mar 2021 15:55:25 +0000
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?  
> > 
> > If the walk wins the race and the pte for the poisoned page is still valid, then yes.
> > 
> > But we could have:
> > 
> > CPU1                            CPU2
> > memory_failure sets poison
> > bit for struct page
> > 
> > 
> > rmap finds page in task
> > on CPU2 and sets PTE
> > to not-valid-poison
> > 
> >                                 memory_failure returns
> >                                 early because struct page
> >                                 already marked as poison
> > 
> >                                 walk page tables looking
> >                                 for mapping - don't find it
> > 
> > -Tony
> 
> While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not
> proper, I think following patch maybe one way.
> I copy your abandon code, and make a little modification, and just now it pass
> my simple test.
> 
> And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right.
> 
> Thanks!
> 
> From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
> From: Aili Yao <yaoaili@kingsoft.com>
> Date: Wed, 10 Mar 2021 13:59:18 +0800
> Subject: [PATCH] x86/mce: fix invalid SIGBUS address
> 
> walk the current process pte and compare with the pfn;
> 1. only test for normal page and 2M hugetlb page;
> 2. 1G hugetlb and transparentHuge is not support currently;
> 3. May other fails is not recognized, This is a RFC version.
> 
> ---
>  arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 80 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index db4afc5..65d7ef7 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -28,8 +28,12 @@
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/init.h>
>  #include <linux/kmod.h>
> +#include <linux/pagewalk.h>
>  #include <linux/poll.h>
>  #include <linux/nmi.h>
>  #include <linux/cpu.h>

Maybe requiring many dependencies like this implies that you might be better
to do below in mm/memory-failure.c instead of in this file.

> @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>  	/* mce_clear_state will clear *final, save locally for use later */
>  	*m = *final;
>  }
> +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	u64 *buff = (u64 *)walk->private;
> +	u64 pfn = buff[0];
> +
> +	if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
> +		goto find;
> +	else if (pte_pfn(*pte) == pfn)
> +		goto find;
> +
> +	return 0;
> +find:
> +	buff[0] = addr;
> +	buff[1] = PAGE_SHIFT;
> +	return true;

Returning true means you stop walking when you find the first entry pointing
to a given pfn. But there could be multiple such entries, so if MCE SRAR is
triggered by memory access to the larger address in hwpoisoned entries, the
returned virtual address might be wrong.

> +}
> +
> +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
> +
> +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
> +				 unsigned long addr, unsigned long end,
> +				 struct mm_walk *walk)
> +{
> +	u64 *buff = (u64 *)walk->private;
> +	u64 pfn = buff[0];
> +	int shift = PMD_SHIFT;
> +	pte_t pte =  huge_ptep_get(ptep);
> +
> +	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
> +		goto find;
> +
> +	if (pte_pfn(*ptep) == pfn)
> +		goto find;
> +
> +	return 0;
> +find:
> +	buff[0] = addr;
> +	buff[1] = shift;
> +	return true;
> +}
> +
> +static struct mm_walk_ops walk = {
> +	.pte_entry = mc_pte_entry,
> +	.hugetlb_entry	= mc_hugetlb_range
> +};
> +
> +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
> +{
> +	u64 buff[2] = {pfn, 0};
> +	struct page *page;
> +	int ret = -1;
> +
> +	page = pfn_to_page(pfn);
> +	if (!page)
> +		goto force_sigbus;
> +
> +	if (is_zone_device_page(page))
> +		goto force_sigbus;
> +
> +	mmap_read_lock(p->mm);
> +	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
> +	mmap_read_unlock(p->mm);
> +
> +	if (ret && buff[0]) {
> +		pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +		buff[0], p->comm, p->pid);
> +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
> +	} else {
> +force_sigbus:
> +		pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +		pfn, p->comm, p->pid);
> +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
> +	}
> +
> +}
>  
>  static void kill_me_now(struct callback_head *ch)
>  {
> @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
>  	}
>  
>  	if (p->mce_vaddr != (void __user *)-1l) {
> -		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> -			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
> -		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> +		mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);

I guess that p->mce_vaddr stores the virtual address of the error here.
If so, sending SIGBUS with the address looks enough as we do now, so why
do you walk page table to find the error virtual address?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-11  8:55                                       ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-11 11:23                                         ` Aili Yao
  2021-03-11 17:05                                         ` Luck, Tony
  2021-03-17  7:48                                         ` Aili Yao
  2 siblings, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-11 11:23 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili,
	sunhao2

On Thu, 11 Mar 2021 08:55:30 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote:
> > On Fri, 5 Mar 2021 15:55:25 +0000
> > "Luck, Tony" <tony.luck@intel.com> wrote:
> >   
> > > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?    
> > > 
> > > If the walk wins the race and the pte for the poisoned page is still valid, then yes.
> > > 
> > > But we could have:
> > > 
> > > CPU1                            CPU2
> > > memory_failure sets poison
> > > bit for struct page
> > > 
> > > 
> > > rmap finds page in task
> > > on CPU2 and sets PTE
> > > to not-valid-poison
> > > 
> > >                                 memory_failure returns
> > >                                 early because struct page
> > >                                 already marked as poison
> > > 
> > >                                 walk page tables looking
> > >                                 for mapping - don't find it
> > > 
> > > -Tony  
> > 
> > While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not
> > proper, I think following patch maybe one way.
> > I copy your abandon code, and make a little modification, and just now it pass
> > my simple test.
> > 
> > And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right.
> > 
> > Thanks!
> > 
> > From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
> > From: Aili Yao <yaoaili@kingsoft.com>
> > Date: Wed, 10 Mar 2021 13:59:18 +0800
> > Subject: [PATCH] x86/mce: fix invalid SIGBUS address
> > 
> > walk the current process pte and compare with the pfn;
> > 1. only test for normal page and 2M hugetlb page;
> > 2. 1G hugetlb and transparentHuge is not support currently;
> > 3. May other fails is not recognized, This is a RFC version.
> > 
> > ---
> >  arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 80 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index db4afc5..65d7ef7 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -28,8 +28,12 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/types.h>
> >  #include <linux/slab.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  #include <linux/init.h>
> >  #include <linux/kmod.h>
> > +#include <linux/pagewalk.h>
> >  #include <linux/poll.h>
> >  #include <linux/nmi.h>
> >  #include <linux/cpu.h>  
> 
> Maybe requiring many dependencies like this implies that you might be better
> to do below in mm/memory-failure.c instead of in this file.

Yes, agree, I will change this, Thanks!
 
> > @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> >  	/* mce_clear_state will clear *final, save locally for use later */
> >  	*m = *final;
> >  }
> > +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	u64 *buff = (u64 *)walk->private;
> > +	u64 pfn = buff[0];
> > +
> > +	if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
> > +		goto find;
> > +	else if (pte_pfn(*pte) == pfn)
> > +		goto find;
> > +
> > +	return 0;
> > +find:
> > +	buff[0] = addr;
> > +	buff[1] = PAGE_SHIFT;
> > +	return true;  
> 
> Returning true means you stop walking when you find the first entry pointing
> to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> triggered by memory access to the larger address in hwpoisoned entries, the
> returned virtual address might be wrong.

Yes, We need to consider multiple posion page entries, I will fix this. Thanks for
you sugguestion!


> > +}
> > +
> > +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
> > +
> > +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
> > +				 unsigned long addr, unsigned long end,
> > +				 struct mm_walk *walk)
> > +{
> > +	u64 *buff = (u64 *)walk->private;
> > +	u64 pfn = buff[0];
> > +	int shift = PMD_SHIFT;
> > +	pte_t pte =  huge_ptep_get(ptep);
> > +
> > +	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
> > +		goto find;
> > +
> > +	if (pte_pfn(*ptep) == pfn)
> > +		goto find;
> > +
> > +	return 0;
> > +find:
> > +	buff[0] = addr;
> > +	buff[1] = shift;
> > +	return true;
> > +}
> > +
> > +static struct mm_walk_ops walk = {
> > +	.pte_entry = mc_pte_entry,
> > +	.hugetlb_entry	= mc_hugetlb_range
> > +};
> > +
> > +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
> > +{
> > +	u64 buff[2] = {pfn, 0};
> > +	struct page *page;
> > +	int ret = -1;
> > +
> > +	page = pfn_to_page(pfn);
> > +	if (!page)
> > +		goto force_sigbus;
> > +
> > +	if (is_zone_device_page(page))
> > +		goto force_sigbus;
> > +
> > +	mmap_read_lock(p->mm);
> > +	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
> > +	mmap_read_unlock(p->mm);
> > +
> > +	if (ret && buff[0]) {
> > +		pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > +		buff[0], p->comm, p->pid);
> > +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
> > +	} else {
> > +force_sigbus:
> > +		pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > +		pfn, p->comm, p->pid);
> > +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
> > +	}
> > +
> > +}
> >  
> >  static void kill_me_now(struct callback_head *ch)
> >  {
> > @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
> >  	}
> >  
> >  	if (p->mce_vaddr != (void __user *)-1l) {
> > -		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > -			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
> > -		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> > +		mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);  
> 
> I guess that p->mce_vaddr stores the virtual address of the error here.
> If so, sending SIGBUS with the address looks enough as we do now, so why
> do you walk page table to find the error virtual address?

I check the code again, yes, I should have placed mc_memory_failure_error in else branch, but it seems p->mce_vaddr is not correctly
initialized and for my test, it has a zero value then code goes into if (p->mce_vaddr != (void __user *)-1l) branch; It seems this is
another issue needing to fix;

And from the p->mce_vaddr, possibly there is a better way to get vaddr, i will dig more about this.

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-11  8:55                                       ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-11 11:23                                         ` Aili Yao
@ 2021-03-11 17:05                                         ` Luck, Tony
  2021-03-12  5:55                                           ` Aili Yao
  2021-03-17  7:48                                         ` Aili Yao
  2 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-11 17:05 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Aili Yao
  Cc: Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1

> I guess that p->mce_vaddr stores the virtual address of the error here.
> If so, sending SIGBUS with the address looks enough as we do now, so why
> do you walk page table to find the error virtual address?

p->mce_vaddr only has the virtual address for the COPYIN case. In that code
path we decode the kernel instruction that hit the fault in order to find the virtual
address. That's easy because:

1) The kernel RIP is known to be good (can't page fault etc. on kernel address).
2) There are only a half dozen instructions used by the kernel for get_user() or
     copy_from_user().

When the machine check happens during user execution accessing poison data
we only have the physical address (from MCi_ADDR).

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-11 17:05                                         ` Luck, Tony
@ 2021-03-12  5:55                                           ` Aili Yao
  2021-03-12 16:29                                             ` Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-12  5:55 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, yaoaili, sunhao2

On Thu, 11 Mar 2021 17:05:53 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > I guess that p->mce_vaddr stores the virtual address of the error here.
> > If so, sending SIGBUS with the address looks enough as we do now, so why
> > do you walk page table to find the error virtual address?  
> 
> p->mce_vaddr only has the virtual address for the COPYIN case. In that code
> path we decode the kernel instruction that hit the fault in order to find the virtual
> address. That's easy because:
> 
> 1) The kernel RIP is known to be good (can't page fault etc. on kernel address).
> 2) There are only a half dozen instructions used by the kernel for get_user() or
>      copy_from_user().
> 
> When the machine check happens during user execution accessing poison data
> we only have the physical address (from MCi_ADDR).
> 
> -Tony

Sorry to interrupt as I am really confused here:
If it's a copyin case, has the page been mapped for the current process?
will memory_failure() find it and unmap it? if succeed, then the current will be
signaled with correct vaddr and shift?

Maybe the mce_vaddr is set correctly, but we may lost the correct page shift?

And for copyin case, we don't need to call set_mce_nospec()?

-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-12  5:55                                           ` Aili Yao
@ 2021-03-12 16:29                                             ` Luck, Tony
  2021-03-12 23:48                                               ` Luck, Tony
  0 siblings, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-12 16:29 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2

> Sorry to interrupt as I am really confused here:
> If it's a copyin case, has the page been mapped for the current process?

Yes. The kernel has the current process mapped to the low address range
and code like get_user(), put_user() "simply" reaches down to those addresses
(maybe not so simply, as the access needs to be within a STAC/CLAC section
of code on modern CPUs, and the access instruction needs an entry in EXTABLE
so that the page fault handler can fix it if there isn't a page mapped at the user
address).

> will memory_failure() find it and unmap it? if succeed, then the current will be
> signaled with correct vaddr and shift?

That's a very good question.  I didn't see a SIGBUS when I first wrote this code,
hence all the p->mce_vaddr.  But now I'm
a) not sure why there wasn't a signal
b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS

> Maybe the mce_vaddr is set correctly, but we may lost the correct page shift?
Yes. There is a hard-coded PAGE_SHIFT for this case - which may not match the actual page size.

> And for copyin case, we don't need to call set_mce_nospec()?
Yes. We should.

Thanks for your good questions.

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-12 16:29                                             ` Luck, Tony
@ 2021-03-12 23:48                                               ` Luck, Tony
  2021-03-16  6:42                                                 ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-17  0:29                                                 ` Luck, Tony
  0 siblings, 2 replies; 42+ messages in thread
From: Luck, Tony @ 2021-03-12 23:48 UTC (permalink / raw)
  To: Aili Yao
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2

>> will memory_failure() find it and unmap it? if succeed, then the current will be
>> signaled with correct vaddr and shift?
>
> That's a very good question.  I didn't see a SIGBUS when I first wrote this code,
> hence all the p->mce_vaddr.  But now I'm
> a) not sure why there wasn't a signal
> b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS

Tests on upstream kernel today show that memory_failure() is both unmapping the page
and sending a SIGBUS.


My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code
to mark the page not present while we are still in do_machine_check().  That's resulted
in recovery working for simple cases where there is a single get_user() call followed by
an error return if that failed. But more complex cases require more machine checks and
a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't).

Thanks to the decode of the instruction we do have the virtual address. So we just need
a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
of a "not-present" value. Maybe a different poison type from the one we get from
memory_failure() so that the #PF code can recognize this as a special case and do any
other work that we avoided because we were in #MC context.

-Tony

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-09 20:01                                                       ` Luck, Tony
  2021-03-10  8:05                                                         ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-13  1:55                                                         ` Jue Wang
  1 sibling, 0 replies; 42+ messages in thread
From: Jue Wang @ 2021-03-13  1:55 UTC (permalink / raw)
  To: tony.luck
  Cc: akpm, bp, david, hpa, linux-edac, linux-kernel, linux-mm, mingo,
	naoya.horiguchi, osalvador, tglx, x86, yangfeng1, yaoaili

I believe the mutex type patch has its own value in protecting 
memory_failure from other inherent races, e.g., races around 
split_huge_page where concurrent MCE happens to different 4k pages 
under the same THP.[1] This realistically can happen given the physical
locality clustering effect of memory errors.

Thanks,
-Jue

[1] The split fails due to a page reference taken by other concurrent 
calling into memory_failure on the same THP.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-12 23:48                                               ` Luck, Tony
@ 2021-03-16  6:42                                                 ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-16  7:54                                                   ` Aili Yao
  2021-03-17  0:29                                                 ` Luck, Tony
  1 sibling, 1 reply; 42+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-03-16  6:42 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Aili Yao, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2

On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> >> will memory_failure() find it and unmap it? if succeed, then the current will be
> >> signaled with correct vaddr and shift?
> >
> > That's a very good question.  I didn't see a SIGBUS when I first wrote this code,
> > hence all the p->mce_vaddr.  But now I'm
> > a) not sure why there wasn't a signal

We have a recent change around this behavior, which might be an answer for you.
See commit 30c9cf492704 ("mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events").

> > b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS
> 
> Tests on upstream kernel today show that memory_failure() is both unmapping the page
> and sending a SIGBUS.

Sorry if I misunderstood the exact problem, but if the point is to allow
user processes to find poisoned virtual address without SIGBUS, one possible
way is to expose hwpoison entry via /proc/pid/pagemap (attached a draft patch below).
With this patch, processes easily find hwpoison entries without any new interface.

> 
> My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code
> to mark the page not present while we are still in do_machine_check(). 

It sounds to me that even if we have such code, it just narrows down the race window
between multiple MCEs on different CPUs. Or does it completely prevent the race?
(Another thread could touch the poisoned page just before the first thread marks
the page non-present?)

> That's resulted
> in recovery working for simple cases where there is a single get_user() call followed by
> an error return if that failed. But more complex cases require more machine checks and
> a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't).
> 
> Thanks to the decode of the instruction we do have the virtual address. So we just need
> a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> of a "not-present" value. Maybe a different poison type from the one we get from
> memory_failure() so that the #PF code can recognize this as a special case and do any
> other work that we avoided because we were in #MC context.

As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case,
then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are
talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case),
that might be helpful, although it might be hard to implement.
And I'm afraid that walking page table could find the wrong virtual address if a process
have multiple entries to the single page. We could send multiple SIGBUSs for such case,
but maybe that's not an optimal solution.

Thanks,
Naoya Horiguchi

----
From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Tue, 16 Mar 2021 14:22:21 +0900
Subject: [PATCH] pagemap: expose hwpoison entry

not-signed-off-by-yet: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 fs/proc/task_mmu.c      |  6 ++++++
 include/linux/swapops.h | 12 ++++++++++++
 tools/vm/page-types.c   |  5 ++++-
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 602e3a52884d..08cea209bae7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1300,6 +1300,7 @@ struct pagemapread {
 #define PM_PFRAME_MASK		GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
 #define PM_SOFT_DIRTY		BIT_ULL(55)
 #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
+#define PM_HWPOISON	       	BIT_ULL(60)
 #define PM_FILE			BIT_ULL(61)
 #define PM_SWAP			BIT_ULL(62)
 #define PM_PRESENT		BIT_ULL(63)
@@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
 
+		if (is_hwpoison_entry(entry)) {
+			page = hwpoison_entry_to_page(entry);
+			flags |= PM_HWPOISON;
+		}
+
 		if (is_device_private_entry(entry))
 			page = device_private_entry_to_page(entry);
 	}
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..1b9dedbd06ab 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	struct page *p = pfn_to_page(swp_offset(entry));
+	WARN_ON(!PageHWPoison(p));
+	return p;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
@@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 	return 0;
 }
 
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 }
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 0517c744b04e..1160d5a14955 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -53,6 +53,7 @@
 #define PM_SWAP_OFFSET(x)	(((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT)
 #define PM_SOFT_DIRTY		(1ULL << 55)
 #define PM_MMAP_EXCLUSIVE	(1ULL << 56)
+#define PM_HWPOISON		(1ULL << 60)
 #define PM_FILE			(1ULL << 61)
 #define PM_SWAP			(1ULL << 62)
 #define PM_PRESENT		(1ULL << 63)
@@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val)
 
 	if (val & PM_PRESENT)
 		pfn = PM_PFRAME(val);
+	else if (val & PM_HWPOISON)
+		pfn = PM_SWAP_OFFSET(val);
 	else
 		pfn = 0;
 
@@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long count)
 			pfn = pagemap_pfn(buf[i]);
 			if (pfn)
 				walk_pfn(index + i, pfn, 1, buf[i]);
-			if (buf[i] & PM_SWAP)
+			else if (buf[i] & PM_SWAP)
 				walk_swap(index + i, buf[i]);
 		}
 
-- 
2.25.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-16  6:42                                                 ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-16  7:54                                                   ` Aili Yao
  0 siblings, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-16  7:54 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, Oscar Salvador, david, akpm, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2,
	yaoaili

> As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case,
> then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are
> talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case),
> that might be helpful, although it might be hard to implement.
> And I'm afraid that walking page table could find the wrong virtual address if a process
> have multiple entries to the single page. We could send multiple SIGBUSs for such case,
> but maybe that's not an optimal solution.

Yes, agree, I can't find one way to address this multiple entries case, to make sure we get
the exact correct virtual address.

But also I will post a v2 RFC patch for this issue for only discussion purpose!

Thanks!

> ----
> From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Tue, 16 Mar 2021 14:22:21 +0900
> Subject: [PATCH] pagemap: expose hwpoison entry
> 
> not-signed-off-by-yet: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  fs/proc/task_mmu.c      |  6 ++++++
>  include/linux/swapops.h | 12 ++++++++++++
>  tools/vm/page-types.c   |  5 ++++-
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..08cea209bae7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1300,6 +1300,7 @@ struct pagemapread {
>  #define PM_PFRAME_MASK		GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
>  #define PM_SOFT_DIRTY		BIT_ULL(55)
>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
> +#define PM_HWPOISON	       	BIT_ULL(60)
>  #define PM_FILE			BIT_ULL(61)
>  #define PM_SWAP			BIT_ULL(62)
>  #define PM_PRESENT		BIT_ULL(63)
> @@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		if (is_migration_entry(entry))
>  			page = migration_entry_to_page(entry);
>  
> +		if (is_hwpoison_entry(entry)) {
> +			page = hwpoison_entry_to_page(entry);
> +			flags |= PM_HWPOISON;
> +		}
> +
>  		if (is_device_private_entry(entry))
>  			page = device_private_entry_to_page(entry);
>  	}
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index d9b7c9132c2f..1b9dedbd06ab 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
>  	return swp_type(entry) == SWP_HWPOISON;
>  }
>  
> +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
> +{
> +	struct page *p = pfn_to_page(swp_offset(entry));
> +	WARN_ON(!PageHWPoison(p));
> +	return p;
> +}
> +
>  static inline void num_poisoned_pages_inc(void)
>  {
>  	atomic_long_inc(&num_poisoned_pages);
> @@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
>  	return 0;
>  }
>  
> +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
>  static inline void num_poisoned_pages_inc(void)
>  {
>  }
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 0517c744b04e..1160d5a14955 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -53,6 +53,7 @@
>  #define PM_SWAP_OFFSET(x)	(((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT)
>  #define PM_SOFT_DIRTY		(1ULL << 55)
>  #define PM_MMAP_EXCLUSIVE	(1ULL << 56)
> +#define PM_HWPOISON		(1ULL << 60)
>  #define PM_FILE			(1ULL << 61)
>  #define PM_SWAP			(1ULL << 62)
>  #define PM_PRESENT		(1ULL << 63)
> @@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val)
>  
>  	if (val & PM_PRESENT)
>  		pfn = PM_PFRAME(val);
> +	else if (val & PM_HWPOISON)
> +		pfn = PM_SWAP_OFFSET(val);
>  	else
>  		pfn = 0;
>  
> @@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long count)
>  			pfn = pagemap_pfn(buf[i]);
>  			if (pfn)
>  				walk_pfn(index + i, pfn, 1, buf[i]);
> -			if (buf[i] & PM_SWAP)
> +			else if (buf[i] & PM_SWAP)
>  				walk_swap(index + i, buf[i]);
>  		}
>  



-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-12 23:48                                               ` Luck, Tony
  2021-03-16  6:42                                                 ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-03-17  0:29                                                 ` Luck, Tony
  2021-03-17  9:07                                                   ` Aili Yao
  1 sibling, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2021-03-17  0:29 UTC (permalink / raw)
  To: Aili Yao, Andy Lutomirski
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2

On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> Thanks to the decode of the instruction we do have the virtual address. So we just need
> a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> of a "not-present" value. Maybe a different poison type from the one we get from
> memory_failure() so that the #PF code can recognize this as a special case and do any
> other work that we avoided because we were in #MC context.

There is no such thing as the safe walk I describe above. In a multi
threaded application other threads might munmap(2) bits of the address
space which could free some of the p4d->pud->pmd->pte tables.

But the pgd table is guaranteed to exist as long as any of the threads
in the process exist. Which gave rise to a new approach (partial credit
to Dave Hansen who helped me understand why a more extreme patch that
I wanted to use wasn't working ... and he pointed out the pgd persistence).

N.B. The code below DOES NOT WORK. My test application tried to do a write(2)
syscall with some poison in the buffer. Goal is that write should return a
short count (with only the bytes from the buffer leading up to the poison
written to the file). Currently the process gets a suprise SIGSEGV in
some glibc code :-(

The code would also need a bunch more work to handle the fact that
in a multi-threaded application multiple threads might consume the
poison, and some innocent threads not consuming the poison may also end
up drawn into the melee in the page fault handler.


The big picture.
---------------

This approach passes the buck for the recovery from the #MC handler
(which has very limited options to do anything useful) to the #PF
handler (which is a much friendlier environment where locks and mutexes
can be obtained as needed).

The mechanism for this buck passing is for the #MC handler to set a
reserved bit in the PGD entry for the user address where the machine
check happened, flush the TLB for that address, and then return from
the #MC handler to the kernel get_user()/copy_from_user() code which
will re-execute the instruction to access the poison address, but this
time take a #PF because of the reserved bit in the PGD.

There's a couple of changes bundled in here to help my debugging:
1) Turn off UCNA calls to memory_failure() ... just avoids races
   and make tests more determinstic for now
2) Disable "fast string" support ... avoid "REP MOVS" copy routines
   since I'm testing on an old system that treats poison consumed
   by REP MOVS as fatal.

Posted here for general comments on the approach.  On the plus
side it avoids taking multiple machine checks in the kernel when
it is accessing user data. I think it can also meet the goal set
by Andy Lutomirski of avoiding SIGBUS from syscalls that happen
to touch user poison. They should see either short counts for
calls like write(2) that may partially succeed or -EFAULT if
the system call totally failed.

-Tony

---

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f24d7ef8fffa..e533eaf20834 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,6 +23,7 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
+#define _PAGE_BIT_RESERVED1	51	/* Reserved bit */
 #define _PAGE_BIT_SOFTW4	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
@@ -56,6 +57,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_RESERVED1	(_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1)
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 #define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
 #define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..269e8ee04c11 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message,
 				      unsigned long fault_address);
 #endif
 
+static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr)
+{
+	pgd_t	pgd;
+
+	pgdp += pgd_index(addr);
+	pgd = *pgdp;
+	pgd.pgd |= _PAGE_RESERVED1;
+	WRITE_ONCE(*pgdp, pgd);
+}
+
+static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr)
+{
+	pgd_t	pgd;
+
+	pgdp += pgd_index(addr);
+	pgd = *pgdp;
+	pgd.pgd &= ~_PAGE_RESERVED1;
+	WRITE_ONCE(*pgdp, pgd);
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0e422a544835..974e1eb5d1aa 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -287,6 +287,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	 */
 	if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) {
 		rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable);
+misc_enable = 0;
 		if (!(misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING)) {
 			pr_info("Disabled fast string operations\n");
 			setup_clear_cpu_cap(X86_FEATURE_REP_GOOD);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..41bedc961928 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -635,6 +635,7 @@ static struct notifier_block early_nb = {
 static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 			      void *data)
 {
+#if 0
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
 
@@ -650,7 +651,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
-
+#endif
 	return NOTIFY_OK;
 }
 
@@ -1443,8 +1444,18 @@ noinstr void do_machine_check(struct pt_regs *regs)
 				mce_panic("Failed kernel mode recovery", &m, msg);
 		}
 
-		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+		/*
+		 * Machine check while copying from user space. Note that we
+		 * do not call fixup_exception(). Instead we ensure a page fault
+		 * when we return to the faulting instruction.
+		 * Let the page fault handler do the rest of the
+		 * recovery.
+		 */
+		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
+			flush_tlb_one_user((unsigned long)current->mce_vaddr);
+			pgd_set_reserved(current->mm->pgd, (unsigned long)current->mce_vaddr);
+			current->mce_addr = m.addr;
+		}
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 83df991314c5..da917945150d 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -282,7 +282,6 @@ static int error_context(struct mce *m, struct pt_regs *regs)
 		return IN_KERNEL_RECOV;
 	}
 	if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
-		m->kflags |= MCE_IN_KERNEL_RECOV;
 		m->kflags |= MCE_IN_KERNEL_COPYIN;
 		return IN_KERNEL_RECOV;
 	}
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 77b9b2a3b5c8..e0e71ca023ce 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
  */
 SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
 	movl %edx,%ecx
-	cmp $X86_TRAP_MC,%eax		/* check if X86_TRAP_MC */
-	je 3f
 1:	rep movsb
 2:	mov %ecx,%eax
 	ASM_CLAC
 	ret
 
-	/*
-	 * Return zero to pretend that this copy succeeded. This
-	 * is counter-intuitive, but needed to prevent the code
-	 * in lib/iov_iter.c from retrying and running back into
-	 * the poison cache line again. The machine check handler
-	 * will ensure that a SIGBUS is sent to the task.
-	 */
-3:	xorl %eax,%eax
-	ASM_CLAC
-	ret
-
 	_ASM_EXTABLE_CPY(1b, 2b)
 SYM_CODE_END(.Lcopy_user_handle_tail)
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a73347e2cdfc..49232264e893 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1245,9 +1245,13 @@ void do_user_addr_fault(struct pt_regs *regs,
 	/*
 	 * Reserved bits are never expected to be set on
 	 * entries in the user portion of the page tables.
+	 * Except when machine check handling of a copy from
+	 * user passed the buck to #PF.
 	 */
-	if (unlikely(error_code & X86_PF_RSVD))
-		pgtable_bad(regs, error_code, address);
+	if (unlikely(error_code & X86_PF_RSVD)) {
+		if (!current->mce_vaddr)
+			pgtable_bad(regs, error_code, address);
+	}
 
 	/*
 	 * If SMAP is on, check for invalid kernel (supervisor) access to user
@@ -1291,6 +1295,15 @@ void do_user_addr_fault(struct pt_regs *regs,
 			local_irq_enable();
 	}
 
+	if (current->mce_vaddr) {
+		pgd_clr_reserved(current->mm->pgd,
+				 (unsigned long)current->mce_vaddr);
+		memory_failure(current->mce_addr >> PAGE_SHIFT, 0);
+		current->mce_vaddr = 0;
+		error_code &= ~X86_PF_RSVD;
+		pr_info("#PF: maybe fixed? Try to continue\n");
+	}
+
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
 	if (error_code & X86_PF_WRITE)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-11  8:55                                       ` HORIGUCHI NAOYA(堀口 直也)
  2021-03-11 11:23                                         ` Aili Yao
  2021-03-11 17:05                                         ` Luck, Tony
@ 2021-03-17  7:48                                         ` Aili Yao
  2021-03-17  8:23                                           ` Aili Yao
  2 siblings, 1 reply; 42+ messages in thread
From: Aili Yao @ 2021-03-17  7:48 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也),
	Luck, Tony, akpm
  Cc: Oscar Salvador, david, bp, tglx, mingo, hpa, x86, linux-edac,
	linux-kernel, linux-mm, yangfeng1, yaoaili, sunhao2


> 
> Returning true means you stop walking when you find the first entry pointing
> to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> triggered by memory access to the larger address in hwpoisoned entries, the
> returned virtual address might be wrong.
> 

I can't find the way to fix this, maybe the virtual address is contained in
related register, but this is really beyong my knowledge.

This is a v2 RFC patch, add support for thp and 1G huge page errors.

Thanks
Aili Yao

From 31b685609610b3b06c8fd98d866913dbfeb7e159 Mon Sep 17 00:00:00 2001
From: Aili Yao <yaoaili@kingsoft.com>
Date: Wed, 17 Mar 2021 15:34:15 +0800
Subject: [PATCH] fix invalid SIGBUS address for recovery fail

Walk the current process pages and compare with the pfn, then get the
user address and related page_shift.

For thp pages, we can only split anonoums thp page, so I think there may
be no race condition for walking and searching the thp error page for such
case; For non anonymous thp, the page flag and pte will not be change. so
when code goes into this place, it may be race condition for non-anonoums
thp page or from a recovery fail for anonoums thp, the page status will
not change, i am not so sure about this;

For the case we don't find the related virtual address, Maybe sending one
BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am
not sure.

And this may get the wrong virtual address if process have multiple entry
for a same page, I don't find a way to get it correct.

Maybe other issues is not recognized.
---
 arch/x86/kernel/cpu/mce/core.c |  16 ++---
 include/linux/mm.h             |   1 +
 mm/memory-failure.c            | 145 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 152 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..1bf21cc 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -1246,7 +1250,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	pr_err("Uncorrected hardware memory error in user-access at %llx, %llx", p->mce_addr, p->mce_vaddr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
@@ -1258,14 +1262,8 @@ static void kill_me_maybe(struct callback_head *cb)
 		return;
 	}
 
-	if (p->mce_vaddr != (void __user *)-1l) {
-		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-	} else {
-		pr_err("Memory error not recovered");
-		kill_me_now(cb);
-	}
+	memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
+
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..cff2f02 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_flags {
 	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_error(struct task_struct *p, unsigned long pfn);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f0061..aaf99a7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,8 +56,10 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
+#include <linux/delay.h>
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
 
@@ -240,7 +242,7 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
 	int ret = 0;
 
 	pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			pfn, t->comm, t->pid);
+			tk->addr, t->comm, t->pid);
 
 	if (flags & MF_ACTION_REQUIRED) {
 		WARN_ON_ONCE(t != current);
@@ -1417,6 +1419,7 @@ int memory_failure(unsigned long pfn, int flags)
 try_again:
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
@@ -1553,6 +1556,146 @@ int memory_failure(unsigned long pfn, int flags)
 }
 EXPORT_SYMBOL_GPL(memory_failure);
 
+static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte = *ptep;
+
+	if (!pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+		
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) {
+			printk("%s, %d \n", __FUNCTION__, __LINE__);
+			goto find;
+		}
+	} else if (pte_pfn(pte) == pfn) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		goto find;
+	}
+
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = PAGE_SHIFT;
+	return true;
+}
+
+static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	struct page *page = (struct page *)buff[0];
+	u64 pfn = buff[1];
+	pmd_t pmd = *pmdp;
+
+	if (likely(!pmd_trans_huge(pmd)))
+		return 0;
+
+	if (pmd_none(pmd) || !pmd_present(pmd))
+		return 0;
+
+	if (pmd_page(pmd) != page)
+		return 0;
+
+	for (; addr != end; page++, addr += PAGE_SIZE) {
+		if (page_to_pfn(page) == pfn) {
+			printk("%s, %d \n", __FUNCTION__, __LINE__);
+			buff[0] = addr;
+			buff[1] = PAGE_SHIFT;
+			return true;
+		}
+	}
+
+	return 0;
+}
+
+static int hugetlb_range(pte_t *ptep, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte =  huge_ptep_get(ptep);
+	struct page *page = pfn_to_page(pfn);
+
+	if (!huge_pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) {
+			printk("%s, %d \n", __FUNCTION__, __LINE__);
+			goto find;
+		}
+	}
+	if (pte_pfn(pte) == pfn) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		goto find;
+	}
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = (huge_page_size(page_hstate(page)) > PMD_SIZE) ? PUD_SHIFT : PMD_SHIFT;
+	return true;
+}
+
+void memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+	u64 buff[2] = {0};
+	struct page *page;
+	int ret = -1;
+	struct mm_walk_ops walk = {0};
+
+	if (p->mce_vaddr != (void __user *)-1l && p->mce_vaddr != (void __user *)0) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
+		return;
+	}
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		goto force_sigbus;
+
+	if (is_zone_device_page(page))
+		goto force_sigbus;
+
+	page = compound_head(page);
+
+	if (PageHuge(page)) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		walk.hugetlb_entry = hugetlb_range;
+		buff[0] = page_to_pfn(page);
+	} else if (PageTransHuge(page)) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		walk.pmd_entry = pmd_range;
+		buff[0] = (u64)page;
+		buff[1] = pfn;
+	} else {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		walk.pte_entry = pte_range;
+		buff[0] = pfn;
+	}
+	msleep(1000);
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
+	mmap_read_unlock(p->mm);
+
+	pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+	pfn, p->comm, p->pid);
+
+	if (ret) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
+	} else {
+force_sigbus:
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)0, PAGE_SHIFT);
+	}
+}
+EXPORT_SYMBOL_GPL(memory_failure_error);
+
 #define MEMORY_FAILURE_FIFO_ORDER	4
 #define MEMORY_FAILURE_FIFO_SIZE	(1 << MEMORY_FAILURE_FIFO_ORDER)
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-17  7:48                                         ` Aili Yao
@ 2021-03-17  8:23                                           ` Aili Yao
  0 siblings, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-17  8:23 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Luck, Tony, akpm, Oscar Salvador, david, bp, tglx, mingo, hpa,
	x86, linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2,
	yaoaili


> > 
> > Returning true means you stop walking when you find the first entry pointing
> > to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> > triggered by memory access to the larger address in hwpoisoned entries, the
> > returned virtual address might be wrong.
> >   
> 
> I can't find the way to fix this, maybe the virtual address is contained in
> related register, but this is really beyong my knowledge.
> 
> This is a v2 RFC patch, add support for thp and 1G huge page errors.
> 

Sorry for the debug info and other unclean modifications.

Post a clean one.

Thanks
Aili Yao

From 2289276ba943cdcddbf3b5b2cdbcaff78690e2e8 Mon Sep 17 00:00:00 2001
From: Aili Yao <yaoaili@kingsoft.com>
Date: Wed, 17 Mar 2021 16:12:41 +0800
Subject: [PATCH] fix invalid SIGBUS address for recovery fail

Walk the current process pages and compare with the pfn, then get the
user address and related page_shift.

For thp pages, we can only split anonoums thp page, so I think there may
be no race condition for walking and searching the thp error page for such
case; For non anonymous thp, the page flag and pte will not be change. so
when code goes into this place, it may be race condition for non-anonoums
thp page or from a recovery fail for anonoums thp, the page status will
not change, i am not so sure about this;

For the case we don't find the related virtual address, Maybe sending one
BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am
not sure.

And this may get the wrong virtual address if process have multiple entry
for a same page, I don't find a way to get it correct.

Maybe other issues is not recognized.
---
 arch/x86/kernel/cpu/mce/core.c |  12 +---
 include/linux/mm.h             |   1 +
 mm/memory-failure.c            | 127 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..4cb873c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,7 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	pr_err("Uncorrected hardware memory error in user-access at %llx\n", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
@@ -1258,14 +1258,8 @@ static void kill_me_maybe(struct callback_head *cb)
 		return;
 	}
 
-	if (p->mce_vaddr != (void __user *)-1l) {
-		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-	} else {
-		pr_err("Memory error not recovered");
-		kill_me_now(cb);
-	}
+	memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
+
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..cff2f02 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_flags {
 	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_error(struct task_struct *p, unsigned long pfn);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f0061..359b42f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -1553,6 +1554,132 @@ int memory_failure(unsigned long pfn, int flags)
 }
 EXPORT_SYMBOL_GPL(memory_failure);
 
+static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte = *ptep;
+
+	if (!pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn)
+			goto find;
+	} else if (pte_pfn(pte) == pfn) {
+		goto find;
+	}
+
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = PAGE_SHIFT;
+	return true;
+}
+
+static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	struct page *page = (struct page *)buff[0];
+	u64 pfn = buff[1];
+	pmd_t pmd = *pmdp;
+
+	if (likely(!pmd_trans_huge(pmd)))
+		return 0;
+
+	if (pmd_none(pmd) || !pmd_present(pmd))
+		return 0;
+
+	if (pmd_page(pmd) != page)
+		return 0;
+
+	for (; addr != end; page++, addr += PAGE_SIZE) {
+		if (page_to_pfn(page) == pfn) {
+			buff[0] = addr;
+			buff[1] = PAGE_SHIFT;
+			return true;
+		}
+	}
+
+	return 0;
+}
+
+static int hugetlb_range(pte_t *ptep, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte =  huge_ptep_get(ptep);
+	struct page *page = pfn_to_page(pfn);
+
+	if (!huge_pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn)
+			goto find;
+	}
+	if (pte_pfn(pte) == pfn)
+		goto find;
+
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = (huge_page_size(page_hstate(page)) > PMD_SIZE) ? PUD_SHIFT : PMD_SHIFT;
+	return true;
+}
+
+void memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+	u64 buff[2] = {0};
+	struct page *page;
+	int ret = -1;
+	struct mm_walk_ops walk = {0};
+
+	if (p->mce_vaddr != (void __user *)-1l && p->mce_vaddr != (void __user *)0) {
+		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
+		return;
+	}
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		goto force_sigbus;
+
+	if (is_zone_device_page(page))
+		goto force_sigbus;
+
+	page = compound_head(page);
+
+	if (PageHuge(page)) {
+		walk.hugetlb_entry = hugetlb_range;
+		buff[0] = page_to_pfn(page);
+	} else if (PageTransHuge(page)) {
+		walk.pmd_entry = pmd_range;
+		buff[0] = (u64)page;
+		buff[1] = pfn;
+	} else {
+		walk.pte_entry = pte_range;
+		buff[0] = pfn;
+	}
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
+	mmap_read_unlock(p->mm);
+
+	pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+	pfn, p->comm, p->pid);
+
+	if (ret) {
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
+	} else {
+force_sigbus:
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)0, PAGE_SHIFT);
+	}
+}
+EXPORT_SYMBOL_GPL(memory_failure_error);
+
 #define MEMORY_FAILURE_FIFO_ORDER	4
 #define MEMORY_FAILURE_FIFO_SIZE	(1 << MEMORY_FAILURE_FIFO_ORDER)
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
  2021-03-17  0:29                                                 ` Luck, Tony
@ 2021-03-17  9:07                                                   ` Aili Yao
  0 siblings, 0 replies; 42+ messages in thread
From: Aili Yao @ 2021-03-17  9:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski,
	HORIGUCHI NAOYA( 堀口 直也),
	Oscar Salvador, david, akpm, bp, tglx, mingo, hpa, x86,
	linux-edac, linux-kernel, linux-mm, yangfeng1, sunhao2, yaoaili

On Tue, 16 Mar 2021 17:29:39 -0700
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> > Thanks to the decode of the instruction we do have the virtual address. So we just need
> > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> > of a "not-present" value. Maybe a different poison type from the one we get from
> > memory_failure() so that the #PF code can recognize this as a special case and do any
> > other work that we avoided because we were in #MC context.  
> 
> There is no such thing as the safe walk I describe above. In a multi
> threaded application other threads might munmap(2) bits of the address
> space which could free some of the p4d->pud->pmd->pte tables.
> 
> But the pgd table is guaranteed to exist as long as any of the threads
> in the process exist. Which gave rise to a new approach (partial credit
> to Dave Hansen who helped me understand why a more extreme patch that
> I wanted to use wasn't working ... and he pointed out the pgd persistence).
> 
> N.B. The code below DOES NOT WORK. My test application tried to do a write(2)
> syscall with some poison in the buffer. Goal is that write should return a
> short count (with only the bytes from the buffer leading up to the poison
> written to the file). Currently the process gets a suprise SIGSEGV in
> some glibc code :-(
> 
> The code would also need a bunch more work to handle the fact that
> in a multi-threaded application multiple threads might consume the
> poison, and some innocent threads not consuming the poison may also end
> up drawn into the melee in the page fault handler.
> 
> 
> The big picture.
> ---------------
> 
> This approach passes the buck for the recovery from the #MC handler
> (which has very limited options to do anything useful) to the #PF
> handler (which is a much friendlier environment where locks and mutexes
> can be obtained as needed).
> 
> The mechanism for this buck passing is for the #MC handler to set a
> reserved bit in the PGD entry for the user address where the machine
> check happened, flush the TLB for that address, and then return from
> the #MC handler to the kernel get_user()/copy_from_user() code which
> will re-execute the instruction to access the poison address, but this
> time take a #PF because of the reserved bit in the PGD.
> 
> There's a couple of changes bundled in here to help my debugging:
> 1) Turn off UCNA calls to memory_failure() ... just avoids races
>    and make tests more determinstic for now
> 2) Disable "fast string" support ... avoid "REP MOVS" copy routines
>    since I'm testing on an old system that treats poison consumed
>    by REP MOVS as fatal.
> 
> Posted here for general comments on the approach.  On the plus
> side it avoids taking multiple machine checks in the kernel when
> it is accessing user data. I think it can also meet the goal set
> by Andy Lutomirski of avoiding SIGBUS from syscalls that happen
> to touch user poison. They should see either short counts for
> calls like write(2) that may partially succeed or -EFAULT if
> the system call totally failed.
> 

I have another view here, but maybe wrong, post here for discussion:
When the process is signaled SIGBUS with BUS_MCEERR_AR, i think it should be from
a read, with the data read, the process can proceed, like process the data, or make decision.
When for poison, the data can't be returned, the process don't know what to do, and kill
is the first option.

while for a copyin case, the read is excuted by kernel, and for poison kernel will refuse to
excute current read and further operation. For the process, it seems it have a change to proceed.

if just error code is returned, the process may care or not, it may not correctly process the error.
It seems the worst case here is the process will touch the poison page again, trigger another MCE and
accordingly be killed.

It's not that bad?

Thanks
Aili Yao

> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index f24d7ef8fffa..e533eaf20834 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -23,6 +23,7 @@
>  #define _PAGE_BIT_SOFTW2	10	/* " */
>  #define _PAGE_BIT_SOFTW3	11	/* " */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> +#define _PAGE_BIT_RESERVED1	51	/* Reserved bit */
>  #define _PAGE_BIT_SOFTW4	58	/* available for programmer */
>  #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>  #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
> @@ -56,6 +57,7 @@
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
>  #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
> +#define _PAGE_RESERVED1	(_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1)
>  #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  #define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
>  #define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 7f7200021bd1..269e8ee04c11 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message,
>  				      unsigned long fault_address);
>  #endif
>  
> +static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t	pgd;
> +
> +	pgdp += pgd_index(addr);
> +	pgd = *pgdp;
> +	pgd.pgd |= _PAGE_RESERVED1;
> +	WRITE_ONCE(*pgdp, pgd);
> +}
> +
> +static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t	pgd;
> +
> +	pgdp += pgd_index(addr);
> +	pgd = *pgdp;
> +	pgd.pgd &= ~_PAGE_RESERVED1;
> +	WRITE_ONCE(*pgdp, pgd);
> +}
> +
>  #endif /* _ASM_X86_TRAPS_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 0e422a544835..974e1eb5d1aa 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -287,6 +287,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>  	 */
>  	if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) {
>  		rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable);
> +misc_enable = 0;
>  		if (!(misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING)) {
>  			pr_info("Disabled fast string operations\n");
>  			setup_clear_cpu_cap(X86_FEATURE_REP_GOOD);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7962355436da..41bedc961928 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -635,6 +635,7 @@ static struct notifier_block early_nb = {
>  static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  			      void *data)
>  {
> +#if 0
>  	struct mce *mce = (struct mce *)data;
>  	unsigned long pfn;
>  
> @@ -650,7 +651,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  		set_mce_nospec(pfn, whole_page(mce));
>  		mce->kflags |= MCE_HANDLED_UC;
>  	}
> -
> +#endif
>  	return NOTIFY_OK;
>  }
>  
> @@ -1443,8 +1444,18 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  				mce_panic("Failed kernel mode recovery", &m, msg);
>  		}
>  
> -		if (m.kflags & MCE_IN_KERNEL_COPYIN)
> -			queue_task_work(&m, kill_current_task);
> +		/*
> +		 * Machine check while copying from user space. Note that we
> +		 * do not call fixup_exception(). Instead we ensure a page fault
> +		 * when we return to the faulting instruction.
> +		 * Let the page fault handler do the rest of the
> +		 * recovery.
> +		 */
> +		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
> +			flush_tlb_one_user((unsigned long)current->mce_vaddr);
> +			pgd_set_reserved(current->mm->pgd, (unsigned long)current->mce_vaddr);
> +			current->mce_addr = m.addr;
> +		}
>  	}
>  out:
>  	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index 83df991314c5..da917945150d 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -282,7 +282,6 @@ static int error_context(struct mce *m, struct pt_regs *regs)
>  		return IN_KERNEL_RECOV;
>  	}
>  	if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
> -		m->kflags |= MCE_IN_KERNEL_RECOV;
>  		m->kflags |= MCE_IN_KERNEL_COPYIN;
>  		return IN_KERNEL_RECOV;
>  	}
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index 77b9b2a3b5c8..e0e71ca023ce 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
>   */
>  SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
>  	movl %edx,%ecx
> -	cmp $X86_TRAP_MC,%eax		/* check if X86_TRAP_MC */
> -	je 3f
>  1:	rep movsb
>  2:	mov %ecx,%eax
>  	ASM_CLAC
>  	ret
>  
> -	/*
> -	 * Return zero to pretend that this copy succeeded. This
> -	 * is counter-intuitive, but needed to prevent the code
> -	 * in lib/iov_iter.c from retrying and running back into
> -	 * the poison cache line again. The machine check handler
> -	 * will ensure that a SIGBUS is sent to the task.
> -	 */
> -3:	xorl %eax,%eax
> -	ASM_CLAC
> -	ret
> -
>  	_ASM_EXTABLE_CPY(1b, 2b)
>  SYM_CODE_END(.Lcopy_user_handle_tail)
>  
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index a73347e2cdfc..49232264e893 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1245,9 +1245,13 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	/*
>  	 * Reserved bits are never expected to be set on
>  	 * entries in the user portion of the page tables.
> +	 * Except when machine check handling of a copy from
> +	 * user passed the buck to #PF.
>  	 */
> -	if (unlikely(error_code & X86_PF_RSVD))
> -		pgtable_bad(regs, error_code, address);
> +	if (unlikely(error_code & X86_PF_RSVD)) {
> +		if (!current->mce_vaddr)
> +			pgtable_bad(regs, error_code, address);
> +	}
>  
>  	/*
>  	 * If SMAP is on, check for invalid kernel (supervisor) access to user
> @@ -1291,6 +1295,15 @@ void do_user_addr_fault(struct pt_regs *regs,
>  			local_irq_enable();
>  	}
>  
> +	if (current->mce_vaddr) {
> +		pgd_clr_reserved(current->mm->pgd,
> +				 (unsigned long)current->mce_vaddr);
> +		memory_failure(current->mce_addr >> PAGE_SHIFT, 0);
> +		current->mce_vaddr = 0;
> +		error_code &= ~X86_PF_RSVD;
> +		pr_info("#PF: maybe fixed? Try to continue\n");
> +	}
> +
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
>  	if (error_code & X86_PF_WRITE)



-- 
Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, back to index

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210224151619.67c29731@alex-virtual-machine>
     [not found] ` <20210224103105.GA16368@linux>
     [not found]   ` <20210225114329.4e1a41c6@alex-virtual-machine>
     [not found]     ` <20210225112818.GA10141@hori.linux.bs1.fc.nec.co.jp>
     [not found]       ` <20210225113930.GA7227@localhost.localdomain>
     [not found]         ` <20210225123806.GA15006@hori.linux.bs1.fc.nec.co.jp>
     [not found]           ` <20210225181542.GA178925@agluck-desk2.amr.corp.intel.com>
     [not found]             ` <20210226021907.GA27861@hori.linux.bs1.fc.nec.co.jp>
2021-02-26  2:59               ` [PATCH] mm,hwpoison: return -EBUSY when page already poisoned Aili Yao
2021-03-03  3:39                 ` Luck, Tony
2021-03-03  3:57                   ` Aili Yao
2021-03-03  8:39                     ` Aili Yao
2021-03-03 15:41                       ` Luck, Tony
2021-03-04  2:16                         ` Aili Yao
2021-03-04  4:19                           ` Aili Yao
2021-03-04  6:45                             ` Aili Yao
2021-03-04 23:57                               ` Luck, Tony
2021-03-05  1:30                                 ` Aili Yao
2021-03-05  1:36                                   ` Aili Yao
2021-03-05 22:11                                     ` Luck, Tony
2021-03-08  6:45                                       ` HORIGUCHI NAOYA(堀口 直也)
2021-03-08 18:54                                         ` Luck, Tony
2021-03-08 22:38                                           ` HORIGUCHI NAOYA(堀口 直也)
2021-03-08 22:55                                             ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Luck, Tony
2021-03-08 23:42                                               ` HORIGUCHI NAOYA(堀口 直也)
2021-03-09  2:04                                               ` Aili Yao
2021-03-09  6:04                                                 ` HORIGUCHI NAOYA(堀口 直也)
2021-03-09  6:35                                                   ` [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned Aili Yao
2021-03-09  8:28                                                     ` HORIGUCHI NAOYA(堀口 直也)
2021-03-09 20:01                                                       ` Luck, Tony
2021-03-10  8:05                                                         ` HORIGUCHI NAOYA(堀口 直也)
2021-03-13  1:55                                                         ` Jue Wang
2021-03-10  8:01                                                       ` Aili Yao
2021-03-09  6:38                                                   ` [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races Aili Yao
2021-03-05 15:55                                   ` [PATCH] mm,hwpoison: return -EBUSY when page already poisoned Luck, Tony
2021-03-10  6:10                                     ` Aili Yao
2021-03-11  8:55                                       ` HORIGUCHI NAOYA(堀口 直也)
2021-03-11 11:23                                         ` Aili Yao
2021-03-11 17:05                                         ` Luck, Tony
2021-03-12  5:55                                           ` Aili Yao
2021-03-12 16:29                                             ` Luck, Tony
2021-03-12 23:48                                               ` Luck, Tony
2021-03-16  6:42                                                 ` HORIGUCHI NAOYA(堀口 直也)
2021-03-16  7:54                                                   ` Aili Yao
2021-03-17  0:29                                                 ` Luck, Tony
2021-03-17  9:07                                                   ` Aili Yao
2021-03-17  7:48                                         ` Aili Yao
2021-03-17  8:23                                           ` Aili Yao
     [not found]         ` <20210226105250.3a15e35c@alex-virtual-machine>
2021-02-26 17:58           ` Luck, Tony
2021-03-02  4:32             ` Aili Yao

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git