linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup()
@ 2021-01-08 22:22 Tony Luck
  2021-01-08 22:22 ` [PATCH 1/2] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Tony Luck @ 2021-01-08 22:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

Linux can now recover from machine checks where kernel code is
doing get_user() to access application memory. But there isn't
a way to distinguish whether get_user() failed because of a page
fault or a machine check.

Thus there is a problem if any kernel code thinks it can retry
an access after doing something that would fix the page fault.

One such example (I'm sure there are more) is in futex_wait_setup()
where an attempt to read the futex with page faults disabled. Then
a retry (after dropping a lock so page faults are safe):


        ret = get_futex_value_locked(&uval, uaddr);

        if (ret) {
                queue_unlock(*hb);

                ret = get_user(uval, uaddr);

It would be good to avoid deliberately taking a second machine
check (especially as the recovery code does really bad things
and ends up in an infinite loop!).

My proposal is to add a new function arch_memory_failure()
that can be called after get_user() returns -EFAULT to allow
graceful recovery.

Futex reviewers: I just have one new call (that fixes my test
case). If you could point out other places this is needed,
that would be most helpful.

Patch roadmap:

Part 1: Add code to avoid the infinite loop in the machine check
code. Just panic if code runs into the same machine check a second
time. This should make it much easier to debug other places where
this happens.

Part 2: Add arch_memory_failure() and use it in futex_wait_setup().
[Suggestions gladly accepted for the current best way to handle the
#defines etc. to define an arch specific function to be used in
generic code]

Tony Luck (2):
  x86/mce: Avoid infinite loop for copy from user recovery
  futex, x86/mce: Avoid double machine checks

 arch/x86/include/asm/mmu.h     |  7 +++++++
 arch/x86/kernel/cpu/mce/core.c | 17 ++++++++++++++++-
 include/linux/mm.h             |  4 ++++
 include/linux/sched.h          |  3 ++-
 kernel/futex.c                 |  3 +++
 5 files changed, 32 insertions(+), 2 deletions(-)

-- 
2.21.1



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/2] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck
@ 2021-01-08 22:22 ` Tony Luck
  2021-01-08 22:22 ` [PATCH 2/2] futex, x86/mce: Avoid double machine checks Tony Luck
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
  2 siblings, 0 replies; 52+ messages in thread
From: Tony Luck @ 2021-01-08 22:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a "mce_busy" flag bit to detect this situation and panic
when it happens.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 7 ++++++-
 include/linux/sched.h          | 3 ++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..1bf11213e093 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_busy = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1268,6 +1269,7 @@ static void kill_me_maybe(struct callback_head *cb)
 
 static void queue_task_work(struct mce *m, int kill_current_task)
 {
+	current->mce_busy = 1;
 	current->mce_addr = m->addr;
 	current->mce_kflags = m->kflags;
 	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
@@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs)
 				mce_panic("Failed kernel mode recovery", &m, msg);
 		}
 
-		if (m.kflags & MCE_IN_KERNEL_COPYIN)
+		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
+			if (current->mce_busy)
+				mce_panic("Multiple copyin", &m, msg);
 			queue_task_work(&m, kill_current_task);
+		}
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..a763a76eac57 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1360,7 +1360,8 @@ struct task_struct {
 	u64				mce_addr;
 	__u64				mce_ripv : 1,
 					mce_whole_page : 1,
-					__mce_reserved : 62;
+					mce_busy : 1,
+					__mce_reserved : 61;
 	struct callback_head		mce_kill_me;
 #endif
 
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] futex, x86/mce: Avoid double machine checks
  2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck
  2021-01-08 22:22 ` [PATCH 1/2] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
@ 2021-01-08 22:22 ` Tony Luck
  2021-01-08 22:47   ` Peter Zijlstra
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
  2 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2021-01-08 22:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

futex_wait_setup() first tries to read the user value with page faults
disabled (because it holds a lock, and so cannot sleep). If that read
fails it drops the lock and tries again.

But there are now two reasons why the user space read can fail. Either:
1) legacy case of a page fault, in which case it is reasonable to retry
2) machine check on the user address, bad idea to re-read

Add some infrastructure to differentiate these cases.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/mmu.h     |  7 +++++++
 arch/x86/kernel/cpu/mce/core.c | 10 ++++++++++
 include/linux/mm.h             |  4 ++++
 kernel/futex.c                 |  3 +++
 4 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..a46c78381388 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -66,4 +66,11 @@ typedef struct {
 void leave_mm(int cpu);
 #define leave_mm leave_mm
 
+#if defined(CONFIG_X86_MCE) && defined(CONFIG_MEMORY_FAILURE)
+#undef arch_memory_failure
+#define arch_memory_failure x86_memory_failure
+#endif
+
+bool x86_memory_failure(u32 __user *addr);
+
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 1bf11213e093..b27aa30290bb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1236,6 +1236,16 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	*m = *final;
 }
 
+bool x86_memory_failure(u32 __user *addr)
+{
+	if (current->mce_busy == 0)
+		return false;
+
+	WARN_ON(current->mce_vaddr != addr);
+
+	return true;
+}
+
 static void kill_me_now(struct callback_head *ch)
 {
 	force_sig(SIGBUS);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..470708a71dd3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3177,5 +3177,9 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+#ifndef arch_memory_failure
+#define arch_memory_failure(vaddr)	(0)
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index c47d1015d759..8fa2fc854026 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2658,6 +2658,9 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 	if (ret) {
 		queue_unlock(*hb);
 
+		if (arch_memory_failure(uaddr))
+			return ret;
+
 		ret = get_user(uval, uaddr);
 		if (ret)
 			return ret;
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] futex, x86/mce: Avoid double machine checks
  2021-01-08 22:22 ` [PATCH 2/2] futex, x86/mce: Avoid double machine checks Tony Luck
@ 2021-01-08 22:47   ` Peter Zijlstra
  2021-01-08 23:08     ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2021-01-08 22:47 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, x86, Andrew Morton, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

On Fri, Jan 08, 2021 at 02:22:51PM -0800, Tony Luck wrote:
> futex_wait_setup() first tries to read the user value with page faults
> disabled (because it holds a lock, and so cannot sleep). If that read
> fails it drops the lock and tries again.
> 
> But there are now two reasons why the user space read can fail. Either:
> 1) legacy case of a page fault, in which case it is reasonable to retry
> 2) machine check on the user address, bad idea to re-read
> 
> Add some infrastructure to differentiate these cases.

> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2658,6 +2658,9 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
>  	if (ret) {
>  		queue_unlock(*hb);
>  
> +		if (arch_memory_failure(uaddr))
> +			return ret;
> +
>  		ret = get_user(uval, uaddr);
>  		if (ret)
>  			return ret;


I think this is horrid; why can't we have it return something different
then -EFAULT instead?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH 2/2] futex, x86/mce: Avoid double machine checks
  2021-01-08 22:47   ` Peter Zijlstra
@ 2021-01-08 23:08     ` Luck, Tony
  2021-01-08 23:14       ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-08 23:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, x86, Andrew Morton, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

> I think this is horrid; why can't we have it return something different
> then -EFAULT instead?

I did consider this ... but it appears that architectures aren't unified in the
return value from get_user()

Here's another function involved in the futex call chain leading to this:

static int get_futex_value_locked(u32 *dest, u32 __user *from)
{
        int ret;

        pagefault_disable();
        ret = __get_user(*dest, from);
        pagefault_enable();

        return ret ? -EFAULT : 0;
}

It seems like the expectation here is just "zero or not" and we
don't care what the "not" value is ... just turn it into -EFAULT.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] futex, x86/mce: Avoid double machine checks
  2021-01-08 23:08     ` Luck, Tony
@ 2021-01-08 23:14       ` Peter Zijlstra
  2021-01-08 23:20         ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2021-01-08 23:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, Andrew Morton, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

On Fri, Jan 08, 2021 at 11:08:58PM +0000, Luck, Tony wrote:
> > I think this is horrid; why can't we have it return something different
> > then -EFAULT instead?
> 
> I did consider this ... but it appears that architectures aren't unified in the
> return value from get_user()

But surely none are currently returning -EMEMERR or whatever name we
come up with.

> Here's another function involved in the futex call chain leading to this:
> 
> static int get_futex_value_locked(u32 *dest, u32 __user *from)
> {
>         int ret;
> 
>         pagefault_disable();
>         ret = __get_user(*dest, from);
>         pagefault_enable();
> 
>         return ret ? -EFAULT : 0;
> }
> 
> It seems like the expectation here is just "zero or not" and we
> don't care what the "not" value is ... just turn it into -EFAULT.

Yeah, saw that, but that should be trivially fixable I'm thinking.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH 2/2] futex, x86/mce: Avoid double machine checks
  2021-01-08 23:14       ` Peter Zijlstra
@ 2021-01-08 23:20         ` Luck, Tony
  0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2021-01-08 23:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, x86, Andrew Morton, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

> Yeah, saw that, but that should be trivially fixable I'm thinking.

Trivial, maybe ... but then follows the audit of other get_user() calls:

git grep get_user\( | wc -l
2003

:-(

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup()
  2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck
  2021-01-08 22:22 ` [PATCH 1/2] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
  2021-01-08 22:22 ` [PATCH 2/2] futex, x86/mce: Avoid double machine checks Tony Luck
@ 2021-01-11 21:44 ` Tony Luck
  2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
                     ` (4 more replies)
  2 siblings, 5 replies; 52+ messages in thread
From: Tony Luck @ 2021-01-11 21:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

Linux can now recover from machine checks where kernel code is
doing get_user() to access application memory. But there isn't
a way to distinguish whether get_user() failed because of a page
fault or a machine check.

Thus there is a problem if any kernel code thinks it can retry
an access after doing something that would fix the page fault.

One such example (I'm sure there are more) is in futex_wait_setup()
where an attempt to read the futex with page faults disabled. Then
a retry (after dropping a lock so page faults are safe):


        ret = get_futex_value_locked(&uval, uaddr);

        if (ret) {
                queue_unlock(*hb);

                ret = get_user(uval, uaddr);

It would be good to avoid deliberately taking a second machine
check (especially as the recovery code does really bad things
and ends up in an infinite loop!).

V2 (thanks to feedback from PeterZ) fixes this by changing get_user() to
return -ENXIO ("No such device or address") for the case where a machine
check occurred. Peter left it open which error code to use (suggesting
"-EMEMERR or whatever name we come up with"). I think the existing ENXIO
error code seems appropriate (the address being accessed has effectively
gone away). But I don't have a strong attachment if anyone thinks we
need a new code.

Callers can check for ENXIO in paths where the access would be
retried so they can avoid a second machine check.

Patch roadmap:

Part 1 (unchanged since v1):
Add code to avoid the infinite loop in the machine check
code. Just panic if code runs into the same machine check a second
time. This should make it much easier to debug other places where
this happens.

Part 2: Change recovery path for get_user() to return -ENXIO

Part 3: Fix the one case in futex code that my test case hits (I'm
sure there are more).

TBD: There are a few places in arch/x86 code that test "ret == -EFAULT"
or have "switch (ret) { case -EFAULT: }" that may benefit from an
additional check for -ENXIO. For now those will continue to crash
(just like every pre-v5.10 kernel crashed when get_user() touched
poison).


Tony Luck (3):
  x86/mce: Avoid infinite loop for copy from user recovery
  x86/mce: Add new return value to get_user() for machine check
  futex, x86/mce: Avoid double machine checks

 arch/x86/kernel/cpu/mce/core.c | 7 ++++++-
 arch/x86/lib/getuser.S         | 8 +++++++-
 arch/x86/mm/extable.c          | 1 +
 include/linux/sched.h          | 3 ++-
 kernel/futex.c                 | 5 ++++-
 5 files changed, 20 insertions(+), 4 deletions(-)

-- 
2.21.1



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
@ 2021-01-11 21:44   ` Tony Luck
  2021-01-11 22:11     ` Andy Lutomirski
  2021-01-14 20:22     ` Borislav Petkov
  2021-01-11 21:44   ` [PATCH v2 2/3] x86/mce: Add new return value to get_user() for machine check Tony Luck
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 52+ messages in thread
From: Tony Luck @ 2021-01-11 21:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a "mce_busy" flag bit to detect this situation and panic
when it happens.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 7 ++++++-
 include/linux/sched.h          | 3 ++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..1bf11213e093 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_busy = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1268,6 +1269,7 @@ static void kill_me_maybe(struct callback_head *cb)
 
 static void queue_task_work(struct mce *m, int kill_current_task)
 {
+	current->mce_busy = 1;
 	current->mce_addr = m->addr;
 	current->mce_kflags = m->kflags;
 	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
@@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs)
 				mce_panic("Failed kernel mode recovery", &m, msg);
 		}
 
-		if (m.kflags & MCE_IN_KERNEL_COPYIN)
+		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
+			if (current->mce_busy)
+				mce_panic("Multiple copyin", &m, msg);
 			queue_task_work(&m, kill_current_task);
+		}
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..a763a76eac57 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1360,7 +1360,8 @@ struct task_struct {
 	u64				mce_addr;
 	__u64				mce_ripv : 1,
 					mce_whole_page : 1,
-					__mce_reserved : 62;
+					mce_busy : 1,
+					__mce_reserved : 61;
 	struct callback_head		mce_kill_me;
 #endif
 
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 2/3] x86/mce: Add new return value to get_user() for machine check
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
  2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
@ 2021-01-11 21:44   ` Tony Luck
  2021-01-11 21:44   ` [PATCH v2 3/3] futex, x86/mce: Avoid double machine checks Tony Luck
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Tony Luck @ 2021-01-11 21:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

When an exception occurs during any of the get_user() functions
fixup_exception() passes the trap number of the exception in regs->ax
to the fixup code.

Check for X86_TRAP_MC and return -ENXIO ("No such device or address")
so that callers can take action to avoid repeating an access to an
address that has an uncorrectable error.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/lib/getuser.S | 8 +++++++-
 arch/x86/mm/extable.c  | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index fa1bc2104b32..c49a449fced6 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -17,7 +17,7 @@
  *
  * Inputs:	%[r|e]ax contains the address.
  *
- * Outputs:	%[r|e]ax is error code (0 or -EFAULT)
+ * Outputs:	%[r|e]ax is error code (0 or -EFAULT or -ENXIO)
  *		%[r|e]dx contains zero-extended value
  *		%ecx contains the high half for 32-bit __get_user_8
  *
@@ -34,6 +34,7 @@
 #include <asm/asm.h>
 #include <asm/smap.h>
 #include <asm/export.h>
+#include <asm/trapnr.h>
 
 #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
 
@@ -168,8 +169,13 @@ SYM_CODE_START_LOCAL(.Lbad_get_user_clac)
 	ASM_CLAC
 bad_get_user:
 	xor %edx,%edx
+	cmpl $X86_TRAP_MC,%eax
+	je mce_get_user
 	mov $(-EFAULT),%_ASM_AX
 	ret
+mce_get_user:
+	mov $(-ENXIO),%_ASM_AX
+	ret
 SYM_CODE_END(.Lbad_get_user_clac)
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index b93d6cd08a7f..ac4fcb820c40 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -77,6 +77,7 @@ __visible bool ex_handler_uaccess(const struct exception_table_entry *fixup,
 {
 	WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?");
 	regs->ip = ex_fixup_addr(fixup);
+	regs->ax = trapnr;
 	return true;
 }
 EXPORT_SYMBOL(ex_handler_uaccess);
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 3/3] futex, x86/mce: Avoid double machine checks
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
  2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
  2021-01-11 21:44   ` [PATCH v2 2/3] x86/mce: Add new return value to get_user() for machine check Tony Luck
@ 2021-01-11 21:44   ` Tony Luck
  2021-01-14 17:22   ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Andy Lutomirski
  2021-01-15  0:38   ` [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
  4 siblings, 0 replies; 52+ messages in thread
From: Tony Luck @ 2021-01-11 21:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

futex_wait_setup() first tries to read the user value with page faults
disabled (because it holds a lock, and so cannot sleep). If that read
fails it drops the lock and tries again.

But there are now two reasons why the user space read can fail. Either:
1) legacy case of a page fault, in which case it is reasonable to retry
2) machine check on the user address, bad idea to re-read

Check for the ENXIO return code from the first get_user() call and
immediately return an error without re-reading the futex.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 kernel/futex.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index c47d1015d759..b11166712a9f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -723,7 +723,7 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 	ret = __get_user(*dest, from);
 	pagefault_enable();
 
-	return ret ? -EFAULT : 0;
+	return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0;
 }
 
 
@@ -2658,6 +2658,9 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 	if (ret) {
 		queue_unlock(*hb);
 
+		if (ret == -ENXIO)
+			return ret;
+
 		ret = get_user(uval, uaddr);
 		if (ret)
 			return ret;
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
@ 2021-01-11 22:11     ` Andy Lutomirski
  2021-01-11 22:20       ` Luck, Tony
  2021-01-14 20:22     ` Borislav Petkov
  1 sibling, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-11 22:11 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm


> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote:
> 
> Recovery action when get_user() triggers a machine check uses the fixup
> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> so that kill_me_maybe() will be called on return to user mode to send a
> SIGBUS to the current process.
> 
> But there are places in the kernel where the code assumes that this
> EFAULT return was simply because of a page fault. The code takes some
> action to fix that, and then retries the access. This results in a second
> machine check.
> 
> While processing this second machine check queue_task_work() is called
> again. But since this uses the same callback_head structure that
> was used in the first call, the net result is an entry on the
> current->task_works list that points to itself.

Is this happening in pagefault_disable context or normal sleepable fault context?  If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it.

Yes, I realize this is messy, but maybe it’s not that messy. Conceptually, we just (famous last words) need to arrange for an MCE with IF=1 to switch off the IST stack and run like a normal exception.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-11 22:11     ` Andy Lutomirski
@ 2021-01-11 22:20       ` Luck, Tony
  2021-01-12 17:00         ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-11 22:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote:
> 
> > On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote:
> > 
> > Recovery action when get_user() triggers a machine check uses the fixup
> > path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> > so that kill_me_maybe() will be called on return to user mode to send a
> > SIGBUS to the current process.
> > 
> > But there are places in the kernel where the code assumes that this
> > EFAULT return was simply because of a page fault. The code takes some
> > action to fix that, and then retries the access. This results in a second
> > machine check.
> > 
> > While processing this second machine check queue_task_work() is called
> > again. But since this uses the same callback_head structure that
> > was used in the first call, the net result is an entry on the
> > current->task_works list that points to itself.
> 
> Is this happening in pagefault_disable context or normal sleepable fault context?  If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it.

The first machine check is in pagefault_disable() context.

static int get_futex_value_locked(u32 *dest, u32 __user *from)
{
        int ret;

        pagefault_disable();
        ret = __get_user(*dest, from);
        pagefault_enable();

        return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0;
}

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-11 22:20       ` Luck, Tony
@ 2021-01-12 17:00         ` Andy Lutomirski
  2021-01-12 17:16           ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-12 17:00 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra,
	Darren Hart, Andy Lutomirski, LKML, linux-edac, Linux-MM

> On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote:
>
> On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote:
>>
>>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote:
>>>
>>> Recovery action when get_user() triggers a machine check uses the fixup
>>> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
>>> so that kill_me_maybe() will be called on return to user mode to send a
>>> SIGBUS to the current process.
>>>
>>> But there are places in the kernel where the code assumes that this
>>> EFAULT return was simply because of a page fault. The code takes some
>>> action to fix that, and then retries the access. This results in a second
>>> machine check.
>>>
>>> While processing this second machine check queue_task_work() is called
>>> again. But since this uses the same callback_head structure that
>>> was used in the first call, the net result is an entry on the
>>> current->task_works list that points to itself.
>>
>> Is this happening in pagefault_disable context or normal sleepable fault context?  If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it.
>
> The first machine check is in pagefault_disable() context.
>
> static int get_futex_value_locked(u32 *dest, u32 __user *from)
> {
>        int ret;
>
>        pagefault_disable();
>        ret = __get_user(*dest, from);

I have very mixed feelings as to whether we should even try to recover
from the first failure like this.  If we actually want to try to
recover, perhaps we should instead arrange for the second MCE to
recover successfully instead of panicking.

--Andy


>        pagefault_enable();
>
>        return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0;
> }
>
> -Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 17:00         ` Andy Lutomirski
@ 2021-01-12 17:16           ` Luck, Tony
  2021-01-12 17:21             ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-12 17:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra,
	Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote:
> > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote:
> >
> > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote:
> >>
> >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote:
> >>>
> >>> Recovery action when get_user() triggers a machine check uses the fixup
> >>> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> >>> so that kill_me_maybe() will be called on return to user mode to send a
> >>> SIGBUS to the current process.
> >>>
> >>> But there are places in the kernel where the code assumes that this
> >>> EFAULT return was simply because of a page fault. The code takes some
> >>> action to fix that, and then retries the access. This results in a second
> >>> machine check.
> >>>
> >>> While processing this second machine check queue_task_work() is called
> >>> again. But since this uses the same callback_head structure that
> >>> was used in the first call, the net result is an entry on the
> >>> current->task_works list that points to itself.
> >>
> >> Is this happening in pagefault_disable context or normal sleepable fault context?  If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it.
> >
> > The first machine check is in pagefault_disable() context.
> >
> > static int get_futex_value_locked(u32 *dest, u32 __user *from)
> > {
> >        int ret;
> >
> >        pagefault_disable();
> >        ret = __get_user(*dest, from);
> 
> I have very mixed feelings as to whether we should even try to recover
> from the first failure like this.  If we actually want to try to
> recover, perhaps we should instead arrange for the second MCE to
> recover successfully instead of panicking.

Well we obviously have to "recover" from the first machine check
in order to get to the second. Are you saying you don't like the
different return value from get_user()?

In my initial playing around with this I just had the second machine
check simply skip the task_work_add(). This worked for this case, but
only because there wasn't a third, fourth, etc. access to the poisoned
data. If the caller keeps peeking, then we'll keep taking more machine
checks - possibly forever.

Even if we do recover with just one extra machine check ... that's one
more than was necessary.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 17:16           ` Luck, Tony
@ 2021-01-12 17:21             ` Andy Lutomirski
  2021-01-12 18:23               ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-12 17:21 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 9:16 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote:
> > > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote:
> > >>
> > >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote:
> > >>>
> > >>> Recovery action when get_user() triggers a machine check uses the fixup
> > >>> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> > >>> so that kill_me_maybe() will be called on return to user mode to send a
> > >>> SIGBUS to the current process.
> > >>>
> > >>> But there are places in the kernel where the code assumes that this
> > >>> EFAULT return was simply because of a page fault. The code takes some
> > >>> action to fix that, and then retries the access. This results in a second
> > >>> machine check.
> > >>>
> > >>> While processing this second machine check queue_task_work() is called
> > >>> again. But since this uses the same callback_head structure that
> > >>> was used in the first call, the net result is an entry on the
> > >>> current->task_works list that points to itself.
> > >>
> > >> Is this happening in pagefault_disable context or normal sleepable fault context?  If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it.
> > >
> > > The first machine check is in pagefault_disable() context.
> > >
> > > static int get_futex_value_locked(u32 *dest, u32 __user *from)
> > > {
> > >        int ret;
> > >
> > >        pagefault_disable();
> > >        ret = __get_user(*dest, from);
> >
> > I have very mixed feelings as to whether we should even try to recover
> > from the first failure like this.  If we actually want to try to
> > recover, perhaps we should instead arrange for the second MCE to
> > recover successfully instead of panicking.
>
> Well we obviously have to "recover" from the first machine check
> in order to get to the second. Are you saying you don't like the
> different return value from get_user()?
>
> In my initial playing around with this I just had the second machine
> check simply skip the task_work_add(). This worked for this case, but
> only because there wasn't a third, fourth, etc. access to the poisoned
> data. If the caller keeps peeking, then we'll keep taking more machine
> checks - possibly forever.
>
> Even if we do recover with just one extra machine check ... that's one
> more than was necessary.

Well, we need to do *something* when the first __get_user() trips the
#MC.  It would be nice if we could actually fix up the page tables
inside the #MC handler, but, if we're in a pagefault_disable() context
we might have locks held.  Heck, we could have the pagetable lock
held, be inside NMI, etc.  Skipping the task_work_add() might actually
make sense if we get a second one.

We won't actually infinite loop in pagefault_disable() context -- if
we would, then we would also infinite loop just from a regular page
fault, too.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 17:21             ` Andy Lutomirski
@ 2021-01-12 18:23               ` Luck, Tony
  2021-01-12 18:57                 ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-12 18:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra,
	Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote:
> Well, we need to do *something* when the first __get_user() trips the
> #MC.  It would be nice if we could actually fix up the page tables
> inside the #MC handler, but, if we're in a pagefault_disable() context
> we might have locks held.  Heck, we could have the pagetable lock
> held, be inside NMI, etc.  Skipping the task_work_add() might actually
> make sense if we get a second one.
> 
> We won't actually infinite loop in pagefault_disable() context -- if
> we would, then we would also infinite loop just from a regular page
> fault, too.

Fixing the page tables inside the #MC handler to unmap the poison
page would indeed be a good solution. But, as you point out, not possible
because of locks.

Could we take a more drastic approach? We know that this case the kernel
is accessing a user address for the current process. Could the machine
check handler just re-write %cr3 to point to a kernel-only page table[1].
I.e. unmap the entire current user process.

Then any subsequent access to user space will page fault. Maybe have a
flag in the task structure to help the #PF handler understand what just
happened.

The code we execute in the task_work handler can restore %cr3

-Tony

[1] Does such a thing already exist? If not, I'd need some help/pointers
to create it.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 18:23               ` Luck, Tony
@ 2021-01-12 18:57                 ` Andy Lutomirski
  2021-01-12 20:52                   ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-12 18:57 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote:
> > Well, we need to do *something* when the first __get_user() trips the
> > #MC.  It would be nice if we could actually fix up the page tables
> > inside the #MC handler, but, if we're in a pagefault_disable() context
> > we might have locks held.  Heck, we could have the pagetable lock
> > held, be inside NMI, etc.  Skipping the task_work_add() might actually
> > make sense if we get a second one.
> >
> > We won't actually infinite loop in pagefault_disable() context -- if
> > we would, then we would also infinite loop just from a regular page
> > fault, too.
>
> Fixing the page tables inside the #MC handler to unmap the poison
> page would indeed be a good solution. But, as you point out, not possible
> because of locks.
>
> Could we take a more drastic approach? We know that this case the kernel
> is accessing a user address for the current process. Could the machine
> check handler just re-write %cr3 to point to a kernel-only page table[1].
> I.e. unmap the entire current user process.

That seems scary, especially if we're in the middle of a context
switch when this happens.  We *could* make it work, but I'm not at all
convinced it's wise.

>
> Then any subsequent access to user space will page fault. Maybe have a
> flag in the task structure to help the #PF handler understand what just
> happened.
>
> The code we execute in the task_work handler can restore %cr3

This would need to be integrated with something much more local IMO.
Maybe it could be scoped to pagefault_disable()/pagefault_enable()?

--Andy


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 18:57                 ` Andy Lutomirski
@ 2021-01-12 20:52                   ` Luck, Tony
  2021-01-12 22:04                     ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-12 20:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra,
	Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote:
> >
> > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote:
> > > Well, we need to do *something* when the first __get_user() trips the
> > > #MC.  It would be nice if we could actually fix up the page tables
> > > inside the #MC handler, but, if we're in a pagefault_disable() context
> > > we might have locks held.  Heck, we could have the pagetable lock
> > > held, be inside NMI, etc.  Skipping the task_work_add() might actually
> > > make sense if we get a second one.
> > >
> > > We won't actually infinite loop in pagefault_disable() context -- if
> > > we would, then we would also infinite loop just from a regular page
> > > fault, too.
> >
> > Fixing the page tables inside the #MC handler to unmap the poison
> > page would indeed be a good solution. But, as you point out, not possible
> > because of locks.
> >
> > Could we take a more drastic approach? We know that this case the kernel
> > is accessing a user address for the current process. Could the machine
> > check handler just re-write %cr3 to point to a kernel-only page table[1].
> > I.e. unmap the entire current user process.
> 
> That seems scary, especially if we're in the middle of a context
> switch when this happens.  We *could* make it work, but I'm not at all
> convinced it's wise.

Scary? It's terrifying!

But we know that the fault happend in a get_user() or copy_from_user() call
(i.e. an RIP with an extable recovery address).  Does context switch
access user memory?

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 20:52                   ` Luck, Tony
@ 2021-01-12 22:04                     ` Andy Lutomirski
  2021-01-13  1:50                       ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-12 22:04 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM


> On Jan 12, 2021, at 12:52 PM, Luck, Tony <tony.luck@intel.com> wrote:
> 
> On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote:
>>> 
>>> On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote:
>>>> Well, we need to do *something* when the first __get_user() trips the
>>>> #MC.  It would be nice if we could actually fix up the page tables
>>>> inside the #MC handler, but, if we're in a pagefault_disable() context
>>>> we might have locks held.  Heck, we could have the pagetable lock
>>>> held, be inside NMI, etc.  Skipping the task_work_add() might actually
>>>> make sense if we get a second one.
>>>> 
>>>> We won't actually infinite loop in pagefault_disable() context -- if
>>>> we would, then we would also infinite loop just from a regular page
>>>> fault, too.
>>> 
>>> Fixing the page tables inside the #MC handler to unmap the poison
>>> page would indeed be a good solution. But, as you point out, not possible
>>> because of locks.
>>> 
>>> Could we take a more drastic approach? We know that this case the kernel
>>> is accessing a user address for the current process. Could the machine
>>> check handler just re-write %cr3 to point to a kernel-only page table[1].
>>> I.e. unmap the entire current user process.
>> 
>> That seems scary, especially if we're in the middle of a context
>> switch when this happens.  We *could* make it work, but I'm not at all
>> convinced it's wise.
> 
> Scary? It's terrifying!
> 
> But we know that the fault happend in a get_user() or copy_from_user() call
> (i.e. an RIP with an extable recovery address).  Does context switch
> access user memory?

No, but NMI can.

The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI.

What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it.

Can we do the cleanup from an interrupt?  IPI-to-self might be a credible approach, if so.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-12 22:04                     ` Andy Lutomirski
@ 2021-01-13  1:50                       ` Luck, Tony
  2021-01-13  4:15                         ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-13  1:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote:
> > But we know that the fault happend in a get_user() or copy_from_user() call
> > (i.e. an RIP with an extable recovery address).  Does context switch
> > access user memory?
> 
> No, but NMI can.
> 
> The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI.
> 
> What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it.
> 
> Can we do the cleanup from an interrupt?  IPI-to-self might be a credible approach, if so.

You seem to be looking for a solution that is entirely contained within
the machine check handling code. Willing to allow for repeated machine
checks from the same poison address in order to achieve that.

I'm opposed to mutliple machine checks.  Willing to make some changes
in core code to avoid repeated access to the same poison location.

We need a tie-breaker.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-13  1:50                       ` Luck, Tony
@ 2021-01-13  4:15                         ` Andy Lutomirski
  2021-01-13 10:00                           ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-13  4:15 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM



> On Jan 12, 2021, at 5:50 PM, Luck, Tony <tony.luck@intel.com> wrote:
> 
> On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote:
>>> But we know that the fault happend in a get_user() or copy_from_user() call
>>> (i.e. an RIP with an extable recovery address).  Does context switch
>>> access user memory?
>> 
>> No, but NMI can.
>> 
>> The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI.
>> 
>> What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it.
>> 
>> Can we do the cleanup from an interrupt?  IPI-to-self might be a credible approach, if so.
> 
> You seem to be looking for a solution that is entirely contained within
> the machine check handling code. Willing to allow for repeated machine
> checks from the same poison address in order to achieve that.
> 
> I'm opposed to mutliple machine checks.  Willing to make some changes
> in core code to avoid repeated access to the same poison location.

How about more questions before the showdown?

If we made core code changes to avoid this, what would they be?  We really can do user access from NMI and #DB, and those can happen in horrible places. We could have the pagetable lock held, be in the middle of CoWing the very page we tripped over, etc. I think we really can’t count on being able to write to the PTEs from #MC. Similarly, we might have IRQs off, so we can’t broadcast a TLB flush. And we might be in the middle of entry, exit, or CR3 switches, and I don’t see a particularly clean way to write CR3 without risking accidentally returning to user mode with the wrong CR3.

So I’m sort of at a loss as to what we can do.  All user memory accessors can handle failure, and they will all avoid infinite looping.  If we can tolerate repeated MCE, we can survive.  But there’s not a whole lot we can do from these horrible contexts.

Hmm.  Maybe if we have SMAP we could play with EFLAGS.AC?  I can imagine this having various regrettable side effects, though.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-13  4:15                         ` Andy Lutomirski
@ 2021-01-13 10:00                           ` Borislav Petkov
  2021-01-13 16:06                             ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-13 10:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luck, Tony, Andy Lutomirski, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

On Tue, Jan 12, 2021 at 08:15:46PM -0800, Andy Lutomirski wrote:
> So I’m sort of at a loss as to what we can do.

I think you guys have veered off into the weeds with this. The current
solution - modulo error messages not destined for humans :) - is not soo
bad, considering the whole MCA trainwreck.

Or am I missing something completely unacceptable?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-13 10:00                           ` Borislav Petkov
@ 2021-01-13 16:06                             ` Luck, Tony
  2021-01-13 16:19                               ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-13 16:06 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Andrew Morton, Peter Zijlstra,
	Darren Hart, LKML, linux-edac, Linux-MM

> I think you guys have veered off into the weeds with this. The current
> solution - modulo error messages not destined for humans :) - is not soo
> bad, considering the whole MCA trainwreck.
>
> Or am I missing something completely unacceptable?

Maybe the other difference in approach between Andy and me is whether to
go for a solution that covers all the corner cases, or just make an incremental
improvement that allows for recover in some useful subset of remaining fatal
cases, but still dies in other cases.

I'm happy to replace error messages with ones that are more descriptive and
helpful to humans.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-13 16:06                             ` Luck, Tony
@ 2021-01-13 16:19                               ` Borislav Petkov
  2021-01-13 16:32                                 ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-13 16:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

On Wed, Jan 13, 2021 at 04:06:09PM +0000, Luck, Tony wrote:
> Maybe the other difference in approach between Andy and me is whether to
> go for a solution that covers all the corner cases, or just make an incremental
> improvement that allows for recover in some useful subset of remaining fatal
> cases, but still dies in other cases.

Does that mean more core code surgery?

> I'm happy to replace error messages with ones that are more descriptive and
> helpful to humans.

Yap, that: "Multiple copyin" with something more understandable to users...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-13 16:19                               ` Borislav Petkov
@ 2021-01-13 16:32                                 ` Luck, Tony
  2021-01-13 17:35                                   ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-13 16:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

>> Maybe the other difference in approach between Andy and me is whether to
>> go for a solution that covers all the corner cases, or just make an incremental
>> improvement that allows for recover in some useful subset of remaining fatal
>> cases, but still dies in other cases.
>
> Does that mean more core code surgery?

Yes. I need to look at other user access inside pagefault_disable/enable()
as likely spots where the code may continue after a machine check and
retry the access.  So expect some more "if (ret == ENXIO) { do something to give up gracefully }"

>> I'm happy to replace error messages with ones that are more descriptive and
>> helpful to humans.
>
> Yap, that: "Multiple copyin" with something more understandable to users...

I'll work on it. We tend not to have essay length messages as panic() strings. But I can
add a comment in the code there so that people who grep whatever panic message
we choose can get more details on what happened and what to do.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-13 16:32                                 ` Luck, Tony
@ 2021-01-13 17:35                                   ` Borislav Petkov
  0 siblings, 0 replies; 52+ messages in thread
From: Borislav Petkov @ 2021-01-13 17:35 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Andrew Morton,
	Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM

On Wed, Jan 13, 2021 at 04:32:54PM +0000, Luck, Tony wrote:
> I'll work on it. We tend not to have essay length messages as panic() strings. But I can
> add a comment in the code there so that people who grep whatever panic message
> we choose can get more details on what happened and what to do.

I did not mean to write an essay but to simply make it more palatable
for "normal" users. Something like:

"PANIC: Second machine check exception while accessing user data."

or along those lines to explain *why* the machine panics.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup()
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
                     ` (2 preceding siblings ...)
  2021-01-11 21:44   ` [PATCH v2 3/3] futex, x86/mce: Avoid double machine checks Tony Luck
@ 2021-01-14 17:22   ` Andy Lutomirski
  2021-01-15  0:38   ` [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
  4 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2021-01-14 17:22 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra,
	Darren Hart, Andy Lutomirski, LKML, linux-edac, Linux-MM

On Mon, Jan 11, 2021 at 1:45 PM Tony Luck <tony.luck@intel.com> wrote:
>
> Linux can now recover from machine checks where kernel code is
> doing get_user() to access application memory. But there isn't
> a way to distinguish whether get_user() failed because of a page
> fault or a machine check.
>
> Thus there is a problem if any kernel code thinks it can retry
> an access after doing something that would fix the page fault.
>
> One such example (I'm sure there are more) is in futex_wait_setup()
> where an attempt to read the futex with page faults disabled. Then
> a retry (after dropping a lock so page faults are safe):
>
>
>         ret = get_futex_value_locked(&uval, uaddr);
>
>         if (ret) {
>                 queue_unlock(*hb);
>
>                 ret = get_user(uval, uaddr);
>
> It would be good to avoid deliberately taking a second machine
> check (especially as the recovery code does really bad things
> and ends up in an infinite loop!).
>
> V2 (thanks to feedback from PeterZ) fixes this by changing get_user() to
> return -ENXIO ("No such device or address") for the case where a machine
> check occurred. Peter left it open which error code to use (suggesting
> "-EMEMERR or whatever name we come up with"). I think the existing ENXIO
> error code seems appropriate (the address being accessed has effectively
> gone away). But I don't have a strong attachment if anyone thinks we
> need a new code.
>
> Callers can check for ENXIO in paths where the access would be
> retried so they can avoid a second machine check.
>

I don't love this -- I'm concerned that there will be some code path
that expects a failing get_user() to return -EFAULT, not -ENXIO.
Also, get_user() *can* return -EFAULT when it hits bad memory even
with your patch if the recovery code manages to yank the PTE before
get_user().

So I tend to think that the machine check code should arrange to
survive some reasonable number of duplicate machine checks.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
  2021-01-11 22:11     ` Andy Lutomirski
@ 2021-01-14 20:22     ` Borislav Petkov
  2021-01-14 21:05       ` Luck, Tony
  1 sibling, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-14 20:22 UTC (permalink / raw)
  To: Tony Luck
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote:
> @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  				mce_panic("Failed kernel mode recovery", &m, msg);
>  		}
>  
> -		if (m.kflags & MCE_IN_KERNEL_COPYIN)
> +		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
> +			if (current->mce_busy)
> +				mce_panic("Multiple copyin", &m, msg);

So this: we're currently busy handling the first MCE, why do we must
panic?

Can we simply ignore all follow-up MCEs to that page?

I.e., the page will get poisoned eventually and that poisoning is
currently executing so all following MCEs are simply nothing new and we
can ignore them.

It's not like we're going to corrupt more data - we already are
"corrupting" whole 4K.

Am I making sense?

Because if we do this, we won't have to pay attention to any get_user()
callers and whatnot - we simply ignore and the solution is simple and
you won't have to touch any get_user() callers...

Hmmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-14 20:22     ` Borislav Petkov
@ 2021-01-14 21:05       ` Luck, Tony
  0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2021-01-14 21:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Thu, Jan 14, 2021 at 09:22:13PM +0100, Borislav Petkov wrote:
> On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote:
> > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs)
> >  				mce_panic("Failed kernel mode recovery", &m, msg);
> >  		}
> >  
> > -		if (m.kflags & MCE_IN_KERNEL_COPYIN)
> > +		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
> > +			if (current->mce_busy)
> > +				mce_panic("Multiple copyin", &m, msg);
> 
> So this: we're currently busy handling the first MCE, why do we must
> panic?
> 
> Can we simply ignore all follow-up MCEs to that page?

If we s/all/some/ you are saying the same as Andy:
> So I tend to think that the machine check code should arrange to
> survive some reasonable number of duplicate machine checks.

> I.e., the page will get poisoned eventually and that poisoning is
> currently executing so all following MCEs are simply nothing new and we
> can ignore them.
> 
> It's not like we're going to corrupt more data - we already are
> "corrupting" whole 4K.
> 
> Am I making sense?
> 
> Because if we do this, we won't have to pay attention to any get_user()
> callers and whatnot - we simply ignore and the solution is simple and
> you won't have to touch any get_user() callers...

Changing get_user() is a can of worms. I don't think its a very big can.
Perhaps two or three dozen places where code needs to change to account
for the -ENXIO return ... but touching a bunch of different subsystems
it is likley to take a while to get everyone in agreement.

I'll try out this new approach, and if it works, I'll post a v3 patch.

Thanks

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
                     ` (3 preceding siblings ...)
  2021-01-14 17:22   ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Andy Lutomirski
@ 2021-01-15  0:38   ` Tony Luck
  2021-01-15 15:27     ` Borislav Petkov
  4 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2021-01-15  0:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart,
	Andy Lutomirski, linux-kernel, linux-edac, linux-mm

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a "mce_busy" counter so that task_work_add() is only called once
per faulty page in this task.

Do not allow too many repeated machine checks, or machine checks to
a different page from the first.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V3: Thanks to extensive commentary from Andy & Boris

Throws out the changes to get_user() and subsequent changes to core
code. Everything is now handled in the machine check code. Downside is
that we can (and do) take multiple machine checks from a single poisoned
page before generic kernel code finally gets the message that a page is
really and truly gone (but all the failed get_user() calls still return
the legacy -EFAULT code, so none of that code will ever mistakenly use
a value from a bad page). But even on an old machine that does broadcast
interrupts for each machine check things survive multiple cycles of my
test injection into a futex operation.

I picked "10" as the magic upper limit for how many times the machine
check code will allow a fault from the same page before deciding to
panic.  We can bike shed that value if you like.

 arch/x86/kernel/cpu/mce/core.c | 27 ++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..25daf6517dc9 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1267,24 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	if (current->mce_count++ == 0) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	if (current->mce_count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (current->mce_count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1427,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1445,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..386366c9c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-15  0:38   ` [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
@ 2021-01-15 15:27     ` Borislav Petkov
  2021-01-15 19:34       ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-15 15:27 UTC (permalink / raw)
  To: Tony Luck
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Thu, Jan 14, 2021 at 04:38:17PM -0800, Tony Luck wrote:
> Recovery action when get_user() triggers a machine check uses the fixup
> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> so that kill_me_maybe() will be called on return to user mode to send a
> SIGBUS to the current process.
> 
> But there are places in the kernel where the code assumes that this
> EFAULT return was simply because of a page fault. The code takes some
> action to fix that, and then retries the access. This results in a second
> machine check.
> 
> While processing this second machine check queue_task_work() is called
> again. But since this uses the same callback_head structure that
> was used in the first call, the net result is an entry on the
> current->task_works list that points to itself. When task_work_run()
> is called it loops forever in this code:
> 
> 		do {
> 			next = work->next;
> 			work->func(work);
> 			work = next;
> 			cond_resched();
> 		} while (work);
> 
> Add a "mce_busy" counter so that task_work_add() is only called once
> per faulty page in this task.

Yeah, that sentence can be removed now too.

> Do not allow too many repeated machine checks, or machine checks to
> a different page from the first.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> V3: Thanks to extensive commentary from Andy & Boris
> 
> Throws out the changes to get_user() and subsequent changes to core
> code. Everything is now handled in the machine check code. Downside is
> that we can (and do) take multiple machine checks from a single poisoned
> page before generic kernel code finally gets the message that a page is
> really and truly gone (but all the failed get_user() calls still return
> the legacy -EFAULT code, so none of that code will ever mistakenly use
> a value from a bad page). But even on an old machine that does broadcast
> interrupts for each machine check things survive multiple cycles of my
> test injection into a futex operation.

Nice.

> 
> I picked "10" as the magic upper limit for how many times the machine
> check code will allow a fault from the same page before deciding to
> panic.  We can bike shed that value if you like.
> 
>  arch/x86/kernel/cpu/mce/core.c | 27 ++++++++++++++++++++-------
>  include/linux/sched.h          |  1 +
>  2 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 13d3f1cbda17..25daf6517dc9 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
>  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>  	int flags = MF_ACTION_REQUIRED;
>  
> +	p->mce_count = 0;
>  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
>  
>  	if (!p->mce_ripv)
> @@ -1266,12 +1267,24 @@ static void kill_me_maybe(struct callback_head *cb)
>  	}
>  }
>  
> -static void queue_task_work(struct mce *m, int kill_current_task)
> +static void queue_task_work(struct mce *m, char *msg, int kill_current_task)

So this function gets called in the user mode MCE case too:

	if ((m.cs & 3) == 3) {

		queue_task_work(&m, msg, kill_current_task);
	}

Do we want to panic for multiple MCEs to different addresses in user
mode?

I don't think so - that should go down the memory failure page
offlining path...

> -	current->mce_addr = m->addr;
> -	current->mce_kflags = m->kflags;
> -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> -	current->mce_whole_page = whole_page(m);
> +	if (current->mce_count++ == 0) {
> +		current->mce_addr = m->addr;
> +		current->mce_kflags = m->kflags;
> +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> +		current->mce_whole_page = whole_page(m);
> +	}
> +

	/* Magic number should be large enough */

> +	if (current->mce_count > 10)
> +		mce_panic("Too many machine checks while accessing user data", m, msg);
> +
> +	if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> +		mce_panic("Machine checks to different user pages", m, msg);

Will this second part of the test expression, after the "||" ever hit?

You do above in the first branch:

	if (current->mce_count++ == 0) {

		...

		current->mce_addr = m->addr;

and ->mce_count becomes 1.

In that case that

	(current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)

gets tested but that won't ever be true because ->mce_addr = ->addr
above.

And then, for other values of mce_count, mce_count > 1 will hit.

In any case, what are you trying to catch with this? Two get_user() to
different pages both catching MCEs?

> +
> +	/* Do not call task_work_add() more than once */
> +	if (current->mce_count > 1)
> +		return;

That won't happen either, AFAICT. It'll panic above.

Regardless, I like how this is all confined to the MCE code and there's
no need to touch stuff outside...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-15 15:27     ` Borislav Petkov
@ 2021-01-15 19:34       ` Luck, Tony
  2021-01-15 20:51         ` [PATCH v4] " Luck, Tony
  2021-01-18 15:39         ` [PATCH v3] " Borislav Petkov
  0 siblings, 2 replies; 52+ messages in thread
From: Luck, Tony @ 2021-01-15 19:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Fri, Jan 15, 2021 at 04:27:54PM +0100, Borislav Petkov wrote:
> On Thu, Jan 14, 2021 at 04:38:17PM -0800, Tony Luck wrote:
> > Add a "mce_busy" counter so that task_work_add() is only called once
> > per faulty page in this task.
> 
> Yeah, that sentence can be removed now too.

I will update with new name "mce_count" and some details.

> > -static void queue_task_work(struct mce *m, int kill_current_task)
> > +static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
> 
> So this function gets called in the user mode MCE case too:
> 
> 	if ((m.cs & 3) == 3) {
> 
> 		queue_task_work(&m, msg, kill_current_task);
> 	}
> 
> Do we want to panic for multiple MCEs to different addresses in user
> mode?

In the user mode case we should only bump mce_count to "1" and
before task_work() gets called. It shouldn't hurt to do the
same checks. Maybe it will catch something weird - like an NMI
handler on return from the machine check doing a get_user() that
hits another machine check during the return from this machine check.

AndyL has made me extra paranoid. :-)

> > -	current->mce_addr = m->addr;
> > -	current->mce_kflags = m->kflags;
> > -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > -	current->mce_whole_page = whole_page(m);
> > +	if (current->mce_count++ == 0) {
> > +		current->mce_addr = m->addr;
> > +		current->mce_kflags = m->kflags;
> > +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > +		current->mce_whole_page = whole_page(m);
> > +	}
> > +
> 
> 	/* Magic number should be large enough */
>
> > +	if (current->mce_count > 10)

Will add similar comment here ... and to other tests in this function
since it may not be obvious to me next year what I was thinking now :-)

> > +	if (current->mce_count > 10)
> > +		mce_panic("Too many machine checks while accessing user data", m, msg);
> > +
> > +	if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > +		mce_panic("Machine checks to different user pages", m, msg);
> 
> Will this second part of the test expression, after the "||" ever hit?

No :-( This code is wrong. Should be "&&" not "||". Then it makes more sense.
Will fix for v4.

> In any case, what are you trying to catch with this? Two get_user() to
> different pages both catching MCEs?

Yes. Trying to catch two accesses to different pages. Need to do this
because kill_me_maybe() is only going to offline one page.

I'm not expecting that this would ever hit.  It means that calling code
took a machine check on one page and get_user() said -EFAULT. The the
code decided to access a different page *and* that other page also triggered
a machine check.

> > +	/* Do not call task_work_add() more than once */
> > +	if (current->mce_count > 1)
> > +		return;
> 
> That won't happen either, AFAICT. It'll panic above.

With the s/||/&&/ above, we can get here.
> 
> Regardless, I like how this is all confined to the MCE code and there's
> no need to touch stuff outside...

Thanks for the review.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-15 19:34       ` Luck, Tony
@ 2021-01-15 20:51         ` Luck, Tony
  2021-01-15 23:23           ` Luck, Tony
  2021-01-18 15:39         ` [PATCH v3] " Borislav Petkov
  1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-15 20:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V4:
	Fix bug with "||" where I meant "&&"
	Update stale commit comment referring to mce_busy field
	in task structure (now called mce_count).
	Add some comments to queue_task_work()

 arch/x86/kernel/cpu/mce/core.c | 31 ++++++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..5460c146edb5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	p->mce_count = 0;
 	force_sig(SIGBUS);
 }
 
@@ -1246,6 +1247,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1268,27 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	/* First call, save all the details */
+	if (current->mce_count++ == 0) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (current->mce_count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (current->mce_count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (current->mce_count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1431,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1449,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..386366c9c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-15 20:51         ` [PATCH v4] " Luck, Tony
@ 2021-01-15 23:23           ` Luck, Tony
  2021-01-19 10:56             ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-15 23:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
>  static void kill_me_now(struct callback_head *ch)
>  {
> +	p->mce_count = 0;
>  	force_sig(SIGBUS);
>  }

Brown paper bag time ... I just pasted that line from kill_me_maybe()
and I thought I did a re-compile ... but obviously not since it gives

error: ‘p’ undeclared (first use in this function)

Option a) (just like kill_me_maybe)

struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);

Option b) (simpler ... not sure why PeterZ did the container_of thing

	current->mce_count = 0;

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-15 19:34       ` Luck, Tony
  2021-01-15 20:51         ` [PATCH v4] " Luck, Tony
@ 2021-01-18 15:39         ` Borislav Petkov
  1 sibling, 0 replies; 52+ messages in thread
From: Borislav Petkov @ 2021-01-18 15:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Fri, Jan 15, 2021 at 11:34:35AM -0800, Luck, Tony wrote:
> In the user mode case we should only bump mce_count to "1" and
> before task_work() gets called.

Ok, right, it should not be possible to trigger a second MCE while
queue_task_work() runs when it is a user MCE. The handler itself won't
touch the page with the hw error so our assumption is that it'll get
poisoned.

If it doesn't, I guess the memory failure code will kill the process
yadda yadda...

> It shouldn't hurt to do the same checks. Maybe it will catch something
> weird - like an NMI handler on return from the machine check doing a
> get_user() that hits another machine check during the return from this
> machine check.

Eww.

> AndyL has made me extra paranoid. :-)

Yeah, he comes up with the nuttiest scenarios. :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-15 23:23           ` Luck, Tony
@ 2021-01-19 10:56             ` Borislav Petkov
  2021-01-19 23:57               ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-19 10:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> >  static void kill_me_now(struct callback_head *ch)
> >  {
> > +	p->mce_count = 0;
> >  	force_sig(SIGBUS);
> >  }
> 
> Brown paper bag time ... I just pasted that line from kill_me_maybe()
> and I thought I did a re-compile ... but obviously not since it gives
> 
> error: ‘p’ undeclared (first use in this function)
> 
> Option a) (just like kill_me_maybe)
> 
> struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> 
> Option b) (simpler ... not sure why PeterZ did the container_of thing
> 
> 	current->mce_count = 0;

Right, he says it is the canonical way to get it out of callback_head.
I don't think current will change while the #MC handler runs but we can
adhere to the design pattern here and do container_of() ...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-19 10:56             ` Borislav Petkov
@ 2021-01-19 23:57               ` Luck, Tony
  2021-01-20 12:18                 ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-19 23:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Tue, Jan 19, 2021 at 11:56:32AM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> > On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> > >  static void kill_me_now(struct callback_head *ch)
> > >  {
> > > +	p->mce_count = 0;
> > >  	force_sig(SIGBUS);
> > >  }
> > 
> > Brown paper bag time ... I just pasted that line from kill_me_maybe()
> > and I thought I did a re-compile ... but obviously not since it gives
> > 
> > error: ‘p’ undeclared (first use in this function)
> > 
> > Option a) (just like kill_me_maybe)
> > 
> > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> > 
> > Option b) (simpler ... not sure why PeterZ did the container_of thing
> > 
> > 	current->mce_count = 0;
> 
> Right, he says it is the canonical way to get it out of callback_head.
> I don't think current will change while the #MC handler runs but we can
> adhere to the design pattern here and do container_of() ...

Ok ... I'll use the canonical way.

But now I've run into a weird issue. I'd run some basic tests with a
dozen machine checks in each of:
1) user access
2) kernel copyin
3) futex (multiple accesses from kernel before task_work())

and it passed my tests before I posted.

But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles.

Their tests are still just doing one at a time (i.e. complete recovery
of one machine cehck before injecting the next error). So there aren't
any complicated race conditions.

So if you see anything obviously broken, let me know. Otherwise I'll
be poking around at the patch to figure out what is wrong.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-19 23:57               ` Luck, Tony
@ 2021-01-20 12:18                 ` Borislav Petkov
  2021-01-20 17:17                   ` Luck, Tony
  2021-01-21 21:09                   ` Luck, Tony
  0 siblings, 2 replies; 52+ messages in thread
From: Borislav Petkov @ 2021-01-20 12:18 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Tue, Jan 19, 2021 at 03:57:59PM -0800, Luck, Tony wrote:
> But the real validation folks took my patch and found that it has
> destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
> more times). System either hangs or panics. Generally before 100
> injection/conumption cycles.

Hmm, that mce_count increment and check stuff might turn out to be
fragile in the face of a race condition. But I don't see it...

> So if you see anything obviously broken, let me know. Otherwise I'll
> be poking around at the patch to figure out what is wrong.

Yah, some printk sprinkling might be a good start. But debugging in that
atomic context is always nasty. ;-\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-20 12:18                 ` Borislav Petkov
@ 2021-01-20 17:17                   ` Luck, Tony
  2021-01-21 21:09                   ` Luck, Tony
  1 sibling, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2021-01-20 17:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

> Yah, some printk sprinkling might be a good start. But debugging in that
> atomic context is always nasty. ;-\

Some very light printk sprinkling (one message in queue_task_work() in atomic
context, one each in kill_me_now() and kill_me_maybe() to check when task_work
actually called them.

Cases 1 & 2 (user & normal copyin) now work just fine (hundreds of iterations).

Case 3 (my futex test) just hangs with only two characters making it to the serial port "[ "

Deeply strange.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-20 12:18                 ` Borislav Petkov
  2021-01-20 17:17                   ` Luck, Tony
@ 2021-01-21 21:09                   ` Luck, Tony
  2021-01-25 22:55                     ` [PATCH v5] " Luck, Tony
  1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-21 21:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote:
> On Tue, Jan 19, 2021 at 03:57:59PM -0800, Luck, Tony wrote:
> > But the real validation folks took my patch and found that it has
> > destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
> > more times). System either hangs or panics. Generally before 100
> > injection/conumption cycles.
> 
> Hmm, that mce_count increment and check stuff might turn out to be
> fragile in the face of a race condition. But I don't see it...

I can't see what the race is either.  As far as I can see the flow
looks like this (for a machine check in user mode)

	user-mode

		do_machine_check() [in atomic context on IST etc.]

			queue_task_work()
				increments current->mce_count
				first (only) call saves address etc. and calls task_work_add()
		return from machine check

Plausibly we might context switch to another process here. We could even
resume on a different CPU.

		do_task_work() [in process context]

			kill_me_maybe()
				sets mce_count back to zero, ready for another #MC
				memory_failure()
				send SIGBUS

The kernel get_user() case is similar, but may take extra machine checks before
before calling code decides get_user() really is going to keep saying -EFAULT.


But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and
fixeed up the increment & clear to use atomic_inc_return() and atomic_set().
See updated patch below.

I can't see why this should have made any difference. But the "int" version
crashes in a few dozen machine checks. The "atomic_t" version has run many
thousands of machine checks (10213 in the current boot according to /proc/interrupts)
with no issues.

-Tony

From d1b003f1de92f87c8dc53dd710fd79ad6e277364 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Wed, 20 Jan 2021 12:40:50 -0800
Subject: [PATCH] x86/mce: Avoid infinite loop for copy from user recovery

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 35 +++++++++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..4a8660058b67 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+	atomic_set(&p->mce_count, 0);
 	force_sig(SIGBUS);
 }
 
@@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	atomic_set(&p->mce_count, 0);
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	int count = atomic_inc_return(&current->mce_count);
+
+	/* First call, save all the details */
+	if (count == 1) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1435,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1453,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..5727d59ab737 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	atomic_t			mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-21 21:09                   ` Luck, Tony
@ 2021-01-25 22:55                     ` Luck, Tony
  2021-01-26 11:03                       ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-25 22:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Thu, Jan 21, 2021 at 01:09:59PM -0800, Luck, Tony wrote:
> On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote:
> But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and
> fixeed up the increment & clear to use atomic_inc_return() and atomic_set().
> See updated patch below.
> 
> I can't see why this should have made any difference. But the "int" version
> crashes in a few dozen machine checks. The "atomic_t" version has run many
> thousands of machine checks (10213 in the current boot according to /proc/interrupts)
> with no issues.

And now I've changed it back to non-atomic (but keeping the
slightly cleaner looking code style that I used for the atomic
version).  This one also works for thousands of injections and
recoveries.  Maybe take it now before it stops working again :-)

-Tony

From: Tony Luck <tony.luck@intel.com>

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 35 +++++++++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..30dedffd6f88 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+	p->mce_count = 0;
 	force_sig(SIGBUS);
 }
 
@@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	int count = ++current->mce_count;
+
+	/* First call, save all the details */
+	if (count == 1) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1435,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1453,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..386366c9c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-25 22:55                     ` [PATCH v5] " Luck, Tony
@ 2021-01-26 11:03                       ` Borislav Petkov
  2021-01-26 22:36                         ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-26 11:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote:
> And now I've changed it back to non-atomic (but keeping the
> slightly cleaner looking code style that I used for the atomic
> version).  This one also works for thousands of injections and
> recoveries.  Maybe take it now before it stops working again :-)

Hmm, so the only differences I see between your v4 and this are:

-@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
+@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
  
  static void kill_me_now(struct callback_head *ch)
  {
++	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
++
 +	p->mce_count = 0;
  	force_sig(SIGBUS);
  }

Could the container_of() macro have changed something?

Because we don't know yet (right?) why would it fail? Would it read
stale ->mce_count data? If so, then a barrier is missing somewhere.

Or what is the failure exactly?

Because if I take it now without us knowing what the issue is, it will
start failing somewhere - Murphy's our friend - and then we'll have to
deal with breaking people's boxes. Not fun.

The other difference is:

@@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644
 -	current->mce_kflags = m->kflags;
 -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
 -	current->mce_whole_page = whole_page(m);
++	int count = ++current->mce_count;
++
 +	/* First call, save all the details */
-+	if (current->mce_count++ == 0) {
++	if (count == 1) {
 +		current->mce_addr = m->addr;
 +		current->mce_kflags = m->kflags;
 +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);

Hmm, a local variable and a pre-increment. Can that have an effect somehow?

> +	/* Ten is likley overkill. Don't expect more than two faults before task_work() */

Typo: likely.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-26 11:03                       ` Borislav Petkov
@ 2021-01-26 22:36                         ` Luck, Tony
  2021-01-28 17:57                           ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-01-26 22:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Tue, Jan 26, 2021 at 12:03:14PM +0100, Borislav Petkov wrote:
> On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote:
> > And now I've changed it back to non-atomic (but keeping the
> > slightly cleaner looking code style that I used for the atomic
> > version).  This one also works for thousands of injections and
> > recoveries.  Maybe take it now before it stops working again :-)
> 
> Hmm, so the only differences I see between your v4 and this are:
> 
> -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>   
>   static void kill_me_now(struct callback_head *ch)
>   {
> ++	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
> ++
>  +	p->mce_count = 0;
>   	force_sig(SIGBUS);
>   }
> 
> Could the container_of() macro have changed something?

That change was to fix my brown paper bag moment (does not
compile without a variable named "p" in scope to be used on
next line.)

> Because we don't know yet (right?) why would it fail? Would it read
> stale ->mce_count data? If so, then a barrier is missing somewhere.

I don't see how a barrier would make a differece. In the common case
all this code is executed on the same logical CPU. Return from the
do_machine_check() tries to return to user mode and finds that there
is some "task_work" to execute first.

In some cases Linux might context switch to something else. Perhaps
this task even gets picked up by another CPU to run the task work
queued functions.  But I imagine that the context switch should act
as a barrier ... shouldn't it?

> Or what is the failure exactly?

After a few cycles of the test injection to user mode, I saw an
overflow in the machine check bank. As if it hadn't been cleared
from the previous iteration ... but all the banks are cleared as
soon as we find that the machine check is recoverable. A while before
getting to the code I changed.

When the tests were failing, code was on top of v5.11-rc3. Latest
experiments moved to -rc5.  There's just a tracing fix from
PeterZ between rc3 and rc5 to mce/core.c:

737495361d44 ("x86/mce: Remove explicit/superfluous tracing")

which doesn't appear to be a candidate for the problems I saw.

> Because if I take it now without us knowing what the issue is, it will
> start failing somewhere - Murphy's our friend - and then we'll have to
> deal with breaking people's boxes. Not fun.

Fair point.

> The other difference is:
> 
> @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644
>  -	current->mce_kflags = m->kflags;
>  -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
>  -	current->mce_whole_page = whole_page(m);
> ++	int count = ++current->mce_count;
> ++
>  +	/* First call, save all the details */
> -+	if (current->mce_count++ == 0) {
> ++	if (count == 1) {
>  +		current->mce_addr = m->addr;
>  +		current->mce_kflags = m->kflags;
>  +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> 
> Hmm, a local variable and a pre-increment. Can that have an effect somehow?

This is the bit that changed during my detour using atomic_t mce_count.
I added the local variable to capture value from atomic_inc_return(), then
used it later, instead of a bunch of atomic_read() calls.

I kept it this way because "if (count == 1)" is marginally easier to read
than "if (current->mce_count++ == 0)"

> > +	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
> 
> Typo: likely.

Oops. Fixed.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-26 22:36                         ` Luck, Tony
@ 2021-01-28 17:57                           ` Borislav Petkov
  2021-02-01 18:58                             ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-01-28 17:57 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Tue, Jan 26, 2021 at 02:36:05PM -0800, Luck, Tony wrote:
> In some cases Linux might context switch to something else. Perhaps
> this task even gets picked up by another CPU to run the task work
> queued functions.  But I imagine that the context switch should act
> as a barrier ... shouldn't it?

I'm given to understand that the #MC from user is likely to schedule and
a context switch has a barrier character.

> After a few cycles of the test injection to user mode, I saw an
> overflow in the machine check bank. As if it hadn't been cleared
> from the previous iteration ...

This sounds weird. As if something else is happening which we haven't
thought of yet...

> When the tests were failing, code was on top of v5.11-rc3. Latest
> experiments moved to -rc5.  There's just a tracing fix from
> PeterZ between rc3 and rc5 to mce/core.c:
> 
> 737495361d44 ("x86/mce: Remove explicit/superfluous tracing")
> 
> which doesn't appear to be a candidate for the problems I saw.

Doesn't look like it.

> This is the bit that changed during my detour using atomic_t mce_count.
> I added the local variable to capture value from atomic_inc_return(), then
> used it later, instead of a bunch of atomic_read() calls.
> 
> I kept it this way because "if (count == 1)" is marginally easier to read
> than "if (current->mce_count++ == 0)"

Right.

So still no explanation why it would fail before. ;-\

Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
if you apply the patch on -rc3 and it explodes and if you apply the same
patch on -rc5 and it works, then that could be a start... Yeah, don't
have a better idea here. :-\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-01-28 17:57                           ` Borislav Petkov
@ 2021-02-01 18:58                             ` Luck, Tony
  2021-02-02 11:01                               ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-02-01 18:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote:
> Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
> if you apply the patch on -rc3 and it explodes and if you apply the same
> patch on -rc5 and it works, then that could be a start... Yeah, don't
> have a better idea here. :-\

I tried reporoducing (applied the original patch I posted back to -rc3) and
the same issue stubbornly refused to show up again.

But I did hit something with the same signature (overflow bit set in
bank 1) while running my futex test (which has two processes mapping
the poison page).  This time I *do* understand what happened.  The test
failed when the two processes were running on the two hyperhtreads of
the same core. Seeing overflow in this case is understandable because
bank 1 MSRs on my test machine are shared between the HT threads. When
I run the test again using taskset(1) to only allowing running on
thread 0 of each core, it keeps going for hunderds of iterations.

I'm not sure I can stitch together how this overflow also happened for
my single process test. Maybe a migration from one HT thread to the
other at an awkward moment?

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-02-01 18:58                             ` Luck, Tony
@ 2021-02-02 11:01                               ` Borislav Petkov
  2021-02-02 16:04                                 ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-02-02 11:01 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Mon, Feb 01, 2021 at 10:58:12AM -0800, Luck, Tony wrote:
> On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote:
> > Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
> > if you apply the patch on -rc3 and it explodes and if you apply the same
> > patch on -rc5 and it works, then that could be a start... Yeah, don't
> > have a better idea here. :-\
> 
> I tried reporoducing (applied the original patch I posted back to -rc3) and
> the same issue stubbornly refused to show up again.
> 
> But I did hit something with the same signature (overflow bit set in
> bank 1) while running my futex test (which has two processes mapping
> the poison page).  This time I *do* understand what happened.  The test
> failed when the two processes were running on the two hyperhtreads of
> the same core. Seeing overflow in this case is understandable because
> bank 1 MSRs on my test machine are shared between the HT threads. When
> I run the test again using taskset(1) to only allowing running on
> thread 0 of each core, it keeps going for hunderds of iterations.
> 
> I'm not sure I can stitch together how this overflow also happened for
> my single process test. Maybe a migration from one HT thread to the
> other at an awkward moment?

Sounds plausible.

And the much more important question is, what is the code supposed to
do when that overflow *actually* happens in real life? Because IINM,
an overflow condition on the same page would mean killing the task to
contain the error and not killing the machine...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-02-02 11:01                               ` Borislav Petkov
@ 2021-02-02 16:04                                 ` Luck, Tony
  2021-02-02 21:06                                   ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-02-02 16:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

> And the much more important question is, what is the code supposed to
> do when that overflow *actually* happens in real life? Because IINM,
> an overflow condition on the same page would mean killing the task to
> contain the error and not killing the machine...

Correct. The cases I've actually hit, the second machine check is on the
same address as the first. But from a recovery perspective Linux is going
to take away the whole page anyway ... so not complaining if the second
(or subsequent) access is within the same page makes sense (and that's
what the patch does).

The code can't handle it if a subsequent #MC is to a different page (because
we only have a single spot in the task structure to store the physical page
address).  But that looks adequate. If the code is wildly accessing different
pages *and* getting machine checks from those different pages ... then
something is very seriously wrong with the system.

-Tony


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-02-02 16:04                                 ` Luck, Tony
@ 2021-02-02 21:06                                   ` Borislav Petkov
  2021-02-02 22:12                                     ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2021-02-02 21:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

On Tue, Feb 02, 2021 at 04:04:17PM +0000, Luck, Tony wrote:
> > And the much more important question is, what is the code supposed to
> > do when that overflow *actually* happens in real life? Because IINM,
> > an overflow condition on the same page would mean killing the task to
> > contain the error and not killing the machine...
> 
> Correct. The cases I've actually hit, the second machine check is on the
> same address as the first. But from a recovery perspective Linux is going
> to take away the whole page anyway ... so not complaining if the second
> (or subsequent) access is within the same page makes sense (and that's
> what the patch does).
> 
> The code can't handle it if a subsequent #MC is to a different page (because
> we only have a single spot in the task structure to store the physical page
> address).  But that looks adequate. If the code is wildly accessing different
> pages *and* getting machine checks from those different pages ... then
> something is very seriously wrong with the system.

Right, that's clear.

But I meant something else: this thread had somewhere upthread:

"But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles."

To which I asked:

"Or what is the failure exactly?"

and you said:

"After a few cycles of the test injection to user mode, I saw an
overflow in the machine check bank. As if it hadn't been cleared from
the previous iteration ... but all the banks are cleared as soon as we
find that the machine check is recoverable. A while before getting to
the code I changed."

And you also said:

"I tried reporoducing (applied the original patch I posted back to -rc3)
and the same issue stubbornly refused to show up again."

So that "system hang or panic" which the validation folks triggered,
that cannot be reproduced anymore? Did they run the latest version of
the patch?

And the overflow issue is something else which the patch handles fine,
as you say.

So that original mysterious hang is still unsolved.

Or am I missing something and am misrepresenting the situation?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
  2021-02-02 21:06                                   ` Borislav Petkov
@ 2021-02-02 22:12                                     ` Luck, Tony
  0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2021-02-02 22:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski,
	linux-kernel, linux-edac, linux-mm

> So that "system hang or panic" which the validation folks triggered,
> that cannot be reproduced anymore? Did they run the latest version of
> the patch?

I will get the validation folks to run the latest version (and play around with
hyperthreading if they see problems).

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-09-13 21:52       ` [PATCH v3] " Luck, Tony
@ 2021-09-14  8:28         ` Borislav Petkov
  0 siblings, 0 replies; 52+ messages in thread
From: Borislav Petkov @ 2021-09-14  8:28 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song,
	huangcun, x86, linux-edac, linux-mm, linux-kernel

On Mon, Sep 13, 2021 at 02:52:39PM -0700, Luck, Tony wrote:
> Also mark queue_task_work() as "noinstr" (as reported kernel test robot
> <lkp@intel.com>)

Yeah, that's not enough - I have a patchset in the works for all this so
I'm going to drop your annotation.

> Cc: <stable@vger.kernel.org>
> Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work")

Ah ok, that one makes sense.

> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> > What about a Fixes: tag?
> 
> Added a Fixes tag.
> 
> Also added "noinstr" to queue_task_work() per a kernel robot report.
> 
> Also re-wrote the commit comment (based on questions raised against v2)

Thanks - very much appreciated and it reads really good!

> > I guess backporting this to the respective kernels is predicated upon
> > the existence of those other "places" in the kernel where code assumes
> > the EFAULT was because of a #PF.
> 
> Not really. I don't expect to change any kernel code that just bounces
> off the same machine check a few times. This patch does work best in
> conjunction with patches 2 & 3 (unchanged, not reposted here). But it
> will fix some old issues even without those two.

Ok, got it.

/me queues.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
  2021-09-13  9:24     ` Borislav Petkov
@ 2021-09-13 21:52       ` Luck, Tony
  2021-09-14  8:28         ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2021-09-13 21:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song,
	huangcun, x86, linux-edac, linux-mm, linux-kernel

There are two cases for machine check recovery:
1) The machine check was triggered by ring3 (application) code.
   This is the simpler case. The machine check handler simply queues
   work to be executed on return to user. That code unmaps the page
   from all users and arranges to send a SIGBUS to the task that
   triggered the poison.
2) The machine check was triggered in kernel code that is covered by
   an extable entry.
   In this case the machine check handler still queues a work entry to
   unmap the page, etc. but this will not be called right away because
   the #MC handler returns to the fix up code address in the extable
   entry.

Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.

Specifically the work is queued using the "mce_kill_me" callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user it enters an infinite loop processing the same entry for ever.

There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.

1) Some code (e.g. futex) first tries a get_user() with page faults
   disabled. If this fails, the code retries with page faults enabled
   expecting that this will resolve the page fault.
2) Copy from user code retries a copy in byte-at-time mode to check
   whether any additional bytes can be copied.

On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.

Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).

Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.

Also mark queue_task_work() as "noinstr" (as reported kernel test robot
<lkp@intel.com>)

Cc: <stable@vger.kernel.org>
Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work")
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

> What about a Fixes: tag?

Added a Fixes tag.

Also added "noinstr" to queue_task_work() per a kernel robot report.

Also re-wrote the commit comment (based on questions raised against v2)

> I guess backporting this to the respective kernels is predicated upon
> the existence of those other "places" in the kernel where code assumes
> the EFAULT was because of a #PF.

Not really. I don't expect to change any kernel code that just bounces
off the same machine check a few times. This patch does work best in
conjunction with patches 2 & 3 (unchanged, not reposted here). But it
will fix some old issues even without those two.

 arch/x86/kernel/cpu/mce/core.c | 43 +++++++++++++++++++++++++---------
 include/linux/sched.h          |  1 +
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 8cb7816d03b4..9891b4070a61 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1253,6 +1253,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+	p->mce_count = 0;
 	force_sig(SIGBUS);
 }
 
@@ -1262,6 +1265,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	int flags = MF_ACTION_REQUIRED;
 	int ret;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1290,17 +1294,34 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static noinstr void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	int count = ++current->mce_count;
 
-	if (kill_current_task)
-		current->mce_kill_me.func = kill_me_now;
-	else
-		current->mce_kill_me.func = kill_me_maybe;
+	/* First call, save all the details */
+	if (count == 1) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+
+		if (kill_current_task)
+			current->mce_kill_me.func = kill_me_now;
+		else
+			current->mce_kill_me.func = kill_me_maybe;
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (count > 1)
+		return;
 
 	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
 }
@@ -1438,7 +1459,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1456,7 +1477,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e12b524426b0..39039ce8ac4c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2021-09-14  8:28 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck
2021-01-08 22:22 ` [PATCH 1/2] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-01-08 22:22 ` [PATCH 2/2] futex, x86/mce: Avoid double machine checks Tony Luck
2021-01-08 22:47   ` Peter Zijlstra
2021-01-08 23:08     ` Luck, Tony
2021-01-08 23:14       ` Peter Zijlstra
2021-01-08 23:20         ` Luck, Tony
2021-01-11 21:44 ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Tony Luck
2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-01-11 22:11     ` Andy Lutomirski
2021-01-11 22:20       ` Luck, Tony
2021-01-12 17:00         ` Andy Lutomirski
2021-01-12 17:16           ` Luck, Tony
2021-01-12 17:21             ` Andy Lutomirski
2021-01-12 18:23               ` Luck, Tony
2021-01-12 18:57                 ` Andy Lutomirski
2021-01-12 20:52                   ` Luck, Tony
2021-01-12 22:04                     ` Andy Lutomirski
2021-01-13  1:50                       ` Luck, Tony
2021-01-13  4:15                         ` Andy Lutomirski
2021-01-13 10:00                           ` Borislav Petkov
2021-01-13 16:06                             ` Luck, Tony
2021-01-13 16:19                               ` Borislav Petkov
2021-01-13 16:32                                 ` Luck, Tony
2021-01-13 17:35                                   ` Borislav Petkov
2021-01-14 20:22     ` Borislav Petkov
2021-01-14 21:05       ` Luck, Tony
2021-01-11 21:44   ` [PATCH v2 2/3] x86/mce: Add new return value to get_user() for machine check Tony Luck
2021-01-11 21:44   ` [PATCH v2 3/3] futex, x86/mce: Avoid double machine checks Tony Luck
2021-01-14 17:22   ` [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() Andy Lutomirski
2021-01-15  0:38   ` [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-01-15 15:27     ` Borislav Petkov
2021-01-15 19:34       ` Luck, Tony
2021-01-15 20:51         ` [PATCH v4] " Luck, Tony
2021-01-15 23:23           ` Luck, Tony
2021-01-19 10:56             ` Borislav Petkov
2021-01-19 23:57               ` Luck, Tony
2021-01-20 12:18                 ` Borislav Petkov
2021-01-20 17:17                   ` Luck, Tony
2021-01-21 21:09                   ` Luck, Tony
2021-01-25 22:55                     ` [PATCH v5] " Luck, Tony
2021-01-26 11:03                       ` Borislav Petkov
2021-01-26 22:36                         ` Luck, Tony
2021-01-28 17:57                           ` Borislav Petkov
2021-02-01 18:58                             ` Luck, Tony
2021-02-02 11:01                               ` Borislav Petkov
2021-02-02 16:04                                 ` Luck, Tony
2021-02-02 21:06                                   ` Borislav Petkov
2021-02-02 22:12                                     ` Luck, Tony
2021-01-18 15:39         ` [PATCH v3] " Borislav Petkov
     [not found] <20210706190620.1290391-1-tony.luck@intel.com>
2021-08-18  0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck
2021-08-18  0:29   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-09-13  9:24     ` Borislav Petkov
2021-09-13 21:52       ` [PATCH v3] " Luck, Tony
2021-09-14  8:28         ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).