All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] Fix machine check recovery for copy_from_user
@ 2021-03-26  0:02 Tony Luck
  2021-03-26  0:02 ` [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check Tony Luck
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Tony Luck @ 2021-03-26  0:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski,
	Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

Maybe this is the way forward?  I made some poor choices before
to treat poison consumption in the kernel when accessing user data
(get_user() or copy_from_user()) ... in particular assuming that
the right action was sending a SIGBUS to the task as if it had
synchronously accessed the poison location.

First three patches may need to be combined (or broken up differently)
for bisectablilty. But they are presented separately here since they
touch separate parts of the problem.

Second part is definitley incomplete. But I'd like to check that it
is the right approach before expending more brain cells in the maze
of nested macros that is lib/iov_iter.c

Last part has been posted before. It covers the case where the kernel
takes more than one swing at reading poison data before returning to
user.

Tony Luck (4):
  x86/mce: Fix copyin code to return -EFAULT on machine check.
  mce/iter: Check for copyin failure & return error up stack
  mce/copyin: fix to not SIGBUS when copying from user hits poison
  x86/mce: Avoid infinite loop for copy from user recovery

 arch/x86/kernel/cpu/mce/core.c     | 63 +++++++++++++++++++++---------
 arch/x86/kernel/cpu/mce/severity.c |  2 -
 arch/x86/lib/copy_user_64.S        | 18 +++++----
 fs/iomap/buffered-io.c             |  8 +++-
 include/linux/sched.h              |  2 +-
 include/linux/uio.h                |  2 +-
 lib/iov_iter.c                     | 15 ++++++-
 7 files changed, 77 insertions(+), 33 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check.
  2021-03-26  0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck
@ 2021-03-26  0:02 ` Tony Luck
  2021-04-06 19:24   ` Borislav Petkov
  2021-03-26  0:02 ` [PATCH 2/4] mce/iter: Check for copyin failure & return error up stack Tony Luck
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Tony Luck @ 2021-03-26  0:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski,
	Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

When copy from user fails due to a machine check on poison reading
user data it should return an error code.

---

Separate patch just now, but likely needs to be combined with patches
to iteration code for bisection safety.
---
 arch/x86/lib/copy_user_64.S | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 77b9b2a3b5c8..2987118c541a 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -14,6 +14,7 @@
 #include <asm/alternative-asm.h>
 #include <asm/asm.h>
 #include <asm/smap.h>
+#include <asm/errno.h>
 #include <asm/export.h>
 #include <asm/trapnr.h>
 
@@ -237,18 +238,21 @@ SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
 	cmp $X86_TRAP_MC,%eax		/* check if X86_TRAP_MC */
 	je 3f
 1:	rep movsb
-2:	mov %ecx,%eax
+	mov %ecx,%eax
+	ASM_CLAC
+	ret
+
+2:
+	cmp $X86_TRAP_MC,%eax
+	je 3f
+	mov %ecx,%eax
 	ASM_CLAC
 	ret
 
 	/*
-	 * Return zero to pretend that this copy succeeded. This
-	 * is counter-intuitive, but needed to prevent the code
-	 * in lib/iov_iter.c from retrying and running back into
-	 * the poison cache line again. The machine check handler
-	 * will ensure that a SIGBUS is sent to the task.
+	 * Return -EFAULT for the machine check cases
 	 */
-3:	xorl %eax,%eax
+3:	movl $-EFAULT,%eax
 	ASM_CLAC
 	ret
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] mce/iter: Check for copyin failure & return error up stack
  2021-03-26  0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck
  2021-03-26  0:02 ` [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check Tony Luck
@ 2021-03-26  0:02 ` Tony Luck
  2021-03-26  0:02 ` [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison Tony Luck
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Tony Luck @ 2021-03-26  0:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski,
	Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

Check for failure from low level copy from user code. Doing this
requires some type changes from the unsigned "size_t" so some
signed type (so that "if (xxx < 0)" works!). I picked "loff_t"
but there may be some other more appropriate type.

Very likely more places need to be changed. These changes based
on a machine check copying user data for a write(2) system call
to an xfs file system where the stack trace looks like:

[  149.445163]  ? copyin+0x2d/0x40
[  149.448687]  ? iov_iter_copy_from_user_atomic+0xd6/0x3a0
[  149.454625]  ? iomap_write_actor+0xc7/0x190
[  149.459319]  ? iomap_apply+0x12a/0x440
[  149.463508]  ? iomap_write_begin+0x530/0x530
[  149.468276]  ? iomap_write_begin+0x530/0x530
[  149.473041]  ? iomap_file_buffered_write+0x62/0x90
[  149.478390]  ? iomap_write_begin+0x530/0x530
[  149.483157]  ? xfs_file_buffered_write+0xe0/0x340 [xfs]
[  149.489393]  ? new_sync_write+0x122/0x1b0
[  149.493879]  ? vfs_write+0x20a/0x350
[  149.497888]  ? ksys_write+0x5f/0xe0
[  149.501784]  ? do_syscall_64+0x33/0x40

---

Needs to be bundled with other patches for bisection safety
---
 fs/iomap/buffered-io.c |  8 +++++++-
 include/linux/uio.h    |  2 +-
 lib/iov_iter.c         | 15 +++++++++++++--
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 414769a6ad11..6cc28e3f32ff 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -757,7 +757,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
-		size_t copied;		/* Bytes copied from user */
+		loff_t copied;		/* Bytes copied from user */
 
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
@@ -791,6 +791,12 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 
+		if (copied < 0) {
+			unlock_page(page);
+			put_page(page);
+			return copied;
+		}
+
 		copied = iomap_write_end(inode, pos, bytes, copied, page, iomap,
 				srcmap);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 27ff8eb786dc..c2d82e8e309c 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -111,7 +111,7 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 	};
 }
 
-size_t iov_iter_copy_from_user_atomic(struct page *page,
+loff_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f66c62aa7154..57e2f81c51ee 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -16,13 +16,18 @@
 #define PIPE_PARANOIA /* for now */
 
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
-	size_t left;					\
+	loff_t left;					\
 	size_t wanted = n;				\
 	__p = i->iov;					\
 	__v.iov_len = min(n, __p->iov_len - skip);	\
 	if (likely(__v.iov_len)) {			\
 		__v.iov_base = __p->iov_base + skip;	\
 		left = (STEP);				\
+		if (left < 0) {				\
+			wanted = left;			\
+			n = 0;				\
+			goto err;			\
+		}					\
 		__v.iov_len -= left;			\
 		skip += __v.iov_len;			\
 		n -= __v.iov_len;			\
@@ -36,10 +41,16 @@
 			continue;			\
 		__v.iov_base = __p->iov_base;		\
 		left = (STEP);				\
+		if (left < 0) {				\
+			wanted = left;			\
+			n = 0;				\
+			break;				\
+		}					\
 		__v.iov_len -= left;			\
 		skip = __v.iov_len;			\
 		n -= __v.iov_len;			\
 	}						\
+err:							\
 	n = wanted - n;					\
 }
 
@@ -975,7 +986,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t iov_iter_copy_from_user_atomic(struct page *page,
+loff_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-03-26  0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck
  2021-03-26  0:02 ` [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check Tony Luck
  2021-03-26  0:02 ` [PATCH 2/4] mce/iter: Check for copyin failure & return error up stack Tony Luck
@ 2021-03-26  0:02 ` Tony Luck
  2021-04-07 21:18   ` Borislav Petkov
  2021-03-26  0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
  2021-04-08  2:13 ` [RFC 0/4] Fix machine check recovery for copy_from_user Aili Yao
  4 siblings, 1 reply; 24+ messages in thread
From: Tony Luck @ 2021-03-26  0:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski,
	Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

Andy Lutomirski pointed out that sending SIGBUS to tasks that
hit poison in the kernel copying syscall parameters from user
address space is not the right semantic.

So stop doing that. Add a new kill_me_never() call back that
simply unmaps and offlines the poison page.

current-mce_vaddr is no longer used, so drop this field

---

Needs to be combined with other patches for bisectability
---
 arch/x86/kernel/cpu/mce/core.c     | 35 ++++++++++++++++--------------
 arch/x86/kernel/cpu/mce/severity.c |  2 --
 include/linux/sched.h              |  1 -
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..1570310cadab 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1263,32 +1263,32 @@ static void kill_me_maybe(struct callback_head *cb)
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
 
-	if (p->mce_vaddr != (void __user *)-1l) {
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-	} else {
-		pr_err("Memory error not recovered");
-		kill_me_now(cb);
-	}
+	pr_err("Memory error not recovered");
+	kill_me_now(cb);
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void kill_me_never(struct callback_head *cb)
+{
+	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
+
+	pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+}
+
+static void queue_task_work(struct mce *m, void (*func)(struct callback_head *))
 {
 	current->mce_addr = m->addr;
 	current->mce_kflags = m->kflags;
 	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
 	current->mce_whole_page = whole_page(m);
-
-	if (kill_current_task)
-		current->mce_kill_me.func = kill_me_now;
-	else
-		current->mce_kill_me.func = kill_me_maybe;
+	current->mce_kill_me.func = func;
 
 	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
 }
@@ -1426,7 +1426,10 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		if (kill_current_task)
+			queue_task_work(&m, kill_me_now);
+		else
+			queue_task_work(&m, kill_me_maybe);
 
 	} else {
 		/*
@@ -1444,7 +1447,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, kill_me_never);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 83df991314c5..47810d12f040 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -251,8 +251,6 @@ static bool is_copy_from_user(struct pt_regs *regs)
 	if (fault_in_kernel_space(addr))
 		return false;
 
-	current->mce_vaddr = (void __user *)addr;
-
 	return true;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef00bb22164c..2d213b52730c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1358,7 +1358,6 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_X86_MCE
-	void __user			*mce_vaddr;
 	__u64				mce_kflags;
 	u64				mce_addr;
 	__u64				mce_ripv : 1,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-03-26  0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck
                   ` (2 preceding siblings ...)
  2021-03-26  0:02 ` [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison Tony Luck
@ 2021-03-26  0:02 ` Tony Luck
  2021-04-08 13:36   ` Borislav Petkov
  2021-04-08  2:13 ` [RFC 0/4] Fix machine check recovery for copy_from_user Aili Yao
  4 siblings, 1 reply; 24+ messages in thread
From: Tony Luck @ 2021-03-26  0:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski,
	Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

	    do {
		    next = work->next;
		    work->func(work);
		    work = next;
		    cond_resched();
	    } while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 40 ++++++++++++++++++++++++++--------
 include/linux/sched.h          |  1 +
 2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 1570310cadab..999fd7f0330b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+	p->mce_count = 0;
 	force_sig(SIGBUS);
 }
 
@@ -1258,6 +1261,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1277,18 +1281,36 @@ static void kill_me_never(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 
+	p->mce_count = 0;
 	pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 }
 
-static void queue_task_work(struct mce *m, void (*func)(struct callback_head *))
+static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
-	current->mce_kill_me.func = func;
+	int count = ++current->mce_count;
+
+	/* First call, save all the details */
+	if (count == 1) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+		current->mce_kill_me.func = func;
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (count > 1)
+		return;
 
 	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
 }
@@ -1427,9 +1449,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
 		if (kill_current_task)
-			queue_task_work(&m, kill_me_now);
+			queue_task_work(&m, msg, kill_me_now);
 		else
-			queue_task_work(&m, kill_me_maybe);
+			queue_task_work(&m, msg, kill_me_maybe);
 
 	} else {
 		/*
@@ -1447,7 +1469,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_me_never);
+			queue_task_work(&m, msg, kill_me_never);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2d213b52730c..8f9dc91498cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1364,6 +1364,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check.
  2021-03-26  0:02 ` [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check Tony Luck
@ 2021-04-06 19:24   ` Borislav Petkov
  0 siblings, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2021-04-06 19:24 UTC (permalink / raw)
  To: Tony Luck
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Thu, Mar 25, 2021 at 05:02:32PM -0700, Tony Luck wrote:
> When copy from user fails due to a machine check on poison reading
> user data it should return an error code.
> 
> ---
> 
> Separate patch just now, but likely needs to be combined with patches
> to iteration code for bisection safety.
> ---
>  arch/x86/lib/copy_user_64.S | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index 77b9b2a3b5c8..2987118c541a 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -14,6 +14,7 @@
>  #include <asm/alternative-asm.h>
>  #include <asm/asm.h>
>  #include <asm/smap.h>
> +#include <asm/errno.h>
>  #include <asm/export.h>
>  #include <asm/trapnr.h>
>  
> @@ -237,18 +238,21 @@ SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
>  	cmp $X86_TRAP_MC,%eax		/* check if X86_TRAP_MC */
>  	je 3f
>  1:	rep movsb
> -2:	mov %ecx,%eax
> +	mov %ecx,%eax
> +	ASM_CLAC
> +	ret
> +
> +2:
> +	cmp $X86_TRAP_MC,%eax
> +	je 3f
> +	mov %ecx,%eax
>  	ASM_CLAC
>  	ret
>  
>  	/*
> -	 * Return zero to pretend that this copy succeeded. This
> -	 * is counter-intuitive, but needed to prevent the code
> -	 * in lib/iov_iter.c from retrying and running back into
> -	 * the poison cache line again. The machine check handler
> -	 * will ensure that a SIGBUS is sent to the task.
> +	 * Return -EFAULT for the machine check cases
>  	 */
> -3:	xorl %eax,%eax
> +3:	movl $-EFAULT,%eax
>  	ASM_CLAC
>  	ret

Right, looks ok after swapping all that
fault-handlers-rerouting-based-on-labels thing back in. You'd need to
update the blubber in the comment above it that we're returning -EFAULT
now too.

Also, you might wanna converge the exit paths to a single "out" label
if the unconditional JMPs are not that big of a deal (pasting the whole
thing and not as a diff because diff is unreadable):

SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
        movl %edx,%ecx
        cmp $X86_TRAP_MC,%eax           /* check if X86_TRAP_MC */
        je 3f

1:      rep movsb
        mov %ecx,%eax
        jmp out

2:      cmp $X86_TRAP_MC,%eax
        je 3f
        mov %ecx,%eax
        jmp out

        /* Return -EFAULT for the machine check cases */
3:      movl $-EFAULT,%eax

out:
        ASM_CLAC
        ret

        _ASM_EXTABLE_CPY(1b, 2b)
SYM_CODE_END(.Lcopy_user_handle_tail)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-03-26  0:02 ` [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison Tony Luck
@ 2021-04-07 21:18   ` Borislav Petkov
  2021-04-07 21:43     ` Luck, Tony
  0 siblings, 1 reply; 24+ messages in thread
From: Borislav Petkov @ 2021-04-07 21:18 UTC (permalink / raw)
  To: Tony Luck
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> Andy Lutomirski pointed out that sending SIGBUS to tasks that
> hit poison in the kernel copying syscall parameters from user
> address space is not the right semantic.

What does that mean exactly?

From looking at the code, that is this conditional:

        if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
                m->kflags |= MCE_IN_KERNEL_RECOV;
                m->kflags |= MCE_IN_KERNEL_COPYIN;

so what does the above have to do with syscall params?

If it is about us being in ring 0 and touching user memory and eating
poison in same *user* memory while doing so, then sure, that makes
sense.

> So stop doing that. Add a new kill_me_never() call back that
> simply unmaps and offlines the poison page.

Right, that's the same as handling poisoned user memory.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-07 21:18   ` Borislav Petkov
@ 2021-04-07 21:43     ` Luck, Tony
  2021-04-08  8:49       ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Luck, Tony @ 2021-04-07 21:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote:
> On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> > Andy Lutomirski pointed out that sending SIGBUS to tasks that
> > hit poison in the kernel copying syscall parameters from user
> > address space is not the right semantic.
> 
> What does that mean exactly?

Andy said that a task could check a memory range for poison by
doing:

	ret = write(fd, buf, size);
	if (ret == size) {
		memory range is all good
	}

That doesn't work if the kernel sends a SIGBUS.

It doesn't seem a likely scenario ... but Andy is correct that
the above ought to work.

> 
> From looking at the code, that is this conditional:
> 
>         if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
>                 m->kflags |= MCE_IN_KERNEL_RECOV;
>                 m->kflags |= MCE_IN_KERNEL_COPYIN;
> 
> so what does the above have to do with syscall params?

Most "copy from user" instances are the result of a system call parameter
(e.g. "buf" in the write(2) example above).

> If it is about us being in ring 0 and touching user memory and eating
> poison in same *user* memory while doing so, then sure, that makes
> sense.

Yes. This is for kernel reading memory belongng to "current" task.

> > So stop doing that. Add a new kill_me_never() call back that
> > simply unmaps and offlines the poison page.
> 
> Right, that's the same as handling poisoned user memory.

Same in that the page gets unmapped. Different in that there
is no SIGBUS if the kernel did the access for the user.

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/4] Fix machine check recovery for copy_from_user
  2021-03-26  0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck
                   ` (3 preceding siblings ...)
  2021-03-26  0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
@ 2021-04-08  2:13 ` Aili Yao
  2021-04-08 14:39   ` Luck, Tony
  4 siblings, 1 reply; 24+ messages in thread
From: Aili Yao @ 2021-04-08  2:13 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski,
	HORIGUCHI NAOYA( 堀口 直也),
	yaoaili

On Thu, 25 Mar 2021 17:02:31 -0700
Tony Luck <tony.luck@intel.com> wrote:

> Maybe this is the way forward?  I made some poor choices before
> to treat poison consumption in the kernel when accessing user data
> (get_user() or copy_from_user()) ... in particular assuming that
> the right action was sending a SIGBUS to the task as if it had
> synchronously accessed the poison location.
> 
> First three patches may need to be combined (or broken up differently)
> for bisectablilty. But they are presented separately here since they
> touch separate parts of the problem.
> 
> Second part is definitley incomplete. But I'd like to check that it
> is the right approach before expending more brain cells in the maze
> of nested macros that is lib/iov_iter.c
> 
> Last part has been posted before. It covers the case where the kernel
> takes more than one swing at reading poison data before returning to
> user.
> 
> Tony Luck (4):
>   x86/mce: Fix copyin code to return -EFAULT on machine check.
>   mce/iter: Check for copyin failure & return error up stack
>   mce/copyin: fix to not SIGBUS when copying from user hits poison
>   x86/mce: Avoid infinite loop for copy from user recovery
> 
>  arch/x86/kernel/cpu/mce/core.c     | 63 +++++++++++++++++++++---------
>  arch/x86/kernel/cpu/mce/severity.c |  2 -
>  arch/x86/lib/copy_user_64.S        | 18 +++++----
>  fs/iomap/buffered-io.c             |  8 +++-
>  include/linux/sched.h              |  2 +-
>  include/linux/uio.h                |  2 +-
>  lib/iov_iter.c                     | 15 ++++++-
>  7 files changed, 77 insertions(+), 33 deletions(-)
> 

I have one scenario, may you take into account:

If one copyin case occurs, write() returned by your patch, the user process may
check the return values, for errors, it may exit the process, then the error page
will be freed, and then the page maybe alloced to other process or to kernel itself,
then code will initialize it and this will trigger one SRAO, if it's used by kernel,
we may do nothing for this, and kernel may still touch it, and lead to one panic.

Is this we expect? 

Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-07 21:43     ` Luck, Tony
@ 2021-04-08  8:49       ` Borislav Petkov
  2021-04-08 17:08         ` Luck, Tony
  0 siblings, 1 reply; 24+ messages in thread
From: Borislav Petkov @ 2021-04-08  8:49 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote:
> On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote:
> > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> > > Andy Lutomirski pointed out that sending SIGBUS to tasks that
> > > hit poison in the kernel copying syscall parameters from user
> > > address space is not the right semantic.
> > 
> > What does that mean exactly?
> 
> Andy said that a task could check a memory range for poison by
> doing:
> 
> 	ret = write(fd, buf, size);
> 	if (ret == size) {
> 		memory range is all good
> 	}
> 
> That doesn't work if the kernel sends a SIGBUS.
> 
> It doesn't seem a likely scenario ... but Andy is correct that
> the above ought to work.

We need to document properly what this is aiming to fix. He said
something yesterday along the lines of kthread_use_mm() hitting a SIGBUS
when a kthread "attaches" to an address space. I'm still unclear as to
how exactly that happens - there are only a handful of kthread_use_mm()
users in the tree...

> Yes. This is for kernel reading memory belongng to "current" task.

Provided "current" is really the task to which the poison page belongs.
That kthread_use_mm() thing sounded like the wrong task gets killed. But that
needs more details.

> Same in that the page gets unmapped. Different in that there
> is no SIGBUS if the kernel did the access for the user.

What is even the actual use case with sending tasks SIGBUS on poison
consumption? KVM? Others?

Are we documenting somewhere: "if your process gets a SIGBUS and this
and that, which means your page got offlined, you should do this and
that to recover"?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-03-26  0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
@ 2021-04-08 13:36   ` Borislav Petkov
  2021-04-08 16:06     ` Luck, Tony
  0 siblings, 1 reply; 24+ messages in thread
From: Borislav Petkov @ 2021-04-08 13:36 UTC (permalink / raw)
  To: Tony Luck
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Thu, Mar 25, 2021 at 05:02:35PM -0700, Tony Luck wrote:
...
> Expected worst case is two machine checks before moving on (e.g. one user
> access with page faults disabled, then a repeat to the same addrsss with
> page faults enabled). Just in case there is some code that loops forever
> enforce a limit of 10.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 40 ++++++++++++++++++++++++++--------
>  include/linux/sched.h          |  1 +
>  2 files changed, 32 insertions(+), 9 deletions(-)

What I'm still unclear on, does this new version address that
"mysterious" hang or panic which the validation team triggered or you
haven't checked yet?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC 0/4] Fix machine check recovery for copy_from_user
  2021-04-08  2:13 ` [RFC 0/4] Fix machine check recovery for copy_from_user Aili Yao
@ 2021-04-08 14:39   ` Luck, Tony
  2021-04-09  6:49     ` Aili Yao
  0 siblings, 1 reply; 24+ messages in thread
From: Luck, Tony @ 2021-04-08 14:39 UTC (permalink / raw)
  To: Aili Yao, Borislav Petkov
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski,
	HORIGUCHI NAOYA( 堀口 直也)

> I have one scenario, may you take into account:
>
> If one copyin case occurs, write() returned by your patch, the user process may
> check the return values, for errors, it may exit the process, then the error page
> will be freed, and then the page maybe alloced to other process or to kernel itself,
> then code will initialize it and this will trigger one SRAO, if it's used by kernel,
> we may do nothing for this, and kernel may still touch it, and lead to one panic.

In this case kill_me_never() calls memory_failure() with flags == 0. I think (hope!)
that means that it will unmap the page from the task, but will not send a signal.

When the task exits the PTE for this page has the swap/poison signature, so the
page is not freed for re-use.

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
  2021-04-08 13:36   ` Borislav Petkov
@ 2021-04-08 16:06     ` Luck, Tony
  0 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2021-04-08 16:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

> What I'm still unclear on, does this new version address that
> "mysterious" hang or panic which the validation team triggered or you
> haven't checked yet?

No :-(

They are triggering some case where multiple threads in a process hit the same
poison, and somehow memory_failure() fails to complete offlining the page. At this
point any other threads that hit that page get the early return from memory_failure
(because the page flags say it is poisoned) ... and so we loop.

But the "recover from cases where multiple machine checks happen
simultaneously" case is orthogonal to the "do the right thing to recover
when the kernel touches poison at a user address". So I think we can
tackle them separately

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-08  8:49       ` Borislav Petkov
@ 2021-04-08 17:08         ` Luck, Tony
  2021-04-13 10:07           ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Luck, Tony @ 2021-04-08 17:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Thu, Apr 08, 2021 at 10:49:58AM +0200, Borislav Petkov wrote:
> On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote:
> > On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote:
> > > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> > > > Andy Lutomirski pointed out that sending SIGBUS to tasks that
> > > > hit poison in the kernel copying syscall parameters from user
> > > > address space is not the right semantic.
> > > 
> > > What does that mean exactly?
> > 
> > Andy said that a task could check a memory range for poison by
> > doing:
> > 
> > 	ret = write(fd, buf, size);
> > 	if (ret == size) {
> > 		memory range is all good
> > 	}
> > 
> > That doesn't work if the kernel sends a SIGBUS.
> > 
> > It doesn't seem a likely scenario ... but Andy is correct that
> > the above ought to work.
> 
> We need to document properly what this is aiming to fix. He said
> something yesterday along the lines of kthread_use_mm() hitting a SIGBUS
> when a kthread "attaches" to an address space. I'm still unclear as to
> how exactly that happens - there are only a handful of kthread_use_mm()
> users in the tree...

Also not clear to me either ... but sending a SIGBUS to a kthread isn't
going to do anything useful. So avoiding doing that is another worthy
goal.

> > Yes. This is for kernel reading memory belongng to "current" task.
> 
> Provided "current" is really the task to which the poison page belongs.
> That kthread_use_mm() thing sounded like the wrong task gets killed. But that
> needs more details.

With these patches nothing gets killed when kernel touches user poison.
If this is in a regular system call then these patches will return EFAULT
to the user (but now that I see EHWPOISON exists that looks like a better
choice - so applications can distinguish the "I just used an invalid address in
a parameter to a syscall" from "This isn't my fault, the memory broke".

> > Same in that the page gets unmapped. Different in that there
> > is no SIGBUS if the kernel did the access for the user.
> 
> What is even the actual use case with sending tasks SIGBUS on poison
> consumption? KVM? Others?

KVM apparently passes a machine check into the guest. Though it seems
to be misisng the MCG_STATUS information to tell the guest whether this
is an "Action Required" machine check, or an "Action Optional" (i.e.
whether the poison was found synchonously by execution of the current
instruction, or asynchronously).

> Are we documenting somewhere: "if your process gets a SIGBUS and this
> and that, which means your page got offlined, you should do this and
> that to recover"?

There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing
seems wrong in that, but could use some updates.  I don't know how much
detail we might want to go into on recovery stratgies for applications.
In terms of production s/w there was one ISV who prototyped recovery
for their application but last time I checked didn't enable it in the
production version.

Essentially it boils down to:
SIGBUS handler gets additional data giving virtual address that has gone away

1) Can the application replace the lost page?
	Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap
	and fill with replacement data. This case can return from SIGBUS
	handler to re-execute failed instruction
2) Can the application continue in degraded mode w/o the lost page?
	Hunt down pointers to lost page and update structures to say
	"this data lost". Use siglongjmp() to go to preset recovery path
3) Can the application shut down gracefully?
	Record details of the lost page. Inform next-of-kin. Exit.
4) Default - just exit

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/4] Fix machine check recovery for copy_from_user
  2021-04-08 14:39   ` Luck, Tony
@ 2021-04-09  6:49     ` Aili Yao
  0 siblings, 0 replies; 24+ messages in thread
From: Aili Yao @ 2021-04-09  6:49 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, x86, linux-mm, Andy Lutomirski,
	HORIGUCHI NAOYA( 堀口 直也),
	yaoaili

On Thu, 8 Apr 2021 14:39:09 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > I have one scenario, may you take into account:
> >
> > If one copyin case occurs, write() returned by your patch, the user process may
> > check the return values, for errors, it may exit the process, then the error page
> > will be freed, and then the page maybe alloced to other process or to kernel itself,
> > then code will initialize it and this will trigger one SRAO, if it's used by kernel,
> > we may do nothing for this, and kernel may still touch it, and lead to one panic.  
> 
> In this case kill_me_never() calls memory_failure() with flags == 0. I think (hope!)
> that means that it will unmap the page from the task, but will not send a signal.
> 
> When the task exits the PTE for this page has the swap/poison signature, so the
> page is not freed for re-use.
> 
> -Tony

Oh, Yes, Sorry for my rudeness and error-understandings, I just happen to can't control my emotions and get confused for some other things.

Thanks!
Aili Yao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-08 17:08         ` Luck, Tony
@ 2021-04-13 10:07           ` Borislav Petkov
  2021-04-13 16:13             ` Luck, Tony
  0 siblings, 1 reply; 24+ messages in thread
From: Borislav Petkov @ 2021-04-13 10:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Thu, Apr 08, 2021 at 10:08:52AM -0700, Luck, Tony wrote:
> Also not clear to me either ... but sending a SIGBUS to a kthread isn't
> going to do anything useful. So avoiding doing that is another worthy
> goal.

Ack.

> With these patches nothing gets killed when kernel touches user poison.
> If this is in a regular system call then these patches will return EFAULT
> to the user (but now that I see EHWPOISON exists that looks like a better
> choice - so applications can distinguish the "I just used an invalid address in
> a parameter to a syscall" from "This isn't my fault, the memory broke".

Yap, makes sense.

> KVM apparently passes a machine check into the guest.

Ah, there it is:

static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk)
{
        send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk);
}

> Though it seems to be misisng the MCG_STATUS information to tell
> the guest whether this is an "Action Required" machine check, or an
> "Action Optional" (i.e. whether the poison was found synchonously by
> execution of the current instruction, or asynchronously).

Yeah, someone needs to deal with that sooner or later, considering how
virt is becoming ubiquitous.

> There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing
> seems wrong in that, but could use some updates.

Ah yap, that.

So what I'm missing with all this fun is, yeah, sure, we have this
facility out there but who's using it? Is anyone even using it at all?

If so, does it even make sense, does it need improvements, etc?

Because from where I stand it all looks like we do all these fancy
recovery things but is userspace even paying attention or using them or
whatever...

> I don't know how much detail we might want to go into on recovery
> stratgies for applications.

Perhaps an example or two how userspace is supposed to use it...

> In terms of production s/w there was one ISV who prototyped recovery
> for their application but last time I checked didn't enable it in the
> production version.

See, one and it hasn't even enabled it. So it all feels like a lot of
wasted energy to me to do all those and keep 'em working. But maybe I'm
missing the big picture and someone would come and say, no, Boris, we
use this and this and that is our feedback...

> Essentially it boils down to:
> SIGBUS handler gets additional data giving virtual address that has gone away
> 
> 1) Can the application replace the lost page?
> 	Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap
> 	and fill with replacement data. This case can return from SIGBUS
> 	handler to re-execute failed instruction
> 2) Can the application continue in degraded mode w/o the lost page?
> 	Hunt down pointers to lost page and update structures to say
> 	"this data lost". Use siglongjmp() to go to preset recovery path
> 3) Can the application shut down gracefully?
> 	Record details of the lost page. Inform next-of-kin. Exit.
> 4) Default - just exit

Right and this should probably start with "Hey userspace folks, here's
what you can do when you get a hwpoison signal and here's how we
envision this recovery to work" and then we can all discuss and converge
on an agreeable solution which is actually used and there will be
selftests and so on and so on...

But what the h*ll do I know.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-13 10:07           ` Borislav Petkov
@ 2021-04-13 16:13             ` Luck, Tony
  2021-04-14 13:05               ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Luck, Tony @ 2021-04-13 16:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

> So what I'm missing with all this fun is, yeah, sure, we have this
> facility out there but who's using it? Is anyone even using it at all?

Even if no applications ever do anything with it, it is still useful to avoid
crashing the whole system and just terminate one application/guest.

> If so, does it even make sense, does it need improvements, etc?

There's one more item on my long term TODO list. Add fixups so that
copy_to_user() from poison in the page cache doesn't crash, but just
checks to see if the page was clean .. .in which case re-read from the
filesystem into a different physical page and retire the old page ... the
read can now succeed. If the page is dirty, then fail the read (and retire
the page ... need to make sure filesystem knows the data for the page
was lost so subsequent reads return -EIO or something).

Page cache occupies enough memory that it is a big enough
source of system crashes that could be avoided. I'm not sure
if there are any other obvious cases after this ... it all gets into
diminishing returns ... not really worth it to handle a case that
only occupies 0.00002% of memory.

> Because from where I stand it all looks like we do all these fancy
> recovery things but is userspace even paying attention or using them or
> whatever...

See above. With core counts continuing to increase, the cloud service
providers really want to see fewer events that crash the whole physical
machine (taking down dozens, or hundreds, of guest VMs).

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-13 16:13             ` Luck, Tony
@ 2021-04-14 13:05               ` Borislav Petkov
  0 siblings, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2021-04-14 13:05 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao,
	HORIGUCHI NAOYA( 堀口 直也)

On Tue, Apr 13, 2021 at 04:13:03PM +0000, Luck, Tony wrote:
> Even if no applications ever do anything with it, it is still useful to avoid
> crashing the whole system and just terminate one application/guest.

True.

> There's one more item on my long term TODO list. Add fixups so that
> copy_to_user() from poison in the page cache doesn't crash, but just
> checks to see if the page was clean .. .in which case re-read from the
> filesystem into a different physical page and retire the old page ... the
> read can now succeed. If the page is dirty, then fail the read (and retire
> the page ... need to make sure filesystem knows the data for the page
> was lost so subsequent reads return -EIO or something).

Makes sense.

> Page cache occupies enough memory that it is a big enough
> source of system crashes that could be avoided. I'm not sure
> if there are any other obvious cases after this ... it all gets into
> diminishing returns ... not really worth it to handle a case that
> only occupies 0.00002% of memory.

Ack.

> See above. With core counts continuing to increase, the cloud service
> providers really want to see fewer events that crash the whole physical
> machine (taking down dozens, or hundreds, of guest VMs).

Yap.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
@ 2021-04-19 20:32 Jue Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Jue Wang @ 2021-04-19 20:32 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, linux-kernel, linux-mm, luto,
	HORIGUCHI NAOYA(堀口 直也),
	x86, yaoaili

On Thu, 8 Apr 2021 10:08:52 -0700, Tony Luck wrote:
> KVM apparently passes a machine check into the guest. Though it seems
> to be misisng the MCG_STATUS information to tell the guest whether this
> is an "Action Required" machine check, or an "Action Optional" (i.e.
> whether the poison was found synchonously by execution of the current
> instruction, or asynchronously).

The KVM_X86_SET_MCE ioctl takes a parameter of struct kvm_x86_mce, hypervisor
can set with necessary semantics.

1140 #ifdef KVM_CAP_MCE
1141 /* x86 MCE */
1142 struct kvm_x86_mce {
1143         __u64 status;
1144         __u64 addr;
1145         __u64 misc;
1146         __u64 mcg_status;
1147         __u8 bank;
1148         __u8 pad1[7];
1149         __u64 pad2[3];
1150 };
1151 #endif

> > Are we documenting somewhere: "if your process gets a SIGBUS and this
> > and that, which means your page got offlined, you should do this and
> > that to recover"?

> Essentially it boils down to:
> SIGBUS handler gets additional data giving virtual address that has gone away

> 1) Can the application replace the lost page?
> Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap
> and fill with replacement data. This case can return from SIGBUS
> handler to re-execute failed instruction
> 2) Can the application continue in degraded mode w/o the lost page?
> Hunt down pointers to lost page and update structures to say
> "this data lost". Use siglongjmp() to go to preset recovery path
> 3) Can the application shut down gracefully?
> Record details of the lost page. Inform next-of-kin. Exit.
> 4) Default - just exit
Two possible addition to these great points:
5) If for some reason the page cannot be unmapped (e.g.,
either losing to much memory like hugetlbfs 1G pages, or
THP split failure for SHMEM THP), kernel maintains a
consistent semantic (i.e., MCE SIGBUS with vaddr) to all future
accesses from user space, by leaving the hwpoisoned page
mapped or in the radix tree.
6). If for some reason the vaddr is not available upon the
first MCE recovery and page is unmapped, kernel provides
correct semantic (MCE SIGBUS with vaddr) in subsequent
page faults from user space accessing the same vaddr.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-14 14:46   ` Jue Wang
@ 2021-04-14 15:35     ` Borislav Petkov
  0 siblings, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2021-04-14 15:35 UTC (permalink / raw)
  To: Jue Wang
  Cc: linux-kernel, linux-mm, luto,
	HORIGUCHI NAOYA(堀口 直也),
	Luck, Tony, x86, yaoaili

On Wed, Apr 14, 2021 at 07:46:49AM -0700, Jue Wang wrote:
> I can see this is useful in other types of domains, e.g., on multi-tenant cloud
> servers where many VMs are collocated on the same host,
> with proper recovery + live migration, a single MCE would only affect a single
> VM at most.
> 
> Another type of generic use case may be services that can tolerate
> abrupt crash,
> i.e., they periodically save checkpoints to persistent storage or are stateless
> services in nature and are managed by some process manager to automatically
> restart and resume from where the work was left at when crashed.

Yap, thanks for those.

So I do see a disconnect between us doing those features in the kernel
and not really seeing how people use them. So this helps, I guess the VM
angle will become important real soon - if not already - so hopefully
we'll get more feedback in the future.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-14 13:10 ` Borislav Petkov
@ 2021-04-14 14:46   ` Jue Wang
  2021-04-14 15:35     ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Jue Wang @ 2021-04-14 14:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, linux-mm, luto,
	HORIGUCHI NAOYA(堀口 直也),
	Luck, Tony, x86, yaoaili

On Wed, Apr 14, 2021 at 6:10 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote:
> > This path is when EPT #PF finds accesses to a hwpoisoned page and
> > sends SIGBUS to user space (KVM exits into user space) with the same
> > semantic as if regular #PF found access to a hwpoisoned page.
> >
> > The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest.
> >
> > We are in process to launch a product with MCE recovery capability in
> > a KVM based virtualization product and plan to expand the scope of the
> > application of it in the near future.
>
> Any pointers to code or is this all non-public? Any text on what that
> product does with the MCEs?

These are non-public at this point.

User-facing docs and blog post are expected to be released towards the
launch (i.e., in 3-4 months from now).
>
> > The in-memory database and analytical domain are definitely using it.
> > A couple examples:
> > SAP HANA - as we've tested and planned to launch as a strategic
> > enterprise use case with MCE recovery capability in our product
> > SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors
>
> Aha, so they register callbacks for the processes to exec on a memory
> error. Good to know, thanks for those.
My other 2 cents:

I can see this is useful in other types of domains, e.g., on multi-tenant cloud
servers where many VMs are collocated on the same host,
with proper recovery + live migration, a single MCE would only affect a single
VM at most.

Another type of generic use case may be services that can tolerate
abrupt crash,
i.e., they periodically save checkpoints to persistent storage or are stateless
services in nature and are managed by some process manager to automatically
restart and resume from where the work was left at when crashed.

Thanks,
-Jue
>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
  2021-04-14  5:47 ` Jue Wang
  (?)
@ 2021-04-14 13:10 ` Borislav Petkov
  2021-04-14 14:46   ` Jue Wang
  -1 siblings, 1 reply; 24+ messages in thread
From: Borislav Petkov @ 2021-04-14 13:10 UTC (permalink / raw)
  To: Jue Wang
  Cc: linux-kernel, linux-mm, luto, naoya.horiguchi, tony.luck, x86, yaoaili

On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote:
> This path is when EPT #PF finds accesses to a hwpoisoned page and
> sends SIGBUS to user space (KVM exits into user space) with the same
> semantic as if regular #PF found access to a hwpoisoned page.
> 
> The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest.
> 
> We are in process to launch a product with MCE recovery capability in
> a KVM based virtualization product and plan to expand the scope of the
> application of it in the near future.

Any pointers to code or is this all non-public? Any text on what that
product does with the MCEs?

> The in-memory database and analytical domain are definitely using it.
> A couple examples:
> SAP HANA - as we've tested and planned to launch as a strategic
> enterprise use case with MCE recovery capability in our product
> SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors

Aha, so they register callbacks for the processes to exec on a memory
error. Good to know, thanks for those.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
@ 2021-04-14  5:47 ` Jue Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Jue Wang @ 2021-04-14  5:47 UTC (permalink / raw)
  To: bp; +Cc: linux-kernel, linux-mm, luto, naoya.horiguchi, tony.luck, x86, yaoaili

On Tue, 13 Apr 2021 12:07:22 +0200, Petkov, Borislav wrote:

>> KVM apparently passes a machine check into the guest.

> Ah, there it is:

> static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk)
> {
>         send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk);
> }

This path is when EPT #PF finds accesses to a hwpoisoned page and
sends SIGBUS to user space (KVM exits into user space) with the same
semantic as if regular #PF found access to a hwpoisoned page.

The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest.

We are in process to launch a product with MCE recovery capability in
a KVM based virtualization product and plan to expand the scope of the
application of it in the near future.

> So what I'm missing with all this fun is, yeah, sure, we have this
> facility out there but who's using it? Is anyone even using it at all?

The in-memory database and analytical domain are definitely using it.
A couple examples:
SAP HANA - as we've tested and planned to launch as a strategic
enterprise use case with MCE recovery capability in our product
SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors


Cheers,
-Jue

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
@ 2021-04-14  5:47 ` Jue Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Jue Wang @ 2021-04-14  5:47 UTC (permalink / raw)
  To: bp; +Cc: linux-kernel, linux-mm, luto, naoya.horiguchi, tony.luck, x86, yaoaili

On Tue, 13 Apr 2021 12:07:22 +0200, Petkov, Borislav wrote:

>> KVM apparently passes a machine check into the guest.

> Ah, there it is:

> static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk)
> {
>         send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk);
> }

This path is when EPT #PF finds accesses to a hwpoisoned page and
sends SIGBUS to user space (KVM exits into user space) with the same
semantic as if regular #PF found access to a hwpoisoned page.

The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest.

We are in process to launch a product with MCE recovery capability in
a KVM based virtualization product and plan to expand the scope of the
application of it in the near future.

> So what I'm missing with all this fun is, yeah, sure, we have this
> facility out there but who's using it? Is anyone even using it at all?

The in-memory database and analytical domain are definitely using it.
A couple examples:
SAP HANA - as we've tested and planned to launch as a strategic
enterprise use case with MCE recovery capability in our product
SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors


Cheers,
-Jue


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-04-19 20:35 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26  0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck
2021-03-26  0:02 ` [PATCH 1/4] x86/mce: Fix copyin code to return -EFAULT on machine check Tony Luck
2021-04-06 19:24   ` Borislav Petkov
2021-03-26  0:02 ` [PATCH 2/4] mce/iter: Check for copyin failure & return error up stack Tony Luck
2021-03-26  0:02 ` [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison Tony Luck
2021-04-07 21:18   ` Borislav Petkov
2021-04-07 21:43     ` Luck, Tony
2021-04-08  8:49       ` Borislav Petkov
2021-04-08 17:08         ` Luck, Tony
2021-04-13 10:07           ` Borislav Petkov
2021-04-13 16:13             ` Luck, Tony
2021-04-14 13:05               ` Borislav Petkov
2021-03-26  0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-04-08 13:36   ` Borislav Petkov
2021-04-08 16:06     ` Luck, Tony
2021-04-08  2:13 ` [RFC 0/4] Fix machine check recovery for copy_from_user Aili Yao
2021-04-08 14:39   ` Luck, Tony
2021-04-09  6:49     ` Aili Yao
2021-04-14  5:47 [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison Jue Wang
2021-04-14  5:47 ` Jue Wang
2021-04-14 13:10 ` Borislav Petkov
2021-04-14 14:46   ` Jue Wang
2021-04-14 15:35     ` Borislav Petkov
2021-04-19 20:32 Jue Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.