* Re: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
@ 2021-04-19 21:28 ` Jue Wang
0 siblings, 0 replies; 6+ messages in thread
From: Jue Wang @ 2021-04-19 21:28 UTC (permalink / raw)
To: tony.luck; +Cc: bp, linux-kernel, linux-mm, luto, naoya.horiguchi, x86, yaoaili
On Thu, 25 Mar 2021 17:02:35 -0700, Tony Luck wrote:
...
> But there are places in the kernel where the code assumes that this
> EFAULT return was simply because of a page fault. The code takes some
> action to fix that, and then retries the access. This results in a second
> machine check.
What about return EHWPOISON instead of EFAULT and update the callers
to handle EHWPOISON explicitly: i.e., not retry but give up on the page?
My main concern is that the strong assumptions that the kernel can't hit more
than a fixed number of poisoned cache lines before turning to user space
may simply not be true.
When DIMM goes bad, it can easily affect an entire bank or entire ram device
chip. Even with memory interleaving, it's possible that a kernel control path
touches lots of poisoned cache lines in the buffer it is working through.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery @ 2021-04-19 21:28 ` Jue Wang 0 siblings, 0 replies; 6+ messages in thread From: Jue Wang @ 2021-04-19 21:28 UTC (permalink / raw) To: tony.luck; +Cc: bp, linux-kernel, linux-mm, luto, naoya.horiguchi, x86, yaoaili On Thu, 25 Mar 2021 17:02:35 -0700, Tony Luck wrote: ... > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. What about return EHWPOISON instead of EFAULT and update the callers to handle EHWPOISON explicitly: i.e., not retry but give up on the page? My main concern is that the strong assumptions that the kernel can't hit more than a fixed number of poisoned cache lines before turning to user space may simply not be true. When DIMM goes bad, it can easily affect an entire bank or entire ram device chip. Even with memory interleaving, it's possible that a kernel control path touches lots of poisoned cache lines in the buffer it is working through. ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery 2021-04-19 21:28 ` Jue Wang (?) @ 2021-04-19 21:41 ` Luck, Tony -1 siblings, 0 replies; 6+ messages in thread From: Luck, Tony @ 2021-04-19 21:41 UTC (permalink / raw) To: Jue Wang; +Cc: bp, linux-kernel, linux-mm, luto, naoya.horiguchi, x86, yaoaili >> But there are places in the kernel where the code assumes that this >> EFAULT return was simply because of a page fault. The code takes some >> action to fix that, and then retries the access. This results in a second >> machine check. > > What about return EHWPOISON instead of EFAULT and update the callers > to handle EHWPOISON explicitly: i.e., not retry but give up on the page? That seems like a good idea to me. But I got some pushback when I started on this path earlier with some patches to the futex code. But back then I wasn't using error return of EHWPOISON ... possibly the code would look less hacky with that explicitly called out. The futex case was specifically for code using pagefault_disable(). Likely all the other callers would need to be audited (but there are only a few dozen places, so not too big of a deal). > My main concern is that the strong assumptions that the kernel can't hit more > than a fixed number of poisoned cache lines before turning to user space > may simply not be true. Agreed. > When DIMM goes bad, it can easily affect an entire bank or entire ram device > chip. Even with memory interleaving, it's possible that a kernel control path > touches lots of poisoned cache lines in the buffer it is working through. These larger failures have other problems ... dozens of unrelated pages may be affected. In a perfect world Linux would be told on the first error that this is just one of many errors ... and be given a list. But in the real world that isn't likely to happen :-( -Tony ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC 0/4] Fix machine check recovery for copy_from_user @ 2021-03-26 0:02 Tony Luck 2021-03-26 0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 0 siblings, 1 reply; 6+ messages in thread From: Tony Luck @ 2021-03-26 0:02 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao, HORIGUCHI NAOYA( 堀口 直也) Maybe this is the way forward? I made some poor choices before to treat poison consumption in the kernel when accessing user data (get_user() or copy_from_user()) ... in particular assuming that the right action was sending a SIGBUS to the task as if it had synchronously accessed the poison location. First three patches may need to be combined (or broken up differently) for bisectablilty. But they are presented separately here since they touch separate parts of the problem. Second part is definitley incomplete. But I'd like to check that it is the right approach before expending more brain cells in the maze of nested macros that is lib/iov_iter.c Last part has been posted before. It covers the case where the kernel takes more than one swing at reading poison data before returning to user. Tony Luck (4): x86/mce: Fix copyin code to return -EFAULT on machine check. mce/iter: Check for copyin failure & return error up stack mce/copyin: fix to not SIGBUS when copying from user hits poison x86/mce: Avoid infinite loop for copy from user recovery arch/x86/kernel/cpu/mce/core.c | 63 +++++++++++++++++++++--------- arch/x86/kernel/cpu/mce/severity.c | 2 - arch/x86/lib/copy_user_64.S | 18 +++++---- fs/iomap/buffered-io.c | 8 +++- include/linux/sched.h | 2 +- include/linux/uio.h | 2 +- lib/iov_iter.c | 15 ++++++- 7 files changed, 77 insertions(+), 33 deletions(-) -- 2.29.2 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery 2021-03-26 0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck @ 2021-03-26 0:02 ` Tony Luck 2021-04-08 13:36 ` Borislav Petkov 0 siblings, 1 reply; 6+ messages in thread From: Tony Luck @ 2021-03-26 0:02 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao, HORIGUCHI NAOYA( 堀口 直也) Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 40 ++++++++++++++++++++++++++-------- include/linux/sched.h | 1 + 2 files changed, 32 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 1570310cadab..999fd7f0330b 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + p->mce_count = 0; force_sig(SIGBUS); } @@ -1258,6 +1261,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1277,18 +1281,36 @@ static void kill_me_never(struct callback_head *cb) { struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); + p->mce_count = 0; pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr); if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0)) set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); } -static void queue_task_work(struct mce *m, void (*func)(struct callback_head *)) +static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *)) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); - current->mce_kill_me.func = func; + int count = ++current->mce_count; + + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + current->mce_kill_me.func = func; + } + + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do not call task_work_add() more than once */ + if (count > 1) + return; task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); } @@ -1427,9 +1449,9 @@ noinstr void do_machine_check(struct pt_regs *regs) BUG_ON(!on_thread_stack() || !user_mode(regs)); if (kill_current_task) - queue_task_work(&m, kill_me_now); + queue_task_work(&m, msg, kill_me_now); else - queue_task_work(&m, kill_me_maybe); + queue_task_work(&m, msg, kill_me_maybe); } else { /* @@ -1447,7 +1469,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_me_never); + queue_task_work(&m, msg, kill_me_never); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index 2d213b52730c..8f9dc91498cf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1364,6 +1364,7 @@ struct task_struct { mce_whole_page : 1, __mce_reserved : 62; struct callback_head mce_kill_me; + int mce_count; #endif #ifdef CONFIG_KRETPROBES -- 2.29.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery 2021-03-26 0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck @ 2021-04-08 13:36 ` Borislav Petkov 2021-04-08 16:06 ` Luck, Tony 0 siblings, 1 reply; 6+ messages in thread From: Borislav Petkov @ 2021-04-08 13:36 UTC (permalink / raw) To: Tony Luck Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao, HORIGUCHI NAOYA( 堀口 直也) On Thu, Mar 25, 2021 at 05:02:35PM -0700, Tony Luck wrote: ... > Expected worst case is two machine checks before moving on (e.g. one user > access with page faults disabled, then a repeat to the same addrsss with > page faults enabled). Just in case there is some code that loops forever > enforce a limit of 10. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > arch/x86/kernel/cpu/mce/core.c | 40 ++++++++++++++++++++++++++-------- > include/linux/sched.h | 1 + > 2 files changed, 32 insertions(+), 9 deletions(-) What I'm still unclear on, does this new version address that "mysterious" hang or panic which the validation team triggered or you haven't checked yet? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery 2021-04-08 13:36 ` Borislav Petkov @ 2021-04-08 16:06 ` Luck, Tony 0 siblings, 0 replies; 6+ messages in thread From: Luck, Tony @ 2021-04-08 16:06 UTC (permalink / raw) To: Borislav Petkov Cc: x86, linux-kernel, linux-mm, Andy Lutomirski, Aili Yao, HORIGUCHI NAOYA( 堀口 直也) > What I'm still unclear on, does this new version address that > "mysterious" hang or panic which the validation team triggered or you > haven't checked yet? No :-( They are triggering some case where multiple threads in a process hit the same poison, and somehow memory_failure() fails to complete offlining the page. At this point any other threads that hit that page get the early return from memory_failure (because the page flags say it is poisoned) ... and so we loop. But the "recover from cases where multiple machine checks happen simultaneously" case is orthogonal to the "do the right thing to recover when the kernel touches poison at a user address". So I think we can tackle them separately -Tony ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-04-19 21:41 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-19 21:28 [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Jue Wang 2021-04-19 21:28 ` Jue Wang 2021-04-19 21:41 ` Luck, Tony -- strict thread matches above, loose matches on Subject: below -- 2021-03-26 0:02 [RFC 0/4] Fix machine check recovery for copy_from_user Tony Luck 2021-03-26 0:02 ` [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-04-08 13:36 ` Borislav Petkov 2021-04-08 16:06 ` Luck, Tony
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.