* [PATCH 0/3] More machine check recovery fixes @ 2021-07-06 19:06 Tony Luck 2021-07-06 19:06 ` [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck ` (3 more replies) 0 siblings, 4 replies; 45+ messages in thread From: Tony Luck @ 2021-07-06 19:06 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-kernel Fix a couple of issues in machine check handling 1) A repeated machine check inside the kernel without calling the task work function between machine checks it will go into an infinite loop 2) Machine checks in kernel functions copying data from user addresses send SIGBUS to the user as if the application had consumed the poison. But this is wrong. The user should see either an -EFAULT error return or a reduced byte count (in the case of write(2)). Tony Luck (3): x86/mce: Change to not send SIGBUS error during copy from user x86/mce: Avoid infinite loop for copy from user recovery x86/mce: Drop copyin special case for #MC arch/x86/kernel/cpu/mce/core.c | 62 ++++++++++++++++++++++++---------- arch/x86/lib/copy_user_64.S | 13 ------- include/linux/sched.h | 1 + 3 files changed, 45 insertions(+), 31 deletions(-) -- 2.29.2 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user 2021-07-06 19:06 [PATCH 0/3] More machine check recovery fixes Tony Luck @ 2021-07-06 19:06 ` Tony Luck 2021-07-06 19:06 ` [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck ` (2 subsequent siblings) 3 siblings, 0 replies; 45+ messages in thread From: Tony Luck @ 2021-07-06 19:06 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-kernel Sending a SIGBUS for a copy from user is not the correct semantic. System calls should return -EFAULT (or a short count for write(2)). Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 22791aadc085..dd03971e5ad5 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1265,7 +1265,7 @@ static void kill_me_maybe(struct callback_head *cb) flags |= MF_MUST_KILL; ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); - if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { + if (!ret) { set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); sync_core(); return; @@ -1279,25 +1279,27 @@ static void kill_me_maybe(struct callback_head *cb) if (ret == -EHWPOISON) return; - if (p->mce_vaddr != (void __user *)-1l) { - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); - } else { - pr_err("Memory error not recovered"); - kill_me_now(cb); - } + pr_err("Memory error not recovered"); + kill_me_now(cb); +} + +static void kill_me_never(struct callback_head *cb) +{ + struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); + + pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr); + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0)) + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); } -static void queue_task_work(struct mce *m, int kill_current_task) +static void queue_task_work(struct mce *m, void (*func)(struct callback_head *)) { current->mce_addr = m->addr; current->mce_kflags = m->kflags; current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); current->mce_whole_page = whole_page(m); - if (kill_current_task) - current->mce_kill_me.func = kill_me_now; - else - current->mce_kill_me.func = kill_me_maybe; + current->mce_kill_me.func = func; task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); } @@ -1435,7 +1437,10 @@ noinstr void do_machine_check(struct pt_regs *regs) /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); - queue_task_work(&m, kill_current_task); + if (kill_current_task) + queue_task_work(&m, kill_me_now); + else + queue_task_work(&m, kill_me_maybe); } else { /* @@ -1453,7 +1458,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_current_task); + queue_task_work(&m, kill_me_never); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); -- 2.29.2 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-07-06 19:06 [PATCH 0/3] More machine check recovery fixes Tony Luck 2021-07-06 19:06 ` [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck @ 2021-07-06 19:06 ` Tony Luck 2021-07-06 19:06 ` [PATCH 3/3] x86/mce: Drop copyin special case for #MC Tony Luck 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck 3 siblings, 0 replies; 45+ messages in thread From: Tony Luck @ 2021-07-06 19:06 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-kernel Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 39 ++++++++++++++++++++++++++-------- include/linux/sched.h | 1 + 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index dd03971e5ad5..957ec60cd2a8 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + p->mce_count = 0; force_sig(SIGBUS); } @@ -1259,6 +1262,7 @@ static void kill_me_maybe(struct callback_head *cb) int flags = MF_ACTION_REQUIRED; int ret; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1287,19 +1291,36 @@ static void kill_me_never(struct callback_head *cb) { struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); + p->mce_count = 0; pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr); if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0)) set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); } -static void queue_task_work(struct mce *m, void (*func)(struct callback_head *)) +static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *)) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + int count = ++current->mce_count; + + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + current->mce_kill_me.func = func; + } - current->mce_kill_me.func = func; + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do not call task_work_add() more than once */ + if (count > 1) + return; task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); } @@ -1438,9 +1459,9 @@ noinstr void do_machine_check(struct pt_regs *regs) BUG_ON(!on_thread_stack() || !user_mode(regs)); if (kill_current_task) - queue_task_work(&m, kill_me_now); + queue_task_work(&m, msg, kill_me_now); else - queue_task_work(&m, kill_me_maybe); + queue_task_work(&m, msg, kill_me_maybe); } else { /* @@ -1458,7 +1479,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_me_never); + queue_task_work(&m, msg, kill_me_never); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index ec8d07d88641..f6935787e7e8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1394,6 +1394,7 @@ struct task_struct { mce_whole_page : 1, __mce_reserved : 62; struct callback_head mce_kill_me; + int mce_count; #endif #ifdef CONFIG_KRETPROBES -- 2.29.2 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 3/3] x86/mce: Drop copyin special case for #MC 2021-07-06 19:06 [PATCH 0/3] More machine check recovery fixes Tony Luck 2021-07-06 19:06 ` [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck 2021-07-06 19:06 ` [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck @ 2021-07-06 19:06 ` Tony Luck 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck 3 siblings, 0 replies; 45+ messages in thread From: Tony Luck @ 2021-07-06 19:06 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-kernel Fixes to the iterator code to handle faults that are not on page boundaries mean that the special case for machine check during copy from user is no longer needed. Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/lib/copy_user_64.S | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 57b79c577496..2797e630b9b1 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string) */ SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail) movl %edx,%ecx - cmp $X86_TRAP_MC,%eax /* check if X86_TRAP_MC */ - je 3f 1: rep movsb 2: mov %ecx,%eax ASM_CLAC ret - /* - * Return zero to pretend that this copy succeeded. This - * is counter-intuitive, but needed to prevent the code - * in lib/iov_iter.c from retrying and running back into - * the poison cache line again. The machine check handler - * will ensure that a SIGBUS is sent to the task. - */ -3: xorl %eax,%eax - ASM_CLAC - ret - _ASM_EXTABLE_CPY(1b, 2b) SYM_CODE_END(.Lcopy_user_handle_tail) -- 2.29.2 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v2 0/3] More machine check recovery fixes 2021-07-06 19:06 [PATCH 0/3] More machine check recovery fixes Tony Luck ` (2 preceding siblings ...) 2021-07-06 19:06 ` [PATCH 3/3] x86/mce: Drop copyin special case for #MC Tony Luck @ 2021-08-18 0:29 ` Tony Luck 2021-08-18 0:29 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck ` (3 more replies) 3 siblings, 4 replies; 45+ messages in thread From: Tony Luck @ 2021-08-18 0:29 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel, Tony Luck Fix a couple of issues in machine check handling 1) A repeated machine check inside the kernel without calling the task work function between machine checks it will go into an infinite loop 2) Machine checks in kernel functions copying data from user addresses send SIGBUS to the user as if the application had consumed the poison. But this is wrong. The user should see either an -EFAULT error return or a reduced byte count (in the case of write(2)). My latest tests have been on v4.14-rc6 with this patch (that's already in -mm) applied: https://lore.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev Changes since v1: 1) Fix bug in kill_me_never() that forgot to clear p->mce_count so repeated recovery in the same task would trigger the panic for "Machine checks to different user pages" [Note to Jue Wang ... this *might* be why your test that injects two errors into the same buffer passed to a write(2) syscall failed with this message] 2) Re-order patches so that "Avoid infinite loop" can be backported to stable. Note that the other two parts of this series depend upon Al Viro's extensive re-work to lib/iov_iter.c ... so don't try to backport those without also picking up Al's work. Tony Luck (3): x86/mce: Avoid infinite loop for copy from user recovery x86/mce: Change to not send SIGBUS error during copy from user x86/mce: Drop copyin special case for #MC arch/x86/kernel/cpu/mce/core.c | 62 ++++++++++++++++++++++++---------- arch/x86/lib/copy_user_64.S | 13 ------- include/linux/sched.h | 1 + 3 files changed, 45 insertions(+), 31 deletions(-) base-commit: 7c60610d476766e128cc4284bb6349732cbd6606 -- 2.29.2 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck @ 2021-08-18 0:29 ` Tony Luck 2021-08-20 17:31 ` Borislav Petkov 2021-09-13 9:24 ` Borislav Petkov 2021-08-18 0:29 ` [PATCH v2 2/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck ` (2 subsequent siblings) 3 siblings, 2 replies; 45+ messages in thread From: Tony Luck @ 2021-08-18 0:29 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel, Tony Luck Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Cc: <stable@vger.kernel.org> Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 43 +++++++++++++++++++++++++--------- include/linux/sched.h | 1 + 2 files changed, 33 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 22791aadc085..94830ee9581c 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + p->mce_count = 0; force_sig(SIGBUS); } @@ -1259,6 +1262,7 @@ static void kill_me_maybe(struct callback_head *cb) int flags = MF_ACTION_REQUIRED; int ret; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1287,17 +1291,34 @@ static void kill_me_maybe(struct callback_head *cb) } } -static void queue_task_work(struct mce *m, int kill_current_task) +static void queue_task_work(struct mce *m, char *msg, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + int count = ++current->mce_count; - if (kill_current_task) - current->mce_kill_me.func = kill_me_now; - else - current->mce_kill_me.func = kill_me_maybe; + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + + if (kill_current_task) + current->mce_kill_me.func = kill_me_now; + else + current->mce_kill_me.func = kill_me_maybe; + } + + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do not call task_work_add() more than once */ + if (count > 1) + return; task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); } @@ -1435,7 +1456,7 @@ noinstr void do_machine_check(struct pt_regs *regs) /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); - queue_task_work(&m, kill_current_task); + queue_task_work(&m, msg, kill_current_task); } else { /* @@ -1453,7 +1474,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_current_task); + queue_task_work(&m, msg, kill_current_task); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index ec8d07d88641..f6935787e7e8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1394,6 +1394,7 @@ struct task_struct { mce_whole_page : 1, __mce_reserved : 62; struct callback_head mce_kill_me; + int mce_count; #endif #ifdef CONFIG_KRETPROBES -- 2.29.2 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-18 0:29 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck @ 2021-08-20 17:31 ` Borislav Petkov 2021-08-20 18:59 ` Luck, Tony 2021-09-13 9:24 ` Borislav Petkov 1 sibling, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-08-20 17:31 UTC (permalink / raw) To: Tony Luck Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote: > @@ -1287,17 +1291,34 @@ static void kill_me_maybe(struct callback_head *cb) > } > } > > -static void queue_task_work(struct mce *m, int kill_current_task) > +static void queue_task_work(struct mce *m, char *msg, int kill_current_task) > { > - current->mce_addr = m->addr; > - current->mce_kflags = m->kflags; > - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > - current->mce_whole_page = whole_page(m); > + int count = ++current->mce_count; > > - if (kill_current_task) > - current->mce_kill_me.func = kill_me_now; > - else > - current->mce_kill_me.func = kill_me_maybe; > + /* First call, save all the details */ > + if (count == 1) { > + current->mce_addr = m->addr; > + current->mce_kflags = m->kflags; > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > + current->mce_whole_page = whole_page(m); > + > + if (kill_current_task) > + current->mce_kill_me.func = kill_me_now; > + else > + current->mce_kill_me.func = kill_me_maybe; > + } > + > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ "likely" > + if (count > 10) > + mce_panic("Too many machine checks while accessing user data", m, msg); Ok, aren't we too nasty here? Why should we panic the whole box even with 10 MCEs? It is still user memory... IOW, why not: if (count > 10) current->mce_kill_me.func = kill_me_now; and when we return, that user process dies immediately. > + /* Second or later call, make sure page address matches the one from first call */ > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) > + mce_panic("Machine checks to different user pages", m, msg); Same question here. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 17:31 ` Borislav Petkov @ 2021-08-20 18:59 ` Luck, Tony 2021-08-20 19:27 ` Borislav Petkov 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-08-20 18:59 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Fri, Aug 20, 2021 at 07:31:43PM +0200, Borislav Petkov wrote: > On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote: > > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ > > "likely" Oops. > > > + if (count > 10) > > + mce_panic("Too many machine checks while accessing user data", m, msg); > > Ok, aren't we too nasty here? Why should we panic the whole box even > with 10 MCEs? It is still user memory... > > IOW, why not: > > if (count > 10) > current->mce_kill_me.func = kill_me_now; > > and when we return, that user process dies immediately. It's the "when we return" part that is the problem here. Logical trace looks like: user-syscall: kernel does get_user() or copyin(), hits user poison address machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel, see that get_user() or copyin() failed Kernel does another get_user() or copyin() (maybe the first was inside a pagefault_disable() region, and kernel is trying again to see if the error was a fixable page fault. But that wasn't the problem so ... machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel ... but persistently thinks that just trying again might fix it. machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel ... this time for sure! get_user() machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel ... but you may see the pattern get_user() machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path I'm bored typing this, but the kernel may not ever give up machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path I.e. the kernel doesn't ever get to call current->mce_kill_me.func() I do have tests that show as many as 4 consecutive machine checks before the kernel gives up trying and returns to the user to complete recovery. Maybe the message could be clearer? mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg); > > > + /* Second or later call, make sure page address matches the one from first call */ > > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) > > + mce_panic("Machine checks to different user pages", m, msg); > > Same question here. Not quite the same answer ... but similar. We could in theory handle multiple different machine check addresses by turning the "mce_addr" field in the task structure into an array and saving each address so that when the kernel eventually gives up poking at poison and tries to return to user kill_me_maybe() could loop through them and deal with each poison page. I don't think this can happen. Jue Wang suggested that multiple poisoned pages passed to a single write(2) syscall might trigger this panic (and because of a bug in my earlier version, he managed to trigger this "different user pages" panic). But this fixed up version survives the "Jue test". -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 18:59 ` Luck, Tony @ 2021-08-20 19:27 ` Borislav Petkov 2021-08-20 20:23 ` Luck, Tony 2021-08-20 20:33 ` Luck, Tony 0 siblings, 2 replies; 45+ messages in thread From: Borislav Petkov @ 2021-08-20 19:27 UTC (permalink / raw) To: Luck, Tony Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote: > It's the "when we return" part that is the problem here. Logical > trace looks like: > > user-syscall: > > kernel does get_user() or copyin(), hits user poison address > > machine check > sees that this was kernel get_user()/copyin() and > uses extable to "return" to exception path > > still in kernel, see that get_user() or copyin() failed > > Kernel does another get_user() or copyin() (maybe the first I forgot all the details we were talking at the time but there's no way to tell the kernel to back off here, is it? As in: there was an MCE while trying to access this user memory, you should not do get_user anymore. You did add that * Return zero to pretend that this copy succeeded. This * is counter-intuitive, but needed to prevent the code * in lib/iov_iter.c from retrying and running back into which you're removing with the last patch so I'm confused. IOW, the problem is that with repeated MCEs while the kernel is accessing that memory, it should be the kernel which should back off. And then we should kill that process too but apparently we don't even come to that. > Maybe the message could be clearer? > > mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg); That's not my point - it is rather: this is a recoverable error because it is in user memory even if it is the kernel which tries to access it. And maybe we should not panic the whole box but try to cordon off the faulty memory only and poison it after having killed the process using it... > Not quite the same answer ... but similar. We could in theory handle > multiple different machine check addresses by turning the "mce_addr" > field in the task structure into an array and saving each address so > that when the kernel eventually gives up poking at poison and tries > to return to user kill_me_maybe() could loop through them and deal > with each poison page. Yes, I like the aspect of making the kernel give up poking at poison and when we return we should kill the process and poison all pages collected so that the error source is hopefully contained. But again, I think the important thing is how to make the kernel to back off quicker so that we can poison the pages at all... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 19:27 ` Borislav Petkov @ 2021-08-20 20:23 ` Luck, Tony 2021-08-21 4:51 ` Tony Luck 2021-08-22 14:36 ` Borislav Petkov 2021-08-20 20:33 ` Luck, Tony 1 sibling, 2 replies; 45+ messages in thread From: Luck, Tony @ 2021-08-20 20:23 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Fri, Aug 20, 2021 at 09:27:44PM +0200, Borislav Petkov wrote: > On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote: > > It's the "when we return" part that is the problem here. Logical > > trace looks like: > > > > user-syscall: > > > > kernel does get_user() or copyin(), hits user poison address > > > > machine check > > sees that this was kernel get_user()/copyin() and > > uses extable to "return" to exception path > > > > still in kernel, see that get_user() or copyin() failed > > > > Kernel does another get_user() or copyin() (maybe the first > > I forgot all the details we were talking at the time but there's no way > to tell the kernel to back off here, is it? > > As in: there was an MCE while trying to access this user memory, you > should not do get_user anymore. You did add that > > * Return zero to pretend that this copy succeeded. This > * is counter-intuitive, but needed to prevent the code > * in lib/iov_iter.c from retrying and running back into > > which you're removing with the last patch so I'm confused. > > IOW, the problem is that with repeated MCEs while the kernel is > accessing that memory, it should be the kernel which should back off. > And then we should kill that process too but apparently we don't even > come to that. My first foray into this just tried to fix the futex() case ... which is one of the first copy is with pagefault_disable() set, then try again with it clear. My attempt there was to make get_user() return -EHWPOISON, and change the futex code to give up immediatley when it saw that code. AndyL (and maybe others) barfed all over that (and rightly so ... there are thousands of get_user() and copyin() calls ... making sure all of them did the right thing with a new error code would be a huge effort. Very error prone (because testing all these paths is hard). New direction was just deal with the fact that we might take more than one machine check before the kernel is finished poking at the poison. > > > Maybe the message could be clearer? > > > > mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg); > > That's not my point - it is rather: this is a recoverable error because > it is in user memory even if it is the kernel which tries to access it. > And maybe we should not panic the whole box but try to cordon off the > faulty memory only and poison it after having killed the process using > it... To recover we need to have some other place to jump to (besides the normal extable error return ... which isn't working if we find ourselves in this situation) when we hit a fault covered by an extable entry. And also know how many machine checks is "normal" before taking the other path. For futex(2) things resolve in two machine checks (one with pagefault_disable() and then one without). For write(2) I see up to four machine cehcks (I didn't do a detailed track ... but I think it is because copyin() does a retry to see if a failure in a many-bytes-at-atime copy might be able to get a few more bytes by doing byte-at-a-time). The logical spot for the alternate return would be to unwind the stack back to the syscall entry point, and then force an error return from there. But that seems a perilous path ... what if some code between syscall entry and the copyin() grabbed a mutex? We would also need to know about that and release it as part of the recovery. Another failed approach was to mark the page not present in the page tables of the process accessing the poison. That would get us out of the machine check loop. But walking page tables and changing them while still in machine check context can't be done in a safe way (the process may be multi-threaded and other threads could still be running on other cores). Bottom line is that I don't think this panic can actually happen unless there is some buggy kernel code that retries get_user() or copyin() indefinitely. Probably the same for the two different addresses case ... though I'm not 100% confident about that. There could be some ioctl() that peeks at two parts of a passed in structure, and the user might pass in a structure that spans across a page boundary with both pages poisoned. But that would only hit if the driver code ignored the failure of the first get_user() and blindly tried the second. So I'd count that as a critically bad driver bug. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 20:23 ` Luck, Tony @ 2021-08-21 4:51 ` Tony Luck 2021-08-21 21:51 ` Al Viro 2021-08-22 14:36 ` Borislav Petkov 1 sibling, 1 reply; 45+ messages in thread From: Tony Luck @ 2021-08-21 4:51 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, HORIGUCHI NAOYA(堀口 直也), Oscar Salvador, Youquan Song, huangcun, X86-ML, Linux Edac Mailing List, Linux-MM, Linux Kernel Mailing List On Fri, Aug 20, 2021 at 1:25 PM Luck, Tony <tony.luck@intel.com> wrote: > Probably the same for the two different addresses case ... though I'm > not 100% confident about that. There could be some ioctl() that peeks > at two parts of a passed in structure, and the user might pass in a > structure that spans across a page boundary with both pages poisoned. > But that would only hit if the driver code ignored the failure of the > first get_user() and blindly tried the second. So I'd count that as a > critically bad driver bug. Or maybe driver writers are just evil :-( for (i = 0; i < len; i++) { tx_wait(10); get_user(dsp56k_host_interface.data.b[1], bin++); get_user(dsp56k_host_interface.data.b[2], bin++); get_user(dsp56k_host_interface.data.b[3], bin++); } -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-21 4:51 ` Tony Luck @ 2021-08-21 21:51 ` Al Viro 0 siblings, 0 replies; 45+ messages in thread From: Al Viro @ 2021-08-21 21:51 UTC (permalink / raw) To: Tony Luck Cc: Borislav Petkov, Jue Wang, Ding Hui, HORIGUCHI NAOYA(堀口 直也), Oscar Salvador, Youquan Song, huangcun, X86-ML, Linux Edac Mailing List, Linux-MM, Linux Kernel Mailing List On Fri, Aug 20, 2021 at 09:51:41PM -0700, Tony Luck wrote: > On Fri, Aug 20, 2021 at 1:25 PM Luck, Tony <tony.luck@intel.com> wrote: > > Probably the same for the two different addresses case ... though I'm > > not 100% confident about that. There could be some ioctl() that peeks > > at two parts of a passed in structure, and the user might pass in a > > structure that spans across a page boundary with both pages poisoned. > > But that would only hit if the driver code ignored the failure of the > > first get_user() and blindly tried the second. So I'd count that as a > > critically bad driver bug. > > Or maybe driver writers are just evil :-( > > for (i = 0; i < len; i++) { > tx_wait(10); > get_user(dsp56k_host_interface.data.b[1], bin++); > get_user(dsp56k_host_interface.data.b[2], bin++); > get_user(dsp56k_host_interface.data.b[3], bin++); > } Almost any unchecked get_user()/put_user() is a bug. Fortunately, there's not a lot of them <greps> 93 for put_user() and 73 for get_user(). _Some_ of the former variety might be legitimate, but most should be taken out and shot. And dsp56k should be taken out and shot, period ;-/ This is far from the worst in there... ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 20:23 ` Luck, Tony 2021-08-21 4:51 ` Tony Luck @ 2021-08-22 14:36 ` Borislav Petkov 1 sibling, 0 replies; 45+ messages in thread From: Borislav Petkov @ 2021-08-22 14:36 UTC (permalink / raw) To: Luck, Tony Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Fri, Aug 20, 2021 at 01:23:46PM -0700, Luck, Tony wrote: > To recover we need to have some other place to jump to (besides the > normal extable error return ... which isn't working if we find ourselves > in this situation) when we hit a fault covered by an extable entry. And > also know how many machine checks is "normal" before taking the other path. Hohumm, we're on the same page here. ... > Bottom line is that I don't think this panic can actually happen unless > there is some buggy kernel code that retries get_user() or copyin() > indefinitely. You know how such statements of "well, this should not really happen in practice" get disproved by, well, practice. :-) I guess we'll see anyway what actually happens in practice. > Probably the same for the two different addresses case ... though I'm > not 100% confident about that. There could be some ioctl() that peeks > at two parts of a passed in structure, and the user might pass in a > structure that spans across a page boundary with both pages poisoned. > But that would only hit if the driver code ignored the failure of the > first get_user() and blindly tried the second. So I'd count that as a > critically bad driver bug. Right. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 19:27 ` Borislav Petkov 2021-08-20 20:23 ` Luck, Tony @ 2021-08-20 20:33 ` Luck, Tony 2021-08-22 14:46 ` Borislav Petkov 1 sibling, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-08-20 20:33 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Fri, Aug 20, 2021 at 09:27:44PM +0200, Borislav Petkov wrote: > On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote: > As in: there was an MCE while trying to access this user memory, you > should not do get_user anymore. You did add that > > * Return zero to pretend that this copy succeeded. This > * is counter-intuitive, but needed to prevent the code > * in lib/iov_iter.c from retrying and running back into > > which you're removing with the last patch so I'm confused. Forget to address this part in the earlier reply. My original code that forced a zero return has a hack. It allowed recovery to complete, but only because there was going to be a SIGBUS. There were some unplesant side effects. E.g. on a write syscall the file size was updated as if the write had succeeded. That would be very confusing for anyone trying to clean up afterwards as the file would have good data that was copied from the user up to the point where the machine check interrupted things. Then NUL bytes after (because the kernel clears pages that are allocated into the page cache). The new version (thanks to All fixing iov_iter.c) now does exactly what POSIX says should happen. If I have a buffer with poison at offset 213, and I do this: ret = write(fd, buf, 512); Then the return from write is 213, and the first 213 bytes from the buffer appear in the file, and the file size is incremented by 213 (assuming the write started with the lseek offset at the original size of the file). -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-20 20:33 ` Luck, Tony @ 2021-08-22 14:46 ` Borislav Petkov 2021-08-23 15:24 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-08-22 14:46 UTC (permalink / raw) To: Luck, Tony Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Fri, Aug 20, 2021 at 01:33:56PM -0700, Luck, Tony wrote: > The new version (thanks to All fixing iov_iter.c) now does > exactly what POSIX says should happen. If I have a buffer > with poison at offset 213, and I do this: > > ret = write(fd, buf, 512); > > Then the return from write is 213, and the first 213 bytes > from the buffer appear in the file, and the file size is > incremented by 213 (assuming the write started with the lseek > offset at the original size of the file). ... and the user still gets a SIGBUS so that it gets a chance to handle the encountered poison? I.e., not retry the write for the remaining 512 - 213 bytes? If so, do we document that somewhere so that application writers can know what they should do in such cases? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-22 14:46 ` Borislav Petkov @ 2021-08-23 15:24 ` Luck, Tony 0 siblings, 0 replies; 45+ messages in thread From: Luck, Tony @ 2021-08-23 15:24 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Sun, Aug 22, 2021 at 04:46:14PM +0200, Borislav Petkov wrote: > On Fri, Aug 20, 2021 at 01:33:56PM -0700, Luck, Tony wrote: > > The new version (thanks to All fixing iov_iter.c) now does > > exactly what POSIX says should happen. If I have a buffer > > with poison at offset 213, and I do this: > > > > ret = write(fd, buf, 512); > > > > Then the return from write is 213, and the first 213 bytes > > from the buffer appear in the file, and the file size is > > incremented by 213 (assuming the write started with the lseek > > offset at the original size of the file). > > ... and the user still gets a SIGBUS so that it gets a chance to handle > the encountered poison? I.e., not retry the write for the remaining 512 > - 213 bytes? Whether the user gets a SIGBUS depends on what they do next. In a typical user loop trying to do a write: while (nbytes) { ret = write(fd, buf, nbytes); if (ret == -1) return ret; buf += ret; nbytes -= ret; } The next iteration after the short write caused by the machine check will return ret == -1, errno = EFAULT. Andy Lutomirski convinced me that the kernel should not send a SIGBUS to an application when the kernel accesses the poison in user memory. If the user tries to access the page with the poison directly they'll get a SIGBUS (page was unmapped so user gets a #PF, but the x86 fault handler sees that the page was unmapped because of poison, so sends a SIGBUS). > If so, do we document that somewhere so that application writers can > know what they should do in such cases? Applications see a failed write ... they should do whatever they would normally do for a failed write. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-08-18 0:29 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-08-20 17:31 ` Borislav Petkov @ 2021-09-13 9:24 ` Borislav Petkov 2021-09-13 21:52 ` [PATCH v3] " Luck, Tony 1 sibling, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-09-13 9:24 UTC (permalink / raw) To: Tony Luck Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote: > Recovery action when get_user() triggers a machine check uses the fixup > path to make get_user() return -EFAULT. Also queue_task_work() sets up > so that kill_me_maybe() will be called on return to user mode to send > a SIGBUS to the current process. > > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. > > While processing this second machine check queue_task_work() is called > again. But since this uses the same callback_head structure that was used > in the first call, the net result is an entry on the current->task_works > list that points to itself. When task_work_run() is called it loops > forever in this code: > > do { > next = work->next; > work->func(work); > work = next; > cond_resched(); > } while (work); > > Add a counter (current->mce_count) to keep track of repeated machine > checks before task_work() is called. First machine check saves the address > information and calls task_work_add(). Subsequent machine checks before > that task_work call back is executed check that the address is in the > same page as the first machine check (since the callback will offline > exactly one page). > > Expected worst case is two machine checks before moving on (e.g. one user > access with page faults disabled, then a repeat to the same addrsss with > page faults enabled). Just in case there is some code that loops forever > enforce a limit of 10. > > Cc: <stable@vger.kernel.org> What about a Fixes: tag? I guess backporting this to the respective kernels is predicated upon the existence of those other "places" in the kernel where code assumes the EFAULT was because of a #PF. Hmmm? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery 2021-09-13 9:24 ` Borislav Petkov @ 2021-09-13 21:52 ` Luck, Tony 2021-09-14 8:28 ` Borislav Petkov 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-09-13 21:52 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel There are two cases for machine check recovery: 1) The machine check was triggered by ring3 (application) code. This is the simpler case. The machine check handler simply queues work to be executed on return to user. That code unmaps the page from all users and arranges to send a SIGBUS to the task that triggered the poison. 2) The machine check was triggered in kernel code that is covered by an extable entry. In this case the machine check handler still queues a work entry to unmap the page, etc. but this will not be called right away because the #MC handler returns to the fix up code address in the extable entry. Problems occur if the kernel triggers another machine check before the return to user processes the first queued work item. Specifically the work is queued using the "mce_kill_me" callback structure in the task struct for the current thread. Attempting to queue a second work item using this same callback results in a loop in the linked list of work functions to call. So when the kernel does return to user it enters an infinite loop processing the same entry for ever. There are some legitimate scenarios where the kernel may take a second machine check before returning to the user. 1) Some code (e.g. futex) first tries a get_user() with page faults disabled. If this fails, the code retries with page faults enabled expecting that this will resolve the page fault. 2) Copy from user code retries a copy in byte-at-time mode to check whether any additional bytes can be copied. On the other side of the fence are some bad drivers that do not check the return value from individual get_user() calls and may access multiple user addresses without noticing that some/all calls have failed. Fix by adding a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is four machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same address with page faults enabled ... repeat in copy tail bytes). Just in case there is some code that loops forever enforce a limit of 10. Also mark queue_task_work() as "noinstr" (as reported kernel test robot <lkp@intel.com>) Cc: <stable@vger.kernel.org> Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Signed-off-by: Tony Luck <tony.luck@intel.com> --- > What about a Fixes: tag? Added a Fixes tag. Also added "noinstr" to queue_task_work() per a kernel robot report. Also re-wrote the commit comment (based on questions raised against v2) > I guess backporting this to the respective kernels is predicated upon > the existence of those other "places" in the kernel where code assumes > the EFAULT was because of a #PF. Not really. I don't expect to change any kernel code that just bounces off the same machine check a few times. This patch does work best in conjunction with patches 2 & 3 (unchanged, not reposted here). But it will fix some old issues even without those two. arch/x86/kernel/cpu/mce/core.c | 43 +++++++++++++++++++++++++--------- include/linux/sched.h | 1 + 2 files changed, 33 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 8cb7816d03b4..9891b4070a61 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1253,6 +1253,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + p->mce_count = 0; force_sig(SIGBUS); } @@ -1262,6 +1265,7 @@ static void kill_me_maybe(struct callback_head *cb) int flags = MF_ACTION_REQUIRED; int ret; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1290,17 +1294,34 @@ static void kill_me_maybe(struct callback_head *cb) } } -static void queue_task_work(struct mce *m, int kill_current_task) +static noinstr void queue_task_work(struct mce *m, char *msg, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + int count = ++current->mce_count; - if (kill_current_task) - current->mce_kill_me.func = kill_me_now; - else - current->mce_kill_me.func = kill_me_maybe; + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + + if (kill_current_task) + current->mce_kill_me.func = kill_me_now; + else + current->mce_kill_me.func = kill_me_maybe; + } + + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do not call task_work_add() more than once */ + if (count > 1) + return; task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); } @@ -1438,7 +1459,7 @@ noinstr void do_machine_check(struct pt_regs *regs) /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); - queue_task_work(&m, kill_current_task); + queue_task_work(&m, msg, kill_current_task); } else { /* @@ -1456,7 +1477,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_current_task); + queue_task_work(&m, msg, kill_current_task); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index e12b524426b0..39039ce8ac4c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1471,6 +1471,7 @@ struct task_struct { mce_whole_page : 1, __mce_reserved : 62; struct callback_head mce_kill_me; + int mce_count; #endif #ifdef CONFIG_KRETPROBES -- 2.31.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery 2021-09-13 21:52 ` [PATCH v3] " Luck, Tony @ 2021-09-14 8:28 ` Borislav Petkov 0 siblings, 0 replies; 45+ messages in thread From: Borislav Petkov @ 2021-09-14 8:28 UTC (permalink / raw) To: Luck, Tony Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Mon, Sep 13, 2021 at 02:52:39PM -0700, Luck, Tony wrote: > Also mark queue_task_work() as "noinstr" (as reported kernel test robot > <lkp@intel.com>) Yeah, that's not enough - I have a patchset in the works for all this so I'm going to drop your annotation. > Cc: <stable@vger.kernel.org> > Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Ah ok, that one makes sense. > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > > > What about a Fixes: tag? > > Added a Fixes tag. > > Also added "noinstr" to queue_task_work() per a kernel robot report. > > Also re-wrote the commit comment (based on questions raised against v2) Thanks - very much appreciated and it reads really good! > > I guess backporting this to the respective kernels is predicated upon > > the existence of those other "places" in the kernel where code assumes > > the EFAULT was because of a #PF. > > Not really. I don't expect to change any kernel code that just bounces > off the same machine check a few times. This patch does work best in > conjunction with patches 2 & 3 (unchanged, not reposted here). But it > will fix some old issues even without those two. Ok, got it. /me queues. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v2 2/3] x86/mce: Change to not send SIGBUS error during copy from user 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck 2021-08-18 0:29 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck @ 2021-08-18 0:29 ` Tony Luck 2021-08-18 0:29 ` [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC Tony Luck 2021-08-18 16:14 ` [PATCH v2 0/3] More machine check recovery fixes Luck, Tony 3 siblings, 0 replies; 45+ messages in thread From: Tony Luck @ 2021-08-18 0:29 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel, Tony Luck Sending a SIGBUS for a copy from user is not the correct semantic. System calls should return -EFAULT (or a short count for write(2)). Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 35 +++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 94830ee9581c..957ec60cd2a8 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1269,7 +1269,7 @@ static void kill_me_maybe(struct callback_head *cb) flags |= MF_MUST_KILL; ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); - if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { + if (!ret) { set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); sync_core(); return; @@ -1283,15 +1283,21 @@ static void kill_me_maybe(struct callback_head *cb) if (ret == -EHWPOISON) return; - if (p->mce_vaddr != (void __user *)-1l) { - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); - } else { - pr_err("Memory error not recovered"); - kill_me_now(cb); - } + pr_err("Memory error not recovered"); + kill_me_now(cb); +} + +static void kill_me_never(struct callback_head *cb) +{ + struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); + + p->mce_count = 0; + pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr); + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0)) + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); } -static void queue_task_work(struct mce *m, char *msg, int kill_current_task) +static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *)) { int count = ++current->mce_count; @@ -1301,11 +1307,7 @@ static void queue_task_work(struct mce *m, char *msg, int kill_current_task) current->mce_kflags = m->kflags; current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); current->mce_whole_page = whole_page(m); - - if (kill_current_task) - current->mce_kill_me.func = kill_me_now; - else - current->mce_kill_me.func = kill_me_maybe; + current->mce_kill_me.func = func; } /* Ten is likley overkill. Don't expect more than two faults before task_work() */ @@ -1456,7 +1458,10 @@ noinstr void do_machine_check(struct pt_regs *regs) /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); - queue_task_work(&m, msg, kill_current_task); + if (kill_current_task) + queue_task_work(&m, msg, kill_me_now); + else + queue_task_work(&m, msg, kill_me_maybe); } else { /* @@ -1474,7 +1479,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, msg, kill_current_task); + queue_task_work(&m, msg, kill_me_never); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); -- 2.29.2 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck 2021-08-18 0:29 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-08-18 0:29 ` [PATCH v2 2/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck @ 2021-08-18 0:29 ` Tony Luck 2021-09-20 9:13 ` Borislav Petkov 2021-08-18 16:14 ` [PATCH v2 0/3] More machine check recovery fixes Luck, Tony 3 siblings, 1 reply; 45+ messages in thread From: Tony Luck @ 2021-08-18 0:29 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel, Tony Luck Fixes to the iterator code to handle faults that are not on page boundaries mean that the special case for machine check during copy from user is no longer needed. Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/lib/copy_user_64.S | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 57b79c577496..2797e630b9b1 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string) */ SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail) movl %edx,%ecx - cmp $X86_TRAP_MC,%eax /* check if X86_TRAP_MC */ - je 3f 1: rep movsb 2: mov %ecx,%eax ASM_CLAC ret - /* - * Return zero to pretend that this copy succeeded. This - * is counter-intuitive, but needed to prevent the code - * in lib/iov_iter.c from retrying and running back into - * the poison cache line again. The machine check handler - * will ensure that a SIGBUS is sent to the task. - */ -3: xorl %eax,%eax - ASM_CLAC - ret - _ASM_EXTABLE_CPY(1b, 2b) SYM_CODE_END(.Lcopy_user_handle_tail) -- 2.29.2 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC 2021-08-18 0:29 ` [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC Tony Luck @ 2021-09-20 9:13 ` Borislav Petkov 2021-09-20 16:18 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-09-20 9:13 UTC (permalink / raw) To: Tony Luck Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Youquan Song, huangcun, x86, linux-edac, linux-mm, linux-kernel On Tue, Aug 17, 2021 at 05:29:42PM -0700, Tony Luck wrote: > Fixes to the iterator code to handle faults Can we name some of those fixes here pls? We will forget which those were in the future so it would be good to leave some breadcrumbs at least... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC 2021-09-20 9:13 ` Borislav Petkov @ 2021-09-20 16:18 ` Luck, Tony 2021-09-20 16:37 ` Borislav Petkov 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-09-20 16:18 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Song, Youquan, huangcun, x86, linux-edac, linux-mm, linux-kernel > Can we name some of those fixes here pls? Some/all of this bunch from Al Viro: a180bd1d7e16 iov_iter: remove uaccess_kernel() warning from iov_iter_init() 6852df126699 csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller 2a510a744beb clean up copy_mc_pipe_to_iter() 893839fd5733 pipe_zero(): we don't need no stinkin' kmap_atomic()... 2495bdcc86dc iov_iter: clean csum_and_copy_...() primitives up a bit 55ca375c5dcc copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases c1d4d6a9ae88 copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases 4b179e9a9c7c iterate_xarray(): only of the first iteration we might get offset != 0 a6e4ec7bfd32 pull handling of ->iov_offset into iterate_{iovec,bvec,xarray} 7baa5099002f iov_iter: make iterator callbacks use base and len instead of iovec 622838f3fde2 iov_iter: make the amount already copied available to iterator callbacks 21b56c847753 iov_iter: get rid of separate bvec and xarray callbacks 1b4fb5ffd79b iov_iter: teach iterate_{bvec,xarray}() about possible short copies 7491a2bf64e3 iterate_bvec(): expand bvec.h macro forest, massage a bit 5c67aa90cd5c iov_iter: unify iterate_iovec and iterate_kvec 7a1bcb5d255d iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec f5da83545f4e iterate_and_advance(): get rid of magic in case when n is 0 594e450b3f44 csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() f0b65f39ac50 iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant e4f8df86798a [xarray] iov_iter_npages(): just use DIV_ROUND_UP() 66531c65aa25 iov_iter_npages(): don't bother with iterate_all_kinds() 3d671ca62a08 get rid of iterate_all_kinds() in iov_iter_get_pages()/iov_iter_get_pages_alloc() 610c7a71543d iov_iter_gap_alignment(): get rid of iterate_all_kinds() 9221d2e37b72 iov_iter_alignment(): don't bother with iterate_all_kinds() 8409a0d261e2 sanitize iov_iter_fault_in_readable() 185ac4d43669 iov_iter: optimize iov_iter_advance() for iovec and kvec 8cd54c1c8480 iov_iter: separate direction from flavour 556351c1c09a iov_iter_advance(): don't modify ->iov_offset for ITER_DISCARD 28f38db7edbf iov_iter: reorder handling of flavours in primitives 4b6c132b7da6 iov_iter: switch ..._full() variants of primitives to use of iov_iter_revert() 3b3fc051cd2c iov_iter_advance(): use consistent semantics for move past the end 0e8f0d674015 [xarray] iov_iter_fault_in_readable() should do nothing in xarray case a506abc7b644 copy_page_to_iter(): fix ITER_DISCARD case 08aa64796016 teach copy_page_to_iter() to handle compound pages 66cd071a1f83 iov_iter: Remove iov_iter_for_each_range() -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC 2021-09-20 16:18 ` Luck, Tony @ 2021-09-20 16:37 ` Borislav Petkov 2021-09-20 16:43 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-09-20 16:37 UTC (permalink / raw) To: Luck, Tony Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Song, Youquan, huangcun, x86, linux-edac, linux-mm, linux-kernel On Mon, Sep 20, 2021 at 04:18:58PM +0000, Luck, Tony wrote: > > Can we name some of those fixes here pls? > > Some/all of this bunch from Al Viro: Is this how you generated that list, per chance? $ git log --oneline v5.14 -- lib/iov_iter.c ? Output looks at least similar to what you've pasted... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC 2021-09-20 16:37 ` Borislav Petkov @ 2021-09-20 16:43 ` Luck, Tony 0 siblings, 0 replies; 45+ messages in thread From: Luck, Tony @ 2021-09-20 16:43 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Song, Youquan, huangcun, x86, linux-edac, linux-mm, linux-kernel > Is this how you generated that list, per chance? > > $ git log --oneline v5.14 -- lib/iov_iter.c Almost. I had "v5.14 ^v5.13" to stop git from going back further than when I think Al started applying those fixes. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH v2 0/3] More machine check recovery fixes 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck ` (2 preceding siblings ...) 2021-08-18 0:29 ` [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC Tony Luck @ 2021-08-18 16:14 ` Luck, Tony 3 siblings, 0 replies; 45+ messages in thread From: Luck, Tony @ 2021-08-18 16:14 UTC (permalink / raw) To: Borislav Petkov Cc: Jue Wang, Ding Hui, naoya.horiguchi, osalvador, Song, Youquan, huangcun, x86, linux-edac, linux-mm, linux-kernel > Changes since v1: > 1) Fix bug in kill_me_never() that forgot to clear p->mce_count so > repeated recovery in the same task would trigger the panic for > "Machine checks to different user pages" > [Note to Jue Wang ... this *might* be why your test that injects > two errors into the same buffer passed to a write(2) syscall > failed with this message] I recreated Jue's specific test today with uncorrected errors in two pages passed to a write(2) syscall. buf = alloc(2 pages); inject(buf + 0x440); inject*buf + 0x11c0); n = write(fd, buf, 8K); Result was that the write returned 0x440 (i.e. bytes written up to the first poison location). -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() @ 2021-01-08 22:22 Tony Luck 2021-01-11 21:44 ` [PATCH v2 0/3] " Tony Luck 0 siblings, 1 reply; 45+ messages in thread From: Tony Luck @ 2021-01-08 22:22 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm Linux can now recover from machine checks where kernel code is doing get_user() to access application memory. But there isn't a way to distinguish whether get_user() failed because of a page fault or a machine check. Thus there is a problem if any kernel code thinks it can retry an access after doing something that would fix the page fault. One such example (I'm sure there are more) is in futex_wait_setup() where an attempt to read the futex with page faults disabled. Then a retry (after dropping a lock so page faults are safe): ret = get_futex_value_locked(&uval, uaddr); if (ret) { queue_unlock(*hb); ret = get_user(uval, uaddr); It would be good to avoid deliberately taking a second machine check (especially as the recovery code does really bad things and ends up in an infinite loop!). My proposal is to add a new function arch_memory_failure() that can be called after get_user() returns -EFAULT to allow graceful recovery. Futex reviewers: I just have one new call (that fixes my test case). If you could point out other places this is needed, that would be most helpful. Patch roadmap: Part 1: Add code to avoid the infinite loop in the machine check code. Just panic if code runs into the same machine check a second time. This should make it much easier to debug other places where this happens. Part 2: Add arch_memory_failure() and use it in futex_wait_setup(). [Suggestions gladly accepted for the current best way to handle the #defines etc. to define an arch specific function to be used in generic code] Tony Luck (2): x86/mce: Avoid infinite loop for copy from user recovery futex, x86/mce: Avoid double machine checks arch/x86/include/asm/mmu.h | 7 +++++++ arch/x86/kernel/cpu/mce/core.c | 17 ++++++++++++++++- include/linux/mm.h | 4 ++++ include/linux/sched.h | 3 ++- kernel/futex.c | 3 +++ 5 files changed, 32 insertions(+), 2 deletions(-) -- 2.21.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v2 0/3] Fix infinite machine check loop in futex_wait_setup() 2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck @ 2021-01-11 21:44 ` Tony Luck 2021-01-11 21:44 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 0 siblings, 1 reply; 45+ messages in thread From: Tony Luck @ 2021-01-11 21:44 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm Linux can now recover from machine checks where kernel code is doing get_user() to access application memory. But there isn't a way to distinguish whether get_user() failed because of a page fault or a machine check. Thus there is a problem if any kernel code thinks it can retry an access after doing something that would fix the page fault. One such example (I'm sure there are more) is in futex_wait_setup() where an attempt to read the futex with page faults disabled. Then a retry (after dropping a lock so page faults are safe): ret = get_futex_value_locked(&uval, uaddr); if (ret) { queue_unlock(*hb); ret = get_user(uval, uaddr); It would be good to avoid deliberately taking a second machine check (especially as the recovery code does really bad things and ends up in an infinite loop!). V2 (thanks to feedback from PeterZ) fixes this by changing get_user() to return -ENXIO ("No such device or address") for the case where a machine check occurred. Peter left it open which error code to use (suggesting "-EMEMERR or whatever name we come up with"). I think the existing ENXIO error code seems appropriate (the address being accessed has effectively gone away). But I don't have a strong attachment if anyone thinks we need a new code. Callers can check for ENXIO in paths where the access would be retried so they can avoid a second machine check. Patch roadmap: Part 1 (unchanged since v1): Add code to avoid the infinite loop in the machine check code. Just panic if code runs into the same machine check a second time. This should make it much easier to debug other places where this happens. Part 2: Change recovery path for get_user() to return -ENXIO Part 3: Fix the one case in futex code that my test case hits (I'm sure there are more). TBD: There are a few places in arch/x86 code that test "ret == -EFAULT" or have "switch (ret) { case -EFAULT: }" that may benefit from an additional check for -ENXIO. For now those will continue to crash (just like every pre-v5.10 kernel crashed when get_user() touched poison). Tony Luck (3): x86/mce: Avoid infinite loop for copy from user recovery x86/mce: Add new return value to get_user() for machine check futex, x86/mce: Avoid double machine checks arch/x86/kernel/cpu/mce/core.c | 7 ++++++- arch/x86/lib/getuser.S | 8 +++++++- arch/x86/mm/extable.c | 1 + include/linux/sched.h | 3 ++- kernel/futex.c | 5 ++++- 5 files changed, 20 insertions(+), 4 deletions(-) -- 2.21.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-11 21:44 ` [PATCH v2 0/3] " Tony Luck @ 2021-01-11 21:44 ` Tony Luck 2021-01-11 22:11 ` Andy Lutomirski 2021-01-14 20:22 ` Borislav Petkov 0 siblings, 2 replies; 45+ messages in thread From: Tony Luck @ 2021-01-11 21:44 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a "mce_busy" flag bit to detect this situation and panic when it happens. Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 7 ++++++- include/linux/sched.h | 3 ++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 13d3f1cbda17..1bf11213e093 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + p->mce_busy = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1268,6 +1269,7 @@ static void kill_me_maybe(struct callback_head *cb) static void queue_task_work(struct mce *m, int kill_current_task) { + current->mce_busy = 1; current->mce_addr = m->addr; current->mce_kflags = m->kflags; current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) mce_panic("Failed kernel mode recovery", &m, msg); } - if (m.kflags & MCE_IN_KERNEL_COPYIN) + if (m.kflags & MCE_IN_KERNEL_COPYIN) { + if (current->mce_busy) + mce_panic("Multiple copyin", &m, msg); queue_task_work(&m, kill_current_task); + } } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6e3a5eeec509..a763a76eac57 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1360,7 +1360,8 @@ struct task_struct { u64 mce_addr; __u64 mce_ripv : 1, mce_whole_page : 1, - __mce_reserved : 62; + mce_busy : 1, + __mce_reserved : 61; struct callback_head mce_kill_me; #endif -- 2.21.1 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-11 21:44 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck @ 2021-01-11 22:11 ` Andy Lutomirski 2021-01-11 22:20 ` Luck, Tony 2021-01-14 20:22 ` Borislav Petkov 1 sibling, 1 reply; 45+ messages in thread From: Andy Lutomirski @ 2021-01-11 22:11 UTC (permalink / raw) To: Tony Luck Cc: Borislav Petkov, x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm > On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > > Recovery action when get_user() triggers a machine check uses the fixup > path to make get_user() return -EFAULT. Also queue_task_work() sets up > so that kill_me_maybe() will be called on return to user mode to send a > SIGBUS to the current process. > > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. > > While processing this second machine check queue_task_work() is called > again. But since this uses the same callback_head structure that > was used in the first call, the net result is an entry on the > current->task_works list that points to itself. Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. Yes, I realize this is messy, but maybe it’s not that messy. Conceptually, we just (famous last words) need to arrange for an MCE with IF=1 to switch off the IST stack and run like a normal exception. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-11 22:11 ` Andy Lutomirski @ 2021-01-11 22:20 ` Luck, Tony 2021-01-12 17:00 ` Andy Lutomirski 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-11 22:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > > > On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > > > > Recovery action when get_user() triggers a machine check uses the fixup > > path to make get_user() return -EFAULT. Also queue_task_work() sets up > > so that kill_me_maybe() will be called on return to user mode to send a > > SIGBUS to the current process. > > > > But there are places in the kernel where the code assumes that this > > EFAULT return was simply because of a page fault. The code takes some > > action to fix that, and then retries the access. This results in a second > > machine check. > > > > While processing this second machine check queue_task_work() is called > > again. But since this uses the same callback_head structure that > > was used in the first call, the net result is an entry on the > > current->task_works list that points to itself. > > Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. The first machine check is in pagefault_disable() context. static int get_futex_value_locked(u32 *dest, u32 __user *from) { int ret; pagefault_disable(); ret = __get_user(*dest, from); pagefault_enable(); return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0; } -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-11 22:20 ` Luck, Tony @ 2021-01-12 17:00 ` Andy Lutomirski 2021-01-12 17:16 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Andy Lutomirski @ 2021-01-12 17:00 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, LKML, linux-edac, Linux-MM > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: >> >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: >>> >>> Recovery action when get_user() triggers a machine check uses the fixup >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up >>> so that kill_me_maybe() will be called on return to user mode to send a >>> SIGBUS to the current process. >>> >>> But there are places in the kernel where the code assumes that this >>> EFAULT return was simply because of a page fault. The code takes some >>> action to fix that, and then retries the access. This results in a second >>> machine check. >>> >>> While processing this second machine check queue_task_work() is called >>> again. But since this uses the same callback_head structure that >>> was used in the first call, the net result is an entry on the >>> current->task_works list that points to itself. >> >> Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. > > The first machine check is in pagefault_disable() context. > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > { > int ret; > > pagefault_disable(); > ret = __get_user(*dest, from); I have very mixed feelings as to whether we should even try to recover from the first failure like this. If we actually want to try to recover, perhaps we should instead arrange for the second MCE to recover successfully instead of panicking. --Andy > pagefault_enable(); > > return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0; > } > > -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 17:00 ` Andy Lutomirski @ 2021-01-12 17:16 ` Luck, Tony 2021-01-12 17:21 ` Andy Lutomirski 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-12 17:16 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote: > > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote: > > > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > >> > >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > >>> > >>> Recovery action when get_user() triggers a machine check uses the fixup > >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up > >>> so that kill_me_maybe() will be called on return to user mode to send a > >>> SIGBUS to the current process. > >>> > >>> But there are places in the kernel where the code assumes that this > >>> EFAULT return was simply because of a page fault. The code takes some > >>> action to fix that, and then retries the access. This results in a second > >>> machine check. > >>> > >>> While processing this second machine check queue_task_work() is called > >>> again. But since this uses the same callback_head structure that > >>> was used in the first call, the net result is an entry on the > >>> current->task_works list that points to itself. > >> > >> Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. > > > > The first machine check is in pagefault_disable() context. > > > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > > { > > int ret; > > > > pagefault_disable(); > > ret = __get_user(*dest, from); > > I have very mixed feelings as to whether we should even try to recover > from the first failure like this. If we actually want to try to > recover, perhaps we should instead arrange for the second MCE to > recover successfully instead of panicking. Well we obviously have to "recover" from the first machine check in order to get to the second. Are you saying you don't like the different return value from get_user()? In my initial playing around with this I just had the second machine check simply skip the task_work_add(). This worked for this case, but only because there wasn't a third, fourth, etc. access to the poisoned data. If the caller keeps peeking, then we'll keep taking more machine checks - possibly forever. Even if we do recover with just one extra machine check ... that's one more than was necessary. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 17:16 ` Luck, Tony @ 2021-01-12 17:21 ` Andy Lutomirski 2021-01-12 18:23 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Andy Lutomirski @ 2021-01-12 17:21 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 9:16 AM Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote: > > > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote: > > > > > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > > >> > > >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > > >>> > > >>> Recovery action when get_user() triggers a machine check uses the fixup > > >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up > > >>> so that kill_me_maybe() will be called on return to user mode to send a > > >>> SIGBUS to the current process. > > >>> > > >>> But there are places in the kernel where the code assumes that this > > >>> EFAULT return was simply because of a page fault. The code takes some > > >>> action to fix that, and then retries the access. This results in a second > > >>> machine check. > > >>> > > >>> While processing this second machine check queue_task_work() is called > > >>> again. But since this uses the same callback_head structure that > > >>> was used in the first call, the net result is an entry on the > > >>> current->task_works list that points to itself. > > >> > > >> Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. > > > > > > The first machine check is in pagefault_disable() context. > > > > > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > > > { > > > int ret; > > > > > > pagefault_disable(); > > > ret = __get_user(*dest, from); > > > > I have very mixed feelings as to whether we should even try to recover > > from the first failure like this. If we actually want to try to > > recover, perhaps we should instead arrange for the second MCE to > > recover successfully instead of panicking. > > Well we obviously have to "recover" from the first machine check > in order to get to the second. Are you saying you don't like the > different return value from get_user()? > > In my initial playing around with this I just had the second machine > check simply skip the task_work_add(). This worked for this case, but > only because there wasn't a third, fourth, etc. access to the poisoned > data. If the caller keeps peeking, then we'll keep taking more machine > checks - possibly forever. > > Even if we do recover with just one extra machine check ... that's one > more than was necessary. Well, we need to do *something* when the first __get_user() trips the #MC. It would be nice if we could actually fix up the page tables inside the #MC handler, but, if we're in a pagefault_disable() context we might have locks held. Heck, we could have the pagetable lock held, be inside NMI, etc. Skipping the task_work_add() might actually make sense if we get a second one. We won't actually infinite loop in pagefault_disable() context -- if we would, then we would also infinite loop just from a regular page fault, too. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 17:21 ` Andy Lutomirski @ 2021-01-12 18:23 ` Luck, Tony 2021-01-12 18:57 ` Andy Lutomirski 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-12 18:23 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > Well, we need to do *something* when the first __get_user() trips the > #MC. It would be nice if we could actually fix up the page tables > inside the #MC handler, but, if we're in a pagefault_disable() context > we might have locks held. Heck, we could have the pagetable lock > held, be inside NMI, etc. Skipping the task_work_add() might actually > make sense if we get a second one. > > We won't actually infinite loop in pagefault_disable() context -- if > we would, then we would also infinite loop just from a regular page > fault, too. Fixing the page tables inside the #MC handler to unmap the poison page would indeed be a good solution. But, as you point out, not possible because of locks. Could we take a more drastic approach? We know that this case the kernel is accessing a user address for the current process. Could the machine check handler just re-write %cr3 to point to a kernel-only page table[1]. I.e. unmap the entire current user process. Then any subsequent access to user space will page fault. Maybe have a flag in the task structure to help the #PF handler understand what just happened. The code we execute in the task_work handler can restore %cr3 -Tony [1] Does such a thing already exist? If not, I'd need some help/pointers to create it. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 18:23 ` Luck, Tony @ 2021-01-12 18:57 ` Andy Lutomirski 2021-01-12 20:52 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Andy Lutomirski @ 2021-01-12 18:57 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > > Well, we need to do *something* when the first __get_user() trips the > > #MC. It would be nice if we could actually fix up the page tables > > inside the #MC handler, but, if we're in a pagefault_disable() context > > we might have locks held. Heck, we could have the pagetable lock > > held, be inside NMI, etc. Skipping the task_work_add() might actually > > make sense if we get a second one. > > > > We won't actually infinite loop in pagefault_disable() context -- if > > we would, then we would also infinite loop just from a regular page > > fault, too. > > Fixing the page tables inside the #MC handler to unmap the poison > page would indeed be a good solution. But, as you point out, not possible > because of locks. > > Could we take a more drastic approach? We know that this case the kernel > is accessing a user address for the current process. Could the machine > check handler just re-write %cr3 to point to a kernel-only page table[1]. > I.e. unmap the entire current user process. That seems scary, especially if we're in the middle of a context switch when this happens. We *could* make it work, but I'm not at all convinced it's wise. > > Then any subsequent access to user space will page fault. Maybe have a > flag in the task structure to help the #PF handler understand what just > happened. > > The code we execute in the task_work handler can restore %cr3 This would need to be integrated with something much more local IMO. Maybe it could be scoped to pagefault_disable()/pagefault_enable()? --Andy ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 18:57 ` Andy Lutomirski @ 2021-01-12 20:52 ` Luck, Tony 2021-01-12 22:04 ` Andy Lutomirski 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-12 20:52 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote: > On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > > > Well, we need to do *something* when the first __get_user() trips the > > > #MC. It would be nice if we could actually fix up the page tables > > > inside the #MC handler, but, if we're in a pagefault_disable() context > > > we might have locks held. Heck, we could have the pagetable lock > > > held, be inside NMI, etc. Skipping the task_work_add() might actually > > > make sense if we get a second one. > > > > > > We won't actually infinite loop in pagefault_disable() context -- if > > > we would, then we would also infinite loop just from a regular page > > > fault, too. > > > > Fixing the page tables inside the #MC handler to unmap the poison > > page would indeed be a good solution. But, as you point out, not possible > > because of locks. > > > > Could we take a more drastic approach? We know that this case the kernel > > is accessing a user address for the current process. Could the machine > > check handler just re-write %cr3 to point to a kernel-only page table[1]. > > I.e. unmap the entire current user process. > > That seems scary, especially if we're in the middle of a context > switch when this happens. We *could* make it work, but I'm not at all > convinced it's wise. Scary? It's terrifying! But we know that the fault happend in a get_user() or copy_from_user() call (i.e. an RIP with an extable recovery address). Does context switch access user memory? -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 20:52 ` Luck, Tony @ 2021-01-12 22:04 ` Andy Lutomirski 2021-01-13 1:50 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Andy Lutomirski @ 2021-01-12 22:04 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM > On Jan 12, 2021, at 12:52 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote: >>> On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote: >>> >>> On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: >>>> Well, we need to do *something* when the first __get_user() trips the >>>> #MC. It would be nice if we could actually fix up the page tables >>>> inside the #MC handler, but, if we're in a pagefault_disable() context >>>> we might have locks held. Heck, we could have the pagetable lock >>>> held, be inside NMI, etc. Skipping the task_work_add() might actually >>>> make sense if we get a second one. >>>> >>>> We won't actually infinite loop in pagefault_disable() context -- if >>>> we would, then we would also infinite loop just from a regular page >>>> fault, too. >>> >>> Fixing the page tables inside the #MC handler to unmap the poison >>> page would indeed be a good solution. But, as you point out, not possible >>> because of locks. >>> >>> Could we take a more drastic approach? We know that this case the kernel >>> is accessing a user address for the current process. Could the machine >>> check handler just re-write %cr3 to point to a kernel-only page table[1]. >>> I.e. unmap the entire current user process. >> >> That seems scary, especially if we're in the middle of a context >> switch when this happens. We *could* make it work, but I'm not at all >> convinced it's wise. > > Scary? It's terrifying! > > But we know that the fault happend in a get_user() or copy_from_user() call > (i.e. an RIP with an extable recovery address). Does context switch > access user memory? No, but NMI can. The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-12 22:04 ` Andy Lutomirski @ 2021-01-13 1:50 ` Luck, Tony 2021-01-13 4:15 ` Andy Lutomirski 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-13 1:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote: > > But we know that the fault happend in a get_user() or copy_from_user() call > > (i.e. an RIP with an extable recovery address). Does context switch > > access user memory? > > No, but NMI can. > > The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. > > What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. > > Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so. You seem to be looking for a solution that is entirely contained within the machine check handling code. Willing to allow for repeated machine checks from the same poison address in order to achieve that. I'm opposed to mutliple machine checks. Willing to make some changes in core code to avoid repeated access to the same poison location. We need a tie-breaker. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-13 1:50 ` Luck, Tony @ 2021-01-13 4:15 ` Andy Lutomirski 2021-01-13 10:00 ` Borislav Petkov 0 siblings, 1 reply; 45+ messages in thread From: Andy Lutomirski @ 2021-01-13 4:15 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM > On Jan 12, 2021, at 5:50 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote: >>> But we know that the fault happend in a get_user() or copy_from_user() call >>> (i.e. an RIP with an extable recovery address). Does context switch >>> access user memory? >> >> No, but NMI can. >> >> The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. >> >> What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. >> >> Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so. > > You seem to be looking for a solution that is entirely contained within > the machine check handling code. Willing to allow for repeated machine > checks from the same poison address in order to achieve that. > > I'm opposed to mutliple machine checks. Willing to make some changes > in core code to avoid repeated access to the same poison location. How about more questions before the showdown? If we made core code changes to avoid this, what would they be? We really can do user access from NMI and #DB, and those can happen in horrible places. We could have the pagetable lock held, be in the middle of CoWing the very page we tripped over, etc. I think we really can’t count on being able to write to the PTEs from #MC. Similarly, we might have IRQs off, so we can’t broadcast a TLB flush. And we might be in the middle of entry, exit, or CR3 switches, and I don’t see a particularly clean way to write CR3 without risking accidentally returning to user mode with the wrong CR3. So I’m sort of at a loss as to what we can do. All user memory accessors can handle failure, and they will all avoid infinite looping. If we can tolerate repeated MCE, we can survive. But there’s not a whole lot we can do from these horrible contexts. Hmm. Maybe if we have SMAP we could play with EFLAGS.AC? I can imagine this having various regrettable side effects, though. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-13 4:15 ` Andy Lutomirski @ 2021-01-13 10:00 ` Borislav Petkov 2021-01-13 16:06 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-01-13 10:00 UTC (permalink / raw) To: Andy Lutomirski Cc: Luck, Tony, Andy Lutomirski, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Tue, Jan 12, 2021 at 08:15:46PM -0800, Andy Lutomirski wrote: > So I’m sort of at a loss as to what we can do. I think you guys have veered off into the weeds with this. The current solution - modulo error messages not destined for humans :) - is not soo bad, considering the whole MCA trainwreck. Or am I missing something completely unacceptable? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-13 10:00 ` Borislav Petkov @ 2021-01-13 16:06 ` Luck, Tony 2021-01-13 16:19 ` Borislav Petkov 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-13 16:06 UTC (permalink / raw) To: Borislav Petkov, Andy Lutomirski Cc: Andy Lutomirski, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM > I think you guys have veered off into the weeds with this. The current > solution - modulo error messages not destined for humans :) - is not soo > bad, considering the whole MCA trainwreck. > > Or am I missing something completely unacceptable? Maybe the other difference in approach between Andy and me is whether to go for a solution that covers all the corner cases, or just make an incremental improvement that allows for recover in some useful subset of remaining fatal cases, but still dies in other cases. I'm happy to replace error messages with ones that are more descriptive and helpful to humans. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-13 16:06 ` Luck, Tony @ 2021-01-13 16:19 ` Borislav Petkov 2021-01-13 16:32 ` Luck, Tony 0 siblings, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-01-13 16:19 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Wed, Jan 13, 2021 at 04:06:09PM +0000, Luck, Tony wrote: > Maybe the other difference in approach between Andy and me is whether to > go for a solution that covers all the corner cases, or just make an incremental > improvement that allows for recover in some useful subset of remaining fatal > cases, but still dies in other cases. Does that mean more core code surgery? > I'm happy to replace error messages with ones that are more descriptive and > helpful to humans. Yap, that: "Multiple copyin" with something more understandable to users... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-13 16:19 ` Borislav Petkov @ 2021-01-13 16:32 ` Luck, Tony 2021-01-13 17:35 ` Borislav Petkov 0 siblings, 1 reply; 45+ messages in thread From: Luck, Tony @ 2021-01-13 16:32 UTC (permalink / raw) To: Borislav Petkov Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM >> Maybe the other difference in approach between Andy and me is whether to >> go for a solution that covers all the corner cases, or just make an incremental >> improvement that allows for recover in some useful subset of remaining fatal >> cases, but still dies in other cases. > > Does that mean more core code surgery? Yes. I need to look at other user access inside pagefault_disable/enable() as likely spots where the code may continue after a machine check and retry the access. So expect some more "if (ret == ENXIO) { do something to give up gracefully }" >> I'm happy to replace error messages with ones that are more descriptive and >> helpful to humans. > > Yap, that: "Multiple copyin" with something more understandable to users... I'll work on it. We tend not to have essay length messages as panic() strings. But I can add a comment in the code there so that people who grep whatever panic message we choose can get more details on what happened and what to do. -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-13 16:32 ` Luck, Tony @ 2021-01-13 17:35 ` Borislav Petkov 0 siblings, 0 replies; 45+ messages in thread From: Borislav Petkov @ 2021-01-13 17:35 UTC (permalink / raw) To: Luck, Tony Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Andrew Morton, Peter Zijlstra, Darren Hart, LKML, linux-edac, Linux-MM On Wed, Jan 13, 2021 at 04:32:54PM +0000, Luck, Tony wrote: > I'll work on it. We tend not to have essay length messages as panic() strings. But I can > add a comment in the code there so that people who grep whatever panic message > we choose can get more details on what happened and what to do. I did not mean to write an essay but to simply make it more palatable for "normal" users. Something like: "PANIC: Second machine check exception while accessing user data." or along those lines to explain *why* the machine panics. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-11 21:44 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-01-11 22:11 ` Andy Lutomirski @ 2021-01-14 20:22 ` Borislav Petkov 2021-01-14 21:05 ` Luck, Tony 1 sibling, 1 reply; 45+ messages in thread From: Borislav Petkov @ 2021-01-14 20:22 UTC (permalink / raw) To: Tony Luck Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote: > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) > mce_panic("Failed kernel mode recovery", &m, msg); > } > > - if (m.kflags & MCE_IN_KERNEL_COPYIN) > + if (m.kflags & MCE_IN_KERNEL_COPYIN) { > + if (current->mce_busy) > + mce_panic("Multiple copyin", &m, msg); So this: we're currently busy handling the first MCE, why do we must panic? Can we simply ignore all follow-up MCEs to that page? I.e., the page will get poisoned eventually and that poisoning is currently executing so all following MCEs are simply nothing new and we can ignore them. It's not like we're going to corrupt more data - we already are "corrupting" whole 4K. Am I making sense? Because if we do this, we won't have to pay attention to any get_user() callers and whatnot - we simply ignore and the solution is simple and you won't have to touch any get_user() callers... Hmmm? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery 2021-01-14 20:22 ` Borislav Petkov @ 2021-01-14 21:05 ` Luck, Tony 0 siblings, 0 replies; 45+ messages in thread From: Luck, Tony @ 2021-01-14 21:05 UTC (permalink / raw) To: Borislav Petkov Cc: x86, Andrew Morton, Peter Zijlstra, Darren Hart, Andy Lutomirski, linux-kernel, linux-edac, linux-mm On Thu, Jan 14, 2021 at 09:22:13PM +0100, Borislav Petkov wrote: > On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote: > > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) > > mce_panic("Failed kernel mode recovery", &m, msg); > > } > > > > - if (m.kflags & MCE_IN_KERNEL_COPYIN) > > + if (m.kflags & MCE_IN_KERNEL_COPYIN) { > > + if (current->mce_busy) > > + mce_panic("Multiple copyin", &m, msg); > > So this: we're currently busy handling the first MCE, why do we must > panic? > > Can we simply ignore all follow-up MCEs to that page? If we s/all/some/ you are saying the same as Andy: > So I tend to think that the machine check code should arrange to > survive some reasonable number of duplicate machine checks. > I.e., the page will get poisoned eventually and that poisoning is > currently executing so all following MCEs are simply nothing new and we > can ignore them. > > It's not like we're going to corrupt more data - we already are > "corrupting" whole 4K. > > Am I making sense? > > Because if we do this, we won't have to pay attention to any get_user() > callers and whatnot - we simply ignore and the solution is simple and > you won't have to touch any get_user() callers... Changing get_user() is a can of worms. I don't think its a very big can. Perhaps two or three dozen places where code needs to change to account for the -ENXIO return ... but touching a bunch of different subsystems it is likley to take a while to get everyone in agreement. I'll try out this new approach, and if it works, I'll post a v3 patch. Thanks -Tony ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2021-09-20 16:43 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-06 19:06 [PATCH 0/3] More machine check recovery fixes Tony Luck 2021-07-06 19:06 ` [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck 2021-07-06 19:06 ` [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-07-06 19:06 ` [PATCH 3/3] x86/mce: Drop copyin special case for #MC Tony Luck 2021-08-18 0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck 2021-08-18 0:29 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-08-20 17:31 ` Borislav Petkov 2021-08-20 18:59 ` Luck, Tony 2021-08-20 19:27 ` Borislav Petkov 2021-08-20 20:23 ` Luck, Tony 2021-08-21 4:51 ` Tony Luck 2021-08-21 21:51 ` Al Viro 2021-08-22 14:36 ` Borislav Petkov 2021-08-20 20:33 ` Luck, Tony 2021-08-22 14:46 ` Borislav Petkov 2021-08-23 15:24 ` Luck, Tony 2021-09-13 9:24 ` Borislav Petkov 2021-09-13 21:52 ` [PATCH v3] " Luck, Tony 2021-09-14 8:28 ` Borislav Petkov 2021-08-18 0:29 ` [PATCH v2 2/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck 2021-08-18 0:29 ` [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC Tony Luck 2021-09-20 9:13 ` Borislav Petkov 2021-09-20 16:18 ` Luck, Tony 2021-09-20 16:37 ` Borislav Petkov 2021-09-20 16:43 ` Luck, Tony 2021-08-18 16:14 ` [PATCH v2 0/3] More machine check recovery fixes Luck, Tony -- strict thread matches above, loose matches on Subject: below -- 2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck 2021-01-11 21:44 ` [PATCH v2 0/3] " Tony Luck 2021-01-11 21:44 ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck 2021-01-11 22:11 ` Andy Lutomirski 2021-01-11 22:20 ` Luck, Tony 2021-01-12 17:00 ` Andy Lutomirski 2021-01-12 17:16 ` Luck, Tony 2021-01-12 17:21 ` Andy Lutomirski 2021-01-12 18:23 ` Luck, Tony 2021-01-12 18:57 ` Andy Lutomirski 2021-01-12 20:52 ` Luck, Tony 2021-01-12 22:04 ` Andy Lutomirski 2021-01-13 1:50 ` Luck, Tony 2021-01-13 4:15 ` Andy Lutomirski 2021-01-13 10:00 ` Borislav Petkov 2021-01-13 16:06 ` Luck, Tony 2021-01-13 16:19 ` Borislav Petkov 2021-01-13 16:32 ` Luck, Tony 2021-01-13 17:35 ` Borislav Petkov 2021-01-14 20:22 ` Borislav Petkov 2021-01-14 21:05 ` Luck, Tony
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).