* [PATCH] x86/nmi: Fix some races in NMI uaccess @ 2018-08-27 23:04 Andy Lutomirski 2018-08-27 23:12 ` Jann Horn ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Andy Lutomirski @ 2018-08-27 23:04 UTC (permalink / raw) To: x86 Cc: Borislav Petkov, Rik van Riel, Jann Horn, LKML, Andy Lutomirski, stable, Peter Zijlstra, Nadav Amit In NMI context, we might be in the middle of context switching or in the middle of switch_mm_irqs_off(). In either case, CR3 might not match current->mm, which could cause copy_from_user_nmi() and friends to read the wrong memory. Fix it by adding a new nmi_uaccess_okay() helper and checking it in copy_from_user_nmi() and in __copy_from_user_nmi()'s callers. Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> --- The 0day bot is still chewing on this, but I've tested it a bit locally and it seems to do the right thing. I've never observed the bug it fixes, but it does appear to fix a bug unless I've missed something. It's also a prerequisite for Nadav's fixmap bugfix. arch/x86/events/core.c | 2 +- arch/x86/include/asm/tlbflush.h | 16 ++++++++++++++++ arch/x86/lib/usercopy.c | 5 +++++ arch/x86/mm/tlb.c | 3 +++ 4 files changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5f4829f10129..dfb2f7c0d019 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2465,7 +2465,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs perf_callchain_store(entry, regs->ip); - if (!current->mm) + if (!nmi_uaccess_okay()) return; if (perf_callchain_user32(regs, entry)) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 89a73bc31622..b23b2625793b 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -230,6 +230,22 @@ struct tlb_state { }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); +/* + * Blindly accessing user memory from NMI context can be dangerous + * if we're in the middle of switching the current user task or + * switching the loaded mm. It can also be dangerous if we + * interrupted some kernel code that was temporarily using a + * different mm. + */ +static inline bool nmi_uaccess_okay(void) +{ + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); + struct mm_struct *current_mm = current->mm; + + return current_mm && loaded_mm == current_mm && + loaded_mm->pgd == __va(read_cr3_pa()); +} + /* Initialize cr4 shadow for this CPU. */ static inline void cr4_init_shadow(void) { diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index c8c6ad0d58b8..3f435d7fca5e 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -7,6 +7,8 @@ #include <linux/uaccess.h> #include <linux/export.h> +#include <asm/tlbflush.h> + /* * We rely on the nested NMI work to allow atomic faults from the NMI path; the * nested NMI paths are careful to preserve CR2. @@ -19,6 +21,9 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) if (__range_not_ok(from, n, TASK_SIZE)) return n; + if (!nmi_uaccess_okay()) + return n; + /* * Even though this function is typically called from NMI/IRQ context * disable pagefaults so that its behaviour is consistent even when diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 457b281b9339..f4b41d5a93dd 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -345,6 +345,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, */ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } else { + /* Let NMI code know that CR3 may not match expectations. */ + this_cpu_write(cpu_tlbstate.loaded_mm, NULL); + /* The new ASID is already up to date. */ load_new_mm_cr3(next->pgd, new_asid, false); -- 2.17.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] x86/nmi: Fix some races in NMI uaccess 2018-08-27 23:04 [PATCH] x86/nmi: Fix some races in NMI uaccess Andy Lutomirski @ 2018-08-27 23:12 ` Jann Horn 2018-08-27 23:25 ` Andy Lutomirski 2018-08-28 1:31 ` Rik van Riel 2018-08-28 17:56 ` [PATCH v2] " Rik van Riel 2 siblings, 1 reply; 13+ messages in thread From: Jann Horn @ 2018-08-27 23:12 UTC (permalink / raw) To: Andy Lutomirski, Alexei Starovoitov, Daniel Borkmann Cc: the arch/x86 maintainers, Borislav Petkov, Rik van Riel, kernel list, stable, Peter Zijlstra, Nadav Amit On Tue, Aug 28, 2018 at 1:04 AM Andy Lutomirski <luto@kernel.org> wrote: > > In NMI context, we might be in the middle of context switching or in > the middle of switch_mm_irqs_off(). In either case, CR3 might not > match current->mm, which could cause copy_from_user_nmi() and > friends to read the wrong memory. > > Fix it by adding a new nmi_uaccess_okay() helper and checking it in > copy_from_user_nmi() and in __copy_from_user_nmi()'s callers. What about eBPF probes (which I think can be attached to kprobe points / tracepoints / perf events) that perform userspace reads / userspace writes / kernel reads? Can those run in NMI context, and if so, do they also need special handling? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86/nmi: Fix some races in NMI uaccess 2018-08-27 23:12 ` Jann Horn @ 2018-08-27 23:25 ` Andy Lutomirski 2018-08-27 23:34 ` Jann Horn 0 siblings, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2018-08-27 23:25 UTC (permalink / raw) To: Jann Horn Cc: Andy Lutomirski, Alexei Starovoitov, Daniel Borkmann, the arch/x86 maintainers, Borislav Petkov, Rik van Riel, kernel list, stable, Peter Zijlstra, Nadav Amit On Mon, Aug 27, 2018 at 4:12 PM, Jann Horn <jannh@google.com> wrote: > On Tue, Aug 28, 2018 at 1:04 AM Andy Lutomirski <luto@kernel.org> wrote: >> >> In NMI context, we might be in the middle of context switching or in >> the middle of switch_mm_irqs_off(). In either case, CR3 might not >> match current->mm, which could cause copy_from_user_nmi() and >> friends to read the wrong memory. >> >> Fix it by adding a new nmi_uaccess_okay() helper and checking it in >> copy_from_user_nmi() and in __copy_from_user_nmi()'s callers. > > What about eBPF probes (which I think can be attached to kprobe points > / tracepoints / perf events) that perform userspace reads / userspace > writes / kernel reads? Can those run in NMI context, and if so, do > they also need special handling? I assume they can run in NMI context, which might be problematic in and of themselves. For example, does BPF adequately protect against a BPF program accessing a map while bpf(2) is modifying it? It seems like bpf_prog_active is intended to serve this purpose. But I don't see any obvious mechanism for eBPF programs to read user memory. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86/nmi: Fix some races in NMI uaccess 2018-08-27 23:25 ` Andy Lutomirski @ 2018-08-27 23:34 ` Jann Horn 0 siblings, 0 replies; 13+ messages in thread From: Jann Horn @ 2018-08-27 23:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Alexei Starovoitov, Daniel Borkmann, the arch/x86 maintainers, Borislav Petkov, Rik van Riel, kernel list, stable, Peter Zijlstra, Nadav Amit On Tue, Aug 28, 2018 at 1:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, Aug 27, 2018 at 4:12 PM, Jann Horn <jannh@google.com> wrote: > > On Tue, Aug 28, 2018 at 1:04 AM Andy Lutomirski <luto@kernel.org> wrote: > >> > >> In NMI context, we might be in the middle of context switching or in > >> the middle of switch_mm_irqs_off(). In either case, CR3 might not > >> match current->mm, which could cause copy_from_user_nmi() and > >> friends to read the wrong memory. > >> > >> Fix it by adding a new nmi_uaccess_okay() helper and checking it in > >> copy_from_user_nmi() and in __copy_from_user_nmi()'s callers. > > > > What about eBPF probes (which I think can be attached to kprobe points > > / tracepoints / perf events) that perform userspace reads / userspace > > writes / kernel reads? Can those run in NMI context, and if so, do > > they also need special handling? > > I assume they can run in NMI context, which might be problematic in > and of themselves. For example, does BPF adequately protect against a > BPF program accessing a map while bpf(2) is modifying it? It seems > like bpf_prog_active is intended to serve this purpose. > > But I don't see any obvious mechanism for eBPF programs to read user memory. Look in kernel/trace/bpf_trace.c, which defines a bunch of eBPF helpers that can only be called from privileged eBPF code. Ah, but I misremembered, the userspace write helper does have a guard against interrupts, just the arbitrary read helper doesn't. BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr) { int ret; ret = probe_kernel_read(dst, unsafe_ptr, size); if (unlikely(ret < 0)) memset(dst, 0, size); return ret; } [...] BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src, u32, size) { /* * Ensure we're in user context which is safe for the helper to * run. This helper has no business in a kthread. * * access_ok() should prevent writing to non-user memory, but in * some situations (nommu, temporary switch, etc) access_ok() does * not provide enough validation, hence the check on KERNEL_DS. */ if (unlikely(in_interrupt() || current->flags & (PF_KTHREAD | PF_EXITING))) return -EPERM; if (unlikely(uaccess_kernel())) return -EPERM; if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) return -EPERM; return probe_kernel_write(unsafe_ptr, src, size); } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86/nmi: Fix some races in NMI uaccess 2018-08-27 23:04 [PATCH] x86/nmi: Fix some races in NMI uaccess Andy Lutomirski 2018-08-27 23:12 ` Jann Horn @ 2018-08-28 1:31 ` Rik van Riel 2018-08-28 2:10 ` Andy Lutomirski 2018-08-28 17:56 ` [PATCH v2] " Rik van Riel 2 siblings, 1 reply; 13+ messages in thread From: Rik van Riel @ 2018-08-28 1:31 UTC (permalink / raw) To: Andy Lutomirski, x86 Cc: Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit [-- Attachment #1: Type: text/plain, Size: 1028 bytes --] On Mon, 2018-08-27 at 16:04 -0700, Andy Lutomirski wrote: > +++ b/arch/x86/mm/tlb.c > @@ -345,6 +345,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, > struct mm_struct *next, > */ > trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, > TLB_FLUSH_ALL); > } else { > + /* Let NMI code know that CR3 may not match > expectations. */ I don't get it. This is in the "ASID is up to date, do not need a TLB flush" path. In what case do we have a TLB that is fully up to date, but a CR3 that does not match expectations? Doesn't the CR3 check in nmi_uaccess_ok already catch the window of time where the CR3 has already been switched over to that of the next task? What is special about this path wrt nmi_uaccess_ok that is not also true for the need_flush branch right above it? What am I missing? > + this_cpu_write(cpu_tlbstate.loaded_mm, NULL); > + > /* The new ASID is already up to date. */ > load_new_mm_cr3(next->pgd, new_asid, false); -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86/nmi: Fix some races in NMI uaccess 2018-08-28 1:31 ` Rik van Riel @ 2018-08-28 2:10 ` Andy Lutomirski 2018-08-28 13:50 ` Rik van Riel 0 siblings, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2018-08-28 2:10 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit On Mon, Aug 27, 2018 at 6:31 PM, Rik van Riel <riel@surriel.com> wrote: > On Mon, 2018-08-27 at 16:04 -0700, Andy Lutomirski wrote: > >> +++ b/arch/x86/mm/tlb.c >> @@ -345,6 +345,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, >> struct mm_struct *next, >> */ >> trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, >> TLB_FLUSH_ALL); >> } else { >> + /* Let NMI code know that CR3 may not match >> expectations. */ > > I don't get it. This is in the "ASID is up to date, do not > need a TLB flush" path. > > In what case do we have a TLB that is fully up to date, but > a CR3 that does not match expectations? > > Doesn't the CR3 check in nmi_uaccess_ok already catch the > window of time where the CR3 has already been switched over > to that of the next task? > > What is special about this path wrt nmi_uaccess_ok that is > not also true for the need_flush branch right above it? > > What am I missing? Nothing. My patch is buggy. ETOLITTLESLEEP. I could drop this part of the patch entirely. Or I could drop the loaded_mm->pgd == __va(read_cr3_pa() check and instead make sure that loaded_mm is NULL at any point at which loaded_mm might not match CR3. The latter will be faster in any (hypothetical) virtualization environment where CR3 reads trap. I don't know if we have any such cases where perf works and we care about performance, though. --Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86/nmi: Fix some races in NMI uaccess 2018-08-28 2:10 ` Andy Lutomirski @ 2018-08-28 13:50 ` Rik van Riel 0 siblings, 0 replies; 13+ messages in thread From: Rik van Riel @ 2018-08-28 13:50 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit [-- Attachment #1: Type: text/plain, Size: 964 bytes --] On Mon, 2018-08-27 at 19:10 -0700, Andy Lutomirski wrote: > On Mon, Aug 27, 2018 at 6:31 PM, Rik van Riel <riel@surriel.com> > wrote: > > > What is special about this path wrt nmi_uaccess_ok that is > > not also true for the need_flush branch right above it? > > > > What am I missing? > > Nothing. My patch is buggy. ETOLITTLESLEEP. > > I could drop this part of the patch entirely. Or I could drop the > loaded_mm->pgd == __va(read_cr3_pa() check and instead make sure that > loaded_mm is NULL at any point at which loaded_mm might not match > CR3. > The latter will be faster in any (hypothetical) virtualization > environment where CR3 reads trap. I don't know if we have any such > cases where perf works and we care about performance, though. Moving that loaded_mm = NULL assignment up a few lines, so it comes before the "if (need_flush)" test and covers both branches should cover that, indeed. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2] x86/nmi: Fix some races in NMI uaccess 2018-08-27 23:04 [PATCH] x86/nmi: Fix some races in NMI uaccess Andy Lutomirski 2018-08-27 23:12 ` Jann Horn 2018-08-28 1:31 ` Rik van Riel @ 2018-08-28 17:56 ` Rik van Riel 2018-08-29 3:46 ` Andy Lutomirski 2 siblings, 1 reply; 13+ messages in thread From: Rik van Riel @ 2018-08-28 17:56 UTC (permalink / raw) To: Andy Lutomirski Cc: x86, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit On Mon, 27 Aug 2018 16:04:16 -0700 Andy Lutomirski <luto@kernel.org> wrote: > The 0day bot is still chewing on this, but I've tested it a bit locally > and it seems to do the right thing. Hi Andy, the version of the patch below should fix the bug we talked about in email yesterday. It should automatically cover kernel threads in lazy TLB mode, because current->mm will be NULL, while the cpu_tlbstate.loaded_mm should never be NULL. ---8<--- From: Andy Lutomirski <luto@kernel.org> Subject: x86/nmi: Fix some races in NMI uaccess In NMI context, we might be in the middle of context switching or in the middle of switch_mm_irqs_off(). In either case, CR3 might not match current->mm, which could cause copy_from_user_nmi() and friends to read the wrong memory. Fix it by adding a new nmi_uaccess_okay() helper and checking it in copy_from_user_nmi() and in __copy_from_user_nmi()'s callers. Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> --- arch/x86/events/core.c | 2 +- arch/x86/include/asm/tlbflush.h | 15 +++++++++++++++ arch/x86/lib/usercopy.c | 5 +++++ arch/x86/mm/tlb.c | 9 ++++++++- 4 files changed, 29 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5f4829f10129..dfb2f7c0d019 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2465,7 +2465,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs perf_callchain_store(entry, regs->ip); - if (!current->mm) + if (!nmi_uaccess_okay()) return; if (perf_callchain_user32(regs, entry)) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 29c9da6c62fc..dafe649b18e1 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -246,6 +246,21 @@ struct tlb_state { }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); +/* + * Blindly accessing user memory from NMI context can be dangerous + * if we're in the middle of switching the current user task or + * switching the loaded mm. It can also be dangerous if we + * interrupted some kernel code that was temporarily using a + * different mm. + */ +static inline bool nmi_uaccess_okay(void) +{ + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); + struct mm_struct *current_mm = current->mm; + + return (loaded_mm == current_mm); +} + /* Initialize cr4 shadow for this CPU. */ static inline void cr4_init_shadow(void) { diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index c8c6ad0d58b8..3f435d7fca5e 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -7,6 +7,8 @@ #include <linux/uaccess.h> #include <linux/export.h> +#include <asm/tlbflush.h> + /* * We rely on the nested NMI work to allow atomic faults from the NMI path; the * nested NMI paths are careful to preserve CR2. @@ -19,6 +21,9 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) if (__range_not_ok(from, n, TASK_SIZE)) return n; + if (!nmi_uaccess_okay()) + return n; + /* * Even though this function is typically called from NMI/IRQ context * disable pagefaults so that its behaviour is consistent even when diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9517d1b2a281..5b75e2fed2b6 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -305,6 +305,14 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); + /* + * Ensure cpu_tlbstate.loaded_mm differs from current.mm + * until the context switch is complete, so NMI handlers + * do not try to access userspace. See nmi_uaccess_okay. + */ + this_cpu_write(cpu_tlbstate.loaded_mm, next); + smp_wmb(); + if (need_flush) { this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); @@ -335,7 +343,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, if (next != &init_mm) this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id); - this_cpu_write(cpu_tlbstate.loaded_mm, next); this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid); } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2] x86/nmi: Fix some races in NMI uaccess 2018-08-28 17:56 ` [PATCH v2] " Rik van Riel @ 2018-08-29 3:46 ` Andy Lutomirski 2018-08-29 15:17 ` Rik van Riel 0 siblings, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2018-08-29 3:46 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit On Tue, Aug 28, 2018 at 10:56 AM, Rik van Riel <riel@surriel.com> wrote: > On Mon, 27 Aug 2018 16:04:16 -0700 > Andy Lutomirski <luto@kernel.org> wrote: > >> The 0day bot is still chewing on this, but I've tested it a bit locally >> and it seems to do the right thing. > > Hi Andy, > > the version of the patch below should fix the bug we talked about > in email yesterday. It should automatically cover kernel threads > in lazy TLB mode, because current->mm will be NULL, while the > cpu_tlbstate.loaded_mm should never be NULL. > That's better than mine. I tweaked it a bit and added some debugging, and I got this: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=dd956eba16646fd0b15c3c0741269dfd84452dac I made the loaded_mm handling a little more conservative to make it more obvious that switch_mm_irqs_off() is safe regardless of exactly when it gets called relative to switching current. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] x86/nmi: Fix some races in NMI uaccess 2018-08-29 3:46 ` Andy Lutomirski @ 2018-08-29 15:17 ` Rik van Riel 2018-08-29 15:36 ` Andy Lutomirski 0 siblings, 1 reply; 13+ messages in thread From: Rik van Riel @ 2018-08-29 15:17 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit [-- Attachment #1: Type: text/plain, Size: 1786 bytes --] On Tue, 2018-08-28 at 20:46 -0700, Andy Lutomirski wrote: > On Tue, Aug 28, 2018 at 10:56 AM, Rik van Riel <riel@surriel.com> > wrote: > > On Mon, 27 Aug 2018 16:04:16 -0700 > > Andy Lutomirski <luto@kernel.org> wrote: > > > > > The 0day bot is still chewing on this, but I've tested it a bit > > > locally > > > and it seems to do the right thing. > > > > Hi Andy, > > > > the version of the patch below should fix the bug we talked about > > in email yesterday. It should automatically cover kernel threads > > in lazy TLB mode, because current->mm will be NULL, while the > > cpu_tlbstate.loaded_mm should never be NULL. > > > > That's better than mine. I tweaked it a bit and added some > debugging, > and I got this: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=dd956eba16646fd0b15c3c0741269dfd84452dac > > I made the loaded_mm handling a little more conservative to make it > more obvious that switch_mm_irqs_off() is safe regardless of exactly > when it gets called relative to switching current. I am not convinced that the dance of writing cpu_tlbstate.loaded_mm twice, with a barrier on each end, is useful or necessary. At the time switch_mm_irqs_off returns, nmi_uaccess_ok() will still return false, because we have not switched "current" to the task that owns the next mm_struct yet. We just have to make sure to: 1) Change cpu_tlbstate.loaded_mm before we manipulate CR3, and 2) Change "current" only once enough of the mm stuff has been switched, __switch_to seems to get that right. Between the time switch_mm_irqs_off() sets cpu_tlbstate to the next mm, and __switch_to moves() over current, nmi_uaccess_ok() will return false. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] x86/nmi: Fix some races in NMI uaccess 2018-08-29 15:17 ` Rik van Riel @ 2018-08-29 15:36 ` Andy Lutomirski 2018-08-29 15:49 ` Rik van Riel 0 siblings, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2018-08-29 15:36 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit On Wed, Aug 29, 2018 at 8:17 AM, Rik van Riel <riel@surriel.com> wrote: > On Tue, 2018-08-28 at 20:46 -0700, Andy Lutomirski wrote: >> On Tue, Aug 28, 2018 at 10:56 AM, Rik van Riel <riel@surriel.com> >> wrote: >> > On Mon, 27 Aug 2018 16:04:16 -0700 >> > Andy Lutomirski <luto@kernel.org> wrote: >> > >> > > The 0day bot is still chewing on this, but I've tested it a bit >> > > locally >> > > and it seems to do the right thing. >> > >> > Hi Andy, >> > >> > the version of the patch below should fix the bug we talked about >> > in email yesterday. It should automatically cover kernel threads >> > in lazy TLB mode, because current->mm will be NULL, while the >> > cpu_tlbstate.loaded_mm should never be NULL. >> > >> >> That's better than mine. I tweaked it a bit and added some >> debugging, >> and I got this: >> >> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=dd956eba16646fd0b15c3c0741269dfd84452dac >> >> I made the loaded_mm handling a little more conservative to make it >> more obvious that switch_mm_irqs_off() is safe regardless of exactly >> when it gets called relative to switching current. > > I am not convinced that the dance of writing > cpu_tlbstate.loaded_mm twice, with a barrier on > each end, is useful or necessary. > > At the time switch_mm_irqs_off returns, nmi_uaccess_ok() > will still return false, because we have not switched > "current" to the task that owns the next mm_struct yet. > > We just have to make sure to: > 1) Change cpu_tlbstate.loaded_mm before we manipulate > CR3, and > 2) Change "current" only once enough of the mm stuff has > been switched, __switch_to seems to get that right. > > Between the time switch_mm_irqs_off() sets cpu_tlbstate > to the next mm, and __switch_to moves() over current, > nmi_uaccess_ok() will return false. All true, but I think it stops working as soon as someone starts calling switch_mm_irqs_off() for some other reason, such as during text_poke(). And that was the original motivation for this patch. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] x86/nmi: Fix some races in NMI uaccess 2018-08-29 15:36 ` Andy Lutomirski @ 2018-08-29 15:49 ` Rik van Riel 2018-08-29 16:14 ` Andy Lutomirski 0 siblings, 1 reply; 13+ messages in thread From: Rik van Riel @ 2018-08-29 15:49 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit [-- Attachment #1: Type: text/plain, Size: 2698 bytes --] On Wed, 2018-08-29 at 08:36 -0700, Andy Lutomirski wrote: > On Wed, Aug 29, 2018 at 8:17 AM, Rik van Riel <riel@surriel.com> > wrote: > > On Tue, 2018-08-28 at 20:46 -0700, Andy Lutomirski wrote: > > > On Tue, Aug 28, 2018 at 10:56 AM, Rik van Riel <riel@surriel.com> > > > wrote: > > > > On Mon, 27 Aug 2018 16:04:16 -0700 > > > > Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > > The 0day bot is still chewing on this, but I've tested it a > > > > > bit > > > > > locally > > > > > and it seems to do the right thing. > > > > > > > > Hi Andy, > > > > > > > > the version of the patch below should fix the bug we talked > > > > about > > > > in email yesterday. It should automatically cover kernel > > > > threads > > > > in lazy TLB mode, because current->mm will be NULL, while the > > > > cpu_tlbstate.loaded_mm should never be NULL. > > > > > > > > > > That's better than mine. I tweaked it a bit and added some > > > debugging, > > > and I got this: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=dd956eba16646fd0b15c3c0741269dfd84452dac > > > > > > I made the loaded_mm handling a little more conservative to make > > > it > > > more obvious that switch_mm_irqs_off() is safe regardless of > > > exactly > > > when it gets called relative to switching current. > > > > I am not convinced that the dance of writing > > cpu_tlbstate.loaded_mm twice, with a barrier on > > each end, is useful or necessary. > > > > At the time switch_mm_irqs_off returns, nmi_uaccess_ok() > > will still return false, because we have not switched > > "current" to the task that owns the next mm_struct yet. > > > > We just have to make sure to: > > 1) Change cpu_tlbstate.loaded_mm before we manipulate > > CR3, and > > 2) Change "current" only once enough of the mm stuff has > > been switched, __switch_to seems to get that right. > > > > Between the time switch_mm_irqs_off() sets cpu_tlbstate > > to the next mm, and __switch_to moves() over current, > > nmi_uaccess_ok() will return false. > > All true, but I think it stops working as soon as someone starts > calling switch_mm_irqs_off() for some other reason, such as during > text_poke(). And that was the original motivation for this patch. How does calling switch_mm_irqs_off() for text_poke() change around current->mm and cpu_tlbstate.loaded_mm? Does current->mm stay the same throughout the entire text_poke() chain, while cpustate.loaded_mm is the only thing that is changed out? If so, then yes the double assignment is indeed necessary. Good point. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] x86/nmi: Fix some races in NMI uaccess 2018-08-29 15:49 ` Rik van Riel @ 2018-08-29 16:14 ` Andy Lutomirski 0 siblings, 0 replies; 13+ messages in thread From: Andy Lutomirski @ 2018-08-29 16:14 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Jann Horn, LKML, stable, Peter Zijlstra, Nadav Amit On Wed, Aug 29, 2018 at 8:49 AM, Rik van Riel <riel@surriel.com> wrote: > On Wed, 2018-08-29 at 08:36 -0700, Andy Lutomirski wrote: >> On Wed, Aug 29, 2018 at 8:17 AM, Rik van Riel <riel@surriel.com> >> wrote: >> > On Tue, 2018-08-28 at 20:46 -0700, Andy Lutomirski wrote: >> > > On Tue, Aug 28, 2018 at 10:56 AM, Rik van Riel <riel@surriel.com> >> > > wrote: >> > > > On Mon, 27 Aug 2018 16:04:16 -0700 >> > > > Andy Lutomirski <luto@kernel.org> wrote: >> > > > >> > > > > The 0day bot is still chewing on this, but I've tested it a >> > > > > bit >> > > > > locally >> > > > > and it seems to do the right thing. >> > > > >> > > > Hi Andy, >> > > > >> > > > the version of the patch below should fix the bug we talked >> > > > about >> > > > in email yesterday. It should automatically cover kernel >> > > > threads >> > > > in lazy TLB mode, because current->mm will be NULL, while the >> > > > cpu_tlbstate.loaded_mm should never be NULL. >> > > > >> > > >> > > That's better than mine. I tweaked it a bit and added some >> > > debugging, >> > > and I got this: >> > > >> > > >> > >> > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=dd956eba16646fd0b15c3c0741269dfd84452dac >> > > >> > > I made the loaded_mm handling a little more conservative to make >> > > it >> > > more obvious that switch_mm_irqs_off() is safe regardless of >> > > exactly >> > > when it gets called relative to switching current. >> > >> > I am not convinced that the dance of writing >> > cpu_tlbstate.loaded_mm twice, with a barrier on >> > each end, is useful or necessary. >> > >> > At the time switch_mm_irqs_off returns, nmi_uaccess_ok() >> > will still return false, because we have not switched >> > "current" to the task that owns the next mm_struct yet. >> > >> > We just have to make sure to: >> > 1) Change cpu_tlbstate.loaded_mm before we manipulate >> > CR3, and >> > 2) Change "current" only once enough of the mm stuff has >> > been switched, __switch_to seems to get that right. >> > >> > Between the time switch_mm_irqs_off() sets cpu_tlbstate >> > to the next mm, and __switch_to moves() over current, >> > nmi_uaccess_ok() will return false. >> >> All true, but I think it stops working as soon as someone starts >> calling switch_mm_irqs_off() for some other reason, such as during >> text_poke(). And that was the original motivation for this patch. > > How does calling switch_mm_irqs_off() for text_poke() > change around current->mm and cpu_tlbstate.loaded_mm? > > Does current->mm stay the same throughout the entire > text_poke() chain, while cpustate.loaded_mm is the > only thing that is changed out? This is exactly what happens. It seemed considerably more complicated and error-prone to fiddle with current->mm. Instead the idea was to turn off IRQs, get NMI to stay out of the way, and put everything back the way it was before the scheduler, any syscalls, etc could notice that we were messing around. --Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-08-29 16:14 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-08-27 23:04 [PATCH] x86/nmi: Fix some races in NMI uaccess Andy Lutomirski 2018-08-27 23:12 ` Jann Horn 2018-08-27 23:25 ` Andy Lutomirski 2018-08-27 23:34 ` Jann Horn 2018-08-28 1:31 ` Rik van Riel 2018-08-28 2:10 ` Andy Lutomirski 2018-08-28 13:50 ` Rik van Riel 2018-08-28 17:56 ` [PATCH v2] " Rik van Riel 2018-08-29 3:46 ` Andy Lutomirski 2018-08-29 15:17 ` Rik van Riel 2018-08-29 15:36 ` Andy Lutomirski 2018-08-29 15:49 ` Rik van Riel 2018-08-29 16:14 ` Andy Lutomirski
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.