* [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe @ 2019-11-27 5:56 Masami Hiramatsu 2019-11-27 5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu 2019-11-27 5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu 0 siblings, 2 replies; 19+ messages in thread From: Masami Hiramatsu @ 2019-11-27 5:56 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov, Masami Hiramatsu Hi, Here are the patches which I've faced while testing ftracetest without function tracer. While investigating I found there were 2 different bugs there. The 1st bug is a timing bug caused by wrong global variable update and syncing in text_poke_bp_batch(). This can cause a kernel panic if we hit int3 in between bp_patching.vec = NULL and bp_patching.nr_entries = 0. This is actually a wrong order and no synchronization. Steve suggested we can fix it with reordering and adding sync_core() between them. The 2nd bug is in the optprobe, which is caused by wrong flag update order. Currently kprobes update optimized flag before unoptimizing code. But if the kprobe is hit unoptimizing intermediate state, it can go back from int3 to the middle of modified instruction and cause a kernel panic. This can be fixed by updating flag after unoptimized code. Thank you, --- Masami Hiramatsu (2): x86/alternative: Sync bp_patching update for avoiding NULL pointer exception kprobes: Set unoptimized flag after unoptimizing code arch/x86/kernel/alternative.c | 8 +++++++- kernel/kprobes.c | 4 +++- 2 files changed, 10 insertions(+), 2 deletions(-) -- Masami Hiramatsu (Linaro) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-11-27 5:56 [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe Masami Hiramatsu @ 2019-11-27 5:56 ` Masami Hiramatsu 2019-12-02 9:15 ` Peter Zijlstra ` (2 more replies) 2019-11-27 5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu 1 sibling, 3 replies; 19+ messages in thread From: Masami Hiramatsu @ 2019-11-27 5:56 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov, Masami Hiramatsu ftracetest multiple_kprobes.tc testcase hit a following NULL pointer exception. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc8+ #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 RIP: 0010:poke_int3_handler+0x39/0x100 Code: 5b 5d c3 f6 87 88 00 00 00 03 75 f2 48 8b 87 80 00 00 00 48 89 fb 48 8d 68 ff 48 8b 05 80 98 72 01 83 fa 01 0f 8f 93 00 00 00 <48> 63 10 48 81 c2 00 00 00 81 48 39 d5 75 c5 0f b6 50 08 8d 4a 34 RSP: 0018:ffffc900001a8eb8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffc900001a8ee8 RCX: ffffffff81a00b57 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900001a8ee8 RBP: ffffffff81027635 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000007a970000 CR4: 00000000000006a0 Call Trace: <IRQ> do_int3+0xd/0xf0 int3+0x42/0x50 RIP: 0010:sched_clock+0x6/0x10 eu-addr2line told that poke_int3_handler+0x39 was alternatives:958. static inline void *text_poke_addr(struct text_poke_loc *tp) { return _stext + tp->rel_addr; <------ Here is line #958 } This seems like caused by the tp (bp_patching.vec) was NULL but bp_patching.nr_entries != 0. There is a small chance to do this, because we have no sync after zeroing bp_patching.nr_entries before clearing bp_patching.vec. Steve suggested we could fix this by adding sync_core, because int3 is done with interrupts disabled, and the on_each_cpu() requires all CPUs to have had their interrupts enabled. Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations") Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> --- arch/x86/kernel/alternative.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 4552795a8df4..9505096e2cd1 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * sync_core() implies an smp_mb() and orders this store against * the writing of the new instruction. */ - bp_patching.vec = NULL; bp_patching.nr_entries = 0; + /* + * This sync_core () ensures that all int3 handlers in progress + * have finished. This allows poke_int3_handler () after this to + * avoid touching bp_paching.vec by checking nr_entries == 0. + */ + text_poke_sync(); + bp_patching.vec = NULL; } void text_poke_loc_init(struct text_poke_loc *tp, void *addr, ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-11-27 5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu @ 2019-12-02 9:15 ` Peter Zijlstra 2019-12-02 11:50 ` Masami Hiramatsu 2019-12-04 8:33 ` [tip: core/kprobes] x86/alternatives: " tip-bot2 for Masami Hiramatsu 2019-12-09 14:39 ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra 2 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2019-12-02 9:15 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote: > ftracetest multiple_kprobes.tc testcase hit a following NULL pointer > exception. > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc8+ #23 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > RIP: 0010:poke_int3_handler+0x39/0x100 > Code: 5b 5d c3 f6 87 88 00 00 00 03 75 f2 48 8b 87 80 00 00 00 48 89 fb 48 8d 68 ff 48 8b 05 80 98 72 01 83 fa 01 0f 8f 93 00 00 00 <48> 63 10 48 81 c2 00 00 00 81 48 39 d5 75 c5 0f b6 50 08 8d 4a 34 > RSP: 0018:ffffc900001a8eb8 EFLAGS: 00010046 > RAX: 0000000000000000 RBX: ffffc900001a8ee8 RCX: ffffffff81a00b57 > RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900001a8ee8 > RBP: ffffffff81027635 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > FS: 0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 000000007a970000 CR4: 00000000000006a0 > Call Trace: > <IRQ> > do_int3+0xd/0xf0 > int3+0x42/0x50 > RIP: 0010:sched_clock+0x6/0x10 > > eu-addr2line told that poke_int3_handler+0x39 was alternatives:958. > > static inline void *text_poke_addr(struct text_poke_loc *tp) > { > return _stext + tp->rel_addr; <------ Here is line #958 > } > > This seems like caused by the tp (bp_patching.vec) was NULL but > bp_patching.nr_entries != 0. There is a small chance to do > this, because we have no sync after zeroing bp_patching.nr_entries > before clearing bp_patching.vec. > > Steve suggested we could fix this by adding sync_core, because int3 > is done with interrupts disabled, and the on_each_cpu() requires > all CPUs to have had their interrupts enabled. > > Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations") > Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > --- > arch/x86/kernel/alternative.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 4552795a8df4..9505096e2cd1 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > * sync_core() implies an smp_mb() and orders this store against > * the writing of the new instruction. > */ > - bp_patching.vec = NULL; > bp_patching.nr_entries = 0; > + /* > + * This sync_core () ensures that all int3 handlers in progress > + * have finished. This allows poke_int3_handler () after this to > + * avoid touching bp_paching.vec by checking nr_entries == 0. > + */ > + text_poke_sync(); > + bp_patching.vec = NULL; > } Hurm.. is there no way we can merge that with the 'last' text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI things like that. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-02 9:15 ` Peter Zijlstra @ 2019-12-02 11:50 ` Masami Hiramatsu 2019-12-02 13:43 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Masami Hiramatsu @ 2019-12-02 11:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Mon, 2 Dec 2019 10:15:19 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote: > > ftracetest multiple_kprobes.tc testcase hit a following NULL pointer > > exception. > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0 > > Oops: 0000 [#1] PREEMPT SMP PTI > > CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc8+ #23 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > > RIP: 0010:poke_int3_handler+0x39/0x100 > > Code: 5b 5d c3 f6 87 88 00 00 00 03 75 f2 48 8b 87 80 00 00 00 48 89 fb 48 8d 68 ff 48 8b 05 80 98 72 01 83 fa 01 0f 8f 93 00 00 00 <48> 63 10 48 81 c2 00 00 00 81 48 39 d5 75 c5 0f b6 50 08 8d 4a 34 > > RSP: 0018:ffffc900001a8eb8 EFLAGS: 00010046 > > RAX: 0000000000000000 RBX: ffffc900001a8ee8 RCX: ffffffff81a00b57 > > RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900001a8ee8 > > RBP: ffffffff81027635 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > FS: 0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000000 CR3: 000000007a970000 CR4: 00000000000006a0 > > Call Trace: > > <IRQ> > > do_int3+0xd/0xf0 > > int3+0x42/0x50 > > RIP: 0010:sched_clock+0x6/0x10 > > > > eu-addr2line told that poke_int3_handler+0x39 was alternatives:958. > > > > static inline void *text_poke_addr(struct text_poke_loc *tp) > > { > > return _stext + tp->rel_addr; <------ Here is line #958 > > } > > > > This seems like caused by the tp (bp_patching.vec) was NULL but > > bp_patching.nr_entries != 0. There is a small chance to do > > this, because we have no sync after zeroing bp_patching.nr_entries > > before clearing bp_patching.vec. > > > > Steve suggested we could fix this by adding sync_core, because int3 > > is done with interrupts disabled, and the on_each_cpu() requires > > all CPUs to have had their interrupts enabled. > > > > Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations") > > Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > > --- > > arch/x86/kernel/alternative.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > > index 4552795a8df4..9505096e2cd1 100644 > > --- a/arch/x86/kernel/alternative.c > > +++ b/arch/x86/kernel/alternative.c > > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > > * sync_core() implies an smp_mb() and orders this store against > > * the writing of the new instruction. > > */ > > - bp_patching.vec = NULL; > > bp_patching.nr_entries = 0; > > + /* > > + * This sync_core () ensures that all int3 handlers in progress > > + * have finished. This allows poke_int3_handler () after this to > > + * avoid touching bp_paching.vec by checking nr_entries == 0. > > + */ > > + text_poke_sync(); > > + bp_patching.vec = NULL; > > } > > Hurm.. is there no way we can merge that with the 'last' > text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI > things like that. Maybe we can add a NULL check of bp_patchig.vec in poke_int3_handler() but it doesn't ensure the fundamental safeness, because the array pointed by bp_patching.vec itself can be released while poke_int3_handler() accesses it. When I hit similar issue on unregister_kprobe, I used synchronize_rcu() to ensure all cores passed scheduler once. This can avoid sending IPI but will take a longer time. Thank you, -- Masami Hiramatsu <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-02 11:50 ` Masami Hiramatsu @ 2019-12-02 13:43 ` Peter Zijlstra 2019-12-02 14:39 ` Masami Hiramatsu 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2019-12-02 13:43 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Mon, Dec 02, 2019 at 08:50:12PM +0900, Masami Hiramatsu wrote: > On Mon, 2 Dec 2019 10:15:19 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote: > > > --- a/arch/x86/kernel/alternative.c > > > +++ b/arch/x86/kernel/alternative.c > > > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > > > * sync_core() implies an smp_mb() and orders this store against > > > * the writing of the new instruction. > > > */ > > > - bp_patching.vec = NULL; > > > bp_patching.nr_entries = 0; > > > + /* > > > + * This sync_core () ensures that all int3 handlers in progress > > > + * have finished. This allows poke_int3_handler () after this to > > > + * avoid touching bp_paching.vec by checking nr_entries == 0. > > > + */ > > > + text_poke_sync(); > > > + bp_patching.vec = NULL; > > > } > > > > Hurm.. is there no way we can merge that with the 'last' > > text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI > > things like that. > > Maybe we can add a NULL check of bp_patchig.vec in poke_int3_handler() > but it doesn't ensure the fundamental safeness, because the array > pointed by bp_patching.vec itself can be released while > poke_int3_handler() accesses it. No, what I mean is something like: diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 30e86730655c..347a234a7c52 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -1119,17 +1119,13 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * Third step: replace the first byte (int3) by the first byte of * replacing opcode. */ - for (do_sync = 0, i = 0; i < nr_entries; i++) { + for (i = 0; i < nr_entries; i++) { if (tp[i].text[0] == INT3_INSN_OPCODE) continue; text_poke(text_poke_addr(&tp[i]), tp[i].text, INT3_INSN_SIZE); - do_sync++; } - if (do_sync) - text_poke_sync(); - /* * sync_core() implies an smp_mb() and orders this store against * the writing of the new instruction. Or is that unsafe ? ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-02 13:43 ` Peter Zijlstra @ 2019-12-02 14:39 ` Masami Hiramatsu 0 siblings, 0 replies; 19+ messages in thread From: Masami Hiramatsu @ 2019-12-02 14:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Mon, 2 Dec 2019 14:43:54 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Dec 02, 2019 at 08:50:12PM +0900, Masami Hiramatsu wrote: > > On Mon, 2 Dec 2019 10:15:19 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote: > > > > > --- a/arch/x86/kernel/alternative.c > > > > +++ b/arch/x86/kernel/alternative.c > > > > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > > > > * sync_core() implies an smp_mb() and orders this store against > > > > * the writing of the new instruction. > > > > */ > > > > - bp_patching.vec = NULL; > > > > bp_patching.nr_entries = 0; > > > > + /* > > > > + * This sync_core () ensures that all int3 handlers in progress > > > > + * have finished. This allows poke_int3_handler () after this to > > > > + * avoid touching bp_paching.vec by checking nr_entries == 0. > > > > + */ > > > > + text_poke_sync(); > > > > + bp_patching.vec = NULL; > > > > } > > > > > > Hurm.. is there no way we can merge that with the 'last' > > > text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI > > > things like that. > > > > Maybe we can add a NULL check of bp_patchig.vec in poke_int3_handler() > > but it doesn't ensure the fundamental safeness, because the array > > pointed by bp_patching.vec itself can be released while > > poke_int3_handler() accesses it. > > No, what I mean is something like: > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 30e86730655c..347a234a7c52 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -1119,17 +1119,13 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > * Third step: replace the first byte (int3) by the first byte of > * replacing opcode. > */ > - for (do_sync = 0, i = 0; i < nr_entries; i++) { > + for (i = 0; i < nr_entries; i++) { > if (tp[i].text[0] == INT3_INSN_OPCODE) > continue; > > text_poke(text_poke_addr(&tp[i]), tp[i].text, INT3_INSN_SIZE); > - do_sync++; > } > > - if (do_sync) > - text_poke_sync(); > - > /* > * sync_core() implies an smp_mb() and orders this store against > * the writing of the new instruction. > > > Or is that unsafe ? OK, let's check it. text_poke_bp_batch() { update vec update nr_entries smp_wmb() write int3 text_poke_sync() write rest_bytes text_poke_sync() if rest_bytes write first_byte text_poke_sync() if first_byte ... (*) update nr_entries text_poke_sync() ... (**) update vec } Before (*), the first byte can be new opcode or int3, thus poke_int3_handler() can be called. But anyway, at that point nr_entries != 0, thus poke_int3_handler() correctly emulate the new instruction. Before (**), all int3 should be removed, so nr_entries must not accessed, EXCEPT for writing int3 case. If we just remove the (*) as you say, the poke_int3_handler() can see nr_entries = 0 before (**). So it is still unsafe. I considered another way that skipping (**) if !first_byte, since (*) ensured the target address(text) doesn't hit int3 anymore. However, this will be also unsafe because there can be another int3 (by kprobes) has been hit while updating nr_entries and vec. Thank you, -- Masami Hiramatsu <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip: core/kprobes] x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception 2019-11-27 5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu 2019-12-02 9:15 ` Peter Zijlstra @ 2019-12-04 8:33 ` tip-bot2 for Masami Hiramatsu 2019-12-09 14:39 ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra 2 siblings, 0 replies; 19+ messages in thread From: tip-bot2 for Masami Hiramatsu @ 2019-12-04 8:33 UTC (permalink / raw) To: linux-tip-commits Cc: Steven Rostedt (VMware), Alexei Starovoitov, Masami Hiramatsu, Andy Lutomirski, Borislav Petkov, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, bristot, Ingo Molnar, x86, LKML The following commit has been merged into the core/kprobes branch of tip: Commit-ID: 285a54efe3861976af9d15e85ff8c91a78d1407b Gitweb: https://git.kernel.org/tip/285a54efe3861976af9d15e85ff8c91a78d1407b Author: Masami Hiramatsu <mhiramat@kernel.org> AuthorDate: Wed, 27 Nov 2019 14:56:52 +09:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Wed, 27 Nov 2019 07:44:25 +01:00 x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception ftracetest multiple_kprobes.tc testcase hits the following NULL pointer exception: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI RIP: 0010:poke_int3_handler+0x39/0x100 Call Trace: <IRQ> do_int3+0xd/0xf0 int3+0x42/0x50 RIP: 0010:sched_clock+0x6/0x10 poke_int3_handler+0x39 was alternatives:958: static inline void *text_poke_addr(struct text_poke_loc *tp) { return _stext + tp->rel_addr; <------ Here is line #958 } This seems to be caused by tp (bp_patching.vec) being NULL but bp_patching.nr_entries != 0. There is a small chance for this to happen, because we have no synchronization between the zeroing of bp_patching.nr_entries and before clearing bp_patching.vec. Steve suggested we could fix this by adding sync_core(), because int3 is done with interrupts disabled, and the on_each_cpu() requires all CPUs to have had their interrupts enabled. [ mingo: Edited the comments and the changelog. ] Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations") Link: https://lkml.kernel.org/r/157483421229.25881.15314414408559963162.stgit@devnote2 Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/alternative.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 4552795..30e8673 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * sync_core() implies an smp_mb() and orders this store against * the writing of the new instruction. */ - bp_patching.vec = NULL; bp_patching.nr_entries = 0; + /* + * This sync_core () call ensures that all INT3 handlers in progress + * have finished. This allows poke_int3_handler() after this to + * avoid touching bp_paching.vec by checking nr_entries == 0. + */ + text_poke_sync(); + bp_patching.vec = NULL; } void text_poke_loc_init(struct text_poke_loc *tp, void *addr, ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-11-27 5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu 2019-12-02 9:15 ` Peter Zijlstra 2019-12-04 8:33 ` [tip: core/kprobes] x86/alternatives: " tip-bot2 for Masami Hiramatsu @ 2019-12-09 14:39 ` Peter Zijlstra 2019-12-10 16:44 ` Masami Hiramatsu 2 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2019-12-09 14:39 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote: > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 4552795a8df4..9505096e2cd1 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > * sync_core() implies an smp_mb() and orders this store against > * the writing of the new instruction. > */ > - bp_patching.vec = NULL; > bp_patching.nr_entries = 0; > + /* > + * This sync_core () ensures that all int3 handlers in progress > + * have finished. This allows poke_int3_handler () after this to > + * avoid touching bp_paching.vec by checking nr_entries == 0. > + */ > + text_poke_sync(); > + bp_patching.vec = NULL; > } How's something like this instead? Under the assumption that it is rare to actually hit the INT3 and even more rare to actually hit this race, the below should be a lot cheaper. --- arch/x86/kernel/alternative.c | 69 +++++++++++++++++++++++++++++++++---------- 1 file changed, 53 insertions(+), 16 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 30e86730655c..12f2d193109d 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -953,6 +953,8 @@ static struct bp_patching_desc { int nr_entries; } bp_patching; +static atomic_t bp_handlers; + static inline void *text_poke_addr(struct text_poke_loc *tp) { return _stext + tp->rel_addr; @@ -973,8 +975,8 @@ NOKPROBE_SYMBOL(patch_cmp); int notrace poke_int3_handler(struct pt_regs *regs) { struct text_poke_loc *tp; + int nr, len, ret = 0; void *ip; - int len; /* * Having observed our INT3 instruction, we now must observe @@ -987,12 +989,21 @@ int notrace poke_int3_handler(struct pt_regs *regs) * Idem for other elements in bp_patching. */ smp_rmb(); - - if (likely(!bp_patching.nr_entries)) + if (!READ_ONCE(bp_patching.nr_entries)) return 0; + atomic_inc(&bp_handlers); + /* + * 'ACQUIRE', everything happens after the increment. + */ + smp_mb__after_atomic(); + + nr = smp_load_acquire(&bp_patching.nr_entries); + if (likely(!nr)) + goto out; + if (user_mode(regs)) - return 0; + goto out; /* * Discount the INT3. See text_poke_bp_batch(). @@ -1002,16 +1013,16 @@ int notrace poke_int3_handler(struct pt_regs *regs) /* * Skip the binary search if there is a single member in the vector. */ - if (unlikely(bp_patching.nr_entries > 1)) { - tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries, + if (unlikely(nr > 1)) { + tp = bsearch(ip, bp_patching.vec, nr, sizeof(struct text_poke_loc), patch_cmp); if (!tp) - return 0; + goto out; } else { tp = bp_patching.vec; if (text_poke_addr(tp) != ip) - return 0; + goto out; } len = text_opcode_size(tp->opcode); @@ -1023,7 +1034,7 @@ int notrace poke_int3_handler(struct pt_regs *regs) * Someone poked an explicit INT3, they'll want to handle it, * do not consume. */ - return 0; + goto out; case CALL_INSN_OPCODE: int3_emulate_call(regs, (long)ip + tp->rel32); @@ -1038,7 +1049,14 @@ int notrace poke_int3_handler(struct pt_regs *regs) BUG(); } - return 1; + ret = 1; +out: + /* + * 'RELEASE", everything happens before the decrement. + */ + smp_mb__before_atomic(); + atomic_dec(&bp_handlers); + return ret; } NOKPROBE_SYMBOL(poke_int3_handler); @@ -1076,7 +1094,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries lockdep_assert_held(&text_mutex); bp_patching.vec = tp; - bp_patching.nr_entries = nr_entries; + /* + * bp_patching.vec = tp nr = bp_patching.nr_entries + * REL ACQ + * bp_patching.nr_entries = nr_entries tp = bp_patching.vec[] + */ + smp_store_release(&bp_patching.nr_entries, nr_entries); /* * Corresponding read barrier in int3 notifier for making sure the @@ -1134,13 +1157,27 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * sync_core() implies an smp_mb() and orders this store against * the writing of the new instruction. */ - bp_patching.nr_entries = 0; + WRITE_ONCE(bp_patching.nr_entries, 0); /* - * This sync_core () call ensures that all INT3 handlers in progress - * have finished. This allows poke_int3_handler() after this to - * avoid touching bp_paching.vec by checking nr_entries == 0. + * nr_entries = 0 bp_handlers++ + * MB MB + * VAL = bp_handlers nr = nr_entries + */ + smp_mb(); + /* + * Guarantee all poke_int3_handler()s that have observed + * @bp_patching.nr_enties have completed before we clear + * bp_patching.vec. + * + * We can't do this before text_poke_sync() because then there + * might still be observable INT3 instructions. + */ + atomic_cond_read_acquire(&bp_handlers, !VAL); + /* + * bp_handlers == 0 tp = bp_patching.vec[] + * ACQ MB + * bp_patching.vec = NULL bp_handlers--; */ - text_poke_sync(); bp_patching.vec = NULL; } ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-09 14:39 ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra @ 2019-12-10 16:44 ` Masami Hiramatsu 2019-12-10 17:32 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Masami Hiramatsu @ 2019-12-10 16:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov Hi Peter, On Mon, 9 Dec 2019 15:39:40 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote: > > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > > index 4552795a8df4..9505096e2cd1 100644 > > --- a/arch/x86/kernel/alternative.c > > +++ b/arch/x86/kernel/alternative.c > > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > > * sync_core() implies an smp_mb() and orders this store against > > * the writing of the new instruction. > > */ > > - bp_patching.vec = NULL; > > bp_patching.nr_entries = 0; > > + /* > > + * This sync_core () ensures that all int3 handlers in progress > > + * have finished. This allows poke_int3_handler () after this to > > + * avoid touching bp_paching.vec by checking nr_entries == 0. > > + */ > > + text_poke_sync(); > > + bp_patching.vec = NULL; > > } > > How's something like this instead? Under the assumption that it is rare > to actually hit the INT3 and even more rare to actually hit this race, > the below should be a lot cheaper. Ah, this reminds me of my atomic-refcounter method for kpatch idea and module unloading. This looks good, but I feel it is a bit complicated. If we use atomic (and spin-wait) here, can we use atomic_inc_not_zero() in the poke_int3_handler() at first for making sure the bp_batching is under operation or not? I think it makes things simpler, like below. --------- atomic_t bp_refcnt; poke_int3_handler() { smp_rmb(); if (!READ_ONCE(bp_patching.nr_entries)) return 0; if (!atomic_inc_not_zero(&bp_refcnt)) return 0; smp_mb__after_atomic(); [use bp_patching] atomic_dec(&bp_refcnt); } text_poke_bp_batch() { bp_patching.vec = tp; bp_patching.nr_entries = nr_entries; smp_wmb(); atomic_inc(&bp_refcnt); ... atomic_dec(&bp_refcnt); /* wait for all running poke_int3_handler(). */ atomic_cond_read_acquire(&bp_refcnt, !VAL); bp_patching.vec = NULL; bp_patching.nr_entries = 0; } --------- Thank you, > > --- > arch/x86/kernel/alternative.c | 69 +++++++++++++++++++++++++++++++++---------- > 1 file changed, 53 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 30e86730655c..12f2d193109d 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -953,6 +953,8 @@ static struct bp_patching_desc { > int nr_entries; > } bp_patching; > > +static atomic_t bp_handlers; > + > static inline void *text_poke_addr(struct text_poke_loc *tp) > { > return _stext + tp->rel_addr; > @@ -973,8 +975,8 @@ NOKPROBE_SYMBOL(patch_cmp); > int notrace poke_int3_handler(struct pt_regs *regs) > { > struct text_poke_loc *tp; > + int nr, len, ret = 0; > void *ip; > - int len; > > /* > * Having observed our INT3 instruction, we now must observe > @@ -987,12 +989,21 @@ int notrace poke_int3_handler(struct pt_regs *regs) > * Idem for other elements in bp_patching. > */ > smp_rmb(); > - > - if (likely(!bp_patching.nr_entries)) > + if (!READ_ONCE(bp_patching.nr_entries)) > return 0; > > + atomic_inc(&bp_handlers); > + /* > + * 'ACQUIRE', everything happens after the increment. > + */ > + smp_mb__after_atomic(); > + > + nr = smp_load_acquire(&bp_patching.nr_entries); > + if (likely(!nr)) > + goto out; > + > if (user_mode(regs)) > - return 0; > + goto out; > > /* > * Discount the INT3. See text_poke_bp_batch(). > @@ -1002,16 +1013,16 @@ int notrace poke_int3_handler(struct pt_regs *regs) > /* > * Skip the binary search if there is a single member in the vector. > */ > - if (unlikely(bp_patching.nr_entries > 1)) { > - tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries, > + if (unlikely(nr > 1)) { > + tp = bsearch(ip, bp_patching.vec, nr, > sizeof(struct text_poke_loc), > patch_cmp); > if (!tp) > - return 0; > + goto out; > } else { > tp = bp_patching.vec; > if (text_poke_addr(tp) != ip) > - return 0; > + goto out; > } > > len = text_opcode_size(tp->opcode); > @@ -1023,7 +1034,7 @@ int notrace poke_int3_handler(struct pt_regs *regs) > * Someone poked an explicit INT3, they'll want to handle it, > * do not consume. > */ > - return 0; > + goto out; > > case CALL_INSN_OPCODE: > int3_emulate_call(regs, (long)ip + tp->rel32); > @@ -1038,7 +1049,14 @@ int notrace poke_int3_handler(struct pt_regs *regs) > BUG(); > } > > - return 1; > + ret = 1; > +out: > + /* > + * 'RELEASE", everything happens before the decrement. > + */ > + smp_mb__before_atomic(); > + atomic_dec(&bp_handlers); > + return ret; > } > NOKPROBE_SYMBOL(poke_int3_handler); > > @@ -1076,7 +1094,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > lockdep_assert_held(&text_mutex); > > bp_patching.vec = tp; > - bp_patching.nr_entries = nr_entries; > + /* > + * bp_patching.vec = tp nr = bp_patching.nr_entries > + * REL ACQ > + * bp_patching.nr_entries = nr_entries tp = bp_patching.vec[] > + */ > + smp_store_release(&bp_patching.nr_entries, nr_entries); > > /* > * Corresponding read barrier in int3 notifier for making sure the > @@ -1134,13 +1157,27 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > * sync_core() implies an smp_mb() and orders this store against > * the writing of the new instruction. > */ > - bp_patching.nr_entries = 0; > + WRITE_ONCE(bp_patching.nr_entries, 0); > /* > - * This sync_core () call ensures that all INT3 handlers in progress > - * have finished. This allows poke_int3_handler() after this to > - * avoid touching bp_paching.vec by checking nr_entries == 0. > + * nr_entries = 0 bp_handlers++ > + * MB MB > + * VAL = bp_handlers nr = nr_entries > + */ > + smp_mb(); > + /* > + * Guarantee all poke_int3_handler()s that have observed > + * @bp_patching.nr_enties have completed before we clear > + * bp_patching.vec. > + * > + * We can't do this before text_poke_sync() because then there > + * might still be observable INT3 instructions. > + */ > + atomic_cond_read_acquire(&bp_handlers, !VAL); > + /* > + * bp_handlers == 0 tp = bp_patching.vec[] > + * ACQ MB > + * bp_patching.vec = NULL bp_handlers--; > */ > - text_poke_sync(); > bp_patching.vec = NULL; > } > -- Masami Hiramatsu <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-10 16:44 ` Masami Hiramatsu @ 2019-12-10 17:32 ` Peter Zijlstra 2019-12-11 0:09 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2019-12-10 17:32 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Wed, Dec 11, 2019 at 01:44:01AM +0900, Masami Hiramatsu wrote: > This looks good, but I feel it is a bit complicated. > > If we use atomic (and spin-wait) here, can we use atomic_inc_not_zero() > in the poke_int3_handler() at first for making sure the bp_batching is > under operation or not? > I think it makes things simpler, like below. > > --------- > atomic_t bp_refcnt; > > poke_int3_handler() > { > smp_rmb(); > if (!READ_ONCE(bp_patching.nr_entries)) > return 0; > if (!atomic_inc_not_zero(&bp_refcnt)) > return 0; > smp_mb__after_atomic(); > [use bp_patching] > atomic_dec(&bp_refcnt); > } > > text_poke_bp_batch() > { > bp_patching.vec = tp; > bp_patching.nr_entries = nr_entries; > smp_wmb(); > atomic_inc(&bp_refcnt); > ... > atomic_dec(&bp_refcnt); > /* wait for all running poke_int3_handler(). */ > atomic_cond_read_acquire(&bp_refcnt, !VAL); > bp_patching.vec = NULL; > bp_patching.nr_entries = 0; > } I feel that is actually more complicated... Let me try to see if I can simplify things. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-10 17:32 ` Peter Zijlstra @ 2019-12-11 0:09 ` Peter Zijlstra 2019-12-11 8:09 ` Masami Hiramatsu 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2019-12-11 0:09 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On Tue, Dec 10, 2019 at 06:32:09PM +0100, Peter Zijlstra wrote: > I feel that is actually more complicated... Let me try to see if I can > simplify things. How is this then? --- arch/x86/kernel/alternative.c | 84 +++++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 31 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 30e86730655c..34360ca301a2 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -948,10 +948,29 @@ struct text_poke_loc { const u8 text[POKE_MAX_OPCODE_SIZE]; }; -static struct bp_patching_desc { +struct bp_patching_desc { struct text_poke_loc *vec; int nr_entries; -} bp_patching; + atomic_t refs; +}; + +static struct bp_patching_desc *bp_desc; + +static inline struct bp_patching_desc *try_get_desc(struct bp_patching_desc **descp) +{ + struct bp_patching_desc *desc = READ_ONCE(*descp); /* rcu_dereference */ + + if (!desc || !atomic_inc_not_zero(&desc->refs)) + return NULL; + + return desc; +} + +static inline void put_desc(struct bp_patching_desc *desc) +{ + smp_mb__before_atomic(); + atomic_dec(&desc->refs); +} static inline void *text_poke_addr(struct text_poke_loc *tp) { @@ -972,26 +991,26 @@ NOKPROBE_SYMBOL(patch_cmp); int notrace poke_int3_handler(struct pt_regs *regs) { + struct bp_patching_desc *desc; struct text_poke_loc *tp; + int len, ret = 0; void *ip; - int len; + + if (user_mode(regs)) + return 0; /* * Having observed our INT3 instruction, we now must observe - * bp_patching.nr_entries. + * bp_desc: * - * nr_entries != 0 INT3 + * bp_desc = desc INT3 * WMB RMB - * write INT3 if (nr_entries) - * - * Idem for other elements in bp_patching. + * write INT3 if (desc) */ smp_rmb(); - if (likely(!bp_patching.nr_entries)) - return 0; - - if (user_mode(regs)) + desc = try_get_desc(&bp_desc); + if (!desc) return 0; /* @@ -1002,16 +1021,16 @@ int notrace poke_int3_handler(struct pt_regs *regs) /* * Skip the binary search if there is a single member in the vector. */ - if (unlikely(bp_patching.nr_entries > 1)) { - tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries, + if (unlikely(desc->nr_entries > 1)) { + tp = bsearch(ip, desc->vec, desc->nr_entries, sizeof(struct text_poke_loc), patch_cmp); if (!tp) - return 0; + goto out_put; } else { - tp = bp_patching.vec; + tp = desc->vec; if (text_poke_addr(tp) != ip) - return 0; + goto out_put; } len = text_opcode_size(tp->opcode); @@ -1023,7 +1042,7 @@ int notrace poke_int3_handler(struct pt_regs *regs) * Someone poked an explicit INT3, they'll want to handle it, * do not consume. */ - return 0; + goto out_put; case CALL_INSN_OPCODE: int3_emulate_call(regs, (long)ip + tp->rel32); @@ -1038,7 +1057,11 @@ int notrace poke_int3_handler(struct pt_regs *regs) BUG(); } - return 1; + ret = 1; + +out_put: + put_desc(desc); + return ret; } NOKPROBE_SYMBOL(poke_int3_handler); @@ -1069,14 +1092,18 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { + struct bp_patching_desc desc = { + .vec = tp, + .nr_entries = nr_entries, + .refs = ATOMIC_INIT(1), + }; unsigned char int3 = INT3_INSN_OPCODE; unsigned int i; int do_sync; lockdep_assert_held(&text_mutex); - bp_patching.vec = tp; - bp_patching.nr_entries = nr_entries; + smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */ /* * Corresponding read barrier in int3 notifier for making sure the @@ -1131,17 +1158,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries text_poke_sync(); /* - * sync_core() implies an smp_mb() and orders this store against - * the writing of the new instruction. + * Remove and synchronize_rcu(), except we have a very primitive + * refcount based completion. */ - bp_patching.nr_entries = 0; - /* - * This sync_core () call ensures that all INT3 handlers in progress - * have finished. This allows poke_int3_handler() after this to - * avoid touching bp_paching.vec by checking nr_entries == 0. - */ - text_poke_sync(); - bp_patching.vec = NULL; + WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */ + if (!atomic_dec_and_test(&desc.refs)) + atomic_cond_read_acquire(&desc.refs, !VAL); } void text_poke_loc_init(struct text_poke_loc *tp, void *addr, ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-11 0:09 ` Peter Zijlstra @ 2019-12-11 8:09 ` Masami Hiramatsu 2019-12-11 9:12 ` Daniel Bristot de Oliveira 0 siblings, 1 reply; 19+ messages in thread From: Masami Hiramatsu @ 2019-12-11 8:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov Hi Peter, On Wed, 11 Dec 2019 01:09:43 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Dec 10, 2019 at 06:32:09PM +0100, Peter Zijlstra wrote: > > > I feel that is actually more complicated... Let me try to see if I can > > simplify things. > > How is this then? This looks perfectly good to me :) Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Thank you! > > --- > arch/x86/kernel/alternative.c | 84 +++++++++++++++++++++++++++---------------- > 1 file changed, 53 insertions(+), 31 deletions(-) > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 30e86730655c..34360ca301a2 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -948,10 +948,29 @@ struct text_poke_loc { > const u8 text[POKE_MAX_OPCODE_SIZE]; > }; > > -static struct bp_patching_desc { > +struct bp_patching_desc { > struct text_poke_loc *vec; > int nr_entries; > -} bp_patching; > + atomic_t refs; > +}; > + > +static struct bp_patching_desc *bp_desc; > + > +static inline struct bp_patching_desc *try_get_desc(struct bp_patching_desc **descp) > +{ > + struct bp_patching_desc *desc = READ_ONCE(*descp); /* rcu_dereference */ > + > + if (!desc || !atomic_inc_not_zero(&desc->refs)) > + return NULL; > + > + return desc; > +} > + > +static inline void put_desc(struct bp_patching_desc *desc) > +{ > + smp_mb__before_atomic(); > + atomic_dec(&desc->refs); > +} > > static inline void *text_poke_addr(struct text_poke_loc *tp) > { > @@ -972,26 +991,26 @@ NOKPROBE_SYMBOL(patch_cmp); > > int notrace poke_int3_handler(struct pt_regs *regs) > { > + struct bp_patching_desc *desc; > struct text_poke_loc *tp; > + int len, ret = 0; > void *ip; > - int len; > + > + if (user_mode(regs)) > + return 0; > > /* > * Having observed our INT3 instruction, we now must observe > - * bp_patching.nr_entries. > + * bp_desc: > * > - * nr_entries != 0 INT3 > + * bp_desc = desc INT3 > * WMB RMB > - * write INT3 if (nr_entries) > - * > - * Idem for other elements in bp_patching. > + * write INT3 if (desc) > */ > smp_rmb(); > > - if (likely(!bp_patching.nr_entries)) > - return 0; > - > - if (user_mode(regs)) > + desc = try_get_desc(&bp_desc); > + if (!desc) > return 0; > > /* > @@ -1002,16 +1021,16 @@ int notrace poke_int3_handler(struct pt_regs *regs) > /* > * Skip the binary search if there is a single member in the vector. > */ > - if (unlikely(bp_patching.nr_entries > 1)) { > - tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries, > + if (unlikely(desc->nr_entries > 1)) { > + tp = bsearch(ip, desc->vec, desc->nr_entries, > sizeof(struct text_poke_loc), > patch_cmp); > if (!tp) > - return 0; > + goto out_put; > } else { > - tp = bp_patching.vec; > + tp = desc->vec; > if (text_poke_addr(tp) != ip) > - return 0; > + goto out_put; > } > > len = text_opcode_size(tp->opcode); > @@ -1023,7 +1042,7 @@ int notrace poke_int3_handler(struct pt_regs *regs) > * Someone poked an explicit INT3, they'll want to handle it, > * do not consume. > */ > - return 0; > + goto out_put; > > case CALL_INSN_OPCODE: > int3_emulate_call(regs, (long)ip + tp->rel32); > @@ -1038,7 +1057,11 @@ int notrace poke_int3_handler(struct pt_regs *regs) > BUG(); > } > > - return 1; > + ret = 1; > + > +out_put: > + put_desc(desc); > + return ret; > } > NOKPROBE_SYMBOL(poke_int3_handler); > > @@ -1069,14 +1092,18 @@ static int tp_vec_nr; > */ > static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) > { > + struct bp_patching_desc desc = { > + .vec = tp, > + .nr_entries = nr_entries, > + .refs = ATOMIC_INIT(1), > + }; > unsigned char int3 = INT3_INSN_OPCODE; > unsigned int i; > int do_sync; > > lockdep_assert_held(&text_mutex); > > - bp_patching.vec = tp; > - bp_patching.nr_entries = nr_entries; > + smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */ > > /* > * Corresponding read barrier in int3 notifier for making sure the > @@ -1131,17 +1158,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries > text_poke_sync(); > > /* > - * sync_core() implies an smp_mb() and orders this store against > - * the writing of the new instruction. > + * Remove and synchronize_rcu(), except we have a very primitive > + * refcount based completion. > */ > - bp_patching.nr_entries = 0; > - /* > - * This sync_core () call ensures that all INT3 handlers in progress > - * have finished. This allows poke_int3_handler() after this to > - * avoid touching bp_paching.vec by checking nr_entries == 0. > - */ > - text_poke_sync(); > - bp_patching.vec = NULL; > + WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */ > + if (!atomic_dec_and_test(&desc.refs)) > + atomic_cond_read_acquire(&desc.refs, !VAL); > } > > void text_poke_loc_init(struct text_poke_loc *tp, void *addr, -- Masami Hiramatsu <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception 2019-12-11 8:09 ` Masami Hiramatsu @ 2019-12-11 9:12 ` Daniel Bristot de Oliveira 0 siblings, 0 replies; 19+ messages in thread From: Daniel Bristot de Oliveira @ 2019-12-11 9:12 UTC (permalink / raw) To: Masami Hiramatsu, Peter Zijlstra Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov On 11/12/2019 09:09, Masami Hiramatsu wrote: > Hi Peter, > > On Wed, 11 Dec 2019 01:09:43 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > >> On Tue, Dec 10, 2019 at 06:32:09PM +0100, Peter Zijlstra wrote: >> >>> I feel that is actually more complicated... Let me try to see if I can >>> simplify things. >> How is this then? > This looks perfectly good to me :) > > Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> too! Thanks! -- Daniel ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code 2019-11-27 5:56 [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe Masami Hiramatsu 2019-11-27 5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu @ 2019-11-27 5:57 ` Masami Hiramatsu 2019-11-27 6:19 ` Alexei Starovoitov 2019-12-04 8:33 ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu 1 sibling, 2 replies; 19+ messages in thread From: Masami Hiramatsu @ 2019-11-27 5:57 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu, alexei.starovoitov, Masami Hiramatsu Fix to set unoptimized flag after confirming the code is completely unoptimized. Without this fix, when a kprobe hits the intermediate modified instruction (the first byte is replaced by int3, but latter bytes still be a jump address operand) while unoptimizing, it can return to the middle byte of the modified code. And it causes an invalid instruction exception in the kernel. Usually, this is a rare case, but if we put a probe on the function called while text patching, it always causes a kernel panic as below. (text_poke() is used for patching the code in optprobe) # echo p text_poke+5 > kprobe_events # echo 1 > events/kprobes/enable # echo 0 > events/kprobes/enable invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: events kprobe_optimizer RIP: 0010:text_poke+0x9/0x50 Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89 RSP: 0018:ffffc90000343df0 EFLAGS: 00010686 RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000 RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796 RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0 R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0 FS: 0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0 Call Trace: arch_unoptimize_kprobe+0x22/0x28 arch_unoptimize_kprobes+0x39/0x87 kprobe_optimizer+0x6e/0x290 process_one_work+0x2a0/0x610 worker_thread+0x28/0x3d0 ? process_one_work+0x610/0x610 kthread+0x10d/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x3a/0x50 Modules linked in: ---[ end trace 83b34b22a228711b ]--- This can happen even if we blacklist text_poke() and other functions, because there is a small time window which showing the intermediate code to other CPUs. Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> --- kernel/kprobes.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 53534aa258a6..34e28b236d68 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -510,6 +510,8 @@ static void do_unoptimize_kprobes(void) arch_unoptimize_kprobes(&unoptimizing_list, &freeing_list); /* Loop free_list for disarming */ list_for_each_entry_safe(op, tmp, &freeing_list, list) { + /* Switching from detour code to origin */ + op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED; /* Disarm probes if marked disabled */ if (kprobe_disabled(&op->kp)) arch_disarm_kprobe(&op->kp); @@ -649,6 +651,7 @@ static void force_unoptimize_kprobe(struct optimized_kprobe *op) { lockdep_assert_cpus_held(); arch_unoptimize_kprobe(op); + op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED; if (kprobe_disabled(&op->kp)) arch_disarm_kprobe(&op->kp); } @@ -676,7 +679,6 @@ static void unoptimize_kprobe(struct kprobe *p, bool force) return; } - op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED; if (!list_empty(&op->list)) { /* Dequeue from the optimization queue */ list_del_init(&op->list); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code 2019-11-27 5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu @ 2019-11-27 6:19 ` Alexei Starovoitov 2019-11-27 6:49 ` Ingo Molnar 2019-11-27 6:56 ` Masami Hiramatsu 2019-12-04 8:33 ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu 1 sibling, 2 replies; 19+ messages in thread From: Alexei Starovoitov @ 2019-11-27 6:19 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote: > Fix to set unoptimized flag after confirming the code is completely > unoptimized. Without this fix, when a kprobe hits the intermediate > modified instruction (the first byte is replaced by int3, but > latter bytes still be a jump address operand) while unoptimizing, > it can return to the middle byte of the modified code. And it causes > an invalid instruction exception in the kernel. > > Usually, this is a rare case, but if we put a probe on the function > called while text patching, it always causes a kernel panic as below. > (text_poke() is used for patching the code in optprobe) > > # echo p text_poke+5 > kprobe_events > # echo 1 > events/kprobes/enable > # echo 0 > events/kprobes/enable > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > Workqueue: events kprobe_optimizer > RIP: 0010:text_poke+0x9/0x50 > Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89 > RSP: 0018:ffffc90000343df0 EFLAGS: 00010686 > RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000 > RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796 > RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0 > R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0 > FS: 0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0 > Call Trace: > arch_unoptimize_kprobe+0x22/0x28 > arch_unoptimize_kprobes+0x39/0x87 > kprobe_optimizer+0x6e/0x290 > process_one_work+0x2a0/0x610 > worker_thread+0x28/0x3d0 > ? process_one_work+0x610/0x610 > kthread+0x10d/0x130 > ? kthread_park+0x80/0x80 > ret_from_fork+0x3a/0x50 > Modules linked in: > ---[ end trace 83b34b22a228711b ]--- > > This can happen even if we blacklist text_poke() and other functions, > because there is a small time window which showing the intermediate > code to other CPUs. > > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Awesome. It fixes the crash for me. Tested-by: Alexei Starovoitov <ast@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code 2019-11-27 6:19 ` Alexei Starovoitov @ 2019-11-27 6:49 ` Ingo Molnar 2019-12-02 21:55 ` Alexei Starovoitov 2019-11-27 6:56 ` Masami Hiramatsu 1 sibling, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2019-11-27 6:49 UTC (permalink / raw) To: Alexei Starovoitov Cc: Masami Hiramatsu, Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu * Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote: > > Fix to set unoptimized flag after confirming the code is completely > > unoptimized. Without this fix, when a kprobe hits the intermediate > > modified instruction (the first byte is replaced by int3, but > > latter bytes still be a jump address operand) while unoptimizing, > > it can return to the middle byte of the modified code. And it causes > > an invalid instruction exception in the kernel. > > > > Usually, this is a rare case, but if we put a probe on the function > > called while text patching, it always causes a kernel panic as below. > > (text_poke() is used for patching the code in optprobe) > > > > # echo p text_poke+5 > kprobe_events > > # echo 1 > events/kprobes/enable > > # echo 0 > events/kprobes/enable > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > > Workqueue: events kprobe_optimizer > > RIP: 0010:text_poke+0x9/0x50 > > Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89 > > RSP: 0018:ffffc90000343df0 EFLAGS: 00010686 > > RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000 > > RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796 > > RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0 > > R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0 > > FS: 0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0 > > Call Trace: > > arch_unoptimize_kprobe+0x22/0x28 > > arch_unoptimize_kprobes+0x39/0x87 > > kprobe_optimizer+0x6e/0x290 > > process_one_work+0x2a0/0x610 > > worker_thread+0x28/0x3d0 > > ? process_one_work+0x610/0x610 > > kthread+0x10d/0x130 > > ? kthread_park+0x80/0x80 > > ret_from_fork+0x3a/0x50 > > Modules linked in: > > ---[ end trace 83b34b22a228711b ]--- > > > > This can happen even if we blacklist text_poke() and other functions, > > because there is a small time window which showing the intermediate > > code to other CPUs. > > > > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > > Awesome. It fixes the crash for me. > Tested-by: Alexei Starovoitov <ast@kernel.org> Thanks guys - I just pushed out a rebased tree, based on an upstream version that has both the BPF tree and most x86 trees merged, into tip:WIP.core/kprobes. This includes these two fixes as well. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code 2019-11-27 6:49 ` Ingo Molnar @ 2019-12-02 21:55 ` Alexei Starovoitov 0 siblings, 0 replies; 19+ messages in thread From: Alexei Starovoitov @ 2019-12-02 21:55 UTC (permalink / raw) To: Ingo Molnar Cc: Masami Hiramatsu, Steven Rostedt, Peter Zijlstra, X86 ML, LKML, bristot, jbaron, Linus Torvalds, Thomas Gleixner, Nadav Amit, H. Peter Anvin, Andy Lutomirski, Ard Biesheuvel, Josh Poimboeuf, Jessica Yu On Tue, Nov 26, 2019 at 10:49 PM Ingo Molnar <mingo@kernel.org> wrote: > > > * Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote: > > > Fix to set unoptimized flag after confirming the code is completely > > > unoptimized. Without this fix, when a kprobe hits the intermediate > > > modified instruction (the first byte is replaced by int3, but > > > latter bytes still be a jump address operand) while unoptimizing, > > > it can return to the middle byte of the modified code. And it causes > > > an invalid instruction exception in the kernel. > > > > > > Usually, this is a rare case, but if we put a probe on the function > > > called while text patching, it always causes a kernel panic as below. > > > (text_poke() is used for patching the code in optprobe) > > > > > > # echo p text_poke+5 > kprobe_events > > > # echo 1 > events/kprobes/enable > > > # echo 0 > events/kprobes/enable > > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > > CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > > > Workqueue: events kprobe_optimizer > > > RIP: 0010:text_poke+0x9/0x50 > > > Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89 > > > RSP: 0018:ffffc90000343df0 EFLAGS: 00010686 > > > RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000 > > > RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796 > > > RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000 > > > R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0 > > > R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0 > > > FS: 0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0 > > > Call Trace: > > > arch_unoptimize_kprobe+0x22/0x28 > > > arch_unoptimize_kprobes+0x39/0x87 > > > kprobe_optimizer+0x6e/0x290 > > > process_one_work+0x2a0/0x610 > > > worker_thread+0x28/0x3d0 > > > ? process_one_work+0x610/0x610 > > > kthread+0x10d/0x130 > > > ? kthread_park+0x80/0x80 > > > ret_from_fork+0x3a/0x50 > > > Modules linked in: > > > ---[ end trace 83b34b22a228711b ]--- > > > > > > This can happen even if we blacklist text_poke() and other functions, > > > because there is a small time window which showing the intermediate > > > code to other CPUs. > > > > > > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") > > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > > > > Awesome. It fixes the crash for me. > > Tested-by: Alexei Starovoitov <ast@kernel.org> > > Thanks guys - I just pushed out a rebased tree, based on an upstream > version that has both the BPF tree and most x86 trees merged, into > tip:WIP.core/kprobes. This includes these two fixes as well. fwiw. I merged WIP.core/kprobes into my local tree and have been testing everything with it. No issues found. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code 2019-11-27 6:19 ` Alexei Starovoitov 2019-11-27 6:49 ` Ingo Molnar @ 2019-11-27 6:56 ` Masami Hiramatsu 1 sibling, 0 replies; 19+ messages in thread From: Masami Hiramatsu @ 2019-11-27 6:56 UTC (permalink / raw) To: Alexei Starovoitov Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu On Tue, 26 Nov 2019 22:19:11 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote: > > Fix to set unoptimized flag after confirming the code is completely > > unoptimized. Without this fix, when a kprobe hits the intermediate > > modified instruction (the first byte is replaced by int3, but > > latter bytes still be a jump address operand) while unoptimizing, > > it can return to the middle byte of the modified code. And it causes > > an invalid instruction exception in the kernel. > > > > Usually, this is a rare case, but if we put a probe on the function > > called while text patching, it always causes a kernel panic as below. > > (text_poke() is used for patching the code in optprobe) > > > > # echo p text_poke+5 > kprobe_events > > # echo 1 > events/kprobes/enable > > # echo 0 > events/kprobes/enable > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > > Workqueue: events kprobe_optimizer > > RIP: 0010:text_poke+0x9/0x50 > > Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89 > > RSP: 0018:ffffc90000343df0 EFLAGS: 00010686 > > RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000 > > RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796 > > RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0 > > R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0 > > FS: 0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0 > > Call Trace: > > arch_unoptimize_kprobe+0x22/0x28 > > arch_unoptimize_kprobes+0x39/0x87 > > kprobe_optimizer+0x6e/0x290 > > process_one_work+0x2a0/0x610 > > worker_thread+0x28/0x3d0 > > ? process_one_work+0x610/0x610 > > kthread+0x10d/0x130 > > ? kthread_park+0x80/0x80 > > ret_from_fork+0x3a/0x50 > > Modules linked in: > > ---[ end trace 83b34b22a228711b ]--- > > > > This can happen even if we blacklist text_poke() and other functions, > > because there is a small time window which showing the intermediate > > code to other CPUs. > > > > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > > Awesome. It fixes the crash for me. > Tested-by: Alexei Starovoitov <ast@kernel.org> That's a good news :) Thank you for testing! -- Masami Hiramatsu <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip: core/kprobes] kprobes: Set unoptimized flag after unoptimizing code 2019-11-27 5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu 2019-11-27 6:19 ` Alexei Starovoitov @ 2019-12-04 8:33 ` tip-bot2 for Masami Hiramatsu 1 sibling, 0 replies; 19+ messages in thread From: tip-bot2 for Masami Hiramatsu @ 2019-12-04 8:33 UTC (permalink / raw) To: linux-tip-commits Cc: Alexei Starovoitov, Masami Hiramatsu, Andy Lutomirski, Borislav Petkov, Linus Torvalds, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, bristot, Ingo Molnar, x86, LKML The following commit has been merged into the core/kprobes branch of tip: Commit-ID: f66c0447cca1281116224d474cdb37d6a18e4b5b Gitweb: https://git.kernel.org/tip/f66c0447cca1281116224d474cdb37d6a18e4b5b Author: Masami Hiramatsu <mhiramat@kernel.org> AuthorDate: Wed, 27 Nov 2019 14:57:04 +09:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Wed, 27 Nov 2019 07:44:25 +01:00 kprobes: Set unoptimized flag after unoptimizing code Set the unoptimized flag after confirming the code is completely unoptimized. Without this fix, when a kprobe hits the intermediate modified instruction (the first byte is replaced by an INT3, but later bytes can still be a jump address operand) while unoptimizing, it can return to the middle byte of the modified code, which causes an invalid instruction exception in the kernel. Usually, this is a rare case, but if we put a probe on the function call while text patching, it always causes a kernel panic as below: # echo p text_poke+5 > kprobe_events # echo 1 > events/kprobes/enable # echo 0 > events/kprobes/enable invalid opcode: 0000 [#1] PREEMPT SMP PTI RIP: 0010:text_poke+0x9/0x50 Call Trace: arch_unoptimize_kprobe+0x22/0x28 arch_unoptimize_kprobes+0x39/0x87 kprobe_optimizer+0x6e/0x290 process_one_work+0x2a0/0x610 worker_thread+0x28/0x3d0 ? process_one_work+0x610/0x610 kthread+0x10d/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x3a/0x50 text_poke() is used for patching the code in optprobes. This can happen even if we blacklist text_poke() and other functions, because there is a small time window during which we show the intermediate code to other CPUs. [ mingo: Edited the changelog. ] Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2 Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/kprobes.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 53534aa..34e28b2 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -510,6 +510,8 @@ static void do_unoptimize_kprobes(void) arch_unoptimize_kprobes(&unoptimizing_list, &freeing_list); /* Loop free_list for disarming */ list_for_each_entry_safe(op, tmp, &freeing_list, list) { + /* Switching from detour code to origin */ + op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED; /* Disarm probes if marked disabled */ if (kprobe_disabled(&op->kp)) arch_disarm_kprobe(&op->kp); @@ -649,6 +651,7 @@ static void force_unoptimize_kprobe(struct optimized_kprobe *op) { lockdep_assert_cpus_held(); arch_unoptimize_kprobe(op); + op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED; if (kprobe_disabled(&op->kp)) arch_disarm_kprobe(&op->kp); } @@ -676,7 +679,6 @@ static void unoptimize_kprobe(struct kprobe *p, bool force) return; } - op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED; if (!list_empty(&op->list)) { /* Dequeue from the optimization queue */ list_del_init(&op->list); ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2019-12-11 9:17 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-27 5:56 [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe Masami Hiramatsu 2019-11-27 5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu 2019-12-02 9:15 ` Peter Zijlstra 2019-12-02 11:50 ` Masami Hiramatsu 2019-12-02 13:43 ` Peter Zijlstra 2019-12-02 14:39 ` Masami Hiramatsu 2019-12-04 8:33 ` [tip: core/kprobes] x86/alternatives: " tip-bot2 for Masami Hiramatsu 2019-12-09 14:39 ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra 2019-12-10 16:44 ` Masami Hiramatsu 2019-12-10 17:32 ` Peter Zijlstra 2019-12-11 0:09 ` Peter Zijlstra 2019-12-11 8:09 ` Masami Hiramatsu 2019-12-11 9:12 ` Daniel Bristot de Oliveira 2019-11-27 5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu 2019-11-27 6:19 ` Alexei Starovoitov 2019-11-27 6:49 ` Ingo Molnar 2019-12-02 21:55 ` Alexei Starovoitov 2019-11-27 6:56 ` Masami Hiramatsu 2019-12-04 8:33 ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.