linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe
@ 2019-11-27  5:56 Masami Hiramatsu
  2019-11-27  5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu
  2019-11-27  5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu
  0 siblings, 2 replies; 19+ messages in thread
From: Masami Hiramatsu @ 2019-11-27  5:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot,
	jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel,
	jpoimboe, jeyu, alexei.starovoitov, Masami Hiramatsu

Hi,

Here are the patches which I've faced while testing ftracetest
without function tracer. While investigating I found there were
2 different bugs there.

The 1st bug is a timing bug caused by wrong global variable
update and syncing in text_poke_bp_batch(). This can cause a
kernel panic if we hit int3 in between bp_patching.vec = NULL
and bp_patching.nr_entries = 0. This is actually a wrong order
and no synchronization. Steve suggested we can fix it with
reordering and adding sync_core() between them.

The 2nd bug is in the optprobe, which is caused by wrong flag
update order. Currently kprobes update optimized flag before
unoptimizing code. But if the kprobe is hit unoptimizing
intermediate state, it can go back from int3 to the middle of
modified instruction and cause a kernel panic. This can be
fixed by updating flag after unoptimized code. 

Thank you,

---

Masami Hiramatsu (2):
      x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
      kprobes: Set unoptimized flag after unoptimizing code


 arch/x86/kernel/alternative.c |    8 +++++++-
 kernel/kprobes.c              |    4 +++-
 2 files changed, 10 insertions(+), 2 deletions(-)

--
Masami Hiramatsu (Linaro) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-11-27  5:56 [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe Masami Hiramatsu
@ 2019-11-27  5:56 ` Masami Hiramatsu
  2019-12-02  9:15   ` Peter Zijlstra
                     ` (2 more replies)
  2019-11-27  5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu
  1 sibling, 3 replies; 19+ messages in thread
From: Masami Hiramatsu @ 2019-11-27  5:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot,
	jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel,
	jpoimboe, jeyu, alexei.starovoitov, Masami Hiramatsu

ftracetest multiple_kprobes.tc testcase hit a following NULL pointer
exception.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc8+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
RIP: 0010:poke_int3_handler+0x39/0x100
Code: 5b 5d c3 f6 87 88 00 00 00 03 75 f2 48 8b 87 80 00 00 00 48 89 fb 48 8d 68 ff 48 8b 05 80 98 72 01 83 fa 01 0f 8f 93 00 00 00 <48> 63 10 48 81 c2 00 00 00 81 48 39 d5 75 c5 0f b6 50 08 8d 4a 34
RSP: 0018:ffffc900001a8eb8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffc900001a8ee8 RCX: ffffffff81a00b57
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900001a8ee8
RBP: ffffffff81027635 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000007a970000 CR4: 00000000000006a0
Call Trace:
 <IRQ>
 do_int3+0xd/0xf0
 int3+0x42/0x50
RIP: 0010:sched_clock+0x6/0x10

eu-addr2line told that poke_int3_handler+0x39 was alternatives:958.

static inline void *text_poke_addr(struct text_poke_loc *tp)
{
        return _stext + tp->rel_addr; <------ Here is line #958
}

This seems like caused by the tp (bp_patching.vec) was NULL but
bp_patching.nr_entries != 0. There is a small chance to do
this, because we have no sync after zeroing bp_patching.nr_entries
before clearing bp_patching.vec.

Steve suggested we could fix this by adding sync_core, because int3
is done with interrupts disabled, and the on_each_cpu() requires
all CPUs to have had their interrupts enabled.

Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations")
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 arch/x86/kernel/alternative.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4552795a8df4..9505096e2cd1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
-	bp_patching.vec = NULL;
 	bp_patching.nr_entries = 0;
+	/*
+	 * This sync_core () ensures that all int3 handlers in progress
+	 * have finished. This allows poke_int3_handler () after this to
+	 * avoid touching bp_paching.vec by checking nr_entries == 0.
+	 */
+	text_poke_sync();
+	bp_patching.vec = NULL;
 }
 
 void text_poke_loc_init(struct text_poke_loc *tp, void *addr,


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code
  2019-11-27  5:56 [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe Masami Hiramatsu
  2019-11-27  5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu
@ 2019-11-27  5:57 ` Masami Hiramatsu
  2019-11-27  6:19   ` Alexei Starovoitov
  2019-12-04  8:33   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  1 sibling, 2 replies; 19+ messages in thread
From: Masami Hiramatsu @ 2019-11-27  5:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, x86, linux-kernel, bristot,
	jbaron, torvalds, tglx, namit, hpa, luto, ard.biesheuvel,
	jpoimboe, jeyu, alexei.starovoitov, Masami Hiramatsu

Fix to set unoptimized flag after confirming the code is completely
unoptimized. Without this fix, when a kprobe hits the intermediate
modified instruction (the first byte is replaced by int3, but
latter bytes still be a jump address operand) while unoptimizing,
it can return to the middle byte of the modified code. And it causes
an invalid instruction exception in the kernel.

Usually, this is a rare case, but if we put a probe on the function
called while text patching, it always causes a kernel panic as below.
(text_poke() is used for patching the code in optprobe)

 # echo p text_poke+5 > kprobe_events
 # echo 1 > events/kprobes/enable
 # echo 0 > events/kprobes/enable
 invalid opcode: 0000 [#1] PREEMPT SMP PTI
 CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
 Workqueue: events kprobe_optimizer
 RIP: 0010:text_poke+0x9/0x50
 Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89
 RSP: 0018:ffffc90000343df0 EFLAGS: 00010686
 RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000
 RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796
 RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0
 R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0
 FS:  0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0
 Call Trace:
  arch_unoptimize_kprobe+0x22/0x28
  arch_unoptimize_kprobes+0x39/0x87
  kprobe_optimizer+0x6e/0x290
  process_one_work+0x2a0/0x610
  worker_thread+0x28/0x3d0
  ? process_one_work+0x610/0x610
  kthread+0x10d/0x130
  ? kthread_park+0x80/0x80
  ret_from_fork+0x3a/0x50
 Modules linked in:
 ---[ end trace 83b34b22a228711b ]---

This can happen even if we blacklist text_poke() and other functions,
because there is a small time window which showing the intermediate
code to other CPUs.

Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 kernel/kprobes.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 53534aa258a6..34e28b236d68 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -510,6 +510,8 @@ static void do_unoptimize_kprobes(void)
 	arch_unoptimize_kprobes(&unoptimizing_list, &freeing_list);
 	/* Loop free_list for disarming */
 	list_for_each_entry_safe(op, tmp, &freeing_list, list) {
+		/* Switching from detour code to origin */
+		op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 		/* Disarm probes if marked disabled */
 		if (kprobe_disabled(&op->kp))
 			arch_disarm_kprobe(&op->kp);
@@ -649,6 +651,7 @@ static void force_unoptimize_kprobe(struct optimized_kprobe *op)
 {
 	lockdep_assert_cpus_held();
 	arch_unoptimize_kprobe(op);
+	op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 	if (kprobe_disabled(&op->kp))
 		arch_disarm_kprobe(&op->kp);
 }
@@ -676,7 +679,6 @@ static void unoptimize_kprobe(struct kprobe *p, bool force)
 		return;
 	}
 
-	op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 	if (!list_empty(&op->list)) {
 		/* Dequeue from the optimization queue */
 		list_del_init(&op->list);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code
  2019-11-27  5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu
@ 2019-11-27  6:19   ` Alexei Starovoitov
  2019-11-27  6:49     ` Ingo Molnar
  2019-11-27  6:56     ` Masami Hiramatsu
  2019-12-04  8:33   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  1 sibling, 2 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2019-11-27  6:19 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, x86, linux-kernel,
	bristot, jbaron, torvalds, tglx, namit, hpa, luto,
	ard.biesheuvel, jpoimboe, jeyu

On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote:
> Fix to set unoptimized flag after confirming the code is completely
> unoptimized. Without this fix, when a kprobe hits the intermediate
> modified instruction (the first byte is replaced by int3, but
> latter bytes still be a jump address operand) while unoptimizing,
> it can return to the middle byte of the modified code. And it causes
> an invalid instruction exception in the kernel.
> 
> Usually, this is a rare case, but if we put a probe on the function
> called while text patching, it always causes a kernel panic as below.
> (text_poke() is used for patching the code in optprobe)
> 
>  # echo p text_poke+5 > kprobe_events
>  # echo 1 > events/kprobes/enable
>  # echo 0 > events/kprobes/enable
>  invalid opcode: 0000 [#1] PREEMPT SMP PTI
>  CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
>  Workqueue: events kprobe_optimizer
>  RIP: 0010:text_poke+0x9/0x50
>  Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89
>  RSP: 0018:ffffc90000343df0 EFLAGS: 00010686
>  RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000
>  RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796
>  RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000
>  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0
>  R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0
>  FS:  0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0
>  Call Trace:
>   arch_unoptimize_kprobe+0x22/0x28
>   arch_unoptimize_kprobes+0x39/0x87
>   kprobe_optimizer+0x6e/0x290
>   process_one_work+0x2a0/0x610
>   worker_thread+0x28/0x3d0
>   ? process_one_work+0x610/0x610
>   kthread+0x10d/0x130
>   ? kthread_park+0x80/0x80
>   ret_from_fork+0x3a/0x50
>  Modules linked in:
>  ---[ end trace 83b34b22a228711b ]---
> 
> This can happen even if we blacklist text_poke() and other functions,
> because there is a small time window which showing the intermediate
> code to other CPUs.
> 
> Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>

Awesome. It fixes the crash for me.
Tested-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code
  2019-11-27  6:19   ` Alexei Starovoitov
@ 2019-11-27  6:49     ` Ingo Molnar
  2019-12-02 21:55       ` Alexei Starovoitov
  2019-11-27  6:56     ` Masami Hiramatsu
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2019-11-27  6:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Masami Hiramatsu, Steven Rostedt, Peter Zijlstra, x86,
	linux-kernel, bristot, jbaron, torvalds, tglx, namit, hpa, luto,
	ard.biesheuvel, jpoimboe, jeyu


* Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote:
> > Fix to set unoptimized flag after confirming the code is completely
> > unoptimized. Without this fix, when a kprobe hits the intermediate
> > modified instruction (the first byte is replaced by int3, but
> > latter bytes still be a jump address operand) while unoptimizing,
> > it can return to the middle byte of the modified code. And it causes
> > an invalid instruction exception in the kernel.
> > 
> > Usually, this is a rare case, but if we put a probe on the function
> > called while text patching, it always causes a kernel panic as below.
> > (text_poke() is used for patching the code in optprobe)
> > 
> >  # echo p text_poke+5 > kprobe_events
> >  # echo 1 > events/kprobes/enable
> >  # echo 0 > events/kprobes/enable
> >  invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >  CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> >  Workqueue: events kprobe_optimizer
> >  RIP: 0010:text_poke+0x9/0x50
> >  Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89
> >  RSP: 0018:ffffc90000343df0 EFLAGS: 00010686
> >  RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000
> >  RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796
> >  RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000
> >  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0
> >  R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0
> >  FS:  0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0
> >  Call Trace:
> >   arch_unoptimize_kprobe+0x22/0x28
> >   arch_unoptimize_kprobes+0x39/0x87
> >   kprobe_optimizer+0x6e/0x290
> >   process_one_work+0x2a0/0x610
> >   worker_thread+0x28/0x3d0
> >   ? process_one_work+0x610/0x610
> >   kthread+0x10d/0x130
> >   ? kthread_park+0x80/0x80
> >   ret_from_fork+0x3a/0x50
> >  Modules linked in:
> >  ---[ end trace 83b34b22a228711b ]---
> > 
> > This can happen even if we blacklist text_poke() and other functions,
> > because there is a small time window which showing the intermediate
> > code to other CPUs.
> > 
> > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
> > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Awesome. It fixes the crash for me.
> Tested-by: Alexei Starovoitov <ast@kernel.org>

Thanks guys - I just pushed out a rebased tree, based on an upstream 
version that has both the BPF tree and most x86 trees merged, into 
tip:WIP.core/kprobes. This includes these two fixes as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code
  2019-11-27  6:19   ` Alexei Starovoitov
  2019-11-27  6:49     ` Ingo Molnar
@ 2019-11-27  6:56     ` Masami Hiramatsu
  1 sibling, 0 replies; 19+ messages in thread
From: Masami Hiramatsu @ 2019-11-27  6:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, x86, linux-kernel,
	bristot, jbaron, torvalds, tglx, namit, hpa, luto,
	ard.biesheuvel, jpoimboe, jeyu

On Tue, 26 Nov 2019 22:19:11 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote:
> > Fix to set unoptimized flag after confirming the code is completely
> > unoptimized. Without this fix, when a kprobe hits the intermediate
> > modified instruction (the first byte is replaced by int3, but
> > latter bytes still be a jump address operand) while unoptimizing,
> > it can return to the middle byte of the modified code. And it causes
> > an invalid instruction exception in the kernel.
> > 
> > Usually, this is a rare case, but if we put a probe on the function
> > called while text patching, it always causes a kernel panic as below.
> > (text_poke() is used for patching the code in optprobe)
> > 
> >  # echo p text_poke+5 > kprobe_events
> >  # echo 1 > events/kprobes/enable
> >  # echo 0 > events/kprobes/enable
> >  invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >  CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> >  Workqueue: events kprobe_optimizer
> >  RIP: 0010:text_poke+0x9/0x50
> >  Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89
> >  RSP: 0018:ffffc90000343df0 EFLAGS: 00010686
> >  RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000
> >  RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796
> >  RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000
> >  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0
> >  R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0
> >  FS:  0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0
> >  Call Trace:
> >   arch_unoptimize_kprobe+0x22/0x28
> >   arch_unoptimize_kprobes+0x39/0x87
> >   kprobe_optimizer+0x6e/0x290
> >   process_one_work+0x2a0/0x610
> >   worker_thread+0x28/0x3d0
> >   ? process_one_work+0x610/0x610
> >   kthread+0x10d/0x130
> >   ? kthread_park+0x80/0x80
> >   ret_from_fork+0x3a/0x50
> >  Modules linked in:
> >  ---[ end trace 83b34b22a228711b ]---
> > 
> > This can happen even if we blacklist text_poke() and other functions,
> > because there is a small time window which showing the intermediate
> > code to other CPUs.
> > 
> > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
> > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Awesome. It fixes the crash for me.
> Tested-by: Alexei Starovoitov <ast@kernel.org>

That's a good news :)

Thank you for testing!

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-11-27  5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu
@ 2019-12-02  9:15   ` Peter Zijlstra
  2019-12-02 11:50     ` Masami Hiramatsu
  2019-12-04  8:33   ` [tip: core/kprobes] x86/alternatives: " tip-bot2 for Masami Hiramatsu
  2019-12-09 14:39   ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2019-12-02  9:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote:
> ftracetest multiple_kprobes.tc testcase hit a following NULL pointer
> exception.
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc8+ #23
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> RIP: 0010:poke_int3_handler+0x39/0x100
> Code: 5b 5d c3 f6 87 88 00 00 00 03 75 f2 48 8b 87 80 00 00 00 48 89 fb 48 8d 68 ff 48 8b 05 80 98 72 01 83 fa 01 0f 8f 93 00 00 00 <48> 63 10 48 81 c2 00 00 00 81 48 39 d5 75 c5 0f b6 50 08 8d 4a 34
> RSP: 0018:ffffc900001a8eb8 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffffc900001a8ee8 RCX: ffffffff81a00b57
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900001a8ee8
> RBP: ffffffff81027635 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000007a970000 CR4: 00000000000006a0
> Call Trace:
>  <IRQ>
>  do_int3+0xd/0xf0
>  int3+0x42/0x50
> RIP: 0010:sched_clock+0x6/0x10
> 
> eu-addr2line told that poke_int3_handler+0x39 was alternatives:958.
> 
> static inline void *text_poke_addr(struct text_poke_loc *tp)
> {
>         return _stext + tp->rel_addr; <------ Here is line #958
> }
> 
> This seems like caused by the tp (bp_patching.vec) was NULL but
> bp_patching.nr_entries != 0. There is a small chance to do
> this, because we have no sync after zeroing bp_patching.nr_entries
> before clearing bp_patching.vec.
> 
> Steve suggested we could fix this by adding sync_core, because int3
> is done with interrupts disabled, and the on_each_cpu() requires
> all CPUs to have had their interrupts enabled.
> 
> Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations")
> Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> ---
>  arch/x86/kernel/alternative.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 4552795a8df4..9505096e2cd1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  	 * sync_core() implies an smp_mb() and orders this store against
>  	 * the writing of the new instruction.
>  	 */
> -	bp_patching.vec = NULL;
>  	bp_patching.nr_entries = 0;
> +	/*
> +	 * This sync_core () ensures that all int3 handlers in progress
> +	 * have finished. This allows poke_int3_handler () after this to
> +	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> +	 */
> +	text_poke_sync();
> +	bp_patching.vec = NULL;
>  }

Hurm.. is there no way we can merge that with the 'last'
text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI
things like that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-02  9:15   ` Peter Zijlstra
@ 2019-12-02 11:50     ` Masami Hiramatsu
  2019-12-02 13:43       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Masami Hiramatsu @ 2019-12-02 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Mon, 2 Dec 2019 10:15:19 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote:
> > ftracetest multiple_kprobes.tc testcase hit a following NULL pointer
> > exception.
> > 
> > BUG: kernel NULL pointer dereference, address: 0000000000000000
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0
> > Oops: 0000 [#1] PREEMPT SMP PTI
> > CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc8+ #23
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:poke_int3_handler+0x39/0x100
> > Code: 5b 5d c3 f6 87 88 00 00 00 03 75 f2 48 8b 87 80 00 00 00 48 89 fb 48 8d 68 ff 48 8b 05 80 98 72 01 83 fa 01 0f 8f 93 00 00 00 <48> 63 10 48 81 c2 00 00 00 81 48 39 d5 75 c5 0f b6 50 08 8d 4a 34
> > RSP: 0018:ffffc900001a8eb8 EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: ffffc900001a8ee8 RCX: ffffffff81a00b57
> > RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900001a8ee8
> > RBP: ffffffff81027635 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > FS:  0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000000 CR3: 000000007a970000 CR4: 00000000000006a0
> > Call Trace:
> >  <IRQ>
> >  do_int3+0xd/0xf0
> >  int3+0x42/0x50
> > RIP: 0010:sched_clock+0x6/0x10
> > 
> > eu-addr2line told that poke_int3_handler+0x39 was alternatives:958.
> > 
> > static inline void *text_poke_addr(struct text_poke_loc *tp)
> > {
> >         return _stext + tp->rel_addr; <------ Here is line #958
> > }
> > 
> > This seems like caused by the tp (bp_patching.vec) was NULL but
> > bp_patching.nr_entries != 0. There is a small chance to do
> > this, because we have no sync after zeroing bp_patching.nr_entries
> > before clearing bp_patching.vec.
> > 
> > Steve suggested we could fix this by adding sync_core, because int3
> > is done with interrupts disabled, and the on_each_cpu() requires
> > all CPUs to have had their interrupts enabled.
> > 
> > Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations")
> > Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> > ---
> >  arch/x86/kernel/alternative.c |    8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index 4552795a8df4..9505096e2cd1 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> >  	 * sync_core() implies an smp_mb() and orders this store against
> >  	 * the writing of the new instruction.
> >  	 */
> > -	bp_patching.vec = NULL;
> >  	bp_patching.nr_entries = 0;
> > +	/*
> > +	 * This sync_core () ensures that all int3 handlers in progress
> > +	 * have finished. This allows poke_int3_handler () after this to
> > +	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> > +	 */
> > +	text_poke_sync();
> > +	bp_patching.vec = NULL;
> >  }
> 
> Hurm.. is there no way we can merge that with the 'last'
> text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI
> things like that.

Maybe we can add a NULL check of bp_patchig.vec in poke_int3_handler()
but it doesn't ensure the fundamental safeness, because the array
pointed by bp_patching.vec itself can be released while
poke_int3_handler() accesses it.

When I hit similar issue on unregister_kprobe, I used synchronize_rcu()
to ensure all cores passed scheduler once. This can avoid sending IPI
but will take a longer time.

Thank you,
-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-02 11:50     ` Masami Hiramatsu
@ 2019-12-02 13:43       ` Peter Zijlstra
  2019-12-02 14:39         ` Masami Hiramatsu
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2019-12-02 13:43 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Mon, Dec 02, 2019 at 08:50:12PM +0900, Masami Hiramatsu wrote:
> On Mon, 2 Dec 2019 10:15:19 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote:

> > > --- a/arch/x86/kernel/alternative.c
> > > +++ b/arch/x86/kernel/alternative.c
> > > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> > >  	 * sync_core() implies an smp_mb() and orders this store against
> > >  	 * the writing of the new instruction.
> > >  	 */
> > > -	bp_patching.vec = NULL;
> > >  	bp_patching.nr_entries = 0;
> > > +	/*
> > > +	 * This sync_core () ensures that all int3 handlers in progress
> > > +	 * have finished. This allows poke_int3_handler () after this to
> > > +	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> > > +	 */
> > > +	text_poke_sync();
> > > +	bp_patching.vec = NULL;
> > >  }
> > 
> > Hurm.. is there no way we can merge that with the 'last'
> > text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI
> > things like that.
> 
> Maybe we can add a NULL check of bp_patchig.vec in poke_int3_handler()
> but it doesn't ensure the fundamental safeness, because the array
> pointed by bp_patching.vec itself can be released while
> poke_int3_handler() accesses it.

No, what I mean is something like:

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 30e86730655c..347a234a7c52 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1119,17 +1119,13 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * Third step: replace the first byte (int3) by the first byte of
 	 * replacing opcode.
 	 */
-	for (do_sync = 0, i = 0; i < nr_entries; i++) {
+	for (i = 0; i < nr_entries; i++) {
 		if (tp[i].text[0] == INT3_INSN_OPCODE)
 			continue;
 
 		text_poke(text_poke_addr(&tp[i]), tp[i].text, INT3_INSN_SIZE);
-		do_sync++;
 	}
 
-	if (do_sync)
-		text_poke_sync();
-
 	/*
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.


Or is that unsafe ?

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-02 13:43       ` Peter Zijlstra
@ 2019-12-02 14:39         ` Masami Hiramatsu
  0 siblings, 0 replies; 19+ messages in thread
From: Masami Hiramatsu @ 2019-12-02 14:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Mon, 2 Dec 2019 14:43:54 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Dec 02, 2019 at 08:50:12PM +0900, Masami Hiramatsu wrote:
> > On Mon, 2 Dec 2019 10:15:19 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote:
> 
> > > > --- a/arch/x86/kernel/alternative.c
> > > > +++ b/arch/x86/kernel/alternative.c
> > > > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> > > >  	 * sync_core() implies an smp_mb() and orders this store against
> > > >  	 * the writing of the new instruction.
> > > >  	 */
> > > > -	bp_patching.vec = NULL;
> > > >  	bp_patching.nr_entries = 0;
> > > > +	/*
> > > > +	 * This sync_core () ensures that all int3 handlers in progress
> > > > +	 * have finished. This allows poke_int3_handler () after this to
> > > > +	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> > > > +	 */
> > > > +	text_poke_sync();
> > > > +	bp_patching.vec = NULL;
> > > >  }
> > > 
> > > Hurm.. is there no way we can merge that with the 'last'
> > > text_poke_sync() ? It seems a little daft to do 2 back-to-back IPI
> > > things like that.
> > 
> > Maybe we can add a NULL check of bp_patchig.vec in poke_int3_handler()
> > but it doesn't ensure the fundamental safeness, because the array
> > pointed by bp_patching.vec itself can be released while
> > poke_int3_handler() accesses it.
> 
> No, what I mean is something like:
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 30e86730655c..347a234a7c52 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1119,17 +1119,13 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  	 * Third step: replace the first byte (int3) by the first byte of
>  	 * replacing opcode.
>  	 */
> -	for (do_sync = 0, i = 0; i < nr_entries; i++) {
> +	for (i = 0; i < nr_entries; i++) {
>  		if (tp[i].text[0] == INT3_INSN_OPCODE)
>  			continue;
>  
>  		text_poke(text_poke_addr(&tp[i]), tp[i].text, INT3_INSN_SIZE);
> -		do_sync++;
>  	}
>  
> -	if (do_sync)
> -		text_poke_sync();
> -
>  	/*
>  	 * sync_core() implies an smp_mb() and orders this store against
>  	 * the writing of the new instruction.
> 
> 
> Or is that unsafe ?

OK, let's check it. 

text_poke_bp_batch() {
  update vec
  update nr_entries
  smp_wmb()
  write int3
  text_poke_sync()
  write rest_bytes
  text_poke_sync() if rest_bytes
  write first_byte
  text_poke_sync() if first_byte ... (*)
  update nr_entries
  text_poke_sync() ... (**)
  update vec
}

Before (*), the first byte can be new opcode or int3, thus
poke_int3_handler() can be called. But anyway, at that point
nr_entries != 0, thus poke_int3_handler() correctly emulate
the new instruction.

Before (**), all int3 should be removed, so nr_entries must
not accessed, EXCEPT for writing int3 case.

If we just remove the (*) as you say, the poke_int3_handler()
can see nr_entries = 0 before (**). So it is still unsafe.

I considered another way that skipping (**) if !first_byte,
since (*) ensured the target address(text) doesn't hit int3
anymore.
However, this will be also unsafe because there can be another
int3 (by kprobes) has been hit while updating nr_entries and vec.


Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code
  2019-11-27  6:49     ` Ingo Molnar
@ 2019-12-02 21:55       ` Alexei Starovoitov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2019-12-02 21:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Masami Hiramatsu, Steven Rostedt, Peter Zijlstra, X86 ML, LKML,
	bristot, jbaron, Linus Torvalds, Thomas Gleixner, Nadav Amit,
	H. Peter Anvin, Andy Lutomirski, Ard Biesheuvel, Josh Poimboeuf,
	Jessica Yu

On Tue, Nov 26, 2019 at 10:49 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > On Wed, Nov 27, 2019 at 02:57:04PM +0900, Masami Hiramatsu wrote:
> > > Fix to set unoptimized flag after confirming the code is completely
> > > unoptimized. Without this fix, when a kprobe hits the intermediate
> > > modified instruction (the first byte is replaced by int3, but
> > > latter bytes still be a jump address operand) while unoptimizing,
> > > it can return to the middle byte of the modified code. And it causes
> > > an invalid instruction exception in the kernel.
> > >
> > > Usually, this is a rare case, but if we put a probe on the function
> > > called while text patching, it always causes a kernel panic as below.
> > > (text_poke() is used for patching the code in optprobe)
> > >
> > >  # echo p text_poke+5 > kprobe_events
> > >  # echo 1 > events/kprobes/enable
> > >  # echo 0 > events/kprobes/enable
> > >  invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > >  CPU: 7 PID: 137 Comm: kworker/7:1 Not tainted 5.4.0-rc8+ #29
> > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > >  Workqueue: events kprobe_optimizer
> > >  RIP: 0010:text_poke+0x9/0x50
> > >  Code: 01 00 00 5b 5d 41 5c 41 5d c3 89 c0 0f b7 4c 02 fe 66 89 4c 05 fe e9 31 ff ff ff e8 71 ac 03 00 90 55 48 89 f5 53 cc 30 cb fd <1e> ec 08 8b 05 72 98 31 01 85 c0 75 11 48 83 c4 08 48 89 ee 48 89
> > >  RSP: 0018:ffffc90000343df0 EFLAGS: 00010686
> > >  RAX: 0000000000000000 RBX: ffffffff81025796 RCX: 0000000000000000
> > >  RDX: 0000000000000004 RSI: ffff88807c983148 RDI: ffffffff81025796
> > >  RBP: ffff88807c983148 R08: 0000000000000001 R09: 0000000000000000
> > >  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82284fe0
> > >  R13: ffff88807c983138 R14: ffffffff82284ff0 R15: 0ffff88807d9eee0
> > >  FS:  0000000000000000(0000) GS:ffff88807d9c0000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 000000000058158b CR3: 000000007b372000 CR4: 00000000000006a0
> > >  Call Trace:
> > >   arch_unoptimize_kprobe+0x22/0x28
> > >   arch_unoptimize_kprobes+0x39/0x87
> > >   kprobe_optimizer+0x6e/0x290
> > >   process_one_work+0x2a0/0x610
> > >   worker_thread+0x28/0x3d0
> > >   ? process_one_work+0x610/0x610
> > >   kthread+0x10d/0x130
> > >   ? kthread_park+0x80/0x80
> > >   ret_from_fork+0x3a/0x50
> > >  Modules linked in:
> > >  ---[ end trace 83b34b22a228711b ]---
> > >
> > > This can happen even if we blacklist text_poke() and other functions,
> > > because there is a small time window which showing the intermediate
> > > code to other CPUs.
> > >
> > > Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
> > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> >
> > Awesome. It fixes the crash for me.
> > Tested-by: Alexei Starovoitov <ast@kernel.org>
>
> Thanks guys - I just pushed out a rebased tree, based on an upstream
> version that has both the BPF tree and most x86 trees merged, into
> tip:WIP.core/kprobes. This includes these two fixes as well.

fwiw. I merged WIP.core/kprobes into my local tree and have been
testing everything with it. No issues found.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip: core/kprobes] kprobes: Set unoptimized flag after unoptimizing code
  2019-11-27  5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu
  2019-11-27  6:19   ` Alexei Starovoitov
@ 2019-12-04  8:33   ` tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 19+ messages in thread
From: tip-bot2 for Masami Hiramatsu @ 2019-12-04  8:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexei Starovoitov, Masami Hiramatsu, Andy Lutomirski,
	Borislav Petkov, Linus Torvalds, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, bristot, Ingo Molnar, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     f66c0447cca1281116224d474cdb37d6a18e4b5b
Gitweb:        https://git.kernel.org/tip/f66c0447cca1281116224d474cdb37d6a18e4b5b
Author:        Masami Hiramatsu <mhiramat@kernel.org>
AuthorDate:    Wed, 27 Nov 2019 14:57:04 +09:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 27 Nov 2019 07:44:25 +01:00

kprobes: Set unoptimized flag after unoptimizing code

Set the unoptimized flag after confirming the code is completely
unoptimized. Without this fix, when a kprobe hits the intermediate
modified instruction (the first byte is replaced by an INT3, but
later bytes can still be a jump address operand) while unoptimizing,
it can return to the middle byte of the modified code, which causes
an invalid instruction exception in the kernel.

Usually, this is a rare case, but if we put a probe on the function
call while text patching, it always causes a kernel panic as below:

 # echo p text_poke+5 > kprobe_events
 # echo 1 > events/kprobes/enable
 # echo 0 > events/kprobes/enable

invalid opcode: 0000 [#1] PREEMPT SMP PTI
 RIP: 0010:text_poke+0x9/0x50
 Call Trace:
  arch_unoptimize_kprobe+0x22/0x28
  arch_unoptimize_kprobes+0x39/0x87
  kprobe_optimizer+0x6e/0x290
  process_one_work+0x2a0/0x610
  worker_thread+0x28/0x3d0
  ? process_one_work+0x610/0x610
  kthread+0x10d/0x130
  ? kthread_park+0x80/0x80
  ret_from_fork+0x3a/0x50

text_poke() is used for patching the code in optprobes.

This can happen even if we blacklist text_poke() and other functions,
because there is a small time window during which we show the intermediate
code to other CPUs.

 [ mingo: Edited the changelog. ]

Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/kprobes.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 53534aa..34e28b2 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -510,6 +510,8 @@ static void do_unoptimize_kprobes(void)
 	arch_unoptimize_kprobes(&unoptimizing_list, &freeing_list);
 	/* Loop free_list for disarming */
 	list_for_each_entry_safe(op, tmp, &freeing_list, list) {
+		/* Switching from detour code to origin */
+		op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 		/* Disarm probes if marked disabled */
 		if (kprobe_disabled(&op->kp))
 			arch_disarm_kprobe(&op->kp);
@@ -649,6 +651,7 @@ static void force_unoptimize_kprobe(struct optimized_kprobe *op)
 {
 	lockdep_assert_cpus_held();
 	arch_unoptimize_kprobe(op);
+	op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 	if (kprobe_disabled(&op->kp))
 		arch_disarm_kprobe(&op->kp);
 }
@@ -676,7 +679,6 @@ static void unoptimize_kprobe(struct kprobe *p, bool force)
 		return;
 	}
 
-	op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 	if (!list_empty(&op->list)) {
 		/* Dequeue from the optimization queue */
 		list_del_init(&op->list);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip: core/kprobes] x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception
  2019-11-27  5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu
  2019-12-02  9:15   ` Peter Zijlstra
@ 2019-12-04  8:33   ` tip-bot2 for Masami Hiramatsu
  2019-12-09 14:39   ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot2 for Masami Hiramatsu @ 2019-12-04  8:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Steven Rostedt (VMware),
	Alexei Starovoitov, Masami Hiramatsu, Andy Lutomirski,
	Borislav Petkov, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	bristot, Ingo Molnar, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     285a54efe3861976af9d15e85ff8c91a78d1407b
Gitweb:        https://git.kernel.org/tip/285a54efe3861976af9d15e85ff8c91a78d1407b
Author:        Masami Hiramatsu <mhiramat@kernel.org>
AuthorDate:    Wed, 27 Nov 2019 14:56:52 +09:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 27 Nov 2019 07:44:25 +01:00

x86/alternatives: Sync bp_patching update for avoiding NULL pointer exception

ftracetest multiple_kprobes.tc testcase hits the following NULL pointer
exception:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 PGD 800000007bf60067 P4D 800000007bf60067 PUD 7bf5f067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 RIP: 0010:poke_int3_handler+0x39/0x100
 Call Trace:
  <IRQ>
  do_int3+0xd/0xf0
  int3+0x42/0x50
  RIP: 0010:sched_clock+0x6/0x10

poke_int3_handler+0x39 was alternatives:958:

  static inline void *text_poke_addr(struct text_poke_loc *tp)
  {
          return _stext + tp->rel_addr; <------ Here is line #958
  }

This seems to be caused by tp (bp_patching.vec) being NULL but
bp_patching.nr_entries != 0. There is a small chance for this
to happen, because we have no synchronization between the zeroing
of bp_patching.nr_entries and before clearing bp_patching.vec.

Steve suggested we could fix this by adding sync_core(), because int3
is done with interrupts disabled, and the on_each_cpu() requires
all CPUs to have had their interrupts enabled.

 [ mingo: Edited the comments and the changelog. ]

Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: c0213b0ac03c ("x86/alternative: Batch of patch operations")
Link: https://lkml.kernel.org/r/157483421229.25881.15314414408559963162.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/alternative.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4552795..30e8673 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
-	bp_patching.vec = NULL;
 	bp_patching.nr_entries = 0;
+	/*
+	 * This sync_core () call ensures that all INT3 handlers in progress
+	 * have finished. This allows poke_int3_handler() after this to
+	 * avoid touching bp_paching.vec by checking nr_entries == 0.
+	 */
+	text_poke_sync();
+	bp_patching.vec = NULL;
 }
 
 void text_poke_loc_init(struct text_poke_loc *tp, void *addr,

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-11-27  5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu
  2019-12-02  9:15   ` Peter Zijlstra
  2019-12-04  8:33   ` [tip: core/kprobes] x86/alternatives: " tip-bot2 for Masami Hiramatsu
@ 2019-12-09 14:39   ` Peter Zijlstra
  2019-12-10 16:44     ` Masami Hiramatsu
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2019-12-09 14:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote:

> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 4552795a8df4..9505096e2cd1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  	 * sync_core() implies an smp_mb() and orders this store against
>  	 * the writing of the new instruction.
>  	 */
> -	bp_patching.vec = NULL;
>  	bp_patching.nr_entries = 0;
> +	/*
> +	 * This sync_core () ensures that all int3 handlers in progress
> +	 * have finished. This allows poke_int3_handler () after this to
> +	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> +	 */
> +	text_poke_sync();
> +	bp_patching.vec = NULL;
>  }

How's something like this instead? Under the assumption that it is rare
to actually hit the INT3 and even more rare to actually hit this race,
the below should be a lot cheaper.

---
 arch/x86/kernel/alternative.c | 69 +++++++++++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 30e86730655c..12f2d193109d 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -953,6 +953,8 @@ static struct bp_patching_desc {
 	int nr_entries;
 } bp_patching;
 
+static atomic_t bp_handlers;
+
 static inline void *text_poke_addr(struct text_poke_loc *tp)
 {
 	return _stext + tp->rel_addr;
@@ -973,8 +975,8 @@ NOKPROBE_SYMBOL(patch_cmp);
 int notrace poke_int3_handler(struct pt_regs *regs)
 {
 	struct text_poke_loc *tp;
+	int nr, len, ret = 0;
 	void *ip;
-	int len;
 
 	/*
 	 * Having observed our INT3 instruction, we now must observe
@@ -987,12 +989,21 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 	 * Idem for other elements in bp_patching.
 	 */
 	smp_rmb();
-
-	if (likely(!bp_patching.nr_entries))
+	if (!READ_ONCE(bp_patching.nr_entries))
 		return 0;
 
+	atomic_inc(&bp_handlers);
+	/*
+	 * 'ACQUIRE', everything happens after the increment.
+	 */
+	smp_mb__after_atomic();
+
+	nr = smp_load_acquire(&bp_patching.nr_entries);
+	if (likely(!nr))
+		goto out;
+
 	if (user_mode(regs))
-		return 0;
+		goto out;
 
 	/*
 	 * Discount the INT3. See text_poke_bp_batch().
@@ -1002,16 +1013,16 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 	/*
 	 * Skip the binary search if there is a single member in the vector.
 	 */
-	if (unlikely(bp_patching.nr_entries > 1)) {
-		tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
+	if (unlikely(nr > 1)) {
+		tp = bsearch(ip, bp_patching.vec, nr,
 			     sizeof(struct text_poke_loc),
 			     patch_cmp);
 		if (!tp)
-			return 0;
+			goto out;
 	} else {
 		tp = bp_patching.vec;
 		if (text_poke_addr(tp) != ip)
-			return 0;
+			goto out;
 	}
 
 	len = text_opcode_size(tp->opcode);
@@ -1023,7 +1034,7 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 		 * Someone poked an explicit INT3, they'll want to handle it,
 		 * do not consume.
 		 */
-		return 0;
+		goto out;
 
 	case CALL_INSN_OPCODE:
 		int3_emulate_call(regs, (long)ip + tp->rel32);
@@ -1038,7 +1049,14 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 		BUG();
 	}
 
-	return 1;
+	ret = 1;
+out:
+	/*
+	 * 'RELEASE", everything happens before the decrement.
+	 */
+	smp_mb__before_atomic();
+	atomic_dec(&bp_handlers);
+	return ret;
 }
 NOKPROBE_SYMBOL(poke_int3_handler);
 
@@ -1076,7 +1094,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	lockdep_assert_held(&text_mutex);
 
 	bp_patching.vec = tp;
-	bp_patching.nr_entries = nr_entries;
+	/*
+	 * bp_patching.vec = tp			nr = bp_patching.nr_entries
+	 * REL					ACQ
+	 * bp_patching.nr_entries = nr_entries	tp = bp_patching.vec[]
+	 */
+	smp_store_release(&bp_patching.nr_entries, nr_entries);
 
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
@@ -1134,13 +1157,27 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
-	bp_patching.nr_entries = 0;
+	WRITE_ONCE(bp_patching.nr_entries, 0);
 	/*
-	 * This sync_core () call ensures that all INT3 handlers in progress
-	 * have finished. This allows poke_int3_handler() after this to
-	 * avoid touching bp_paching.vec by checking nr_entries == 0.
+	 * nr_entries = 0	bp_handlers++
+	 * MB			MB
+	 * VAL = bp_handlers	nr = nr_entries
+	 */
+	smp_mb();
+	/*
+	 * Guarantee all poke_int3_handler()s that have observed
+	 * @bp_patching.nr_enties have completed before we clear
+	 * bp_patching.vec.
+	 *
+	 * We can't do this before text_poke_sync() because then there
+	 * might still be observable INT3 instructions.
+	 */
+	atomic_cond_read_acquire(&bp_handlers, !VAL);
+	/*
+	 * bp_handlers == 0		tp = bp_patching.vec[]
+	 * ACQ				MB
+	 * bp_patching.vec = NULL	bp_handlers--;
 	 */
-	text_poke_sync();
 	bp_patching.vec = NULL;
 }
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-09 14:39   ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra
@ 2019-12-10 16:44     ` Masami Hiramatsu
  2019-12-10 17:32       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Masami Hiramatsu @ 2019-12-10 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

Hi Peter,

On Mon, 9 Dec 2019 15:39:40 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2019 at 02:56:52PM +0900, Masami Hiramatsu wrote:
> 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index 4552795a8df4..9505096e2cd1 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -1134,8 +1134,14 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
> >  	 * sync_core() implies an smp_mb() and orders this store against
> >  	 * the writing of the new instruction.
> >  	 */
> > -	bp_patching.vec = NULL;
> >  	bp_patching.nr_entries = 0;
> > +	/*
> > +	 * This sync_core () ensures that all int3 handlers in progress
> > +	 * have finished. This allows poke_int3_handler () after this to
> > +	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> > +	 */
> > +	text_poke_sync();
> > +	bp_patching.vec = NULL;
> >  }
> 
> How's something like this instead? Under the assumption that it is rare
> to actually hit the INT3 and even more rare to actually hit this race,
> the below should be a lot cheaper.

Ah, this reminds me of my atomic-refcounter method for kpatch idea
and module unloading.

This looks good, but I feel it is a bit complicated.

If we use atomic (and spin-wait) here, can we use atomic_inc_not_zero()
in the poke_int3_handler() at first for making sure the bp_batching is
under operation or not?
I think it makes things simpler, like below.

---------
atomic_t bp_refcnt;

poke_int3_handler()
{
	smp_rmb();
	if (!READ_ONCE(bp_patching.nr_entries))
		return 0;
	if (!atomic_inc_not_zero(&bp_refcnt))
		return 0;
	smp_mb__after_atomic();
	[use bp_patching]
	atomic_dec(&bp_refcnt);
}

text_poke_bp_batch()
{
	bp_patching.vec = tp;
	bp_patching.nr_entries = nr_entries;
	smp_wmb();
	atomic_inc(&bp_refcnt);
	...
	atomic_dec(&bp_refcnt);
	/* wait for all running poke_int3_handler(). */
	atomic_cond_read_acquire(&bp_refcnt, !VAL);
	bp_patching.vec = NULL;
	bp_patching.nr_entries = 0;
}
---------

Thank you,


> 
> ---
>  arch/x86/kernel/alternative.c | 69 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 53 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 30e86730655c..12f2d193109d 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -953,6 +953,8 @@ static struct bp_patching_desc {
>  	int nr_entries;
>  } bp_patching;
>  
> +static atomic_t bp_handlers;
> +
>  static inline void *text_poke_addr(struct text_poke_loc *tp)
>  {
>  	return _stext + tp->rel_addr;
> @@ -973,8 +975,8 @@ NOKPROBE_SYMBOL(patch_cmp);
>  int notrace poke_int3_handler(struct pt_regs *regs)
>  {
>  	struct text_poke_loc *tp;
> +	int nr, len, ret = 0;
>  	void *ip;
> -	int len;
>  
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
> @@ -987,12 +989,21 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  	 * Idem for other elements in bp_patching.
>  	 */
>  	smp_rmb();
> -
> -	if (likely(!bp_patching.nr_entries))
> +	if (!READ_ONCE(bp_patching.nr_entries))
>  		return 0;
>  
> +	atomic_inc(&bp_handlers);
> +	/*
> +	 * 'ACQUIRE', everything happens after the increment.
> +	 */
> +	smp_mb__after_atomic();
> +
> +	nr = smp_load_acquire(&bp_patching.nr_entries);
> +	if (likely(!nr))
> +		goto out;
> +
>  	if (user_mode(regs))
> -		return 0;
> +		goto out;
>  
>  	/*
>  	 * Discount the INT3. See text_poke_bp_batch().
> @@ -1002,16 +1013,16 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  	/*
>  	 * Skip the binary search if there is a single member in the vector.
>  	 */
> -	if (unlikely(bp_patching.nr_entries > 1)) {
> -		tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
> +	if (unlikely(nr > 1)) {
> +		tp = bsearch(ip, bp_patching.vec, nr,
>  			     sizeof(struct text_poke_loc),
>  			     patch_cmp);
>  		if (!tp)
> -			return 0;
> +			goto out;
>  	} else {
>  		tp = bp_patching.vec;
>  		if (text_poke_addr(tp) != ip)
> -			return 0;
> +			goto out;
>  	}
>  
>  	len = text_opcode_size(tp->opcode);
> @@ -1023,7 +1034,7 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  		 * Someone poked an explicit INT3, they'll want to handle it,
>  		 * do not consume.
>  		 */
> -		return 0;
> +		goto out;
>  
>  	case CALL_INSN_OPCODE:
>  		int3_emulate_call(regs, (long)ip + tp->rel32);
> @@ -1038,7 +1049,14 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  		BUG();
>  	}
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	/*
> +	 * 'RELEASE", everything happens before the decrement.
> +	 */
> +	smp_mb__before_atomic();
> +	atomic_dec(&bp_handlers);
> +	return ret;
>  }
>  NOKPROBE_SYMBOL(poke_int3_handler);
>  
> @@ -1076,7 +1094,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  	lockdep_assert_held(&text_mutex);
>  
>  	bp_patching.vec = tp;
> -	bp_patching.nr_entries = nr_entries;
> +	/*
> +	 * bp_patching.vec = tp			nr = bp_patching.nr_entries
> +	 * REL					ACQ
> +	 * bp_patching.nr_entries = nr_entries	tp = bp_patching.vec[]
> +	 */
> +	smp_store_release(&bp_patching.nr_entries, nr_entries);
>  
>  	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> @@ -1134,13 +1157,27 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  	 * sync_core() implies an smp_mb() and orders this store against
>  	 * the writing of the new instruction.
>  	 */
> -	bp_patching.nr_entries = 0;
> +	WRITE_ONCE(bp_patching.nr_entries, 0);
>  	/*
> -	 * This sync_core () call ensures that all INT3 handlers in progress
> -	 * have finished. This allows poke_int3_handler() after this to
> -	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> +	 * nr_entries = 0	bp_handlers++
> +	 * MB			MB
> +	 * VAL = bp_handlers	nr = nr_entries
> +	 */
> +	smp_mb();
> +	/*
> +	 * Guarantee all poke_int3_handler()s that have observed
> +	 * @bp_patching.nr_enties have completed before we clear
> +	 * bp_patching.vec.
> +	 *
> +	 * We can't do this before text_poke_sync() because then there
> +	 * might still be observable INT3 instructions.
> +	 */
> +	atomic_cond_read_acquire(&bp_handlers, !VAL);
> +	/*
> +	 * bp_handlers == 0		tp = bp_patching.vec[]
> +	 * ACQ				MB
> +	 * bp_patching.vec = NULL	bp_handlers--;
>  	 */
> -	text_poke_sync();
>  	bp_patching.vec = NULL;
>  }
>  


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-10 16:44     ` Masami Hiramatsu
@ 2019-12-10 17:32       ` Peter Zijlstra
  2019-12-11  0:09         ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2019-12-10 17:32 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Wed, Dec 11, 2019 at 01:44:01AM +0900, Masami Hiramatsu wrote:

> This looks good, but I feel it is a bit complicated.
> 
> If we use atomic (and spin-wait) here, can we use atomic_inc_not_zero()
> in the poke_int3_handler() at first for making sure the bp_batching is
> under operation or not?
> I think it makes things simpler, like below.
> 
> ---------
> atomic_t bp_refcnt;
> 
> poke_int3_handler()
> {
> 	smp_rmb();
> 	if (!READ_ONCE(bp_patching.nr_entries))
> 		return 0;
> 	if (!atomic_inc_not_zero(&bp_refcnt))
> 		return 0;
> 	smp_mb__after_atomic();
> 	[use bp_patching]
> 	atomic_dec(&bp_refcnt);
> }
> 
> text_poke_bp_batch()
> {
> 	bp_patching.vec = tp;
> 	bp_patching.nr_entries = nr_entries;
> 	smp_wmb();
> 	atomic_inc(&bp_refcnt);
> 	...
> 	atomic_dec(&bp_refcnt);
> 	/* wait for all running poke_int3_handler(). */
> 	atomic_cond_read_acquire(&bp_refcnt, !VAL);
> 	bp_patching.vec = NULL;
> 	bp_patching.nr_entries = 0;
> }

I feel that is actually more complicated...  Let me try to see if I can
simplify things.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-10 17:32       ` Peter Zijlstra
@ 2019-12-11  0:09         ` Peter Zijlstra
  2019-12-11  8:09           ` Masami Hiramatsu
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2019-12-11  0:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On Tue, Dec 10, 2019 at 06:32:09PM +0100, Peter Zijlstra wrote:

> I feel that is actually more complicated...  Let me try to see if I can
> simplify things.

How is this then?

---
 arch/x86/kernel/alternative.c | 84 +++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 30e86730655c..34360ca301a2 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -948,10 +948,29 @@ struct text_poke_loc {
 	const u8 text[POKE_MAX_OPCODE_SIZE];
 };
 
-static struct bp_patching_desc {
+struct bp_patching_desc {
 	struct text_poke_loc *vec;
 	int nr_entries;
-} bp_patching;
+	atomic_t refs;
+};
+
+static struct bp_patching_desc *bp_desc;
+
+static inline struct bp_patching_desc *try_get_desc(struct bp_patching_desc **descp)
+{
+	struct bp_patching_desc *desc = READ_ONCE(*descp); /* rcu_dereference */
+
+	if (!desc || !atomic_inc_not_zero(&desc->refs))
+		return NULL;
+
+	return desc;
+}
+
+static inline void put_desc(struct bp_patching_desc *desc)
+{
+	smp_mb__before_atomic();
+	atomic_dec(&desc->refs);
+}
 
 static inline void *text_poke_addr(struct text_poke_loc *tp)
 {
@@ -972,26 +991,26 @@ NOKPROBE_SYMBOL(patch_cmp);
 
 int notrace poke_int3_handler(struct pt_regs *regs)
 {
+	struct bp_patching_desc *desc;
 	struct text_poke_loc *tp;
+	int len, ret = 0;
 	void *ip;
-	int len;
+
+	if (user_mode(regs))
+		return 0;
 
 	/*
 	 * Having observed our INT3 instruction, we now must observe
-	 * bp_patching.nr_entries.
+	 * bp_desc:
 	 *
-	 *	nr_entries != 0			INT3
+	 *	bp_desc = desc			INT3
 	 *	WMB				RMB
-	 *	write INT3			if (nr_entries)
-	 *
-	 * Idem for other elements in bp_patching.
+	 *	write INT3			if (desc)
 	 */
 	smp_rmb();
 
-	if (likely(!bp_patching.nr_entries))
-		return 0;
-
-	if (user_mode(regs))
+	desc = try_get_desc(&bp_desc);
+	if (!desc)
 		return 0;
 
 	/*
@@ -1002,16 +1021,16 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 	/*
 	 * Skip the binary search if there is a single member in the vector.
 	 */
-	if (unlikely(bp_patching.nr_entries > 1)) {
-		tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
+	if (unlikely(desc->nr_entries > 1)) {
+		tp = bsearch(ip, desc->vec, desc->nr_entries,
 			     sizeof(struct text_poke_loc),
 			     patch_cmp);
 		if (!tp)
-			return 0;
+			goto out_put;
 	} else {
-		tp = bp_patching.vec;
+		tp = desc->vec;
 		if (text_poke_addr(tp) != ip)
-			return 0;
+			goto out_put;
 	}
 
 	len = text_opcode_size(tp->opcode);
@@ -1023,7 +1042,7 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 		 * Someone poked an explicit INT3, they'll want to handle it,
 		 * do not consume.
 		 */
-		return 0;
+		goto out_put;
 
 	case CALL_INSN_OPCODE:
 		int3_emulate_call(regs, (long)ip + tp->rel32);
@@ -1038,7 +1057,11 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 		BUG();
 	}
 
-	return 1;
+	ret = 1;
+
+out_put:
+	put_desc(desc);
+	return ret;
 }
 NOKPROBE_SYMBOL(poke_int3_handler);
 
@@ -1069,14 +1092,18 @@ static int tp_vec_nr;
  */
 static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
+	struct bp_patching_desc desc = {
+		.vec = tp,
+		.nr_entries = nr_entries,
+		.refs = ATOMIC_INIT(1),
+	};
 	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
 	int do_sync;
 
 	lockdep_assert_held(&text_mutex);
 
-	bp_patching.vec = tp;
-	bp_patching.nr_entries = nr_entries;
+	smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */
 
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
@@ -1131,17 +1158,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		text_poke_sync();
 
 	/*
-	 * sync_core() implies an smp_mb() and orders this store against
-	 * the writing of the new instruction.
+	 * Remove and synchronize_rcu(), except we have a very primitive
+	 * refcount based completion.
 	 */
-	bp_patching.nr_entries = 0;
-	/*
-	 * This sync_core () call ensures that all INT3 handlers in progress
-	 * have finished. This allows poke_int3_handler() after this to
-	 * avoid touching bp_paching.vec by checking nr_entries == 0.
-	 */
-	text_poke_sync();
-	bp_patching.vec = NULL;
+	WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */
+	if (!atomic_dec_and_test(&desc.refs))
+		atomic_cond_read_acquire(&desc.refs, !VAL);
 }
 
 void text_poke_loc_init(struct text_poke_loc *tp, void *addr,

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-11  0:09         ` Peter Zijlstra
@ 2019-12-11  8:09           ` Masami Hiramatsu
  2019-12-11  9:12             ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 19+ messages in thread
From: Masami Hiramatsu @ 2019-12-11  8:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, bristot, jbaron,
	torvalds, tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

Hi Peter,

On Wed, 11 Dec 2019 01:09:43 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Dec 10, 2019 at 06:32:09PM +0100, Peter Zijlstra wrote:
> 
> > I feel that is actually more complicated...  Let me try to see if I can
> > simplify things.
> 
> How is this then?

This looks perfectly good to me :)

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you!

> 
> ---
>  arch/x86/kernel/alternative.c | 84 +++++++++++++++++++++++++++----------------
>  1 file changed, 53 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 30e86730655c..34360ca301a2 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -948,10 +948,29 @@ struct text_poke_loc {
>  	const u8 text[POKE_MAX_OPCODE_SIZE];
>  };
>  
> -static struct bp_patching_desc {
> +struct bp_patching_desc {
>  	struct text_poke_loc *vec;
>  	int nr_entries;
> -} bp_patching;
> +	atomic_t refs;
> +};
> +
> +static struct bp_patching_desc *bp_desc;
> +
> +static inline struct bp_patching_desc *try_get_desc(struct bp_patching_desc **descp)
> +{
> +	struct bp_patching_desc *desc = READ_ONCE(*descp); /* rcu_dereference */
> +
> +	if (!desc || !atomic_inc_not_zero(&desc->refs))
> +		return NULL;
> +
> +	return desc;
> +}
> +
> +static inline void put_desc(struct bp_patching_desc *desc)
> +{
> +	smp_mb__before_atomic();
> +	atomic_dec(&desc->refs);
> +}
>  
>  static inline void *text_poke_addr(struct text_poke_loc *tp)
>  {
> @@ -972,26 +991,26 @@ NOKPROBE_SYMBOL(patch_cmp);
>  
>  int notrace poke_int3_handler(struct pt_regs *regs)
>  {
> +	struct bp_patching_desc *desc;
>  	struct text_poke_loc *tp;
> +	int len, ret = 0;
>  	void *ip;
> -	int len;
> +
> +	if (user_mode(regs))
> +		return 0;
>  
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
> -	 * bp_patching.nr_entries.
> +	 * bp_desc:
>  	 *
> -	 *	nr_entries != 0			INT3
> +	 *	bp_desc = desc			INT3
>  	 *	WMB				RMB
> -	 *	write INT3			if (nr_entries)
> -	 *
> -	 * Idem for other elements in bp_patching.
> +	 *	write INT3			if (desc)
>  	 */
>  	smp_rmb();
>  
> -	if (likely(!bp_patching.nr_entries))
> -		return 0;
> -
> -	if (user_mode(regs))
> +	desc = try_get_desc(&bp_desc);
> +	if (!desc)
>  		return 0;
>  
>  	/*
> @@ -1002,16 +1021,16 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  	/*
>  	 * Skip the binary search if there is a single member in the vector.
>  	 */
> -	if (unlikely(bp_patching.nr_entries > 1)) {
> -		tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
> +	if (unlikely(desc->nr_entries > 1)) {
> +		tp = bsearch(ip, desc->vec, desc->nr_entries,
>  			     sizeof(struct text_poke_loc),
>  			     patch_cmp);
>  		if (!tp)
> -			return 0;
> +			goto out_put;
>  	} else {
> -		tp = bp_patching.vec;
> +		tp = desc->vec;
>  		if (text_poke_addr(tp) != ip)
> -			return 0;
> +			goto out_put;
>  	}
>  
>  	len = text_opcode_size(tp->opcode);
> @@ -1023,7 +1042,7 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  		 * Someone poked an explicit INT3, they'll want to handle it,
>  		 * do not consume.
>  		 */
> -		return 0;
> +		goto out_put;
>  
>  	case CALL_INSN_OPCODE:
>  		int3_emulate_call(regs, (long)ip + tp->rel32);
> @@ -1038,7 +1057,11 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  		BUG();
>  	}
>  
> -	return 1;
> +	ret = 1;
> +
> +out_put:
> +	put_desc(desc);
> +	return ret;
>  }
>  NOKPROBE_SYMBOL(poke_int3_handler);
>  
> @@ -1069,14 +1092,18 @@ static int tp_vec_nr;
>   */
>  static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
>  {
> +	struct bp_patching_desc desc = {
> +		.vec = tp,
> +		.nr_entries = nr_entries,
> +		.refs = ATOMIC_INIT(1),
> +	};
>  	unsigned char int3 = INT3_INSN_OPCODE;
>  	unsigned int i;
>  	int do_sync;
>  
>  	lockdep_assert_held(&text_mutex);
>  
> -	bp_patching.vec = tp;
> -	bp_patching.nr_entries = nr_entries;
> +	smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */
>  
>  	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> @@ -1131,17 +1158,12 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>  		text_poke_sync();
>  
>  	/*
> -	 * sync_core() implies an smp_mb() and orders this store against
> -	 * the writing of the new instruction.
> +	 * Remove and synchronize_rcu(), except we have a very primitive
> +	 * refcount based completion.
>  	 */
> -	bp_patching.nr_entries = 0;
> -	/*
> -	 * This sync_core () call ensures that all INT3 handlers in progress
> -	 * have finished. This allows poke_int3_handler() after this to
> -	 * avoid touching bp_paching.vec by checking nr_entries == 0.
> -	 */
> -	text_poke_sync();
> -	bp_patching.vec = NULL;
> +	WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */
> +	if (!atomic_dec_and_test(&desc.refs))
> +		atomic_cond_read_acquire(&desc.refs, !VAL);
>  }
>  
>  void text_poke_loc_init(struct text_poke_loc *tp, void *addr,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception
  2019-12-11  8:09           ` Masami Hiramatsu
@ 2019-12-11  9:12             ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-12-11  9:12 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, x86, linux-kernel, jbaron, torvalds,
	tglx, namit, hpa, luto, ard.biesheuvel, jpoimboe, jeyu,
	alexei.starovoitov

On 11/12/2019 09:09, Masami Hiramatsu wrote:
> Hi Peter,
> 
> On Wed, 11 Dec 2019 01:09:43 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Tue, Dec 10, 2019 at 06:32:09PM +0100, Peter Zijlstra wrote:
>>
>>> I feel that is actually more complicated...  Let me try to see if I can
>>> simplify things.
>> How is this then?
> This looks perfectly good to me :)
> 
> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>

too!

Thanks!
-- Daniel


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-12-11  9:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-27  5:56 [PATCH -tip 0/2] x86/kprobes: Fix 2 issues related to text_poke_bp and optprobe Masami Hiramatsu
2019-11-27  5:56 ` [PATCH -tip 1/2] x86/alternative: Sync bp_patching update for avoiding NULL pointer exception Masami Hiramatsu
2019-12-02  9:15   ` Peter Zijlstra
2019-12-02 11:50     ` Masami Hiramatsu
2019-12-02 13:43       ` Peter Zijlstra
2019-12-02 14:39         ` Masami Hiramatsu
2019-12-04  8:33   ` [tip: core/kprobes] x86/alternatives: " tip-bot2 for Masami Hiramatsu
2019-12-09 14:39   ` [PATCH -tip 1/2] x86/alternative: " Peter Zijlstra
2019-12-10 16:44     ` Masami Hiramatsu
2019-12-10 17:32       ` Peter Zijlstra
2019-12-11  0:09         ` Peter Zijlstra
2019-12-11  8:09           ` Masami Hiramatsu
2019-12-11  9:12             ` Daniel Bristot de Oliveira
2019-11-27  5:57 ` [PATCH -tip 2/2] kprobes: Set unoptimized flag after unoptimizing code Masami Hiramatsu
2019-11-27  6:19   ` Alexei Starovoitov
2019-11-27  6:49     ` Ingo Molnar
2019-12-02 21:55       ` Alexei Starovoitov
2019-11-27  6:56     ` Masami Hiramatsu
2019-12-04  8:33   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).