linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* KRETPROBES are broken since kernel 5.8
@ 2020-12-09 21:50 Adam Zabrocki
  2020-12-10  1:25 ` Masami Hiramatsu
  0 siblings, 1 reply; 6+ messages in thread
From: Adam Zabrocki @ 2020-12-09 21:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, Naveen N. Rao,
	Anil S Keshavamurthy, David S. Miller, Masami Hiramatsu,
	Solar Designer

Hello,

Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
when #DB exception was raised, entry to the NMI was not fully performed. Among
others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

    if (!user_mode(regs)) {
        rcu_nmi_enter();
        preempt_disable();
    }

In some older kernels function ist_enter() was called instead. Inside this
function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

    if (user_mode(regs)) {
        RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
    } else {
        /*
         * We might have interrupted pretty much anything.  In
         * fact, if we're a machine check, we can even interrupt
         * NMI processing.  We don't want in_nmi() to return true,
         * but we need to notify RCU.
         */
        rcu_nmi_enter();
    }

    preempt_disable();

As the comment says "We don't want in_nmi() to return true, but we need to
notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
was modified and currently we have this (function "exc_int3"):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

    /*
     * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
     * can trigger INT3, hence poke_int3_handler() must be done
     * before. If the entry came from kernel mode, then use nmi_enter()
     * because the INT3 could have been hit in any context including
     * NMI.
     */
    if (user_mode(regs)) {
        idtentry_enter_user(regs);
        instrumentation_begin();
        do_int3_user(regs);
        instrumentation_end();
        idtentry_exit_user(regs);
    } else {
        nmi_enter();
        instrumentation_begin();
        trace_hardirqs_off_finish();
        if (!do_int3(regs))
            die("int3", regs, 0);
        if (regs->flags & X86_EFLAGS_IF)
            trace_hardirqs_on_prepare();
        instrumentation_end();
        nmi_exit();
    }

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which later was modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially,
KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
                    |
    ----------------|
    |
    v
...
nmi_enter();
...
if (!do_int3(regs))
       |
  -----|
  |
  v
do_int3() -> kprobe_int3_handler():
                    |
    ----------------|
    |
    v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
                             |
    -------------------------|
    |
    v
...
pre_handler_kretprobe():
...
    if (unlikely(in_nmi())) {
        rp->nmissed++;
        return 0;
    }

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.

This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
FTRACE so essentially if someone registered KRETPROBE, buggy code was not
invoked (FTRACE code was executed instead). However, the optimizer was changed
and can't optimize as many functions anymore (probably another bug which might
be worth to investigate) and this bug is more visible.

Thanks,
Adam

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: KRETPROBES are broken since kernel 5.8
  2020-12-09 21:50 KRETPROBES are broken since kernel 5.8 Adam Zabrocki
@ 2020-12-10  1:25 ` Masami Hiramatsu
  2020-12-10  7:17   ` Adam Zabrocki
  0 siblings, 1 reply; 6+ messages in thread
From: Masami Hiramatsu @ 2020-12-10  1:25 UTC (permalink / raw)
  To: Adam Zabrocki
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, Naveen N. Rao,
	Anil S Keshavamurthy, David S. Miller, Solar Designer

Hi Adam,

Thank you for reporting and debugging, actually we had investigated this
issue in Aug. Please read this thread.

https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad2825d3@trendmicro.com

We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
kprobe nesting"). Yeah, it will be in the v5.10.

Could you try to test whether these commits can solve the issue?
If it is OK, we should backport those to stable tree.

Thank you,

On Wed, 9 Dec 2020 22:50:01 +0100
Adam Zabrocki <pi3@pi3.com.pl> wrote:

> Hello,
> 
> Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> when #DB exception was raised, entry to the NMI was not fully performed. Among
> others, the following logic was executed:
> https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> 
>     if (!user_mode(regs)) {
>         rcu_nmi_enter();
>         preempt_disable();
>     }
> 
> In some older kernels function ist_enter() was called instead. Inside this
> function we can see the following logic:
> https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> 
>     if (user_mode(regs)) {
>         RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
>     } else {
>         /*
>          * We might have interrupted pretty much anything.  In
>          * fact, if we're a machine check, we can even interrupt
>          * NMI processing.  We don't want in_nmi() to return true,
>          * but we need to notify RCU.
>          */
>         rcu_nmi_enter();
>     }
> 
>     preempt_disable();
> 
> As the comment says "We don't want in_nmi() to return true, but we need to
> notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
> was modified and currently we have this (function "exc_int3"):
> https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630
> 
>     /*
>      * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
>      * can trigger INT3, hence poke_int3_handler() must be done
>      * before. If the entry came from kernel mode, then use nmi_enter()
>      * because the INT3 could have been hit in any context including
>      * NMI.
>      */
>     if (user_mode(regs)) {
>         idtentry_enter_user(regs);
>         instrumentation_begin();
>         do_int3_user(regs);
>         instrumentation_end();
>         idtentry_exit_user(regs);
>     } else {
>         nmi_enter();
>         instrumentation_begin();
>         trace_hardirqs_off_finish();
>         if (!do_int3(regs))
>             die("int3", regs, 0);
>         if (regs->flags & X86_EFLAGS_IF)
>             trace_hardirqs_on_prepare();
>         instrumentation_end();
>         nmi_exit();
>     }
> 
> The root of unlucky change comes from this commit:
> 
> https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5
> 
> which later was modified by this commit:
> 
> https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2
> 
> and this is what we currently have in all kernels since 5.8. Essentially,
> KRETPROBES are not working since these commits. We have the following logic:
> 
> asm_exc_int3() -> exc_int3():
>                     |
>     ----------------|
>     |
>     v
> ...
> nmi_enter();
> ...
> if (!do_int3(regs))
>        |
>   -----|
>   |
>   v
> do_int3() -> kprobe_int3_handler():
>                     |
>     ----------------|
>     |
>     v
> ...
> if (!p->pre_handler || !p->pre_handler(p, regs))
>                              |
>     -------------------------|
>     |
>     v
> ...
> pre_handler_kretprobe():
> ...
>     if (unlikely(in_nmi())) {
>         rp->nmissed++;
>         return 0;
>     }
> 
> Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
> invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.
> 
> This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
> FTRACE so essentially if someone registered KRETPROBE, buggy code was not
> invoked (FTRACE code was executed instead). However, the optimizer was changed
> and can't optimize as many functions anymore (probably another bug which might
> be worth to investigate) and this bug is more visible.
> 
> Thanks,
> Adam
> 
> -- 
> pi3 (pi3ki31ny) - pi3 (at) itsec pl
> http://pi3.com.pl
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: KRETPROBES are broken since kernel 5.8
  2020-12-10  1:25 ` Masami Hiramatsu
@ 2020-12-10  7:17   ` Adam Zabrocki
  2020-12-10 13:09     ` Masami Hiramatsu
  0 siblings, 1 reply; 6+ messages in thread
From: Adam Zabrocki @ 2020-12-10  7:17 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, Naveen N. Rao,
	Anil S Keshavamurthy, David S. Miller, Solar Designer

Hi,

On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote:
> Hi Adam,
> 
> Thank you for reporting and debugging, actually we had investigated this
> issue in Aug. Please read this thread.
> 
> https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad2825d3@trendmicro.com
> 

Thanks for the link. I've read the entire history of proposed fix - very 
informative :)

> We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
> context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
> kprobe nesting"). Yeah, it will be in the v5.10.
> 
> Could you try to test whether these commits can solve the issue?

I've applied these commits on the top of kernel 5.9.7 and verified that it 
works well. Non-optimized KRETPROBES are back to life.

However, there might be another issue which I wanted to brought / discuss - 
problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
'ftrace_enable_sysctl' function was correctly optimized without any problems. 
Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
sources regarding the optimizer, neither function itself.
When I looked at the generated vmlinux binary, I can see that GCC generated 
padding at the end of this function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194 <ftrace_enable_sysctl+0x114>
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn't exist in this function in generated images for older 
kernels. Nevertheless, such padding is not unusual and it's pretty common.

Optimizer logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():

    /* Decode instructions */
    addr = paddr - offset;
    while (addr < paddr - offset + size) { /* Decode until function end */
        unsigned long recovered_insn;
        if (search_exception_tables(addr))
            /*
             * Since some fixup code will jumps into this function,
             * we can't optimize kprobe in this function.
             */
            return 0;
        recovered_insn = recover_probed_instruction(buf, addr);
        if (!recovered_insn)
            return 0;
        kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
        insn_get_length(&insn);
        /* Another subsystem puts a breakpoint */
        if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
            return 0;
        /* Recover address */
        insn.kaddr = (void *)addr;
        insn.next_byte = (void *)(addr + insn.length);
        /* Check any instructions don't jump into target */
        if (insn_is_indirect_jump(&insn) ||
            insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                     DISP32_SIZE))
            return 0;
        addr += insn.length;
    }

One of the check tries to protect from the situation when another subsystem 
puts a breakpoint there as well:

        /* Another subsystem puts a breakpoint */
        if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
            return 0;

However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
the function as padding (and protect from NOP-padding problems).

I wonder, if optimizer should take this special case into account? Is it worth 
to still optimize such functions when they are padded with INT3?

> If it is OK, we should backport those to stable tree.

Agreed.

Thanks,
Adam

> Thank you,
> 
> On Wed, 9 Dec 2020 22:50:01 +0100
> Adam Zabrocki <pi3@pi3.com.pl> wrote:
> 
> > Hello,
> > 
> > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> > when #DB exception was raised, entry to the NMI was not fully performed. Among
> > others, the following logic was executed:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> > 
> >     if (!user_mode(regs)) {
> >         rcu_nmi_enter();
> >         preempt_disable();
> >     }
> > 
> > In some older kernels function ist_enter() was called instead. Inside this
> > function we can see the following logic:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> > 
> >     if (user_mode(regs)) {
> >         RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> >     } else {
> >         /*
> >          * We might have interrupted pretty much anything.  In
> >          * fact, if we're a machine check, we can even interrupt
> >          * NMI processing.  We don't want in_nmi() to return true,
> >          * but we need to notify RCU.
> >          */
> >         rcu_nmi_enter();
> >     }
> > 
> >     preempt_disable();
> > 
> > As the comment says "We don't want in_nmi() to return true, but we need to
> > notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
> > was modified and currently we have this (function "exc_int3"):
> > https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630
> > 
> >     /*
> >      * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
> >      * can trigger INT3, hence poke_int3_handler() must be done
> >      * before. If the entry came from kernel mode, then use nmi_enter()
> >      * because the INT3 could have been hit in any context including
> >      * NMI.
> >      */
> >     if (user_mode(regs)) {
> >         idtentry_enter_user(regs);
> >         instrumentation_begin();
> >         do_int3_user(regs);
> >         instrumentation_end();
> >         idtentry_exit_user(regs);
> >     } else {
> >         nmi_enter();
> >         instrumentation_begin();
> >         trace_hardirqs_off_finish();
> >         if (!do_int3(regs))
> >             die("int3", regs, 0);
> >         if (regs->flags & X86_EFLAGS_IF)
> >             trace_hardirqs_on_prepare();
> >         instrumentation_end();
> >         nmi_exit();
> >     }
> > 
> > The root of unlucky change comes from this commit:
> > 
> > https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5
> > 
> > which later was modified by this commit:
> > 
> > https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2
> > 
> > and this is what we currently have in all kernels since 5.8. Essentially,
> > KRETPROBES are not working since these commits. We have the following logic:
> > 
> > asm_exc_int3() -> exc_int3():
> >                     |
> >     ----------------|
> >     |
> >     v
> > ...
> > nmi_enter();
> > ...
> > if (!do_int3(regs))
> >        |
> >   -----|
> >   |
> >   v
> > do_int3() -> kprobe_int3_handler():
> >                     |
> >     ----------------|
> >     |
> >     v
> > ...
> > if (!p->pre_handler || !p->pre_handler(p, regs))
> >                              |
> >     -------------------------|
> >     |
> >     v
> > ...
> > pre_handler_kretprobe():
> > ...
> >     if (unlikely(in_nmi())) {
> >         rp->nmissed++;
> >         return 0;
> >     }
> > 
> > Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
> > invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.
> > 
> > This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
> > FTRACE so essentially if someone registered KRETPROBE, buggy code was not
> > invoked (FTRACE code was executed instead). However, the optimizer was changed
> > and can't optimize as many functions anymore (probably another bug which might
> > be worth to investigate) and this bug is more visible.
> > 
> > Thanks,
> > Adam
> > 
> > -- 
> > pi3 (pi3ki31ny) - pi3 (at) itsec pl
> > http://pi3.com.pl
> > 
> 
> 
> -- 
> Masami Hiramatsu <mhiramat@kernel.org>

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: KRETPROBES are broken since kernel 5.8
  2020-12-10  7:17   ` Adam Zabrocki
@ 2020-12-10 13:09     ` Masami Hiramatsu
  2020-12-10 17:14       ` Adam Zabrocki
  0 siblings, 1 reply; 6+ messages in thread
From: Masami Hiramatsu @ 2020-12-10 13:09 UTC (permalink / raw)
  To: Adam Zabrocki
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, Naveen N. Rao,
	Anil S Keshavamurthy, David S. Miller, Solar Designer

Hi Adam,

On Thu, 10 Dec 2020 08:17:42 +0100
Adam Zabrocki <pi3@pi3.com.pl> wrote:

> Hi,
> 
> On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote:
> > Hi Adam,
> > 
> > Thank you for reporting and debugging, actually we had investigated this
> > issue in Aug. Please read this thread.
> > 
> > https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad2825d3@trendmicro.com
> > 
> 
> Thanks for the link. I've read the entire history of proposed fix - very 
> informative :)
> 
> > We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
> > context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
> > kprobe nesting"). Yeah, it will be in the v5.10.
> > 
> > Could you try to test whether these commits can solve the issue?
> 
> I've applied these commits on the top of kernel 5.9.7 and verified that it 
> works well. Non-optimized KRETPROBES are back to life.

Good!

> 
> However, there might be another issue which I wanted to brought / discuss - 
> problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
> 'ftrace_enable_sysctl' function was correctly optimized without any problems. 

Did you check it on other functions? Did you see it only on the "ftrace_enable_sysctl"?

> Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
> sources regarding the optimizer, neither function itself.
> When I looked at the generated vmlinux binary, I can see that GCC generated 
> padding at the end of this function using INT3 opcode:
> 
> ...
> ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
> ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194 <ftrace_enable_sysctl+0x114>
> ffffffff81305296:       cc                      int3
> ffffffff81305297:       cc                      int3
> ffffffff81305298:       cc                      int3
> ffffffff81305299:       cc                      int3
> ffffffff8130529a:       cc                      int3
> ffffffff8130529b:       cc                      int3
> ffffffff8130529c:       cc                      int3
> ffffffff8130529d:       cc                      int3
> ffffffff8130529e:       cc                      int3
> ffffffff8130529f:       cc                      int3

So these int3 is generated by GCC for padding, right?

> Such padding didn't exist in this function in generated images for older 
> kernels. Nevertheless, such padding is not unusual and it's pretty common.
> 
> Optimizer logic fails here:
> 
> try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
> -> arch_prepare_optimized_kprobe() -> can_optimize():
> 
>     /* Decode instructions */
>     addr = paddr - offset;
>     while (addr < paddr - offset + size) { /* Decode until function end */
>         unsigned long recovered_insn;
>         if (search_exception_tables(addr))
>             /*
>              * Since some fixup code will jumps into this function,
>              * we can't optimize kprobe in this function.
>              */
>             return 0;
>         recovered_insn = recover_probed_instruction(buf, addr);
>         if (!recovered_insn)
>             return 0;
>         kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
>         insn_get_length(&insn);
>         /* Another subsystem puts a breakpoint */
>         if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
>             return 0;
>         /* Recover address */
>         insn.kaddr = (void *)addr;
>         insn.next_byte = (void *)(addr + insn.length);
>         /* Check any instructions don't jump into target */
>         if (insn_is_indirect_jump(&insn) ||
>             insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
>                      DISP32_SIZE))
>             return 0;
>         addr += insn.length;
>     }
> 
> One of the check tries to protect from the situation when another subsystem 
> puts a breakpoint there as well:
> 
>         /* Another subsystem puts a breakpoint */
>         if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
>             return 0;

Right.

> 
> However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
> the function as padding (and protect from NOP-padding problems).
> 
> I wonder, if optimizer should take this special case into account? Is it worth 
> to still optimize such functions when they are padded with INT3?

Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and,
in that case the optimization can confuse them.
But if the gcc uses int3 to pad the room between functions, it should be
reconsidered. 

Thank you,

> 
> > If it is OK, we should backport those to stable tree.
> 
> Agreed.
> 
> Thanks,
> Adam
> 
> > Thank you,
> > 
> > On Wed, 9 Dec 2020 22:50:01 +0100
> > Adam Zabrocki <pi3@pi3.com.pl> wrote:
> > 
> > > Hello,
> > > 
> > > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> > > when #DB exception was raised, entry to the NMI was not fully performed. Among
> > > others, the following logic was executed:
> > > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> > > 
> > >     if (!user_mode(regs)) {
> > >         rcu_nmi_enter();
> > >         preempt_disable();
> > >     }
> > > 
> > > In some older kernels function ist_enter() was called instead. Inside this
> > > function we can see the following logic:
> > > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> > > 
> > >     if (user_mode(regs)) {
> > >         RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > >     } else {
> > >         /*
> > >          * We might have interrupted pretty much anything.  In
> > >          * fact, if we're a machine check, we can even interrupt
> > >          * NMI processing.  We don't want in_nmi() to return true,
> > >          * but we need to notify RCU.
> > >          */
> > >         rcu_nmi_enter();
> > >     }
> > > 
> > >     preempt_disable();
> > > 
> > > As the comment says "We don't want in_nmi() to return true, but we need to
> > > notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
> > > was modified and currently we have this (function "exc_int3"):
> > > https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630
> > > 
> > >     /*
> > >      * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
> > >      * can trigger INT3, hence poke_int3_handler() must be done
> > >      * before. If the entry came from kernel mode, then use nmi_enter()
> > >      * because the INT3 could have been hit in any context including
> > >      * NMI.
> > >      */
> > >     if (user_mode(regs)) {
> > >         idtentry_enter_user(regs);
> > >         instrumentation_begin();
> > >         do_int3_user(regs);
> > >         instrumentation_end();
> > >         idtentry_exit_user(regs);
> > >     } else {
> > >         nmi_enter();
> > >         instrumentation_begin();
> > >         trace_hardirqs_off_finish();
> > >         if (!do_int3(regs))
> > >             die("int3", regs, 0);
> > >         if (regs->flags & X86_EFLAGS_IF)
> > >             trace_hardirqs_on_prepare();
> > >         instrumentation_end();
> > >         nmi_exit();
> > >     }
> > > 
> > > The root of unlucky change comes from this commit:
> > > 
> > > https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5
> > > 
> > > which later was modified by this commit:
> > > 
> > > https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2
> > > 
> > > and this is what we currently have in all kernels since 5.8. Essentially,
> > > KRETPROBES are not working since these commits. We have the following logic:
> > > 
> > > asm_exc_int3() -> exc_int3():
> > >                     |
> > >     ----------------|
> > >     |
> > >     v
> > > ...
> > > nmi_enter();
> > > ...
> > > if (!do_int3(regs))
> > >        |
> > >   -----|
> > >   |
> > >   v
> > > do_int3() -> kprobe_int3_handler():
> > >                     |
> > >     ----------------|
> > >     |
> > >     v
> > > ...
> > > if (!p->pre_handler || !p->pre_handler(p, regs))
> > >                              |
> > >     -------------------------|
> > >     |
> > >     v
> > > ...
> > > pre_handler_kretprobe():
> > > ...
> > >     if (unlikely(in_nmi())) {
> > >         rp->nmissed++;
> > >         return 0;
> > >     }
> > > 
> > > Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
> > > invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.
> > > 
> > > This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
> > > FTRACE so essentially if someone registered KRETPROBE, buggy code was not
> > > invoked (FTRACE code was executed instead). However, the optimizer was changed
> > > and can't optimize as many functions anymore (probably another bug which might
> > > be worth to investigate) and this bug is more visible.
> > > 
> > > Thanks,
> > > Adam
> > > 
> > > -- 
> > > pi3 (pi3ki31ny) - pi3 (at) itsec pl
> > > http://pi3.com.pl
> > > 
> > 
> > 
> > -- 
> > Masami Hiramatsu <mhiramat@kernel.org>
> 
> -- 
> pi3 (pi3ki31ny) - pi3 (at) itsec pl
> http://pi3.com.pl
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: KRETPROBES are broken since kernel 5.8
  2020-12-10 13:09     ` Masami Hiramatsu
@ 2020-12-10 17:14       ` Adam Zabrocki
  2020-12-11  1:44         ` Masami Hiramatsu
  0 siblings, 1 reply; 6+ messages in thread
From: Adam Zabrocki @ 2020-12-10 17:14 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, Naveen N. Rao,
	Anil S Keshavamurthy, David S. Miller, Solar Designer

Hi,

> > However, there might be another issue which I wanted to brought / discuss - 
> > problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
> > 'ftrace_enable_sysctl' function was correctly optimized without any problems. 
> 
> Did you check it on other functions? Did you see it only on the "ftrace_enable_sysctl"?
> 

Yes, I see it in most of the functions with padding.

> > Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
> > sources regarding the optimizer, neither function itself.
> > When I looked at the generated vmlinux binary, I can see that GCC generated 
> > padding at the end of this function using INT3 opcode:
> > 
> > ...
> > ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
> > ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194 <ftrace_enable_sysctl+0x114>
> > ffffffff81305296:       cc                      int3
> > ffffffff81305297:       cc                      int3
> > ffffffff81305298:       cc                      int3
> > ffffffff81305299:       cc                      int3
> > ffffffff8130529a:       cc                      int3
> > ffffffff8130529b:       cc                      int3
> > ffffffff8130529c:       cc                      int3
> > ffffffff8130529d:       cc                      int3
> > ffffffff8130529e:       cc                      int3
> > ffffffff8130529f:       cc                      int3
> 
> So these int3 is generated by GCC for padding, right?
> 

I've just browsed a few commits and I've found that one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

It looks like INT3 is now a default padding used by linker.

> > However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
> > the function as padding (and protect from NOP-padding problems).
> > 
> > I wonder, if optimizer should take this special case into account? Is it worth 
> > to still optimize such functions when they are padded with INT3?
> 
> Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and,
> in that case the optimization can confuse them.

Right. The same can happen when text section is being actively modified. 
However, this case could be covered by running the optimizer logic under 
text_mutex.

> But if the gcc uses int3 to pad the room between functions, it should be
> reconsidered. 
> 

Looks like it's a default behavior now.

> Thank you,
>
> > If it is OK, we should backport those to stable tree.
> 
> Agreed.

It is also important to make sure that distro kernels would pick-up such 
backported fix.

Thanks,
Adam

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: KRETPROBES are broken since kernel 5.8
  2020-12-10 17:14       ` Adam Zabrocki
@ 2020-12-11  1:44         ` Masami Hiramatsu
  0 siblings, 0 replies; 6+ messages in thread
From: Masami Hiramatsu @ 2020-12-11  1:44 UTC (permalink / raw)
  To: Adam Zabrocki
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel, Naveen N. Rao,
	Anil S Keshavamurthy, David S. Miller, Solar Designer

On Thu, 10 Dec 2020 18:14:30 +0100
Adam Zabrocki <pi3@pi3.com.pl> wrote:

> Hi,
> 
> > > However, there might be another issue which I wanted to brought / discuss - 
> > > problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
> > > 'ftrace_enable_sysctl' function was correctly optimized without any problems. 
> > 
> > Did you check it on other functions? Did you see it only on the "ftrace_enable_sysctl"?
> > 
> 
> Yes, I see it in most of the functions with padding.

Thanks for the confirmation.

> 
> > > Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
> > > sources regarding the optimizer, neither function itself.
> > > When I looked at the generated vmlinux binary, I can see that GCC generated 
> > > padding at the end of this function using INT3 opcode:
> > > 
> > > ...
> > > ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
> > > ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194 <ftrace_enable_sysctl+0x114>
> > > ffffffff81305296:       cc                      int3
> > > ffffffff81305297:       cc                      int3
> > > ffffffff81305298:       cc                      int3
> > > ffffffff81305299:       cc                      int3
> > > ffffffff8130529a:       cc                      int3
> > > ffffffff8130529b:       cc                      int3
> > > ffffffff8130529c:       cc                      int3
> > > ffffffff8130529d:       cc                      int3
> > > ffffffff8130529e:       cc                      int3
> > > ffffffff8130529f:       cc                      int3
> > 
> > So these int3 is generated by GCC for padding, right?
> > 
> 
> I've just browsed a few commits and I've found that one:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e
> 
> It looks like INT3 is now a default padding used by linker.

Thanks for the information! OK, I will add Fixed: tag and backport it.

> 
> > > However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
> > > the function as padding (and protect from NOP-padding problems).
> > > 
> > > I wonder, if optimizer should take this special case into account? Is it worth 
> > > to still optimize such functions when they are padded with INT3?
> > 
> > Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and,
> > in that case the optimization can confuse them.
> 
> Right. The same can happen when text section is being actively modified. 
> However, this case could be covered by running the optimizer logic under 
> text_mutex.

No, this check is needed because of the instruction decoding. Usually,
the int3 will be put a the first byte of the existing instruction whose
length is usually 1-6 bytes. If the instruction's opcode is overwritten
by the int3, kprobes can not get the original opcode and this means it 
can not get the original length of the instruction.
To optimize the probe, kprobes have to ensure the other jump instruction
doesn't jump into the instructions which will be overwritten by optimized
jump instruction. This is why the can_optimize() decodes all instructions
in the function (note that ksyms has no information of padding bytes, it
returns the function size with the padding bytes.)
Thus, when the kprobes detects the int3 in the function, it gives up the
decoding and optimizing.

> 
> > But if the gcc uses int3 to pad the room between functions, it should be
> > reconsidered. 
> > 
> 
> Looks like it's a default behavior now.

OK, let me fix that. If the int3 is only used for the padding between functions,
those int3 should continue to the end of the function. So kprobes can distinguish
the int3 comes from other subsystems or linker.

Thank you,

> 
> > Thank you,
> >
> > > If it is OK, we should backport those to stable tree.
> > 
> > Agreed.
> 
> It is also important to make sure that distro kernels would pick-up such 
> backported fix.
> 
> Thanks,
> Adam
> 
> -- 
> pi3 (pi3ki31ny) - pi3 (at) itsec pl
> http://pi3.com.pl
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-11  1:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-09 21:50 KRETPROBES are broken since kernel 5.8 Adam Zabrocki
2020-12-10  1:25 ` Masami Hiramatsu
2020-12-10  7:17   ` Adam Zabrocki
2020-12-10 13:09     ` Masami Hiramatsu
2020-12-10 17:14       ` Adam Zabrocki
2020-12-11  1:44         ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).