All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] arm64: ftrace with regs for livepatch support
@ 2015-11-20 11:47 AKASHI Takahiro
  2015-12-14  8:22 ` Li Bin
  2015-12-26  9:28 ` Li Bin
  0 siblings, 2 replies; 9+ messages in thread
From: AKASHI Takahiro @ 2015-11-20 11:47 UTC (permalink / raw)
  To: linux-arm-kernel

In this RFC, I'd like to describe and discuss some issues on adding ftrace/
livepatch support on arm64 before actually submitting patches. In fact,
porting livepatch is not a complicated task, but adding "ftrace with
regs(CONFIG_DYNAMIC_FTRACE_WITH_REGS)" which livepatch heavily relies on
is a matter.
(There is another discussion about "arch-independent livepatch" in LKML.)

Under "ftrace with regs", a ftrace helper function (ftrace_regs_caller)
will be called with cpu registers (struct pt_regs_t) at the beginning of
a function if tracing is enabled on the function. Livepatch utilizes this
argument to replace PC and jump back into a new (patched) function.
(Please note that this feature will also be used for ftrace-based kprobes.)

On arm64, there is no template for a function prologue, and "instruction
scheduling" may mix it with a function body. So a helper function, which
is inserted by gcc's "-pg" option, cannot (at least potentially) recognize
correct values of registers because some may have already been overwritten
at that point.

Instead, X86 uses gcc's "-mfentry" option, which inserts "call _mcount" as
the first instruction of a function, to implement "ftrace with regs".
As this option is arch-specific, after discussions with toolchain folks,
we are proposing a new arch-neutral option, "-fprolog-pad=N"[1].
This option inserts N nop instructions before a function prologue so that
any architecture can utilize it to replace nops with whatever instruction
sequence they want later on when required.
(I assume that nop is very cheap in terms of performance impact.)

First, let me explain how we can implement "ftrace with regs", or more
specifically, ftrace_make_call() and ftrace_make_nop() as well as how
inserted instruction sequences look like. Implementing ftrace_regs_caller
is quite straightforward, we don't have to care (at least, in this RFC).

1) instruction sequence
Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
a ftrace help function will be invoked before a function prologue. so we
need a few, not one, instructions here. Two possible ways:

  (a) stp x29, x30, [sp, #-16]!
      mov x29, sp
      bl <mcount>
      ldp x29, x30, [sp], #16
      <function prologue>
      ...

  (b) mov x9, x30
      bl <mcount>
      mov x30, x9
      <function prologue>
      ...

(a) complies with a normal calling convention.
(b) is Li Bin's idea in his old patch. While (b) can save some memory
accesses by using a scratch register(x9 in this example), we have no way
to recover an actual value for this register.

       Q#1. Which approach should we take here?


2) replacing an instruction sequence
    (This issue is orthogonal to Q#1.)

Replacing can happen anytime, so we have to do it (without any locking) in
such a safe way that any task either calls a helper or doesn't call it, but
never runs in any intermediate state.

Again here, two possible ways:

   (a) initialize the code in the shape of (A') at boot time,
             (B) -> (B') -> (A')
       then switching to (A) or (A')
   (b) take a few steps each time. For example,
       to enable tracing,
             (B) -> (B') -> (A') -> (A)
       to disable tracing,
             (A) -> (A') -> (B') -> (A)
       Obviously, we need cache flushing/invalidation and barriers between.

     (A)                                (A')
         stp x29, x30, [sp, #-16]!           b 1f
         mov x29, sp                         mov x29, sp
         bl <_mcount>                        bl <_mcount>
         ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
                                          1:
         <function prologue>
         <function body>
         ...

     (B)                                (B')
         nop                                 b 1f
         nop                                 nop
         nop                                 nop
         nop                                 nop
                                          1:
         <function prologue>
         <function body>
         ...

(a) is much simpler, but (b) has less performance penalty(?) when tracing
is disabled. I'm afraid that I might simplify the issue too much.

        Q#2. Which one is more preferable?


[1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
     https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html


Thanks,
-Takahiro AKASHI

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2015-11-20 11:47 [RFC] arm64: ftrace with regs for livepatch support AKASHI Takahiro
@ 2015-12-14  8:22 ` Li Bin
  2015-12-14 15:56   ` Steven Rostedt
  2015-12-26  9:28 ` Li Bin
  1 sibling, 1 reply; 9+ messages in thread
From: Li Bin @ 2015-12-14  8:22 UTC (permalink / raw)
  To: linux-arm-kernel



on 2015/11/20 19:47, AKASHI Takahiro wrote:
> In this RFC, I'd like to describe and discuss some issues on adding ftrace/
> livepatch support on arm64 before actually submitting patches. In fact,
> porting livepatch is not a complicated task, but adding "ftrace with
> regs(CONFIG_DYNAMIC_FTRACE_WITH_REGS)" which livepatch heavily relies on
> is a matter.
> (There is another discussion about "arch-independent livepatch" in LKML.)
>
> Under "ftrace with regs", a ftrace helper function (ftrace_regs_caller)
> will be called with cpu registers (struct pt_regs_t) at the beginning of
> a function if tracing is enabled on the function. Livepatch utilizes this
> argument to replace PC and jump back into a new (patched) function.
> (Please note that this feature will also be used for ftrace-based kprobes.)
>
> On arm64, there is no template for a function prologue, and "instruction
> scheduling" may mix it with a function body. So a helper function, which
> is inserted by gcc's "-pg" option, cannot (at least potentially) recognize
> correct values of registers because some may have already been overwritten
> at that point.
>
> Instead, X86 uses gcc's "-mfentry" option, which inserts "call _mcount" as
> the first instruction of a function, to implement "ftrace with regs".
> As this option is arch-specific, after discussions with toolchain folks,
> we are proposing a new arch-neutral option, "-fprolog-pad=N"[1].
> This option inserts N nop instructions before a function prologue so that
> any architecture can utilize it to replace nops with whatever instruction
> sequence they want later on when required.
> (I assume that nop is very cheap in terms of performance impact.)
>
> First, let me explain how we can implement "ftrace with regs", or more
> specifically, ftrace_make_call() and ftrace_make_nop() as well as how
> inserted instruction sequences look like. Implementing ftrace_regs_caller
> is quite straightforward, we don't have to care (at least, in this RFC).
>
> 1) instruction sequence
> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
> a ftrace help function will be invoked before a function prologue. so we
> need a few, not one, instructions here. Two possible ways:
>
>  (a) stp x29, x30, [sp, #-16]!
>      mov x29, sp
>      bl <mcount>
>      ldp x29, x30, [sp], #16
>      <function prologue>
>      ...
>
>  (b) mov x9, x30
>      bl <mcount>
>      mov x30, x9
>      <function prologue>
>      ...
>
> (a) complies with a normal calling convention.
> (b) is Li Bin's idea in his old patch. While (b) can save some memory
> accesses by using a scratch register(x9 in this example), we have no way
> to recover an actual value for this register.
>
>       Q#1. Which approach should we take here?
>

I think the more appropriate way to implement the livepatch on arm64 is to
directly modify the instruction with the help of the gcc "-fprolog-pad=N"option
and the N only needs 1, rather than basing on ftrace.

func:
    nop    <--->  b <(new_func1 - func)>  <--->   b <(new_func2 - func)>       
    [prologue]

And that NOP and B are both safe instructions which called "concurrent modification
and execution of instructions", that can be executed by one thread of execution as
they are being modified by another thread of execution without requiring explicit
synchronization.

On arm64, this method will improve performance significantly compared with the method
based on ftrace, especially for the critical function being frequently called.

Can we modify the livepatch to allow the arch specific implementation? Such as that
making the klp_enable_func/klp_disable_func as the weak function and allow their
implementations be architecture sepcific that not use ftrace. I already have a prototype
patchset and have test it and will post them soon.

Thanks,
Li Bin

>
> 2) replacing an instruction sequence
>    (This issue is orthogonal to Q#1.)
>
> Replacing can happen anytime, so we have to do it (without any locking) in
> such a safe way that any task either calls a helper or doesn't call it, but
> never runs in any intermediate state.
>
> Again here, two possible ways:
>
>   (a) initialize the code in the shape of (A') at boot time,
>             (B) -> (B') -> (A')
>       then switching to (A) or (A')
>   (b) take a few steps each time. For example,
>       to enable tracing,
>             (B) -> (B') -> (A') -> (A)
>       to disable tracing,
>             (A) -> (A') -> (B') -> (A)
>       Obviously, we need cache flushing/invalidation and barriers between.
>
>     (A)                                (A')
>         stp x29, x30, [sp, #-16]!           b 1f
>         mov x29, sp                         mov x29, sp
>         bl <_mcount>                        bl <_mcount>
>         ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>                                          1:
>         <function prologue>
>         <function body>
>         ...
>
>     (B)                                (B')
>         nop                                 b 1f
>         nop                                 nop
>         nop                                 nop
>         nop                                 nop
>                                          1:
>         <function prologue>
>         <function body>
>         ...
>
> (a) is much simpler, but (b) has less performance penalty(?) when tracing
> is disabled. I'm afraid that I might simplify the issue too much.
>
>        Q#2. Which one is more preferable?
>
>
> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>     https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>
>
> Thanks,
> -Takahiro AKASHI
>
> .
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2015-12-14  8:22 ` Li Bin
@ 2015-12-14 15:56   ` Steven Rostedt
  0 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2015-12-14 15:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Dec 2015 16:22:13 +0800
Li Bin <huawei.libin@huawei.com> wrote:

> I think the more appropriate way to implement the livepatch on arm64 is to
> directly modify the instruction with the help of the gcc "-fprolog-pad=N"option
> and the N only needs 1, rather than basing on ftrace.
> 
> func:
>     nop    <--->  b <(new_func1 - func)>  <--->   b <(new_func2 - func)>       
>     [prologue]
> 
> And that NOP and B are both safe instructions which called "concurrent modification
> and execution of instructions", that can be executed by one thread of execution as
> they are being modified by another thread of execution without requiring explicit
> synchronization.
> 
> On arm64, this method will improve performance significantly compared with the method
> based on ftrace, especially for the critical function being frequently called.

Yes there's a way to do this and I was against it because it can make
in very difficult to monitor what gets changed. ftrace just happens to
have the feature currently built in. I'm working on a way to make this
a bit better for live patching and still maintain some history of the
changes as well keeping the integrity of the ftrace infrastructure.

-- Steve


> 
> Can we modify the livepatch to allow the arch specific implementation? Such as that
> making the klp_enable_func/klp_disable_func as the weak function and allow their
> implementations be architecture sepcific that not use ftrace. I already have a prototype
> patchset and have test it and will post them soon.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2015-11-20 11:47 [RFC] arm64: ftrace with regs for livepatch support AKASHI Takahiro
  2015-12-14  8:22 ` Li Bin
@ 2015-12-26  9:28 ` Li Bin
  2016-01-19  8:28   ` AKASHI Takahiro
  1 sibling, 1 reply; 9+ messages in thread
From: Li Bin @ 2015-12-26  9:28 UTC (permalink / raw)
  To: linux-arm-kernel



on 2015/11/20 19:47, AKASHI Takahiro wrote:
> In this RFC, I'd like to describe and discuss some issues on adding ftrace/
> livepatch support on arm64 before actually submitting patches. In fact,
> porting livepatch is not a complicated task, but adding "ftrace with
> regs(CONFIG_DYNAMIC_FTRACE_WITH_REGS)" which livepatch heavily relies on
> is a matter.
> (There is another discussion about "arch-independent livepatch" in LKML.)
>
> Under "ftrace with regs", a ftrace helper function (ftrace_regs_caller)
> will be called with cpu registers (struct pt_regs_t) at the beginning of
> a function if tracing is enabled on the function. Livepatch utilizes this
> argument to replace PC and jump back into a new (patched) function.
> (Please note that this feature will also be used for ftrace-based kprobes.)
>
> On arm64, there is no template for a function prologue, and "instruction
> scheduling" may mix it with a function body. So a helper function, which
> is inserted by gcc's "-pg" option, cannot (at least potentially) recognize
> correct values of registers because some may have already been overwritten
> at that point.
>
> Instead, X86 uses gcc's "-mfentry" option, which inserts "call _mcount" as
> the first instruction of a function, to implement "ftrace with regs".
> As this option is arch-specific, after discussions with toolchain folks,
> we are proposing a new arch-neutral option, "-fprolog-pad=N"[1].
> This option inserts N nop instructions before a function prologue so that
> any architecture can utilize it to replace nops with whatever instruction
> sequence they want later on when required.
> (I assume that nop is very cheap in terms of performance impact.)
>
> First, let me explain how we can implement "ftrace with regs", or more
> specifically, ftrace_make_call() and ftrace_make_nop() as well as how
> inserted instruction sequences look like. Implementing ftrace_regs_caller
> is quite straightforward, we don't have to care (at least, in this RFC).
>
> 1) instruction sequence
> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
> a ftrace help function will be invoked before a function prologue. so we
> need a few, not one, instructions here. Two possible ways:
>
>  (a) stp x29, x30, [sp, #-16]!
>      mov x29, sp
>      bl <mcount>
>      ldp x29, x30, [sp], #16
>      <function prologue>
>      ...
>
>  (b) mov x9, x30
>      bl <mcount>
>      mov x30, x9
>      <function prologue>
>      ...
>
> (a) complies with a normal calling convention.
> (b) is Li Bin's idea in his old patch. While (b) can save some memory
> accesses by using a scratch register(x9 in this example), we have no way
> to recover an actual value for this register.
>
>       Q#1. Which approach should we take here?
>
>
> 2) replacing an instruction sequence
>    (This issue is orthogonal to Q#1.)
>
> Replacing can happen anytime, so we have to do it (without any locking) in
> such a safe way that any task either calls a helper or doesn't call it, but
> never runs in any intermediate state.
>
> Again here, two possible ways:
>
>   (a) initialize the code in the shape of (A') at boot time,
>             (B) -> (B') -> (A')
>       then switching to (A) or (A')
>   (b) take a few steps each time. For example,
>       to enable tracing,
>             (B) -> (B') -> (A') -> (A)
>       to disable tracing,
>             (A) -> (A') -> (B') -> (A)
>       Obviously, we need cache flushing/invalidation and barriers between.
>
>     (A)                                (A')
>         stp x29, x30, [sp, #-16]!           b 1f
>         mov x29, sp                         mov x29, sp
>         bl <_mcount>                        bl <_mcount>
>         ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>                                          1:
>         <function prologue>
>         <function body>
>         ...
>
>     (B)                                (B')
>         nop                                 b 1f
>         nop                                 nop
>         nop                                 nop
>         nop                                 nop
>                                          1:
>         <function prologue>
>         <function body>
>         ...
>

Hi takahiro,
This method can not guarantee the correctness of replacing the multi instrucions
from  (A') to (B') or from (B') to (A'), even if under kstop_machine especially for
preemptable kernel or NMI context (which will be supported on arm64 in future).
Right?

Thanks,
Li Bin

> (a) is much simpler, but (b) has less performance penalty(?) when tracing
> is disabled. I'm afraid that I might simplify the issue too much.
>
>        Q#2. Which one is more preferable?
>
>
> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>     https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>
>
> Thanks,
> -Takahiro AKASHI
>
> .
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2015-12-26  9:28 ` Li Bin
@ 2016-01-19  8:28   ` AKASHI Takahiro
  2016-01-20  1:25     ` Li Bin
  0 siblings, 1 reply; 9+ messages in thread
From: AKASHI Takahiro @ 2016-01-19  8:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Li,

Sorry for not replying soon, I didn't notice your e-mail.

On 12/26/2015 06:28 PM, Li Bin wrote:
>
>
> on 2015/11/20 19:47, AKASHI Takahiro wrote:
>> In this RFC, I'd like to describe and discuss some issues on adding ftrace/
>> livepatch support on arm64 before actuailly submitting patches. In fact,
>> porting livepatch is not a complicated task, but adding "ftrace with
>> regs(CONFIG_DYNAMIC_FTRACE_WITH_REGS)" which livepatch heavily relies on
>> is a matter.
>> (There is another discussion about "arch-independent livepatch" in LKML.)
>>
>> Under "ftrace with regs", a ftrace helper function (ftrace_regs_caller)
>> will be called with cpu registers (struct pt_regs_t) at the beginning of
>> a function if tracing is enabled on the function. Livepatch utilizes this
>> argument to replace PC and jump back into a new (patched) function.
>> (Please note that this feature will also be used for ftrace-based kprobes.)
>>
>> On arm64, there is no template for a function prologue, and "instruction
>> scheduling" may mix it with a function body. So a helper function, which
>> is inserted by gcc's "-pg" option, cannot (at least potentially) recognize
>> correct values of registers because some may have already been overwritten
>> at that point.
>>
>> Instead, X86 uses gcc's "-mfentry" option, which inserts "call _mcount" as
>> the first instruction of a function, to implement "ftrace with regs".
>> As this option is arch-specific, after discussions with toolchain folks,
>> we are proposing a new arch-neutral option, "-fprolog-pad=N"[1].
>> This option inserts N nop instructions before a function prologue so that
>> any architecture can utilize it to replace nops with whatever instruction
>> sequence they want later on when required.
>> (I assume that nop is very cheap in terms of performance impact.)
>>
>> First, let me explain how we can implement "ftrace with regs", or more
>> specifically, ftrace_make_call() and ftrace_make_nop() as well as how
>> inserted instruction sequences look like. Implementing ftrace_regs_caller
>> is quite straightforward, we don't have to care (at least, in this RFC).
>>
>> 1) instruction sequence
>> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
>> a ftrace help function will be invoked before a function prologue. so we
>> need a few, not one, instructions here. Two possible ways:
>>
>>   (a) stp x29, x30, [sp, #-16]!
>>       mov x29, sp
>>       bl <mcount>
>>       ldp x29, x30, [sp], #16
>>       <function prologue>
>>       ...
>>
>>   (b) mov x9, x30
>>       bl <mcount>
>>       mov x30, x9
>>       <function prologue>
>>       ...
>>
>> (a) complies with a normal calling convention.
>> (b) is Li Bin's idea in his old patch. While (b) can save some memory
>> accesses by using a scratch register(x9 in this example), we have no way
>> to recover an actual value for this register.
>>
>>        Q#1. Which approach should we take here?
>>
>>
>> 2) replacing an instruction sequence
>>     (This issue is orthogonal to Q#1.)
>>
>> Replacing can happen anytime, so we have to do it (without any locking) in
>> such a safe way that any task either calls a helper or doesn't call it, but
>> never runs in any intermediate state.
>>
>> Again here, two possible ways:
>>
>>    (a) initialize the code in the shape of (A') at boot time,
>>              (B) -> (B') -> (A')
>>        then switching to (A) or (A')
>>    (b) take a few steps each time. For example,
>>        to enable tracing,
>>              (B) -> (B') -> (A') -> (A)
>>        to disable tracing,
>>              (A) -> (A') -> (B') -> (A)
>>        Obviously, we need cache flushing/invalidation and barriers between.
>>
>>      (A)                                (A')
>>          stp x29, x30, [sp, #-16]!           b 1f
>>          mov x29, sp                         mov x29, sp
>>          bl <_mcount>                        bl <_mcount>
>>          ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>>                                           1:
>>          <function prologue>
>>          <function body>
>>          ...
>>
>>      (B)                                (B')
>>          nop                                 b 1f
>>          nop                                 nop
>>          nop                                 nop
>>          nop                                 nop
>>                                           1:
>>          <function prologue>
>>          <function body>
>>          ...
>>
>
> Hi takahiro,
> This method can not guarantee the correctness of replacing the multi instrucions
> from  (A') to (B') or from (B') to (A'), even if under kstop_machine especially for
> preemptable kernel or NMI context (which will be supported on arm64 in future).
> Right?

You seem to be right.
I thought that we could use aarch64_insn_patch_text() here, but
it doesn't ensure any atomicity of replacement.
Switching from (A') to (A) or (A) to (A') can be used instead,
but the performance penalty will not be acceptable.

Why does your livepatch with -mfentry work?

-Takahiro AKASHI

> Thanks,
> Li Bin
>
>> (a) is much simpler, but (b) has less performance penalty(?) when tracing
>> is disabled. I'm afraid that I might simplify the issue too much.
>>
>>         Q#2. Which one is more preferable?
>>
>>
>> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>>      https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>>
>>
>> Thanks,
>> -Takahiro AKASHI
>>
>> .
>>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2016-01-19  8:28   ` AKASHI Takahiro
@ 2016-01-20  1:25     ` Li Bin
  2016-01-20  3:12       ` AKASHI Takahiro
  0 siblings, 1 reply; 9+ messages in thread
From: Li Bin @ 2016-01-20  1:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Takahiro,
Thanks for your reply firstly.

on 2016/1/19 16:28, AKASHI Takahiro wrote:
>>> 1) instruction sequence
>>> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
>>> a ftrace help function will be invoked before a function prologue. so we
>>> need a few, not one, instructions here. Two possible ways:
>>>
>>>   (a) stp x29, x30, [sp, #-16]!
>>>       mov x29, sp
>>>       bl <mcount>
>>>       ldp x29, x30, [sp], #16
>>>       <function prologue>
>>>       ...
>>>
>>>   (b) mov x9, x30
>>>       bl <mcount>
>>>       mov x30, x9
>>>       <function prologue>
>>>       ...
>>>
>>> (a) complies with a normal calling convention.
>>> (b) is Li Bin's idea in his old patch. While (b) can save some memory
>>> accesses by using a scratch register(x9 in this example), we have no way
>>> to recover an actual value for this register.
>>>
>>>        Q#1. Which approach should we take here?
>>>
>>>
>>> 2) replacing an instruction sequence
>>>     (This issue is orthogonal to Q#1.)
>>>
>>> Replacing can happen anytime, so we have to do it (without any locking) in
>>> such a safe way that any task either calls a helper or doesn't call it, but
>>> never runs in any intermediate state.
>>>
>>> Again here, two possible ways:
>>>
>>>    (a) initialize the code in the shape of (A') at boot time,
>>>              (B) -> (B') -> (A')
>>>        then switching to (A) or (A')
>>>    (b) take a few steps each time. For example,
>>>        to enable tracing,
>>>              (B) -> (B') -> (A') -> (A)
>>>        to disable tracing,
>>>              (A) -> (A') -> (B') -> (A)
>>>        Obviously, we need cache flushing/invalidation and barriers between.
>>>
>>>      (A)                                (A')
>>>          stp x29, x30, [sp, #-16]!           b 1f
>>>          mov x29, sp                         mov x29, sp
>>>          bl <_mcount>                        bl <_mcount>
>>>          ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>>>                                           1:
>>>          <function prologue>
>>>          <function body>
>>>          ...
>>>
>>>      (B)                                (B')
>>>          nop                                 b 1f
>>>          nop                                 nop
>>>          nop                                 nop
>>>          nop                                 nop
>>>                                           1:
>>>          <function prologue>
>>>          <function body>
>>>          ...
>>>
>>
>> Hi takahiro,
>> This method can not guarantee the correctness of replacing the multi instrucions
>> from  (A') to (B') or from (B') to (A'), even if under kstop_machine especially for
>> preemptable kernel or NMI context (which will be supported on arm64 in future).
>> Right?
>
> You seem to be right.
> I thought that we could use aarch64_insn_patch_text() here, but
> it doesn't ensure any atomicity of replacement.
> Switching from (A') to (A) or (A) to (A') can be used instead,
> but the performance penalty will not be acceptable.
>
> Why does your livepatch with -mfentry work?
>

For my method with -mfentry, the instruction replacement for converting
nops to ftrace calls or back is as following,

    (A)    mov x9, x30                     (A')  mov x9, x30
            nop           <-------------->     bl <__fentry__>
            mov x30, x9                            mov x30, x9
            <function prologue>            <function prologue>

so it is compatible with the current recordmcount and ftrace logic, and the
only effect is that introducing two extra low cost mov instruction.

Thanks,
Li Bin

> -Takahiro AKASHI
>
>> Thanks,
>> Li Bin
>>
>>> (a) is much simpler, but (b) has less performance penalty(?) when tracing
>>> is disabled. I'm afraid that I might simplify the issue too much.
>>>
>>>         Q#2. Which one is more preferable?
>>>
>>>
>>> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>>>      https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>>>
>>>
>>> Thanks,
>>> -Takahiro AKASHI
>>>
>>> .
>>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2016-01-20  1:25     ` Li Bin
@ 2016-01-20  3:12       ` AKASHI Takahiro
  2016-01-25  8:56         ` Li Bin
  0 siblings, 1 reply; 9+ messages in thread
From: AKASHI Takahiro @ 2016-01-20  3:12 UTC (permalink / raw)
  To: linux-arm-kernel

Li,

On 01/20/2016 10:25 AM, Li Bin wrote:
> Hi Takahiro,
> Thanks for your reply firstly.
>
> on 2016/1/19 16:28, AKASHI Takahiro wrote:
>>>> 1) instruction sequence
>>>> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
>>>> a ftrace help function will be invoked before a function prologue. so we
>>>> need a few, not one, instructions here. Two possible ways:
>>>>
>>>>    (a) stp x29, x30, [sp, #-16]!
>>>>        mov x29, sp
>>>>        bl <mcount>
>>>>        ldp x29, x30, [sp], #16
>>>>        <function prologue>
>>>>        ...
>>>>
>>>>    (b) mov x9, x30
>>>>        bl <mcount>
>>>>        mov x30, x9
>>>>        <function prologue>
>>>>        ...
>>>>
>>>> (a) complies with a normal calling convention.
>>>> (b) is Li Bin's idea in his old patch. While (b) can save some memory
>>>> accesses by using a scratch register(x9 in this example), we have no way
>>>> to recover an actual value for this register.
>>>>
>>>>         Q#1. Which approach should we take here?
>>>>
>>>>
>>>> 2) replacing an instruction sequence
>>>>      (This issue is orthogonal to Q#1.)
>>>>
>>>> Replacing can happen anytime, so we have to do it (without any locking) in
>>>> such a safe way that any task either calls a helper or doesn't call it, but
>>>> never runs in any intermediate state.
>>>>
>>>> Again here, two possible ways:
>>>>
>>>>     (a) initialize the code in the shape of (A') at boot time,
>>>>               (B) -> (B') -> (A')
>>>>         then switching to (A) or (A')
>>>>     (b) take a few steps each time. For example,
>>>>         to enable tracing,
>>>>               (B) -> (B') -> (A') -> (A)
>>>>         to disable tracing,
>>>>               (A) -> (A') -> (B') -> (A)
>>>>         Obviously, we need cache flushing/invalidation and barriers between.
>>>>
>>>>       (A)                                (A')
>>>>           stp x29, x30, [sp, #-16]!           b 1f
>>>>           mov x29, sp                         mov x29, sp
>>>>           bl <_mcount>                        bl <_mcount>
>>>>           ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>>>>                                            1:
>>>>           <function prologue>
>>>>           <function body>
>>>>           ...
>>>>
>>>>       (B)                                (B')
>>>>           nop                                 b 1f
>>>>           nop                                 nop
>>>>           nop                                 nop
>>>>           nop                                 nop
>>>>                                            1:
>>>>           <function prologue>
>>>>           <function body>
>>>>           ...
>>>>
>>>
>>> Hi takahiro,
>>> This method can not guarantee the correctness of replacing the multi instrucions
>>> from  (A') to (B') or from (B') to (A'), even if under kstop_machine especially for
>>> preemptable kernel or NMI context (which will be supported on arm64 in future).
>>> Right?
>>
>> You seem to be right.
>> I thought that we could use aarch64_insn_patch_text() here, but
>> it doesn't ensure any atomicity of replacement.
>> Switching from (A') to (A) or (A) to (A') can be used instead,
>> but the performance penalty will not be acceptable.
>>
>> Why does your livepatch with -mfentry work?
>>
>
> For my method with -mfentry, the instruction replacement for converting
> nops to ftrace calls or back is as following,
>
>      (A)    mov x9, x30                     (A')  mov x9, x30
>              nop           <-------------->     bl <__fentry__>
>              mov x30, x9                            mov x30, x9
>              <function prologue>            <function prologue>
>
> so it is compatible with the current recordmcount and ftrace logic, and the
> only effect is that introducing two extra low cost mov instruction.

Last night I thought this issue and reached almost the same conclusion :)
The only other way would be to forcedly suppress gcc's instruction
scheduling in a function prologue and generate an always-the-same prologue.
But this will be unlikely.
(requiring pt_regs for livepatch is just too much.)

Other than a performance impact (I'm still not sure about it),
we might have a problem *with -fprolog-add=N* when the kernel sets up this
multiple-instructions sequence in each traceable function at boot time because
we have no way to do so transparently. I will check the ftrace code.

Thanks,
-Takahiro AKASHI

> Thanks,
> Li Bin
>
>> -Takahiro AKASHI
>>
>>> Thanks,
>>> Li Bin
>>>
>>>> (a) is much simpler, but (b) has less performance penalty(?) when tracing
>>>> is disabled. I'm afraid that I might simplify the issue too much.
>>>>
>>>>          Q#2. Which one is more preferable?
>>>>
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>>>>       https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>>>>
>>>>
>>>> Thanks,
>>>> -Takahiro AKASHI
>>>>
>>>> .
>>>>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2016-01-20  3:12       ` AKASHI Takahiro
@ 2016-01-25  8:56         ` Li Bin
  2016-01-26  8:07           ` AKASHI Takahiro
  0 siblings, 1 reply; 9+ messages in thread
From: Li Bin @ 2016-01-25  8:56 UTC (permalink / raw)
  To: linux-arm-kernel



on 2016/1/20 11:12, AKASHI Takahiro wrote:
> Li,
> 
> On 01/20/2016 10:25 AM, Li Bin wrote:
>> Hi Takahiro,
>> Thanks for your reply firstly.
>>
>> on 2016/1/19 16:28, AKASHI Takahiro wrote:
>>>>> 1) instruction sequence
>>>>> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
>>>>> a ftrace help function will be invoked before a function prologue. so we
>>>>> need a few, not one, instructions here. Two possible ways:
>>>>>
>>>>>    (a) stp x29, x30, [sp, #-16]!
>>>>>        mov x29, sp
>>>>>        bl <mcount>
>>>>>        ldp x29, x30, [sp], #16
>>>>>        <function prologue>
>>>>>        ...
>>>>>
>>>>>    (b) mov x9, x30
>>>>>        bl <mcount>
>>>>>        mov x30, x9
>>>>>        <function prologue>
>>>>>        ...
>>>>>
>>>>> (a) complies with a normal calling convention.
>>>>> (b) is Li Bin's idea in his old patch. While (b) can save some memory
>>>>> accesses by using a scratch register(x9 in this example), we have no way
>>>>> to recover an actual value for this register.
>>>>>
>>>>>         Q#1. Which approach should we take here?
>>>>>
>>>>>
>>>>> 2) replacing an instruction sequence
>>>>>      (This issue is orthogonal to Q#1.)
>>>>>
>>>>> Replacing can happen anytime, so we have to do it (without any locking) in
>>>>> such a safe way that any task either calls a helper or doesn't call it, but
>>>>> never runs in any intermediate state.
>>>>>
>>>>> Again here, two possible ways:
>>>>>
>>>>>     (a) initialize the code in the shape of (A') at boot time,
>>>>>               (B) -> (B') -> (A')
>>>>>         then switching to (A) or (A')
>>>>>     (b) take a few steps each time. For example,
>>>>>         to enable tracing,
>>>>>               (B) -> (B') -> (A') -> (A)
>>>>>         to disable tracing,
>>>>>               (A) -> (A') -> (B') -> (A)
>>>>>         Obviously, we need cache flushing/invalidation and barriers between.
>>>>>
>>>>>       (A)                                (A')
>>>>>           stp x29, x30, [sp, #-16]!           b 1f
>>>>>           mov x29, sp                         mov x29, sp
>>>>>           bl <_mcount>                        bl <_mcount>
>>>>>           ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>>>>>                                            1:
>>>>>           <function prologue>
>>>>>           <function body>
>>>>>           ...
>>>>>
>>>>>       (B)                                (B')
>>>>>           nop                                 b 1f
>>>>>           nop                                 nop
>>>>>           nop                                 nop
>>>>>           nop                                 nop
>>>>>                                            1:
>>>>>           <function prologue>
>>>>>           <function body>
>>>>>           ...
>>>>>
>>>>
>>>> Hi takahiro,
>>>> This method can not guarantee the correctness of replacing the multi instrucions
>>>> from  (A') to (B') or from (B') to (A'), even if under kstop_machine especially for
>>>> preemptable kernel or NMI context (which will be supported on arm64 in future).
>>>> Right?
>>>
>>> You seem to be right.
>>> I thought that we could use aarch64_insn_patch_text() here, but
>>> it doesn't ensure any atomicity of replacement.
>>> Switching from (A') to (A) or (A) to (A') can be used instead,
>>> but the performance penalty will not be acceptable.
>>>
>>> Why does your livepatch with -mfentry work?
>>>
>>
>> For my method with -mfentry, the instruction replacement for converting
>> nops to ftrace calls or back is as following,
>>
>>      (A)    mov x9, x30                     (A')  mov x9, x30
>>              nop           <-------------->     bl <__fentry__>
>>              mov x30, x9                            mov x30, x9
>>              <function prologue>            <function prologue>
>>
>> so it is compatible with the current recordmcount and ftrace logic, and the
>> only effect is that introducing two extra low cost mov instruction.
> 
> Last night I thought this issue and reached almost the same conclusion :)
> The only other way would be to forcedly suppress gcc's instruction
> scheduling in a function prologue and generate an always-the-same prologue.
> But this will be unlikely.
> (requiring pt_regs for livepatch is just too much.)
> 
> Other than a performance impact (I'm still not sure about it),
> we might have a problem *with -fprolog-add=N* when the kernel sets up this
> multiple-instructions sequence in each traceable function at boot time because
> we have no way to do so transparently. I will check the ftrace code.
> 

So can we promote the gcc community to consider the -mfentry feature which
follows x86/s390/mips/sh example to support this method?
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg02250.html

Thanks,
Li Bin

> Thanks,
> -Takahiro AKASHI
> 
>> Thanks,
>> Li Bin
>>
>>> -Takahiro AKASHI
>>>
>>>> Thanks,
>>>> Li Bin
>>>>
>>>>> (a) is much simpler, but (b) has less performance penalty(?) when tracing
>>>>> is disabled. I'm afraid that I might simplify the issue too much.
>>>>>
>>>>>          Q#2. Which one is more preferable?
>>>>>
>>>>>
>>>>> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>>>>>       https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -Takahiro AKASHI
>>>>>
>>>>> .
>>>>>
>>
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] arm64: ftrace with regs for livepatch support
  2016-01-25  8:56         ` Li Bin
@ 2016-01-26  8:07           ` AKASHI Takahiro
  0 siblings, 0 replies; 9+ messages in thread
From: AKASHI Takahiro @ 2016-01-26  8:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/25/2016 05:56 PM, Li Bin wrote:
>
>
> on 2016/1/20 11:12, AKASHI Takahiro wrote:
>> Li,
>>
>> On 01/20/2016 10:25 AM, Li Bin wrote:
>>> Hi Takahiro,
>>> Thanks for your reply firstly.
>>>
>>> on 2016/1/19 16:28, AKASHI Takahiro wrote:
>>>>>> 1) instruction sequence
>>>>>> Unlike x86, we have to preserve link register(x30) explicitly on arm64 since
>>>>>> a ftrace help function will be invoked before a function prologue. so we
>>>>>> need a few, not one, instructions here. Two possible ways:
>>>>>>
>>>>>>     (a) stp x29, x30, [sp, #-16]!
>>>>>>         mov x29, sp
>>>>>>         bl <mcount>
>>>>>>         ldp x29, x30, [sp], #16
>>>>>>         <function prologue>
>>>>>>         ...
>>>>>>
>>>>>>     (b) mov x9, x30
>>>>>>         bl <mcount>
>>>>>>         mov x30, x9
>>>>>>         <function prologue>
>>>>>>         ...
>>>>>>
>>>>>> (a) complies with a normal calling convention.
>>>>>> (b) is Li Bin's idea in his old patch. While (b) can save some memory
>>>>>> accesses by using a scratch register(x9 in this example), we have no way
>>>>>> to recover an actual value for this register.
>>>>>>
>>>>>>          Q#1. Which approach should we take here?
>>>>>>
>>>>>>
>>>>>> 2) replacing an instruction sequence
>>>>>>       (This issue is orthogonal to Q#1.)
>>>>>>
>>>>>> Replacing can happen anytime, so we have to do it (without any locking) in
>>>>>> such a safe way that any task either calls a helper or doesn't call it, but
>>>>>> never runs in any intermediate state.
>>>>>>
>>>>>> Again here, two possible ways:
>>>>>>
>>>>>>      (a) initialize the code in the shape of (A') at boot time,
>>>>>>                (B) -> (B') -> (A')
>>>>>>          then switching to (A) or (A')
>>>>>>      (b) take a few steps each time. For example,
>>>>>>          to enable tracing,
>>>>>>                (B) -> (B') -> (A') -> (A)
>>>>>>          to disable tracing,
>>>>>>                (A) -> (A') -> (B') -> (A)
>>>>>>          Obviously, we need cache flushing/invalidation and barriers between.
>>>>>>
>>>>>>        (A)                                (A')
>>>>>>            stp x29, x30, [sp, #-16]!           b 1f
>>>>>>            mov x29, sp                         mov x29, sp
>>>>>>            bl <_mcount>                        bl <_mcount>
>>>>>>            ldp x29, x30, [sp], #16             ld x29, x30, [sp], #16
>>>>>>                                             1:
>>>>>>            <function prologue>
>>>>>>            <function body>
>>>>>>            ...
>>>>>>
>>>>>>        (B)                                (B')
>>>>>>            nop                                 b 1f
>>>>>>            nop                                 nop
>>>>>>            nop                                 nop
>>>>>>            nop                                 nop
>>>>>>                                             1:
>>>>>>            <function prologue>
>>>>>>            <function body>
>>>>>>            ...
>>>>>>
>>>>>
>>>>> Hi takahiro,
>>>>> This method can not guarantee the correctness of replacing the multi instrucions
>>>>> from  (A') to (B') or from (B') to (A'), even if under kstop_machine especially for
>>>>> preemptable kernel or NMI context (which will be supported on arm64 in future).
>>>>> Right?
>>>>
>>>> You seem to be right.
>>>> I thought that we could use aarch64_insn_patch_text() here, but
>>>> it doesn't ensure any atomicity of replacement.
>>>> Switching from (A') to (A) or (A) to (A') can be used instead,
>>>> but the performance penalty will not be acceptable.
>>>>
>>>> Why does your livepatch with -mfentry work?
>>>>
>>>
>>> For my method with -mfentry, the instruction replacement for converting
>>> nops to ftrace calls or back is as following,
>>>
>>>       (A)    mov x9, x30                     (A')  mov x9, x30
>>>               nop           <-------------->     bl <__fentry__>
>>>               mov x30, x9                            mov x30, x9
>>>               <function prologue>            <function prologue>
>>>
>>> so it is compatible with the current recordmcount and ftrace logic, and the
>>> only effect is that introducing two extra low cost mov instruction.
>>
>> Last night I thought this issue and reached almost the same conclusion :)
>> The only other way would be to forcedly suppress gcc's instruction
>> scheduling in a function prologue and generate an always-the-same prologue.
>> But this will be unlikely.
>> (requiring pt_regs for livepatch is just too much.)
>>
>> Other than a performance impact (I'm still not sure about it),
>> we might have a problem *with -fprolog-add=N* when the kernel sets up this
>> multiple-instructions sequence in each traceable function at boot time because
>> we have no way to do so transparently. I will check the ftrace code.
>>
>
> So can we promote the gcc community to consider the -mfentry feature which
> follows x86/s390/mips/sh example to support this method?
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg02250.html

I think we can also initialize ftrace in case of -fprolog-add=N option.

First, ftrace_init() is called in start_kernel() so early that no other threads
are run. Then ftrace_init() calls ftrace_update_code() with local irq disabled.
This means that ftrace_make_call/nop() can be executed in a *safe* mode.
The problem is that we have to modify not only the second NOP, but also all the three
NOPs either to nop;nop;nop or mov; bl; mov at boot time.
Fortunately, the first and third instruction will never be modified later and so
we can do as follows:
ftrace_make_call/nop(...)
{
     if (1st instruction == NOP) --- (*)
         write 1st instruction.
         write 3rd instruction.

     write nop to 2nd.
}
(or we don't even need (*).)

It is a bit ugly, but works.

So now I don't super strongly advocate -fprolog-add=N, but if this option can also be
used to implement livepatch (precisely, FTRACE_WITH_REGS) on other archs or just for
other purposes,  I don't want to deny this approach yet.

Thanks,
-Takahiro AKASHI

> Thanks,
> Li Bin
>
>> Thanks,
>> -Takahiro AKASHI
>>
>>> Thanks,
>>> Li Bin
>>>
>>>> -Takahiro AKASHI
>>>>
>>>>> Thanks,
>>>>> Li Bin
>>>>>
>>>>>> (a) is much simpler, but (b) has less performance penalty(?) when tracing
>>>>>> is disabled. I'm afraid that I might simplify the issue too much.
>>>>>>
>>>>>>           Q#2. Which one is more preferable?
>>>>>>
>>>>>>
>>>>>> [1] https://gcc.gnu.org/ml/gcc/2015-05/msg00267.html, and
>>>>>>        https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -Takahiro AKASHI
>>>>>>
>>>>>> .
>>>>>>
>>>
>>>
>>
>> .
>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-01-26  8:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 11:47 [RFC] arm64: ftrace with regs for livepatch support AKASHI Takahiro
2015-12-14  8:22 ` Li Bin
2015-12-14 15:56   ` Steven Rostedt
2015-12-26  9:28 ` Li Bin
2016-01-19  8:28   ` AKASHI Takahiro
2016-01-20  1:25     ` Li Bin
2016-01-20  3:12       ` AKASHI Takahiro
2016-01-25  8:56         ` Li Bin
2016-01-26  8:07           ` AKASHI Takahiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.