All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Rutland <mark.rutland@arm.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Wang ShaoBo <bobo.shaobowang@huawei.com>,
	cj.chengjian@huawei.com, huawei.libin@huawei.com,
	xiexiuqi@huawei.com, liwei391@huawei.com,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, zengshun.wu@outlook.com
Subject: Re: [RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically allocated trampolines
Date: Thu, 21 Apr 2022 17:27:40 +0100	[thread overview]
Message-ID: <YmGF/OpIhAF8YeVq@lakrids> (raw)
In-Reply-To: <20220421114201.21228eeb@gandalf.local.home>

On Thu, Apr 21, 2022 at 11:42:01AM -0400, Steven Rostedt wrote:
> On Thu, 21 Apr 2022 16:14:13 +0100
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
> > > Let's say you have 10 ftrace_ops registered (with bpf and kprobes this can
> > > be quite common). But each of these ftrace_ops traces a function (or
> > > functions) that are not being traced by the other ftrace_ops. That is, each
> > > ftrace_ops has its own unique function(s) that they are tracing. One could
> > > be tracing schedule, the other could be tracing ksoftirqd_should_run
> > > (whatever).  
> > 
> > Ok, so that's when messing around with bpf or kprobes, and not generally
> > when using plain old ftrace functionality under /sys/kernel/tracing/
> > (unless that's concurrent with one of the former, as per your other
> > reply) ?
> 
> It's any user of the ftrace infrastructure, which includes kprobes, bpf,
> perf, function tracing, function graph tracing, and also affects instances.
> 
> > 
> > > Without this change, because the arch does not support dynamically
> > > allocated trampolines, it means that all these ftrace_ops will be
> > > registered to the same trampoline. That means, for every function that is
> > > traced, it will loop through all 10 of theses ftrace_ops and check their
> > > hashes to see if their callback should be called or not.  
> > 
> > Sure; I can see how that can be quite expensive.
> > 
> > What I'm trying to figure out is who this matters to and when, since the
> > implementation is going to come with a bunch of subtle/fractal
> > complexities, and likely a substantial overhead too when enabling or
> > disabling tracing of a patch-site. I'd like to understand the trade-offs
> > better.
> > 
> > > With dynamically allocated trampolines, each ftrace_ops will have their own
> > > trampoline, and that trampoline will be called directly if the function
> > > is only being traced by the one ftrace_ops. This is much more efficient.
> > > 
> > > If a function is traced by more than one ftrace_ops, then it falls back to
> > > the loop.  
> > 
> > I see -- so the dynamic trampoline is just to get the ops? Or is that
> > doing additional things?
> 
> It's to get both the ftrace_ops (as that's one of the parameters) as well
> as to call the callback directly. Not sure if arm is affected by spectre,
> but the "loop" function is filled with indirect function calls, where as
> the dynamic trampolines call the callback directly.
> 
> Instead of:
> 
>   bl ftrace_caller
> 
> ftrace_caller:
>   [..]
>   bl ftrace_ops_list_func
>   [..]
> 
> 
> void ftrace_ops_list_func(...)
> {
> 	__do_for_each_ftrace_ops(op, ftrace_ops_list) {
> 		if (ftrace_ops_test(op, ip)) // test the hash to see if it
> 					     //	should trace this
> 					     //	function.
> 			op->func(...);
> 	}
> }
> 
> It does:
> 
>   bl dyanmic_tramp
> 
> dynamic_tramp:
>   [..]
>   bl func  // call the op->func directly!
> 
> 
> Much more efficient!
> 
> 
> > 
> > There might be a middle-ground here where we patch the ftrace_ops
> > pointer into a literal pool at the patch-site, which would allow us to
> > handle this atomically, and would avoid the issues with out-of-range
> > trampolines.
> 
> Have an example of what you are suggesting?

We can make the compiler to place 2 NOPs before the function entry point, and 2
NOPs after it using `-fpatchable-function-entry=4,2` (the arguments are
<total>,<before>). On arm64 all instructions are 4 bytes, and we'll use the
first two NOPs as an 8-byte literal pool.

Ignoring BTI for now, the compiler generates (with some magic labels added here
for demonstration):

	__before_func:
			NOP
			NOP
	func:
			NOP
			NOP
	__remainder_of_func:
			...

At ftrace_init_nop() time we patch that to:

	__before_func:
			// treat the 2 NOPs as an 8-byte literal-pool
			.quad	<default ops pointer> // see below
	func:
			MOV	X9, X30
			NOP
	__remainder_of_func:
			...

When enabling tracing we do

	__before_func:
			// patch this with the relevant ops pointer
			.quad	<ops pointer>
	func:
			MOV	X9, X30
			BL	<trampoline>	// common trampoline
	__remainder_of_func:
		 	..

The `BL <trampoline>` clobbers X30 with __remainder_of_func, so within
the trampoline we can find the ops pointer at an offset from X30. On
arm64 we can load that directly with something like:

	LDR	<tmp>, [X30, # -(__remainder_of_func - __before_func)]

... then load the ops->func from that and invoke it (or pass it to a
helper which does):

	// Ignoring the function arguments for this demonstration
	LDR	<tmp2>, [<tmp>, #OPS_FUNC_OFFSET]
	BLR	<tmp2>

That avoids iterating over the list *without* requiring separate
trampolines, and allows us to patch the sequence without requiring
stop-the-world logic (since arm64 has strong requirements for patching
most instructions other than branches and nops).

We can initialize the ops pointer to a default ops that does the whole
__do_for_each_ftrace_ops() dance.

To handle BTI we can have two trampolines, or we can always reserve 3 NOPs
before the function so that we can have a consistent offset regardless.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Mark Rutland <mark.rutland@arm.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Wang ShaoBo <bobo.shaobowang@huawei.com>,
	cj.chengjian@huawei.com, huawei.libin@huawei.com,
	xiexiuqi@huawei.com, liwei391@huawei.com,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, zengshun.wu@outlook.com
Subject: Re: [RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically allocated trampolines
Date: Thu, 21 Apr 2022 17:27:40 +0100	[thread overview]
Message-ID: <YmGF/OpIhAF8YeVq@lakrids> (raw)
In-Reply-To: <20220421114201.21228eeb@gandalf.local.home>

On Thu, Apr 21, 2022 at 11:42:01AM -0400, Steven Rostedt wrote:
> On Thu, 21 Apr 2022 16:14:13 +0100
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
> > > Let's say you have 10 ftrace_ops registered (with bpf and kprobes this can
> > > be quite common). But each of these ftrace_ops traces a function (or
> > > functions) that are not being traced by the other ftrace_ops. That is, each
> > > ftrace_ops has its own unique function(s) that they are tracing. One could
> > > be tracing schedule, the other could be tracing ksoftirqd_should_run
> > > (whatever).  
> > 
> > Ok, so that's when messing around with bpf or kprobes, and not generally
> > when using plain old ftrace functionality under /sys/kernel/tracing/
> > (unless that's concurrent with one of the former, as per your other
> > reply) ?
> 
> It's any user of the ftrace infrastructure, which includes kprobes, bpf,
> perf, function tracing, function graph tracing, and also affects instances.
> 
> > 
> > > Without this change, because the arch does not support dynamically
> > > allocated trampolines, it means that all these ftrace_ops will be
> > > registered to the same trampoline. That means, for every function that is
> > > traced, it will loop through all 10 of theses ftrace_ops and check their
> > > hashes to see if their callback should be called or not.  
> > 
> > Sure; I can see how that can be quite expensive.
> > 
> > What I'm trying to figure out is who this matters to and when, since the
> > implementation is going to come with a bunch of subtle/fractal
> > complexities, and likely a substantial overhead too when enabling or
> > disabling tracing of a patch-site. I'd like to understand the trade-offs
> > better.
> > 
> > > With dynamically allocated trampolines, each ftrace_ops will have their own
> > > trampoline, and that trampoline will be called directly if the function
> > > is only being traced by the one ftrace_ops. This is much more efficient.
> > > 
> > > If a function is traced by more than one ftrace_ops, then it falls back to
> > > the loop.  
> > 
> > I see -- so the dynamic trampoline is just to get the ops? Or is that
> > doing additional things?
> 
> It's to get both the ftrace_ops (as that's one of the parameters) as well
> as to call the callback directly. Not sure if arm is affected by spectre,
> but the "loop" function is filled with indirect function calls, where as
> the dynamic trampolines call the callback directly.
> 
> Instead of:
> 
>   bl ftrace_caller
> 
> ftrace_caller:
>   [..]
>   bl ftrace_ops_list_func
>   [..]
> 
> 
> void ftrace_ops_list_func(...)
> {
> 	__do_for_each_ftrace_ops(op, ftrace_ops_list) {
> 		if (ftrace_ops_test(op, ip)) // test the hash to see if it
> 					     //	should trace this
> 					     //	function.
> 			op->func(...);
> 	}
> }
> 
> It does:
> 
>   bl dyanmic_tramp
> 
> dynamic_tramp:
>   [..]
>   bl func  // call the op->func directly!
> 
> 
> Much more efficient!
> 
> 
> > 
> > There might be a middle-ground here where we patch the ftrace_ops
> > pointer into a literal pool at the patch-site, which would allow us to
> > handle this atomically, and would avoid the issues with out-of-range
> > trampolines.
> 
> Have an example of what you are suggesting?

We can make the compiler to place 2 NOPs before the function entry point, and 2
NOPs after it using `-fpatchable-function-entry=4,2` (the arguments are
<total>,<before>). On arm64 all instructions are 4 bytes, and we'll use the
first two NOPs as an 8-byte literal pool.

Ignoring BTI for now, the compiler generates (with some magic labels added here
for demonstration):

	__before_func:
			NOP
			NOP
	func:
			NOP
			NOP
	__remainder_of_func:
			...

At ftrace_init_nop() time we patch that to:

	__before_func:
			// treat the 2 NOPs as an 8-byte literal-pool
			.quad	<default ops pointer> // see below
	func:
			MOV	X9, X30
			NOP
	__remainder_of_func:
			...

When enabling tracing we do

	__before_func:
			// patch this with the relevant ops pointer
			.quad	<ops pointer>
	func:
			MOV	X9, X30
			BL	<trampoline>	// common trampoline
	__remainder_of_func:
		 	..

The `BL <trampoline>` clobbers X30 with __remainder_of_func, so within
the trampoline we can find the ops pointer at an offset from X30. On
arm64 we can load that directly with something like:

	LDR	<tmp>, [X30, # -(__remainder_of_func - __before_func)]

... then load the ops->func from that and invoke it (or pass it to a
helper which does):

	// Ignoring the function arguments for this demonstration
	LDR	<tmp2>, [<tmp>, #OPS_FUNC_OFFSET]
	BLR	<tmp2>

That avoids iterating over the list *without* requiring separate
trampolines, and allows us to patch the sequence without requiring
stop-the-world logic (since arm64 has strong requirements for patching
most instructions other than branches and nops).

We can initialize the ops pointer to a default ops that does the whole
__do_for_each_ftrace_ops() dance.

To handle BTI we can have two trampolines, or we can always reserve 3 NOPs
before the function so that we can have a consistent offset regardless.

Thanks,
Mark.

  reply	other threads:[~2022-04-21 16:28 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-16 10:01 [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline Wang ShaoBo
2022-03-16 10:01 ` Wang ShaoBo
2022-03-16 10:01 ` [RFC PATCH -next v2 1/4] arm64: introduce aarch64_insn_gen_load_literal Wang ShaoBo
2022-03-16 10:01   ` Wang ShaoBo
2022-03-16 10:01 ` [RFC PATCH -next v2 2/4] arm64/ftrace: introduce ftrace dynamic trampoline entrances Wang ShaoBo
2022-03-16 10:01   ` Wang ShaoBo
2022-03-16 10:01 ` [RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically allocated trampolines Wang ShaoBo
2022-03-16 10:01   ` Wang ShaoBo
2022-04-21 13:10   ` Mark Rutland
2022-04-21 13:10     ` Mark Rutland
2022-04-21 14:06     ` Steven Rostedt
2022-04-21 14:06       ` Steven Rostedt
2022-04-21 14:08       ` Steven Rostedt
2022-04-21 14:08         ` Steven Rostedt
2022-04-21 15:14       ` Mark Rutland
2022-04-21 15:14         ` Mark Rutland
2022-04-21 15:42         ` Steven Rostedt
2022-04-21 15:42           ` Steven Rostedt
2022-04-21 16:27           ` Mark Rutland [this message]
2022-04-21 16:27             ` Mark Rutland
2022-04-21 17:06             ` Steven Rostedt
2022-04-21 17:06               ` Steven Rostedt
2022-04-22 10:12               ` Mark Rutland
2022-04-22 10:12                 ` Mark Rutland
2022-04-22 15:45                 ` Steven Rostedt
2022-04-22 15:45                   ` Steven Rostedt
2022-04-22 17:27                   ` Mark Rutland
2022-04-22 17:27                     ` Mark Rutland
2022-04-26  8:47                     ` Masami Hiramatsu
2022-04-26  8:47                       ` Masami Hiramatsu
2022-05-04 10:24                       ` Mark Rutland
2022-05-04 10:24                         ` Mark Rutland
2022-05-05  3:15                         ` Masami Hiramatsu
2022-05-05  3:15                           ` Masami Hiramatsu
2022-05-09 18:22                           ` Steven Rostedt
2022-05-09 18:22                             ` Steven Rostedt
2022-05-10  9:10                             ` Masami Hiramatsu
2022-05-10  9:10                               ` Masami Hiramatsu
2022-05-10 14:44                               ` Steven Rostedt
2022-05-10 14:44                                 ` Steven Rostedt
2022-05-11 14:34                                 ` Masami Hiramatsu
2022-05-11 14:34                                   ` Masami Hiramatsu
2022-05-11 15:12                                   ` Steven Rostedt
2022-05-11 15:12                                     ` Steven Rostedt
2022-05-12 12:02                                     ` Masami Hiramatsu
2022-05-12 12:02                                       ` Masami Hiramatsu
2022-05-12 13:50                                       ` Steven Rostedt
2022-05-12 13:50                                         ` Steven Rostedt
2022-05-25 12:17                                       ` Mark Rutland
2022-05-25 12:17                                         ` Mark Rutland
2022-05-25 13:43                                         ` Steven Rostedt
2022-05-25 13:43                                           ` Steven Rostedt
2022-05-25 17:12                                           ` Mark Rutland
2022-05-25 17:12                                             ` Mark Rutland
2022-05-30  1:03                                         ` Masami Hiramatsu
2022-05-30  1:03                                           ` Masami Hiramatsu
2022-05-30 12:38                                           ` Jiri Olsa
2022-05-30 12:38                                             ` Jiri Olsa
2022-05-31  1:00                                             ` Masami Hiramatsu
2022-05-31  1:00                                               ` Masami Hiramatsu
2022-05-04 12:43               ` Mark Rutland
2022-05-04 12:43                 ` Mark Rutland
2022-05-05  2:57             ` Wangshaobo (bobo)
2022-05-05  2:57               ` Wangshaobo (bobo)
2022-05-25 12:27               ` Mark Rutland
2022-05-25 12:27                 ` Mark Rutland
2022-04-27  8:54       ` Wangshaobo (bobo)
2022-04-27  8:54         ` Wangshaobo (bobo)
2022-03-16 10:01 ` [RFC PATCH -next v2 4/4] arm64/ftrace: implement long jump for dynamic trampolines Wang ShaoBo
2022-03-16 10:01   ` Wang ShaoBo
2022-04-21 13:47   ` Mark Rutland
2022-04-21 13:47     ` Mark Rutland
2022-03-16 14:29 ` [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline Steven Rostedt
2022-03-16 14:29   ` Steven Rostedt
2022-04-20 18:11 ` Steven Rostedt
2022-04-20 18:11   ` Steven Rostedt
2022-04-21  1:13   ` Wangshaobo (bobo)
2022-04-21  1:13     ` Wangshaobo (bobo)
2022-04-21 12:37     ` Steven Rostedt
2022-04-21 12:37       ` Steven Rostedt
2022-05-25 12:45       ` Mark Rutland
2022-05-25 12:45         ` Mark Rutland
2022-05-25 13:58         ` Steven Rostedt
2022-05-25 13:58           ` Steven Rostedt
2022-05-25 17:26           ` Mark Rutland
2022-05-25 17:26             ` Mark Rutland
2022-04-21 12:53 ` Mark Rutland
2022-04-21 12:53   ` Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YmGF/OpIhAF8YeVq@lakrids \
    --to=mark.rutland@arm.com \
    --cc=bobo.shaobowang@huawei.com \
    --cc=catalin.marinas@arm.com \
    --cc=cj.chengjian@huawei.com \
    --cc=huawei.libin@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liwei391@huawei.com \
    --cc=rostedt@goodmis.org \
    --cc=will@kernel.org \
    --cc=xiexiuqi@huawei.com \
    --cc=zengshun.wu@outlook.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.