* [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-13 16:27 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
This series adds ftrace direct call for arm64, which is required to attach
bpf trampoline to fentry.
Although there is no agreement on how to support ftrace direct call on arm64,
no patch has been posted except the one I posted in [1], so this series
continues the work of [1] with the addition of long jump support. Now ftrace
direct call works regardless of the distance between the callsite and custom
trampoline.
[1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
v2:
- Fix compile and runtime errors caused by ftrace_rec_arch_init
v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
Xu Kuohai (4):
ftrace: Allow users to disable ftrace direct call
arm64: ftrace: Support long jump for ftrace direct call
arm64: ftrace: Add ftrace direct call support
ftrace: Fix dead loop caused by direct call in ftrace selftest
arch/arm64/Kconfig | 2 +
arch/arm64/Makefile | 4 +
arch/arm64/include/asm/ftrace.h | 35 ++++--
arch/arm64/include/asm/patching.h | 2 +
arch/arm64/include/asm/ptrace.h | 6 +-
arch/arm64/kernel/asm-offsets.c | 1 +
arch/arm64/kernel/entry-ftrace.S | 39 ++++--
arch/arm64/kernel/ftrace.c | 198 ++++++++++++++++++++++++++++--
arch/arm64/kernel/patching.c | 14 +++
arch/arm64/net/bpf_jit_comp.c | 4 +
include/linux/ftrace.h | 2 +
kernel/trace/Kconfig | 7 +-
kernel/trace/ftrace.c | 9 +-
kernel/trace/trace_selftest.c | 2 +
14 files changed, 296 insertions(+), 29 deletions(-)
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-13 16:27 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
This series adds ftrace direct call for arm64, which is required to attach
bpf trampoline to fentry.
Although there is no agreement on how to support ftrace direct call on arm64,
no patch has been posted except the one I posted in [1], so this series
continues the work of [1] with the addition of long jump support. Now ftrace
direct call works regardless of the distance between the callsite and custom
trampoline.
[1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
v2:
- Fix compile and runtime errors caused by ftrace_rec_arch_init
v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
Xu Kuohai (4):
ftrace: Allow users to disable ftrace direct call
arm64: ftrace: Support long jump for ftrace direct call
arm64: ftrace: Add ftrace direct call support
ftrace: Fix dead loop caused by direct call in ftrace selftest
arch/arm64/Kconfig | 2 +
arch/arm64/Makefile | 4 +
arch/arm64/include/asm/ftrace.h | 35 ++++--
arch/arm64/include/asm/patching.h | 2 +
arch/arm64/include/asm/ptrace.h | 6 +-
arch/arm64/kernel/asm-offsets.c | 1 +
arch/arm64/kernel/entry-ftrace.S | 39 ++++--
arch/arm64/kernel/ftrace.c | 198 ++++++++++++++++++++++++++++--
arch/arm64/kernel/patching.c | 14 +++
arch/arm64/net/bpf_jit_comp.c | 4 +
include/linux/ftrace.h | 2 +
kernel/trace/Kconfig | 7 +-
kernel/trace/ftrace.c | 9 +-
kernel/trace/trace_selftest.c | 2 +
14 files changed, 296 insertions(+), 29 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 1/4] ftrace: Allow users to disable ftrace direct call
2022-09-13 16:27 ` Xu Kuohai
@ 2022-09-13 16:27 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
To support ftrace direct call on arm64, multiple NOP instructions need
to be added to the ftrace fentry, which will make the kernel image
larger. For users who don't need direct calls, they should not pay this
unnecessary price, so they should be allowed to disable this option.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
kernel/trace/Kconfig | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 1052126bdca2..fc8a22f1a6a0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -240,9 +240,14 @@ config DYNAMIC_FTRACE_WITH_REGS
depends on HAVE_DYNAMIC_FTRACE_WITH_REGS
config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
- def_bool y
+ bool "Support for calling custom trampoline from fentry directly"
+ default y
depends on DYNAMIC_FTRACE_WITH_REGS
depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ help
+ This option enables calling custom trampoline from ftrace fentry
+ directly, instead of using ftrace regs caller. This may reserve more
+ space in the fentry, making the kernel image larger.
config DYNAMIC_FTRACE_WITH_ARGS
def_bool y
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 1/4] ftrace: Allow users to disable ftrace direct call
@ 2022-09-13 16:27 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
To support ftrace direct call on arm64, multiple NOP instructions need
to be added to the ftrace fentry, which will make the kernel image
larger. For users who don't need direct calls, they should not pay this
unnecessary price, so they should be allowed to disable this option.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
kernel/trace/Kconfig | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 1052126bdca2..fc8a22f1a6a0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -240,9 +240,14 @@ config DYNAMIC_FTRACE_WITH_REGS
depends on HAVE_DYNAMIC_FTRACE_WITH_REGS
config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
- def_bool y
+ bool "Support for calling custom trampoline from fentry directly"
+ default y
depends on DYNAMIC_FTRACE_WITH_REGS
depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ help
+ This option enables calling custom trampoline from ftrace fentry
+ directly, instead of using ftrace regs caller. This may reserve more
+ space in the fentry, making the kernel image larger.
config DYNAMIC_FTRACE_WITH_ARGS
def_bool y
--
2.30.2
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 2/4] arm64: ftrace: Support long jump for ftrace direct call
2022-09-13 16:27 ` Xu Kuohai
@ 2022-09-13 16:27 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
Add long jump support to fentry, so dynamically allocated trampolines
like bpf trampoline can be called from fentry directly, as these
trampoline addresses may be out of the range that a single bl
instruction can jump to.
The scheme used here is basically the same as commit b2ad54e1533e
("bpf, arm64: Implement bpf_arch_text_poke() for arm64").
1. At compile time, we use -fpatchable-function-entry=7,5 to insert 5
NOPs before function entry and 2 NOPs after function entry:
NOP
NOP
NOP
NOP
NOP
func:
BTI C // if BTI
NOP
NOP
The reason for inserting 5 NOPs before the function entry is that
2 NOPs are patched to LDR and BR instructions, 2 NOPs are used to
store the destination jump address, and 1 NOP is used to adjust
alignment to ensure the destination jump address is stored in 8-byte
aligned memory, which is required by atomic store and load.
2. When there is no trampoline attached, the callsite is patched to:
NOP // extra NOP if func is 8-byte aligned
literal:
.quad ftrace_dummy_tramp
NOP // extra NOP if func is NOT 8-byte aligned
literal_call:
LDR X16, literal
BR X16
func:
BTI C // if BTI
MOV X9, LR
NOP
3. When long jump trampoline is attached, the callsite is patched to:
NOP // extra NOP if func is 8-byte aligned
literal:
.quad <long-jump-trampoline>
NOP // extra NOP if func is NOT 8-byte aligned
literal_call:
LDR X16, literal
BR X16
func:
BTI C // if BTI
MOV X9, LR
BL literal_call
4. When short jump trampoline is attached, the callsite is patched to:
NOP // extra NOP if func is 8-byte aligned
literal:
.quad ftrace_dummy_tramp
NOP // extra NOP if func is NOT 8-byte aligned
literal_call:
LDR X16, literal
BR X16
func:
BTI C // if BTI
MOV X9, LR
BL <short-jump-trampoline>
Note that there is always a valid jump address in literal, either custom
trampoline address or the dummy trampoline address, which ensures
that we'll never jump from callsite to an unknown place.
Also note that the callsite is only ensured to be patched atomically and
securely. Whether the custom trampoline can be freed should be checked
by the trampoline user. For example, bpf uses refcnt and task based rcu
to ensure bpf trampoline could be freed safely.
In my environment, before this patch, there are 2 NOPs inserted in function
entry, and the generated vmlinux size is 463,649,280 bytes, while after
this patch, the vmlinux size is 465,069,368 bytes, increased 1,420,088
bytes, about 0.3%. In vmlinux, there are 14,376 8-byte aligned functions
and 41,847 unaligned functions. For each aligned function, one of the
five NOPs before the function entry is unnecessary, wasting 57,504 bytes.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
arch/arm64/Makefile | 4 +
arch/arm64/include/asm/ftrace.h | 27 ++--
arch/arm64/include/asm/patching.h | 2 +
arch/arm64/kernel/entry-ftrace.S | 21 +++-
arch/arm64/kernel/ftrace.c | 198 ++++++++++++++++++++++++++++--
arch/arm64/kernel/patching.c | 14 +++
arch/arm64/net/bpf_jit_comp.c | 4 +
include/linux/ftrace.h | 2 +
kernel/trace/ftrace.c | 9 +-
9 files changed, 253 insertions(+), 28 deletions(-)
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 6d9d4a58b898..e540b50db5b8 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -130,7 +130,11 @@ CHECKFLAGS += -D__aarch64__
ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_REGS),y)
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
+ ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS),y)
+ CC_FLAGS_FTRACE := -fpatchable-function-entry=7,5
+ else
CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+ endif
endif
# Default value
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index dbc45a4157fa..40e63435965b 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -56,27 +56,16 @@ extern void _mcount(unsigned long);
extern void *return_address(unsigned int);
struct dyn_arch_ftrace {
- /* No extra data needed for arm64 */
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ unsigned long func; /* start address of function */
+#endif
};
extern unsigned long ftrace_graph_call;
extern void return_to_handler(void);
-static inline unsigned long ftrace_call_adjust(unsigned long addr)
-{
- /*
- * Adjust addr to point at the BL in the callsite.
- * See ftrace_init_nop() for the callsite sequence.
- */
- if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
- return addr + AARCH64_INSN_SIZE;
- /*
- * addr is the address of the mcount call instruction.
- * recordmcount does the necessary offset calculation.
- */
- return addr;
-}
+unsigned long ftrace_call_adjust(unsigned long addr);
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
struct dyn_ftrace;
@@ -121,6 +110,14 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
*/
return !strcmp(sym + 8, name);
}
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+
+#define ftrace_dummy_tramp ftrace_dummy_tramp
+extern void ftrace_dummy_tramp(void);
+
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
#endif /* ifndef __ASSEMBLY__ */
#endif /* __ASM_FTRACE_H */
diff --git a/arch/arm64/include/asm/patching.h b/arch/arm64/include/asm/patching.h
index 6bf5adc56295..b9077205e6b2 100644
--- a/arch/arm64/include/asm/patching.h
+++ b/arch/arm64/include/asm/patching.h
@@ -10,4 +10,6 @@ int aarch64_insn_write(void *addr, u32 insn);
int aarch64_insn_patch_text_nosync(void *addr, u32 insn);
int aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt);
+void aarch64_literal64_write(void *addr, u64 data);
+
#endif /* __ASM_PATCHING_H */
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index bd5df50e4643..0bebe3ffdb58 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -14,14 +14,16 @@
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
/*
- * Due to -fpatchable-function-entry=2, the compiler has placed two NOPs before
- * the regular function prologue. For an enabled callsite, ftrace_init_nop() and
- * ftrace_make_call() have patched those NOPs to:
+ * Due to -fpatchable-function-entry=2 or -fpatchable-function-entry=7,5, the
+ * compiler has placed two NOPs before the regular function prologue. For an
+ * enabled callsite, ftrace_init_nop() and ftrace_make_call() have patched those
+ * NOPs to:
*
* MOV X9, LR
* BL <entry>
*
- * ... where <entry> is either ftrace_caller or ftrace_regs_caller.
+ * ... where <entry> is ftrace_caller or ftrace_regs_caller or custom
+ * trampoline.
*
* Each instrumented function follows the AAPCS, so here x0-x8 and x18-x30 are
* live (x18 holds the Shadow Call Stack pointer), and x9-x17 are safe to
@@ -327,3 +329,14 @@ SYM_CODE_START(return_to_handler)
ret
SYM_CODE_END(return_to_handler)
#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+SYM_FUNC_START(ftrace_dummy_tramp)
+#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
+ bti j /* ftrace_dummy_tramp is called via "br x10" */
+#endif
+ mov x10, x30
+ mov x30, x9
+ ret x10
+SYM_FUNC_END(ftrace_dummy_tramp)
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index ea5dc7c90f46..a311c19bf06a 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -77,6 +77,123 @@ static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
return NULL;
}
+enum ftrace_callsite_action {
+ FC_INIT,
+ FC_REMOVE_CALL,
+ FC_ADD_CALL,
+ FC_REPLACE_CALL,
+};
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+
+/*
+ * When func is 8-byte aligned, literal_call is located at func - 8 and literal
+ * is located at func - 16:
+ *
+ * NOP
+ * literal:
+ * .quad ftrace_dummy_tramp
+ * literal_call:
+ * LDR X16, literal
+ * BR X16
+ * func:
+ * BTI C // if BTI
+ * MOV X9, LR
+ * NOP
+ *
+ * When func is not 8-byte aligned, literal_call is located at func - 8 and
+ * literal is located at func - 20:
+ *
+ * literal:
+ * .quad ftrace_dummy_tramp
+ * NOP
+ * literal_call:
+ * LDR X16, literal
+ * BR X16
+ * func:
+ * BTI C // if BTI
+ * MOV X9, LR
+ * NOP
+ */
+
+static unsigned long ftrace_literal_call_addr(struct dyn_ftrace *rec)
+{
+ return rec->arch.func - 2 * AARCH64_INSN_SIZE;
+}
+
+static unsigned long ftrace_literal_addr(struct dyn_ftrace *rec)
+{
+ unsigned long addr = 0;
+
+ addr = ftrace_literal_call_addr(rec);
+ if (addr % sizeof(long))
+ addr -= 3 * AARCH64_INSN_SIZE;
+ else
+ addr -= 2 * AARCH64_INSN_SIZE;
+
+ return addr;
+}
+
+static void ftrace_update_literal(unsigned long literal_addr, unsigned long call_target,
+ int action)
+{
+ unsigned long dummy_tramp = (unsigned long)&ftrace_dummy_tramp;
+
+ if (action == FC_INIT || action == FC_REMOVE_CALL)
+ aarch64_literal64_write((void *)literal_addr, dummy_tramp);
+ else if (action == FC_ADD_CALL)
+ aarch64_literal64_write((void *)literal_addr, call_target);
+}
+
+static int ftrace_init_literal(struct module *mod, struct dyn_ftrace *rec)
+{
+ int ret;
+ u32 old, new;
+ unsigned long addr;
+ unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
+
+ old = aarch64_insn_gen_nop();
+
+ addr = ftrace_literal_addr(rec);
+ ftrace_update_literal(addr, 0, FC_INIT);
+
+ pc = ftrace_literal_call_addr(rec);
+ new = aarch64_insn_gen_load_literal(pc, addr, AARCH64_INSN_REG_16,
+ true);
+ ret = ftrace_modify_code(pc, old, new, true);
+ if (ret)
+ return ret;
+
+ pc += AARCH64_INSN_SIZE;
+ new = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
+ AARCH64_INSN_BRANCH_NOLINK);
+ return ftrace_modify_code(pc, old, new, true);
+}
+
+#else
+
+static unsigned long ftrace_literal_addr(struct dyn_ftrace *rec)
+{
+ return 0;
+}
+
+static unsigned long ftrace_literal_call_addr(struct dyn_ftrace *rec)
+{
+ return 0;
+}
+
+static void ftrace_update_literal(unsigned long literal_addr, unsigned long call_target,
+ int action)
+{
+}
+
+static int ftrace_init_literal(struct module *mod, struct dyn_ftrace *rec)
+{
+ return 0;
+}
+
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
/*
* Find the address the callsite must branch to in order to reach '*addr'.
*
@@ -88,7 +205,8 @@ static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
*/
static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
struct module *mod,
- unsigned long *addr)
+ unsigned long *addr,
+ int action)
{
unsigned long pc = rec->ip;
long offset = (long)*addr - (long)pc;
@@ -101,6 +219,15 @@ static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
if (offset >= -SZ_128M && offset < SZ_128M)
return true;
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)) {
+ unsigned long literal_addr;
+
+ literal_addr = ftrace_literal_addr(rec);
+ ftrace_update_literal(literal_addr, *addr, action);
+ *addr = ftrace_literal_call_addr(rec);
+ return true;
+ }
+
/*
* When the target is outside of the range of a 'BL' instruction, we
* must use a PLT to reach it. We can only place PLTs for modules, and
@@ -145,7 +272,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
unsigned long pc = rec->ip;
u32 old, new;
- if (!ftrace_find_callable_addr(rec, NULL, &addr))
+ if (!ftrace_find_callable_addr(rec, NULL, &addr, FC_ADD_CALL))
return -EINVAL;
old = aarch64_insn_gen_nop();
@@ -161,9 +288,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
unsigned long pc = rec->ip;
u32 old, new;
- if (!ftrace_find_callable_addr(rec, NULL, &old_addr))
+ if (!ftrace_find_callable_addr(rec, NULL, &old_addr, FC_REPLACE_CALL))
return -EINVAL;
- if (!ftrace_find_callable_addr(rec, NULL, &addr))
+ if (!ftrace_find_callable_addr(rec, NULL, &addr, FC_ADD_CALL))
return -EINVAL;
old = aarch64_insn_gen_branch_imm(pc, old_addr,
@@ -188,18 +315,26 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
* | NOP | MOV X9, LR | MOV X9, LR |
* | NOP | NOP | BL <entry> |
*
- * The LR value will be recovered by ftrace_regs_entry, and restored into LR
- * before returning to the regular function prologue. When a function is not
- * being traced, the MOV is not harmful given x9 is not live per the AAPCS.
+ * The LR value will be recovered by ftrace_regs_entry or custom trampoline,
+ * and restored into LR before returning to the regular function prologue.
+ * When a function is not being traced, the MOV is not harmful given x9 is
+ * not live per the AAPCS.
*
* Note: ftrace_process_locs() has pre-adjusted rec->ip to be the address of
* the BL.
*/
int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
{
+ int ret;
unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
u32 old, new;
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)) {
+ ret = ftrace_init_literal(mod, rec);
+ if (ret)
+ return ret;
+ }
+
old = aarch64_insn_gen_nop();
new = aarch64_insn_gen_move_reg(AARCH64_INSN_REG_9,
AARCH64_INSN_REG_LR,
@@ -208,6 +343,45 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
}
#endif
+unsigned long ftrace_call_adjust(unsigned long addr)
+{
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)) {
+ u32 insn;
+ u32 nop = aarch64_insn_gen_nop();
+
+ /* Skip the first 5 NOPS */
+ addr += 5 * AARCH64_INSN_SIZE;
+
+ if (aarch64_insn_read((void *)addr, &insn))
+ return 0;
+
+ if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) {
+ if (insn != nop) {
+ addr += AARCH64_INSN_SIZE;
+ if (aarch64_insn_read((void *)addr, &insn))
+ return 0;
+ }
+ }
+
+ if (WARN_ON_ONCE(insn != nop))
+ return 0;
+
+ return addr + AARCH64_INSN_SIZE;
+ } else if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS)) {
+ /*
+ * Adjust addr to point at the BL in the callsite.
+ * See ftrace_init_nop() for the callsite sequence.
+ */
+ return addr + AARCH64_INSN_SIZE;
+ }
+
+ /*
+ * addr is the address of the mcount call instruction.
+ * recordmcount does the necessary offset calculation.
+ */
+ return addr;
+}
+
/*
* Turn off the call to ftrace_caller() in instrumented function
*/
@@ -217,7 +391,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
unsigned long pc = rec->ip;
u32 old = 0, new;
- if (!ftrace_find_callable_addr(rec, mod, &addr))
+ if (!ftrace_find_callable_addr(rec, mod, &addr, FC_REMOVE_CALL))
return -EINVAL;
old = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK);
@@ -231,6 +405,14 @@ void arch_ftrace_update_code(int command)
command |= FTRACE_MAY_SLEEP;
ftrace_modify_all_code(command);
}
+
+void ftrace_rec_arch_init(struct dyn_ftrace *rec, unsigned long func)
+{
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ rec->arch.func = func + 5 * AARCH64_INSN_SIZE;
+#endif
+}
+
#endif /* CONFIG_DYNAMIC_FTRACE */
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 33e0fabc0b79..3a4326c1ca80 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -83,6 +83,20 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
return ret;
}
+void __kprobes aarch64_literal64_write(void *addr, u64 data)
+{
+ u64 *waddr;
+ unsigned long flags = 0;
+
+ raw_spin_lock_irqsave(&patch_lock, flags);
+ waddr = patch_map(addr, FIX_TEXT_POKE0);
+
+ WRITE_ONCE(*waddr, data);
+
+ patch_unmap(FIX_TEXT_POKE0);
+ raw_spin_unlock_irqrestore(&patch_lock, flags);
+}
+
int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
return __aarch64_insn_write(addr, cpu_to_le32(insn));
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 34d78ca16beb..e42955b78174 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -625,6 +625,9 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
return 0;
}
+#ifdef ftrace_dummy_tramp
+#define dummy_tramp ftrace_dummy_tramp
+#else
void dummy_tramp(void);
asm (
@@ -641,6 +644,7 @@ asm (
" .size dummy_tramp, .-dummy_tramp\n"
" .popsection\n"
);
+#endif
/* build a plt initialized like this:
*
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 0b61371e287b..d5a385453b17 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -566,6 +566,8 @@ struct dyn_ftrace {
struct dyn_arch_ftrace arch;
};
+void ftrace_rec_arch_init(struct dyn_ftrace *rec, unsigned long addr);
+
int ftrace_set_filter_ip(struct ftrace_ops *ops, unsigned long ip,
int remove, int reset);
int ftrace_set_filter_ips(struct ftrace_ops *ops, unsigned long *ips,
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index bc921a3f7ea8..4e5b5aa9812b 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6664,6 +6664,10 @@ static void test_is_sorted(unsigned long *start, unsigned long count)
}
#endif
+void __weak ftrace_rec_arch_init(struct dyn_ftrace *rec, unsigned long addr)
+{
+}
+
static int ftrace_process_locs(struct module *mod,
unsigned long *start,
unsigned long *end)
@@ -6726,7 +6730,9 @@ static int ftrace_process_locs(struct module *mod,
pg = start_pg;
while (p < end) {
unsigned long end_offset;
- addr = ftrace_call_adjust(*p++);
+ unsigned long nop_addr = *p++;
+
+ addr = ftrace_call_adjust(nop_addr);
/*
* Some architecture linkers will pad between
* the different mcount_loc sections of different
@@ -6746,6 +6752,7 @@ static int ftrace_process_locs(struct module *mod,
rec = &pg->records[pg->index++];
rec->ip = addr;
+ ftrace_rec_arch_init(rec, nop_addr);
}
/* We should have used all pages */
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 2/4] arm64: ftrace: Support long jump for ftrace direct call
@ 2022-09-13 16:27 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
Add long jump support to fentry, so dynamically allocated trampolines
like bpf trampoline can be called from fentry directly, as these
trampoline addresses may be out of the range that a single bl
instruction can jump to.
The scheme used here is basically the same as commit b2ad54e1533e
("bpf, arm64: Implement bpf_arch_text_poke() for arm64").
1. At compile time, we use -fpatchable-function-entry=7,5 to insert 5
NOPs before function entry and 2 NOPs after function entry:
NOP
NOP
NOP
NOP
NOP
func:
BTI C // if BTI
NOP
NOP
The reason for inserting 5 NOPs before the function entry is that
2 NOPs are patched to LDR and BR instructions, 2 NOPs are used to
store the destination jump address, and 1 NOP is used to adjust
alignment to ensure the destination jump address is stored in 8-byte
aligned memory, which is required by atomic store and load.
2. When there is no trampoline attached, the callsite is patched to:
NOP // extra NOP if func is 8-byte aligned
literal:
.quad ftrace_dummy_tramp
NOP // extra NOP if func is NOT 8-byte aligned
literal_call:
LDR X16, literal
BR X16
func:
BTI C // if BTI
MOV X9, LR
NOP
3. When long jump trampoline is attached, the callsite is patched to:
NOP // extra NOP if func is 8-byte aligned
literal:
.quad <long-jump-trampoline>
NOP // extra NOP if func is NOT 8-byte aligned
literal_call:
LDR X16, literal
BR X16
func:
BTI C // if BTI
MOV X9, LR
BL literal_call
4. When short jump trampoline is attached, the callsite is patched to:
NOP // extra NOP if func is 8-byte aligned
literal:
.quad ftrace_dummy_tramp
NOP // extra NOP if func is NOT 8-byte aligned
literal_call:
LDR X16, literal
BR X16
func:
BTI C // if BTI
MOV X9, LR
BL <short-jump-trampoline>
Note that there is always a valid jump address in literal, either custom
trampoline address or the dummy trampoline address, which ensures
that we'll never jump from callsite to an unknown place.
Also note that the callsite is only ensured to be patched atomically and
securely. Whether the custom trampoline can be freed should be checked
by the trampoline user. For example, bpf uses refcnt and task based rcu
to ensure bpf trampoline could be freed safely.
In my environment, before this patch, there are 2 NOPs inserted in function
entry, and the generated vmlinux size is 463,649,280 bytes, while after
this patch, the vmlinux size is 465,069,368 bytes, increased 1,420,088
bytes, about 0.3%. In vmlinux, there are 14,376 8-byte aligned functions
and 41,847 unaligned functions. For each aligned function, one of the
five NOPs before the function entry is unnecessary, wasting 57,504 bytes.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
arch/arm64/Makefile | 4 +
arch/arm64/include/asm/ftrace.h | 27 ++--
arch/arm64/include/asm/patching.h | 2 +
arch/arm64/kernel/entry-ftrace.S | 21 +++-
arch/arm64/kernel/ftrace.c | 198 ++++++++++++++++++++++++++++--
arch/arm64/kernel/patching.c | 14 +++
arch/arm64/net/bpf_jit_comp.c | 4 +
include/linux/ftrace.h | 2 +
kernel/trace/ftrace.c | 9 +-
9 files changed, 253 insertions(+), 28 deletions(-)
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 6d9d4a58b898..e540b50db5b8 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -130,7 +130,11 @@ CHECKFLAGS += -D__aarch64__
ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_REGS),y)
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
+ ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS),y)
+ CC_FLAGS_FTRACE := -fpatchable-function-entry=7,5
+ else
CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+ endif
endif
# Default value
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index dbc45a4157fa..40e63435965b 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -56,27 +56,16 @@ extern void _mcount(unsigned long);
extern void *return_address(unsigned int);
struct dyn_arch_ftrace {
- /* No extra data needed for arm64 */
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ unsigned long func; /* start address of function */
+#endif
};
extern unsigned long ftrace_graph_call;
extern void return_to_handler(void);
-static inline unsigned long ftrace_call_adjust(unsigned long addr)
-{
- /*
- * Adjust addr to point at the BL in the callsite.
- * See ftrace_init_nop() for the callsite sequence.
- */
- if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
- return addr + AARCH64_INSN_SIZE;
- /*
- * addr is the address of the mcount call instruction.
- * recordmcount does the necessary offset calculation.
- */
- return addr;
-}
+unsigned long ftrace_call_adjust(unsigned long addr);
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
struct dyn_ftrace;
@@ -121,6 +110,14 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
*/
return !strcmp(sym + 8, name);
}
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+
+#define ftrace_dummy_tramp ftrace_dummy_tramp
+extern void ftrace_dummy_tramp(void);
+
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
#endif /* ifndef __ASSEMBLY__ */
#endif /* __ASM_FTRACE_H */
diff --git a/arch/arm64/include/asm/patching.h b/arch/arm64/include/asm/patching.h
index 6bf5adc56295..b9077205e6b2 100644
--- a/arch/arm64/include/asm/patching.h
+++ b/arch/arm64/include/asm/patching.h
@@ -10,4 +10,6 @@ int aarch64_insn_write(void *addr, u32 insn);
int aarch64_insn_patch_text_nosync(void *addr, u32 insn);
int aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt);
+void aarch64_literal64_write(void *addr, u64 data);
+
#endif /* __ASM_PATCHING_H */
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index bd5df50e4643..0bebe3ffdb58 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -14,14 +14,16 @@
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
/*
- * Due to -fpatchable-function-entry=2, the compiler has placed two NOPs before
- * the regular function prologue. For an enabled callsite, ftrace_init_nop() and
- * ftrace_make_call() have patched those NOPs to:
+ * Due to -fpatchable-function-entry=2 or -fpatchable-function-entry=7,5, the
+ * compiler has placed two NOPs before the regular function prologue. For an
+ * enabled callsite, ftrace_init_nop() and ftrace_make_call() have patched those
+ * NOPs to:
*
* MOV X9, LR
* BL <entry>
*
- * ... where <entry> is either ftrace_caller or ftrace_regs_caller.
+ * ... where <entry> is ftrace_caller or ftrace_regs_caller or custom
+ * trampoline.
*
* Each instrumented function follows the AAPCS, so here x0-x8 and x18-x30 are
* live (x18 holds the Shadow Call Stack pointer), and x9-x17 are safe to
@@ -327,3 +329,14 @@ SYM_CODE_START(return_to_handler)
ret
SYM_CODE_END(return_to_handler)
#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+SYM_FUNC_START(ftrace_dummy_tramp)
+#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
+ bti j /* ftrace_dummy_tramp is called via "br x10" */
+#endif
+ mov x10, x30
+ mov x30, x9
+ ret x10
+SYM_FUNC_END(ftrace_dummy_tramp)
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index ea5dc7c90f46..a311c19bf06a 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -77,6 +77,123 @@ static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
return NULL;
}
+enum ftrace_callsite_action {
+ FC_INIT,
+ FC_REMOVE_CALL,
+ FC_ADD_CALL,
+ FC_REPLACE_CALL,
+};
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+
+/*
+ * When func is 8-byte aligned, literal_call is located at func - 8 and literal
+ * is located at func - 16:
+ *
+ * NOP
+ * literal:
+ * .quad ftrace_dummy_tramp
+ * literal_call:
+ * LDR X16, literal
+ * BR X16
+ * func:
+ * BTI C // if BTI
+ * MOV X9, LR
+ * NOP
+ *
+ * When func is not 8-byte aligned, literal_call is located at func - 8 and
+ * literal is located at func - 20:
+ *
+ * literal:
+ * .quad ftrace_dummy_tramp
+ * NOP
+ * literal_call:
+ * LDR X16, literal
+ * BR X16
+ * func:
+ * BTI C // if BTI
+ * MOV X9, LR
+ * NOP
+ */
+
+static unsigned long ftrace_literal_call_addr(struct dyn_ftrace *rec)
+{
+ return rec->arch.func - 2 * AARCH64_INSN_SIZE;
+}
+
+static unsigned long ftrace_literal_addr(struct dyn_ftrace *rec)
+{
+ unsigned long addr = 0;
+
+ addr = ftrace_literal_call_addr(rec);
+ if (addr % sizeof(long))
+ addr -= 3 * AARCH64_INSN_SIZE;
+ else
+ addr -= 2 * AARCH64_INSN_SIZE;
+
+ return addr;
+}
+
+static void ftrace_update_literal(unsigned long literal_addr, unsigned long call_target,
+ int action)
+{
+ unsigned long dummy_tramp = (unsigned long)&ftrace_dummy_tramp;
+
+ if (action == FC_INIT || action == FC_REMOVE_CALL)
+ aarch64_literal64_write((void *)literal_addr, dummy_tramp);
+ else if (action == FC_ADD_CALL)
+ aarch64_literal64_write((void *)literal_addr, call_target);
+}
+
+static int ftrace_init_literal(struct module *mod, struct dyn_ftrace *rec)
+{
+ int ret;
+ u32 old, new;
+ unsigned long addr;
+ unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
+
+ old = aarch64_insn_gen_nop();
+
+ addr = ftrace_literal_addr(rec);
+ ftrace_update_literal(addr, 0, FC_INIT);
+
+ pc = ftrace_literal_call_addr(rec);
+ new = aarch64_insn_gen_load_literal(pc, addr, AARCH64_INSN_REG_16,
+ true);
+ ret = ftrace_modify_code(pc, old, new, true);
+ if (ret)
+ return ret;
+
+ pc += AARCH64_INSN_SIZE;
+ new = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
+ AARCH64_INSN_BRANCH_NOLINK);
+ return ftrace_modify_code(pc, old, new, true);
+}
+
+#else
+
+static unsigned long ftrace_literal_addr(struct dyn_ftrace *rec)
+{
+ return 0;
+}
+
+static unsigned long ftrace_literal_call_addr(struct dyn_ftrace *rec)
+{
+ return 0;
+}
+
+static void ftrace_update_literal(unsigned long literal_addr, unsigned long call_target,
+ int action)
+{
+}
+
+static int ftrace_init_literal(struct module *mod, struct dyn_ftrace *rec)
+{
+ return 0;
+}
+
+#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
/*
* Find the address the callsite must branch to in order to reach '*addr'.
*
@@ -88,7 +205,8 @@ static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
*/
static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
struct module *mod,
- unsigned long *addr)
+ unsigned long *addr,
+ int action)
{
unsigned long pc = rec->ip;
long offset = (long)*addr - (long)pc;
@@ -101,6 +219,15 @@ static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
if (offset >= -SZ_128M && offset < SZ_128M)
return true;
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)) {
+ unsigned long literal_addr;
+
+ literal_addr = ftrace_literal_addr(rec);
+ ftrace_update_literal(literal_addr, *addr, action);
+ *addr = ftrace_literal_call_addr(rec);
+ return true;
+ }
+
/*
* When the target is outside of the range of a 'BL' instruction, we
* must use a PLT to reach it. We can only place PLTs for modules, and
@@ -145,7 +272,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
unsigned long pc = rec->ip;
u32 old, new;
- if (!ftrace_find_callable_addr(rec, NULL, &addr))
+ if (!ftrace_find_callable_addr(rec, NULL, &addr, FC_ADD_CALL))
return -EINVAL;
old = aarch64_insn_gen_nop();
@@ -161,9 +288,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
unsigned long pc = rec->ip;
u32 old, new;
- if (!ftrace_find_callable_addr(rec, NULL, &old_addr))
+ if (!ftrace_find_callable_addr(rec, NULL, &old_addr, FC_REPLACE_CALL))
return -EINVAL;
- if (!ftrace_find_callable_addr(rec, NULL, &addr))
+ if (!ftrace_find_callable_addr(rec, NULL, &addr, FC_ADD_CALL))
return -EINVAL;
old = aarch64_insn_gen_branch_imm(pc, old_addr,
@@ -188,18 +315,26 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
* | NOP | MOV X9, LR | MOV X9, LR |
* | NOP | NOP | BL <entry> |
*
- * The LR value will be recovered by ftrace_regs_entry, and restored into LR
- * before returning to the regular function prologue. When a function is not
- * being traced, the MOV is not harmful given x9 is not live per the AAPCS.
+ * The LR value will be recovered by ftrace_regs_entry or custom trampoline,
+ * and restored into LR before returning to the regular function prologue.
+ * When a function is not being traced, the MOV is not harmful given x9 is
+ * not live per the AAPCS.
*
* Note: ftrace_process_locs() has pre-adjusted rec->ip to be the address of
* the BL.
*/
int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
{
+ int ret;
unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
u32 old, new;
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)) {
+ ret = ftrace_init_literal(mod, rec);
+ if (ret)
+ return ret;
+ }
+
old = aarch64_insn_gen_nop();
new = aarch64_insn_gen_move_reg(AARCH64_INSN_REG_9,
AARCH64_INSN_REG_LR,
@@ -208,6 +343,45 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
}
#endif
+unsigned long ftrace_call_adjust(unsigned long addr)
+{
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)) {
+ u32 insn;
+ u32 nop = aarch64_insn_gen_nop();
+
+ /* Skip the first 5 NOPS */
+ addr += 5 * AARCH64_INSN_SIZE;
+
+ if (aarch64_insn_read((void *)addr, &insn))
+ return 0;
+
+ if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) {
+ if (insn != nop) {
+ addr += AARCH64_INSN_SIZE;
+ if (aarch64_insn_read((void *)addr, &insn))
+ return 0;
+ }
+ }
+
+ if (WARN_ON_ONCE(insn != nop))
+ return 0;
+
+ return addr + AARCH64_INSN_SIZE;
+ } else if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS)) {
+ /*
+ * Adjust addr to point at the BL in the callsite.
+ * See ftrace_init_nop() for the callsite sequence.
+ */
+ return addr + AARCH64_INSN_SIZE;
+ }
+
+ /*
+ * addr is the address of the mcount call instruction.
+ * recordmcount does the necessary offset calculation.
+ */
+ return addr;
+}
+
/*
* Turn off the call to ftrace_caller() in instrumented function
*/
@@ -217,7 +391,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
unsigned long pc = rec->ip;
u32 old = 0, new;
- if (!ftrace_find_callable_addr(rec, mod, &addr))
+ if (!ftrace_find_callable_addr(rec, mod, &addr, FC_REMOVE_CALL))
return -EINVAL;
old = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK);
@@ -231,6 +405,14 @@ void arch_ftrace_update_code(int command)
command |= FTRACE_MAY_SLEEP;
ftrace_modify_all_code(command);
}
+
+void ftrace_rec_arch_init(struct dyn_ftrace *rec, unsigned long func)
+{
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ rec->arch.func = func + 5 * AARCH64_INSN_SIZE;
+#endif
+}
+
#endif /* CONFIG_DYNAMIC_FTRACE */
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 33e0fabc0b79..3a4326c1ca80 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -83,6 +83,20 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
return ret;
}
+void __kprobes aarch64_literal64_write(void *addr, u64 data)
+{
+ u64 *waddr;
+ unsigned long flags = 0;
+
+ raw_spin_lock_irqsave(&patch_lock, flags);
+ waddr = patch_map(addr, FIX_TEXT_POKE0);
+
+ WRITE_ONCE(*waddr, data);
+
+ patch_unmap(FIX_TEXT_POKE0);
+ raw_spin_unlock_irqrestore(&patch_lock, flags);
+}
+
int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
return __aarch64_insn_write(addr, cpu_to_le32(insn));
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 34d78ca16beb..e42955b78174 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -625,6 +625,9 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
return 0;
}
+#ifdef ftrace_dummy_tramp
+#define dummy_tramp ftrace_dummy_tramp
+#else
void dummy_tramp(void);
asm (
@@ -641,6 +644,7 @@ asm (
" .size dummy_tramp, .-dummy_tramp\n"
" .popsection\n"
);
+#endif
/* build a plt initialized like this:
*
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 0b61371e287b..d5a385453b17 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -566,6 +566,8 @@ struct dyn_ftrace {
struct dyn_arch_ftrace arch;
};
+void ftrace_rec_arch_init(struct dyn_ftrace *rec, unsigned long addr);
+
int ftrace_set_filter_ip(struct ftrace_ops *ops, unsigned long ip,
int remove, int reset);
int ftrace_set_filter_ips(struct ftrace_ops *ops, unsigned long *ips,
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index bc921a3f7ea8..4e5b5aa9812b 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6664,6 +6664,10 @@ static void test_is_sorted(unsigned long *start, unsigned long count)
}
#endif
+void __weak ftrace_rec_arch_init(struct dyn_ftrace *rec, unsigned long addr)
+{
+}
+
static int ftrace_process_locs(struct module *mod,
unsigned long *start,
unsigned long *end)
@@ -6726,7 +6730,9 @@ static int ftrace_process_locs(struct module *mod,
pg = start_pg;
while (p < end) {
unsigned long end_offset;
- addr = ftrace_call_adjust(*p++);
+ unsigned long nop_addr = *p++;
+
+ addr = ftrace_call_adjust(nop_addr);
/*
* Some architecture linkers will pad between
* the different mcount_loc sections of different
@@ -6746,6 +6752,7 @@ static int ftrace_process_locs(struct module *mod,
rec = &pg->records[pg->index++];
rec->ip = addr;
+ ftrace_rec_arch_init(rec, nop_addr);
}
/* We should have used all pages */
--
2.30.2
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 3/4] arm64: ftrace: Add ftrace direct call support
2022-09-13 16:27 ` Xu Kuohai
@ 2022-09-13 16:27 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
Add ftrace direct support for arm64.
1. When there is custom trampoline only, patch fentry callsite to call
the custom trampoline directly.
2. When ftrace caller and custom trampoline coexist, jump from fentry to
ftrace caller first, then jump to custom trampoline when ftrace caller
exits. As pt_regs->orig_x0 is currently unused by ftrace, its space
is reused as an intermediary for jumping from ftrace caller to custom
trampoline.
In short, this patch does the same thing as the x86 commit 562955fe6a55
("ftrace/x86: Add register_ftrace_direct() for custom trampolines").
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
arch/arm64/Kconfig | 2 ++
arch/arm64/include/asm/ftrace.h | 12 ++++++++++++
arch/arm64/include/asm/ptrace.h | 6 +++++-
arch/arm64/kernel/asm-offsets.c | 1 +
arch/arm64/kernel/entry-ftrace.S | 18 +++++++++++++++---
5 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 571cc234d0b3..e2f6ca75b881 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -180,6 +180,8 @@ config ARM64
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
+ select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \
+ if DYNAMIC_FTRACE_WITH_REGS
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \
if DYNAMIC_FTRACE_WITH_REGS
select HAVE_EFFICIENT_UNALIGNED_ACCESS
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 40e63435965b..b07a3c24f918 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -67,6 +67,18 @@ extern void return_to_handler(void);
unsigned long ftrace_call_adjust(unsigned long addr);
+#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+static inline void arch_ftrace_set_direct_caller(struct pt_regs *regs,
+ unsigned long addr)
+{
+ /*
+ * Place custom trampoline address in regs->custom_tramp to let ftrace
+ * trampoline jump to it.
+ */
+ regs->custom_tramp = addr;
+}
+#endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
struct dyn_ftrace;
struct ftrace_ops;
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index 41b332c054ab..9701c38fcc5f 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -185,7 +185,11 @@ struct pt_regs {
u64 pstate;
};
};
- u64 orig_x0;
+ union {
+ u64 orig_x0;
+ /* Only used by ftrace to save custom trampoline address */
+ u64 custom_tramp;
+ };
#ifdef __AARCH64EB__
u32 unused2;
s32 syscallno;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 1197e7679882..56d4acc52a86 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -80,6 +80,7 @@ int main(void)
DEFINE(S_SDEI_TTBR1, offsetof(struct pt_regs, sdei_ttbr1));
DEFINE(S_PMR_SAVE, offsetof(struct pt_regs, pmr_save));
DEFINE(S_STACKFRAME, offsetof(struct pt_regs, stackframe));
+ DEFINE(S_CUSTOM_TRAMP, offsetof(struct pt_regs, custom_tramp));
DEFINE(PT_REGS_SIZE, sizeof(struct pt_regs));
BLANK();
#ifdef CONFIG_COMPAT
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index 0bebe3ffdb58..ae03df89d031 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -62,6 +62,9 @@
str x29, [sp, #S_FP]
.endif
+ /* Set custom_tramp to zero */
+ str xzr, [sp, #S_CUSTOM_TRAMP]
+
/* Save the callsite's SP and LR */
add x10, sp, #(PT_REGS_SIZE + 16)
stp x9, x10, [sp, #S_LR]
@@ -114,12 +117,21 @@ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
/* Restore the callsite's FP, LR, PC */
ldr x29, [sp, #S_FP]
ldr x30, [sp, #S_LR]
- ldr x9, [sp, #S_PC]
-
+ ldr x10, [sp, #S_PC]
+
+ ldr x11, [sp, #S_CUSTOM_TRAMP]
+ cbz x11, 1f
+ /* Set x9 to parent ip before jump to custom trampoline */
+ mov x9, x30
+ /* Set lr to self ip */
+ ldr x30, [sp, #S_PC]
+ /* Set x10 (used for return address) to custom trampoline */
+ mov x10, x11
+1:
/* Restore the callsite's SP */
add sp, sp, #PT_REGS_SIZE + 16
- ret x9
+ ret x10
SYM_CODE_END(ftrace_common)
#else /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 3/4] arm64: ftrace: Add ftrace direct call support
@ 2022-09-13 16:27 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
Add ftrace direct support for arm64.
1. When there is custom trampoline only, patch fentry callsite to call
the custom trampoline directly.
2. When ftrace caller and custom trampoline coexist, jump from fentry to
ftrace caller first, then jump to custom trampoline when ftrace caller
exits. As pt_regs->orig_x0 is currently unused by ftrace, its space
is reused as an intermediary for jumping from ftrace caller to custom
trampoline.
In short, this patch does the same thing as the x86 commit 562955fe6a55
("ftrace/x86: Add register_ftrace_direct() for custom trampolines").
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
arch/arm64/Kconfig | 2 ++
arch/arm64/include/asm/ftrace.h | 12 ++++++++++++
arch/arm64/include/asm/ptrace.h | 6 +++++-
arch/arm64/kernel/asm-offsets.c | 1 +
arch/arm64/kernel/entry-ftrace.S | 18 +++++++++++++++---
5 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 571cc234d0b3..e2f6ca75b881 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -180,6 +180,8 @@ config ARM64
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
+ select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS \
+ if DYNAMIC_FTRACE_WITH_REGS
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \
if DYNAMIC_FTRACE_WITH_REGS
select HAVE_EFFICIENT_UNALIGNED_ACCESS
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 40e63435965b..b07a3c24f918 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -67,6 +67,18 @@ extern void return_to_handler(void);
unsigned long ftrace_call_adjust(unsigned long addr);
+#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+static inline void arch_ftrace_set_direct_caller(struct pt_regs *regs,
+ unsigned long addr)
+{
+ /*
+ * Place custom trampoline address in regs->custom_tramp to let ftrace
+ * trampoline jump to it.
+ */
+ regs->custom_tramp = addr;
+}
+#endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
struct dyn_ftrace;
struct ftrace_ops;
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index 41b332c054ab..9701c38fcc5f 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -185,7 +185,11 @@ struct pt_regs {
u64 pstate;
};
};
- u64 orig_x0;
+ union {
+ u64 orig_x0;
+ /* Only used by ftrace to save custom trampoline address */
+ u64 custom_tramp;
+ };
#ifdef __AARCH64EB__
u32 unused2;
s32 syscallno;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 1197e7679882..56d4acc52a86 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -80,6 +80,7 @@ int main(void)
DEFINE(S_SDEI_TTBR1, offsetof(struct pt_regs, sdei_ttbr1));
DEFINE(S_PMR_SAVE, offsetof(struct pt_regs, pmr_save));
DEFINE(S_STACKFRAME, offsetof(struct pt_regs, stackframe));
+ DEFINE(S_CUSTOM_TRAMP, offsetof(struct pt_regs, custom_tramp));
DEFINE(PT_REGS_SIZE, sizeof(struct pt_regs));
BLANK();
#ifdef CONFIG_COMPAT
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index 0bebe3ffdb58..ae03df89d031 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -62,6 +62,9 @@
str x29, [sp, #S_FP]
.endif
+ /* Set custom_tramp to zero */
+ str xzr, [sp, #S_CUSTOM_TRAMP]
+
/* Save the callsite's SP and LR */
add x10, sp, #(PT_REGS_SIZE + 16)
stp x9, x10, [sp, #S_LR]
@@ -114,12 +117,21 @@ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
/* Restore the callsite's FP, LR, PC */
ldr x29, [sp, #S_FP]
ldr x30, [sp, #S_LR]
- ldr x9, [sp, #S_PC]
-
+ ldr x10, [sp, #S_PC]
+
+ ldr x11, [sp, #S_CUSTOM_TRAMP]
+ cbz x11, 1f
+ /* Set x9 to parent ip before jump to custom trampoline */
+ mov x9, x30
+ /* Set lr to self ip */
+ ldr x30, [sp, #S_PC]
+ /* Set x10 (used for return address) to custom trampoline */
+ mov x10, x11
+1:
/* Restore the callsite's SP */
add sp, sp, #PT_REGS_SIZE + 16
- ret x9
+ ret x10
SYM_CODE_END(ftrace_common)
#else /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */
--
2.30.2
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 4/4] ftrace: Fix dead loop caused by direct call in ftrace selftest
2022-09-13 16:27 ` Xu Kuohai
@ 2022-09-13 16:27 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
After direct call is enabled for arm64, ftrace selftest enters a
dead loop:
<trace_selftest_dynamic_test_func>:
00 bti c
01 mov x9, x30 <trace_direct_tramp>:
02 bl <trace_direct_tramp> ----------> ret
|
lr/x30 is 03, return to 03
|
03 mov w0, #0x0 <-----------------------------|
| |
| dead loop! |
| |
04 ret ---- lr/x30 is still 03, go back to 03 ----|
The reason is that when the direct caller trace_direct_tramp() returns
to the patched function trace_selftest_dynamic_test_func(), lr is still
the address after the instrumented instruction in the patched function,
so when the patched function exits, it returns to itself!
To fix this issue, we need to restore lr before trace_direct_tramp()
exits, so use a dedicated trace_direct_tramp() for arm64.
Reported-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/arm64/include/asm/ftrace.h | 4 ++++
kernel/trace/trace_selftest.c | 2 ++
2 files changed, 6 insertions(+)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index b07a3c24f918..15247f73bf54 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -128,6 +128,10 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
#define ftrace_dummy_tramp ftrace_dummy_tramp
extern void ftrace_dummy_tramp(void);
+#ifdef CONFIG_FTRACE_SELFTEST
+#define trace_direct_tramp ftrace_dummy_tramp
+#endif /* CONFIG_FTRACE_SELFTEST */
+
#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
#endif /* ifndef __ASSEMBLY__ */
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index a2d301f58ced..092239bc373c 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -785,8 +785,10 @@ static struct fgraph_ops fgraph_ops __initdata = {
};
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+#ifndef trace_direct_tramp
noinline __noclone static void trace_direct_tramp(void) { }
#endif
+#endif
/*
* Pretty much the same than for the function tracer from which the selftest
--
2.30.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH bpf-next v2 4/4] ftrace: Fix dead loop caused by direct call in ftrace selftest
@ 2022-09-13 16:27 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-13 16:27 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
From: Xu Kuohai <xukuohai@huawei.com>
After direct call is enabled for arm64, ftrace selftest enters a
dead loop:
<trace_selftest_dynamic_test_func>:
00 bti c
01 mov x9, x30 <trace_direct_tramp>:
02 bl <trace_direct_tramp> ----------> ret
|
lr/x30 is 03, return to 03
|
03 mov w0, #0x0 <-----------------------------|
| |
| dead loop! |
| |
04 ret ---- lr/x30 is still 03, go back to 03 ----|
The reason is that when the direct caller trace_direct_tramp() returns
to the patched function trace_selftest_dynamic_test_func(), lr is still
the address after the instrumented instruction in the patched function,
so when the patched function exits, it returns to itself!
To fix this issue, we need to restore lr before trace_direct_tramp()
exits, so use a dedicated trace_direct_tramp() for arm64.
Reported-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/arm64/include/asm/ftrace.h | 4 ++++
kernel/trace/trace_selftest.c | 2 ++
2 files changed, 6 insertions(+)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index b07a3c24f918..15247f73bf54 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -128,6 +128,10 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
#define ftrace_dummy_tramp ftrace_dummy_tramp
extern void ftrace_dummy_tramp(void);
+#ifdef CONFIG_FTRACE_SELFTEST
+#define trace_direct_tramp ftrace_dummy_tramp
+#endif /* CONFIG_FTRACE_SELFTEST */
+
#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
#endif /* ifndef __ASSEMBLY__ */
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index a2d301f58ced..092239bc373c 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -785,8 +785,10 @@ static struct fgraph_ops fgraph_ops __initdata = {
};
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+#ifndef trace_direct_tramp
noinline __noclone static void trace_direct_tramp(void) { }
#endif
+#endif
/*
* Pretty much the same than for the function tracer from which the selftest
--
2.30.2
^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-13 16:27 ` Xu Kuohai
@ 2022-09-22 18:01 ` Daniel Borkmann
-1 siblings, 0 replies; 60+ messages in thread
From: Daniel Borkmann @ 2022-09-22 18:01 UTC (permalink / raw)
To: Xu Kuohai, linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 9/13/22 6:27 PM, Xu Kuohai wrote:
> This series adds ftrace direct call for arm64, which is required to attach
> bpf trampoline to fentry.
>
> Although there is no agreement on how to support ftrace direct call on arm64,
> no patch has been posted except the one I posted in [1], so this series
> continues the work of [1] with the addition of long jump support. Now ftrace
> direct call works regardless of the distance between the callsite and custom
> trampoline.
>
> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>
> v2:
> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>
> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>
> Xu Kuohai (4):
> ftrace: Allow users to disable ftrace direct call
> arm64: ftrace: Support long jump for ftrace direct call
> arm64: ftrace: Add ftrace direct call support
> ftrace: Fix dead loop caused by direct call in ftrace selftest
Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
it probably makes sense that this series goes via Catalin/Will through arm64 tree
instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
could work too, but I'd presume this just results in merge conflicts)?
> arch/arm64/Kconfig | 2 +
> arch/arm64/Makefile | 4 +
> arch/arm64/include/asm/ftrace.h | 35 ++++--
> arch/arm64/include/asm/patching.h | 2 +
> arch/arm64/include/asm/ptrace.h | 6 +-
> arch/arm64/kernel/asm-offsets.c | 1 +
> arch/arm64/kernel/entry-ftrace.S | 39 ++++--
> arch/arm64/kernel/ftrace.c | 198 ++++++++++++++++++++++++++++--
> arch/arm64/kernel/patching.c | 14 +++
> arch/arm64/net/bpf_jit_comp.c | 4 +
> include/linux/ftrace.h | 2 +
> kernel/trace/Kconfig | 7 +-
> kernel/trace/ftrace.c | 9 +-
> kernel/trace/trace_selftest.c | 2 +
> 14 files changed, 296 insertions(+), 29 deletions(-)
Thanks,
Daniel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-22 18:01 ` Daniel Borkmann
0 siblings, 0 replies; 60+ messages in thread
From: Daniel Borkmann @ 2022-09-22 18:01 UTC (permalink / raw)
To: Xu Kuohai, linux-arm-kernel, linux-kernel, bpf
Cc: Mark Rutland, Florent Revest, Catalin Marinas, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 9/13/22 6:27 PM, Xu Kuohai wrote:
> This series adds ftrace direct call for arm64, which is required to attach
> bpf trampoline to fentry.
>
> Although there is no agreement on how to support ftrace direct call on arm64,
> no patch has been posted except the one I posted in [1], so this series
> continues the work of [1] with the addition of long jump support. Now ftrace
> direct call works regardless of the distance between the callsite and custom
> trampoline.
>
> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>
> v2:
> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>
> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>
> Xu Kuohai (4):
> ftrace: Allow users to disable ftrace direct call
> arm64: ftrace: Support long jump for ftrace direct call
> arm64: ftrace: Add ftrace direct call support
> ftrace: Fix dead loop caused by direct call in ftrace selftest
Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
it probably makes sense that this series goes via Catalin/Will through arm64 tree
instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
could work too, but I'd presume this just results in merge conflicts)?
> arch/arm64/Kconfig | 2 +
> arch/arm64/Makefile | 4 +
> arch/arm64/include/asm/ftrace.h | 35 ++++--
> arch/arm64/include/asm/patching.h | 2 +
> arch/arm64/include/asm/ptrace.h | 6 +-
> arch/arm64/kernel/asm-offsets.c | 1 +
> arch/arm64/kernel/entry-ftrace.S | 39 ++++--
> arch/arm64/kernel/ftrace.c | 198 ++++++++++++++++++++++++++++--
> arch/arm64/kernel/patching.c | 14 +++
> arch/arm64/net/bpf_jit_comp.c | 4 +
> include/linux/ftrace.h | 2 +
> kernel/trace/Kconfig | 7 +-
> kernel/trace/ftrace.c | 9 +-
> kernel/trace/trace_selftest.c | 2 +
> 14 files changed, 296 insertions(+), 29 deletions(-)
Thanks,
Daniel
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-22 18:01 ` Daniel Borkmann
@ 2022-09-26 14:40 ` Catalin Marinas
-1 siblings, 0 replies; 60+ messages in thread
From: Catalin Marinas @ 2022-09-26 14:40 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Mark Rutland,
Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > This series adds ftrace direct call for arm64, which is required to attach
> > bpf trampoline to fentry.
> >
> > Although there is no agreement on how to support ftrace direct call on arm64,
> > no patch has been posted except the one I posted in [1], so this series
> > continues the work of [1] with the addition of long jump support. Now ftrace
> > direct call works regardless of the distance between the callsite and custom
> > trampoline.
> >
> > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> >
> > v2:
> > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> >
> > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> >
> > Xu Kuohai (4):
> > ftrace: Allow users to disable ftrace direct call
> > arm64: ftrace: Support long jump for ftrace direct call
> > arm64: ftrace: Add ftrace direct call support
> > ftrace: Fix dead loop caused by direct call in ftrace selftest
>
> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> could work too, but I'd presume this just results in merge conflicts)?
I think it makes sense for the series to go via the arm64 tree but I'd
like Mark to have a look at the ftrace changes first.
Thanks.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-26 14:40 ` Catalin Marinas
0 siblings, 0 replies; 60+ messages in thread
From: Catalin Marinas @ 2022-09-26 14:40 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Mark Rutland,
Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > This series adds ftrace direct call for arm64, which is required to attach
> > bpf trampoline to fentry.
> >
> > Although there is no agreement on how to support ftrace direct call on arm64,
> > no patch has been posted except the one I posted in [1], so this series
> > continues the work of [1] with the addition of long jump support. Now ftrace
> > direct call works regardless of the distance between the callsite and custom
> > trampoline.
> >
> > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> >
> > v2:
> > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> >
> > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> >
> > Xu Kuohai (4):
> > ftrace: Allow users to disable ftrace direct call
> > arm64: ftrace: Support long jump for ftrace direct call
> > arm64: ftrace: Add ftrace direct call support
> > ftrace: Fix dead loop caused by direct call in ftrace selftest
>
> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> could work too, but I'd presume this just results in merge conflicts)?
I think it makes sense for the series to go via the arm64 tree but I'd
like Mark to have a look at the ftrace changes first.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-26 14:40 ` Catalin Marinas
@ 2022-09-26 17:43 ` Mark Rutland
-1 siblings, 0 replies; 60+ messages in thread
From: Mark Rutland @ 2022-09-26 17:43 UTC (permalink / raw)
To: Catalin Marinas
Cc: Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> > On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > > This series adds ftrace direct call for arm64, which is required to attach
> > > bpf trampoline to fentry.
> > >
> > > Although there is no agreement on how to support ftrace direct call on arm64,
> > > no patch has been posted except the one I posted in [1], so this series
> > > continues the work of [1] with the addition of long jump support. Now ftrace
> > > direct call works regardless of the distance between the callsite and custom
> > > trampoline.
> > >
> > > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > >
> > > v2:
> > > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > >
> > > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > >
> > > Xu Kuohai (4):
> > > ftrace: Allow users to disable ftrace direct call
> > > arm64: ftrace: Support long jump for ftrace direct call
> > > arm64: ftrace: Add ftrace direct call support
> > > ftrace: Fix dead loop caused by direct call in ftrace selftest
> >
> > Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> > it probably makes sense that this series goes via Catalin/Will through arm64 tree
> > instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> > could work too, but I'd presume this just results in merge conflicts)?
>
> I think it makes sense for the series to go via the arm64 tree but I'd
> like Mark to have a look at the ftrace changes first.
From a quick scan, I still don't think this is quite right, and as it stands I
believe this will break backtracing (as the instructions before the function
entry point will not be symbolized correctly, getting in the way of
RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
feedback there, as I have a mechanism in mind that wa a little simpler.
I'll try to reply with some more detail tomorrow, but I don't think this is the
right approach, and as mentioned previously (and e.g. at LPC) I'd strongly
prefer to *not* implement direct calls, so that we can have more consistent
entry/exit handling.
Thanks,
Mark.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-26 17:43 ` Mark Rutland
0 siblings, 0 replies; 60+ messages in thread
From: Mark Rutland @ 2022-09-26 17:43 UTC (permalink / raw)
To: Catalin Marinas
Cc: Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> > On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > > This series adds ftrace direct call for arm64, which is required to attach
> > > bpf trampoline to fentry.
> > >
> > > Although there is no agreement on how to support ftrace direct call on arm64,
> > > no patch has been posted except the one I posted in [1], so this series
> > > continues the work of [1] with the addition of long jump support. Now ftrace
> > > direct call works regardless of the distance between the callsite and custom
> > > trampoline.
> > >
> > > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > >
> > > v2:
> > > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > >
> > > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > >
> > > Xu Kuohai (4):
> > > ftrace: Allow users to disable ftrace direct call
> > > arm64: ftrace: Support long jump for ftrace direct call
> > > arm64: ftrace: Add ftrace direct call support
> > > ftrace: Fix dead loop caused by direct call in ftrace selftest
> >
> > Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> > it probably makes sense that this series goes via Catalin/Will through arm64 tree
> > instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> > could work too, but I'd presume this just results in merge conflicts)?
>
> I think it makes sense for the series to go via the arm64 tree but I'd
> like Mark to have a look at the ftrace changes first.
From a quick scan, I still don't think this is quite right, and as it stands I
believe this will break backtracing (as the instructions before the function
entry point will not be symbolized correctly, getting in the way of
RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
feedback there, as I have a mechanism in mind that wa a little simpler.
I'll try to reply with some more detail tomorrow, but I don't think this is the
right approach, and as mentioned previously (and e.g. at LPC) I'd strongly
prefer to *not* implement direct calls, so that we can have more consistent
entry/exit handling.
Thanks,
Mark.
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-26 17:43 ` Mark Rutland
@ 2022-09-27 4:49 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-27 4:49 UTC (permalink / raw)
To: Mark Rutland, Catalin Marinas
Cc: Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On 9/27/2022 1:43 AM, Mark Rutland wrote:
> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>> bpf trampoline to fentry.
>>>>
>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>> no patch has been posted except the one I posted in [1], so this series
>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>> direct call works regardless of the distance between the callsite and custom
>>>> trampoline.
>>>>
>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>
>>>> v2:
>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>
>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>
>>>> Xu Kuohai (4):
>>>> ftrace: Allow users to disable ftrace direct call
>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>> arm64: ftrace: Add ftrace direct call support
>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>
>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>> could work too, but I'd presume this just results in merge conflicts)?
>>
>> I think it makes sense for the series to go via the arm64 tree but I'd
>> like Mark to have a look at the ftrace changes first.
>
>>From a quick scan, I still don't think this is quite right, and as it stands I
> believe this will break backtracing (as the instructions before the function
> entry point will not be symbolized correctly, getting in the way of
> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> feedback there, as I have a mechanism in mind that wa a little simpler.
>
Thanks for the review. I have some thoughts about reliable stacktrace.
If PC is not in the range of literal_call, stacktrace works as before without
changes.
If PC is in the range of literal_call, for example, interrupted by an
irq, I think there are 2 problems:
1. Caller LR is not pushed to the stack yet, so caller's address and name
will be missing from the backtrace.
2. Since PC is not in func's address range, no symbol name will be found, so
func name is also missing.
Problem 1 is not introduced by this patchset, but the occurring probability
may be increased by this patchset. I think this problem should be addressed by
a reliable stacktrace scheme, such as ORC on x86.
Problem 2 is indeed introduced by this patchset. I think there are at least 3
ways to deal with it:
1. Add a symbol name for literal_call.
2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
we can check if the PC is in literal_call, then adjust PC and try again.
3. Move literal_call to the func's address range, for example:
a. Compile with -fpatchable-function-entry=7
func:
BTI C
NOP
NOP
NOP
NOP
NOP
NOP
NOP
func_body:
...
b. When disabled, patch it to
func:
BTI C
B func_body
literal:
.quad dummy_tramp
literal_call:
LDR X16, literal
MOV X9, LR
BLR X16
func_body:
...
c. When enabled and target is out-of-range, patch it to
func:
BTI C
B literal_call
literal:
.quad custom_trampoline
literal_call:
LDR X16, literal
MOV X9, LR
BLR X16
func_body:
...
d. When enabled and target is in range, patch it to
func:
BTI C
B direct_call
literal:
.quad dummy_tramp
LDR X16, literal
direct_call:
MOV X9, LR
BL custom_trampoline
func_body:
...
> I'll try to reply with some more detail tomorrow, but I don't think this is the
> right approach, and as mentioned previously (and e.g. at LPC) I'd strongly
> prefer to *not* implement direct calls, so that we can have more consistent
> entry/exit handling.
>
> Thanks,
> Mark.
> .
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-27 4:49 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-27 4:49 UTC (permalink / raw)
To: Mark Rutland, Catalin Marinas
Cc: Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On 9/27/2022 1:43 AM, Mark Rutland wrote:
> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>> bpf trampoline to fentry.
>>>>
>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>> no patch has been posted except the one I posted in [1], so this series
>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>> direct call works regardless of the distance between the callsite and custom
>>>> trampoline.
>>>>
>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>
>>>> v2:
>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>
>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>
>>>> Xu Kuohai (4):
>>>> ftrace: Allow users to disable ftrace direct call
>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>> arm64: ftrace: Add ftrace direct call support
>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>
>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>> could work too, but I'd presume this just results in merge conflicts)?
>>
>> I think it makes sense for the series to go via the arm64 tree but I'd
>> like Mark to have a look at the ftrace changes first.
>
>>From a quick scan, I still don't think this is quite right, and as it stands I
> believe this will break backtracing (as the instructions before the function
> entry point will not be symbolized correctly, getting in the way of
> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> feedback there, as I have a mechanism in mind that wa a little simpler.
>
Thanks for the review. I have some thoughts about reliable stacktrace.
If PC is not in the range of literal_call, stacktrace works as before without
changes.
If PC is in the range of literal_call, for example, interrupted by an
irq, I think there are 2 problems:
1. Caller LR is not pushed to the stack yet, so caller's address and name
will be missing from the backtrace.
2. Since PC is not in func's address range, no symbol name will be found, so
func name is also missing.
Problem 1 is not introduced by this patchset, but the occurring probability
may be increased by this patchset. I think this problem should be addressed by
a reliable stacktrace scheme, such as ORC on x86.
Problem 2 is indeed introduced by this patchset. I think there are at least 3
ways to deal with it:
1. Add a symbol name for literal_call.
2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
we can check if the PC is in literal_call, then adjust PC and try again.
3. Move literal_call to the func's address range, for example:
a. Compile with -fpatchable-function-entry=7
func:
BTI C
NOP
NOP
NOP
NOP
NOP
NOP
NOP
func_body:
...
b. When disabled, patch it to
func:
BTI C
B func_body
literal:
.quad dummy_tramp
literal_call:
LDR X16, literal
MOV X9, LR
BLR X16
func_body:
...
c. When enabled and target is out-of-range, patch it to
func:
BTI C
B literal_call
literal:
.quad custom_trampoline
literal_call:
LDR X16, literal
MOV X9, LR
BLR X16
func_body:
...
d. When enabled and target is in range, patch it to
func:
BTI C
B direct_call
literal:
.quad dummy_tramp
LDR X16, literal
direct_call:
MOV X9, LR
BL custom_trampoline
func_body:
...
> I'll try to reply with some more detail tomorrow, but I don't think this is the
> right approach, and as mentioned previously (and e.g. at LPC) I'd strongly
> prefer to *not* implement direct calls, so that we can have more consistent
> entry/exit handling.
>
> Thanks,
> Mark.
> .
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-27 4:49 ` Xu Kuohai
@ 2022-09-28 16:42 ` Mark Rutland
-1 siblings, 0 replies; 60+ messages in thread
From: Mark Rutland @ 2022-09-28 16:42 UTC (permalink / raw)
To: Xu Kuohai
Cc: Catalin Marinas, Daniel Borkmann, Xu Kuohai, linux-arm-kernel,
linux-kernel, bpf, Florent Revest, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> > On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> > > On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> > > > On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > > > > This series adds ftrace direct call for arm64, which is required to attach
> > > > > bpf trampoline to fentry.
> > > > >
> > > > > Although there is no agreement on how to support ftrace direct call on arm64,
> > > > > no patch has been posted except the one I posted in [1], so this series
> > > > > continues the work of [1] with the addition of long jump support. Now ftrace
> > > > > direct call works regardless of the distance between the callsite and custom
> > > > > trampoline.
> > > > >
> > > > > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > > > >
> > > > > v2:
> > > > > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > > > >
> > > > > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > > > >
> > > > > Xu Kuohai (4):
> > > > > ftrace: Allow users to disable ftrace direct call
> > > > > arm64: ftrace: Support long jump for ftrace direct call
> > > > > arm64: ftrace: Add ftrace direct call support
> > > > > ftrace: Fix dead loop caused by direct call in ftrace selftest
> > > >
> > > > Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> > > > it probably makes sense that this series goes via Catalin/Will through arm64 tree
> > > > instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> > > > could work too, but I'd presume this just results in merge conflicts)?
> > >
> > > I think it makes sense for the series to go via the arm64 tree but I'd
> > > like Mark to have a look at the ftrace changes first.
> >
> > > From a quick scan, I still don't think this is quite right, and as it stands I
> > believe this will break backtracing (as the instructions before the function
> > entry point will not be symbolized correctly, getting in the way of
> > RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> > feedback there, as I have a mechanism in mind that wa a little simpler.
>
> Thanks for the review. I have some thoughts about reliable stacktrace.
>
> If PC is not in the range of literal_call, stacktrace works as before without
> changes.
>
> If PC is in the range of literal_call, for example, interrupted by an
> irq, I think there are 2 problems:
>
> 1. Caller LR is not pushed to the stack yet, so caller's address and name
> will be missing from the backtrace.
>
> 2. Since PC is not in func's address range, no symbol name will be found, so
> func name is also missing.
>
> Problem 1 is not introduced by this patchset, but the occurring probability
> may be increased by this patchset. I think this problem should be addressed by
> a reliable stacktrace scheme, such as ORC on x86.
I agree problem 1 is not introduced by this patch set; I have plans fo how to
address that for reliable stacktrace based on identifying the ftrace
trampoline. This is one of the reasons I do not want direct calls, as
identifying all direct call trampolines is going to be very painful and slow,
whereas identifying a statically allocated ftrace trampoline is far simpler.
> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> ways to deal with it:
What I would like to do here, as mentioned previously in other threads, is to
avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
each patch-site with a specific set of ops, and invoke that directly from the
regular ftrace trampoline.
With that, the patch site would look like:
pre_func_literal:
NOP // Patched to a pointer to
NOP // ftrace_ops
func:
< optional BTI here >
NOP // Patched to MOV X9, LR
NOP // Patched to a BL to the ftrace trampoline
... then in the ftrace trampoline we can recover the ops pointer at a negative
offset from the LR based on the LR, and invoke the ops from there (passing a
struct ftrace_regs with the saved regs).
That way the patch-site is less significantly affected, and there's no impact
to backtracing. That gets most of the benefit of the direct calls avoiding the
ftrace ops list traversal, without having to do anything special at all. That
should be much easier to maintain, too.
I started implementing that before LPC (and you can find some branches on my
kernel.org repo), but I haven't yet had the time to rebase those and sort out
the remaining issues:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
Note that as a prerequisite for that I also want to reduce the set of registers
we save/restore down to the set required by our calling convention, as the
existing pt_regs is both large and generally unsound (since we can not and do
not fill in many of the fields we only acquire at an exception boundary).
That'll further reduce the ftrace overhead generally, and remove the needs for
the two trampolines we currently have. I have a WIP at:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
I intend to get back to both of those shortly (along with some related bits for
kretprobes and stacktracing); I just haven't had much time recently due to
other work and illness.
> 1. Add a symbol name for literal_call.
That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
so I don't think we want to do that.
> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
> we can check if the PC is in literal_call, then adjust PC and try again.
The problem is that the existing symbolization code doesn't know the length of
the prior symbol, so it will find *some* symbol associated with the previous
function rather than finding no symbol.
To bodge around this we'dd need to special-case each patchable-function-entry
site in symbolization, which is going to be painful and slow down unwinding
unless we try to fix this up at boot-time or compile time.
> 3. Move literal_call to the func's address range, for example:
>
> a. Compile with -fpatchable-function-entry=7
> func:
> BTI C
> NOP
> NOP
> NOP
> NOP
> NOP
> NOP
> NOP
This is a non-starter. We are not going to add 7 NOPs at the start of every
function.
Thanks,
Mark.
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-28 16:42 ` Mark Rutland
0 siblings, 0 replies; 60+ messages in thread
From: Mark Rutland @ 2022-09-28 16:42 UTC (permalink / raw)
To: Xu Kuohai
Cc: Catalin Marinas, Daniel Borkmann, Xu Kuohai, linux-arm-kernel,
linux-kernel, bpf, Florent Revest, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> > On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> > > On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> > > > On 9/13/22 6:27 PM, Xu Kuohai wrote:
> > > > > This series adds ftrace direct call for arm64, which is required to attach
> > > > > bpf trampoline to fentry.
> > > > >
> > > > > Although there is no agreement on how to support ftrace direct call on arm64,
> > > > > no patch has been posted except the one I posted in [1], so this series
> > > > > continues the work of [1] with the addition of long jump support. Now ftrace
> > > > > direct call works regardless of the distance between the callsite and custom
> > > > > trampoline.
> > > > >
> > > > > [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> > > > >
> > > > > v2:
> > > > > - Fix compile and runtime errors caused by ftrace_rec_arch_init
> > > > >
> > > > > v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> > > > >
> > > > > Xu Kuohai (4):
> > > > > ftrace: Allow users to disable ftrace direct call
> > > > > arm64: ftrace: Support long jump for ftrace direct call
> > > > > arm64: ftrace: Add ftrace direct call support
> > > > > ftrace: Fix dead loop caused by direct call in ftrace selftest
> > > >
> > > > Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> > > > it probably makes sense that this series goes via Catalin/Will through arm64 tree
> > > > instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> > > > could work too, but I'd presume this just results in merge conflicts)?
> > >
> > > I think it makes sense for the series to go via the arm64 tree but I'd
> > > like Mark to have a look at the ftrace changes first.
> >
> > > From a quick scan, I still don't think this is quite right, and as it stands I
> > believe this will break backtracing (as the instructions before the function
> > entry point will not be symbolized correctly, getting in the way of
> > RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> > feedback there, as I have a mechanism in mind that wa a little simpler.
>
> Thanks for the review. I have some thoughts about reliable stacktrace.
>
> If PC is not in the range of literal_call, stacktrace works as before without
> changes.
>
> If PC is in the range of literal_call, for example, interrupted by an
> irq, I think there are 2 problems:
>
> 1. Caller LR is not pushed to the stack yet, so caller's address and name
> will be missing from the backtrace.
>
> 2. Since PC is not in func's address range, no symbol name will be found, so
> func name is also missing.
>
> Problem 1 is not introduced by this patchset, but the occurring probability
> may be increased by this patchset. I think this problem should be addressed by
> a reliable stacktrace scheme, such as ORC on x86.
I agree problem 1 is not introduced by this patch set; I have plans fo how to
address that for reliable stacktrace based on identifying the ftrace
trampoline. This is one of the reasons I do not want direct calls, as
identifying all direct call trampolines is going to be very painful and slow,
whereas identifying a statically allocated ftrace trampoline is far simpler.
> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> ways to deal with it:
What I would like to do here, as mentioned previously in other threads, is to
avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
each patch-site with a specific set of ops, and invoke that directly from the
regular ftrace trampoline.
With that, the patch site would look like:
pre_func_literal:
NOP // Patched to a pointer to
NOP // ftrace_ops
func:
< optional BTI here >
NOP // Patched to MOV X9, LR
NOP // Patched to a BL to the ftrace trampoline
... then in the ftrace trampoline we can recover the ops pointer at a negative
offset from the LR based on the LR, and invoke the ops from there (passing a
struct ftrace_regs with the saved regs).
That way the patch-site is less significantly affected, and there's no impact
to backtracing. That gets most of the benefit of the direct calls avoiding the
ftrace ops list traversal, without having to do anything special at all. That
should be much easier to maintain, too.
I started implementing that before LPC (and you can find some branches on my
kernel.org repo), but I haven't yet had the time to rebase those and sort out
the remaining issues:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
Note that as a prerequisite for that I also want to reduce the set of registers
we save/restore down to the set required by our calling convention, as the
existing pt_regs is both large and generally unsound (since we can not and do
not fill in many of the fields we only acquire at an exception boundary).
That'll further reduce the ftrace overhead generally, and remove the needs for
the two trampolines we currently have. I have a WIP at:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
I intend to get back to both of those shortly (along with some related bits for
kretprobes and stacktracing); I just haven't had much time recently due to
other work and illness.
> 1. Add a symbol name for literal_call.
That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
so I don't think we want to do that.
> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
> we can check if the PC is in literal_call, then adjust PC and try again.
The problem is that the existing symbolization code doesn't know the length of
the prior symbol, so it will find *some* symbol associated with the previous
function rather than finding no symbol.
To bodge around this we'dd need to special-case each patchable-function-entry
site in symbolization, which is going to be painful and slow down unwinding
unless we try to fix this up at boot-time or compile time.
> 3. Move literal_call to the func's address range, for example:
>
> a. Compile with -fpatchable-function-entry=7
> func:
> BTI C
> NOP
> NOP
> NOP
> NOP
> NOP
> NOP
> NOP
This is a non-starter. We are not going to add 7 NOPs at the start of every
function.
Thanks,
Mark.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-28 16:42 ` Mark Rutland
@ 2022-09-30 4:07 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-30 4:07 UTC (permalink / raw)
To: Mark Rutland
Cc: Catalin Marinas, Daniel Borkmann, Xu Kuohai, linux-arm-kernel,
linux-kernel, bpf, Florent Revest, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 9/29/2022 12:42 AM, Mark Rutland wrote:
> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>> bpf trampoline to fentry.
>>>>>>
>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>> no patch has been posted except the one I posted in [1], so this series
>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>> trampoline.
>>>>>>
>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>
>>>>>> v2:
>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>
>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>> Xu Kuohai (4):
>>>>>> ftrace: Allow users to disable ftrace direct call
>>>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>>>> arm64: ftrace: Add ftrace direct call support
>>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>
>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>
>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>> like Mark to have a look at the ftrace changes first.
>>>
>>>> From a quick scan, I still don't think this is quite right, and as it stands I
>>> believe this will break backtracing (as the instructions before the function
>>> entry point will not be symbolized correctly, getting in the way of
>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>
>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>
>> If PC is not in the range of literal_call, stacktrace works as before without
>> changes.
>>
>> If PC is in the range of literal_call, for example, interrupted by an
>> irq, I think there are 2 problems:
>>
>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>> will be missing from the backtrace.
>>
>> 2. Since PC is not in func's address range, no symbol name will be found, so
>> func name is also missing.
>>
>> Problem 1 is not introduced by this patchset, but the occurring probability
>> may be increased by this patchset. I think this problem should be addressed by
>> a reliable stacktrace scheme, such as ORC on x86.
>
> I agree problem 1 is not introduced by this patch set; I have plans fo how to
> address that for reliable stacktrace based on identifying the ftrace
> trampoline. This is one of the reasons I do not want direct calls, as
> identifying all direct call trampolines is going to be very painful and slow,
> whereas identifying a statically allocated ftrace trampoline is far simpler.
>
>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>> ways to deal with it:
>
> What I would like to do here, as mentioned previously in other threads, is to
> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> each patch-site with a specific set of ops, and invoke that directly from the
> regular ftrace trampoline.
>
> With that, the patch site would look like:
>
> pre_func_literal:
> NOP // Patched to a pointer to
> NOP // ftrace_ops
> func:
> < optional BTI here >
> NOP // Patched to MOV X9, LR
> NOP // Patched to a BL to the ftrace trampoline
>
> ... then in the ftrace trampoline we can recover the ops pointer at a negative
> offset from the LR based on the LR, and invoke the ops from there (passing a
> struct ftrace_regs with the saved regs).
>
> That way the patch-site is less significantly affected, and there's no impact
> to backtracing. That gets most of the benefit of the direct calls avoiding the
> ftrace ops list traversal, without having to do anything special at all. That
> should be much easier to maintain, too.
>
> I started implementing that before LPC (and you can find some branches on my
> kernel.org repo), but I haven't yet had the time to rebase those and sort out
> the remaining issues:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>
I've read this code before, but it doesn't run and since you haven't updated
it, I assumed you dropped it :(
This approach seems appropriate for dynamic ftrace trampolines, but I think
there are two more issues for bpf.
1. bpf trampoline was designed to be called directly from fentry (located in
kernel function or bpf prog). So to make it work as ftrace_op, we may end
up with two different bpf trampoline types on arm64, one for bpf prog and
the other for ftrace.
2. Performance overhead, as we always jump to a static ftrace trampoline to
construct execution environment for bpf trampoline, then jump to the bpf
trampoline to construct execution environment for bpf prog, then jump to
the bpf prog, so for some small bpf progs or hot functions, the calling
overhead may be unacceptable.
> Note that as a prerequisite for that I also want to reduce the set of registers
> we save/restore down to the set required by our calling convention, as the
> existing pt_regs is both large and generally unsound (since we can not and do
> not fill in many of the fields we only acquire at an exception boundary).
> That'll further reduce the ftrace overhead generally, and remove the needs for
> the two trampolines we currently have. I have a WIP at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
>
> I intend to get back to both of those shortly (along with some related bits for
> kretprobes and stacktracing); I just haven't had much time recently due to
> other work and illness.
>
Sorry for that, hope you getting better soon.
>> 1. Add a symbol name for literal_call.
>
> That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
> so I don't think we want to do that.
>
>> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>> we can check if the PC is in literal_call, then adjust PC and try again.
>
> The problem is that the existing symbolization code doesn't know the length of
> the prior symbol, so it will find *some* symbol associated with the previous
> function rather than finding no symbol.
>
> To bodge around this we'dd need to special-case each patchable-function-entry
> site in symbolization, which is going to be painful and slow down unwinding
> unless we try to fix this up at boot-time or compile time.
> >> 3. Move literal_call to the func's address range, for example:
>>
>> a. Compile with -fpatchable-function-entry=7
>> func:
>> BTI C
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>
> This is a non-starter. We are not going to add 7 NOPs at the start of every
> function.
>
> Thanks,
> Mark.
>
> .
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-09-30 4:07 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-09-30 4:07 UTC (permalink / raw)
To: Mark Rutland
Cc: Catalin Marinas, Daniel Borkmann, Xu Kuohai, linux-arm-kernel,
linux-kernel, bpf, Florent Revest, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 9/29/2022 12:42 AM, Mark Rutland wrote:
> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>> bpf trampoline to fentry.
>>>>>>
>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>> no patch has been posted except the one I posted in [1], so this series
>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>> trampoline.
>>>>>>
>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>
>>>>>> v2:
>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>
>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>> Xu Kuohai (4):
>>>>>> ftrace: Allow users to disable ftrace direct call
>>>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>>>> arm64: ftrace: Add ftrace direct call support
>>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>
>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>
>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>> like Mark to have a look at the ftrace changes first.
>>>
>>>> From a quick scan, I still don't think this is quite right, and as it stands I
>>> believe this will break backtracing (as the instructions before the function
>>> entry point will not be symbolized correctly, getting in the way of
>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>
>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>
>> If PC is not in the range of literal_call, stacktrace works as before without
>> changes.
>>
>> If PC is in the range of literal_call, for example, interrupted by an
>> irq, I think there are 2 problems:
>>
>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>> will be missing from the backtrace.
>>
>> 2. Since PC is not in func's address range, no symbol name will be found, so
>> func name is also missing.
>>
>> Problem 1 is not introduced by this patchset, but the occurring probability
>> may be increased by this patchset. I think this problem should be addressed by
>> a reliable stacktrace scheme, such as ORC on x86.
>
> I agree problem 1 is not introduced by this patch set; I have plans fo how to
> address that for reliable stacktrace based on identifying the ftrace
> trampoline. This is one of the reasons I do not want direct calls, as
> identifying all direct call trampolines is going to be very painful and slow,
> whereas identifying a statically allocated ftrace trampoline is far simpler.
>
>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>> ways to deal with it:
>
> What I would like to do here, as mentioned previously in other threads, is to
> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> each patch-site with a specific set of ops, and invoke that directly from the
> regular ftrace trampoline.
>
> With that, the patch site would look like:
>
> pre_func_literal:
> NOP // Patched to a pointer to
> NOP // ftrace_ops
> func:
> < optional BTI here >
> NOP // Patched to MOV X9, LR
> NOP // Patched to a BL to the ftrace trampoline
>
> ... then in the ftrace trampoline we can recover the ops pointer at a negative
> offset from the LR based on the LR, and invoke the ops from there (passing a
> struct ftrace_regs with the saved regs).
>
> That way the patch-site is less significantly affected, and there's no impact
> to backtracing. That gets most of the benefit of the direct calls avoiding the
> ftrace ops list traversal, without having to do anything special at all. That
> should be much easier to maintain, too.
>
> I started implementing that before LPC (and you can find some branches on my
> kernel.org repo), but I haven't yet had the time to rebase those and sort out
> the remaining issues:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>
I've read this code before, but it doesn't run and since you haven't updated
it, I assumed you dropped it :(
This approach seems appropriate for dynamic ftrace trampolines, but I think
there are two more issues for bpf.
1. bpf trampoline was designed to be called directly from fentry (located in
kernel function or bpf prog). So to make it work as ftrace_op, we may end
up with two different bpf trampoline types on arm64, one for bpf prog and
the other for ftrace.
2. Performance overhead, as we always jump to a static ftrace trampoline to
construct execution environment for bpf trampoline, then jump to the bpf
trampoline to construct execution environment for bpf prog, then jump to
the bpf prog, so for some small bpf progs or hot functions, the calling
overhead may be unacceptable.
> Note that as a prerequisite for that I also want to reduce the set of registers
> we save/restore down to the set required by our calling convention, as the
> existing pt_regs is both large and generally unsound (since we can not and do
> not fill in many of the fields we only acquire at an exception boundary).
> That'll further reduce the ftrace overhead generally, and remove the needs for
> the two trampolines we currently have. I have a WIP at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
>
> I intend to get back to both of those shortly (along with some related bits for
> kretprobes and stacktracing); I just haven't had much time recently due to
> other work and illness.
>
Sorry for that, hope you getting better soon.
>> 1. Add a symbol name for literal_call.
>
> That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
> so I don't think we want to do that.
>
>> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>> we can check if the PC is in literal_call, then adjust PC and try again.
>
> The problem is that the existing symbolization code doesn't know the length of
> the prior symbol, so it will find *some* symbol associated with the previous
> function rather than finding no symbol.
>
> To bodge around this we'dd need to special-case each patchable-function-entry
> site in symbolization, which is going to be painful and slow down unwinding
> unless we try to fix this up at boot-time or compile time.
> >> 3. Move literal_call to the func's address range, for example:
>>
>> a. Compile with -fpatchable-function-entry=7
>> func:
>> BTI C
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>
> This is a non-starter. We are not going to add 7 NOPs at the start of every
> function.
>
> Thanks,
> Mark.
>
> .
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-30 4:07 ` Xu Kuohai
@ 2022-10-04 16:06 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-04 16:06 UTC (permalink / raw)
To: Xu Kuohai
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, Xu Kuohai,
linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@huawei.com> wrote:
>
> On 9/29/2022 12:42 AM, Mark Rutland wrote:
> > On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> >> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> >>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> >>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> >>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> >>>>>> This series adds ftrace direct call for arm64, which is required to attach
> >>>>>> bpf trampoline to fentry.
> >>>>>>
> >>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
> >>>>>> no patch has been posted except the one I posted in [1], so this series
Hey Xu :) Sorry I wasn't more pro-active about communicating what i
was experimenting with! A lot of conversations happened off-the-list
at LPC and LSS so I was playing on the side with the ideas that got
suggested to me. I start to have a little something to share.
Hopefully if we work closer together now we can get quicker results.
> >>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
> >>>>>> direct call works regardless of the distance between the callsite and custom
> >>>>>> trampoline.
> >>>>>>
> >>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> >>>>>>
> >>>>>> v2:
> >>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
> >>>>>>
> >>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> >>>>>>
> >>>>>> Xu Kuohai (4):
> >>>>>> ftrace: Allow users to disable ftrace direct call
> >>>>>> arm64: ftrace: Support long jump for ftrace direct call
> >>>>>> arm64: ftrace: Add ftrace direct call support
> >>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
> >>>>>
> >>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> >>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> >>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> >>>>> could work too, but I'd presume this just results in merge conflicts)?
> >>>>
> >>>> I think it makes sense for the series to go via the arm64 tree but I'd
> >>>> like Mark to have a look at the ftrace changes first.
> >>>
> >>>> From a quick scan, I still don't think this is quite right, and as it stands I
> >>> believe this will break backtracing (as the instructions before the function
> >>> entry point will not be symbolized correctly, getting in the way of
> >>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> >>> feedback there, as I have a mechanism in mind that wa a little simpler.
> >>
> >> Thanks for the review. I have some thoughts about reliable stacktrace.
> >>
> >> If PC is not in the range of literal_call, stacktrace works as before without
> >> changes.
> >>
> >> If PC is in the range of literal_call, for example, interrupted by an
> >> irq, I think there are 2 problems:
> >>
> >> 1. Caller LR is not pushed to the stack yet, so caller's address and name
> >> will be missing from the backtrace.
> >>
> >> 2. Since PC is not in func's address range, no symbol name will be found, so
> >> func name is also missing.
> >>
> >> Problem 1 is not introduced by this patchset, but the occurring probability
> >> may be increased by this patchset. I think this problem should be addressed by
> >> a reliable stacktrace scheme, such as ORC on x86.
> >
> > I agree problem 1 is not introduced by this patch set; I have plans fo how to
> > address that for reliable stacktrace based on identifying the ftrace
> > trampoline. This is one of the reasons I do not want direct calls, as
> > identifying all direct call trampolines is going to be very painful and slow,
> > whereas identifying a statically allocated ftrace trampoline is far simpler.
> >
> >> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> >> ways to deal with it:
> >
> > What I would like to do here, as mentioned previously in other threads, is to
> > avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> > each patch-site with a specific set of ops, and invoke that directly from the
> > regular ftrace trampoline.
> >
> > With that, the patch site would look like:
> >
> > pre_func_literal:
> > NOP // Patched to a pointer to
> > NOP // ftrace_ops
> > func:
> > < optional BTI here >
> > NOP // Patched to MOV X9, LR
> > NOP // Patched to a BL to the ftrace trampoline
> >
> > ... then in the ftrace trampoline we can recover the ops pointer at a negative
> > offset from the LR based on the LR, and invoke the ops from there (passing a
> > struct ftrace_regs with the saved regs).
> >
> > That way the patch-site is less significantly affected, and there's no impact
> > to backtracing. That gets most of the benefit of the direct calls avoiding the
> > ftrace ops list traversal, without having to do anything special at all. That
> > should be much easier to maintain, too.
> >
> > I started implementing that before LPC (and you can find some branches on my
> > kernel.org repo), but I haven't yet had the time to rebase those and sort out
> > the remaining issues:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> >
>
> I've read this code before, but it doesn't run and since you haven't updated
I also tried to use this but indeed the "TODO: mess with protection to
set this" in 5437aa788d needs to be addressed before we can use it.
> it, I assumed you dropped it :(
>
> This approach seems appropriate for dynamic ftrace trampolines, but I think
> there are two more issues for bpf.
>
> 1. bpf trampoline was designed to be called directly from fentry (located in
> kernel function or bpf prog). So to make it work as ftrace_op, we may end
> up with two different bpf trampoline types on arm64, one for bpf prog and
> the other for ftrace.
>
> 2. Performance overhead, as we always jump to a static ftrace trampoline to
> construct execution environment for bpf trampoline, then jump to the bpf
> trampoline to construct execution environment for bpf prog, then jump to
> the bpf prog, so for some small bpf progs or hot functions, the calling
> overhead may be unacceptable.
From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
(all in CC) would like to see an ftrace ops based solution (or rather,
something that doesn't require direct calls) for invoking BPF tracing
programs. I figured that the best way to move forward on the question
of whether the performance impact of that would be acceptable or not
is to just build it and measure it. I understand you're testing your
work on real hardware (I work on an emulator at the moment) , would
you be able to compare the impact of my proof of concept branch with
your direct call based approach ?
https://github.com/FlorentRevest/linux/commits/fprobe-min-args
I first tried to implement this as an ftrace op myself but realized I
was re-implementing a lot of the function graph tracer. So I then
tried to use the function graph tracer API but realized I was missing
some features which Steven had addressed in an RFC few years back. So
I rebuilt on that until I realized Masami has been upstreaming the
fprobe and rethook APIs as spiritual successors of Steven's RFC... So
I've now rebuilt yet another proof of concept based on fprobe and
rethook.
That branch is still very much WIP and there are a few things I'd like
to address before sending even an RFC (when kretprobe is built on
rethook for example, I construct pt_regs on the stack in which I copy
the content of ftrace_regs... or program linking/unlinking is racy
right now) but I think it's good enough for performance measurements
already. (fentry_fexit and lsm tests pass)
> > Note that as a prerequisite for that I also want to reduce the set of registers
> > we save/restore down to the set required by our calling convention, as the
> > existing pt_regs is both large and generally unsound (since we can not and do
> > not fill in many of the fields we only acquire at an exception boundary).
> > That'll further reduce the ftrace overhead generally, and remove the needs for
> > the two trampolines we currently have. I have a WIP at:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
Note that I integrated this work to my branch too. I extended it to
also have fprobe and rethook save and pass ftrace_regs structures to
their callbacks. Most performance improvements would come from your
arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
TODO for it to work.
> > I intend to get back to both of those shortly (along with some related bits for
> > kretprobes and stacktracing); I just haven't had much time recently due to
> > other work and illness.
> >
>
> Sorry for that, hope you getting better soon.
Oh, that sucks. Get better Mark!
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-04 16:06 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-04 16:06 UTC (permalink / raw)
To: Xu Kuohai
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, Xu Kuohai,
linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@huawei.com> wrote:
>
> On 9/29/2022 12:42 AM, Mark Rutland wrote:
> > On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
> >> On 9/27/2022 1:43 AM, Mark Rutland wrote:
> >>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
> >>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
> >>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
> >>>>>> This series adds ftrace direct call for arm64, which is required to attach
> >>>>>> bpf trampoline to fentry.
> >>>>>>
> >>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
> >>>>>> no patch has been posted except the one I posted in [1], so this series
Hey Xu :) Sorry I wasn't more pro-active about communicating what i
was experimenting with! A lot of conversations happened off-the-list
at LPC and LSS so I was playing on the side with the ideas that got
suggested to me. I start to have a little something to share.
Hopefully if we work closer together now we can get quicker results.
> >>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
> >>>>>> direct call works regardless of the distance between the callsite and custom
> >>>>>> trampoline.
> >>>>>>
> >>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
> >>>>>>
> >>>>>> v2:
> >>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
> >>>>>>
> >>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
> >>>>>>
> >>>>>> Xu Kuohai (4):
> >>>>>> ftrace: Allow users to disable ftrace direct call
> >>>>>> arm64: ftrace: Support long jump for ftrace direct call
> >>>>>> arm64: ftrace: Add ftrace direct call support
> >>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
> >>>>>
> >>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
> >>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
> >>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
> >>>>> could work too, but I'd presume this just results in merge conflicts)?
> >>>>
> >>>> I think it makes sense for the series to go via the arm64 tree but I'd
> >>>> like Mark to have a look at the ftrace changes first.
> >>>
> >>>> From a quick scan, I still don't think this is quite right, and as it stands I
> >>> believe this will break backtracing (as the instructions before the function
> >>> entry point will not be symbolized correctly, getting in the way of
> >>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
> >>> feedback there, as I have a mechanism in mind that wa a little simpler.
> >>
> >> Thanks for the review. I have some thoughts about reliable stacktrace.
> >>
> >> If PC is not in the range of literal_call, stacktrace works as before without
> >> changes.
> >>
> >> If PC is in the range of literal_call, for example, interrupted by an
> >> irq, I think there are 2 problems:
> >>
> >> 1. Caller LR is not pushed to the stack yet, so caller's address and name
> >> will be missing from the backtrace.
> >>
> >> 2. Since PC is not in func's address range, no symbol name will be found, so
> >> func name is also missing.
> >>
> >> Problem 1 is not introduced by this patchset, but the occurring probability
> >> may be increased by this patchset. I think this problem should be addressed by
> >> a reliable stacktrace scheme, such as ORC on x86.
> >
> > I agree problem 1 is not introduced by this patch set; I have plans fo how to
> > address that for reliable stacktrace based on identifying the ftrace
> > trampoline. This is one of the reasons I do not want direct calls, as
> > identifying all direct call trampolines is going to be very painful and slow,
> > whereas identifying a statically allocated ftrace trampoline is far simpler.
> >
> >> Problem 2 is indeed introduced by this patchset. I think there are at least 3
> >> ways to deal with it:
> >
> > What I would like to do here, as mentioned previously in other threads, is to
> > avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> > each patch-site with a specific set of ops, and invoke that directly from the
> > regular ftrace trampoline.
> >
> > With that, the patch site would look like:
> >
> > pre_func_literal:
> > NOP // Patched to a pointer to
> > NOP // ftrace_ops
> > func:
> > < optional BTI here >
> > NOP // Patched to MOV X9, LR
> > NOP // Patched to a BL to the ftrace trampoline
> >
> > ... then in the ftrace trampoline we can recover the ops pointer at a negative
> > offset from the LR based on the LR, and invoke the ops from there (passing a
> > struct ftrace_regs with the saved regs).
> >
> > That way the patch-site is less significantly affected, and there's no impact
> > to backtracing. That gets most of the benefit of the direct calls avoiding the
> > ftrace ops list traversal, without having to do anything special at all. That
> > should be much easier to maintain, too.
> >
> > I started implementing that before LPC (and you can find some branches on my
> > kernel.org repo), but I haven't yet had the time to rebase those and sort out
> > the remaining issues:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> >
>
> I've read this code before, but it doesn't run and since you haven't updated
I also tried to use this but indeed the "TODO: mess with protection to
set this" in 5437aa788d needs to be addressed before we can use it.
> it, I assumed you dropped it :(
>
> This approach seems appropriate for dynamic ftrace trampolines, but I think
> there are two more issues for bpf.
>
> 1. bpf trampoline was designed to be called directly from fentry (located in
> kernel function or bpf prog). So to make it work as ftrace_op, we may end
> up with two different bpf trampoline types on arm64, one for bpf prog and
> the other for ftrace.
>
> 2. Performance overhead, as we always jump to a static ftrace trampoline to
> construct execution environment for bpf trampoline, then jump to the bpf
> trampoline to construct execution environment for bpf prog, then jump to
> the bpf prog, so for some small bpf progs or hot functions, the calling
> overhead may be unacceptable.
From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
(all in CC) would like to see an ftrace ops based solution (or rather,
something that doesn't require direct calls) for invoking BPF tracing
programs. I figured that the best way to move forward on the question
of whether the performance impact of that would be acceptable or not
is to just build it and measure it. I understand you're testing your
work on real hardware (I work on an emulator at the moment) , would
you be able to compare the impact of my proof of concept branch with
your direct call based approach ?
https://github.com/FlorentRevest/linux/commits/fprobe-min-args
I first tried to implement this as an ftrace op myself but realized I
was re-implementing a lot of the function graph tracer. So I then
tried to use the function graph tracer API but realized I was missing
some features which Steven had addressed in an RFC few years back. So
I rebuilt on that until I realized Masami has been upstreaming the
fprobe and rethook APIs as spiritual successors of Steven's RFC... So
I've now rebuilt yet another proof of concept based on fprobe and
rethook.
That branch is still very much WIP and there are a few things I'd like
to address before sending even an RFC (when kretprobe is built on
rethook for example, I construct pt_regs on the stack in which I copy
the content of ftrace_regs... or program linking/unlinking is racy
right now) but I think it's good enough for performance measurements
already. (fentry_fexit and lsm tests pass)
> > Note that as a prerequisite for that I also want to reduce the set of registers
> > we save/restore down to the set required by our calling convention, as the
> > existing pt_regs is both large and generally unsound (since we can not and do
> > not fill in many of the fields we only acquire at an exception boundary).
> > That'll further reduce the ftrace overhead generally, and remove the needs for
> > the two trampolines we currently have. I have a WIP at:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
Note that I integrated this work to my branch too. I extended it to
also have fprobe and rethook save and pass ftrace_regs structures to
their callbacks. Most performance improvements would come from your
arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
TODO for it to work.
> > I intend to get back to both of those shortly (along with some related bits for
> > kretprobes and stacktracing); I just haven't had much time recently due to
> > other work and illness.
> >
>
> Sorry for that, hope you getting better soon.
Oh, that sucks. Get better Mark!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-04 16:06 ` Florent Revest
@ 2022-10-05 14:54 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-05 14:54 UTC (permalink / raw)
To: Florent Revest
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, Xu Kuohai,
linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 10/5/2022 12:06 AM, Florent Revest wrote:
> On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@huawei.com> wrote:
>>
>> On 9/29/2022 12:42 AM, Mark Rutland wrote:
>>> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>>>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>>>> bpf trampoline to fentry.
>>>>>>>>
>>>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>>>> no patch has been posted except the one I posted in [1], so this series
>
> Hey Xu :) Sorry I wasn't more pro-active about communicating what i
> was experimenting with! A lot of conversations happened off-the-list
> at LPC and LSS so I was playing on the side with the ideas that got
> suggested to me. I start to have a little something to share.
> Hopefully if we work closer together now we can get quicker results.
>
>>>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>>>> trampoline.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>>>
>>>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>>>
>>>>>>>> Xu Kuohai (4):
>>>>>>>> ftrace: Allow users to disable ftrace direct call
>>>>>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>>>>>> arm64: ftrace: Add ftrace direct call support
>>>>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>>>
>>>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>>>
>>>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>>>> like Mark to have a look at the ftrace changes first.
>>>>>
>>>>>> From a quick scan, I still don't think this is quite right, and as it stands I
>>>>> believe this will break backtracing (as the instructions before the function
>>>>> entry point will not be symbolized correctly, getting in the way of
>>>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>>>
>>>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>>>
>>>> If PC is not in the range of literal_call, stacktrace works as before without
>>>> changes.
>>>>
>>>> If PC is in the range of literal_call, for example, interrupted by an
>>>> irq, I think there are 2 problems:
>>>>
>>>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>>>> will be missing from the backtrace.
>>>>
>>>> 2. Since PC is not in func's address range, no symbol name will be found, so
>>>> func name is also missing.
>>>>
>>>> Problem 1 is not introduced by this patchset, but the occurring probability
>>>> may be increased by this patchset. I think this problem should be addressed by
>>>> a reliable stacktrace scheme, such as ORC on x86.
>>>
>>> I agree problem 1 is not introduced by this patch set; I have plans fo how to
>>> address that for reliable stacktrace based on identifying the ftrace
>>> trampoline. This is one of the reasons I do not want direct calls, as
>>> identifying all direct call trampolines is going to be very painful and slow,
>>> whereas identifying a statically allocated ftrace trampoline is far simpler.
>>>
>>>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>>>> ways to deal with it:
>>>
>>> What I would like to do here, as mentioned previously in other threads, is to
>>> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
>>> each patch-site with a specific set of ops, and invoke that directly from the
>>> regular ftrace trampoline.
>>>
>>> With that, the patch site would look like:
>>>
>>> pre_func_literal:
>>> NOP // Patched to a pointer to
>>> NOP // ftrace_ops
>>> func:
>>> < optional BTI here >
>>> NOP // Patched to MOV X9, LR
>>> NOP // Patched to a BL to the ftrace trampoline
>>>
>>> ... then in the ftrace trampoline we can recover the ops pointer at a negative
>>> offset from the LR based on the LR, and invoke the ops from there (passing a
>>> struct ftrace_regs with the saved regs).
>>>
>>> That way the patch-site is less significantly affected, and there's no impact
>>> to backtracing. That gets most of the benefit of the direct calls avoiding the
>>> ftrace ops list traversal, without having to do anything special at all. That
>>> should be much easier to maintain, too.
>>>
>>> I started implementing that before LPC (and you can find some branches on my
>>> kernel.org repo), but I haven't yet had the time to rebase those and sort out
>>> the remaining issues:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>>>
>>
>> I've read this code before, but it doesn't run and since you haven't updated
>
> I also tried to use this but indeed the "TODO: mess with protection to
> set this" in 5437aa788d needs to be addressed before we can use it.
>
>> it, I assumed you dropped it :(
>>
>> This approach seems appropriate for dynamic ftrace trampolines, but I think
>> there are two more issues for bpf.
>>
>> 1. bpf trampoline was designed to be called directly from fentry (located in
>> kernel function or bpf prog). So to make it work as ftrace_op, we may end
>> up with two different bpf trampoline types on arm64, one for bpf prog and
>> the other for ftrace.
>>
>> 2. Performance overhead, as we always jump to a static ftrace trampoline to
>> construct execution environment for bpf trampoline, then jump to the bpf
>> trampoline to construct execution environment for bpf prog, then jump to
>> the bpf prog, so for some small bpf progs or hot functions, the calling
>> overhead may be unacceptable.
>
>>From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
> (all in CC) would like to see an ftrace ops based solution (or rather,
> something that doesn't require direct calls) for invoking BPF tracing
> programs. I figured that the best way to move forward on the question
> of whether the performance impact of that would be acceptable or not
> is to just build it and measure it. I understand you're testing your
> work on real hardware (I work on an emulator at the moment) , would
> you be able to compare the impact of my proof of concept branch with
> your direct call based approach ?
>
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
Tested on my pi4, here is the result.
1. test with dd
1.1 when no bpf prog attached to vfs_write
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.56858 s, 326 MB/s
1.2 attach bpf prog with kprobe, bpftrace -e 'kprobe:vfs_write {}'
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.33439 s, 219 MB/s
1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
2. test with bpf/bench
2.1 bench trig-base
Iter 0 ( 65.259us): hits 1.774M/s ( 1.774M/prod), drops 0.000M/s, total operations 1.774M/s
Iter 1 (-17.075us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 2 ( 0.388us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 3 ( -1.759us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 4 ( 1.980us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 5 ( -2.222us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 6 ( 0.869us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Summary: hits 1.790 ± 0.000M/s ( 1.790M/prod), drops 0.000 ± 0.000M/s, total operations 1.790 ± 0.000M/s
2.2 bench trig-kprobe
Iter 0 ( 50.703us): hits 0.765M/s ( 0.765M/prod), drops 0.000M/s, total operations 0.765M/s
Iter 1 (-15.056us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 2 ( 2.981us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 3 ( -3.834us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 4 ( -1.964us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 5 ( 0.426us): hits 0.770M/s ( 0.770M/prod), drops 0.000M/s, total operations 0.770M/s
Iter 6 ( -1.297us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Summary: hits 0.771 ± 0.000M/s ( 0.771M/prod), drops 0.000 ± 0.000M/s, total operations 0.771 ± 0.000M/s
2.2 bench trig-fentry, with direct call
Iter 0 ( 49.981us): hits 1.357M/s ( 1.357M/prod), drops 0.000M/s, total operations 1.357M/s
Iter 1 ( 2.184us): hits 1.363M/s ( 1.363M/prod), drops 0.000M/s, total operations 1.363M/s
Iter 2 (-14.167us): hits 1.358M/s ( 1.358M/prod), drops 0.000M/s, total operations 1.358M/s
Iter 3 ( -4.890us): hits 1.362M/s ( 1.362M/prod), drops 0.000M/s, total operations 1.362M/s
Iter 4 ( 5.759us): hits 1.362M/s ( 1.362M/prod), drops 0.000M/s, total operations 1.362M/s
Iter 5 ( -4.389us): hits 1.362M/s ( 1.362M/prod), drops 0.000M/s, total operations 1.362M/s
Iter 6 ( -0.594us): hits 1.364M/s ( 1.364M/prod), drops 0.000M/s, total operations 1.364M/s
Summary: hits 1.362 ± 0.002M/s ( 1.362M/prod), drops 0.000 ± 0.000M/s, total operations 1.362 ± 0.002M/s
2.3 bench trig-fentry, with indirect call
Iter 0 ( 49.148us): hits 1.014M/s ( 1.014M/prod), drops 0.000M/s, total operations 1.014M/s
Iter 1 (-13.816us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 2 ( 0.648us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 3 ( 3.370us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 4 ( 11.388us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 5 (-17.242us): hits 1.022M/s ( 1.022M/prod), drops 0.000M/s, total operations 1.022M/s
Iter 6 ( 1.815us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Summary: hits 1.021 ± 0.000M/s ( 1.021M/prod), drops 0.000 ± 0.000M/s, total operations 1.021 ± 0.000M/s
> I first tried to implement this as an ftrace op myself but realized I
> was re-implementing a lot of the function graph tracer. So I then
> tried to use the function graph tracer API but realized I was missing
> some features which Steven had addressed in an RFC few years back. So
> I rebuilt on that until I realized Masami has been upstreaming the
> fprobe and rethook APIs as spiritual successors of Steven's RFC... So
> I've now rebuilt yet another proof of concept based on fprobe and
> rethook.
>
> That branch is still very much WIP and there are a few things I'd like
> to address before sending even an RFC (when kretprobe is built on
> rethook for example, I construct pt_regs on the stack in which I copy
> the content of ftrace_regs... or program linking/unlinking is racy
> right now) but I think it's good enough for performance measurements
> already. (fentry_fexit and lsm tests pass)
>
>>> Note that as a prerequisite for that I also want to reduce the set of registers
>>> we save/restore down to the set required by our calling convention, as the
>>> existing pt_regs is both large and generally unsound (since we can not and do
>>> not fill in many of the fields we only acquire at an exception boundary).
>>> That'll further reduce the ftrace overhead generally, and remove the needs for
>>> the two trampolines we currently have. I have a WIP at:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
>
> Note that I integrated this work to my branch too. I extended it to
> also have fprobe and rethook save and pass ftrace_regs structures to
> their callbacks. Most performance improvements would come from your
> arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
> TODO for it to work.
>
>>> I intend to get back to both of those shortly (along with some related bits for
>>> kretprobes and stacktracing); I just haven't had much time recently due to
>>> other work and illness.
>>>
>>
>> Sorry for that, hope you getting better soon.
>
> Oh, that sucks. Get better Mark!
> .
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-05 14:54 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-05 14:54 UTC (permalink / raw)
To: Florent Revest
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, Xu Kuohai,
linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Steven Rostedt, Ingo Molnar,
Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 10/5/2022 12:06 AM, Florent Revest wrote:
> On Fri, Sep 30, 2022 at 6:07 AM Xu Kuohai <xukuohai@huawei.com> wrote:
>>
>> On 9/29/2022 12:42 AM, Mark Rutland wrote:
>>> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>>>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>>>> bpf trampoline to fentry.
>>>>>>>>
>>>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>>>> no patch has been posted except the one I posted in [1], so this series
>
> Hey Xu :) Sorry I wasn't more pro-active about communicating what i
> was experimenting with! A lot of conversations happened off-the-list
> at LPC and LSS so I was playing on the side with the ideas that got
> suggested to me. I start to have a little something to share.
> Hopefully if we work closer together now we can get quicker results.
>
>>>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>>>> trampoline.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>>>
>>>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>>>
>>>>>>>> Xu Kuohai (4):
>>>>>>>> ftrace: Allow users to disable ftrace direct call
>>>>>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>>>>>> arm64: ftrace: Add ftrace direct call support
>>>>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>>>
>>>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>>>
>>>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>>>> like Mark to have a look at the ftrace changes first.
>>>>>
>>>>>> From a quick scan, I still don't think this is quite right, and as it stands I
>>>>> believe this will break backtracing (as the instructions before the function
>>>>> entry point will not be symbolized correctly, getting in the way of
>>>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>>>
>>>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>>>
>>>> If PC is not in the range of literal_call, stacktrace works as before without
>>>> changes.
>>>>
>>>> If PC is in the range of literal_call, for example, interrupted by an
>>>> irq, I think there are 2 problems:
>>>>
>>>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>>>> will be missing from the backtrace.
>>>>
>>>> 2. Since PC is not in func's address range, no symbol name will be found, so
>>>> func name is also missing.
>>>>
>>>> Problem 1 is not introduced by this patchset, but the occurring probability
>>>> may be increased by this patchset. I think this problem should be addressed by
>>>> a reliable stacktrace scheme, such as ORC on x86.
>>>
>>> I agree problem 1 is not introduced by this patch set; I have plans fo how to
>>> address that for reliable stacktrace based on identifying the ftrace
>>> trampoline. This is one of the reasons I do not want direct calls, as
>>> identifying all direct call trampolines is going to be very painful and slow,
>>> whereas identifying a statically allocated ftrace trampoline is far simpler.
>>>
>>>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>>>> ways to deal with it:
>>>
>>> What I would like to do here, as mentioned previously in other threads, is to
>>> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
>>> each patch-site with a specific set of ops, and invoke that directly from the
>>> regular ftrace trampoline.
>>>
>>> With that, the patch site would look like:
>>>
>>> pre_func_literal:
>>> NOP // Patched to a pointer to
>>> NOP // ftrace_ops
>>> func:
>>> < optional BTI here >
>>> NOP // Patched to MOV X9, LR
>>> NOP // Patched to a BL to the ftrace trampoline
>>>
>>> ... then in the ftrace trampoline we can recover the ops pointer at a negative
>>> offset from the LR based on the LR, and invoke the ops from there (passing a
>>> struct ftrace_regs with the saved regs).
>>>
>>> That way the patch-site is less significantly affected, and there's no impact
>>> to backtracing. That gets most of the benefit of the direct calls avoiding the
>>> ftrace ops list traversal, without having to do anything special at all. That
>>> should be much easier to maintain, too.
>>>
>>> I started implementing that before LPC (and you can find some branches on my
>>> kernel.org repo), but I haven't yet had the time to rebase those and sort out
>>> the remaining issues:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>>>
>>
>> I've read this code before, but it doesn't run and since you haven't updated
>
> I also tried to use this but indeed the "TODO: mess with protection to
> set this" in 5437aa788d needs to be addressed before we can use it.
>
>> it, I assumed you dropped it :(
>>
>> This approach seems appropriate for dynamic ftrace trampolines, but I think
>> there are two more issues for bpf.
>>
>> 1. bpf trampoline was designed to be called directly from fentry (located in
>> kernel function or bpf prog). So to make it work as ftrace_op, we may end
>> up with two different bpf trampoline types on arm64, one for bpf prog and
>> the other for ftrace.
>>
>> 2. Performance overhead, as we always jump to a static ftrace trampoline to
>> construct execution environment for bpf trampoline, then jump to the bpf
>> trampoline to construct execution environment for bpf prog, then jump to
>> the bpf prog, so for some small bpf progs or hot functions, the calling
>> overhead may be unacceptable.
>
>>From the conversations I've had at LPC, Steven, Mark, Jiri and Masami
> (all in CC) would like to see an ftrace ops based solution (or rather,
> something that doesn't require direct calls) for invoking BPF tracing
> programs. I figured that the best way to move forward on the question
> of whether the performance impact of that would be acceptable or not
> is to just build it and measure it. I understand you're testing your
> work on real hardware (I work on an emulator at the moment) , would
> you be able to compare the impact of my proof of concept branch with
> your direct call based approach ?
>
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
Tested on my pi4, here is the result.
1. test with dd
1.1 when no bpf prog attached to vfs_write
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.56858 s, 326 MB/s
1.2 attach bpf prog with kprobe, bpftrace -e 'kprobe:vfs_write {}'
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.33439 s, 219 MB/s
1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
2. test with bpf/bench
2.1 bench trig-base
Iter 0 ( 65.259us): hits 1.774M/s ( 1.774M/prod), drops 0.000M/s, total operations 1.774M/s
Iter 1 (-17.075us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 2 ( 0.388us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 3 ( -1.759us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 4 ( 1.980us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 5 ( -2.222us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Iter 6 ( 0.869us): hits 1.790M/s ( 1.790M/prod), drops 0.000M/s, total operations 1.790M/s
Summary: hits 1.790 ± 0.000M/s ( 1.790M/prod), drops 0.000 ± 0.000M/s, total operations 1.790 ± 0.000M/s
2.2 bench trig-kprobe
Iter 0 ( 50.703us): hits 0.765M/s ( 0.765M/prod), drops 0.000M/s, total operations 0.765M/s
Iter 1 (-15.056us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 2 ( 2.981us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 3 ( -3.834us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 4 ( -1.964us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Iter 5 ( 0.426us): hits 0.770M/s ( 0.770M/prod), drops 0.000M/s, total operations 0.770M/s
Iter 6 ( -1.297us): hits 0.771M/s ( 0.771M/prod), drops 0.000M/s, total operations 0.771M/s
Summary: hits 0.771 ± 0.000M/s ( 0.771M/prod), drops 0.000 ± 0.000M/s, total operations 0.771 ± 0.000M/s
2.2 bench trig-fentry, with direct call
Iter 0 ( 49.981us): hits 1.357M/s ( 1.357M/prod), drops 0.000M/s, total operations 1.357M/s
Iter 1 ( 2.184us): hits 1.363M/s ( 1.363M/prod), drops 0.000M/s, total operations 1.363M/s
Iter 2 (-14.167us): hits 1.358M/s ( 1.358M/prod), drops 0.000M/s, total operations 1.358M/s
Iter 3 ( -4.890us): hits 1.362M/s ( 1.362M/prod), drops 0.000M/s, total operations 1.362M/s
Iter 4 ( 5.759us): hits 1.362M/s ( 1.362M/prod), drops 0.000M/s, total operations 1.362M/s
Iter 5 ( -4.389us): hits 1.362M/s ( 1.362M/prod), drops 0.000M/s, total operations 1.362M/s
Iter 6 ( -0.594us): hits 1.364M/s ( 1.364M/prod), drops 0.000M/s, total operations 1.364M/s
Summary: hits 1.362 ± 0.002M/s ( 1.362M/prod), drops 0.000 ± 0.000M/s, total operations 1.362 ± 0.002M/s
2.3 bench trig-fentry, with indirect call
Iter 0 ( 49.148us): hits 1.014M/s ( 1.014M/prod), drops 0.000M/s, total operations 1.014M/s
Iter 1 (-13.816us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 2 ( 0.648us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 3 ( 3.370us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 4 ( 11.388us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Iter 5 (-17.242us): hits 1.022M/s ( 1.022M/prod), drops 0.000M/s, total operations 1.022M/s
Iter 6 ( 1.815us): hits 1.021M/s ( 1.021M/prod), drops 0.000M/s, total operations 1.021M/s
Summary: hits 1.021 ± 0.000M/s ( 1.021M/prod), drops 0.000 ± 0.000M/s, total operations 1.021 ± 0.000M/s
> I first tried to implement this as an ftrace op myself but realized I
> was re-implementing a lot of the function graph tracer. So I then
> tried to use the function graph tracer API but realized I was missing
> some features which Steven had addressed in an RFC few years back. So
> I rebuilt on that until I realized Masami has been upstreaming the
> fprobe and rethook APIs as spiritual successors of Steven's RFC... So
> I've now rebuilt yet another proof of concept based on fprobe and
> rethook.
>
> That branch is still very much WIP and there are a few things I'd like
> to address before sending even an RFC (when kretprobe is built on
> rethook for example, I construct pt_regs on the stack in which I copy
> the content of ftrace_regs... or program linking/unlinking is racy
> right now) but I think it's good enough for performance measurements
> already. (fentry_fexit and lsm tests pass)
>
>>> Note that as a prerequisite for that I also want to reduce the set of registers
>>> we save/restore down to the set required by our calling convention, as the
>>> existing pt_regs is both large and generally unsound (since we can not and do
>>> not fill in many of the fields we only acquire at an exception boundary).
>>> That'll further reduce the ftrace overhead generally, and remove the needs for
>>> the two trampolines we currently have. I have a WIP at:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
>
> Note that I integrated this work to my branch too. I extended it to
> also have fprobe and rethook save and pass ftrace_regs structures to
> their callbacks. Most performance improvements would come from your
> arm64/ftrace/per-callsite-ops branch but we'd need to fix the above
> TODO for it to work.
>
>>> I intend to get back to both of those shortly (along with some related bits for
>>> kretprobes and stacktracing); I just haven't had much time recently due to
>>> other work and illness.
>>>
>>
>> Sorry for that, hope you getting better soon.
>
> Oh, that sucks. Get better Mark!
> .
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 14:54 ` Xu Kuohai
@ 2022-10-05 15:07 ` Steven Rostedt
-1 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-05 15:07 UTC (permalink / raw)
To: Xu Kuohai
Cc: Florent Revest, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, 5 Oct 2022 22:54:15 +0800
Xu Kuohai <xukuohai@huawei.com> wrote:
> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
>
>
> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
Can you show the implementation of the indirect call you used?
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-05 15:07 ` Steven Rostedt
0 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-05 15:07 UTC (permalink / raw)
To: Xu Kuohai
Cc: Florent Revest, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, 5 Oct 2022 22:54:15 +0800
Xu Kuohai <xukuohai@huawei.com> wrote:
> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
>
>
> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
Can you show the implementation of the indirect call you used?
Thanks,
-- Steve
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 15:07 ` Steven Rostedt
@ 2022-10-05 15:10 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-05 15:10 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Oct 2022 22:54:15 +0800
> Xu Kuohai <xukuohai@huawei.com> wrote:
>
> > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> >
> > # dd if=/dev/zero of=/dev/null count=1000000
> > 1000000+0 records in
> > 1000000+0 records out
> > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> >
> >
> > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> >
> > # dd if=/dev/zero of=/dev/null count=1000000
> > 1000000+0 records in
> > 1000000+0 records out
> > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
Thanks for the measurements Xu!
> Can you show the implementation of the indirect call you used?
Xu used my development branch here
https://github.com/FlorentRevest/linux/commits/fprobe-min-args
As it stands, the performance impact of the fprobe based
implementation would be too high for us. I wonder how much Mark's idea
here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
would help but it doesn't work right now.
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-05 15:10 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-05 15:10 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Oct 2022 22:54:15 +0800
> Xu Kuohai <xukuohai@huawei.com> wrote:
>
> > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> >
> > # dd if=/dev/zero of=/dev/null count=1000000
> > 1000000+0 records in
> > 1000000+0 records out
> > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> >
> >
> > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> >
> > # dd if=/dev/zero of=/dev/null count=1000000
> > 1000000+0 records in
> > 1000000+0 records out
> > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
Thanks for the measurements Xu!
> Can you show the implementation of the indirect call you used?
Xu used my development branch here
https://github.com/FlorentRevest/linux/commits/fprobe-min-args
As it stands, the performance impact of the fprobe based
implementation would be too high for us. I wonder how much Mark's idea
here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
would help but it doesn't work right now.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 15:10 ` Florent Revest
@ 2022-10-05 15:30 ` Steven Rostedt
-1 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-05 15:30 UTC (permalink / raw)
To: Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, 5 Oct 2022 17:10:33 +0200
Florent Revest <revest@chromium.org> wrote:
> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 5 Oct 2022 22:54:15 +0800
> > Xu Kuohai <xukuohai@huawei.com> wrote:
> >
> > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > >
> > > # dd if=/dev/zero of=/dev/null count=1000000
> > > 1000000+0 records in
> > > 1000000+0 records out
> > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > >
> > >
> > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > >
> > > # dd if=/dev/zero of=/dev/null count=1000000
> > > 1000000+0 records in
> > > 1000000+0 records out
> > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
>
> Thanks for the measurements Xu!
>
> > Can you show the implementation of the indirect call you used?
>
> Xu used my development branch here
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
That looks like it could be optimized quite a bit too.
Specifically this part:
static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
{
struct bpf_fprobe_call_context *call_ctx = private;
struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
struct bpf_tramp_links *links = fprobe_ctx->links;
struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
int i, ret;
memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
call_ctx->ip = ip;
for (i = 0; i < fprobe_ctx->nr_args; i++)
call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
for (i = 0; i < fentry->nr_links; i++)
call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
call_ctx->args[fprobe_ctx->nr_args] = 0;
for (i = 0; i < fmod_ret->nr_links; i++) {
ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
call_ctx->args);
if (ret) {
ftrace_regs_set_return_value(regs, ret);
ftrace_override_function_with_return(regs);
bpf_fprobe_exit(fp, ip, regs, private);
return false;
}
}
return fexit->nr_links;
}
There's a lot of low hanging fruit to speed up there. I wouldn't be too
fast to throw out this solution if it hasn't had the care that direct calls
have had to speed that up.
For example, trampolines currently only allow to attach to functions with 6
parameters or less (3 on x86_32). You could make 7 specific callbacks, with
zero to 6 parameters, and unroll the argument loop.
Would also be interesting to run perf to see where the overhead is. There
may be other locations to work on to make it almost as fast as direct
callers without the other baggage.
-- Steve
>
> As it stands, the performance impact of the fprobe based
> implementation would be too high for us. I wonder how much Mark's idea
> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> would help but it doesn't work right now.
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-05 15:30 ` Steven Rostedt
0 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-05 15:30 UTC (permalink / raw)
To: Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, 5 Oct 2022 17:10:33 +0200
Florent Revest <revest@chromium.org> wrote:
> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 5 Oct 2022 22:54:15 +0800
> > Xu Kuohai <xukuohai@huawei.com> wrote:
> >
> > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > >
> > > # dd if=/dev/zero of=/dev/null count=1000000
> > > 1000000+0 records in
> > > 1000000+0 records out
> > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > >
> > >
> > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > >
> > > # dd if=/dev/zero of=/dev/null count=1000000
> > > 1000000+0 records in
> > > 1000000+0 records out
> > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
>
> Thanks for the measurements Xu!
>
> > Can you show the implementation of the indirect call you used?
>
> Xu used my development branch here
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
That looks like it could be optimized quite a bit too.
Specifically this part:
static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
{
struct bpf_fprobe_call_context *call_ctx = private;
struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
struct bpf_tramp_links *links = fprobe_ctx->links;
struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
int i, ret;
memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
call_ctx->ip = ip;
for (i = 0; i < fprobe_ctx->nr_args; i++)
call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
for (i = 0; i < fentry->nr_links; i++)
call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
call_ctx->args[fprobe_ctx->nr_args] = 0;
for (i = 0; i < fmod_ret->nr_links; i++) {
ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
call_ctx->args);
if (ret) {
ftrace_regs_set_return_value(regs, ret);
ftrace_override_function_with_return(regs);
bpf_fprobe_exit(fp, ip, regs, private);
return false;
}
}
return fexit->nr_links;
}
There's a lot of low hanging fruit to speed up there. I wouldn't be too
fast to throw out this solution if it hasn't had the care that direct calls
have had to speed that up.
For example, trampolines currently only allow to attach to functions with 6
parameters or less (3 on x86_32). You could make 7 specific callbacks, with
zero to 6 parameters, and unroll the argument loop.
Would also be interesting to run perf to see where the overhead is. There
may be other locations to work on to make it almost as fast as direct
callers without the other baggage.
-- Steve
>
> As it stands, the performance impact of the fprobe based
> implementation would be too high for us. I wonder how much Mark's idea
> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
> would help but it doesn't work right now.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 15:30 ` Steven Rostedt
@ 2022-10-05 22:12 ` Jiri Olsa
-1 siblings, 0 replies; 60+ messages in thread
From: Jiri Olsa @ 2022-10-05 22:12 UTC (permalink / raw)
To: Steven Rostedt, Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > Xu Kuohai <xukuohai@huawei.com> wrote:
> > >
> > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > >
> > > >
> > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
> >
> > Thanks for the measurements Xu!
> >
> > > Can you show the implementation of the indirect call you used?
> >
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
nice :) I guess you did not try to run it on x86, I had to add some small
changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
>
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.
I can boot the change and run tests in qemu but for some reason it
won't boot on hw, so I have just perf report from qemu so far
there's fprobe/rethook machinery showing out as expected
jirka
---
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 23K of event 'cpu-clock:k'
# Event count (approx.): 5841250000
#
# Overhead Command Shared Object Symbol
# ........ ....... .............................................. ..................................................
#
18.65% bench [kernel.kallsyms] [k] syscall_enter_from_user_mode
|
---syscall_enter_from_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
13.03% bench [kernel.kallsyms] [k] seqcount_lockdep_reader_access.constprop.0
|
---seqcount_lockdep_reader_access.constprop.0
ktime_get_coarse_real_ts64
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
9.49% bench [kernel.kallsyms] [k] rethook_try_get
|
---rethook_try_get
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
8.71% bench [kernel.kallsyms] [k] rethook_recycle
|
---rethook_recycle
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
4.31% bench [kernel.kallsyms] [k] rcu_is_watching
|
---rcu_is_watching
|
|--1.49%--rethook_try_get
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
|--1.10%--do_getpgid
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
|--1.02%--__bpf_prog_exit
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--0.70%--__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.94% bench [kernel.kallsyms] [k] lock_release
|
---lock_release
|
|--1.51%--call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.43%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.91% bench bpf_prog_21856463590f61f1_bench_trigger_fentry [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
|
---bpf_prog_21856463590f61f1_bench_trigger_fentry
|
--2.66%--call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.69% bench [kernel.kallsyms] [k] bpf_fprobe_entry
|
---bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.60% bench [kernel.kallsyms] [k] lock_acquire
|
---lock_acquire
|
|--1.34%--__bpf_prog_enter
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.24%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.42% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare
|
---syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.37% bench [kernel.kallsyms] [k] __audit_syscall_entry
|
---__audit_syscall_entry
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
|
--2.36%--syscall
2.35% bench [kernel.kallsyms] [k] syscall_trace_enter.constprop.0
|
---syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.12% bench [kernel.kallsyms] [k] check_preemption_disabled
|
---check_preemption_disabled
|
--1.55%--rcu_is_watching
|
--0.59%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.00% bench [kernel.kallsyms] [k] fprobe_handler
|
---fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.94% bench [kernel.kallsyms] [k] local_irq_disable_exit_to_user
|
---local_irq_disable_exit_to_user
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.84% bench [kernel.kallsyms] [k] rcu_read_lock_sched_held
|
---rcu_read_lock_sched_held
|
|--0.93%--lock_acquire
|
--0.90%--lock_release
1.71% bench [kernel.kallsyms] [k] migrate_enable
|
---migrate_enable
__bpf_prog_exit
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.66% bench [kernel.kallsyms] [k] call_bpf_prog.isra.0
|
---call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.53% bench [kernel.kallsyms] [k] __rcu_read_unlock
|
---__rcu_read_unlock
|
|--0.86%--__bpf_prog_exit
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--0.66%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.31% bench [kernel.kallsyms] [k] debug_smp_processor_id
|
---debug_smp_processor_id
|
--0.77%--rcu_is_watching
1.22% bench [kernel.kallsyms] [k] migrate_disable
|
---migrate_disable
__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.19% bench [kernel.kallsyms] [k] __bpf_prog_enter
|
---__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.84% bench [kernel.kallsyms] [k] __radix_tree_lookup
|
---__radix_tree_lookup
find_task_by_pid_ns
do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.82% bench [kernel.kallsyms] [k] do_getpgid
|
---do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.78% bench [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
|
---debug_lockdep_rcu_enabled
|
--0.63%--rcu_read_lock_sched_held
0.74% bench ftrace_trampoline [k] ftrace_trampoline
|
---ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.72% bench [kernel.kallsyms] [k] preempt_count_add
|
---preempt_count_add
0.71% bench [kernel.kallsyms] [k] ktime_get_coarse_real_ts64
|
---ktime_get_coarse_real_ts64
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.69% bench [kernel.kallsyms] [k] do_syscall_64
|
---do_syscall_64
entry_SYSCALL_64_after_hwframe
|
--0.68%--syscall
0.60% bench [kernel.kallsyms] [k] preempt_count_sub
|
---preempt_count_sub
0.59% bench [kernel.kallsyms] [k] __rcu_read_lock
|
---__rcu_read_lock
0.59% bench [kernel.kallsyms] [k] __x64_sys_getpgid
|
---__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.58% bench [kernel.kallsyms] [k] __audit_syscall_exit
|
---__audit_syscall_exit
syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.53% bench [kernel.kallsyms] [k] audit_reset_context
|
---audit_reset_context
syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.45% bench [kernel.kallsyms] [k] rcu_read_lock_held
0.36% bench [kernel.kallsyms] [k] find_task_by_vpid
0.32% bench [kernel.kallsyms] [k] __bpf_prog_exit
0.26% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode
0.20% bench [kernel.kallsyms] [k] idr_find
0.18% bench [kernel.kallsyms] [k] find_task_by_pid_ns
0.17% bench [kernel.kallsyms] [k] update_prog_stats
0.16% bench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.14% bench [kernel.kallsyms] [k] pid_task
0.04% bench [kernel.kallsyms] [k] memchr_inv
0.04% bench [kernel.kallsyms] [k] smp_call_function_many_cond
0.03% bench [kernel.kallsyms] [k] do_user_addr_fault
0.03% bench [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.0
0.03% bench [kernel.kallsyms] [k] native_flush_tlb_global
0.03% bench [kernel.kallsyms] [k] __change_page_attr_set_clr
0.02% bench [kernel.kallsyms] [k] memcpy_erms
0.02% bench [kernel.kallsyms] [k] unwind_next_frame
0.02% bench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
0.01% bench [kernel.kallsyms] [k] __orc_find
0.01% bench [kernel.kallsyms] [k] call_rcu
0.01% bench [kernel.kallsyms] [k] __alloc_pages
0.01% bench [kernel.kallsyms] [k] __purge_vmap_area_lazy
0.01% bench [kernel.kallsyms] [k] __softirqentry_text_start
0.01% bench [kernel.kallsyms] [k] __stack_depot_save
0.01% bench [kernel.kallsyms] [k] __up_read
0.01% bench [kernel.kallsyms] [k] __virt_addr_valid
0.01% bench [kernel.kallsyms] [k] clear_page_erms
0.01% bench [kernel.kallsyms] [k] deactivate_slab
0.01% bench [kernel.kallsyms] [k] do_check_common
0.01% bench [kernel.kallsyms] [k] finish_task_switch.isra.0
0.01% bench [kernel.kallsyms] [k] free_unref_page_list
0.01% bench [kernel.kallsyms] [k] ftrace_rec_iter_next
0.01% bench [kernel.kallsyms] [k] handle_mm_fault
0.01% bench [kernel.kallsyms] [k] orc_find.part.0
0.01% bench [kernel.kallsyms] [k] try_charge_memcg
0.00% bench [kernel.kallsyms] [k] ___slab_alloc
0.00% bench [kernel.kallsyms] [k] __fdget_pos
0.00% bench [kernel.kallsyms] [k] __handle_mm_fault
0.00% bench [kernel.kallsyms] [k] __is_insn_slot_addr
0.00% bench [kernel.kallsyms] [k] __kmalloc
0.00% bench [kernel.kallsyms] [k] __mod_lruvec_page_state
0.00% bench [kernel.kallsyms] [k] __mod_node_page_state
0.00% bench [kernel.kallsyms] [k] __mutex_lock
0.00% bench [kernel.kallsyms] [k] __raw_spin_lock_init
0.00% bench [kernel.kallsyms] [k] alloc_vmap_area
0.00% bench [kernel.kallsyms] [k] allocate_slab
0.00% bench [kernel.kallsyms] [k] audit_get_tty
0.00% bench [kernel.kallsyms] [k] bpf_ksym_find
0.00% bench [kernel.kallsyms] [k] btf_check_all_metas
0.00% bench [kernel.kallsyms] [k] btf_put
0.00% bench [kernel.kallsyms] [k] cmpxchg_double_slab.constprop.0.isra.0
0.00% bench [kernel.kallsyms] [k] do_fault
0.00% bench [kernel.kallsyms] [k] do_raw_spin_trylock
0.00% bench [kernel.kallsyms] [k] find_vma
0.00% bench [kernel.kallsyms] [k] fs_reclaim_release
0.00% bench [kernel.kallsyms] [k] ftrace_check_record
0.00% bench [kernel.kallsyms] [k] ftrace_replace_code
0.00% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.00% bench [kernel.kallsyms] [k] get_page_from_freelist
0.00% bench [kernel.kallsyms] [k] in_gate_area_no_mm
0.00% bench [kernel.kallsyms] [k] in_task_stack
0.00% bench [kernel.kallsyms] [k] kernel_text_address
0.00% bench [kernel.kallsyms] [k] kernfs_fop_read_iter
0.00% bench [kernel.kallsyms] [k] kernfs_put_active
0.00% bench [kernel.kallsyms] [k] kfree
0.00% bench [kernel.kallsyms] [k] kmem_cache_alloc
0.00% bench [kernel.kallsyms] [k] ksys_read
0.00% bench [kernel.kallsyms] [k] lookup_address_in_pgd
0.00% bench [kernel.kallsyms] [k] mlock_page_drain_local
0.00% bench [kernel.kallsyms] [k] page_remove_rmap
0.00% bench [kernel.kallsyms] [k] post_alloc_hook
0.00% bench [kernel.kallsyms] [k] preempt_schedule_irq
0.00% bench [kernel.kallsyms] [k] queue_work_on
0.00% bench [kernel.kallsyms] [k] stack_trace_save
0.00% bench [kernel.kallsyms] [k] within_error_injection_list
#
# (Tip: To record callchains for each sample: perf record -g)
#
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-05 22:12 ` Jiri Olsa
0 siblings, 0 replies; 60+ messages in thread
From: Jiri Olsa @ 2022-10-05 22:12 UTC (permalink / raw)
To: Steven Rostedt, Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > Xu Kuohai <xukuohai@huawei.com> wrote:
> > >
> > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > >
> > > >
> > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
> >
> > Thanks for the measurements Xu!
> >
> > > Can you show the implementation of the indirect call you used?
> >
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
nice :) I guess you did not try to run it on x86, I had to add some small
changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
>
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.
I can boot the change and run tests in qemu but for some reason it
won't boot on hw, so I have just perf report from qemu so far
there's fprobe/rethook machinery showing out as expected
jirka
---
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 23K of event 'cpu-clock:k'
# Event count (approx.): 5841250000
#
# Overhead Command Shared Object Symbol
# ........ ....... .............................................. ..................................................
#
18.65% bench [kernel.kallsyms] [k] syscall_enter_from_user_mode
|
---syscall_enter_from_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
13.03% bench [kernel.kallsyms] [k] seqcount_lockdep_reader_access.constprop.0
|
---seqcount_lockdep_reader_access.constprop.0
ktime_get_coarse_real_ts64
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
9.49% bench [kernel.kallsyms] [k] rethook_try_get
|
---rethook_try_get
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
8.71% bench [kernel.kallsyms] [k] rethook_recycle
|
---rethook_recycle
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
4.31% bench [kernel.kallsyms] [k] rcu_is_watching
|
---rcu_is_watching
|
|--1.49%--rethook_try_get
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
|--1.10%--do_getpgid
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
|--1.02%--__bpf_prog_exit
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--0.70%--__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.94% bench [kernel.kallsyms] [k] lock_release
|
---lock_release
|
|--1.51%--call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.43%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.91% bench bpf_prog_21856463590f61f1_bench_trigger_fentry [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
|
---bpf_prog_21856463590f61f1_bench_trigger_fentry
|
--2.66%--call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.69% bench [kernel.kallsyms] [k] bpf_fprobe_entry
|
---bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.60% bench [kernel.kallsyms] [k] lock_acquire
|
---lock_acquire
|
|--1.34%--__bpf_prog_enter
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.24%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.42% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare
|
---syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.37% bench [kernel.kallsyms] [k] __audit_syscall_entry
|
---__audit_syscall_entry
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
|
--2.36%--syscall
2.35% bench [kernel.kallsyms] [k] syscall_trace_enter.constprop.0
|
---syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.12% bench [kernel.kallsyms] [k] check_preemption_disabled
|
---check_preemption_disabled
|
--1.55%--rcu_is_watching
|
--0.59%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
2.00% bench [kernel.kallsyms] [k] fprobe_handler
|
---fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.94% bench [kernel.kallsyms] [k] local_irq_disable_exit_to_user
|
---local_irq_disable_exit_to_user
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.84% bench [kernel.kallsyms] [k] rcu_read_lock_sched_held
|
---rcu_read_lock_sched_held
|
|--0.93%--lock_acquire
|
--0.90%--lock_release
1.71% bench [kernel.kallsyms] [k] migrate_enable
|
---migrate_enable
__bpf_prog_exit
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.66% bench [kernel.kallsyms] [k] call_bpf_prog.isra.0
|
---call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.53% bench [kernel.kallsyms] [k] __rcu_read_unlock
|
---__rcu_read_unlock
|
|--0.86%--__bpf_prog_exit
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--0.66%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.31% bench [kernel.kallsyms] [k] debug_smp_processor_id
|
---debug_smp_processor_id
|
--0.77%--rcu_is_watching
1.22% bench [kernel.kallsyms] [k] migrate_disable
|
---migrate_disable
__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
1.19% bench [kernel.kallsyms] [k] __bpf_prog_enter
|
---__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.84% bench [kernel.kallsyms] [k] __radix_tree_lookup
|
---__radix_tree_lookup
find_task_by_pid_ns
do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.82% bench [kernel.kallsyms] [k] do_getpgid
|
---do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.78% bench [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
|
---debug_lockdep_rcu_enabled
|
--0.63%--rcu_read_lock_sched_held
0.74% bench ftrace_trampoline [k] ftrace_trampoline
|
---ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.72% bench [kernel.kallsyms] [k] preempt_count_add
|
---preempt_count_add
0.71% bench [kernel.kallsyms] [k] ktime_get_coarse_real_ts64
|
---ktime_get_coarse_real_ts64
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.69% bench [kernel.kallsyms] [k] do_syscall_64
|
---do_syscall_64
entry_SYSCALL_64_after_hwframe
|
--0.68%--syscall
0.60% bench [kernel.kallsyms] [k] preempt_count_sub
|
---preempt_count_sub
0.59% bench [kernel.kallsyms] [k] __rcu_read_lock
|
---__rcu_read_lock
0.59% bench [kernel.kallsyms] [k] __x64_sys_getpgid
|
---__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.58% bench [kernel.kallsyms] [k] __audit_syscall_exit
|
---__audit_syscall_exit
syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.53% bench [kernel.kallsyms] [k] audit_reset_context
|
---audit_reset_context
syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall
0.45% bench [kernel.kallsyms] [k] rcu_read_lock_held
0.36% bench [kernel.kallsyms] [k] find_task_by_vpid
0.32% bench [kernel.kallsyms] [k] __bpf_prog_exit
0.26% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode
0.20% bench [kernel.kallsyms] [k] idr_find
0.18% bench [kernel.kallsyms] [k] find_task_by_pid_ns
0.17% bench [kernel.kallsyms] [k] update_prog_stats
0.16% bench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.14% bench [kernel.kallsyms] [k] pid_task
0.04% bench [kernel.kallsyms] [k] memchr_inv
0.04% bench [kernel.kallsyms] [k] smp_call_function_many_cond
0.03% bench [kernel.kallsyms] [k] do_user_addr_fault
0.03% bench [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.0
0.03% bench [kernel.kallsyms] [k] native_flush_tlb_global
0.03% bench [kernel.kallsyms] [k] __change_page_attr_set_clr
0.02% bench [kernel.kallsyms] [k] memcpy_erms
0.02% bench [kernel.kallsyms] [k] unwind_next_frame
0.02% bench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
0.01% bench [kernel.kallsyms] [k] __orc_find
0.01% bench [kernel.kallsyms] [k] call_rcu
0.01% bench [kernel.kallsyms] [k] __alloc_pages
0.01% bench [kernel.kallsyms] [k] __purge_vmap_area_lazy
0.01% bench [kernel.kallsyms] [k] __softirqentry_text_start
0.01% bench [kernel.kallsyms] [k] __stack_depot_save
0.01% bench [kernel.kallsyms] [k] __up_read
0.01% bench [kernel.kallsyms] [k] __virt_addr_valid
0.01% bench [kernel.kallsyms] [k] clear_page_erms
0.01% bench [kernel.kallsyms] [k] deactivate_slab
0.01% bench [kernel.kallsyms] [k] do_check_common
0.01% bench [kernel.kallsyms] [k] finish_task_switch.isra.0
0.01% bench [kernel.kallsyms] [k] free_unref_page_list
0.01% bench [kernel.kallsyms] [k] ftrace_rec_iter_next
0.01% bench [kernel.kallsyms] [k] handle_mm_fault
0.01% bench [kernel.kallsyms] [k] orc_find.part.0
0.01% bench [kernel.kallsyms] [k] try_charge_memcg
0.00% bench [kernel.kallsyms] [k] ___slab_alloc
0.00% bench [kernel.kallsyms] [k] __fdget_pos
0.00% bench [kernel.kallsyms] [k] __handle_mm_fault
0.00% bench [kernel.kallsyms] [k] __is_insn_slot_addr
0.00% bench [kernel.kallsyms] [k] __kmalloc
0.00% bench [kernel.kallsyms] [k] __mod_lruvec_page_state
0.00% bench [kernel.kallsyms] [k] __mod_node_page_state
0.00% bench [kernel.kallsyms] [k] __mutex_lock
0.00% bench [kernel.kallsyms] [k] __raw_spin_lock_init
0.00% bench [kernel.kallsyms] [k] alloc_vmap_area
0.00% bench [kernel.kallsyms] [k] allocate_slab
0.00% bench [kernel.kallsyms] [k] audit_get_tty
0.00% bench [kernel.kallsyms] [k] bpf_ksym_find
0.00% bench [kernel.kallsyms] [k] btf_check_all_metas
0.00% bench [kernel.kallsyms] [k] btf_put
0.00% bench [kernel.kallsyms] [k] cmpxchg_double_slab.constprop.0.isra.0
0.00% bench [kernel.kallsyms] [k] do_fault
0.00% bench [kernel.kallsyms] [k] do_raw_spin_trylock
0.00% bench [kernel.kallsyms] [k] find_vma
0.00% bench [kernel.kallsyms] [k] fs_reclaim_release
0.00% bench [kernel.kallsyms] [k] ftrace_check_record
0.00% bench [kernel.kallsyms] [k] ftrace_replace_code
0.00% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.00% bench [kernel.kallsyms] [k] get_page_from_freelist
0.00% bench [kernel.kallsyms] [k] in_gate_area_no_mm
0.00% bench [kernel.kallsyms] [k] in_task_stack
0.00% bench [kernel.kallsyms] [k] kernel_text_address
0.00% bench [kernel.kallsyms] [k] kernfs_fop_read_iter
0.00% bench [kernel.kallsyms] [k] kernfs_put_active
0.00% bench [kernel.kallsyms] [k] kfree
0.00% bench [kernel.kallsyms] [k] kmem_cache_alloc
0.00% bench [kernel.kallsyms] [k] ksys_read
0.00% bench [kernel.kallsyms] [k] lookup_address_in_pgd
0.00% bench [kernel.kallsyms] [k] mlock_page_drain_local
0.00% bench [kernel.kallsyms] [k] page_remove_rmap
0.00% bench [kernel.kallsyms] [k] post_alloc_hook
0.00% bench [kernel.kallsyms] [k] preempt_schedule_irq
0.00% bench [kernel.kallsyms] [k] queue_work_on
0.00% bench [kernel.kallsyms] [k] stack_trace_save
0.00% bench [kernel.kallsyms] [k] within_error_injection_list
#
# (Tip: To record callchains for each sample: perf record -g)
#
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-09-28 16:42 ` Mark Rutland
@ 2022-10-06 10:09 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-06 10:09 UTC (permalink / raw)
To: Mark Rutland
Cc: Catalin Marinas, Daniel Borkmann, linux-arm-kernel, linux-kernel,
bpf, Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On 9/29/2022 12:42 AM, Mark Rutland wrote:
> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>> bpf trampoline to fentry.
>>>>>>
>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>> no patch has been posted except the one I posted in [1], so this series
>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>> trampoline.
>>>>>>
>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>
>>>>>> v2:
>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>
>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>> Xu Kuohai (4):
>>>>>> ftrace: Allow users to disable ftrace direct call
>>>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>>>> arm64: ftrace: Add ftrace direct call support
>>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>
>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>
>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>> like Mark to have a look at the ftrace changes first.
>>>
>>>> From a quick scan, I still don't think this is quite right, and as it stands I
>>> believe this will break backtracing (as the instructions before the function
>>> entry point will not be symbolized correctly, getting in the way of
>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>
>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>
>> If PC is not in the range of literal_call, stacktrace works as before without
>> changes.
>>
>> If PC is in the range of literal_call, for example, interrupted by an
>> irq, I think there are 2 problems:
>>
>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>> will be missing from the backtrace.
>>
>> 2. Since PC is not in func's address range, no symbol name will be found, so
>> func name is also missing.
>>
>> Problem 1 is not introduced by this patchset, but the occurring probability
>> may be increased by this patchset. I think this problem should be addressed by
>> a reliable stacktrace scheme, such as ORC on x86.
>
> I agree problem 1 is not introduced by this patch set; I have plans fo how to
> address that for reliable stacktrace based on identifying the ftrace
> trampoline. This is one of the reasons I do not want direct calls, as
> identifying all direct call trampolines is going to be very painful and slow,
> whereas identifying a statically allocated ftrace trampoline is far simpler.
>
>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>> ways to deal with it:
>
> What I would like to do here, as mentioned previously in other threads, is to
> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> each patch-site with a specific set of ops, and invoke that directly from the
> regular ftrace trampoline.
>
> With that, the patch site would look like:
>
> pre_func_literal:
> NOP // Patched to a pointer to
> NOP // ftrace_ops
> func:
> < optional BTI here >
> NOP // Patched to MOV X9, LR
> NOP // Patched to a BL to the ftrace trampoline
>
> ... then in the ftrace trampoline we can recover the ops pointer at a negative
> offset from the LR based on the LR, and invoke the ops from there (passing a
> struct ftrace_regs with the saved regs).
>
> That way the patch-site is less significantly affected, and there's no impact
> to backtracing. That gets most of the benefit of the direct calls avoiding the
> ftrace ops list traversal, without having to do anything special at all. That
> should be much easier to maintain, too.
>
> I started implementing that before LPC (and you can find some branches on my
> kernel.org repo), but I haven't yet had the time to rebase those and sort out
> the remaining issues:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>
> Note that as a prerequisite for that I also want to reduce the set of registers
> we save/restore down to the set required by our calling convention, as the
> existing pt_regs is both large and generally unsound (since we can not and do
> not fill in many of the fields we only acquire at an exception boundary).
> That'll further reduce the ftrace overhead generally, and remove the needs for
> the two trampolines we currently have. I have a WIP at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
>
> I intend to get back to both of those shortly (along with some related bits for
> kretprobes and stacktracing); I just haven't had much time recently due to
> other work and illness.
>
>> 1. Add a symbol name for literal_call.
>
> That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
> so I don't think we want to do that.
>
>> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>> we can check if the PC is in literal_call, then adjust PC and try again.
>
> The problem is that the existing symbolization code doesn't know the length of
> the prior symbol, so it will find *some* symbol associated with the previous
> function rather than finding no symbol.
>
> To bodge around this we'dd need to special-case each patchable-function-entry
> site in symbolization, which is going to be painful and slow down unwinding
> unless we try to fix this up at boot-time or compile time.
>
>> 3. Move literal_call to the func's address range, for example:
>>
>> a. Compile with -fpatchable-function-entry=7
>> func:
>> BTI C
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>
> This is a non-starter. We are not going to add 7 NOPs at the start of every
> function.
>
Looks like we could just add 3 NOPs to function entry, like this:
1. At startup or when nothing attached, patch callsite to:
literal:
.quad dummy_tramp
func:
BTI C
MOV X9, LR
NOP
NOP
...
2. When target is in range, patch callsite to
literal:
.quad dummy_tramp
func:
BTI C
MOV X9, LR
NOP
BL custom_trampoline
...
3. Whe target is out of range, patch callsite to
literal:
.quad custom_trampoline
func:
BTI C
MOV X9, LR
LDR X16, literal
BLR X16
...
> Thanks,
> Mark.
>
> .
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-06 10:09 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-06 10:09 UTC (permalink / raw)
To: Mark Rutland
Cc: Catalin Marinas, Daniel Borkmann, linux-arm-kernel, linux-kernel,
bpf, Florent Revest, Will Deacon, Jean-Philippe Brucker,
Steven Rostedt, Ingo Molnar, Oleg Nesterov, Alexei Starovoitov,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel, Marc Zyngier,
Guo Ren, Masami Hiramatsu
On 9/29/2022 12:42 AM, Mark Rutland wrote:
> On Tue, Sep 27, 2022 at 12:49:58PM +0800, Xu Kuohai wrote:
>> On 9/27/2022 1:43 AM, Mark Rutland wrote:
>>> On Mon, Sep 26, 2022 at 03:40:20PM +0100, Catalin Marinas wrote:
>>>> On Thu, Sep 22, 2022 at 08:01:16PM +0200, Daniel Borkmann wrote:
>>>>> On 9/13/22 6:27 PM, Xu Kuohai wrote:
>>>>>> This series adds ftrace direct call for arm64, which is required to attach
>>>>>> bpf trampoline to fentry.
>>>>>>
>>>>>> Although there is no agreement on how to support ftrace direct call on arm64,
>>>>>> no patch has been posted except the one I posted in [1], so this series
>>>>>> continues the work of [1] with the addition of long jump support. Now ftrace
>>>>>> direct call works regardless of the distance between the callsite and custom
>>>>>> trampoline.
>>>>>>
>>>>>> [1] https://lore.kernel.org/bpf/20220518131638.3401509-2-xukuohai@huawei.com/
>>>>>>
>>>>>> v2:
>>>>>> - Fix compile and runtime errors caused by ftrace_rec_arch_init
>>>>>>
>>>>>> v1: https://lore.kernel.org/bpf/20220913063146.74750-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>> Xu Kuohai (4):
>>>>>> ftrace: Allow users to disable ftrace direct call
>>>>>> arm64: ftrace: Support long jump for ftrace direct call
>>>>>> arm64: ftrace: Add ftrace direct call support
>>>>>> ftrace: Fix dead loop caused by direct call in ftrace selftest
>>>>>
>>>>> Given there's just a tiny fraction touching BPF JIT and most are around core arm64,
>>>>> it probably makes sense that this series goes via Catalin/Will through arm64 tree
>>>>> instead of bpf-next if it looks good to them. Catalin/Will, thoughts (Ack + bpf-next
>>>>> could work too, but I'd presume this just results in merge conflicts)?
>>>>
>>>> I think it makes sense for the series to go via the arm64 tree but I'd
>>>> like Mark to have a look at the ftrace changes first.
>>>
>>>> From a quick scan, I still don't think this is quite right, and as it stands I
>>> believe this will break backtracing (as the instructions before the function
>>> entry point will not be symbolized correctly, getting in the way of
>>> RELIABLE_STACKTRACE). I think I was insufficiently clear with my earlier
>>> feedback there, as I have a mechanism in mind that wa a little simpler.
>>
>> Thanks for the review. I have some thoughts about reliable stacktrace.
>>
>> If PC is not in the range of literal_call, stacktrace works as before without
>> changes.
>>
>> If PC is in the range of literal_call, for example, interrupted by an
>> irq, I think there are 2 problems:
>>
>> 1. Caller LR is not pushed to the stack yet, so caller's address and name
>> will be missing from the backtrace.
>>
>> 2. Since PC is not in func's address range, no symbol name will be found, so
>> func name is also missing.
>>
>> Problem 1 is not introduced by this patchset, but the occurring probability
>> may be increased by this patchset. I think this problem should be addressed by
>> a reliable stacktrace scheme, such as ORC on x86.
>
> I agree problem 1 is not introduced by this patch set; I have plans fo how to
> address that for reliable stacktrace based on identifying the ftrace
> trampoline. This is one of the reasons I do not want direct calls, as
> identifying all direct call trampolines is going to be very painful and slow,
> whereas identifying a statically allocated ftrace trampoline is far simpler.
>
>> Problem 2 is indeed introduced by this patchset. I think there are at least 3
>> ways to deal with it:
>
> What I would like to do here, as mentioned previously in other threads, is to
> avoid direct calls, and implement "FTRACE_WITH_OPS", where we can associate
> each patch-site with a specific set of ops, and invoke that directly from the
> regular ftrace trampoline.
>
> With that, the patch site would look like:
>
> pre_func_literal:
> NOP // Patched to a pointer to
> NOP // ftrace_ops
> func:
> < optional BTI here >
> NOP // Patched to MOV X9, LR
> NOP // Patched to a BL to the ftrace trampoline
>
> ... then in the ftrace trampoline we can recover the ops pointer at a negative
> offset from the LR based on the LR, and invoke the ops from there (passing a
> struct ftrace_regs with the saved regs).
>
> That way the patch-site is less significantly affected, and there's no impact
> to backtracing. That gets most of the benefit of the direct calls avoiding the
> ftrace ops list traversal, without having to do anything special at all. That
> should be much easier to maintain, too.
>
> I started implementing that before LPC (and you can find some branches on my
> kernel.org repo), but I haven't yet had the time to rebase those and sort out
> the remaining issues:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>
> Note that as a prerequisite for that I also want to reduce the set of registers
> we save/restore down to the set required by our calling convention, as the
> existing pt_regs is both large and generally unsound (since we can not and do
> not fill in many of the fields we only acquire at an exception boundary).
> That'll further reduce the ftrace overhead generally, and remove the needs for
> the two trampolines we currently have. I have a WIP at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/minimal-regs
>
> I intend to get back to both of those shortly (along with some related bits for
> kretprobes and stacktracing); I just haven't had much time recently due to
> other work and illness.
>
>> 1. Add a symbol name for literal_call.
>
> That'll require a number of invasive changes to make RELIABLE_STACKTRACE work,
> so I don't think we want to do that.
>
>> 2. Hack the backtrace routine, if no symbol name found for a PC during backtrace,
>> we can check if the PC is in literal_call, then adjust PC and try again.
>
> The problem is that the existing symbolization code doesn't know the length of
> the prior symbol, so it will find *some* symbol associated with the previous
> function rather than finding no symbol.
>
> To bodge around this we'dd need to special-case each patchable-function-entry
> site in symbolization, which is going to be painful and slow down unwinding
> unless we try to fix this up at boot-time or compile time.
>
>> 3. Move literal_call to the func's address range, for example:
>>
>> a. Compile with -fpatchable-function-entry=7
>> func:
>> BTI C
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>> NOP
>
> This is a non-starter. We are not going to add 7 NOPs at the start of every
> function.
>
Looks like we could just add 3 NOPs to function entry, like this:
1. At startup or when nothing attached, patch callsite to:
literal:
.quad dummy_tramp
func:
BTI C
MOV X9, LR
NOP
NOP
...
2. When target is in range, patch callsite to
literal:
.quad dummy_tramp
func:
BTI C
MOV X9, LR
NOP
BL custom_trampoline
...
3. Whe target is out of range, patch callsite to
literal:
.quad custom_trampoline
func:
BTI C
MOV X9, LR
LDR X16, literal
BLR X16
...
> Thanks,
> Mark.
>
> .
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 15:30 ` Steven Rostedt
@ 2022-10-06 10:09 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-06 10:09 UTC (permalink / raw)
To: Steven Rostedt, Florent Revest
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, linux-arm-kernel,
linux-kernel, bpf, Will Deacon, Jean-Philippe Brucker,
Ingo Molnar, Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 10/5/2022 11:30 PM, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
>> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>> On Wed, 5 Oct 2022 22:54:15 +0800
>>> Xu Kuohai <xukuohai@huawei.com> wrote:
>>>
>>>> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
>>>>
>>>>
>>>> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
>>
>> Thanks for the measurements Xu!
>>
>>> Can you show the implementation of the indirect call you used?
>>
>> Xu used my development branch here
>> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
>
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.
>
There is something wrong with my pi4 perf, I'll send the perf report after
I fix it.
> -- Steve
>
>>
>> As it stands, the performance impact of the fprobe based
>> implementation would be too high for us. I wonder how much Mark's idea
>> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>> would help but it doesn't work right now.
>
>
> .
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-06 10:09 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-06 10:09 UTC (permalink / raw)
To: Steven Rostedt, Florent Revest
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, linux-arm-kernel,
linux-kernel, bpf, Will Deacon, Jean-Philippe Brucker,
Ingo Molnar, Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 10/5/2022 11:30 PM, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
>> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>> On Wed, 5 Oct 2022 22:54:15 +0800
>>> Xu Kuohai <xukuohai@huawei.com> wrote:
>>>
>>>> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
>>>>
>>>>
>>>> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
>>
>> Thanks for the measurements Xu!
>>
>>> Can you show the implementation of the indirect call you used?
>>
>> Xu used my development branch here
>> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
>
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.
>
There is something wrong with my pi4 perf, I'll send the perf report after
I fix it.
> -- Steve
>
>>
>> As it stands, the performance impact of the fprobe based
>> implementation would be too high for us. I wonder how much Mark's idea
>> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>> would help but it doesn't work right now.
>
>
> .
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 15:30 ` Steven Rostedt
@ 2022-10-06 16:19 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-06 16:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, Oct 5, 2022 at 5:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > Can you show the implementation of the indirect call you used?
> >
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
Sure, we can give this a try, I'll work on a macro that generates the
7 callbacks and we can check how much that helps. My belief right now
is that ftrace's iteration over all ops on arm64 is where we lose most
time but now that we have numbers it's pretty easy to check hypothesis
:)
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-06 16:19 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-06 16:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Wed, Oct 5, 2022 at 5:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > Can you show the implementation of the indirect call you used?
> >
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
Sure, we can give this a try, I'll work on a macro that generates the
7 callbacks and we can check how much that helps. My belief right now
is that ftrace's iteration over all ops on arm64 is where we lose most
time but now that we have numbers it's pretty easy to check hypothesis
:)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-06 16:19 ` Florent Revest
@ 2022-10-06 16:29 ` Steven Rostedt
-1 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-06 16:29 UTC (permalink / raw)
To: Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Thu, 6 Oct 2022 18:19:12 +0200
Florent Revest <revest@chromium.org> wrote:
> Sure, we can give this a try, I'll work on a macro that generates the
> 7 callbacks and we can check how much that helps. My belief right now
> is that ftrace's iteration over all ops on arm64 is where we lose most
> time but now that we have numbers it's pretty easy to check hypothesis
> :)
Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
So, let's hold off until that is complete.
-- Steve
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-06 16:29 ` Steven Rostedt
0 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-06 16:29 UTC (permalink / raw)
To: Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Thu, 6 Oct 2022 18:19:12 +0200
Florent Revest <revest@chromium.org> wrote:
> Sure, we can give this a try, I'll work on a macro that generates the
> 7 callbacks and we can check how much that helps. My belief right now
> is that ftrace's iteration over all ops on arm64 is where we lose most
> time but now that we have numbers it's pretty easy to check hypothesis
> :)
Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
So, let's hold off until that is complete.
-- Steve
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-05 22:12 ` Jiri Olsa
@ 2022-10-06 16:35 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-06 16:35 UTC (permalink / raw)
To: Jiri Olsa
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Thu, Oct 6, 2022 at 12:12 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> > On Wed, 5 Oct 2022 17:10:33 +0200
> > Florent Revest <revest@chromium.org> wrote:
> >
> > > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > > Xu Kuohai <xukuohai@huawei.com> wrote:
> > > >
> > > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > > >
> > > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > > 1000000+0 records in
> > > > > 1000000+0 records out
> > > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > > >
> > > > >
> > > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > > >
> > > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > > 1000000+0 records in
> > > > > 1000000+0 records out
> > > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
> > >
> > > Thanks for the measurements Xu!
> > >
> > > > Can you show the implementation of the indirect call you used?
> > >
> > > Xu used my development branch here
> > > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> nice :) I guess you did not try to run it on x86, I had to add some small
> changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it
Indeed, I haven't tried building on x86 yet, I'll have a look at what
I broke, thanks. :)
That branch is just an outline of the idea at this point anyway. Just
enough for performance measurements, not particularly ready for
review.
> >
> > That looks like it could be optimized quite a bit too.
> >
> > Specifically this part:
> >
> > static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> > {
> > struct bpf_fprobe_call_context *call_ctx = private;
> > struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> > struct bpf_tramp_links *links = fprobe_ctx->links;
> > struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> > struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> > struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> > int i, ret;
> >
> > memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> > call_ctx->ip = ip;
> > for (i = 0; i < fprobe_ctx->nr_args; i++)
> > call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
> >
> > for (i = 0; i < fentry->nr_links; i++)
> > call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
> >
> > call_ctx->args[fprobe_ctx->nr_args] = 0;
> > for (i = 0; i < fmod_ret->nr_links; i++) {
> > ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> > call_ctx->args);
> >
> > if (ret) {
> > ftrace_regs_set_return_value(regs, ret);
> > ftrace_override_function_with_return(regs);
> >
> > bpf_fprobe_exit(fp, ip, regs, private);
> > return false;
> > }
> > }
> >
> > return fexit->nr_links;
> > }
> >
> > There's a lot of low hanging fruit to speed up there. I wouldn't be too
> > fast to throw out this solution if it hasn't had the care that direct calls
> > have had to speed that up.
> >
> > For example, trampolines currently only allow to attach to functions with 6
> > parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> > zero to 6 parameters, and unroll the argument loop.
> >
> > Would also be interesting to run perf to see where the overhead is. There
> > may be other locations to work on to make it almost as fast as direct
> > callers without the other baggage.
>
> I can boot the change and run tests in qemu but for some reason it
> won't boot on hw, so I have just perf report from qemu so far
Oh, ok, that's interesting. The changes look pretty benign (only
fprobe and arm64 specific code) I'm curious how that would break the
boot uh :p
>
> there's fprobe/rethook machinery showing out as expected
>
> jirka
>
>
> ---
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 23K of event 'cpu-clock:k'
> # Event count (approx.): 5841250000
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... .............................................. ..................................................
> #
> 18.65% bench [kernel.kallsyms] [k] syscall_enter_from_user_mode
> |
> ---syscall_enter_from_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 13.03% bench [kernel.kallsyms] [k] seqcount_lockdep_reader_access.constprop.0
> |
> ---seqcount_lockdep_reader_access.constprop.0
> ktime_get_coarse_real_ts64
> syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 9.49% bench [kernel.kallsyms] [k] rethook_try_get
> |
> ---rethook_try_get
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 8.71% bench [kernel.kallsyms] [k] rethook_recycle
> |
> ---rethook_recycle
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 4.31% bench [kernel.kallsyms] [k] rcu_is_watching
> |
> ---rcu_is_watching
> |
> |--1.49%--rethook_try_get
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> |--1.10%--do_getpgid
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> |--1.02%--__bpf_prog_exit
> | call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --0.70%--__bpf_prog_enter
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.94% bench [kernel.kallsyms] [k] lock_release
> |
> ---lock_release
> |
> |--1.51%--call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --1.43%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.91% bench bpf_prog_21856463590f61f1_bench_trigger_fentry [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
> |
> ---bpf_prog_21856463590f61f1_bench_trigger_fentry
> |
> --2.66%--call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.69% bench [kernel.kallsyms] [k] bpf_fprobe_entry
> |
> ---bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.60% bench [kernel.kallsyms] [k] lock_acquire
> |
> ---lock_acquire
> |
> |--1.34%--__bpf_prog_enter
> | call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --1.24%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.42% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare
> |
> ---syscall_exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.37% bench [kernel.kallsyms] [k] __audit_syscall_entry
> |
> ---__audit_syscall_entry
> syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> |
> --2.36%--syscall
>
> 2.35% bench [kernel.kallsyms] [k] syscall_trace_enter.constprop.0
> |
> ---syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.12% bench [kernel.kallsyms] [k] check_preemption_disabled
> |
> ---check_preemption_disabled
> |
> --1.55%--rcu_is_watching
> |
> --0.59%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.00% bench [kernel.kallsyms] [k] fprobe_handler
> |
> ---fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.94% bench [kernel.kallsyms] [k] local_irq_disable_exit_to_user
> |
> ---local_irq_disable_exit_to_user
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.84% bench [kernel.kallsyms] [k] rcu_read_lock_sched_held
> |
> ---rcu_read_lock_sched_held
> |
> |--0.93%--lock_acquire
> |
> --0.90%--lock_release
>
> 1.71% bench [kernel.kallsyms] [k] migrate_enable
> |
> ---migrate_enable
> __bpf_prog_exit
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.66% bench [kernel.kallsyms] [k] call_bpf_prog.isra.0
> |
> ---call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.53% bench [kernel.kallsyms] [k] __rcu_read_unlock
> |
> ---__rcu_read_unlock
> |
> |--0.86%--__bpf_prog_exit
> | call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --0.66%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.31% bench [kernel.kallsyms] [k] debug_smp_processor_id
> |
> ---debug_smp_processor_id
> |
> --0.77%--rcu_is_watching
>
> 1.22% bench [kernel.kallsyms] [k] migrate_disable
> |
> ---migrate_disable
> __bpf_prog_enter
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.19% bench [kernel.kallsyms] [k] __bpf_prog_enter
> |
> ---__bpf_prog_enter
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.84% bench [kernel.kallsyms] [k] __radix_tree_lookup
> |
> ---__radix_tree_lookup
> find_task_by_pid_ns
> do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.82% bench [kernel.kallsyms] [k] do_getpgid
> |
> ---do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.78% bench [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
> |
> ---debug_lockdep_rcu_enabled
> |
> --0.63%--rcu_read_lock_sched_held
>
> 0.74% bench ftrace_trampoline [k] ftrace_trampoline
> |
> ---ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.72% bench [kernel.kallsyms] [k] preempt_count_add
> |
> ---preempt_count_add
>
> 0.71% bench [kernel.kallsyms] [k] ktime_get_coarse_real_ts64
> |
> ---ktime_get_coarse_real_ts64
> syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.69% bench [kernel.kallsyms] [k] do_syscall_64
> |
> ---do_syscall_64
> entry_SYSCALL_64_after_hwframe
> |
> --0.68%--syscall
>
> 0.60% bench [kernel.kallsyms] [k] preempt_count_sub
> |
> ---preempt_count_sub
>
> 0.59% bench [kernel.kallsyms] [k] __rcu_read_lock
> |
> ---__rcu_read_lock
>
> 0.59% bench [kernel.kallsyms] [k] __x64_sys_getpgid
> |
> ---__x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.58% bench [kernel.kallsyms] [k] __audit_syscall_exit
> |
> ---__audit_syscall_exit
> syscall_exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.53% bench [kernel.kallsyms] [k] audit_reset_context
> |
> ---audit_reset_context
> syscall_exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.45% bench [kernel.kallsyms] [k] rcu_read_lock_held
> 0.36% bench [kernel.kallsyms] [k] find_task_by_vpid
> 0.32% bench [kernel.kallsyms] [k] __bpf_prog_exit
> 0.26% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode
> 0.20% bench [kernel.kallsyms] [k] idr_find
> 0.18% bench [kernel.kallsyms] [k] find_task_by_pid_ns
> 0.17% bench [kernel.kallsyms] [k] update_prog_stats
> 0.16% bench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 0.14% bench [kernel.kallsyms] [k] pid_task
> 0.04% bench [kernel.kallsyms] [k] memchr_inv
> 0.04% bench [kernel.kallsyms] [k] smp_call_function_many_cond
> 0.03% bench [kernel.kallsyms] [k] do_user_addr_fault
> 0.03% bench [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.0
> 0.03% bench [kernel.kallsyms] [k] native_flush_tlb_global
> 0.03% bench [kernel.kallsyms] [k] __change_page_attr_set_clr
> 0.02% bench [kernel.kallsyms] [k] memcpy_erms
> 0.02% bench [kernel.kallsyms] [k] unwind_next_frame
> 0.02% bench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> 0.01% bench [kernel.kallsyms] [k] __orc_find
> 0.01% bench [kernel.kallsyms] [k] call_rcu
> 0.01% bench [kernel.kallsyms] [k] __alloc_pages
> 0.01% bench [kernel.kallsyms] [k] __purge_vmap_area_lazy
> 0.01% bench [kernel.kallsyms] [k] __softirqentry_text_start
> 0.01% bench [kernel.kallsyms] [k] __stack_depot_save
> 0.01% bench [kernel.kallsyms] [k] __up_read
> 0.01% bench [kernel.kallsyms] [k] __virt_addr_valid
> 0.01% bench [kernel.kallsyms] [k] clear_page_erms
> 0.01% bench [kernel.kallsyms] [k] deactivate_slab
> 0.01% bench [kernel.kallsyms] [k] do_check_common
> 0.01% bench [kernel.kallsyms] [k] finish_task_switch.isra.0
> 0.01% bench [kernel.kallsyms] [k] free_unref_page_list
> 0.01% bench [kernel.kallsyms] [k] ftrace_rec_iter_next
> 0.01% bench [kernel.kallsyms] [k] handle_mm_fault
> 0.01% bench [kernel.kallsyms] [k] orc_find.part.0
> 0.01% bench [kernel.kallsyms] [k] try_charge_memcg
> 0.00% bench [kernel.kallsyms] [k] ___slab_alloc
> 0.00% bench [kernel.kallsyms] [k] __fdget_pos
> 0.00% bench [kernel.kallsyms] [k] __handle_mm_fault
> 0.00% bench [kernel.kallsyms] [k] __is_insn_slot_addr
> 0.00% bench [kernel.kallsyms] [k] __kmalloc
> 0.00% bench [kernel.kallsyms] [k] __mod_lruvec_page_state
> 0.00% bench [kernel.kallsyms] [k] __mod_node_page_state
> 0.00% bench [kernel.kallsyms] [k] __mutex_lock
> 0.00% bench [kernel.kallsyms] [k] __raw_spin_lock_init
> 0.00% bench [kernel.kallsyms] [k] alloc_vmap_area
> 0.00% bench [kernel.kallsyms] [k] allocate_slab
> 0.00% bench [kernel.kallsyms] [k] audit_get_tty
> 0.00% bench [kernel.kallsyms] [k] bpf_ksym_find
> 0.00% bench [kernel.kallsyms] [k] btf_check_all_metas
> 0.00% bench [kernel.kallsyms] [k] btf_put
> 0.00% bench [kernel.kallsyms] [k] cmpxchg_double_slab.constprop.0.isra.0
> 0.00% bench [kernel.kallsyms] [k] do_fault
> 0.00% bench [kernel.kallsyms] [k] do_raw_spin_trylock
> 0.00% bench [kernel.kallsyms] [k] find_vma
> 0.00% bench [kernel.kallsyms] [k] fs_reclaim_release
> 0.00% bench [kernel.kallsyms] [k] ftrace_check_record
> 0.00% bench [kernel.kallsyms] [k] ftrace_replace_code
> 0.00% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm
> 0.00% bench [kernel.kallsyms] [k] get_page_from_freelist
> 0.00% bench [kernel.kallsyms] [k] in_gate_area_no_mm
> 0.00% bench [kernel.kallsyms] [k] in_task_stack
> 0.00% bench [kernel.kallsyms] [k] kernel_text_address
> 0.00% bench [kernel.kallsyms] [k] kernfs_fop_read_iter
> 0.00% bench [kernel.kallsyms] [k] kernfs_put_active
> 0.00% bench [kernel.kallsyms] [k] kfree
> 0.00% bench [kernel.kallsyms] [k] kmem_cache_alloc
> 0.00% bench [kernel.kallsyms] [k] ksys_read
> 0.00% bench [kernel.kallsyms] [k] lookup_address_in_pgd
> 0.00% bench [kernel.kallsyms] [k] mlock_page_drain_local
> 0.00% bench [kernel.kallsyms] [k] page_remove_rmap
> 0.00% bench [kernel.kallsyms] [k] post_alloc_hook
> 0.00% bench [kernel.kallsyms] [k] preempt_schedule_irq
> 0.00% bench [kernel.kallsyms] [k] queue_work_on
> 0.00% bench [kernel.kallsyms] [k] stack_trace_save
> 0.00% bench [kernel.kallsyms] [k] within_error_injection_list
>
>
> #
> # (Tip: To record callchains for each sample: perf record -g)
> #
>
Thanks for the measurements Jiri! :) At this point, my hypothesis is
that the biggest part of the performance hit comes from arm64 specific
code in ftrace so I would rather wait to see what Xu finds out on his
pi4. Also, I found an arm64 board today so I should soon be able to
make measurements there too.
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-06 16:35 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-06 16:35 UTC (permalink / raw)
To: Jiri Olsa
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Thu, Oct 6, 2022 at 12:12 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> > On Wed, 5 Oct 2022 17:10:33 +0200
> > Florent Revest <revest@chromium.org> wrote:
> >
> > > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > > Xu Kuohai <xukuohai@huawei.com> wrote:
> > > >
> > > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > > >
> > > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > > 1000000+0 records in
> > > > > 1000000+0 records out
> > > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > > >
> > > > >
> > > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > > >
> > > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > > 1000000+0 records in
> > > > > 1000000+0 records out
> > > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
> > >
> > > Thanks for the measurements Xu!
> > >
> > > > Can you show the implementation of the indirect call you used?
> > >
> > > Xu used my development branch here
> > > https://github.com/FlorentRevest/linux/commits/fprobe-min-args
>
> nice :) I guess you did not try to run it on x86, I had to add some small
> changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it
Indeed, I haven't tried building on x86 yet, I'll have a look at what
I broke, thanks. :)
That branch is just an outline of the idea at this point anyway. Just
enough for performance measurements, not particularly ready for
review.
> >
> > That looks like it could be optimized quite a bit too.
> >
> > Specifically this part:
> >
> > static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> > {
> > struct bpf_fprobe_call_context *call_ctx = private;
> > struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> > struct bpf_tramp_links *links = fprobe_ctx->links;
> > struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> > struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> > struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> > int i, ret;
> >
> > memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> > call_ctx->ip = ip;
> > for (i = 0; i < fprobe_ctx->nr_args; i++)
> > call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
> >
> > for (i = 0; i < fentry->nr_links; i++)
> > call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
> >
> > call_ctx->args[fprobe_ctx->nr_args] = 0;
> > for (i = 0; i < fmod_ret->nr_links; i++) {
> > ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> > call_ctx->args);
> >
> > if (ret) {
> > ftrace_regs_set_return_value(regs, ret);
> > ftrace_override_function_with_return(regs);
> >
> > bpf_fprobe_exit(fp, ip, regs, private);
> > return false;
> > }
> > }
> >
> > return fexit->nr_links;
> > }
> >
> > There's a lot of low hanging fruit to speed up there. I wouldn't be too
> > fast to throw out this solution if it hasn't had the care that direct calls
> > have had to speed that up.
> >
> > For example, trampolines currently only allow to attach to functions with 6
> > parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> > zero to 6 parameters, and unroll the argument loop.
> >
> > Would also be interesting to run perf to see where the overhead is. There
> > may be other locations to work on to make it almost as fast as direct
> > callers without the other baggage.
>
> I can boot the change and run tests in qemu but for some reason it
> won't boot on hw, so I have just perf report from qemu so far
Oh, ok, that's interesting. The changes look pretty benign (only
fprobe and arm64 specific code) I'm curious how that would break the
boot uh :p
>
> there's fprobe/rethook machinery showing out as expected
>
> jirka
>
>
> ---
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 23K of event 'cpu-clock:k'
> # Event count (approx.): 5841250000
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... .............................................. ..................................................
> #
> 18.65% bench [kernel.kallsyms] [k] syscall_enter_from_user_mode
> |
> ---syscall_enter_from_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 13.03% bench [kernel.kallsyms] [k] seqcount_lockdep_reader_access.constprop.0
> |
> ---seqcount_lockdep_reader_access.constprop.0
> ktime_get_coarse_real_ts64
> syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 9.49% bench [kernel.kallsyms] [k] rethook_try_get
> |
> ---rethook_try_get
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 8.71% bench [kernel.kallsyms] [k] rethook_recycle
> |
> ---rethook_recycle
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 4.31% bench [kernel.kallsyms] [k] rcu_is_watching
> |
> ---rcu_is_watching
> |
> |--1.49%--rethook_try_get
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> |--1.10%--do_getpgid
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> |--1.02%--__bpf_prog_exit
> | call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --0.70%--__bpf_prog_enter
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.94% bench [kernel.kallsyms] [k] lock_release
> |
> ---lock_release
> |
> |--1.51%--call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --1.43%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.91% bench bpf_prog_21856463590f61f1_bench_trigger_fentry [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
> |
> ---bpf_prog_21856463590f61f1_bench_trigger_fentry
> |
> --2.66%--call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.69% bench [kernel.kallsyms] [k] bpf_fprobe_entry
> |
> ---bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.60% bench [kernel.kallsyms] [k] lock_acquire
> |
> ---lock_acquire
> |
> |--1.34%--__bpf_prog_enter
> | call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --1.24%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.42% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare
> |
> ---syscall_exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.37% bench [kernel.kallsyms] [k] __audit_syscall_entry
> |
> ---__audit_syscall_entry
> syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> |
> --2.36%--syscall
>
> 2.35% bench [kernel.kallsyms] [k] syscall_trace_enter.constprop.0
> |
> ---syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.12% bench [kernel.kallsyms] [k] check_preemption_disabled
> |
> ---check_preemption_disabled
> |
> --1.55%--rcu_is_watching
> |
> --0.59%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 2.00% bench [kernel.kallsyms] [k] fprobe_handler
> |
> ---fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.94% bench [kernel.kallsyms] [k] local_irq_disable_exit_to_user
> |
> ---local_irq_disable_exit_to_user
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.84% bench [kernel.kallsyms] [k] rcu_read_lock_sched_held
> |
> ---rcu_read_lock_sched_held
> |
> |--0.93%--lock_acquire
> |
> --0.90%--lock_release
>
> 1.71% bench [kernel.kallsyms] [k] migrate_enable
> |
> ---migrate_enable
> __bpf_prog_exit
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.66% bench [kernel.kallsyms] [k] call_bpf_prog.isra.0
> |
> ---call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.53% bench [kernel.kallsyms] [k] __rcu_read_unlock
> |
> ---__rcu_read_unlock
> |
> |--0.86%--__bpf_prog_exit
> | call_bpf_prog.isra.0
> | bpf_fprobe_entry
> | fprobe_handler
> | ftrace_trampoline
> | __x64_sys_getpgid
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | syscall
> |
> --0.66%--do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.31% bench [kernel.kallsyms] [k] debug_smp_processor_id
> |
> ---debug_smp_processor_id
> |
> --0.77%--rcu_is_watching
>
> 1.22% bench [kernel.kallsyms] [k] migrate_disable
> |
> ---migrate_disable
> __bpf_prog_enter
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 1.19% bench [kernel.kallsyms] [k] __bpf_prog_enter
> |
> ---__bpf_prog_enter
> call_bpf_prog.isra.0
> bpf_fprobe_entry
> fprobe_handler
> ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.84% bench [kernel.kallsyms] [k] __radix_tree_lookup
> |
> ---__radix_tree_lookup
> find_task_by_pid_ns
> do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.82% bench [kernel.kallsyms] [k] do_getpgid
> |
> ---do_getpgid
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.78% bench [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
> |
> ---debug_lockdep_rcu_enabled
> |
> --0.63%--rcu_read_lock_sched_held
>
> 0.74% bench ftrace_trampoline [k] ftrace_trampoline
> |
> ---ftrace_trampoline
> __x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.72% bench [kernel.kallsyms] [k] preempt_count_add
> |
> ---preempt_count_add
>
> 0.71% bench [kernel.kallsyms] [k] ktime_get_coarse_real_ts64
> |
> ---ktime_get_coarse_real_ts64
> syscall_trace_enter.constprop.0
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.69% bench [kernel.kallsyms] [k] do_syscall_64
> |
> ---do_syscall_64
> entry_SYSCALL_64_after_hwframe
> |
> --0.68%--syscall
>
> 0.60% bench [kernel.kallsyms] [k] preempt_count_sub
> |
> ---preempt_count_sub
>
> 0.59% bench [kernel.kallsyms] [k] __rcu_read_lock
> |
> ---__rcu_read_lock
>
> 0.59% bench [kernel.kallsyms] [k] __x64_sys_getpgid
> |
> ---__x64_sys_getpgid
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.58% bench [kernel.kallsyms] [k] __audit_syscall_exit
> |
> ---__audit_syscall_exit
> syscall_exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.53% bench [kernel.kallsyms] [k] audit_reset_context
> |
> ---audit_reset_context
> syscall_exit_to_user_mode_prepare
> syscall_exit_to_user_mode
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
> syscall
>
> 0.45% bench [kernel.kallsyms] [k] rcu_read_lock_held
> 0.36% bench [kernel.kallsyms] [k] find_task_by_vpid
> 0.32% bench [kernel.kallsyms] [k] __bpf_prog_exit
> 0.26% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode
> 0.20% bench [kernel.kallsyms] [k] idr_find
> 0.18% bench [kernel.kallsyms] [k] find_task_by_pid_ns
> 0.17% bench [kernel.kallsyms] [k] update_prog_stats
> 0.16% bench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 0.14% bench [kernel.kallsyms] [k] pid_task
> 0.04% bench [kernel.kallsyms] [k] memchr_inv
> 0.04% bench [kernel.kallsyms] [k] smp_call_function_many_cond
> 0.03% bench [kernel.kallsyms] [k] do_user_addr_fault
> 0.03% bench [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.0
> 0.03% bench [kernel.kallsyms] [k] native_flush_tlb_global
> 0.03% bench [kernel.kallsyms] [k] __change_page_attr_set_clr
> 0.02% bench [kernel.kallsyms] [k] memcpy_erms
> 0.02% bench [kernel.kallsyms] [k] unwind_next_frame
> 0.02% bench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> 0.01% bench [kernel.kallsyms] [k] __orc_find
> 0.01% bench [kernel.kallsyms] [k] call_rcu
> 0.01% bench [kernel.kallsyms] [k] __alloc_pages
> 0.01% bench [kernel.kallsyms] [k] __purge_vmap_area_lazy
> 0.01% bench [kernel.kallsyms] [k] __softirqentry_text_start
> 0.01% bench [kernel.kallsyms] [k] __stack_depot_save
> 0.01% bench [kernel.kallsyms] [k] __up_read
> 0.01% bench [kernel.kallsyms] [k] __virt_addr_valid
> 0.01% bench [kernel.kallsyms] [k] clear_page_erms
> 0.01% bench [kernel.kallsyms] [k] deactivate_slab
> 0.01% bench [kernel.kallsyms] [k] do_check_common
> 0.01% bench [kernel.kallsyms] [k] finish_task_switch.isra.0
> 0.01% bench [kernel.kallsyms] [k] free_unref_page_list
> 0.01% bench [kernel.kallsyms] [k] ftrace_rec_iter_next
> 0.01% bench [kernel.kallsyms] [k] handle_mm_fault
> 0.01% bench [kernel.kallsyms] [k] orc_find.part.0
> 0.01% bench [kernel.kallsyms] [k] try_charge_memcg
> 0.00% bench [kernel.kallsyms] [k] ___slab_alloc
> 0.00% bench [kernel.kallsyms] [k] __fdget_pos
> 0.00% bench [kernel.kallsyms] [k] __handle_mm_fault
> 0.00% bench [kernel.kallsyms] [k] __is_insn_slot_addr
> 0.00% bench [kernel.kallsyms] [k] __kmalloc
> 0.00% bench [kernel.kallsyms] [k] __mod_lruvec_page_state
> 0.00% bench [kernel.kallsyms] [k] __mod_node_page_state
> 0.00% bench [kernel.kallsyms] [k] __mutex_lock
> 0.00% bench [kernel.kallsyms] [k] __raw_spin_lock_init
> 0.00% bench [kernel.kallsyms] [k] alloc_vmap_area
> 0.00% bench [kernel.kallsyms] [k] allocate_slab
> 0.00% bench [kernel.kallsyms] [k] audit_get_tty
> 0.00% bench [kernel.kallsyms] [k] bpf_ksym_find
> 0.00% bench [kernel.kallsyms] [k] btf_check_all_metas
> 0.00% bench [kernel.kallsyms] [k] btf_put
> 0.00% bench [kernel.kallsyms] [k] cmpxchg_double_slab.constprop.0.isra.0
> 0.00% bench [kernel.kallsyms] [k] do_fault
> 0.00% bench [kernel.kallsyms] [k] do_raw_spin_trylock
> 0.00% bench [kernel.kallsyms] [k] find_vma
> 0.00% bench [kernel.kallsyms] [k] fs_reclaim_release
> 0.00% bench [kernel.kallsyms] [k] ftrace_check_record
> 0.00% bench [kernel.kallsyms] [k] ftrace_replace_code
> 0.00% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm
> 0.00% bench [kernel.kallsyms] [k] get_page_from_freelist
> 0.00% bench [kernel.kallsyms] [k] in_gate_area_no_mm
> 0.00% bench [kernel.kallsyms] [k] in_task_stack
> 0.00% bench [kernel.kallsyms] [k] kernel_text_address
> 0.00% bench [kernel.kallsyms] [k] kernfs_fop_read_iter
> 0.00% bench [kernel.kallsyms] [k] kernfs_put_active
> 0.00% bench [kernel.kallsyms] [k] kfree
> 0.00% bench [kernel.kallsyms] [k] kmem_cache_alloc
> 0.00% bench [kernel.kallsyms] [k] ksys_read
> 0.00% bench [kernel.kallsyms] [k] lookup_address_in_pgd
> 0.00% bench [kernel.kallsyms] [k] mlock_page_drain_local
> 0.00% bench [kernel.kallsyms] [k] page_remove_rmap
> 0.00% bench [kernel.kallsyms] [k] post_alloc_hook
> 0.00% bench [kernel.kallsyms] [k] preempt_schedule_irq
> 0.00% bench [kernel.kallsyms] [k] queue_work_on
> 0.00% bench [kernel.kallsyms] [k] stack_trace_save
> 0.00% bench [kernel.kallsyms] [k] within_error_injection_list
>
>
> #
> # (Tip: To record callchains for each sample: perf record -g)
> #
>
Thanks for the measurements Jiri! :) At this point, my hypothesis is
that the biggest part of the performance hit comes from arm64 specific
code in ftrace so I would rather wait to see what Xu finds out on his
pi4. Also, I found an arm64 board today so I should soon be able to
make measurements there too.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-06 16:29 ` Steven Rostedt
@ 2022-10-07 10:13 ` Xu Kuohai
-1 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-07 10:13 UTC (permalink / raw)
To: Steven Rostedt, Florent Revest
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, linux-arm-kernel,
linux-kernel, bpf, Will Deacon, Jean-Philippe Brucker,
Ingo Molnar, Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 10/7/2022 12:29 AM, Steven Rostedt wrote:
> On Thu, 6 Oct 2022 18:19:12 +0200
> Florent Revest <revest@chromium.org> wrote:
>
>> Sure, we can give this a try, I'll work on a macro that generates the
>> 7 callbacks and we can check how much that helps. My belief right now
>> is that ftrace's iteration over all ops on arm64 is where we lose most
>> time but now that we have numbers it's pretty easy to check hypothesis
>> :)
>
> Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
>
> So, let's hold off until that is complete.
>
> -- Steve
>
> .
Here is the perf data I captured.
1. perf report
99.94% 0.00% ld-linux-aarch6 bench [.] trigger_producer
|
---trigger_producer
|
|--98.04%--syscall
| |
| --81.35%--el0t_64_sync
| el0t_64_sync_handler
| el0_svc
| do_el0_svc
| |
| |--80.75%--el0_svc_common.constprop.0
| | |
| | |--49.70%--invoke_syscall
| | | |
| | | --46.66%--__arm64_sys_getpgid
| | | |
| | | |--40.73%--ftrace_call
| | | | |
| | | | |--38.71%--ftrace_ops_list_func
| | | | | |
| | | | | |--25.06%--fprobe_handler
| | | | | | |
| | | | | | |--13.20%--bpf_fprobe_entry
| | | | | | | |
| | | | | | | --11.47%--call_bpf_prog.isra.0
| | | | | | | |
| | | | | | | |--4.08%--__bpf_prog_exit
| | | | | | | | |
| | | | | | | | --0.87%--migrate_enable
| | | | | | | |
| | | | | | | |--2.46%--__bpf_prog_enter
| | | | | | | |
| | | | | | | --2.18%--bpf_prog_21856463590f61f1_bench_trigger_fentry
| | | | | | |
| | | | | | |--8.68%--rethook_trampoline_handler
| | | | | | |
| | | | | | --1.59%--rethook_try_get
| | | | | | |
| | | | | | --0.58%--rcu_is_watching
| | | | | |
| | | | | |--6.65%--rethook_trampoline_handler
| | | | | |
| | | | | --0.77%--rethook_recycle
| | | | |
| | | | --1.74%--hash_contains_ip.isra.0
| | | |
| | | --3.62%--find_task_by_vpid
| | | |
| | | --2.75%--idr_find
| | | |
| | | --2.17%--__radix_tree_lookup
| | |
| | --1.30%--ftrace_caller
| |
| --0.60%--invoke_syscall
|
|--0.88%--0xffffb2807594
|
--0.87%--syscall@plt
2. perf annotate
2.1 ftrace_caller
: 39 SYM_CODE_START(ftrace_caller)
: 40 bti c
0.00 : ffff80000802e0c4: bti c
:
: 39 /* Save original SP */
: 40 mov x10, sp
0.00 : ffff80000802e0c8: mov x10, sp
:
: 42 /* Make room for pt_regs, plus two frame records */
: 43 sub sp, sp, #(FREGS_SIZE + 32)
0.00 : ffff80000802e0cc: sub sp, sp, #0x90
:
: 45 /* Save function arguments */
: 46 stp x0, x1, [sp, #FREGS_X0]
0.00 : ffff80000802e0d0: stp x0, x1, [sp]
: 45 stp x2, x3, [sp, #FREGS_X2]
0.00 : ffff80000802e0d4: stp x2, x3, [sp, #16]
: 46 stp x4, x5, [sp, #FREGS_X4]
16.67 : ffff80000802e0d8: stp x4, x5, [sp, #32] // entry-ftrace.S:46
: 47 stp x6, x7, [sp, #FREGS_X6]
8.33 : ffff80000802e0dc: stp x6, x7, [sp, #48] // entry-ftrace.S:47
: 48 str x8, [sp, #FREGS_X8]
0.00 : ffff80000802e0e0: str x8, [sp, #64]
:
: 52 /* Save the callsite's FP, LR, SP */
: 53 str x29, [sp, #FREGS_FP]
8.33 : ffff80000802e0e4: str x29, [sp, #80] // entry-ftrace.S:51
: 52 str x9, [sp, #FREGS_LR]
8.33 : ffff80000802e0e8: str x9, [sp, #88] // entry-ftrace.S:52
: 53 str x10, [sp, #FREGS_SP]
0.00 : ffff80000802e0ec: str x10, [sp, #96]
:
: 57 /* Save the PC after the ftrace callsite */
: 58 str x30, [sp, #FREGS_PC]
16.67 : ffff80000802e0f0: str x30, [sp, #104] // entry-ftrace.S:56
:
: 60 /* Create a frame record for the callsite above the ftrace regs */
: 61 stp x29, x9, [sp, #FREGS_SIZE + 16]
16.67 : ffff80000802e0f4: stp x29, x9, [sp, #128] // entry-ftrace.S:59
: 60 add x29, sp, #FREGS_SIZE + 16
0.00 : ffff80000802e0f8: add x29, sp, #0x80
:
: 64 /* Create our frame record above the ftrace regs */
: 65 stp x29, x30, [sp, #FREGS_SIZE]
16.67 : ffff80000802e0fc: stp x29, x30, [sp, #112] // entry-ftrace.S:63
: 64 add x29, sp, #FREGS_SIZE
0.00 : ffff80000802e100: add x29, sp, #0x70
:
: 67 sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
0.00 : ffff80000802e104: sub x0, x30, #0x4
: 67 mov x1, x9 // parent_ip (callsite's LR)
0.00 : ffff80000802e108: mov x1, x9
: 68 ldr_l x2, function_trace_op // op
0.00 : ffff80000802e10c: adrp x2, ffff800009638000 <folio_wait_table+0x14c0>
0.00 : ffff80000802e110: ldr x2, [x2, #3320]
: 69 mov x3, sp // regs
0.00 : ffff80000802e114: mov x3, sp
:
: 72 ffff80000802e118 <ftrace_call>:
:
: 73 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
: 74 bl ftrace_stub
0.00 : ffff80000802e118: bl ffff80000802e144 <ftrace_stub>
: 80 * At the callsite x0-x8 and x19-x30 were live. Any C code will have preserved
: 81 * x19-x29 per the AAPCS, and we created frame records upon entry, so we need
: 82 * to restore x0-x8, x29, and x30.
: 83 */
: 84 /* Restore function arguments */
: 85 ldp x0, x1, [sp, #FREGS_X0]
8.33 : ffff80000802e11c: ldp x0, x1, [sp] // entry-ftrace.S:80
: 81 ldp x2, x3, [sp, #FREGS_X2]
0.00 : ffff80000802e120: ldp x2, x3, [sp, #16]
: 82 ldp x4, x5, [sp, #FREGS_X4]
0.00 : ffff80000802e124: ldp x4, x5, [sp, #32]
: 83 ldp x6, x7, [sp, #FREGS_X6]
0.00 : ffff80000802e128: ldp x6, x7, [sp, #48]
: 84 ldr x8, [sp, #FREGS_X8]
0.00 : ffff80000802e12c: ldr x8, [sp, #64]
:
: 88 /* Restore the callsite's FP, LR, PC */
: 89 ldr x29, [sp, #FREGS_FP]
0.00 : ffff80000802e130: ldr x29, [sp, #80]
: 88 ldr x30, [sp, #FREGS_LR]
0.00 : ffff80000802e134: ldr x30, [sp, #88]
: 89 ldr x9, [sp, #FREGS_PC]
0.00 : ffff80000802e138: ldr x9, [sp, #104]
:
: 93 /* Restore the callsite's SP */
: 94 add sp, sp, #FREGS_SIZE + 32
0.00 : ffff80000802e13c: add sp, sp, #0x90
:
: 95 ret x9
0.00 : ffff80000802e140: ret x9
2.2 arch_ftrace_ops_list_func
: 7554 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
: 7555 struct ftrace_ops *op, struct ftrace_regs *fregs)
: 7556 {
0.00 : ffff80000815bdf0: paciasp
4.65 : ffff80000815bdf4: stp x29, x30, [sp, #-144]! // ftrace.c:7551
0.00 : ffff80000815bdf8: mrs x2, sp_el0
0.00 : ffff80000815bdfc: mov x29, sp
2.32 : ffff80000815be00: stp x19, x20, [sp, #16]
0.00 : ffff80000815be04: mov x20, x1
: 7563 trace_test_and_set_recursion():
: 147 int start)
: 148 {
: 149 unsigned int val = READ_ONCE(current->trace_recursion);
: 150 int bit;
:
: 152 bit = trace_get_context_bit() + start;
0.00 : ffff80000815be08: mov w5, #0x8 // #8
: 154 arch_ftrace_ops_list_func():
0.00 : ffff80000815be0c: stp x21, x22, [sp, #32]
0.00 : ffff80000815be10: mov x21, x3
2.32 : ffff80000815be14: stp x23, x24, [sp, #48]
0.00 : ffff80000815be18: mov x23, x0
0.00 : ffff80000815be1c: ldr x4, [x2, #1168]
2.32 : ffff80000815be20: str x4, [sp, #136]
0.00 : ffff80000815be24: mov x4, #0x0 // #0
: 7558 trace_test_and_set_recursion():
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff80000815be28: mov w2, #0x1 // #1
: 150 get_current():
: 19 */
: 20 static __always_inline struct task_struct *get_current(void)
: 21 {
: 22 unsigned long sp_el0;
:
: 24 asm ("mrs %0, sp_el0" : "=r" (sp_el0));
0.00 : ffff80000815be2c: mrs x4, sp_el0
: 26 trace_test_and_set_recursion():
: 144 unsigned int val = READ_ONCE(current->trace_recursion);
0.00 : ffff80000815be30: ldr x7, [x4, #2520]
: 146 preempt_count():
: 13 #define PREEMPT_NEED_RESCHED BIT(32)
: 14 #define PREEMPT_ENABLED (PREEMPT_NEED_RESCHED)
:
: 16 static inline int preempt_count(void)
: 17 {
: 18 return READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff80000815be34: ldr w6, [x4, #8]
: 20 interrupt_context_level():
: 94 static __always_inline unsigned char interrupt_context_level(void)
: 95 {
: 96 unsigned long pc = preempt_count();
: 97 unsigned char level = 0;
:
: 99 level += !!(pc & (NMI_MASK));
0.00 : ffff80000815be38: tst w6, #0xf00000
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
: 97 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff80000815be3c: and w1, w6, #0xffff00
: 94 level += !!(pc & (NMI_MASK));
0.00 : ffff80000815be40: cset w4, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff80000815be44: and w1, w1, #0xffff01ff
: 95 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
0.00 : ffff80000815be48: tst w6, #0xff0000
0.00 : ffff80000815be4c: cinc w4, w4, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff80000815be50: cmp w1, #0x0
: 98 trace_get_context_bit():
: 121 return TRACE_CTX_NORMAL - bit;
0.00 : ffff80000815be54: cinc w4, w4, ne // ne = any
: 123 trace_test_and_set_recursion():
: 147 bit = trace_get_context_bit() + start;
0.00 : ffff80000815be58: sub w5, w5, w4
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff80000815be5c: lsl w2, w2, w5
0.00 : ffff80000815be60: tst w2, w7
0.00 : ffff80000815be64: b.ne ffff80000815bf84 <arch_ftrace_ops_list_func+0x194> // b.any
: 152 trace_clear_recursion():
: 180 */
: 181 static __always_inline void trace_clear_recursion(int bit)
: 182 {
: 183 preempt_enable_notrace();
: 184 barrier();
: 185 trace_recursion_clear(bit);
4.65 : ffff80000815be68: mvn w22, w2 // trace_recursion.h:180
0.00 : ffff80000815be6c: str x25, [sp, #64]
0.00 : ffff80000815be70: sxtw x22, w22
: 189 trace_test_and_set_recursion():
: 165 current->trace_recursion = val;
0.00 : ffff80000815be74: orr w2, w2, w7
: 167 get_current():
0.00 : ffff80000815be78: mrs x4, sp_el0
: 20 trace_test_and_set_recursion():
2.32 : ffff80000815be7c: str x2, [x4, #2520] // trace_recursion.h:165
: 166 __preempt_count_add():
: 47 return !current_thread_info()->preempt.need_resched;
: 48 }
:
: 50 static inline void __preempt_count_add(int val)
: 51 {
: 52 u32 pc = READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff80000815be80: ldr w1, [x4, #8]
: 48 pc += val;
0.00 : ffff80000815be84: add w1, w1, #0x1
: 49 WRITE_ONCE(current_thread_info()->preempt.count, pc);
2.32 : ffff80000815be88: str w1, [x4, #8] // preempt.h:49
: 51 __ftrace_ops_list_func():
: 7506 do_for_each_ftrace_op(op, ftrace_ops_list) {
0.00 : ffff80000815be8c: adrp x0, ffff800009638000 <folio_wait_table+0x14c0>
0.00 : ffff80000815be90: add x25, x0, #0xc28
: 7527 } while_for_each_ftrace_op(op);
0.00 : ffff80000815be94: add x24, x25, #0x8
: 7506 do_for_each_ftrace_op(op, ftrace_ops_list) {
0.00 : ffff80000815be98: ldr x19, [x0, #3112]
: 7508 if (op->flags & FTRACE_OPS_FL_STUB)
4.72 : ffff80000815be9c: ldr x0, [x19, #16] // ftrace.c:7508
0.00 : ffff80000815bea0: tbnz w0, #5, ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>
: 7519 if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
2.32 : ffff80000815bea4: tbnz w0, #14, ffff80000815bf74 <arch_ftrace_ops_list_func+0x184> // ftrace.c:7519
: 7521 ftrace_ops_test():
: 1486 rcu_assign_pointer(hash.filter_hash, ops->func_hash->filter_hash);
2.32 : ffff80000815bea8: ldr x0, [x19, #88] // ftrace.c:1486
0.00 : ffff80000815beac: add x1, sp, #0x60
0.00 : ffff80000815beb0: ldr x0, [x0, #8]
0.00 : ffff80000815beb4: stlr x0, [x1]
: 1487 rcu_assign_pointer(hash.notrace_hash, ops->func_hash->notrace_hash);
0.00 : ffff80000815beb8: ldr x0, [x19, #88]
0.00 : ffff80000815bebc: add x1, sp, #0x58
0.00 : ffff80000815bec0: ldr x0, [x0]
2.32 : ffff80000815bec4: stlr x0, [x1] // ftrace.c:1487
: 1489 if (hash_contains_ip(ip, &hash))
44.15 : ffff80000815bec8: ldp x1, x2, [sp, #88] // ftrace.c:1489
0.00 : ffff80000815becc: mov x0, x23
0.00 : ffff80000815bed0: bl ffff80000815b530 <hash_contains_ip.isra.0>
0.00 : ffff80000815bed4: tst w0, #0xff
0.00 : ffff80000815bed8: b.eq ffff80000815bef8 <arch_ftrace_ops_list_func+0x108> // b.none
: 1495 __ftrace_ops_list_func():
: 7521 if (FTRACE_WARN_ON(!op->func)) {
0.00 : ffff80000815bedc: ldr x4, [x19]
0.00 : ffff80000815bee0: cbz x4, ffff80000815bfa0 <arch_ftrace_ops_list_func+0x1b0>
: 7525 op->func(ip, parent_ip, op, fregs);
0.00 : ffff80000815bee4: mov x3, x21
0.00 : ffff80000815bee8: mov x2, x19
0.00 : ffff80000815beec: mov x1, x20
0.00 : ffff80000815bef0: mov x0, x23
0.00 : ffff80000815bef4: blr x4
: 7527 } while_for_each_ftrace_op(op);
0.00 : ffff80000815bef8: ldr x19, [x19, #8]
0.00 : ffff80000815befc: cmp x19, #0x0
0.00 : ffff80000815bf00: ccmp x19, x24, #0x4, ne // ne = any
0.00 : ffff80000815bf04: b.ne ffff80000815be9c <arch_ftrace_ops_list_func+0xac> // b.any
: 7532 get_current():
0.00 : ffff80000815bf08: mrs x1, sp_el0
: 20 __preempt_count_dec_and_test():
: 62 }
:
: 64 static inline bool __preempt_count_dec_and_test(void)
: 65 {
: 66 struct thread_info *ti = current_thread_info();
: 67 u64 pc = READ_ONCE(ti->preempt_count);
0.00 : ffff80000815bf0c: ldr x0, [x1, #8]
:
: 66 /* Update only the count field, leaving need_resched unchanged */
: 67 WRITE_ONCE(ti->preempt.count, --pc);
0.00 : ffff80000815bf10: sub x0, x0, #0x1
0.00 : ffff80000815bf14: str w0, [x1, #8]
: 74 * need of a reschedule. Otherwise, we need to reload the
: 75 * preempt_count in case the need_resched flag was cleared by an
: 76 * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
: 77 * pair.
: 78 */
: 79 return !pc || !READ_ONCE(ti->preempt_count);
0.00 : ffff80000815bf18: cbnz x0, ffff80000815bf64 <arch_ftrace_ops_list_func+0x174>
: 81 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff80000815bf1c: bl ffff800008ae88d0 <preempt_schedule_notrace>
: 180 get_current():
2.32 : ffff80000815bf20: mrs x1, sp_el0 // current.h:19
: 20 trace_clear_recursion():
: 180 trace_recursion_clear(bit);
0.00 : ffff80000815bf24: ldr x0, [x1, #2520]
0.00 : ffff80000815bf28: and x0, x0, x22
2.32 : ffff80000815bf2c: str x0, [x1, #2520] // trace_recursion.h:180
: 184 arch_ftrace_ops_list_func():
: 7553 __ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
: 7554 }
0.00 : ffff80000815bf30: ldr x25, [sp, #64]
0.00 : ffff80000815bf34: mrs x0, sp_el0
2.32 : ffff80000815bf38: ldr x2, [sp, #136] // ftrace.c:7553
0.00 : ffff80000815bf3c: ldr x1, [x0, #1168]
0.00 : ffff80000815bf40: subs x2, x2, x1
0.00 : ffff80000815bf44: mov x1, #0x0 // #0
0.00 : ffff80000815bf48: b.ne ffff80000815bf98 <arch_ftrace_ops_list_func+0x1a8> // b.any
2.32 : ffff80000815bf4c: ldp x19, x20, [sp, #16]
0.00 : ffff80000815bf50: ldp x21, x22, [sp, #32]
2.32 : ffff80000815bf54: ldp x23, x24, [sp, #48]
0.00 : ffff80000815bf58: ldp x29, x30, [sp], #144
0.00 : ffff80000815bf5c: autiasp
0.00 : ffff80000815bf60: ret
: 7568 __preempt_count_dec_and_test():
11.62 : ffff80000815bf64: ldr x0, [x1, #8] // preempt.h:74
0.00 : ffff80000815bf68: cbnz x0, ffff80000815bf20 <arch_ftrace_ops_list_func+0x130>
: 76 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff80000815bf6c: bl ffff800008ae88d0 <preempt_schedule_notrace>
0.00 : ffff80000815bf70: b ffff80000815bf20 <arch_ftrace_ops_list_func+0x130>
: 181 __ftrace_ops_list_func():
: 7519 if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
0.00 : ffff80000815bf74: bl ffff8000080e5770 <rcu_is_watching>
0.00 : ffff80000815bf78: tst w0, #0xff
0.00 : ffff80000815bf7c: b.ne ffff80000815bea8 <arch_ftrace_ops_list_func+0xb8> // b.any
0.00 : ffff80000815bf80: b ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>
: 7524 trace_test_and_set_recursion():
: 158 if (val & (1 << bit)) {
0.00 : ffff80000815bf84: tbnz w7, #9, ffff80000815bf34 <arch_ftrace_ops_list_func+0x144>
0.00 : ffff80000815bf88: mov x22, #0xfffffffffffffdff // #-513
0.00 : ffff80000815bf8c: mov w2, #0x200 // #512
0.00 : ffff80000815bf90: str x25, [sp, #64]
0.00 : ffff80000815bf94: b ffff80000815be74 <arch_ftrace_ops_list_func+0x84>
0.00 : ffff80000815bf98: str x25, [sp, #64]
: 165 arch_ftrace_ops_list_func():
: 7553 }
0.00 : ffff80000815bf9c: bl ffff800008ae5de0 <__stack_chk_fail>
: 7555 __ftrace_ops_list_func():
: 7521 if (FTRACE_WARN_ON(!op->func)) {
0.00 : ffff80000815bfa0: brk #0x800
: 7523 ftrace_kill():
: 8040 */
: 8041 void ftrace_kill(void)
: 8042 {
: 8043 ftrace_disabled = 1;
: 8044 ftrace_enabled = 0;
: 8045 ftrace_trace_function = ftrace_stub;
0.00 : ffff80000815bfa4: adrp x3, ffff80000802e000 <arch_ftrace_update_code+0x10>
0.00 : ffff80000815bfa8: add x3, x3, #0x144
: 8038 ftrace_disabled = 1;
0.00 : ffff80000815bfac: mov w4, #0x1 // #1
: 8040 __ftrace_ops_list_func():
: 7522 pr_warn("op=%p %pS\n", op, op);
0.00 : ffff80000815bfb0: mov x2, x19
0.00 : ffff80000815bfb4: mov x1, x19
0.00 : ffff80000815bfb8: adrp x0, ffff800008d80000 <kallsyms_token_index+0x17f60>
0.00 : ffff80000815bfbc: add x0, x0, #0x678
: 7527 ftrace_kill():
: 8040 ftrace_trace_function = ftrace_stub;
0.00 : ffff80000815bfc0: str x3, [x25, #192]
: 8039 ftrace_enabled = 0;
0.00 : ffff80000815bfc4: stp w4, wzr, [x25, #200]
: 8041 __ftrace_ops_list_func():
: 7522 pr_warn("op=%p %pS\n", op, op);
0.00 : ffff80000815bfc8: bl ffff800008ad5220 <_printk>
: 7523 goto out;
0.00 : ffff80000815bfcc: b ffff80000815bf08 <arch_ftrace_ops_list_func+0x118>
2.3 fprobe_handler
: 28 static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
: 29 struct ftrace_ops *ops, struct ftrace_regs *fregs)
: 30 {
0.00 : ffff8000081a2020: paciasp
0.00 : ffff8000081a2024: stp x29, x30, [sp, #-64]!
0.00 : ffff8000081a2028: mov x29, sp
0.00 : ffff8000081a202c: stp x19, x20, [sp, #16]
0.00 : ffff8000081a2030: mov x19, x2
0.00 : ffff8000081a2034: stp x21, x22, [sp, #32]
0.00 : ffff8000081a2038: mov x22, x3
0.00 : ffff8000081a203c: str x23, [sp, #48]
0.00 : ffff8000081a2040: mov x23, x0
: 40 fprobe_disabled():
: 49 */
: 50 #define FPROBE_FL_KPROBE_SHARED 2
:
: 52 static inline bool fprobe_disabled(struct fprobe *fp)
: 53 {
: 54 return (fp) ? fp->flags & FPROBE_FL_DISABLED : false;
0.00 : ffff8000081a2044: cbz x2, ffff8000081a2050 <fprobe_handler+0x30>
20.00 : ffff8000081a2048: ldr w0, [x2, #192] // fprobe.h:49
0.00 : ffff8000081a204c: tbnz w0, #0, ffff8000081a2128 <fprobe_handler+0x108>
: 58 get_current():
: 19 */
: 20 static __always_inline struct task_struct *get_current(void)
: 21 {
: 22 unsigned long sp_el0;
:
: 24 asm ("mrs %0, sp_el0" : "=r" (sp_el0));
0.00 : ffff8000081a2050: mrs x0, sp_el0
: 26 trace_test_and_set_recursion():
: 144 * Preemption is promised to be disabled when return bit >= 0.
: 145 */
: 146 static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
: 147 int start)
: 148 {
: 149 unsigned int val = READ_ONCE(current->trace_recursion);
10.00 : ffff8000081a2054: ldr x9, [x0, #2520] // trace_recursion.h:144
: 151 trace_get_context_bit():
: 121 return TRACE_CTX_NORMAL - bit;
0.00 : ffff8000081a2058: mov w6, #0x3 // #3
: 123 preempt_count():
: 13 #define PREEMPT_NEED_RESCHED BIT(32)
: 14 #define PREEMPT_ENABLED (PREEMPT_NEED_RESCHED)
:
: 16 static inline int preempt_count(void)
: 17 {
: 18 return READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff8000081a205c: ldr w8, [x0, #8]
: 20 trace_test_and_set_recursion():
: 148 int bit;
:
: 150 bit = trace_get_context_bit() + start;
: 151 if (unlikely(val & (1 << bit))) {
0.00 : ffff8000081a2060: mov w4, #0x1 // #1
: 153 interrupt_context_level():
: 94 static __always_inline unsigned char interrupt_context_level(void)
: 95 {
: 96 unsigned long pc = preempt_count();
: 97 unsigned char level = 0;
:
: 99 level += !!(pc & (NMI_MASK));
0.00 : ffff8000081a2064: tst w8, #0xf00000
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
: 97 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff8000081a2068: and w7, w8, #0xffff00
: 94 level += !!(pc & (NMI_MASK));
0.00 : ffff8000081a206c: cset w5, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff8000081a2070: and w7, w7, #0xffff01ff
: 95 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
0.00 : ffff8000081a2074: tst w8, #0xff0000
0.00 : ffff8000081a2078: cinc w5, w5, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff8000081a207c: cmp w7, #0x0
: 98 trace_get_context_bit():
: 121 return TRACE_CTX_NORMAL - bit;
0.00 : ffff8000081a2080: cinc w5, w5, ne // ne = any
0.00 : ffff8000081a2084: sub w5, w6, w5
: 124 trace_test_and_set_recursion():
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff8000081a2088: lsl w4, w4, w5
: 150 trace_clear_recursion():
: 180 */
: 181 static __always_inline void trace_clear_recursion(int bit)
: 182 {
: 183 preempt_enable_notrace();
: 184 barrier();
: 185 trace_recursion_clear(bit);
10.00 : ffff8000081a208c: mvn w20, w4 // trace_recursion.h:180
0.00 : ffff8000081a2090: sxtw x20, w20
: 188 trace_test_and_set_recursion():
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff8000081a2094: tst w4, w9
0.00 : ffff8000081a2098: b.ne ffff8000081a2194 <fprobe_handler+0x174> // b.any
: 165 current->trace_recursion = val;
0.00 : ffff8000081a209c: orr w4, w4, w9
: 167 get_current():
0.00 : ffff8000081a20a0: mrs x5, sp_el0
: 20 trace_test_and_set_recursion():
0.00 : ffff8000081a20a4: str x4, [x5, #2520]
: 166 __preempt_count_add():
: 47 return !current_thread_info()->preempt.need_resched;
: 48 }
:
: 50 static inline void __preempt_count_add(int val)
: 51 {
: 52 u32 pc = READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff8000081a20a8: ldr w4, [x5, #8]
: 48 pc += val;
0.00 : ffff8000081a20ac: add w4, w4, #0x1
: 49 WRITE_ONCE(current_thread_info()->preempt.count, pc);
0.00 : ffff8000081a20b0: str w4, [x5, #8]
: 51 fprobe_handler():
: 43 if (bit < 0) {
: 44 fp->nmissed++;
: 45 return;
: 46 }
:
: 48 if (fp->exit_handler) {
0.00 : ffff8000081a20b4: ldr x0, [x19, #224]
0.00 : ffff8000081a20b8: cbz x0, ffff8000081a2140 <fprobe_handler+0x120>
: 44 rh = rethook_try_get(fp->rethook);
10.00 : ffff8000081a20bc: ldr x0, [x19, #200] // fprobe.c:44
0.00 : ffff8000081a20c0: bl ffff8000081a2a54 <rethook_try_get>
0.00 : ffff8000081a20c4: mov x21, x0
: 45 if (!rh) {
0.00 : ffff8000081a20c8: cbz x0, ffff8000081a21a4 <fprobe_handler+0x184>
: 50 fp->nmissed++;
: 51 goto out;
: 52 }
: 53 fpr = container_of(rh, struct fprobe_rethook_node, node);
: 54 fpr->entry_ip = ip;
0.00 : ffff8000081a20cc: str x23, [x0, #48]
: 54 private = fpr->private;
: 55 }
:
: 57 if (fp->entry_handler)
0.00 : ffff8000081a20d0: ldr x4, [x19, #216]
0.00 : ffff8000081a20d4: cbz x4, ffff8000081a2180 <fprobe_handler+0x160>
: 55 should_rethook = fp->entry_handler(fp, ip, fregs, fpr->private);
0.00 : ffff8000081a20d8: mov x1, x23
0.00 : ffff8000081a20dc: mov x0, x19
0.00 : ffff8000081a20e0: add x3, x21, #0x38
0.00 : ffff8000081a20e4: mov x2, x22
0.00 : ffff8000081a20e8: blr x4
:
: 59 if (rh) {
: 60 if (should_rethook)
0.00 : ffff8000081a20ec: tst w0, #0xff
0.00 : ffff8000081a20f0: b.ne ffff8000081a2180 <fprobe_handler+0x160> // b.any
: 61 rethook_hook(rh, fregs, true);
: 62 else
: 63 rethook_recycle(rh);
0.00 : ffff8000081a20f4: mov x0, x21
0.00 : ffff8000081a20f8: bl ffff8000081a2bf0 <rethook_recycle>
: 66 get_current():
0.00 : ffff8000081a20fc: mrs x1, sp_el0
: 20 __preempt_count_dec_and_test():
: 62 }
:
: 64 static inline bool __preempt_count_dec_and_test(void)
: 65 {
: 66 struct thread_info *ti = current_thread_info();
: 67 u64 pc = READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a2100: ldr x0, [x1, #8]
:
: 66 /* Update only the count field, leaving need_resched unchanged */
: 67 WRITE_ONCE(ti->preempt.count, --pc);
0.00 : ffff8000081a2104: sub x0, x0, #0x1
0.00 : ffff8000081a2108: str w0, [x1, #8]
: 74 * need of a reschedule. Otherwise, we need to reload the
: 75 * preempt_count in case the need_resched flag was cleared by an
: 76 * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
: 77 * pair.
: 78 */
: 79 return !pc || !READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a210c: cbnz x0, ffff8000081a2170 <fprobe_handler+0x150>
: 81 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff8000081a2110: bl ffff800008ae88d0 <preempt_schedule_notrace>
0.00 : ffff8000081a2114: nop
: 181 get_current():
10.00 : ffff8000081a2118: mrs x1, sp_el0 // current.h:19
: 20 trace_clear_recursion():
: 180 trace_recursion_clear(bit);
0.00 : ffff8000081a211c: ldr x0, [x1, #2520]
0.00 : ffff8000081a2120: and x0, x0, x20
10.00 : ffff8000081a2124: str x0, [x1, #2520] // trace_recursion.h:180
: 184 fprobe_handler():
: 66 }
:
: 68 out:
: 69 ftrace_test_recursion_unlock(bit);
: 70 }
0.00 : ffff8000081a2128: ldp x19, x20, [sp, #16]
0.00 : ffff8000081a212c: ldp x21, x22, [sp, #32]
0.00 : ffff8000081a2130: ldr x23, [sp, #48]
20.00 : ffff8000081a2134: ldp x29, x30, [sp], #64 // fprobe.c:66
0.00 : ffff8000081a2138: autiasp
10.00 : ffff8000081a213c: ret
: 54 if (fp->entry_handler)
0.00 : ffff8000081a2140: ldr x4, [x19, #216]
0.00 : ffff8000081a2144: cbz x4, ffff8000081a215c <fprobe_handler+0x13c>
: 55 should_rethook = fp->entry_handler(fp, ip, fregs, fpr->private);
0.00 : ffff8000081a2148: mov x2, x22
0.00 : ffff8000081a214c: mov x1, x23
0.00 : ffff8000081a2150: mov x0, x19
0.00 : ffff8000081a2154: mov x3, #0x38 // #56
0.00 : ffff8000081a2158: blr x4
: 61 get_current():
0.00 : ffff8000081a215c: mrs x1, sp_el0
: 20 __preempt_count_dec_and_test():
: 62 u64 pc = READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a2160: ldr x0, [x1, #8]
: 65 WRITE_ONCE(ti->preempt.count, --pc);
0.00 : ffff8000081a2164: sub x0, x0, #0x1
0.00 : ffff8000081a2168: str w0, [x1, #8]
: 74 return !pc || !READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a216c: cbz x0, ffff8000081a2110 <fprobe_handler+0xf0>
0.00 : ffff8000081a2170: ldr x0, [x1, #8]
0.00 : ffff8000081a2174: cbnz x0, ffff8000081a2118 <fprobe_handler+0xf8>
: 78 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff8000081a2178: bl ffff800008ae88d0 <preempt_schedule_notrace>
0.00 : ffff8000081a217c: b ffff8000081a2118 <fprobe_handler+0xf8>
: 181 fprobe_handler():
: 59 rethook_hook(rh, fregs, true);
0.00 : ffff8000081a2180: mov x1, x22
0.00 : ffff8000081a2184: mov x0, x21
0.00 : ffff8000081a2188: mov w2, #0x1 // #1
0.00 : ffff8000081a218c: bl ffff8000081a27d0 <rethook_hook>
0.00 : ffff8000081a2190: b ffff8000081a215c <fprobe_handler+0x13c>
: 65 trace_test_and_set_recursion():
: 158 if (val & (1 << bit)) {
0.00 : ffff8000081a2194: tbnz w9, #4, ffff8000081a21b4 <fprobe_handler+0x194>
0.00 : ffff8000081a2198: mov x20, #0xffffffffffffffef // #-17
0.00 : ffff8000081a219c: mov w4, #0x10 // #16
0.00 : ffff8000081a21a0: b ffff8000081a209c <fprobe_handler+0x7c>
: 163 fprobe_handler():
: 46 fp->nmissed++;
0.00 : ffff8000081a21a4: ldr x0, [x19, #184]
0.00 : ffff8000081a21a8: add x0, x0, #0x1
0.00 : ffff8000081a21ac: str x0, [x19, #184]
: 47 goto out;
0.00 : ffff8000081a21b0: b ffff8000081a215c <fprobe_handler+0x13c>
: 39 fp->nmissed++;
0.00 : ffff8000081a21b4: ldr x0, [x19, #184]
0.00 : ffff8000081a21b8: add x0, x0, #0x1
0.00 : ffff8000081a21bc: str x0, [x19, #184]
: 40 return;
0.00 : ffff8000081a21c0: b ffff8000081a2128 <fprobe_handler+0x108>
2.4 bpf_fprobe_entry
: 5 ffff8000081e19f0 <bpf_fprobe_entry>:
: 6 bpf_fprobe_entry():
: 1057 flags = u64_stats_update_begin_irqsave(&stats->syncp);
: 1058 u64_stats_inc(&stats->cnt);
: 1059 u64_stats_add(&stats->nsecs, sched_clock() - start);
: 1060 u64_stats_update_end_irqrestore(&stats->syncp, flags);
: 1061 }
: 1062 }
0.00 : ffff8000081e19f0: bti c
0.00 : ffff8000081e19f4: nop
0.00 : ffff8000081e19f8: nop
: 165 {
0.00 : ffff8000081e19fc: paciasp
0.00 : ffff8000081e1a00: stp x29, x30, [sp, #-80]!
0.00 : ffff8000081e1a04: mov w4, #0x0 // #0
0.00 : ffff8000081e1a08: mov x29, sp
0.00 : ffff8000081e1a0c: stp x19, x20, [sp, #16]
0.00 : ffff8000081e1a10: mov x19, x3
0.00 : ffff8000081e1a14: stp x21, x22, [sp, #32]
0.00 : ffff8000081e1a18: mov x22, x0
0.00 : ffff8000081e1a1c: mov x21, x2
0.00 : ffff8000081e1a20: stp x23, x24, [sp, #48]
0.00 : ffff8000081e1a24: str x25, [sp, #64]
: 167 struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
0.00 : ffff8000081e1a28: ldr x24, [x0, #24]
: 168 struct bpf_tramp_links *links = fprobe_ctx->links;
0.00 : ffff8000081e1a2c: ldr x23, [x24]
: 174 memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
0.00 : ffff8000081e1a30: stp xzr, xzr, [x3]
: 175 call_ctx->ip = ip;
0.00 : ffff8000081e1a34: str x1, [x3, #16]
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a38: ldr w0, [x24, #8]
0.00 : ffff8000081e1a3c: cmp w0, #0x0
0.00 : ffff8000081e1a40: b.gt ffff8000081e1a64 <bpf_fprobe_entry+0x74>
0.00 : ffff8000081e1a44: b ffff8000081e1a90 <bpf_fprobe_entry+0xa0>
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
0.00 : ffff8000081e1a48: ldr x0, [x21, x1, lsl #3]
0.00 : ffff8000081e1a4c: add x1, x19, x1, lsl #3
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a50: add w4, w4, #0x1
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
16.67 : ffff8000081e1a54: str x0, [x1, #24] // trampoline.c:177
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a58: ldr w0, [x24, #8]
0.00 : ffff8000081e1a5c: cmp w0, w4
0.00 : ffff8000081e1a60: b.le ffff8000081e1a90 <bpf_fprobe_entry+0xa0>
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
8.33 : ffff8000081e1a64: sxtw x1, w4
0.00 : ffff8000081e1a68: mov x0, #0x0 // #0
0.00 : ffff8000081e1a6c: cmp w4, #0x7
0.00 : ffff8000081e1a70: b.le ffff8000081e1a48 <bpf_fprobe_entry+0x58>
0.00 : ffff8000081e1a74: sxtw x1, w4
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a78: add w4, w4, #0x1
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
0.00 : ffff8000081e1a7c: add x1, x19, x1, lsl #3
0.00 : ffff8000081e1a80: str x0, [x1, #24]
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a84: ldr w0, [x24, #8]
0.00 : ffff8000081e1a88: cmp w0, w4
0.00 : ffff8000081e1a8c: b.gt ffff8000081e1a64 <bpf_fprobe_entry+0x74>
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1a90: ldr w1, [x23, #304]
: 185 call_ctx->args);
0.00 : ffff8000081e1a94: add x25, x19, #0x18
0.00 : ffff8000081e1a98: mov x20, #0x0 // #0
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1a9c: cmp w1, #0x0
0.00 : ffff8000081e1aa0: b.le ffff8000081e1ad4 <bpf_fprobe_entry+0xe4>
0.00 : ffff8000081e1aa4: nop
: 180 call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
0.00 : ffff8000081e1aa8: ldr x1, [x23, x20, lsl #3]
0.00 : ffff8000081e1aac: mov x3, x25
0.00 : ffff8000081e1ab0: mov x2, x19
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1ab4: add x20, x20, #0x1
: 180 call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
16.67 : ffff8000081e1ab8: ldr x0, [x1, #24] // trampoline.c:180
0.00 : ffff8000081e1abc: ldr x1, [x1, #80]
0.00 : ffff8000081e1ac0: bl ffff8000081e1800 <call_bpf_prog.isra.0>
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1ac4: ldr w0, [x23, #304]
0.00 : ffff8000081e1ac8: cmp w0, w20
0.00 : ffff8000081e1acc: b.gt ffff8000081e1aa8 <bpf_fprobe_entry+0xb8>
0.00 : ffff8000081e1ad0: ldr w0, [x24, #8]
: 182 call_ctx->args[fprobe_ctx->nr_args] = 0;
0.00 : ffff8000081e1ad4: add x0, x19, w0, sxtw #3
: 183 for (i = 0; i < fmod_ret->nr_links; i++) {
0.00 : ffff8000081e1ad8: add x25, x23, #0x270
: 185 call_ctx->args);
0.00 : ffff8000081e1adc: add x24, x19, #0x18
0.00 : ffff8000081e1ae0: mov x20, #0x0 // #0
: 182 call_ctx->args[fprobe_ctx->nr_args] = 0;
25.00 : ffff8000081e1ae4: str xzr, [x0, #24] // trampoline.c:182
: 183 for (i = 0; i < fmod_ret->nr_links; i++) {
0.00 : ffff8000081e1ae8: ldr w0, [x25, #304]
0.00 : ffff8000081e1aec: cmp w0, #0x0
0.00 : ffff8000081e1af0: b.gt ffff8000081e1b04 <bpf_fprobe_entry+0x114>
16.67 : ffff8000081e1af4: b ffff8000081e1ba8 <bpf_fprobe_entry+0x1b8> // trampoline.c:183
0.00 : ffff8000081e1af8: ldr w0, [x25, #304]
0.00 : ffff8000081e1afc: cmp w0, w20
0.00 : ffff8000081e1b00: b.le ffff8000081e1ba8 <bpf_fprobe_entry+0x1b8>
: 184 ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
0.00 : ffff8000081e1b04: ldr x1, [x25, x20, lsl #3]
0.00 : ffff8000081e1b08: mov x3, x24
0.00 : ffff8000081e1b0c: mov x2, x19
: 183 for (i = 0; i < fmod_ret->nr_links; i++) {
0.00 : ffff8000081e1b10: add x20, x20, #0x1
: 184 ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
0.00 : ffff8000081e1b14: ldr x0, [x1, #24]
0.00 : ffff8000081e1b18: ldr x1, [x1, #80]
0.00 : ffff8000081e1b1c: bl ffff8000081e1800 <call_bpf_prog.isra.0>
: 187 if (ret) {
0.00 : ffff8000081e1b20: cbz w0, ffff8000081e1af8 <bpf_fprobe_entry+0x108>
: 189 ftrace_override_function_with_return(regs);
0.00 : ffff8000081e1b24: ldr x2, [x21, #88]
: 188 ftrace_regs_set_return_value(regs, ret);
0.00 : ffff8000081e1b28: sxtw x1, w0
0.00 : ffff8000081e1b2c: str x1, [x21]
: 191 bpf_fprobe_exit():
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b30: mov x20, #0x0 // #0
: 162 bpf_fprobe_entry():
: 189 ftrace_override_function_with_return(regs);
0.00 : ffff8000081e1b34: str x2, [x21, #104]
: 191 bpf_fprobe_exit():
: 153 struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
0.00 : ffff8000081e1b38: ldr x2, [x22, #24]
: 158 call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
0.00 : ffff8000081e1b3c: ldrsw x0, [x2, #8]
: 154 struct bpf_tramp_links *links = fprobe_ctx->links;
0.00 : ffff8000081e1b40: ldr x21, [x2]
: 158 call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
0.00 : ffff8000081e1b44: add x0, x19, x0, lsl #3
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b48: add x21, x21, #0x138
: 158 call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
0.00 : ffff8000081e1b4c: str x1, [x0, #24]
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b50: ldr w0, [x21, #304]
0.00 : ffff8000081e1b54: cmp w0, #0x0
0.00 : ffff8000081e1b58: b.le ffff8000081e1b88 <bpf_fprobe_entry+0x198>
0.00 : ffff8000081e1b5c: nop
: 161 call_bpf_prog(fexit->links[i], &call_ctx->ctx, call_ctx->args);
0.00 : ffff8000081e1b60: ldr x1, [x21, x20, lsl #3]
0.00 : ffff8000081e1b64: mov x3, x24
0.00 : ffff8000081e1b68: mov x2, x19
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b6c: add x20, x20, #0x1
: 161 call_bpf_prog(fexit->links[i], &call_ctx->ctx, call_ctx->args);
0.00 : ffff8000081e1b70: ldr x0, [x1, #24]
0.00 : ffff8000081e1b74: ldr x1, [x1, #80]
0.00 : ffff8000081e1b78: bl ffff8000081e1800 <call_bpf_prog.isra.0>
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b7c: ldr w0, [x21, #304]
0.00 : ffff8000081e1b80: cmp w0, w20
0.00 : ffff8000081e1b84: b.gt ffff8000081e1b60 <bpf_fprobe_entry+0x170>
: 164 bpf_fprobe_entry():
: 192 return false;
0.00 : ffff8000081e1b88: mov w0, #0x0 // #0
: 197 }
0.00 : ffff8000081e1b8c: ldp x19, x20, [sp, #16]
0.00 : ffff8000081e1b90: ldp x21, x22, [sp, #32]
0.00 : ffff8000081e1b94: ldp x23, x24, [sp, #48]
0.00 : ffff8000081e1b98: ldr x25, [sp, #64]
0.00 : ffff8000081e1b9c: ldp x29, x30, [sp], #80
0.00 : ffff8000081e1ba0: autiasp
0.00 : ffff8000081e1ba4: ret
: 196 return fexit->nr_links;
0.00 : ffff8000081e1ba8: ldr w0, [x23, #616]
: 197 }
0.00 : ffff8000081e1bac: ldp x19, x20, [sp, #16]
: 196 return fexit->nr_links;
0.00 : ffff8000081e1bb0: cmp w0, #0x0
0.00 : ffff8000081e1bb4: cset w0, ne // ne = any
: 197 }
0.00 : ffff8000081e1bb8: ldp x21, x22, [sp, #32]
0.00 : ffff8000081e1bbc: ldp x23, x24, [sp, #48]
0.00 : ffff8000081e1bc0: ldr x25, [sp, #64]
0.00 : ffff8000081e1bc4: ldp x29, x30, [sp], #80
0.00 : ffff8000081e1bc8: autiasp
16.67 : ffff8000081e1bcc: ret // trampoline.c:197
2.5 call_bpf_prog
: 5 ffff8000081e1800 <call_bpf_prog.isra.0>:
: 6 call_bpf_prog.isra.0():
:
: 200 if (oldp)
: 201 *oldp = old;
:
: 203 if (unlikely(!old))
: 204 refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
13.33 : ffff8000081e1800: nop // refcount.h:199
0.00 : ffff8000081e1804: nop
: 207 call_bpf_prog():
:
: 108 mutex_unlock(&tr->mutex);
: 109 return ret;
: 110 }
: 111 #else
: 112 static unsigned int call_bpf_prog(struct bpf_tramp_link *l,
0.00 : ffff8000081e1808: paciasp
0.00 : ffff8000081e180c: stp x29, x30, [sp, #-64]!
0.00 : ffff8000081e1810: mov x29, sp
0.00 : ffff8000081e1814: stp x19, x20, [sp, #16]
0.00 : ffff8000081e1818: mov x19, x0
0.00 : ffff8000081e181c: mov x20, x2
0.00 : ffff8000081e1820: stp x21, x22, [sp, #32]
6.67 : ffff8000081e1824: stp x23, x24, [sp, #48] // trampoline.c:107
0.00 : ffff8000081e1828: mov x24, x3
: 118 struct bpf_tramp_run_ctx *run_ctx) = __bpf_prog_exit;
: 119 struct bpf_prog *p = l->link.prog;
: 120 unsigned int ret;
: 121 u64 start_time;
:
: 123 if (p->aux->sleepable) {
60.00 : ffff8000081e182c: ldr x0, [x0, #56] // trampoline.c:118
13.33 : ffff8000081e1830: ldrb w0, [x0, #140]
0.00 : ffff8000081e1834: cbnz w0, ffff8000081e1858 <call_bpf_prog.isra.0+0x58>
: 121 enter = __bpf_prog_enter_sleepable;
: 122 exit = __bpf_prog_exit_sleepable;
: 123 } else if (p->expected_attach_type == BPF_LSM_CGROUP) {
0.00 : ffff8000081e1838: ldr w0, [x19, #8]
0.00 : ffff8000081e183c: cmp w0, #0x2b
0.00 : ffff8000081e1840: b.eq ffff8000081e18c4 <call_bpf_prog.isra.0+0xc4> // b.none
: 112 void (*exit)(struct bpf_prog *prog, u64 start,
0.00 : ffff8000081e1844: adrp x22, ffff8000081e1000 <print_bpf_insn+0x580>
: 110 u64 (*enter)(struct bpf_prog *prog,
0.00 : ffff8000081e1848: adrp x2, ffff8000081e1000 <print_bpf_insn+0x580>
: 112 void (*exit)(struct bpf_prog *prog, u64 start,
0.00 : ffff8000081e184c: add x22, x22, #0xbd0
: 110 u64 (*enter)(struct bpf_prog *prog,
0.00 : ffff8000081e1850: add x2, x2, #0xd20
0.00 : ffff8000081e1854: b ffff8000081e1868 <call_bpf_prog.isra.0+0x68>
: 120 exit = __bpf_prog_exit_sleepable;
0.00 : ffff8000081e1858: adrp x22, ffff8000081e1000 <print_bpf_insn+0x580>
: 119 enter = __bpf_prog_enter_sleepable;
0.00 : ffff8000081e185c: adrp x2, ffff8000081e1000 <print_bpf_insn+0x580>
: 120 exit = __bpf_prog_exit_sleepable;
0.00 : ffff8000081e1860: add x22, x22, #0xc60
: 119 enter = __bpf_prog_enter_sleepable;
0.00 : ffff8000081e1864: add x2, x2, #0xe10
: 126 enter = __bpf_prog_enter_lsm_cgroup;
: 127 exit = __bpf_prog_exit_lsm_cgroup;
: 128 }
:
: 130 ctx->bpf_cookie = l->cookie;
0.00 : ffff8000081e1868: str x1, [x20]
:
: 129 start_time = enter(p, ctx);
0.00 : ffff8000081e186c: mov x0, x19
0.00 : ffff8000081e1870: mov x1, x20
: 130 if (!start_time)
: 131 return 0;
0.00 : ffff8000081e1874: mov w23, #0x0 // #0
: 128 start_time = enter(p, ctx);
0.00 : ffff8000081e1878: blr x2
0.00 : ffff8000081e187c: mov x21, x0
: 129 if (!start_time)
0.00 : ffff8000081e1880: cbz x0, ffff8000081e18a8 <call_bpf_prog.isra.0+0xa8>
:
: 133 ret = p->bpf_func(args, p->insnsi);
0.00 : ffff8000081e1884: ldr x2, [x19, #48]
0.00 : ffff8000081e1888: add x1, x19, #0x48
0.00 : ffff8000081e188c: mov x0, x24
0.00 : ffff8000081e1890: blr x2
0.00 : ffff8000081e1894: mov w23, w0
:
: 135 exit(p, start_time, ctx);
0.00 : ffff8000081e1898: mov x2, x20
0.00 : ffff8000081e189c: mov x1, x21
0.00 : ffff8000081e18a0: mov x0, x19
0.00 : ffff8000081e18a4: blr x22
:
: 138 return ret;
: 139 }
0.00 : ffff8000081e18a8: mov w0, w23
0.00 : ffff8000081e18ac: ldp x19, x20, [sp, #16]
0.00 : ffff8000081e18b0: ldp x21, x22, [sp, #32]
0.00 : ffff8000081e18b4: ldp x23, x24, [sp, #48]
6.67 : ffff8000081e18b8: ldp x29, x30, [sp], #64 // trampoline.c:137
0.00 : ffff8000081e18bc: autiasp
0.00 : ffff8000081e18c0: ret
: 123 exit = __bpf_prog_exit_lsm_cgroup;
0.00 : ffff8000081e18c4: adrp x22, ffff8000081e1000 <print_bpf_insn+0x580>
: 122 enter = __bpf_prog_enter_lsm_cgroup;
0.00 : ffff8000081e18c8: adrp x2, ffff8000081e1000 <print_bpf_insn+0x580>
: 123 exit = __bpf_prog_exit_lsm_cgroup;
0.00 : ffff8000081e18cc: add x22, x22, #0x200
: 122 enter = __bpf_prog_enter_lsm_cgroup;
0.00 : ffff8000081e18d0: add x2, x2, #0x1c0
0.00 : ffff8000081e18d4: b ffff8000081e1868 <call_bpf_prog.isra.0+0x68>
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-07 10:13 ` Xu Kuohai
0 siblings, 0 replies; 60+ messages in thread
From: Xu Kuohai @ 2022-10-07 10:13 UTC (permalink / raw)
To: Steven Rostedt, Florent Revest
Cc: Mark Rutland, Catalin Marinas, Daniel Borkmann, linux-arm-kernel,
linux-kernel, bpf, Will Deacon, Jean-Philippe Brucker,
Ingo Molnar, Oleg Nesterov, Alexei Starovoitov, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Zi Shen Lim,
Pasha Tatashin, Ard Biesheuvel, Marc Zyngier, Guo Ren,
Masami Hiramatsu
On 10/7/2022 12:29 AM, Steven Rostedt wrote:
> On Thu, 6 Oct 2022 18:19:12 +0200
> Florent Revest <revest@chromium.org> wrote:
>
>> Sure, we can give this a try, I'll work on a macro that generates the
>> 7 callbacks and we can check how much that helps. My belief right now
>> is that ftrace's iteration over all ops on arm64 is where we lose most
>> time but now that we have numbers it's pretty easy to check hypothesis
>> :)
>
> Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
>
> So, let's hold off until that is complete.
>
> -- Steve
>
> .
Here is the perf data I captured.
1. perf report
99.94% 0.00% ld-linux-aarch6 bench [.] trigger_producer
|
---trigger_producer
|
|--98.04%--syscall
| |
| --81.35%--el0t_64_sync
| el0t_64_sync_handler
| el0_svc
| do_el0_svc
| |
| |--80.75%--el0_svc_common.constprop.0
| | |
| | |--49.70%--invoke_syscall
| | | |
| | | --46.66%--__arm64_sys_getpgid
| | | |
| | | |--40.73%--ftrace_call
| | | | |
| | | | |--38.71%--ftrace_ops_list_func
| | | | | |
| | | | | |--25.06%--fprobe_handler
| | | | | | |
| | | | | | |--13.20%--bpf_fprobe_entry
| | | | | | | |
| | | | | | | --11.47%--call_bpf_prog.isra.0
| | | | | | | |
| | | | | | | |--4.08%--__bpf_prog_exit
| | | | | | | | |
| | | | | | | | --0.87%--migrate_enable
| | | | | | | |
| | | | | | | |--2.46%--__bpf_prog_enter
| | | | | | | |
| | | | | | | --2.18%--bpf_prog_21856463590f61f1_bench_trigger_fentry
| | | | | | |
| | | | | | |--8.68%--rethook_trampoline_handler
| | | | | | |
| | | | | | --1.59%--rethook_try_get
| | | | | | |
| | | | | | --0.58%--rcu_is_watching
| | | | | |
| | | | | |--6.65%--rethook_trampoline_handler
| | | | | |
| | | | | --0.77%--rethook_recycle
| | | | |
| | | | --1.74%--hash_contains_ip.isra.0
| | | |
| | | --3.62%--find_task_by_vpid
| | | |
| | | --2.75%--idr_find
| | | |
| | | --2.17%--__radix_tree_lookup
| | |
| | --1.30%--ftrace_caller
| |
| --0.60%--invoke_syscall
|
|--0.88%--0xffffb2807594
|
--0.87%--syscall@plt
2. perf annotate
2.1 ftrace_caller
: 39 SYM_CODE_START(ftrace_caller)
: 40 bti c
0.00 : ffff80000802e0c4: bti c
:
: 39 /* Save original SP */
: 40 mov x10, sp
0.00 : ffff80000802e0c8: mov x10, sp
:
: 42 /* Make room for pt_regs, plus two frame records */
: 43 sub sp, sp, #(FREGS_SIZE + 32)
0.00 : ffff80000802e0cc: sub sp, sp, #0x90
:
: 45 /* Save function arguments */
: 46 stp x0, x1, [sp, #FREGS_X0]
0.00 : ffff80000802e0d0: stp x0, x1, [sp]
: 45 stp x2, x3, [sp, #FREGS_X2]
0.00 : ffff80000802e0d4: stp x2, x3, [sp, #16]
: 46 stp x4, x5, [sp, #FREGS_X4]
16.67 : ffff80000802e0d8: stp x4, x5, [sp, #32] // entry-ftrace.S:46
: 47 stp x6, x7, [sp, #FREGS_X6]
8.33 : ffff80000802e0dc: stp x6, x7, [sp, #48] // entry-ftrace.S:47
: 48 str x8, [sp, #FREGS_X8]
0.00 : ffff80000802e0e0: str x8, [sp, #64]
:
: 52 /* Save the callsite's FP, LR, SP */
: 53 str x29, [sp, #FREGS_FP]
8.33 : ffff80000802e0e4: str x29, [sp, #80] // entry-ftrace.S:51
: 52 str x9, [sp, #FREGS_LR]
8.33 : ffff80000802e0e8: str x9, [sp, #88] // entry-ftrace.S:52
: 53 str x10, [sp, #FREGS_SP]
0.00 : ffff80000802e0ec: str x10, [sp, #96]
:
: 57 /* Save the PC after the ftrace callsite */
: 58 str x30, [sp, #FREGS_PC]
16.67 : ffff80000802e0f0: str x30, [sp, #104] // entry-ftrace.S:56
:
: 60 /* Create a frame record for the callsite above the ftrace regs */
: 61 stp x29, x9, [sp, #FREGS_SIZE + 16]
16.67 : ffff80000802e0f4: stp x29, x9, [sp, #128] // entry-ftrace.S:59
: 60 add x29, sp, #FREGS_SIZE + 16
0.00 : ffff80000802e0f8: add x29, sp, #0x80
:
: 64 /* Create our frame record above the ftrace regs */
: 65 stp x29, x30, [sp, #FREGS_SIZE]
16.67 : ffff80000802e0fc: stp x29, x30, [sp, #112] // entry-ftrace.S:63
: 64 add x29, sp, #FREGS_SIZE
0.00 : ffff80000802e100: add x29, sp, #0x70
:
: 67 sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
0.00 : ffff80000802e104: sub x0, x30, #0x4
: 67 mov x1, x9 // parent_ip (callsite's LR)
0.00 : ffff80000802e108: mov x1, x9
: 68 ldr_l x2, function_trace_op // op
0.00 : ffff80000802e10c: adrp x2, ffff800009638000 <folio_wait_table+0x14c0>
0.00 : ffff80000802e110: ldr x2, [x2, #3320]
: 69 mov x3, sp // regs
0.00 : ffff80000802e114: mov x3, sp
:
: 72 ffff80000802e118 <ftrace_call>:
:
: 73 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
: 74 bl ftrace_stub
0.00 : ffff80000802e118: bl ffff80000802e144 <ftrace_stub>
: 80 * At the callsite x0-x8 and x19-x30 were live. Any C code will have preserved
: 81 * x19-x29 per the AAPCS, and we created frame records upon entry, so we need
: 82 * to restore x0-x8, x29, and x30.
: 83 */
: 84 /* Restore function arguments */
: 85 ldp x0, x1, [sp, #FREGS_X0]
8.33 : ffff80000802e11c: ldp x0, x1, [sp] // entry-ftrace.S:80
: 81 ldp x2, x3, [sp, #FREGS_X2]
0.00 : ffff80000802e120: ldp x2, x3, [sp, #16]
: 82 ldp x4, x5, [sp, #FREGS_X4]
0.00 : ffff80000802e124: ldp x4, x5, [sp, #32]
: 83 ldp x6, x7, [sp, #FREGS_X6]
0.00 : ffff80000802e128: ldp x6, x7, [sp, #48]
: 84 ldr x8, [sp, #FREGS_X8]
0.00 : ffff80000802e12c: ldr x8, [sp, #64]
:
: 88 /* Restore the callsite's FP, LR, PC */
: 89 ldr x29, [sp, #FREGS_FP]
0.00 : ffff80000802e130: ldr x29, [sp, #80]
: 88 ldr x30, [sp, #FREGS_LR]
0.00 : ffff80000802e134: ldr x30, [sp, #88]
: 89 ldr x9, [sp, #FREGS_PC]
0.00 : ffff80000802e138: ldr x9, [sp, #104]
:
: 93 /* Restore the callsite's SP */
: 94 add sp, sp, #FREGS_SIZE + 32
0.00 : ffff80000802e13c: add sp, sp, #0x90
:
: 95 ret x9
0.00 : ffff80000802e140: ret x9
2.2 arch_ftrace_ops_list_func
: 7554 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
: 7555 struct ftrace_ops *op, struct ftrace_regs *fregs)
: 7556 {
0.00 : ffff80000815bdf0: paciasp
4.65 : ffff80000815bdf4: stp x29, x30, [sp, #-144]! // ftrace.c:7551
0.00 : ffff80000815bdf8: mrs x2, sp_el0
0.00 : ffff80000815bdfc: mov x29, sp
2.32 : ffff80000815be00: stp x19, x20, [sp, #16]
0.00 : ffff80000815be04: mov x20, x1
: 7563 trace_test_and_set_recursion():
: 147 int start)
: 148 {
: 149 unsigned int val = READ_ONCE(current->trace_recursion);
: 150 int bit;
:
: 152 bit = trace_get_context_bit() + start;
0.00 : ffff80000815be08: mov w5, #0x8 // #8
: 154 arch_ftrace_ops_list_func():
0.00 : ffff80000815be0c: stp x21, x22, [sp, #32]
0.00 : ffff80000815be10: mov x21, x3
2.32 : ffff80000815be14: stp x23, x24, [sp, #48]
0.00 : ffff80000815be18: mov x23, x0
0.00 : ffff80000815be1c: ldr x4, [x2, #1168]
2.32 : ffff80000815be20: str x4, [sp, #136]
0.00 : ffff80000815be24: mov x4, #0x0 // #0
: 7558 trace_test_and_set_recursion():
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff80000815be28: mov w2, #0x1 // #1
: 150 get_current():
: 19 */
: 20 static __always_inline struct task_struct *get_current(void)
: 21 {
: 22 unsigned long sp_el0;
:
: 24 asm ("mrs %0, sp_el0" : "=r" (sp_el0));
0.00 : ffff80000815be2c: mrs x4, sp_el0
: 26 trace_test_and_set_recursion():
: 144 unsigned int val = READ_ONCE(current->trace_recursion);
0.00 : ffff80000815be30: ldr x7, [x4, #2520]
: 146 preempt_count():
: 13 #define PREEMPT_NEED_RESCHED BIT(32)
: 14 #define PREEMPT_ENABLED (PREEMPT_NEED_RESCHED)
:
: 16 static inline int preempt_count(void)
: 17 {
: 18 return READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff80000815be34: ldr w6, [x4, #8]
: 20 interrupt_context_level():
: 94 static __always_inline unsigned char interrupt_context_level(void)
: 95 {
: 96 unsigned long pc = preempt_count();
: 97 unsigned char level = 0;
:
: 99 level += !!(pc & (NMI_MASK));
0.00 : ffff80000815be38: tst w6, #0xf00000
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
: 97 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff80000815be3c: and w1, w6, #0xffff00
: 94 level += !!(pc & (NMI_MASK));
0.00 : ffff80000815be40: cset w4, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff80000815be44: and w1, w1, #0xffff01ff
: 95 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
0.00 : ffff80000815be48: tst w6, #0xff0000
0.00 : ffff80000815be4c: cinc w4, w4, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff80000815be50: cmp w1, #0x0
: 98 trace_get_context_bit():
: 121 return TRACE_CTX_NORMAL - bit;
0.00 : ffff80000815be54: cinc w4, w4, ne // ne = any
: 123 trace_test_and_set_recursion():
: 147 bit = trace_get_context_bit() + start;
0.00 : ffff80000815be58: sub w5, w5, w4
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff80000815be5c: lsl w2, w2, w5
0.00 : ffff80000815be60: tst w2, w7
0.00 : ffff80000815be64: b.ne ffff80000815bf84 <arch_ftrace_ops_list_func+0x194> // b.any
: 152 trace_clear_recursion():
: 180 */
: 181 static __always_inline void trace_clear_recursion(int bit)
: 182 {
: 183 preempt_enable_notrace();
: 184 barrier();
: 185 trace_recursion_clear(bit);
4.65 : ffff80000815be68: mvn w22, w2 // trace_recursion.h:180
0.00 : ffff80000815be6c: str x25, [sp, #64]
0.00 : ffff80000815be70: sxtw x22, w22
: 189 trace_test_and_set_recursion():
: 165 current->trace_recursion = val;
0.00 : ffff80000815be74: orr w2, w2, w7
: 167 get_current():
0.00 : ffff80000815be78: mrs x4, sp_el0
: 20 trace_test_and_set_recursion():
2.32 : ffff80000815be7c: str x2, [x4, #2520] // trace_recursion.h:165
: 166 __preempt_count_add():
: 47 return !current_thread_info()->preempt.need_resched;
: 48 }
:
: 50 static inline void __preempt_count_add(int val)
: 51 {
: 52 u32 pc = READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff80000815be80: ldr w1, [x4, #8]
: 48 pc += val;
0.00 : ffff80000815be84: add w1, w1, #0x1
: 49 WRITE_ONCE(current_thread_info()->preempt.count, pc);
2.32 : ffff80000815be88: str w1, [x4, #8] // preempt.h:49
: 51 __ftrace_ops_list_func():
: 7506 do_for_each_ftrace_op(op, ftrace_ops_list) {
0.00 : ffff80000815be8c: adrp x0, ffff800009638000 <folio_wait_table+0x14c0>
0.00 : ffff80000815be90: add x25, x0, #0xc28
: 7527 } while_for_each_ftrace_op(op);
0.00 : ffff80000815be94: add x24, x25, #0x8
: 7506 do_for_each_ftrace_op(op, ftrace_ops_list) {
0.00 : ffff80000815be98: ldr x19, [x0, #3112]
: 7508 if (op->flags & FTRACE_OPS_FL_STUB)
4.72 : ffff80000815be9c: ldr x0, [x19, #16] // ftrace.c:7508
0.00 : ffff80000815bea0: tbnz w0, #5, ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>
: 7519 if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
2.32 : ffff80000815bea4: tbnz w0, #14, ffff80000815bf74 <arch_ftrace_ops_list_func+0x184> // ftrace.c:7519
: 7521 ftrace_ops_test():
: 1486 rcu_assign_pointer(hash.filter_hash, ops->func_hash->filter_hash);
2.32 : ffff80000815bea8: ldr x0, [x19, #88] // ftrace.c:1486
0.00 : ffff80000815beac: add x1, sp, #0x60
0.00 : ffff80000815beb0: ldr x0, [x0, #8]
0.00 : ffff80000815beb4: stlr x0, [x1]
: 1487 rcu_assign_pointer(hash.notrace_hash, ops->func_hash->notrace_hash);
0.00 : ffff80000815beb8: ldr x0, [x19, #88]
0.00 : ffff80000815bebc: add x1, sp, #0x58
0.00 : ffff80000815bec0: ldr x0, [x0]
2.32 : ffff80000815bec4: stlr x0, [x1] // ftrace.c:1487
: 1489 if (hash_contains_ip(ip, &hash))
44.15 : ffff80000815bec8: ldp x1, x2, [sp, #88] // ftrace.c:1489
0.00 : ffff80000815becc: mov x0, x23
0.00 : ffff80000815bed0: bl ffff80000815b530 <hash_contains_ip.isra.0>
0.00 : ffff80000815bed4: tst w0, #0xff
0.00 : ffff80000815bed8: b.eq ffff80000815bef8 <arch_ftrace_ops_list_func+0x108> // b.none
: 1495 __ftrace_ops_list_func():
: 7521 if (FTRACE_WARN_ON(!op->func)) {
0.00 : ffff80000815bedc: ldr x4, [x19]
0.00 : ffff80000815bee0: cbz x4, ffff80000815bfa0 <arch_ftrace_ops_list_func+0x1b0>
: 7525 op->func(ip, parent_ip, op, fregs);
0.00 : ffff80000815bee4: mov x3, x21
0.00 : ffff80000815bee8: mov x2, x19
0.00 : ffff80000815beec: mov x1, x20
0.00 : ffff80000815bef0: mov x0, x23
0.00 : ffff80000815bef4: blr x4
: 7527 } while_for_each_ftrace_op(op);
0.00 : ffff80000815bef8: ldr x19, [x19, #8]
0.00 : ffff80000815befc: cmp x19, #0x0
0.00 : ffff80000815bf00: ccmp x19, x24, #0x4, ne // ne = any
0.00 : ffff80000815bf04: b.ne ffff80000815be9c <arch_ftrace_ops_list_func+0xac> // b.any
: 7532 get_current():
0.00 : ffff80000815bf08: mrs x1, sp_el0
: 20 __preempt_count_dec_and_test():
: 62 }
:
: 64 static inline bool __preempt_count_dec_and_test(void)
: 65 {
: 66 struct thread_info *ti = current_thread_info();
: 67 u64 pc = READ_ONCE(ti->preempt_count);
0.00 : ffff80000815bf0c: ldr x0, [x1, #8]
:
: 66 /* Update only the count field, leaving need_resched unchanged */
: 67 WRITE_ONCE(ti->preempt.count, --pc);
0.00 : ffff80000815bf10: sub x0, x0, #0x1
0.00 : ffff80000815bf14: str w0, [x1, #8]
: 74 * need of a reschedule. Otherwise, we need to reload the
: 75 * preempt_count in case the need_resched flag was cleared by an
: 76 * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
: 77 * pair.
: 78 */
: 79 return !pc || !READ_ONCE(ti->preempt_count);
0.00 : ffff80000815bf18: cbnz x0, ffff80000815bf64 <arch_ftrace_ops_list_func+0x174>
: 81 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff80000815bf1c: bl ffff800008ae88d0 <preempt_schedule_notrace>
: 180 get_current():
2.32 : ffff80000815bf20: mrs x1, sp_el0 // current.h:19
: 20 trace_clear_recursion():
: 180 trace_recursion_clear(bit);
0.00 : ffff80000815bf24: ldr x0, [x1, #2520]
0.00 : ffff80000815bf28: and x0, x0, x22
2.32 : ffff80000815bf2c: str x0, [x1, #2520] // trace_recursion.h:180
: 184 arch_ftrace_ops_list_func():
: 7553 __ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
: 7554 }
0.00 : ffff80000815bf30: ldr x25, [sp, #64]
0.00 : ffff80000815bf34: mrs x0, sp_el0
2.32 : ffff80000815bf38: ldr x2, [sp, #136] // ftrace.c:7553
0.00 : ffff80000815bf3c: ldr x1, [x0, #1168]
0.00 : ffff80000815bf40: subs x2, x2, x1
0.00 : ffff80000815bf44: mov x1, #0x0 // #0
0.00 : ffff80000815bf48: b.ne ffff80000815bf98 <arch_ftrace_ops_list_func+0x1a8> // b.any
2.32 : ffff80000815bf4c: ldp x19, x20, [sp, #16]
0.00 : ffff80000815bf50: ldp x21, x22, [sp, #32]
2.32 : ffff80000815bf54: ldp x23, x24, [sp, #48]
0.00 : ffff80000815bf58: ldp x29, x30, [sp], #144
0.00 : ffff80000815bf5c: autiasp
0.00 : ffff80000815bf60: ret
: 7568 __preempt_count_dec_and_test():
11.62 : ffff80000815bf64: ldr x0, [x1, #8] // preempt.h:74
0.00 : ffff80000815bf68: cbnz x0, ffff80000815bf20 <arch_ftrace_ops_list_func+0x130>
: 76 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff80000815bf6c: bl ffff800008ae88d0 <preempt_schedule_notrace>
0.00 : ffff80000815bf70: b ffff80000815bf20 <arch_ftrace_ops_list_func+0x130>
: 181 __ftrace_ops_list_func():
: 7519 if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
0.00 : ffff80000815bf74: bl ffff8000080e5770 <rcu_is_watching>
0.00 : ffff80000815bf78: tst w0, #0xff
0.00 : ffff80000815bf7c: b.ne ffff80000815bea8 <arch_ftrace_ops_list_func+0xb8> // b.any
0.00 : ffff80000815bf80: b ffff80000815bef8 <arch_ftrace_ops_list_func+0x108>
: 7524 trace_test_and_set_recursion():
: 158 if (val & (1 << bit)) {
0.00 : ffff80000815bf84: tbnz w7, #9, ffff80000815bf34 <arch_ftrace_ops_list_func+0x144>
0.00 : ffff80000815bf88: mov x22, #0xfffffffffffffdff // #-513
0.00 : ffff80000815bf8c: mov w2, #0x200 // #512
0.00 : ffff80000815bf90: str x25, [sp, #64]
0.00 : ffff80000815bf94: b ffff80000815be74 <arch_ftrace_ops_list_func+0x84>
0.00 : ffff80000815bf98: str x25, [sp, #64]
: 165 arch_ftrace_ops_list_func():
: 7553 }
0.00 : ffff80000815bf9c: bl ffff800008ae5de0 <__stack_chk_fail>
: 7555 __ftrace_ops_list_func():
: 7521 if (FTRACE_WARN_ON(!op->func)) {
0.00 : ffff80000815bfa0: brk #0x800
: 7523 ftrace_kill():
: 8040 */
: 8041 void ftrace_kill(void)
: 8042 {
: 8043 ftrace_disabled = 1;
: 8044 ftrace_enabled = 0;
: 8045 ftrace_trace_function = ftrace_stub;
0.00 : ffff80000815bfa4: adrp x3, ffff80000802e000 <arch_ftrace_update_code+0x10>
0.00 : ffff80000815bfa8: add x3, x3, #0x144
: 8038 ftrace_disabled = 1;
0.00 : ffff80000815bfac: mov w4, #0x1 // #1
: 8040 __ftrace_ops_list_func():
: 7522 pr_warn("op=%p %pS\n", op, op);
0.00 : ffff80000815bfb0: mov x2, x19
0.00 : ffff80000815bfb4: mov x1, x19
0.00 : ffff80000815bfb8: adrp x0, ffff800008d80000 <kallsyms_token_index+0x17f60>
0.00 : ffff80000815bfbc: add x0, x0, #0x678
: 7527 ftrace_kill():
: 8040 ftrace_trace_function = ftrace_stub;
0.00 : ffff80000815bfc0: str x3, [x25, #192]
: 8039 ftrace_enabled = 0;
0.00 : ffff80000815bfc4: stp w4, wzr, [x25, #200]
: 8041 __ftrace_ops_list_func():
: 7522 pr_warn("op=%p %pS\n", op, op);
0.00 : ffff80000815bfc8: bl ffff800008ad5220 <_printk>
: 7523 goto out;
0.00 : ffff80000815bfcc: b ffff80000815bf08 <arch_ftrace_ops_list_func+0x118>
2.3 fprobe_handler
: 28 static void fprobe_handler(unsigned long ip, unsigned long parent_ip,
: 29 struct ftrace_ops *ops, struct ftrace_regs *fregs)
: 30 {
0.00 : ffff8000081a2020: paciasp
0.00 : ffff8000081a2024: stp x29, x30, [sp, #-64]!
0.00 : ffff8000081a2028: mov x29, sp
0.00 : ffff8000081a202c: stp x19, x20, [sp, #16]
0.00 : ffff8000081a2030: mov x19, x2
0.00 : ffff8000081a2034: stp x21, x22, [sp, #32]
0.00 : ffff8000081a2038: mov x22, x3
0.00 : ffff8000081a203c: str x23, [sp, #48]
0.00 : ffff8000081a2040: mov x23, x0
: 40 fprobe_disabled():
: 49 */
: 50 #define FPROBE_FL_KPROBE_SHARED 2
:
: 52 static inline bool fprobe_disabled(struct fprobe *fp)
: 53 {
: 54 return (fp) ? fp->flags & FPROBE_FL_DISABLED : false;
0.00 : ffff8000081a2044: cbz x2, ffff8000081a2050 <fprobe_handler+0x30>
20.00 : ffff8000081a2048: ldr w0, [x2, #192] // fprobe.h:49
0.00 : ffff8000081a204c: tbnz w0, #0, ffff8000081a2128 <fprobe_handler+0x108>
: 58 get_current():
: 19 */
: 20 static __always_inline struct task_struct *get_current(void)
: 21 {
: 22 unsigned long sp_el0;
:
: 24 asm ("mrs %0, sp_el0" : "=r" (sp_el0));
0.00 : ffff8000081a2050: mrs x0, sp_el0
: 26 trace_test_and_set_recursion():
: 144 * Preemption is promised to be disabled when return bit >= 0.
: 145 */
: 146 static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip,
: 147 int start)
: 148 {
: 149 unsigned int val = READ_ONCE(current->trace_recursion);
10.00 : ffff8000081a2054: ldr x9, [x0, #2520] // trace_recursion.h:144
: 151 trace_get_context_bit():
: 121 return TRACE_CTX_NORMAL - bit;
0.00 : ffff8000081a2058: mov w6, #0x3 // #3
: 123 preempt_count():
: 13 #define PREEMPT_NEED_RESCHED BIT(32)
: 14 #define PREEMPT_ENABLED (PREEMPT_NEED_RESCHED)
:
: 16 static inline int preempt_count(void)
: 17 {
: 18 return READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff8000081a205c: ldr w8, [x0, #8]
: 20 trace_test_and_set_recursion():
: 148 int bit;
:
: 150 bit = trace_get_context_bit() + start;
: 151 if (unlikely(val & (1 << bit))) {
0.00 : ffff8000081a2060: mov w4, #0x1 // #1
: 153 interrupt_context_level():
: 94 static __always_inline unsigned char interrupt_context_level(void)
: 95 {
: 96 unsigned long pc = preempt_count();
: 97 unsigned char level = 0;
:
: 99 level += !!(pc & (NMI_MASK));
0.00 : ffff8000081a2064: tst w8, #0xf00000
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
: 97 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff8000081a2068: and w7, w8, #0xffff00
: 94 level += !!(pc & (NMI_MASK));
0.00 : ffff8000081a206c: cset w5, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff8000081a2070: and w7, w7, #0xffff01ff
: 95 level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
0.00 : ffff8000081a2074: tst w8, #0xff0000
0.00 : ffff8000081a2078: cinc w5, w5, ne // ne = any
: 96 level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
0.00 : ffff8000081a207c: cmp w7, #0x0
: 98 trace_get_context_bit():
: 121 return TRACE_CTX_NORMAL - bit;
0.00 : ffff8000081a2080: cinc w5, w5, ne // ne = any
0.00 : ffff8000081a2084: sub w5, w6, w5
: 124 trace_test_and_set_recursion():
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff8000081a2088: lsl w4, w4, w5
: 150 trace_clear_recursion():
: 180 */
: 181 static __always_inline void trace_clear_recursion(int bit)
: 182 {
: 183 preempt_enable_notrace();
: 184 barrier();
: 185 trace_recursion_clear(bit);
10.00 : ffff8000081a208c: mvn w20, w4 // trace_recursion.h:180
0.00 : ffff8000081a2090: sxtw x20, w20
: 188 trace_test_and_set_recursion():
: 148 if (unlikely(val & (1 << bit))) {
0.00 : ffff8000081a2094: tst w4, w9
0.00 : ffff8000081a2098: b.ne ffff8000081a2194 <fprobe_handler+0x174> // b.any
: 165 current->trace_recursion = val;
0.00 : ffff8000081a209c: orr w4, w4, w9
: 167 get_current():
0.00 : ffff8000081a20a0: mrs x5, sp_el0
: 20 trace_test_and_set_recursion():
0.00 : ffff8000081a20a4: str x4, [x5, #2520]
: 166 __preempt_count_add():
: 47 return !current_thread_info()->preempt.need_resched;
: 48 }
:
: 50 static inline void __preempt_count_add(int val)
: 51 {
: 52 u32 pc = READ_ONCE(current_thread_info()->preempt.count);
0.00 : ffff8000081a20a8: ldr w4, [x5, #8]
: 48 pc += val;
0.00 : ffff8000081a20ac: add w4, w4, #0x1
: 49 WRITE_ONCE(current_thread_info()->preempt.count, pc);
0.00 : ffff8000081a20b0: str w4, [x5, #8]
: 51 fprobe_handler():
: 43 if (bit < 0) {
: 44 fp->nmissed++;
: 45 return;
: 46 }
:
: 48 if (fp->exit_handler) {
0.00 : ffff8000081a20b4: ldr x0, [x19, #224]
0.00 : ffff8000081a20b8: cbz x0, ffff8000081a2140 <fprobe_handler+0x120>
: 44 rh = rethook_try_get(fp->rethook);
10.00 : ffff8000081a20bc: ldr x0, [x19, #200] // fprobe.c:44
0.00 : ffff8000081a20c0: bl ffff8000081a2a54 <rethook_try_get>
0.00 : ffff8000081a20c4: mov x21, x0
: 45 if (!rh) {
0.00 : ffff8000081a20c8: cbz x0, ffff8000081a21a4 <fprobe_handler+0x184>
: 50 fp->nmissed++;
: 51 goto out;
: 52 }
: 53 fpr = container_of(rh, struct fprobe_rethook_node, node);
: 54 fpr->entry_ip = ip;
0.00 : ffff8000081a20cc: str x23, [x0, #48]
: 54 private = fpr->private;
: 55 }
:
: 57 if (fp->entry_handler)
0.00 : ffff8000081a20d0: ldr x4, [x19, #216]
0.00 : ffff8000081a20d4: cbz x4, ffff8000081a2180 <fprobe_handler+0x160>
: 55 should_rethook = fp->entry_handler(fp, ip, fregs, fpr->private);
0.00 : ffff8000081a20d8: mov x1, x23
0.00 : ffff8000081a20dc: mov x0, x19
0.00 : ffff8000081a20e0: add x3, x21, #0x38
0.00 : ffff8000081a20e4: mov x2, x22
0.00 : ffff8000081a20e8: blr x4
:
: 59 if (rh) {
: 60 if (should_rethook)
0.00 : ffff8000081a20ec: tst w0, #0xff
0.00 : ffff8000081a20f0: b.ne ffff8000081a2180 <fprobe_handler+0x160> // b.any
: 61 rethook_hook(rh, fregs, true);
: 62 else
: 63 rethook_recycle(rh);
0.00 : ffff8000081a20f4: mov x0, x21
0.00 : ffff8000081a20f8: bl ffff8000081a2bf0 <rethook_recycle>
: 66 get_current():
0.00 : ffff8000081a20fc: mrs x1, sp_el0
: 20 __preempt_count_dec_and_test():
: 62 }
:
: 64 static inline bool __preempt_count_dec_and_test(void)
: 65 {
: 66 struct thread_info *ti = current_thread_info();
: 67 u64 pc = READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a2100: ldr x0, [x1, #8]
:
: 66 /* Update only the count field, leaving need_resched unchanged */
: 67 WRITE_ONCE(ti->preempt.count, --pc);
0.00 : ffff8000081a2104: sub x0, x0, #0x1
0.00 : ffff8000081a2108: str w0, [x1, #8]
: 74 * need of a reschedule. Otherwise, we need to reload the
: 75 * preempt_count in case the need_resched flag was cleared by an
: 76 * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE
: 77 * pair.
: 78 */
: 79 return !pc || !READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a210c: cbnz x0, ffff8000081a2170 <fprobe_handler+0x150>
: 81 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff8000081a2110: bl ffff800008ae88d0 <preempt_schedule_notrace>
0.00 : ffff8000081a2114: nop
: 181 get_current():
10.00 : ffff8000081a2118: mrs x1, sp_el0 // current.h:19
: 20 trace_clear_recursion():
: 180 trace_recursion_clear(bit);
0.00 : ffff8000081a211c: ldr x0, [x1, #2520]
0.00 : ffff8000081a2120: and x0, x0, x20
10.00 : ffff8000081a2124: str x0, [x1, #2520] // trace_recursion.h:180
: 184 fprobe_handler():
: 66 }
:
: 68 out:
: 69 ftrace_test_recursion_unlock(bit);
: 70 }
0.00 : ffff8000081a2128: ldp x19, x20, [sp, #16]
0.00 : ffff8000081a212c: ldp x21, x22, [sp, #32]
0.00 : ffff8000081a2130: ldr x23, [sp, #48]
20.00 : ffff8000081a2134: ldp x29, x30, [sp], #64 // fprobe.c:66
0.00 : ffff8000081a2138: autiasp
10.00 : ffff8000081a213c: ret
: 54 if (fp->entry_handler)
0.00 : ffff8000081a2140: ldr x4, [x19, #216]
0.00 : ffff8000081a2144: cbz x4, ffff8000081a215c <fprobe_handler+0x13c>
: 55 should_rethook = fp->entry_handler(fp, ip, fregs, fpr->private);
0.00 : ffff8000081a2148: mov x2, x22
0.00 : ffff8000081a214c: mov x1, x23
0.00 : ffff8000081a2150: mov x0, x19
0.00 : ffff8000081a2154: mov x3, #0x38 // #56
0.00 : ffff8000081a2158: blr x4
: 61 get_current():
0.00 : ffff8000081a215c: mrs x1, sp_el0
: 20 __preempt_count_dec_and_test():
: 62 u64 pc = READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a2160: ldr x0, [x1, #8]
: 65 WRITE_ONCE(ti->preempt.count, --pc);
0.00 : ffff8000081a2164: sub x0, x0, #0x1
0.00 : ffff8000081a2168: str w0, [x1, #8]
: 74 return !pc || !READ_ONCE(ti->preempt_count);
0.00 : ffff8000081a216c: cbz x0, ffff8000081a2110 <fprobe_handler+0xf0>
0.00 : ffff8000081a2170: ldr x0, [x1, #8]
0.00 : ffff8000081a2174: cbnz x0, ffff8000081a2118 <fprobe_handler+0xf8>
: 78 trace_clear_recursion():
: 178 preempt_enable_notrace();
0.00 : ffff8000081a2178: bl ffff800008ae88d0 <preempt_schedule_notrace>
0.00 : ffff8000081a217c: b ffff8000081a2118 <fprobe_handler+0xf8>
: 181 fprobe_handler():
: 59 rethook_hook(rh, fregs, true);
0.00 : ffff8000081a2180: mov x1, x22
0.00 : ffff8000081a2184: mov x0, x21
0.00 : ffff8000081a2188: mov w2, #0x1 // #1
0.00 : ffff8000081a218c: bl ffff8000081a27d0 <rethook_hook>
0.00 : ffff8000081a2190: b ffff8000081a215c <fprobe_handler+0x13c>
: 65 trace_test_and_set_recursion():
: 158 if (val & (1 << bit)) {
0.00 : ffff8000081a2194: tbnz w9, #4, ffff8000081a21b4 <fprobe_handler+0x194>
0.00 : ffff8000081a2198: mov x20, #0xffffffffffffffef // #-17
0.00 : ffff8000081a219c: mov w4, #0x10 // #16
0.00 : ffff8000081a21a0: b ffff8000081a209c <fprobe_handler+0x7c>
: 163 fprobe_handler():
: 46 fp->nmissed++;
0.00 : ffff8000081a21a4: ldr x0, [x19, #184]
0.00 : ffff8000081a21a8: add x0, x0, #0x1
0.00 : ffff8000081a21ac: str x0, [x19, #184]
: 47 goto out;
0.00 : ffff8000081a21b0: b ffff8000081a215c <fprobe_handler+0x13c>
: 39 fp->nmissed++;
0.00 : ffff8000081a21b4: ldr x0, [x19, #184]
0.00 : ffff8000081a21b8: add x0, x0, #0x1
0.00 : ffff8000081a21bc: str x0, [x19, #184]
: 40 return;
0.00 : ffff8000081a21c0: b ffff8000081a2128 <fprobe_handler+0x108>
2.4 bpf_fprobe_entry
: 5 ffff8000081e19f0 <bpf_fprobe_entry>:
: 6 bpf_fprobe_entry():
: 1057 flags = u64_stats_update_begin_irqsave(&stats->syncp);
: 1058 u64_stats_inc(&stats->cnt);
: 1059 u64_stats_add(&stats->nsecs, sched_clock() - start);
: 1060 u64_stats_update_end_irqrestore(&stats->syncp, flags);
: 1061 }
: 1062 }
0.00 : ffff8000081e19f0: bti c
0.00 : ffff8000081e19f4: nop
0.00 : ffff8000081e19f8: nop
: 165 {
0.00 : ffff8000081e19fc: paciasp
0.00 : ffff8000081e1a00: stp x29, x30, [sp, #-80]!
0.00 : ffff8000081e1a04: mov w4, #0x0 // #0
0.00 : ffff8000081e1a08: mov x29, sp
0.00 : ffff8000081e1a0c: stp x19, x20, [sp, #16]
0.00 : ffff8000081e1a10: mov x19, x3
0.00 : ffff8000081e1a14: stp x21, x22, [sp, #32]
0.00 : ffff8000081e1a18: mov x22, x0
0.00 : ffff8000081e1a1c: mov x21, x2
0.00 : ffff8000081e1a20: stp x23, x24, [sp, #48]
0.00 : ffff8000081e1a24: str x25, [sp, #64]
: 167 struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
0.00 : ffff8000081e1a28: ldr x24, [x0, #24]
: 168 struct bpf_tramp_links *links = fprobe_ctx->links;
0.00 : ffff8000081e1a2c: ldr x23, [x24]
: 174 memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
0.00 : ffff8000081e1a30: stp xzr, xzr, [x3]
: 175 call_ctx->ip = ip;
0.00 : ffff8000081e1a34: str x1, [x3, #16]
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a38: ldr w0, [x24, #8]
0.00 : ffff8000081e1a3c: cmp w0, #0x0
0.00 : ffff8000081e1a40: b.gt ffff8000081e1a64 <bpf_fprobe_entry+0x74>
0.00 : ffff8000081e1a44: b ffff8000081e1a90 <bpf_fprobe_entry+0xa0>
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
0.00 : ffff8000081e1a48: ldr x0, [x21, x1, lsl #3]
0.00 : ffff8000081e1a4c: add x1, x19, x1, lsl #3
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a50: add w4, w4, #0x1
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
16.67 : ffff8000081e1a54: str x0, [x1, #24] // trampoline.c:177
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a58: ldr w0, [x24, #8]
0.00 : ffff8000081e1a5c: cmp w0, w4
0.00 : ffff8000081e1a60: b.le ffff8000081e1a90 <bpf_fprobe_entry+0xa0>
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
8.33 : ffff8000081e1a64: sxtw x1, w4
0.00 : ffff8000081e1a68: mov x0, #0x0 // #0
0.00 : ffff8000081e1a6c: cmp w4, #0x7
0.00 : ffff8000081e1a70: b.le ffff8000081e1a48 <bpf_fprobe_entry+0x58>
0.00 : ffff8000081e1a74: sxtw x1, w4
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a78: add w4, w4, #0x1
: 177 call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
0.00 : ffff8000081e1a7c: add x1, x19, x1, lsl #3
0.00 : ffff8000081e1a80: str x0, [x1, #24]
: 176 for (i = 0; i < fprobe_ctx->nr_args; i++)
0.00 : ffff8000081e1a84: ldr w0, [x24, #8]
0.00 : ffff8000081e1a88: cmp w0, w4
0.00 : ffff8000081e1a8c: b.gt ffff8000081e1a64 <bpf_fprobe_entry+0x74>
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1a90: ldr w1, [x23, #304]
: 185 call_ctx->args);
0.00 : ffff8000081e1a94: add x25, x19, #0x18
0.00 : ffff8000081e1a98: mov x20, #0x0 // #0
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1a9c: cmp w1, #0x0
0.00 : ffff8000081e1aa0: b.le ffff8000081e1ad4 <bpf_fprobe_entry+0xe4>
0.00 : ffff8000081e1aa4: nop
: 180 call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
0.00 : ffff8000081e1aa8: ldr x1, [x23, x20, lsl #3]
0.00 : ffff8000081e1aac: mov x3, x25
0.00 : ffff8000081e1ab0: mov x2, x19
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1ab4: add x20, x20, #0x1
: 180 call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
16.67 : ffff8000081e1ab8: ldr x0, [x1, #24] // trampoline.c:180
0.00 : ffff8000081e1abc: ldr x1, [x1, #80]
0.00 : ffff8000081e1ac0: bl ffff8000081e1800 <call_bpf_prog.isra.0>
: 179 for (i = 0; i < fentry->nr_links; i++)
0.00 : ffff8000081e1ac4: ldr w0, [x23, #304]
0.00 : ffff8000081e1ac8: cmp w0, w20
0.00 : ffff8000081e1acc: b.gt ffff8000081e1aa8 <bpf_fprobe_entry+0xb8>
0.00 : ffff8000081e1ad0: ldr w0, [x24, #8]
: 182 call_ctx->args[fprobe_ctx->nr_args] = 0;
0.00 : ffff8000081e1ad4: add x0, x19, w0, sxtw #3
: 183 for (i = 0; i < fmod_ret->nr_links; i++) {
0.00 : ffff8000081e1ad8: add x25, x23, #0x270
: 185 call_ctx->args);
0.00 : ffff8000081e1adc: add x24, x19, #0x18
0.00 : ffff8000081e1ae0: mov x20, #0x0 // #0
: 182 call_ctx->args[fprobe_ctx->nr_args] = 0;
25.00 : ffff8000081e1ae4: str xzr, [x0, #24] // trampoline.c:182
: 183 for (i = 0; i < fmod_ret->nr_links; i++) {
0.00 : ffff8000081e1ae8: ldr w0, [x25, #304]
0.00 : ffff8000081e1aec: cmp w0, #0x0
0.00 : ffff8000081e1af0: b.gt ffff8000081e1b04 <bpf_fprobe_entry+0x114>
16.67 : ffff8000081e1af4: b ffff8000081e1ba8 <bpf_fprobe_entry+0x1b8> // trampoline.c:183
0.00 : ffff8000081e1af8: ldr w0, [x25, #304]
0.00 : ffff8000081e1afc: cmp w0, w20
0.00 : ffff8000081e1b00: b.le ffff8000081e1ba8 <bpf_fprobe_entry+0x1b8>
: 184 ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
0.00 : ffff8000081e1b04: ldr x1, [x25, x20, lsl #3]
0.00 : ffff8000081e1b08: mov x3, x24
0.00 : ffff8000081e1b0c: mov x2, x19
: 183 for (i = 0; i < fmod_ret->nr_links; i++) {
0.00 : ffff8000081e1b10: add x20, x20, #0x1
: 184 ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
0.00 : ffff8000081e1b14: ldr x0, [x1, #24]
0.00 : ffff8000081e1b18: ldr x1, [x1, #80]
0.00 : ffff8000081e1b1c: bl ffff8000081e1800 <call_bpf_prog.isra.0>
: 187 if (ret) {
0.00 : ffff8000081e1b20: cbz w0, ffff8000081e1af8 <bpf_fprobe_entry+0x108>
: 189 ftrace_override_function_with_return(regs);
0.00 : ffff8000081e1b24: ldr x2, [x21, #88]
: 188 ftrace_regs_set_return_value(regs, ret);
0.00 : ffff8000081e1b28: sxtw x1, w0
0.00 : ffff8000081e1b2c: str x1, [x21]
: 191 bpf_fprobe_exit():
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b30: mov x20, #0x0 // #0
: 162 bpf_fprobe_entry():
: 189 ftrace_override_function_with_return(regs);
0.00 : ffff8000081e1b34: str x2, [x21, #104]
: 191 bpf_fprobe_exit():
: 153 struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
0.00 : ffff8000081e1b38: ldr x2, [x22, #24]
: 158 call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
0.00 : ffff8000081e1b3c: ldrsw x0, [x2, #8]
: 154 struct bpf_tramp_links *links = fprobe_ctx->links;
0.00 : ffff8000081e1b40: ldr x21, [x2]
: 158 call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
0.00 : ffff8000081e1b44: add x0, x19, x0, lsl #3
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b48: add x21, x21, #0x138
: 158 call_ctx->args[fprobe_ctx->nr_args] = ftrace_regs_return_value(regs);
0.00 : ffff8000081e1b4c: str x1, [x0, #24]
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b50: ldr w0, [x21, #304]
0.00 : ffff8000081e1b54: cmp w0, #0x0
0.00 : ffff8000081e1b58: b.le ffff8000081e1b88 <bpf_fprobe_entry+0x198>
0.00 : ffff8000081e1b5c: nop
: 161 call_bpf_prog(fexit->links[i], &call_ctx->ctx, call_ctx->args);
0.00 : ffff8000081e1b60: ldr x1, [x21, x20, lsl #3]
0.00 : ffff8000081e1b64: mov x3, x24
0.00 : ffff8000081e1b68: mov x2, x19
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b6c: add x20, x20, #0x1
: 161 call_bpf_prog(fexit->links[i], &call_ctx->ctx, call_ctx->args);
0.00 : ffff8000081e1b70: ldr x0, [x1, #24]
0.00 : ffff8000081e1b74: ldr x1, [x1, #80]
0.00 : ffff8000081e1b78: bl ffff8000081e1800 <call_bpf_prog.isra.0>
: 160 for (i = 0; i < fexit->nr_links; i++)
0.00 : ffff8000081e1b7c: ldr w0, [x21, #304]
0.00 : ffff8000081e1b80: cmp w0, w20
0.00 : ffff8000081e1b84: b.gt ffff8000081e1b60 <bpf_fprobe_entry+0x170>
: 164 bpf_fprobe_entry():
: 192 return false;
0.00 : ffff8000081e1b88: mov w0, #0x0 // #0
: 197 }
0.00 : ffff8000081e1b8c: ldp x19, x20, [sp, #16]
0.00 : ffff8000081e1b90: ldp x21, x22, [sp, #32]
0.00 : ffff8000081e1b94: ldp x23, x24, [sp, #48]
0.00 : ffff8000081e1b98: ldr x25, [sp, #64]
0.00 : ffff8000081e1b9c: ldp x29, x30, [sp], #80
0.00 : ffff8000081e1ba0: autiasp
0.00 : ffff8000081e1ba4: ret
: 196 return fexit->nr_links;
0.00 : ffff8000081e1ba8: ldr w0, [x23, #616]
: 197 }
0.00 : ffff8000081e1bac: ldp x19, x20, [sp, #16]
: 196 return fexit->nr_links;
0.00 : ffff8000081e1bb0: cmp w0, #0x0
0.00 : ffff8000081e1bb4: cset w0, ne // ne = any
: 197 }
0.00 : ffff8000081e1bb8: ldp x21, x22, [sp, #32]
0.00 : ffff8000081e1bbc: ldp x23, x24, [sp, #48]
0.00 : ffff8000081e1bc0: ldr x25, [sp, #64]
0.00 : ffff8000081e1bc4: ldp x29, x30, [sp], #80
0.00 : ffff8000081e1bc8: autiasp
16.67 : ffff8000081e1bcc: ret // trampoline.c:197
2.5 call_bpf_prog
: 5 ffff8000081e1800 <call_bpf_prog.isra.0>:
: 6 call_bpf_prog.isra.0():
:
: 200 if (oldp)
: 201 *oldp = old;
:
: 203 if (unlikely(!old))
: 204 refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
13.33 : ffff8000081e1800: nop // refcount.h:199
0.00 : ffff8000081e1804: nop
: 207 call_bpf_prog():
:
: 108 mutex_unlock(&tr->mutex);
: 109 return ret;
: 110 }
: 111 #else
: 112 static unsigned int call_bpf_prog(struct bpf_tramp_link *l,
0.00 : ffff8000081e1808: paciasp
0.00 : ffff8000081e180c: stp x29, x30, [sp, #-64]!
0.00 : ffff8000081e1810: mov x29, sp
0.00 : ffff8000081e1814: stp x19, x20, [sp, #16]
0.00 : ffff8000081e1818: mov x19, x0
0.00 : ffff8000081e181c: mov x20, x2
0.00 : ffff8000081e1820: stp x21, x22, [sp, #32]
6.67 : ffff8000081e1824: stp x23, x24, [sp, #48] // trampoline.c:107
0.00 : ffff8000081e1828: mov x24, x3
: 118 struct bpf_tramp_run_ctx *run_ctx) = __bpf_prog_exit;
: 119 struct bpf_prog *p = l->link.prog;
: 120 unsigned int ret;
: 121 u64 start_time;
:
: 123 if (p->aux->sleepable) {
60.00 : ffff8000081e182c: ldr x0, [x0, #56] // trampoline.c:118
13.33 : ffff8000081e1830: ldrb w0, [x0, #140]
0.00 : ffff8000081e1834: cbnz w0, ffff8000081e1858 <call_bpf_prog.isra.0+0x58>
: 121 enter = __bpf_prog_enter_sleepable;
: 122 exit = __bpf_prog_exit_sleepable;
: 123 } else if (p->expected_attach_type == BPF_LSM_CGROUP) {
0.00 : ffff8000081e1838: ldr w0, [x19, #8]
0.00 : ffff8000081e183c: cmp w0, #0x2b
0.00 : ffff8000081e1840: b.eq ffff8000081e18c4 <call_bpf_prog.isra.0+0xc4> // b.none
: 112 void (*exit)(struct bpf_prog *prog, u64 start,
0.00 : ffff8000081e1844: adrp x22, ffff8000081e1000 <print_bpf_insn+0x580>
: 110 u64 (*enter)(struct bpf_prog *prog,
0.00 : ffff8000081e1848: adrp x2, ffff8000081e1000 <print_bpf_insn+0x580>
: 112 void (*exit)(struct bpf_prog *prog, u64 start,
0.00 : ffff8000081e184c: add x22, x22, #0xbd0
: 110 u64 (*enter)(struct bpf_prog *prog,
0.00 : ffff8000081e1850: add x2, x2, #0xd20
0.00 : ffff8000081e1854: b ffff8000081e1868 <call_bpf_prog.isra.0+0x68>
: 120 exit = __bpf_prog_exit_sleepable;
0.00 : ffff8000081e1858: adrp x22, ffff8000081e1000 <print_bpf_insn+0x580>
: 119 enter = __bpf_prog_enter_sleepable;
0.00 : ffff8000081e185c: adrp x2, ffff8000081e1000 <print_bpf_insn+0x580>
: 120 exit = __bpf_prog_exit_sleepable;
0.00 : ffff8000081e1860: add x22, x22, #0xc60
: 119 enter = __bpf_prog_enter_sleepable;
0.00 : ffff8000081e1864: add x2, x2, #0xe10
: 126 enter = __bpf_prog_enter_lsm_cgroup;
: 127 exit = __bpf_prog_exit_lsm_cgroup;
: 128 }
:
: 130 ctx->bpf_cookie = l->cookie;
0.00 : ffff8000081e1868: str x1, [x20]
:
: 129 start_time = enter(p, ctx);
0.00 : ffff8000081e186c: mov x0, x19
0.00 : ffff8000081e1870: mov x1, x20
: 130 if (!start_time)
: 131 return 0;
0.00 : ffff8000081e1874: mov w23, #0x0 // #0
: 128 start_time = enter(p, ctx);
0.00 : ffff8000081e1878: blr x2
0.00 : ffff8000081e187c: mov x21, x0
: 129 if (!start_time)
0.00 : ffff8000081e1880: cbz x0, ffff8000081e18a8 <call_bpf_prog.isra.0+0xa8>
:
: 133 ret = p->bpf_func(args, p->insnsi);
0.00 : ffff8000081e1884: ldr x2, [x19, #48]
0.00 : ffff8000081e1888: add x1, x19, #0x48
0.00 : ffff8000081e188c: mov x0, x24
0.00 : ffff8000081e1890: blr x2
0.00 : ffff8000081e1894: mov w23, w0
:
: 135 exit(p, start_time, ctx);
0.00 : ffff8000081e1898: mov x2, x20
0.00 : ffff8000081e189c: mov x1, x21
0.00 : ffff8000081e18a0: mov x0, x19
0.00 : ffff8000081e18a4: blr x22
:
: 138 return ret;
: 139 }
0.00 : ffff8000081e18a8: mov w0, w23
0.00 : ffff8000081e18ac: ldp x19, x20, [sp, #16]
0.00 : ffff8000081e18b0: ldp x21, x22, [sp, #32]
0.00 : ffff8000081e18b4: ldp x23, x24, [sp, #48]
6.67 : ffff8000081e18b8: ldp x29, x30, [sp], #64 // trampoline.c:137
0.00 : ffff8000081e18bc: autiasp
0.00 : ffff8000081e18c0: ret
: 123 exit = __bpf_prog_exit_lsm_cgroup;
0.00 : ffff8000081e18c4: adrp x22, ffff8000081e1000 <print_bpf_insn+0x580>
: 122 enter = __bpf_prog_enter_lsm_cgroup;
0.00 : ffff8000081e18c8: adrp x2, ffff8000081e1000 <print_bpf_insn+0x580>
: 123 exit = __bpf_prog_exit_lsm_cgroup;
0.00 : ffff8000081e18cc: add x22, x22, #0x200
: 122 enter = __bpf_prog_enter_lsm_cgroup;
0.00 : ffff8000081e18d0: add x2, x2, #0x1c0
0.00 : ffff8000081e18d4: b ffff8000081e1868 <call_bpf_prog.isra.0+0x68>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-06 16:29 ` Steven Rostedt
@ 2022-10-17 17:55 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-17 17:55 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Thu, Oct 6, 2022 at 6:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 6 Oct 2022 18:19:12 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > Sure, we can give this a try, I'll work on a macro that generates the
> > 7 callbacks and we can check how much that helps. My belief right now
> > is that ftrace's iteration over all ops on arm64 is where we lose most
> > time but now that we have numbers it's pretty easy to check hypothesis
> > :)
>
> Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
>
> So, let's hold off until that is complete.
>
> -- Steve
Mark finished an implementation of his per-callsite-ops and min-args
branches (meaning that we can now skip the expensive ftrace's saving
of all registers and iteration over all ops if only one is attached)
- https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
And Masami wrote similar patches to what I had originally done to
fprobe in my branch:
- https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
So I could rebase my previous "bpf on fprobe" branch on top of these:
(as before, it's just good enough for benchmarking and to give a
general sense of the idea, not for a thorough code review):
- https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
And I could run the benchmarks against my rpi4. I have different
baseline numbers as Xu so I ran everything again and tried to keep the
format the same. "indirect call" refers to my branch I just linked and
"direct call" refers to the series this is a reply to (Xu's work)
1. test with dd
1.1 when no bpf prog attached to vfs_write
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 3.94315 s, 130 MB/s
1.2 attach bpf prog with kprobe, bpftrace -e kprobe:vfs_write {}
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 5.80493 s, 88.2 MB/s
1.3 attach bpf prog with with direct call, bpftrace -e kfunc:vfs_write {}
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.18579 s, 122 MB/s
1.4 attach bpf prog with with indirect call, bpftrace -e kfunc:vfs_write {}
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.92616 s, 104 MB/s
2. test with bpf/bench
2.1 bench trig-base
Iter 0 ( 86.518us): hits 0.700M/s ( 0.700M/prod), drops
0.000M/s, total operations 0.700M/s
Iter 1 (-26.352us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 2 ( 1.092us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 3 ( -1.890us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 4 ( -2.315us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 5 ( 4.184us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 6 ( -3.241us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Summary: hits 0.701 ± 0.000M/s ( 0.701M/prod), drops 0.000 ±
0.000M/s, total operations 0.701 ± 0.000M/s
2.2 bench trig-kprobe
Iter 0 ( 96.833us): hits 0.290M/s ( 0.290M/prod), drops
0.000M/s, total operations 0.290M/s
Iter 1 (-20.834us): hits 0.291M/s ( 0.291M/prod), drops
0.000M/s, total operations 0.291M/s
Iter 2 ( -2.426us): hits 0.291M/s ( 0.291M/prod), drops
0.000M/s, total operations 0.291M/s
Iter 3 ( 22.332us): hits 0.292M/s ( 0.292M/prod), drops
0.000M/s, total operations 0.292M/s
Iter 4 (-18.204us): hits 0.292M/s ( 0.292M/prod), drops
0.000M/s, total operations 0.292M/s
Iter 5 ( 5.370us): hits 0.292M/s ( 0.292M/prod), drops
0.000M/s, total operations 0.292M/s
Iter 6 ( -7.853us): hits 0.290M/s ( 0.290M/prod), drops
0.000M/s, total operations 0.290M/s
Summary: hits 0.291 ± 0.001M/s ( 0.291M/prod), drops 0.000 ±
0.000M/s, total operations 0.291 ± 0.001M/s
2.3 bench trig-fentry, with direct call
Iter 0 ( 86.481us): hits 0.530M/s ( 0.530M/prod), drops
0.000M/s, total operations 0.530M/s
Iter 1 (-12.593us): hits 0.536M/s ( 0.536M/prod), drops
0.000M/s, total operations 0.536M/s
Iter 2 ( -5.760us): hits 0.532M/s ( 0.532M/prod), drops
0.000M/s, total operations 0.532M/s
Iter 3 ( 1.629us): hits 0.532M/s ( 0.532M/prod), drops
0.000M/s, total operations 0.532M/s
Iter 4 ( -1.945us): hits 0.533M/s ( 0.533M/prod), drops
0.000M/s, total operations 0.533M/s
Iter 5 ( -1.297us): hits 0.532M/s ( 0.532M/prod), drops
0.000M/s, total operations 0.532M/s
Iter 6 ( 0.444us): hits 0.535M/s ( 0.535M/prod), drops
0.000M/s, total operations 0.535M/s
Summary: hits 0.533 ± 0.002M/s ( 0.533M/prod), drops 0.000 ±
0.000M/s, total operations 0.533 ± 0.002M/s
2.3 bench trig-fentry, with indirect call
Iter 0 ( 84.463us): hits 0.404M/s ( 0.404M/prod), drops
0.000M/s, total operations 0.404M/s
Iter 1 (-16.260us): hits 0.405M/s ( 0.405M/prod), drops
0.000M/s, total operations 0.405M/s
Iter 2 ( -1.038us): hits 0.405M/s ( 0.405M/prod), drops
0.000M/s, total operations 0.405M/s
Iter 3 ( -3.797us): hits 0.405M/s ( 0.405M/prod), drops
0.000M/s, total operations 0.405M/s
Iter 4 ( -0.537us): hits 0.402M/s ( 0.402M/prod), drops
0.000M/s, total operations 0.402M/s
Iter 5 ( 3.536us): hits 0.403M/s ( 0.403M/prod), drops
0.000M/s, total operations 0.403M/s
Iter 6 ( 12.203us): hits 0.404M/s ( 0.404M/prod), drops
0.000M/s, total operations 0.404M/s
Summary: hits 0.404 ± 0.001M/s ( 0.404M/prod), drops 0.000 ±
0.000M/s, total operations 0.404 ± 0.001M/s
3. perf report of bench trig-fentry
3.1 with direct call
98.67% 0.27% bench bench
[.] trigger_producer
|
--98.40%--trigger_producer
|
|--96.63%--syscall
| |
| --71.90%--el0t_64_sync
| el0t_64_sync_handler
| el0_svc
| do_el0_svc
| |
| |--70.94%--el0_svc_common
| | |
| |
|--29.55%--invoke_syscall
| | | |
| | |
|--26.23%--__arm64_sys_getpgid
| | | |
|
| | | |
|--18.88%--bpf_trampoline_6442462665_0
| | | |
| |
| | | |
| |--6.85%--__bpf_prog_enter
| | | |
| | |
| | | |
| | --2.68%--migrate_disable
| | | |
| |
| | | |
| |--5.28%--__bpf_prog_exit
| | | |
| | |
| | | |
| | --1.29%--migrate_enable
| | | |
| |
| | | |
|
|--3.96%--bpf_prog_21856463590f61f1_bench_trigger_fentry
| | | |
| |
| | | |
| --0.61%--__rcu_read_lock
| | | |
|
| | | |
--4.42%--find_task_by_vpid
| | | |
|
| | | |
|--2.53%--radix_tree_lookup
| | | |
|
| | | |
--0.61%--idr_find
| | | |
| | |
--0.81%--pid_vnr
| | |
| |
--0.53%--__arm64_sys_getpgid
| |
| --0.95%--invoke_syscall
|
--0.99%--syscall@plt
3.2 with indirect call
98.68% 0.20% bench bench
[.] trigger_producer
|
--98.48%--trigger_producer
|
--97.47%--syscall
|
--76.11%--el0t_64_sync
el0t_64_sync_handler
el0_svc
do_el0_svc
|
|--75.52%--el0_svc_common
| |
|
|--46.35%--invoke_syscall
| | |
| |
--44.06%--__arm64_sys_getpgid
| |
|
| |
|--35.40%--ftrace_caller
| |
| |
| |
| --34.04%--fprobe_handler
| |
| |
| |
| |--15.61%--bpf_fprobe_entry
| |
| | |
| |
| | |--3.79%--__bpf_prog_enter
| |
| | | |
| |
| | |
--0.80%--migrate_disable
| |
| | |
| |
| | |--3.74%--__bpf_prog_exit
| |
| | | |
| |
| | |
--0.77%--migrate_enable
| |
| | |
| |
| |
--2.65%--bpf_prog_21856463590f61f1_bench_trigger_fentry
| |
| |
| |
| |--12.65%--rethook_trampoline_handler
| |
| |
| |
| |--1.70%--rethook_try_get
| |
| | |
| |
| | --1.48%--rcu_is_watching
| |
| |
| |
| |--1.46%--freelist_try_get
| |
| |
| |
| --0.65%--rethook_recycle
| |
|
| |
--6.36%--find_task_by_vpid
| |
|
| |
|--3.64%--radix_tree_lookup
| |
|
| |
--1.74%--idr_find
| |
| --1.05%--ftrace_caller
|
--0.59%--invoke_syscall
This looks slightly better than before but it is actually still a
pretty significant performance hit compared to direct calls.
Note that I can't really make sense of the perf report with indirect
calls. it always reports it spent 12% of the time in
rethook_trampoline_handler but I verified with both a WARN in that
function and a breakpoint with a debugger, this function does *not*
get called when running this "bench trig-fentry" benchmark. Also it
wouldn't make sense for fprobe_handler to call it so I'm quite
confused why perf would report this call and such a long time spent
there. Anyone know what I could be missing here ?
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-17 17:55 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-17 17:55 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Thu, Oct 6, 2022 at 6:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 6 Oct 2022 18:19:12 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > Sure, we can give this a try, I'll work on a macro that generates the
> > 7 callbacks and we can check how much that helps. My belief right now
> > is that ftrace's iteration over all ops on arm64 is where we lose most
> > time but now that we have numbers it's pretty easy to check hypothesis
> > :)
>
> Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
>
> So, let's hold off until that is complete.
>
> -- Steve
Mark finished an implementation of his per-callsite-ops and min-args
branches (meaning that we can now skip the expensive ftrace's saving
of all registers and iteration over all ops if only one is attached)
- https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
And Masami wrote similar patches to what I had originally done to
fprobe in my branch:
- https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
So I could rebase my previous "bpf on fprobe" branch on top of these:
(as before, it's just good enough for benchmarking and to give a
general sense of the idea, not for a thorough code review):
- https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
And I could run the benchmarks against my rpi4. I have different
baseline numbers as Xu so I ran everything again and tried to keep the
format the same. "indirect call" refers to my branch I just linked and
"direct call" refers to the series this is a reply to (Xu's work)
1. test with dd
1.1 when no bpf prog attached to vfs_write
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 3.94315 s, 130 MB/s
1.2 attach bpf prog with kprobe, bpftrace -e kprobe:vfs_write {}
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 5.80493 s, 88.2 MB/s
1.3 attach bpf prog with with direct call, bpftrace -e kfunc:vfs_write {}
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.18579 s, 122 MB/s
1.4 attach bpf prog with with indirect call, bpftrace -e kfunc:vfs_write {}
# dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.92616 s, 104 MB/s
2. test with bpf/bench
2.1 bench trig-base
Iter 0 ( 86.518us): hits 0.700M/s ( 0.700M/prod), drops
0.000M/s, total operations 0.700M/s
Iter 1 (-26.352us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 2 ( 1.092us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 3 ( -1.890us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 4 ( -2.315us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 5 ( 4.184us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Iter 6 ( -3.241us): hits 0.701M/s ( 0.701M/prod), drops
0.000M/s, total operations 0.701M/s
Summary: hits 0.701 ± 0.000M/s ( 0.701M/prod), drops 0.000 ±
0.000M/s, total operations 0.701 ± 0.000M/s
2.2 bench trig-kprobe
Iter 0 ( 96.833us): hits 0.290M/s ( 0.290M/prod), drops
0.000M/s, total operations 0.290M/s
Iter 1 (-20.834us): hits 0.291M/s ( 0.291M/prod), drops
0.000M/s, total operations 0.291M/s
Iter 2 ( -2.426us): hits 0.291M/s ( 0.291M/prod), drops
0.000M/s, total operations 0.291M/s
Iter 3 ( 22.332us): hits 0.292M/s ( 0.292M/prod), drops
0.000M/s, total operations 0.292M/s
Iter 4 (-18.204us): hits 0.292M/s ( 0.292M/prod), drops
0.000M/s, total operations 0.292M/s
Iter 5 ( 5.370us): hits 0.292M/s ( 0.292M/prod), drops
0.000M/s, total operations 0.292M/s
Iter 6 ( -7.853us): hits 0.290M/s ( 0.290M/prod), drops
0.000M/s, total operations 0.290M/s
Summary: hits 0.291 ± 0.001M/s ( 0.291M/prod), drops 0.000 ±
0.000M/s, total operations 0.291 ± 0.001M/s
2.3 bench trig-fentry, with direct call
Iter 0 ( 86.481us): hits 0.530M/s ( 0.530M/prod), drops
0.000M/s, total operations 0.530M/s
Iter 1 (-12.593us): hits 0.536M/s ( 0.536M/prod), drops
0.000M/s, total operations 0.536M/s
Iter 2 ( -5.760us): hits 0.532M/s ( 0.532M/prod), drops
0.000M/s, total operations 0.532M/s
Iter 3 ( 1.629us): hits 0.532M/s ( 0.532M/prod), drops
0.000M/s, total operations 0.532M/s
Iter 4 ( -1.945us): hits 0.533M/s ( 0.533M/prod), drops
0.000M/s, total operations 0.533M/s
Iter 5 ( -1.297us): hits 0.532M/s ( 0.532M/prod), drops
0.000M/s, total operations 0.532M/s
Iter 6 ( 0.444us): hits 0.535M/s ( 0.535M/prod), drops
0.000M/s, total operations 0.535M/s
Summary: hits 0.533 ± 0.002M/s ( 0.533M/prod), drops 0.000 ±
0.000M/s, total operations 0.533 ± 0.002M/s
2.3 bench trig-fentry, with indirect call
Iter 0 ( 84.463us): hits 0.404M/s ( 0.404M/prod), drops
0.000M/s, total operations 0.404M/s
Iter 1 (-16.260us): hits 0.405M/s ( 0.405M/prod), drops
0.000M/s, total operations 0.405M/s
Iter 2 ( -1.038us): hits 0.405M/s ( 0.405M/prod), drops
0.000M/s, total operations 0.405M/s
Iter 3 ( -3.797us): hits 0.405M/s ( 0.405M/prod), drops
0.000M/s, total operations 0.405M/s
Iter 4 ( -0.537us): hits 0.402M/s ( 0.402M/prod), drops
0.000M/s, total operations 0.402M/s
Iter 5 ( 3.536us): hits 0.403M/s ( 0.403M/prod), drops
0.000M/s, total operations 0.403M/s
Iter 6 ( 12.203us): hits 0.404M/s ( 0.404M/prod), drops
0.000M/s, total operations 0.404M/s
Summary: hits 0.404 ± 0.001M/s ( 0.404M/prod), drops 0.000 ±
0.000M/s, total operations 0.404 ± 0.001M/s
3. perf report of bench trig-fentry
3.1 with direct call
98.67% 0.27% bench bench
[.] trigger_producer
|
--98.40%--trigger_producer
|
|--96.63%--syscall
| |
| --71.90%--el0t_64_sync
| el0t_64_sync_handler
| el0_svc
| do_el0_svc
| |
| |--70.94%--el0_svc_common
| | |
| |
|--29.55%--invoke_syscall
| | | |
| | |
|--26.23%--__arm64_sys_getpgid
| | | |
|
| | | |
|--18.88%--bpf_trampoline_6442462665_0
| | | |
| |
| | | |
| |--6.85%--__bpf_prog_enter
| | | |
| | |
| | | |
| | --2.68%--migrate_disable
| | | |
| |
| | | |
| |--5.28%--__bpf_prog_exit
| | | |
| | |
| | | |
| | --1.29%--migrate_enable
| | | |
| |
| | | |
|
|--3.96%--bpf_prog_21856463590f61f1_bench_trigger_fentry
| | | |
| |
| | | |
| --0.61%--__rcu_read_lock
| | | |
|
| | | |
--4.42%--find_task_by_vpid
| | | |
|
| | | |
|--2.53%--radix_tree_lookup
| | | |
|
| | | |
--0.61%--idr_find
| | | |
| | |
--0.81%--pid_vnr
| | |
| |
--0.53%--__arm64_sys_getpgid
| |
| --0.95%--invoke_syscall
|
--0.99%--syscall@plt
3.2 with indirect call
98.68% 0.20% bench bench
[.] trigger_producer
|
--98.48%--trigger_producer
|
--97.47%--syscall
|
--76.11%--el0t_64_sync
el0t_64_sync_handler
el0_svc
do_el0_svc
|
|--75.52%--el0_svc_common
| |
|
|--46.35%--invoke_syscall
| | |
| |
--44.06%--__arm64_sys_getpgid
| |
|
| |
|--35.40%--ftrace_caller
| |
| |
| |
| --34.04%--fprobe_handler
| |
| |
| |
| |--15.61%--bpf_fprobe_entry
| |
| | |
| |
| | |--3.79%--__bpf_prog_enter
| |
| | | |
| |
| | |
--0.80%--migrate_disable
| |
| | |
| |
| | |--3.74%--__bpf_prog_exit
| |
| | | |
| |
| | |
--0.77%--migrate_enable
| |
| | |
| |
| |
--2.65%--bpf_prog_21856463590f61f1_bench_trigger_fentry
| |
| |
| |
| |--12.65%--rethook_trampoline_handler
| |
| |
| |
| |--1.70%--rethook_try_get
| |
| | |
| |
| | --1.48%--rcu_is_watching
| |
| |
| |
| |--1.46%--freelist_try_get
| |
| |
| |
| --0.65%--rethook_recycle
| |
|
| |
--6.36%--find_task_by_vpid
| |
|
| |
|--3.64%--radix_tree_lookup
| |
|
| |
--1.74%--idr_find
| |
| --1.05%--ftrace_caller
|
--0.59%--invoke_syscall
This looks slightly better than before but it is actually still a
pretty significant performance hit compared to direct calls.
Note that I can't really make sense of the perf report with indirect
calls. it always reports it spent 12% of the time in
rethook_trampoline_handler but I verified with both a WARN in that
function and a breakpoint with a debugger, this function does *not*
get called when running this "bench trig-fentry" benchmark. Also it
wouldn't make sense for fprobe_handler to call it so I'm quite
confused why perf would report this call and such a long time spent
there. Anyone know what I could be missing here ?
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-17 17:55 ` Florent Revest
@ 2022-10-17 18:49 ` Steven Rostedt
-1 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-17 18:49 UTC (permalink / raw)
To: Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Mon, 17 Oct 2022 19:55:06 +0200
Florent Revest <revest@chromium.org> wrote:
> Note that I can't really make sense of the perf report with indirect
> calls. it always reports it spent 12% of the time in
> rethook_trampoline_handler but I verified with both a WARN in that
> function and a breakpoint with a debugger, this function does *not*
> get called when running this "bench trig-fentry" benchmark. Also it
> wouldn't make sense for fprobe_handler to call it so I'm quite
> confused why perf would report this call and such a long time spent
> there. Anyone know what I could be missing here ?
The trace shows __bpf_prog_exit, which I'm guessing is tracing the end of
the function. Right?
In which case I believe it must call rethook_trampoline_handler:
-> fprobe_handler() /* Which could use some "unlikely()" to move disabled
paths out of the hot path */
/* And also calls rethook_try_get () which does a cmpxchg! */
-> ret_hook()
-> arch_rethook_prepare()
Sets regs->lr = arch_rethook_trampoline
On return of the function, it jumps to arch_rethook_trampoline()
-> arch_rethook_trampoline()
-> arch_rethook_trampoline_callback()
-> rethook_trampoline_handler()
So I do not know how it wouldn't trigger the WARNING or breakpoint if you
added it there.
-- Steve
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-17 18:49 ` Steven Rostedt
0 siblings, 0 replies; 60+ messages in thread
From: Steven Rostedt @ 2022-10-17 18:49 UTC (permalink / raw)
To: Florent Revest
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
On Mon, 17 Oct 2022 19:55:06 +0200
Florent Revest <revest@chromium.org> wrote:
> Note that I can't really make sense of the perf report with indirect
> calls. it always reports it spent 12% of the time in
> rethook_trampoline_handler but I verified with both a WARN in that
> function and a breakpoint with a debugger, this function does *not*
> get called when running this "bench trig-fentry" benchmark. Also it
> wouldn't make sense for fprobe_handler to call it so I'm quite
> confused why perf would report this call and such a long time spent
> there. Anyone know what I could be missing here ?
The trace shows __bpf_prog_exit, which I'm guessing is tracing the end of
the function. Right?
In which case I believe it must call rethook_trampoline_handler:
-> fprobe_handler() /* Which could use some "unlikely()" to move disabled
paths out of the hot path */
/* And also calls rethook_try_get () which does a cmpxchg! */
-> ret_hook()
-> arch_rethook_prepare()
Sets regs->lr = arch_rethook_trampoline
On return of the function, it jumps to arch_rethook_trampoline()
-> arch_rethook_trampoline()
-> arch_rethook_trampoline_callback()
-> rethook_trampoline_handler()
So I do not know how it wouldn't trigger the WARNING or breakpoint if you
added it there.
-- Steve
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-17 18:49 ` Steven Rostedt
@ 2022-10-17 19:10 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-17 19:10 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
Uhuh, apologies for my perf report formatting! I'll try to figure it
out for next time, meanwhile you can find it better formatted here
https://paste.debian.net/1257405/
On Mon, Oct 17, 2022 at 8:49 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 17 Oct 2022 19:55:06 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > Note that I can't really make sense of the perf report with indirect
> > calls. it always reports it spent 12% of the time in
> > rethook_trampoline_handler but I verified with both a WARN in that
> > function and a breakpoint with a debugger, this function does *not*
> > get called when running this "bench trig-fentry" benchmark. Also it
> > wouldn't make sense for fprobe_handler to call it so I'm quite
> > confused why perf would report this call and such a long time spent
> > there. Anyone know what I could be missing here ?
>
> The trace shows __bpf_prog_exit, which I'm guessing is tracing the end of
> the function. Right?
Actually no, this function is called to end the context of a BPF
program execution. Here it is called at the end of the fentry program
(so still before the traced function). I hope the pastebin helps
clarify this!
> In which case I believe it must call rethook_trampoline_handler:
>
> -> fprobe_handler() /* Which could use some "unlikely()" to move disabled
> paths out of the hot path */
>
> /* And also calls rethook_try_get () which does a cmpxchg! */
>
> -> ret_hook()
> -> arch_rethook_prepare()
> Sets regs->lr = arch_rethook_trampoline
>
> On return of the function, it jumps to arch_rethook_trampoline()
>
> -> arch_rethook_trampoline()
> -> arch_rethook_trampoline_callback()
> -> rethook_trampoline_handler()
This is indeed what happens when an fexit program is also attached.
But when running "bench trig-fentry", only an fentry program is
attached so bpf_fprobe_entry returns a non-zero value and fprobe
doesn't call rethook_hook.
Also, in this situation arch_rethook_trampoline is called by the
traced function's return but in the perf report, iiuc, it shows up as
being called from fprobe_handler and that should never happen. I
wonder if this is some sort of stack unwinding artifact during the
perf record?
> So I do not know how it wouldn't trigger the WARNING or breakpoint if you
> added it there.
By the way, the WARNING does trigger if I also attach an fexit program
(then rethook_hook is called). But I made sure we skip the whole
rethook logic if no fexit program is attached so bench trig-fentry
should not go through rethook_trampoline_handler.
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-17 19:10 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-17 19:10 UTC (permalink / raw)
To: Steven Rostedt
Cc: Xu Kuohai, Mark Rutland, Catalin Marinas, Daniel Borkmann,
Xu Kuohai, linux-arm-kernel, linux-kernel, bpf, Will Deacon,
Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
Uhuh, apologies for my perf report formatting! I'll try to figure it
out for next time, meanwhile you can find it better formatted here
https://paste.debian.net/1257405/
On Mon, Oct 17, 2022 at 8:49 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 17 Oct 2022 19:55:06 +0200
> Florent Revest <revest@chromium.org> wrote:
>
> > Note that I can't really make sense of the perf report with indirect
> > calls. it always reports it spent 12% of the time in
> > rethook_trampoline_handler but I verified with both a WARN in that
> > function and a breakpoint with a debugger, this function does *not*
> > get called when running this "bench trig-fentry" benchmark. Also it
> > wouldn't make sense for fprobe_handler to call it so I'm quite
> > confused why perf would report this call and such a long time spent
> > there. Anyone know what I could be missing here ?
>
> The trace shows __bpf_prog_exit, which I'm guessing is tracing the end of
> the function. Right?
Actually no, this function is called to end the context of a BPF
program execution. Here it is called at the end of the fentry program
(so still before the traced function). I hope the pastebin helps
clarify this!
> In which case I believe it must call rethook_trampoline_handler:
>
> -> fprobe_handler() /* Which could use some "unlikely()" to move disabled
> paths out of the hot path */
>
> /* And also calls rethook_try_get () which does a cmpxchg! */
>
> -> ret_hook()
> -> arch_rethook_prepare()
> Sets regs->lr = arch_rethook_trampoline
>
> On return of the function, it jumps to arch_rethook_trampoline()
>
> -> arch_rethook_trampoline()
> -> arch_rethook_trampoline_callback()
> -> rethook_trampoline_handler()
This is indeed what happens when an fexit program is also attached.
But when running "bench trig-fentry", only an fentry program is
attached so bpf_fprobe_entry returns a non-zero value and fprobe
doesn't call rethook_hook.
Also, in this situation arch_rethook_trampoline is called by the
traced function's return but in the perf report, iiuc, it shows up as
being called from fprobe_handler and that should never happen. I
wonder if this is some sort of stack unwinding artifact during the
perf record?
> So I do not know how it wouldn't trigger the WARNING or breakpoint if you
> added it there.
By the way, the WARNING does trigger if I also attach an fexit program
(then rethook_hook is called). But I made sure we skip the whole
rethook logic if no fexit program is attached so bench trig-fentry
should not go through rethook_trampoline_handler.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-17 17:55 ` Florent Revest
@ 2022-10-21 11:31 ` Masami Hiramatsu
-1 siblings, 0 replies; 60+ messages in thread
From: Masami Hiramatsu @ 2022-10-21 11:31 UTC (permalink / raw)
To: Florent Revest
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
Hi Florent,
On Mon, 17 Oct 2022 19:55:06 +0200
Florent Revest <revest@chromium.org> wrote:
> On Thu, Oct 6, 2022 at 6:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 6 Oct 2022 18:19:12 +0200
> > Florent Revest <revest@chromium.org> wrote:
> >
> > > Sure, we can give this a try, I'll work on a macro that generates the
> > > 7 callbacks and we can check how much that helps. My belief right now
> > > is that ftrace's iteration over all ops on arm64 is where we lose most
> > > time but now that we have numbers it's pretty easy to check hypothesis
> > > :)
> >
> > Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> > fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
> >
> > So, let's hold off until that is complete.
> >
> > -- Steve
>
> Mark finished an implementation of his per-callsite-ops and min-args
> branches (meaning that we can now skip the expensive ftrace's saving
> of all registers and iteration over all ops if only one is attached)
> - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
>
> And Masami wrote similar patches to what I had originally done to
> fprobe in my branch:
> - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
>
> So I could rebase my previous "bpf on fprobe" branch on top of these:
> (as before, it's just good enough for benchmarking and to give a
> general sense of the idea, not for a thorough code review):
> - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>
> And I could run the benchmarks against my rpi4. I have different
> baseline numbers as Xu so I ran everything again and tried to keep the
> format the same. "indirect call" refers to my branch I just linked and
> "direct call" refers to the series this is a reply to (Xu's work)
Thanks for sharing the measurement results. Yes, fprobes/rethook
implementation is just porting the kretprobes implementation, thus
it may not be so optimized.
BTW, I remember Wuqiang's patch for kretprobes.
https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
This is for the scalability fixing, but may possible to improve
the performance a bit. It is not hard to port to the recent kernel.
Can you try it too?
Anyway, eventually, I would like to remove the current kretprobe
based implementation and unify fexit hook with function-graph
tracer. It should make more better perfromance on it.
Thank you,
>
> 1. test with dd
>
> 1.1 when no bpf prog attached to vfs_write
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 3.94315 s, 130 MB/s
>
>
> 1.2 attach bpf prog with kprobe, bpftrace -e kprobe:vfs_write {}
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 5.80493 s, 88.2 MB/s
>
>
> 1.3 attach bpf prog with with direct call, bpftrace -e kfunc:vfs_write {}
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 4.18579 s, 122 MB/s
>
>
> 1.4 attach bpf prog with with indirect call, bpftrace -e kfunc:vfs_write {}
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 4.92616 s, 104 MB/s
>
>
> 2. test with bpf/bench
>
> 2.1 bench trig-base
> Iter 0 ( 86.518us): hits 0.700M/s ( 0.700M/prod), drops
> 0.000M/s, total operations 0.700M/s
> Iter 1 (-26.352us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 2 ( 1.092us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 3 ( -1.890us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 4 ( -2.315us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 5 ( 4.184us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 6 ( -3.241us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Summary: hits 0.701 ± 0.000M/s ( 0.701M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.701 ± 0.000M/s
>
> 2.2 bench trig-kprobe
> Iter 0 ( 96.833us): hits 0.290M/s ( 0.290M/prod), drops
> 0.000M/s, total operations 0.290M/s
> Iter 1 (-20.834us): hits 0.291M/s ( 0.291M/prod), drops
> 0.000M/s, total operations 0.291M/s
> Iter 2 ( -2.426us): hits 0.291M/s ( 0.291M/prod), drops
> 0.000M/s, total operations 0.291M/s
> Iter 3 ( 22.332us): hits 0.292M/s ( 0.292M/prod), drops
> 0.000M/s, total operations 0.292M/s
> Iter 4 (-18.204us): hits 0.292M/s ( 0.292M/prod), drops
> 0.000M/s, total operations 0.292M/s
> Iter 5 ( 5.370us): hits 0.292M/s ( 0.292M/prod), drops
> 0.000M/s, total operations 0.292M/s
> Iter 6 ( -7.853us): hits 0.290M/s ( 0.290M/prod), drops
> 0.000M/s, total operations 0.290M/s
> Summary: hits 0.291 ± 0.001M/s ( 0.291M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.291 ± 0.001M/s
>
> 2.3 bench trig-fentry, with direct call
> Iter 0 ( 86.481us): hits 0.530M/s ( 0.530M/prod), drops
> 0.000M/s, total operations 0.530M/s
> Iter 1 (-12.593us): hits 0.536M/s ( 0.536M/prod), drops
> 0.000M/s, total operations 0.536M/s
> Iter 2 ( -5.760us): hits 0.532M/s ( 0.532M/prod), drops
> 0.000M/s, total operations 0.532M/s
> Iter 3 ( 1.629us): hits 0.532M/s ( 0.532M/prod), drops
> 0.000M/s, total operations 0.532M/s
> Iter 4 ( -1.945us): hits 0.533M/s ( 0.533M/prod), drops
> 0.000M/s, total operations 0.533M/s
> Iter 5 ( -1.297us): hits 0.532M/s ( 0.532M/prod), drops
> 0.000M/s, total operations 0.532M/s
> Iter 6 ( 0.444us): hits 0.535M/s ( 0.535M/prod), drops
> 0.000M/s, total operations 0.535M/s
> Summary: hits 0.533 ± 0.002M/s ( 0.533M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.533 ± 0.002M/s
>
> 2.3 bench trig-fentry, with indirect call
> Iter 0 ( 84.463us): hits 0.404M/s ( 0.404M/prod), drops
> 0.000M/s, total operations 0.404M/s
> Iter 1 (-16.260us): hits 0.405M/s ( 0.405M/prod), drops
> 0.000M/s, total operations 0.405M/s
> Iter 2 ( -1.038us): hits 0.405M/s ( 0.405M/prod), drops
> 0.000M/s, total operations 0.405M/s
> Iter 3 ( -3.797us): hits 0.405M/s ( 0.405M/prod), drops
> 0.000M/s, total operations 0.405M/s
> Iter 4 ( -0.537us): hits 0.402M/s ( 0.402M/prod), drops
> 0.000M/s, total operations 0.402M/s
> Iter 5 ( 3.536us): hits 0.403M/s ( 0.403M/prod), drops
> 0.000M/s, total operations 0.403M/s
> Iter 6 ( 12.203us): hits 0.404M/s ( 0.404M/prod), drops
> 0.000M/s, total operations 0.404M/s
> Summary: hits 0.404 ± 0.001M/s ( 0.404M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.404 ± 0.001M/s
>
>
> 3. perf report of bench trig-fentry
>
> 3.1 with direct call
>
> 98.67% 0.27% bench bench
> [.] trigger_producer
> |
> --98.40%--trigger_producer
> |
> |--96.63%--syscall
> | |
> | --71.90%--el0t_64_sync
> | el0t_64_sync_handler
> | el0_svc
> | do_el0_svc
> | |
> | |--70.94%--el0_svc_common
> | | |
> | |
> |--29.55%--invoke_syscall
> | | | |
> | | |
> |--26.23%--__arm64_sys_getpgid
> | | | |
> |
> | | | |
> |--18.88%--bpf_trampoline_6442462665_0
> | | | |
> | |
> | | | |
> | |--6.85%--__bpf_prog_enter
> | | | |
> | | |
> | | | |
> | | --2.68%--migrate_disable
> | | | |
> | |
> | | | |
> | |--5.28%--__bpf_prog_exit
> | | | |
> | | |
> | | | |
> | | --1.29%--migrate_enable
> | | | |
> | |
> | | | |
> |
> |--3.96%--bpf_prog_21856463590f61f1_bench_trigger_fentry
> | | | |
> | |
> | | | |
> | --0.61%--__rcu_read_lock
> | | | |
> |
> | | | |
> --4.42%--find_task_by_vpid
> | | | |
> |
> | | | |
> |--2.53%--radix_tree_lookup
> | | | |
> |
> | | | |
> --0.61%--idr_find
> | | | |
> | | |
> --0.81%--pid_vnr
> | | |
> | |
> --0.53%--__arm64_sys_getpgid
> | |
> | --0.95%--invoke_syscall
> |
> --0.99%--syscall@plt
>
>
> 3.2 with indirect call
>
> 98.68% 0.20% bench bench
> [.] trigger_producer
> |
> --98.48%--trigger_producer
> |
> --97.47%--syscall
> |
> --76.11%--el0t_64_sync
> el0t_64_sync_handler
> el0_svc
> do_el0_svc
> |
> |--75.52%--el0_svc_common
> | |
> |
> |--46.35%--invoke_syscall
> | | |
> | |
> --44.06%--__arm64_sys_getpgid
> | |
> |
> | |
> |--35.40%--ftrace_caller
> | |
> | |
> | |
> | --34.04%--fprobe_handler
> | |
> | |
> | |
> | |--15.61%--bpf_fprobe_entry
> | |
> | | |
> | |
> | | |--3.79%--__bpf_prog_enter
> | |
> | | | |
> | |
> | | |
> --0.80%--migrate_disable
> | |
> | | |
> | |
> | | |--3.74%--__bpf_prog_exit
> | |
> | | | |
> | |
> | | |
> --0.77%--migrate_enable
> | |
> | | |
> | |
> | |
> --2.65%--bpf_prog_21856463590f61f1_bench_trigger_fentry
> | |
> | |
> | |
> | |--12.65%--rethook_trampoline_handler
> | |
> | |
> | |
> | |--1.70%--rethook_try_get
> | |
> | | |
> | |
> | | --1.48%--rcu_is_watching
> | |
> | |
> | |
> | |--1.46%--freelist_try_get
> | |
> | |
> | |
> | --0.65%--rethook_recycle
> | |
> |
> | |
> --6.36%--find_task_by_vpid
> | |
> |
> | |
> |--3.64%--radix_tree_lookup
> | |
> |
> | |
> --1.74%--idr_find
> | |
> | --1.05%--ftrace_caller
> |
> --0.59%--invoke_syscall
>
> This looks slightly better than before but it is actually still a
> pretty significant performance hit compared to direct calls.
>
> Note that I can't really make sense of the perf report with indirect
> calls. it always reports it spent 12% of the time in
> rethook_trampoline_handler but I verified with both a WARN in that
> function and a breakpoint with a debugger, this function does *not*
> get called when running this "bench trig-fentry" benchmark. Also it
> wouldn't make sense for fprobe_handler to call it so I'm quite
> confused why perf would report this call and such a long time spent
> there. Anyone know what I could be missing here ?
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-21 11:31 ` Masami Hiramatsu
0 siblings, 0 replies; 60+ messages in thread
From: Masami Hiramatsu @ 2022-10-21 11:31 UTC (permalink / raw)
To: Florent Revest
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren, Masami Hiramatsu
Hi Florent,
On Mon, 17 Oct 2022 19:55:06 +0200
Florent Revest <revest@chromium.org> wrote:
> On Thu, Oct 6, 2022 at 6:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 6 Oct 2022 18:19:12 +0200
> > Florent Revest <revest@chromium.org> wrote:
> >
> > > Sure, we can give this a try, I'll work on a macro that generates the
> > > 7 callbacks and we can check how much that helps. My belief right now
> > > is that ftrace's iteration over all ops on arm64 is where we lose most
> > > time but now that we have numbers it's pretty easy to check hypothesis
> > > :)
> >
> > Ah, I forgot that's what Mark's code is doing. But yes, that needs to be
> > fixed first. I forget that arm64 doesn't have the dedicated trampolines yet.
> >
> > So, let's hold off until that is complete.
> >
> > -- Steve
>
> Mark finished an implementation of his per-callsite-ops and min-args
> branches (meaning that we can now skip the expensive ftrace's saving
> of all registers and iteration over all ops if only one is attached)
> - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
>
> And Masami wrote similar patches to what I had originally done to
> fprobe in my branch:
> - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
>
> So I could rebase my previous "bpf on fprobe" branch on top of these:
> (as before, it's just good enough for benchmarking and to give a
> general sense of the idea, not for a thorough code review):
> - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>
> And I could run the benchmarks against my rpi4. I have different
> baseline numbers as Xu so I ran everything again and tried to keep the
> format the same. "indirect call" refers to my branch I just linked and
> "direct call" refers to the series this is a reply to (Xu's work)
Thanks for sharing the measurement results. Yes, fprobes/rethook
implementation is just porting the kretprobes implementation, thus
it may not be so optimized.
BTW, I remember Wuqiang's patch for kretprobes.
https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
This is for the scalability fixing, but may possible to improve
the performance a bit. It is not hard to port to the recent kernel.
Can you try it too?
Anyway, eventually, I would like to remove the current kretprobe
based implementation and unify fexit hook with function-graph
tracer. It should make more better perfromance on it.
Thank you,
>
> 1. test with dd
>
> 1.1 when no bpf prog attached to vfs_write
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 3.94315 s, 130 MB/s
>
>
> 1.2 attach bpf prog with kprobe, bpftrace -e kprobe:vfs_write {}
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 5.80493 s, 88.2 MB/s
>
>
> 1.3 attach bpf prog with with direct call, bpftrace -e kfunc:vfs_write {}
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 4.18579 s, 122 MB/s
>
>
> 1.4 attach bpf prog with with indirect call, bpftrace -e kfunc:vfs_write {}
>
> # dd if=/dev/zero of=/dev/null count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 512000000 bytes (512 MB, 488 MiB) copied, 4.92616 s, 104 MB/s
>
>
> 2. test with bpf/bench
>
> 2.1 bench trig-base
> Iter 0 ( 86.518us): hits 0.700M/s ( 0.700M/prod), drops
> 0.000M/s, total operations 0.700M/s
> Iter 1 (-26.352us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 2 ( 1.092us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 3 ( -1.890us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 4 ( -2.315us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 5 ( 4.184us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Iter 6 ( -3.241us): hits 0.701M/s ( 0.701M/prod), drops
> 0.000M/s, total operations 0.701M/s
> Summary: hits 0.701 ± 0.000M/s ( 0.701M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.701 ± 0.000M/s
>
> 2.2 bench trig-kprobe
> Iter 0 ( 96.833us): hits 0.290M/s ( 0.290M/prod), drops
> 0.000M/s, total operations 0.290M/s
> Iter 1 (-20.834us): hits 0.291M/s ( 0.291M/prod), drops
> 0.000M/s, total operations 0.291M/s
> Iter 2 ( -2.426us): hits 0.291M/s ( 0.291M/prod), drops
> 0.000M/s, total operations 0.291M/s
> Iter 3 ( 22.332us): hits 0.292M/s ( 0.292M/prod), drops
> 0.000M/s, total operations 0.292M/s
> Iter 4 (-18.204us): hits 0.292M/s ( 0.292M/prod), drops
> 0.000M/s, total operations 0.292M/s
> Iter 5 ( 5.370us): hits 0.292M/s ( 0.292M/prod), drops
> 0.000M/s, total operations 0.292M/s
> Iter 6 ( -7.853us): hits 0.290M/s ( 0.290M/prod), drops
> 0.000M/s, total operations 0.290M/s
> Summary: hits 0.291 ± 0.001M/s ( 0.291M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.291 ± 0.001M/s
>
> 2.3 bench trig-fentry, with direct call
> Iter 0 ( 86.481us): hits 0.530M/s ( 0.530M/prod), drops
> 0.000M/s, total operations 0.530M/s
> Iter 1 (-12.593us): hits 0.536M/s ( 0.536M/prod), drops
> 0.000M/s, total operations 0.536M/s
> Iter 2 ( -5.760us): hits 0.532M/s ( 0.532M/prod), drops
> 0.000M/s, total operations 0.532M/s
> Iter 3 ( 1.629us): hits 0.532M/s ( 0.532M/prod), drops
> 0.000M/s, total operations 0.532M/s
> Iter 4 ( -1.945us): hits 0.533M/s ( 0.533M/prod), drops
> 0.000M/s, total operations 0.533M/s
> Iter 5 ( -1.297us): hits 0.532M/s ( 0.532M/prod), drops
> 0.000M/s, total operations 0.532M/s
> Iter 6 ( 0.444us): hits 0.535M/s ( 0.535M/prod), drops
> 0.000M/s, total operations 0.535M/s
> Summary: hits 0.533 ± 0.002M/s ( 0.533M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.533 ± 0.002M/s
>
> 2.3 bench trig-fentry, with indirect call
> Iter 0 ( 84.463us): hits 0.404M/s ( 0.404M/prod), drops
> 0.000M/s, total operations 0.404M/s
> Iter 1 (-16.260us): hits 0.405M/s ( 0.405M/prod), drops
> 0.000M/s, total operations 0.405M/s
> Iter 2 ( -1.038us): hits 0.405M/s ( 0.405M/prod), drops
> 0.000M/s, total operations 0.405M/s
> Iter 3 ( -3.797us): hits 0.405M/s ( 0.405M/prod), drops
> 0.000M/s, total operations 0.405M/s
> Iter 4 ( -0.537us): hits 0.402M/s ( 0.402M/prod), drops
> 0.000M/s, total operations 0.402M/s
> Iter 5 ( 3.536us): hits 0.403M/s ( 0.403M/prod), drops
> 0.000M/s, total operations 0.403M/s
> Iter 6 ( 12.203us): hits 0.404M/s ( 0.404M/prod), drops
> 0.000M/s, total operations 0.404M/s
> Summary: hits 0.404 ± 0.001M/s ( 0.404M/prod), drops 0.000 ±
> 0.000M/s, total operations 0.404 ± 0.001M/s
>
>
> 3. perf report of bench trig-fentry
>
> 3.1 with direct call
>
> 98.67% 0.27% bench bench
> [.] trigger_producer
> |
> --98.40%--trigger_producer
> |
> |--96.63%--syscall
> | |
> | --71.90%--el0t_64_sync
> | el0t_64_sync_handler
> | el0_svc
> | do_el0_svc
> | |
> | |--70.94%--el0_svc_common
> | | |
> | |
> |--29.55%--invoke_syscall
> | | | |
> | | |
> |--26.23%--__arm64_sys_getpgid
> | | | |
> |
> | | | |
> |--18.88%--bpf_trampoline_6442462665_0
> | | | |
> | |
> | | | |
> | |--6.85%--__bpf_prog_enter
> | | | |
> | | |
> | | | |
> | | --2.68%--migrate_disable
> | | | |
> | |
> | | | |
> | |--5.28%--__bpf_prog_exit
> | | | |
> | | |
> | | | |
> | | --1.29%--migrate_enable
> | | | |
> | |
> | | | |
> |
> |--3.96%--bpf_prog_21856463590f61f1_bench_trigger_fentry
> | | | |
> | |
> | | | |
> | --0.61%--__rcu_read_lock
> | | | |
> |
> | | | |
> --4.42%--find_task_by_vpid
> | | | |
> |
> | | | |
> |--2.53%--radix_tree_lookup
> | | | |
> |
> | | | |
> --0.61%--idr_find
> | | | |
> | | |
> --0.81%--pid_vnr
> | | |
> | |
> --0.53%--__arm64_sys_getpgid
> | |
> | --0.95%--invoke_syscall
> |
> --0.99%--syscall@plt
>
>
> 3.2 with indirect call
>
> 98.68% 0.20% bench bench
> [.] trigger_producer
> |
> --98.48%--trigger_producer
> |
> --97.47%--syscall
> |
> --76.11%--el0t_64_sync
> el0t_64_sync_handler
> el0_svc
> do_el0_svc
> |
> |--75.52%--el0_svc_common
> | |
> |
> |--46.35%--invoke_syscall
> | | |
> | |
> --44.06%--__arm64_sys_getpgid
> | |
> |
> | |
> |--35.40%--ftrace_caller
> | |
> | |
> | |
> | --34.04%--fprobe_handler
> | |
> | |
> | |
> | |--15.61%--bpf_fprobe_entry
> | |
> | | |
> | |
> | | |--3.79%--__bpf_prog_enter
> | |
> | | | |
> | |
> | | |
> --0.80%--migrate_disable
> | |
> | | |
> | |
> | | |--3.74%--__bpf_prog_exit
> | |
> | | | |
> | |
> | | |
> --0.77%--migrate_enable
> | |
> | | |
> | |
> | |
> --2.65%--bpf_prog_21856463590f61f1_bench_trigger_fentry
> | |
> | |
> | |
> | |--12.65%--rethook_trampoline_handler
> | |
> | |
> | |
> | |--1.70%--rethook_try_get
> | |
> | | |
> | |
> | | --1.48%--rcu_is_watching
> | |
> | |
> | |
> | |--1.46%--freelist_try_get
> | |
> | |
> | |
> | --0.65%--rethook_recycle
> | |
> |
> | |
> --6.36%--find_task_by_vpid
> | |
> |
> | |
> |--3.64%--radix_tree_lookup
> | |
> |
> | |
> --1.74%--idr_find
> | |
> | --1.05%--ftrace_caller
> |
> --0.59%--invoke_syscall
>
> This looks slightly better than before but it is actually still a
> pretty significant performance hit compared to direct calls.
>
> Note that I can't really make sense of the perf report with indirect
> calls. it always reports it spent 12% of the time in
> rethook_trampoline_handler but I verified with both a WARN in that
> function and a breakpoint with a debugger, this function does *not*
> get called when running this "bench trig-fentry" benchmark. Also it
> wouldn't make sense for fprobe_handler to call it so I'm quite
> confused why perf would report this call and such a long time spent
> there. Anyone know what I could be missing here ?
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-21 11:31 ` Masami Hiramatsu
@ 2022-10-21 16:49 ` Florent Revest
-1 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-21 16:49 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren
On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Mon, 17 Oct 2022 19:55:06 +0200
> Florent Revest <revest@chromium.org> wrote:
> > Mark finished an implementation of his per-callsite-ops and min-args
> > branches (meaning that we can now skip the expensive ftrace's saving
> > of all registers and iteration over all ops if only one is attached)
> > - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> >
> > And Masami wrote similar patches to what I had originally done to
> > fprobe in my branch:
> > - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> >
> > So I could rebase my previous "bpf on fprobe" branch on top of these:
> > (as before, it's just good enough for benchmarking and to give a
> > general sense of the idea, not for a thorough code review):
> > - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> >
> > And I could run the benchmarks against my rpi4. I have different
> > baseline numbers as Xu so I ran everything again and tried to keep the
> > format the same. "indirect call" refers to my branch I just linked and
> > "direct call" refers to the series this is a reply to (Xu's work)
>
> Thanks for sharing the measurement results. Yes, fprobes/rethook
> implementation is just porting the kretprobes implementation, thus
> it may not be so optimized.
>
> BTW, I remember Wuqiang's patch for kretprobes.
>
> https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
Oh that's a great idea, thanks for pointing it out Masami!
> This is for the scalability fixing, but may possible to improve
> the performance a bit. It is not hard to port to the recent kernel.
> Can you try it too?
I rebased it on my branch
https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
And I got measurements again. Unfortunately it looks like this does not help :/
New benchmark results: https://paste.debian.net/1257856/
New perf report: https://paste.debian.net/1257859/
The fprobe based approach is still significantly slower than the
direct call approach.
> Anyway, eventually, I would like to remove the current kretprobe
> based implementation and unify fexit hook with function-graph
> tracer. It should make more better perfromance on it.
That makes sense. :) How do you imagine the unified solution ?
Would both the fgraph and fprobe APIs keep existing but under the hood
one would be implemented on the other ? (or would one be gone ?) Would
we replace the rethook freelist with the function graph's per-task
shadow stacks ? (or the other way around ?))
> > Note that I can't really make sense of the perf report with indirect
> > calls. it always reports it spent 12% of the time in
> > rethook_trampoline_handler but I verified with both a WARN in that
> > function and a breakpoint with a debugger, this function does *not*
> > get called when running this "bench trig-fentry" benchmark. Also it
> > wouldn't make sense for fprobe_handler to call it so I'm quite
> > confused why perf would report this call and such a long time spent
> > there. Anyone know what I could be missing here ?
I made slight progress on this. If I put the vmlinux file in the cwd
where I run perf report, the reports no longer contain references to
rethook_trampoline_handler. Instead, they have a few
0xffff800008xxxxxx addresses under fprobe_handler. (like in the
pastebin I just linked)
It's still pretty weird because that range is the vmalloc area on
arm64 and I don't understand why anything under fprobe_handler would
execute there. However, I'm also definitely sure that these 12% are
actually spent getting buffers from the rethook memory pool because if
I replace rethook_try_get and rethook_recycle calls with the usage of
a dummy static bss buffer (for the sake of benchmarking the
"theoretical best case scenario") these weird perf report traces are
gone and the 12% are saved. https://paste.debian.net/1257862/
This is why I would be interested in seeing rethook's memory pool
reimplemented on top of something like
https://lwn.net/Articles/788923/ If we get closer to the performance
of the the theoretical best case scenario where getting a blob of
memory is ~free (and I think it could be the case with a per task
shadow stack like fgraph's), then a bpf on fprobe implementation would
start to approach the performances of a direct called trampoline on
arm64: https://paste.debian.net/1257863/
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-21 16:49 ` Florent Revest
0 siblings, 0 replies; 60+ messages in thread
From: Florent Revest @ 2022-10-21 16:49 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren
On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Mon, 17 Oct 2022 19:55:06 +0200
> Florent Revest <revest@chromium.org> wrote:
> > Mark finished an implementation of his per-callsite-ops and min-args
> > branches (meaning that we can now skip the expensive ftrace's saving
> > of all registers and iteration over all ops if only one is attached)
> > - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> >
> > And Masami wrote similar patches to what I had originally done to
> > fprobe in my branch:
> > - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> >
> > So I could rebase my previous "bpf on fprobe" branch on top of these:
> > (as before, it's just good enough for benchmarking and to give a
> > general sense of the idea, not for a thorough code review):
> > - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> >
> > And I could run the benchmarks against my rpi4. I have different
> > baseline numbers as Xu so I ran everything again and tried to keep the
> > format the same. "indirect call" refers to my branch I just linked and
> > "direct call" refers to the series this is a reply to (Xu's work)
>
> Thanks for sharing the measurement results. Yes, fprobes/rethook
> implementation is just porting the kretprobes implementation, thus
> it may not be so optimized.
>
> BTW, I remember Wuqiang's patch for kretprobes.
>
> https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
Oh that's a great idea, thanks for pointing it out Masami!
> This is for the scalability fixing, but may possible to improve
> the performance a bit. It is not hard to port to the recent kernel.
> Can you try it too?
I rebased it on my branch
https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
And I got measurements again. Unfortunately it looks like this does not help :/
New benchmark results: https://paste.debian.net/1257856/
New perf report: https://paste.debian.net/1257859/
The fprobe based approach is still significantly slower than the
direct call approach.
> Anyway, eventually, I would like to remove the current kretprobe
> based implementation and unify fexit hook with function-graph
> tracer. It should make more better perfromance on it.
That makes sense. :) How do you imagine the unified solution ?
Would both the fgraph and fprobe APIs keep existing but under the hood
one would be implemented on the other ? (or would one be gone ?) Would
we replace the rethook freelist with the function graph's per-task
shadow stacks ? (or the other way around ?))
> > Note that I can't really make sense of the perf report with indirect
> > calls. it always reports it spent 12% of the time in
> > rethook_trampoline_handler but I verified with both a WARN in that
> > function and a breakpoint with a debugger, this function does *not*
> > get called when running this "bench trig-fentry" benchmark. Also it
> > wouldn't make sense for fprobe_handler to call it so I'm quite
> > confused why perf would report this call and such a long time spent
> > there. Anyone know what I could be missing here ?
I made slight progress on this. If I put the vmlinux file in the cwd
where I run perf report, the reports no longer contain references to
rethook_trampoline_handler. Instead, they have a few
0xffff800008xxxxxx addresses under fprobe_handler. (like in the
pastebin I just linked)
It's still pretty weird because that range is the vmalloc area on
arm64 and I don't understand why anything under fprobe_handler would
execute there. However, I'm also definitely sure that these 12% are
actually spent getting buffers from the rethook memory pool because if
I replace rethook_try_get and rethook_recycle calls with the usage of
a dummy static bss buffer (for the sake of benchmarking the
"theoretical best case scenario") these weird perf report traces are
gone and the 12% are saved. https://paste.debian.net/1257862/
This is why I would be interested in seeing rethook's memory pool
reimplemented on top of something like
https://lwn.net/Articles/788923/ If we get closer to the performance
of the the theoretical best case scenario where getting a blob of
memory is ~free (and I think it could be the case with a per task
shadow stack like fgraph's), then a bpf on fprobe implementation would
start to approach the performances of a direct called trampoline on
arm64: https://paste.debian.net/1257863/
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-21 16:49 ` Florent Revest
@ 2022-10-24 13:00 ` Masami Hiramatsu
-1 siblings, 0 replies; 60+ messages in thread
From: Masami Hiramatsu @ 2022-10-24 13:00 UTC (permalink / raw)
To: Florent Revest
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren
On Fri, 21 Oct 2022 18:49:38 +0200
Florent Revest <revest@chromium.org> wrote:
> On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > On Mon, 17 Oct 2022 19:55:06 +0200
> > Florent Revest <revest@chromium.org> wrote:
> > > Mark finished an implementation of his per-callsite-ops and min-args
> > > branches (meaning that we can now skip the expensive ftrace's saving
> > > of all registers and iteration over all ops if only one is attached)
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> > >
> > > And Masami wrote similar patches to what I had originally done to
> > > fprobe in my branch:
> > > - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> > >
> > > So I could rebase my previous "bpf on fprobe" branch on top of these:
> > > (as before, it's just good enough for benchmarking and to give a
> > > general sense of the idea, not for a thorough code review):
> > > - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> > >
> > > And I could run the benchmarks against my rpi4. I have different
> > > baseline numbers as Xu so I ran everything again and tried to keep the
> > > format the same. "indirect call" refers to my branch I just linked and
> > > "direct call" refers to the series this is a reply to (Xu's work)
> >
> > Thanks for sharing the measurement results. Yes, fprobes/rethook
> > implementation is just porting the kretprobes implementation, thus
> > it may not be so optimized.
> >
> > BTW, I remember Wuqiang's patch for kretprobes.
> >
> > https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
>
> Oh that's a great idea, thanks for pointing it out Masami!
>
> > This is for the scalability fixing, but may possible to improve
> > the performance a bit. It is not hard to port to the recent kernel.
> > Can you try it too?
>
> I rebased it on my branch
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>
> And I got measurements again. Unfortunately it looks like this does not help :/
>
> New benchmark results: https://paste.debian.net/1257856/
> New perf report: https://paste.debian.net/1257859/
Hmm, OK. That is only for the scalability.
>
> The fprobe based approach is still significantly slower than the
> direct call approach.
>
> > Anyway, eventually, I would like to remove the current kretprobe
> > based implementation and unify fexit hook with function-graph
> > tracer. It should make more better perfromance on it.
>
> That makes sense. :) How do you imagine the unified solution ?
> Would both the fgraph and fprobe APIs keep existing but under the hood
> one would be implemented on the other ? (or would one be gone ?) Would
> we replace the rethook freelist with the function graph's per-task
> shadow stacks ? (or the other way around ?))
Yes, that's right. As far as using a global object pool, there must
be a performance bottleneck to pick up an object and returning the
object to the pool. Per-CPU pool may give a better performance but
more complicated to balance pools. Per-task shadow stack will solve it.
So I plan to expand fgraph API and use it in fprobe instead of rethook.
(I planned to re-implement rethook, but I realized that it has more issue
than I thought.)
> > > Note that I can't really make sense of the perf report with indirect
> > > calls. it always reports it spent 12% of the time in
> > > rethook_trampoline_handler but I verified with both a WARN in that
> > > function and a breakpoint with a debugger, this function does *not*
> > > get called when running this "bench trig-fentry" benchmark. Also it
> > > wouldn't make sense for fprobe_handler to call it so I'm quite
> > > confused why perf would report this call and such a long time spent
> > > there. Anyone know what I could be missing here ?
>
> I made slight progress on this. If I put the vmlinux file in the cwd
> where I run perf report, the reports no longer contain references to
> rethook_trampoline_handler. Instead, they have a few
> 0xffff800008xxxxxx addresses under fprobe_handler. (like in the
> pastebin I just linked)
>
> It's still pretty weird because that range is the vmalloc area on
> arm64 and I don't understand why anything under fprobe_handler would
> execute there. However, I'm also definitely sure that these 12% are
> actually spent getting buffers from the rethook memory pool because if
> I replace rethook_try_get and rethook_recycle calls with the usage of
> a dummy static bss buffer (for the sake of benchmarking the
> "theoretical best case scenario") these weird perf report traces are
> gone and the 12% are saved. https://paste.debian.net/1257862/
Yeah, I understand that. Rethook (and kretprobes) is not designed
for such heavy workload.
> This is why I would be interested in seeing rethook's memory pool
> reimplemented on top of something like
> https://lwn.net/Articles/788923/ If we get closer to the performance
> of the the theoretical best case scenario where getting a blob of
> memory is ~free (and I think it could be the case with a per task
> shadow stack like fgraph's), then a bpf on fprobe implementation would
> start to approach the performances of a direct called trampoline on
> arm64: https://paste.debian.net/1257863/
OK, I think we are on the same page and same direction.
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-10-24 13:00 ` Masami Hiramatsu
0 siblings, 0 replies; 60+ messages in thread
From: Masami Hiramatsu @ 2022-10-24 13:00 UTC (permalink / raw)
To: Florent Revest
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren
On Fri, 21 Oct 2022 18:49:38 +0200
Florent Revest <revest@chromium.org> wrote:
> On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > On Mon, 17 Oct 2022 19:55:06 +0200
> > Florent Revest <revest@chromium.org> wrote:
> > > Mark finished an implementation of his per-callsite-ops and min-args
> > > branches (meaning that we can now skip the expensive ftrace's saving
> > > of all registers and iteration over all ops if only one is attached)
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
> > >
> > > And Masami wrote similar patches to what I had originally done to
> > > fprobe in my branch:
> > > - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
> > >
> > > So I could rebase my previous "bpf on fprobe" branch on top of these:
> > > (as before, it's just good enough for benchmarking and to give a
> > > general sense of the idea, not for a thorough code review):
> > > - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
> > >
> > > And I could run the benchmarks against my rpi4. I have different
> > > baseline numbers as Xu so I ran everything again and tried to keep the
> > > format the same. "indirect call" refers to my branch I just linked and
> > > "direct call" refers to the series this is a reply to (Xu's work)
> >
> > Thanks for sharing the measurement results. Yes, fprobes/rethook
> > implementation is just porting the kretprobes implementation, thus
> > it may not be so optimized.
> >
> > BTW, I remember Wuqiang's patch for kretprobes.
> >
> > https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
>
> Oh that's a great idea, thanks for pointing it out Masami!
>
> > This is for the scalability fixing, but may possible to improve
> > the performance a bit. It is not hard to port to the recent kernel.
> > Can you try it too?
>
> I rebased it on my branch
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>
> And I got measurements again. Unfortunately it looks like this does not help :/
>
> New benchmark results: https://paste.debian.net/1257856/
> New perf report: https://paste.debian.net/1257859/
Hmm, OK. That is only for the scalability.
>
> The fprobe based approach is still significantly slower than the
> direct call approach.
>
> > Anyway, eventually, I would like to remove the current kretprobe
> > based implementation and unify fexit hook with function-graph
> > tracer. It should make more better perfromance on it.
>
> That makes sense. :) How do you imagine the unified solution ?
> Would both the fgraph and fprobe APIs keep existing but under the hood
> one would be implemented on the other ? (or would one be gone ?) Would
> we replace the rethook freelist with the function graph's per-task
> shadow stacks ? (or the other way around ?))
Yes, that's right. As far as using a global object pool, there must
be a performance bottleneck to pick up an object and returning the
object to the pool. Per-CPU pool may give a better performance but
more complicated to balance pools. Per-task shadow stack will solve it.
So I plan to expand fgraph API and use it in fprobe instead of rethook.
(I planned to re-implement rethook, but I realized that it has more issue
than I thought.)
> > > Note that I can't really make sense of the perf report with indirect
> > > calls. it always reports it spent 12% of the time in
> > > rethook_trampoline_handler but I verified with both a WARN in that
> > > function and a breakpoint with a debugger, this function does *not*
> > > get called when running this "bench trig-fentry" benchmark. Also it
> > > wouldn't make sense for fprobe_handler to call it so I'm quite
> > > confused why perf would report this call and such a long time spent
> > > there. Anyone know what I could be missing here ?
>
> I made slight progress on this. If I put the vmlinux file in the cwd
> where I run perf report, the reports no longer contain references to
> rethook_trampoline_handler. Instead, they have a few
> 0xffff800008xxxxxx addresses under fprobe_handler. (like in the
> pastebin I just linked)
>
> It's still pretty weird because that range is the vmalloc area on
> arm64 and I don't understand why anything under fprobe_handler would
> execute there. However, I'm also definitely sure that these 12% are
> actually spent getting buffers from the rethook memory pool because if
> I replace rethook_try_get and rethook_recycle calls with the usage of
> a dummy static bss buffer (for the sake of benchmarking the
> "theoretical best case scenario") these weird perf report traces are
> gone and the 12% are saved. https://paste.debian.net/1257862/
Yeah, I understand that. Rethook (and kretprobes) is not designed
for such heavy workload.
> This is why I would be interested in seeing rethook's memory pool
> reimplemented on top of something like
> https://lwn.net/Articles/788923/ If we get closer to the performance
> of the the theoretical best case scenario where getting a blob of
> memory is ~free (and I think it could be the case with a per task
> shadow stack like fgraph's), then a bpf on fprobe implementation would
> start to approach the performances of a direct called trampoline on
> arm64: https://paste.debian.net/1257863/
OK, I think we are on the same page and same direction.
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
2022-10-21 16:49 ` Florent Revest
@ 2022-11-10 4:58 ` wuqiang
-1 siblings, 0 replies; 60+ messages in thread
From: wuqiang @ 2022-11-10 4:58 UTC (permalink / raw)
To: Florent Revest, Masami Hiramatsu
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren
On 2022/10/22 00:49, Florent Revest wrote:
> On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>> On Mon, 17 Oct 2022 19:55:06 +0200
>> Florent Revest <revest@chromium.org> wrote:
>>> Mark finished an implementation of his per-callsite-ops and min-args
>>> branches (meaning that we can now skip the expensive ftrace's saving
>>> of all registers and iteration over all ops if only one is attached)
>>> - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
>>>
>>> And Masami wrote similar patches to what I had originally done to
>>> fprobe in my branch:
>>> - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
>>>
>>> So I could rebase my previous "bpf on fprobe" branch on top of these:
>>> (as before, it's just good enough for benchmarking and to give a
>>> general sense of the idea, not for a thorough code review):
>>> - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>>>
>>> And I could run the benchmarks against my rpi4. I have different
>>> baseline numbers as Xu so I ran everything again and tried to keep the
>>> format the same. "indirect call" refers to my branch I just linked and
>>> "direct call" refers to the series this is a reply to (Xu's work)
>>
>> Thanks for sharing the measurement results. Yes, fprobes/rethook
>> implementation is just porting the kretprobes implementation, thus
>> it may not be so optimized.
>>
>> BTW, I remember Wuqiang's patch for kretprobes.
>>
>> https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
>
> Oh that's a great idea, thanks for pointing it out Masami!
>
>> This is for the scalability fixing, but may possible to improve
>> the performance a bit. It is not hard to port to the recent kernel.
>> Can you try it too?
>
> I rebased it on my branch
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>
> And I got measurements again. Unfortunately it looks like this does not help :/
>
> New benchmark results: https://paste.debian.net/1257856/
> New perf report: https://paste.debian.net/1257859/
>
> The fprobe based approach is still significantly slower than the
> direct call approach.
FYI, a new version was released, basing on ring-array, which brings a 6.96%
increase in throughput of 1-thread case for ARM64.
https://lore.kernel.org/all/20221108071443.258794-1-wuqiang.matt@bytedance.com/
Could you share more details of the test ? I'll give it a try.
>> Anyway, eventually, I would like to remove the current kretprobe
>> based implementation and unify fexit hook with function-graph
>> tracer. It should make more better perfromance on it.
>
> That makes sense. :) How do you imagine the unified solution ?
> Would both the fgraph and fprobe APIs keep existing but under the hood
> one would be implemented on the other ? (or would one be gone ?) Would
> we replace the rethook freelist with the function graph's per-task
> shadow stacks ? (or the other way around ?))
How about a private pool designate for local cpu ? If the fprobed routine
sticks to the same CPU when returning, the object allocation and reclaim
can go a quick path, that should bring same performance as shadow stack.
Otherwise the return of an object will go a slow path (slow as current
freelist or objpool).
>>> Note that I can't really make sense of the perf report with indirect
>>> calls. it always reports it spent 12% of the time in
>>> rethook_trampoline_handler but I verified with both a WARN in that
>>> function and a breakpoint with a debugger, this function does *not*
>>> get called when running this "bench trig-fentry" benchmark. Also it
>>> wouldn't make sense for fprobe_handler to call it so I'm quite
>>> confused why perf would report this call and such a long time spent
>>> there. Anyone know what I could be missing here ?
>
> I made slight progress on this. If I put the vmlinux file in the cwd
> where I run perf report, the reports no longer contain references to
> rethook_trampoline_handler. Instead, they have a few
> 0xffff800008xxxxxx addresses under fprobe_handler. (like in the
> pastebin I just linked)
>
> It's still pretty weird because that range is the vmalloc area on
> arm64 and I don't understand why anything under fprobe_handler would
> execute there. However, I'm also definitely sure that these 12% are
> actually spent getting buffers from the rethook memory pool because if
> I replace rethook_try_get and rethook_recycle calls with the usage of
> a dummy static bss buffer (for the sake of benchmarking the
> "theoretical best case scenario") these weird perf report traces are
> gone and the 12% are saved. https://paste.debian.net/1257862/
>
> This is why I would be interested in seeing rethook's memory pool
> reimplemented on top of something like
> https://lwn.net/Articles/788923/ If we get closer to the performance
> of the the theoretical best case scenario where getting a blob of
> memory is ~free (and I think it could be the case with a per task
> shadow stack like fgraph's), then a bpf on fprobe implementation would
> start to approach the performances of a direct called trampoline on
> arm64: https://paste.debian.net/1257863/
^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64
@ 2022-11-10 4:58 ` wuqiang
0 siblings, 0 replies; 60+ messages in thread
From: wuqiang @ 2022-11-10 4:58 UTC (permalink / raw)
To: Florent Revest, Masami Hiramatsu
Cc: Steven Rostedt, Xu Kuohai, Mark Rutland, Catalin Marinas,
Daniel Borkmann, Xu Kuohai, linux-arm-kernel, linux-kernel, bpf,
Will Deacon, Jean-Philippe Brucker, Ingo Molnar, Oleg Nesterov,
Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Zi Shen Lim, Pasha Tatashin, Ard Biesheuvel,
Marc Zyngier, Guo Ren
On 2022/10/22 00:49, Florent Revest wrote:
> On Fri, Oct 21, 2022 at 1:32 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>> On Mon, 17 Oct 2022 19:55:06 +0200
>> Florent Revest <revest@chromium.org> wrote:
>>> Mark finished an implementation of his per-callsite-ops and min-args
>>> branches (meaning that we can now skip the expensive ftrace's saving
>>> of all registers and iteration over all ops if only one is attached)
>>> - https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64-ftrace-call-ops-20221017
>>>
>>> And Masami wrote similar patches to what I had originally done to
>>> fprobe in my branch:
>>> - https://github.com/mhiramat/linux/commits/kprobes/fprobe-update
>>>
>>> So I could rebase my previous "bpf on fprobe" branch on top of these:
>>> (as before, it's just good enough for benchmarking and to give a
>>> general sense of the idea, not for a thorough code review):
>>> - https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>>>
>>> And I could run the benchmarks against my rpi4. I have different
>>> baseline numbers as Xu so I ran everything again and tried to keep the
>>> format the same. "indirect call" refers to my branch I just linked and
>>> "direct call" refers to the series this is a reply to (Xu's work)
>>
>> Thanks for sharing the measurement results. Yes, fprobes/rethook
>> implementation is just porting the kretprobes implementation, thus
>> it may not be so optimized.
>>
>> BTW, I remember Wuqiang's patch for kretprobes.
>>
>> https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/#u
>
> Oh that's a great idea, thanks for pointing it out Masami!
>
>> This is for the scalability fixing, but may possible to improve
>> the performance a bit. It is not hard to port to the recent kernel.
>> Can you try it too?
>
> I rebased it on my branch
> https://github.com/FlorentRevest/linux/commits/fprobe-min-args-3
>
> And I got measurements again. Unfortunately it looks like this does not help :/
>
> New benchmark results: https://paste.debian.net/1257856/
> New perf report: https://paste.debian.net/1257859/
>
> The fprobe based approach is still significantly slower than the
> direct call approach.
FYI, a new version was released, basing on ring-array, which brings a 6.96%
increase in throughput of 1-thread case for ARM64.
https://lore.kernel.org/all/20221108071443.258794-1-wuqiang.matt@bytedance.com/
Could you share more details of the test ? I'll give it a try.
>> Anyway, eventually, I would like to remove the current kretprobe
>> based implementation and unify fexit hook with function-graph
>> tracer. It should make more better perfromance on it.
>
> That makes sense. :) How do you imagine the unified solution ?
> Would both the fgraph and fprobe APIs keep existing but under the hood
> one would be implemented on the other ? (or would one be gone ?) Would
> we replace the rethook freelist with the function graph's per-task
> shadow stacks ? (or the other way around ?))
How about a private pool designate for local cpu ? If the fprobed routine
sticks to the same CPU when returning, the object allocation and reclaim
can go a quick path, that should bring same performance as shadow stack.
Otherwise the return of an object will go a slow path (slow as current
freelist or objpool).
>>> Note that I can't really make sense of the perf report with indirect
>>> calls. it always reports it spent 12% of the time in
>>> rethook_trampoline_handler but I verified with both a WARN in that
>>> function and a breakpoint with a debugger, this function does *not*
>>> get called when running this "bench trig-fentry" benchmark. Also it
>>> wouldn't make sense for fprobe_handler to call it so I'm quite
>>> confused why perf would report this call and such a long time spent
>>> there. Anyone know what I could be missing here ?
>
> I made slight progress on this. If I put the vmlinux file in the cwd
> where I run perf report, the reports no longer contain references to
> rethook_trampoline_handler. Instead, they have a few
> 0xffff800008xxxxxx addresses under fprobe_handler. (like in the
> pastebin I just linked)
>
> It's still pretty weird because that range is the vmalloc area on
> arm64 and I don't understand why anything under fprobe_handler would
> execute there. However, I'm also definitely sure that these 12% are
> actually spent getting buffers from the rethook memory pool because if
> I replace rethook_try_get and rethook_recycle calls with the usage of
> a dummy static bss buffer (for the sake of benchmarking the
> "theoretical best case scenario") these weird perf report traces are
> gone and the 12% are saved. https://paste.debian.net/1257862/
>
> This is why I would be interested in seeing rethook's memory pool
> reimplemented on top of something like
> https://lwn.net/Articles/788923/ If we get closer to the performance
> of the the theoretical best case scenario where getting a blob of
> memory is ~free (and I think it could be the case with a per task
> shadow stack like fgraph's), then a bpf on fprobe implementation would
> start to approach the performances of a direct called trampoline on
> arm64: https://paste.debian.net/1257863/
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2022-11-10 5:00 UTC | newest]
Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-13 16:27 [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64 Xu Kuohai
2022-09-13 16:27 ` Xu Kuohai
2022-09-13 16:27 ` [PATCH bpf-next v2 1/4] ftrace: Allow users to disable ftrace direct call Xu Kuohai
2022-09-13 16:27 ` Xu Kuohai
2022-09-13 16:27 ` [PATCH bpf-next v2 2/4] arm64: ftrace: Support long jump for " Xu Kuohai
2022-09-13 16:27 ` Xu Kuohai
2022-09-13 16:27 ` [PATCH bpf-next v2 3/4] arm64: ftrace: Add ftrace direct call support Xu Kuohai
2022-09-13 16:27 ` Xu Kuohai
2022-09-13 16:27 ` [PATCH bpf-next v2 4/4] ftrace: Fix dead loop caused by direct call in ftrace selftest Xu Kuohai
2022-09-13 16:27 ` Xu Kuohai
2022-09-22 18:01 ` [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64 Daniel Borkmann
2022-09-22 18:01 ` Daniel Borkmann
2022-09-26 14:40 ` Catalin Marinas
2022-09-26 14:40 ` Catalin Marinas
2022-09-26 17:43 ` Mark Rutland
2022-09-26 17:43 ` Mark Rutland
2022-09-27 4:49 ` Xu Kuohai
2022-09-27 4:49 ` Xu Kuohai
2022-09-28 16:42 ` Mark Rutland
2022-09-28 16:42 ` Mark Rutland
2022-09-30 4:07 ` Xu Kuohai
2022-09-30 4:07 ` Xu Kuohai
2022-10-04 16:06 ` Florent Revest
2022-10-04 16:06 ` Florent Revest
2022-10-05 14:54 ` Xu Kuohai
2022-10-05 14:54 ` Xu Kuohai
2022-10-05 15:07 ` Steven Rostedt
2022-10-05 15:07 ` Steven Rostedt
2022-10-05 15:10 ` Florent Revest
2022-10-05 15:10 ` Florent Revest
2022-10-05 15:30 ` Steven Rostedt
2022-10-05 15:30 ` Steven Rostedt
2022-10-05 22:12 ` Jiri Olsa
2022-10-05 22:12 ` Jiri Olsa
2022-10-06 16:35 ` Florent Revest
2022-10-06 16:35 ` Florent Revest
2022-10-06 10:09 ` Xu Kuohai
2022-10-06 10:09 ` Xu Kuohai
2022-10-06 16:19 ` Florent Revest
2022-10-06 16:19 ` Florent Revest
2022-10-06 16:29 ` Steven Rostedt
2022-10-06 16:29 ` Steven Rostedt
2022-10-07 10:13 ` Xu Kuohai
2022-10-07 10:13 ` Xu Kuohai
2022-10-17 17:55 ` Florent Revest
2022-10-17 17:55 ` Florent Revest
2022-10-17 18:49 ` Steven Rostedt
2022-10-17 18:49 ` Steven Rostedt
2022-10-17 19:10 ` Florent Revest
2022-10-17 19:10 ` Florent Revest
2022-10-21 11:31 ` Masami Hiramatsu
2022-10-21 11:31 ` Masami Hiramatsu
2022-10-21 16:49 ` Florent Revest
2022-10-21 16:49 ` Florent Revest
2022-10-24 13:00 ` Masami Hiramatsu
2022-10-24 13:00 ` Masami Hiramatsu
2022-11-10 4:58 ` wuqiang
2022-11-10 4:58 ` wuqiang
2022-10-06 10:09 ` Xu Kuohai
2022-10-06 10:09 ` Xu Kuohai
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.