All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] Inline two LBR-related helpers
@ 2024-03-21 18:04 Andrii Nakryiko
  2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 18:04 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko

Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach.

Also inline bpf_get_smp_processor_id() BPF helper but using
architecture-specific assembly code in x86-64 JIT compiler, given getting CPU
ID is highly architecture-specific.

These two helpers are on a criticl direct path to grabbing LBR records from
BPF program and inlining them help save 3 LBR records in PERF_SAMPLE_BRANCH_ANY
mode.

Just to give some visual idea of the effect of these changes (and inlining of
kprobe_multi_link_prog_run() posted as a separte patch) based on retsnoop's
LBR output (with --lbr=any flag). I only show "wasted" records that are needed
to go from when some event happened (kernel function return in this case), to
triggering BPF program that captures LBR *the very first thing* (after getting
CPU ID to get a temporary buffer).

There are still ways to reduce number of "wasted" records further, this is
a problem that requires many small and rather independent steps.

fentry mode
===========

BEFORE
------
  [#10] __sys_bpf+0x270                          ->  __x64_sys_bpf+0x18
  [#09] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508684+0x7f
  [#08] bpf_trampoline_6442508684+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#07] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#06] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#05] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508684+0xa1
  [#04] bpf_trampoline_6442508684+0xad           ->  bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#03] bpf_prog_dc54a596b39d4177_fexit1+0x32    ->  bpf_get_smp_processor_id+0x0
  [#02] bpf_get_smp_processor_id+0xe             ->  bpf_prog_dc54a596b39d4177_fexit1+0x37
  [#01] bpf_prog_dc54a596b39d4177_fexit1+0xe0    ->  bpf_get_branch_snapshot+0x0
  [#00] bpf_get_branch_snapshot+0x13             ->  intel_pmu_snapshot_branch_stack+0x0

AFTER
-----
  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508829+0xa1
  [#01] bpf_trampoline_6442508829+0xad           ->  bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->  intel_pmu_snapshot_branch_stack+0x0

multi-kprobe mode
=================

BEFORE
------
  [#14] __sys_bpf+0x270                          ->  arch_rethook_trampoline+0x0
  [#13] arch_rethook_trampoline+0x27             ->  arch_rethook_trampoline_callback+0x0
  [#12] arch_rethook_trampoline_callback+0x31    ->  rethook_trampoline_handler+0x0
  [#11] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#10] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#09] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#08] fprobe_exit_handler+0xb4                 ->  kprobe_multi_link_exit_handler+0x0
  [#07] kprobe_multi_link_exit_handler+0x4       ->  kprobe_multi_link_prog_run+0x0
  [#06] kprobe_multi_link_prog_run+0x2d          ->  migrate_disable+0x0
  [#05] migrate_disable+0x37                     ->  kprobe_multi_link_prog_run+0x32
  [#04] kprobe_multi_link_prog_run+0x58          ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#03] bpf_prog_2b455b4f8a8d48c5_kexit+0x32     ->  bpf_get_smp_processor_id+0x0
  [#02] bpf_get_smp_processor_id+0xe             ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x37
  [#01] bpf_prog_2b455b4f8a8d48c5_kexit+0x82     ->  bpf_get_branch_snapshot+0x0
  [#00] bpf_get_branch_snapshot+0x13             ->  intel_pmu_snapshot_branch_stack+0x0

AFTER
-----
  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->  arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->  rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->  kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->  kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3     ->  intel_pmu_snapshot_branch_stack+0x0


For default --lbr mode (PERF_SAMPLE_BRANCH_ANY_RETURN), interestingly enough,
multi-kprobe is *less* wasteful (by one function call):

fentry mode
===========

BEFORE
------
  [#04] __sys_bpf+0x270                          ->  __x64_sys_bpf+0x18
  [#03] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508684+0x7f
  [#02] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#01] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508684+0xa1
  [#00] bpf_get_smp_processor_id+0xe             ->  bpf_prog_dc54a596b39d4177_fexit1+0x37

AFTER
-----
  [#03] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#02] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508829+0x7f
  [#01] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#00] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508829+0xa1

multi-kprobe mode
=================

BEFORE
------
  [#03] __sys_bpf+0x270                          ->  arch_rethook_trampoline+0x0
  [#02] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#01] migrate_disable+0x37                     ->  kprobe_multi_link_prog_run+0x32
  [#00] bpf_get_smp_processor_id+0xe             ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x37

AFTER
-----
  [#02] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#01] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#00] migrate_disable+0x37                     ->  kprobe_multi_link_exit_handler+0x36

Andrii Nakryiko (3):
  bpf: make bpf_get_branch_snapshot() architecture-agnostic
  bpf: inline bpf_get_branch_snapshot() helper
  bpf,x86: inline bpf_get_smp_processor_id() on x86-64

 arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
 kernel/bpf/verifier.c       | 37 +++++++++++++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c    |  4 ----
 3 files changed, 62 insertions(+), 5 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic
  2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko
@ 2024-03-21 18:04 ` Andrii Nakryiko
  2024-03-21 21:08   ` Jiri Olsa
  2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 18:04 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko

perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.

perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.

  [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/bpf_trace.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 434e3ece6688..6d000332b17b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1182,9 +1182,6 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_tracing = {
 
 BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
 {
-#ifndef CONFIG_X86
-	return -ENOENT;
-#else
 	static const u32 br_entry_size = sizeof(struct perf_branch_entry);
 	u32 entry_cnt = size / br_entry_size;
 
@@ -1197,7 +1194,6 @@ BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
 		return -ENOENT;
 
 	return entry_cnt * br_entry_size;
-#endif
 }
 
 static const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper
  2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko
  2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
@ 2024-03-21 18:05 ` Andrii Nakryiko
  2024-03-21 21:08   ` Jiri Olsa
  2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko
  2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov
  3 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 18:05 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko

Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index de7813947981..4fb6c468e199 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20130,6 +20130,43 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
+		/* Implement bpf_get_branch_snapshot inline. */
+		if (prog->jit_requested && BITS_PER_LONG == 64 &&
+		    insn->imm == BPF_FUNC_get_branch_snapshot) {
+			/* We are dealing with the following func protos:
+			 * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
+			 * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
+			 */
+			const u32 br_entry_size = sizeof(struct perf_branch_entry);
+
+			/* if (unlikely(flags)) return -EINVAL */
+			insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 5);
+			/* transform size (bytes) into entry_cnt */
+			insn_buf[1] = BPF_ALU32_IMM(BPF_DIV, BPF_REG_2, br_entry_size);
+			/* call perf_snapshot_branch_stack implementation */
+			insn_buf[2] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
+			/* if (entry_cnt == 0) return -ENOENT */
+			insn_buf[3] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4);
+			/* return entry_cnt * sizeof(struct perf_branch_entry) */
+			insn_buf[4] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size);
+			insn_buf[5] = BPF_JMP_A(3);
+			/* return -EINVAL; */
+			insn_buf[6] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL);
+			insn_buf[7] = BPF_JMP_A(1);
+			/* return -ENOENT; */
+			insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT);
+			cnt = 9;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			continue;
+		}
+
 		/* Implement bpf_kptr_xchg inline */
 		if (prog->jit_requested && BITS_PER_LONG == 64 &&
 		    insn->imm == BPF_FUNC_kptr_xchg &&
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko
  2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
  2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
@ 2024-03-21 18:05 ` Andrii Nakryiko
  2024-03-21 21:08   ` Jiri Olsa
  2024-03-21 23:49   ` Alexei Starovoitov
  2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov
  3 siblings, 2 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 18:05 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko

Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
gs segment-based addressing.

Just to be on the safer side both rip-relative addressing is implemented
(providing a shorter instruction, but limiting offset to signed 32 bits)
and more universal absolute memory offset addressing is used as
a fallback in (unlikely) scenario that given offset doesn't fit int s32.
The latter is 5 bytes longer, and it seems compilers prefer rip-relative
instructions when compiling kernel code.

Both instructions were tested and confirmed using gdb. We also already
have a BPF selftest (raw_tp_test_run) that validates correctness of
bpf_get_smp_processor_id(), while running target BPF program on each
online CPU.

Here's a disassembly of bpf_get_smp_processor_id() helper:

$ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
Dump of assembler code for function bpf_get_smp_processor_id:
   0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
   0xffffffff810fa89c <+12>:    48 98                   cdqe
   0xffffffff810fa89e <+14>:    c3                      ret
End of assembler dump.

And here's a GDB disassembly dump of a piece of BPF program calling
bpf_get_smp_processor_id().

  $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
  000000000002d840 A pcpu_hot
  ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]

Then attaching GDB to the running kernel in QEMU and breaking inside BPF
program:

(gdb) b *0xffffffffa000f8e2
Breakpoint 1 at 0xffffffffa000f8e2

When RIP-relative instruction is used:

  0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
  0xffffffffa000f8e9      cltq

You can see that final address is resolved to <pcpu_hot+12> as expected.

When absolute addressing is used:

  0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
  0xffffffffa000f8ed      cltq

And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
plus 12 (0xc) bytes offset of cpu_number field.

This inlining eliminates entire function call for this (rather trivial in terms
of instructions executed) helper, saving a bit of performance, but foremost
saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
place.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4900b1ee019f..5b7fdc24b5b8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	*pprog = prog;
 }
 
+/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
+extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
+
 static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
 {
 	u8 *prog = *pprog;
@@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
 		pr_err("Target call %p is out of range\n", func);
 		return -ERANGE;
 	}
-	EMIT1_off32(opcode, offset);
+
+	/* inline bpf_get_smp_processor_id() to avoid calls */
+	if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
+		/* 7 to account for the mov instruction itself,
+		 * as rip value *after* mov instruction is used
+		 */
+		offset = (void *)&pcpu_hot.cpu_number - ip - 7;
+		if (is_simm32(offset)) {
+			/* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
+			EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
+		} else {
+			/* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
+			offset = (s64)(void *)&pcpu_hot.cpu_number;
+			EMIT2(0x65, 0xa1);
+			EMIT((u32)offset, 4);
+			EMIT((u64)offset >> 32, 4);
+		}
+		EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
+	} else {
+		EMIT1_off32(opcode, offset);
+	}
+
 	*pprog = prog;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic
  2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
@ 2024-03-21 21:08   ` Jiri Olsa
  0 siblings, 0 replies; 23+ messages in thread
From: Jiri Olsa @ 2024-03-21 21:08 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 11:04:59AM -0700, Andrii Nakryiko wrote:
> perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
> there is no reason for BPF subsystem to keep track of which
> architectures do support LBR or not. E.g., it looks like ARM64 might soon
> get support for BRBE ([0]), which (with proper integration) should be
> possible to utilize using this BPF helper.
> 
> perf_snapshot_branch_stack static call will point to
> __static_call_return0() by default, which just returns zero, which will
> lead to -ENOENT, as expected. So no need to guard anything here.
> 
>   [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Jiri Olsa <jolsa@kernel.org>

jirka

> ---
>  kernel/trace/bpf_trace.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 434e3ece6688..6d000332b17b 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1182,9 +1182,6 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_tracing = {
>  
>  BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
>  {
> -#ifndef CONFIG_X86
> -	return -ENOENT;
> -#else
>  	static const u32 br_entry_size = sizeof(struct perf_branch_entry);
>  	u32 entry_cnt = size / br_entry_size;
>  
> @@ -1197,7 +1194,6 @@ BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
>  		return -ENOENT;
>  
>  	return entry_cnt * br_entry_size;
> -#endif
>  }
>  
>  static const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko
@ 2024-03-21 21:08   ` Jiri Olsa
  2024-03-21 21:09     ` Andrii Nakryiko
  2024-03-21 23:49   ` Alexei Starovoitov
  1 sibling, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2024-03-21 21:08 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote:
> Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> gs segment-based addressing.
> 
> Just to be on the safer side both rip-relative addressing is implemented
> (providing a shorter instruction, but limiting offset to signed 32 bits)
> and more universal absolute memory offset addressing is used as
> a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> instructions when compiling kernel code.
> 
> Both instructions were tested and confirmed using gdb. We also already
> have a BPF selftest (raw_tp_test_run) that validates correctness of
> bpf_get_smp_processor_id(), while running target BPF program on each
> online CPU.
> 
> Here's a disassembly of bpf_get_smp_processor_id() helper:
> 
> $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> Dump of assembler code for function bpf_get_smp_processor_id:
>    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
>    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
>    0xffffffff810fa89c <+12>:    48 98                   cdqe
>    0xffffffff810fa89e <+14>:    c3                      ret
> End of assembler dump.
> 
> And here's a GDB disassembly dump of a piece of BPF program calling
> bpf_get_smp_processor_id().
> 
>   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
>   000000000002d840 A pcpu_hot
>   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
> 
> Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> program:
> 
> (gdb) b *0xffffffffa000f8e2
> Breakpoint 1 at 0xffffffffa000f8e2
> 
> When RIP-relative instruction is used:
> 
>   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
>   0xffffffffa000f8e9      cltq
> 
> You can see that final address is resolved to <pcpu_hot+12> as expected.
> 
> When absolute addressing is used:
> 
>   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
>   0xffffffffa000f8ed      cltq
> 
> And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> plus 12 (0xc) bytes offset of cpu_number field.
> 
> This inlining eliminates entire function call for this (rather trivial in terms
> of instructions executed) helper, saving a bit of performance, but foremost
> saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> place.

this should also 'fix' the k[ret]probe-multi-fast benchmark issue right?

https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/

jirka

> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 4900b1ee019f..5b7fdc24b5b8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	*pprog = prog;
>  }
>  
> +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> +
>  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
>  {
>  	u8 *prog = *pprog;
> @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
>  		pr_err("Target call %p is out of range\n", func);
>  		return -ERANGE;
>  	}
> -	EMIT1_off32(opcode, offset);
> +
> +	/* inline bpf_get_smp_processor_id() to avoid calls */
> +	if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> +		/* 7 to account for the mov instruction itself,
> +		 * as rip value *after* mov instruction is used
> +		 */
> +		offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> +		if (is_simm32(offset)) {
> +			/* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> +			EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> +		} else {
> +			/* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> +			offset = (s64)(void *)&pcpu_hot.cpu_number;
> +			EMIT2(0x65, 0xa1);
> +			EMIT((u32)offset, 4);
> +			EMIT((u64)offset >> 32, 4);
> +		}
> +		EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
> +	} else {
> +		EMIT1_off32(opcode, offset);
> +	}
> +
>  	*pprog = prog;
>  	return 0;
>  }
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper
  2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
@ 2024-03-21 21:08   ` Jiri Olsa
  2024-03-21 21:27     ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2024-03-21 21:08 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 11:05:00AM -0700, Andrii Nakryiko wrote:
> Inline bpf_get_branch_snapshot() helper using architecture-agnostic
> inline BPF code which calls directly into underlying callback of
> perf_snapshot_branch_stack static call. This callback is set early
> during kernel initialization and is never updated or reset, so it's ok
> to fetch actual implementation using static_call_query() and call
> directly into it.
> 
> This change eliminates a full function call and saves one LBR entry
> in PERF_SAMPLE_BRANCH_ANY LBR mode.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index de7813947981..4fb6c468e199 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20130,6 +20130,43 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>  			goto next_insn;
>  		}
>  
> +		/* Implement bpf_get_branch_snapshot inline. */
> +		if (prog->jit_requested && BITS_PER_LONG == 64 &&
> +		    insn->imm == BPF_FUNC_get_branch_snapshot) {
> +			/* We are dealing with the following func protos:
> +			 * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
> +			 * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
> +			 */
> +			const u32 br_entry_size = sizeof(struct perf_branch_entry);
> +
> +			/* if (unlikely(flags)) return -EINVAL */
> +			insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 5);

nit, you moved the flags check on top, which I think makes sense and
we should do it in bpf_get_branch_snapshot as well to keep it same

jirka

> +			/* transform size (bytes) into entry_cnt */
> +			insn_buf[1] = BPF_ALU32_IMM(BPF_DIV, BPF_REG_2, br_entry_size);
> +			/* call perf_snapshot_branch_stack implementation */
> +			insn_buf[2] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
> +			/* if (entry_cnt == 0) return -ENOENT */
> +			insn_buf[3] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4);
> +			/* return entry_cnt * sizeof(struct perf_branch_entry) */
> +			insn_buf[4] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size);
> +			insn_buf[5] = BPF_JMP_A(3);
> +			/* return -EINVAL; */
> +			insn_buf[6] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL);
> +			insn_buf[7] = BPF_JMP_A(1);
> +			/* return -ENOENT; */
> +			insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT);
> +			cnt = 9;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta    += cnt - 1;
> +			env->prog = prog = new_prog;
> +			insn      = new_prog->insnsi + i + delta;
> +			continue;
> +		}
> +
>  		/* Implement bpf_kptr_xchg inline */
>  		if (prog->jit_requested && BITS_PER_LONG == 64 &&
>  		    insn->imm == BPF_FUNC_kptr_xchg &&
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 21:08   ` Jiri Olsa
@ 2024-03-21 21:09     ` Andrii Nakryiko
  2024-03-21 22:57       ` Jiri Olsa
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 21:09 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote:
> > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> > gs segment-based addressing.
> >
> > Just to be on the safer side both rip-relative addressing is implemented
> > (providing a shorter instruction, but limiting offset to signed 32 bits)
> > and more universal absolute memory offset addressing is used as
> > a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> > The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> > instructions when compiling kernel code.
> >
> > Both instructions were tested and confirmed using gdb. We also already
> > have a BPF selftest (raw_tp_test_run) that validates correctness of
> > bpf_get_smp_processor_id(), while running target BPF program on each
> > online CPU.
> >
> > Here's a disassembly of bpf_get_smp_processor_id() helper:
> >
> > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> > Dump of assembler code for function bpf_get_smp_processor_id:
> >    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> >    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
> >    0xffffffff810fa89c <+12>:    48 98                   cdqe
> >    0xffffffff810fa89e <+14>:    c3                      ret
> > End of assembler dump.
> >
> > And here's a GDB disassembly dump of a piece of BPF program calling
> > bpf_get_smp_processor_id().
> >
> >   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
> >   000000000002d840 A pcpu_hot
> >   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
> >
> > Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> > program:
> >
> > (gdb) b *0xffffffffa000f8e2
> > Breakpoint 1 at 0xffffffffa000f8e2
> >
> > When RIP-relative instruction is used:
> >
> >   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
> >   0xffffffffa000f8e9      cltq
> >
> > You can see that final address is resolved to <pcpu_hot+12> as expected.
> >
> > When absolute addressing is used:
> >
> >   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
> >   0xffffffffa000f8ed      cltq
> >
> > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> > plus 12 (0xc) bytes offset of cpu_number field.
> >
> > This inlining eliminates entire function call for this (rather trivial in terms
> > of instructions executed) helper, saving a bit of performance, but foremost
> > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> > place.
>
> this should also 'fix' the k[ret]probe-multi-fast benchmark issue right?

I already fixed it locally by switching to bpf_get_numa_node_id(), but
this change would generally make my original approach not work because
bpf_get_smp_processor_id() isn't actually called at runtime on x86-64
:)

>
> https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/
>
> jirka
>
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 4900b1ee019f..5b7fdc24b5b8 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >       *pprog = prog;
> >  }
> >
> > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> > +
> >  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> >  {
> >       u8 *prog = *pprog;
> > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> >               pr_err("Target call %p is out of range\n", func);
> >               return -ERANGE;
> >       }
> > -     EMIT1_off32(opcode, offset);
> > +
> > +     /* inline bpf_get_smp_processor_id() to avoid calls */
> > +     if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> > +             /* 7 to account for the mov instruction itself,
> > +              * as rip value *after* mov instruction is used
> > +              */
> > +             offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> > +             if (is_simm32(offset)) {
> > +                     /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> > +                     EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> > +             } else {
> > +                     /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> > +                     offset = (s64)(void *)&pcpu_hot.cpu_number;
> > +                     EMIT2(0x65, 0xa1);
> > +                     EMIT((u32)offset, 4);
> > +                     EMIT((u64)offset >> 32, 4);
> > +             }
> > +             EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
> > +     } else {
> > +             EMIT1_off32(opcode, offset);
> > +     }
> > +
> >       *pprog = prog;
> >       return 0;
> >  }
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper
  2024-03-21 21:08   ` Jiri Olsa
@ 2024-03-21 21:27     ` Andrii Nakryiko
  0 siblings, 0 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 21:27 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 11:05:00AM -0700, Andrii Nakryiko wrote:
> > Inline bpf_get_branch_snapshot() helper using architecture-agnostic
> > inline BPF code which calls directly into underlying callback of
> > perf_snapshot_branch_stack static call. This callback is set early
> > during kernel initialization and is never updated or reset, so it's ok
> > to fetch actual implementation using static_call_query() and call
> > directly into it.
> >
> > This change eliminates a full function call and saves one LBR entry
> > in PERF_SAMPLE_BRANCH_ANY LBR mode.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 37 insertions(+)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index de7813947981..4fb6c468e199 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -20130,6 +20130,43 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >                       goto next_insn;
> >               }
> >
> > +             /* Implement bpf_get_branch_snapshot inline. */
> > +             if (prog->jit_requested && BITS_PER_LONG == 64 &&
> > +                 insn->imm == BPF_FUNC_get_branch_snapshot) {
> > +                     /* We are dealing with the following func protos:
> > +                      * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
> > +                      * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
> > +                      */
> > +                     const u32 br_entry_size = sizeof(struct perf_branch_entry);
> > +
> > +                     /* if (unlikely(flags)) return -EINVAL */
> > +                     insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 5);
>
> nit, you moved the flags check on top, which I think makes sense and
> we should do it in bpf_get_branch_snapshot as well to keep it same
>

here I could control that it won't be taken in common case, so if we
are to do that in BPF helper itself, we should use unlikely() and
check that compiler actually honored it. I can add that in the next
revision.

> jirka
>
> > +                     /* transform size (bytes) into entry_cnt */
> > +                     insn_buf[1] = BPF_ALU32_IMM(BPF_DIV, BPF_REG_2, br_entry_size);
> > +                     /* call perf_snapshot_branch_stack implementation */
> > +                     insn_buf[2] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
> > +                     /* if (entry_cnt == 0) return -ENOENT */
> > +                     insn_buf[3] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4);
> > +                     /* return entry_cnt * sizeof(struct perf_branch_entry) */
> > +                     insn_buf[4] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size);
> > +                     insn_buf[5] = BPF_JMP_A(3);
> > +                     /* return -EINVAL; */
> > +                     insn_buf[6] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL);
> > +                     insn_buf[7] = BPF_JMP_A(1);
> > +                     /* return -ENOENT; */
> > +                     insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT);
> > +                     cnt = 9;
> > +
> > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > +                     if (!new_prog)
> > +                             return -ENOMEM;
> > +
> > +                     delta    += cnt - 1;
> > +                     env->prog = prog = new_prog;
> > +                     insn      = new_prog->insnsi + i + delta;
> > +                     continue;
> > +             }
> > +
> >               /* Implement bpf_kptr_xchg inline */
> >               if (prog->jit_requested && BITS_PER_LONG == 64 &&
> >                   insn->imm == BPF_FUNC_kptr_xchg &&
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 21:09     ` Andrii Nakryiko
@ 2024-03-21 22:57       ` Jiri Olsa
  2024-03-21 23:38         ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2024-03-21 22:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 02:09:41PM -0700, Andrii Nakryiko wrote:
> On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote:
> > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> > > gs segment-based addressing.
> > >
> > > Just to be on the safer side both rip-relative addressing is implemented
> > > (providing a shorter instruction, but limiting offset to signed 32 bits)
> > > and more universal absolute memory offset addressing is used as
> > > a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> > > The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> > > instructions when compiling kernel code.
> > >
> > > Both instructions were tested and confirmed using gdb. We also already
> > > have a BPF selftest (raw_tp_test_run) that validates correctness of
> > > bpf_get_smp_processor_id(), while running target BPF program on each
> > > online CPU.
> > >
> > > Here's a disassembly of bpf_get_smp_processor_id() helper:
> > >
> > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> > > Dump of assembler code for function bpf_get_smp_processor_id:
> > >    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> > >    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
> > >    0xffffffff810fa89c <+12>:    48 98                   cdqe
> > >    0xffffffff810fa89e <+14>:    c3                      ret
> > > End of assembler dump.
> > >
> > > And here's a GDB disassembly dump of a piece of BPF program calling
> > > bpf_get_smp_processor_id().
> > >
> > >   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
> > >   000000000002d840 A pcpu_hot
> > >   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
> > >
> > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> > > program:
> > >
> > > (gdb) b *0xffffffffa000f8e2
> > > Breakpoint 1 at 0xffffffffa000f8e2
> > >
> > > When RIP-relative instruction is used:
> > >
> > >   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
> > >   0xffffffffa000f8e9      cltq
> > >
> > > You can see that final address is resolved to <pcpu_hot+12> as expected.
> > >
> > > When absolute addressing is used:
> > >
> > >   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
> > >   0xffffffffa000f8ed      cltq
> > >
> > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> > > plus 12 (0xc) bytes offset of cpu_number field.
> > >
> > > This inlining eliminates entire function call for this (rather trivial in terms
> > > of instructions executed) helper, saving a bit of performance, but foremost
> > > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> > > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> > > place.
> >
> > this should also 'fix' the k[ret]probe-multi-fast benchmark issue right?
> 
> I already fixed it locally by switching to bpf_get_numa_node_id(), but
> this change would generally make my original approach not work because
> bpf_get_smp_processor_id() isn't actually called at runtime on x86-64
> :)

hm, but the reason was that program attached to bpf_get_smp_processor_id
called bpf_get_smp_processor_id helper:
  https://lore.kernel.org/bpf/CAEf4BzayNECKkmc4=XfLW5fzsPozMnjqOEmGO+r2UmEQXt1XyA@mail.gmail.com/

inlining of bpf_get_smp_processor_id helper call would prevent that, no?

jirka


> 
> >
> > https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/
> >
> > jirka
> >
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
> > >  1 file changed, 25 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index 4900b1ee019f..5b7fdc24b5b8 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> > >       *pprog = prog;
> > >  }
> > >
> > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> > > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> > > +
> > >  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> > >  {
> > >       u8 *prog = *pprog;
> > > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> > >               pr_err("Target call %p is out of range\n", func);
> > >               return -ERANGE;
> > >       }
> > > -     EMIT1_off32(opcode, offset);
> > > +
> > > +     /* inline bpf_get_smp_processor_id() to avoid calls */
> > > +     if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> > > +             /* 7 to account for the mov instruction itself,
> > > +              * as rip value *after* mov instruction is used
> > > +              */
> > > +             offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> > > +             if (is_simm32(offset)) {
> > > +                     /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> > > +                     EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> > > +             } else {
> > > +                     /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> > > +                     offset = (s64)(void *)&pcpu_hot.cpu_number;
> > > +                     EMIT2(0x65, 0xa1);
> > > +                     EMIT((u32)offset, 4);
> > > +                     EMIT((u64)offset >> 32, 4);
> > > +             }
> > > +             EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
> > > +     } else {
> > > +             EMIT1_off32(opcode, offset);
> > > +     }
> > > +
> > >       *pprog = prog;
> > >       return 0;
> > >  }
> > > --
> > > 2.43.0
> > >
> > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 22:57       ` Jiri Olsa
@ 2024-03-21 23:38         ` Andrii Nakryiko
  0 siblings, 0 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 23:38 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song

On Thu, Mar 21, 2024 at 3:57 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 02:09:41PM -0700, Andrii Nakryiko wrote:
> > On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote:
> > > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> > > > gs segment-based addressing.
> > > >
> > > > Just to be on the safer side both rip-relative addressing is implemented
> > > > (providing a shorter instruction, but limiting offset to signed 32 bits)
> > > > and more universal absolute memory offset addressing is used as
> > > > a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> > > > The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> > > > instructions when compiling kernel code.
> > > >
> > > > Both instructions were tested and confirmed using gdb. We also already
> > > > have a BPF selftest (raw_tp_test_run) that validates correctness of
> > > > bpf_get_smp_processor_id(), while running target BPF program on each
> > > > online CPU.
> > > >
> > > > Here's a disassembly of bpf_get_smp_processor_id() helper:
> > > >
> > > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> > > > Dump of assembler code for function bpf_get_smp_processor_id:
> > > >    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> > > >    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
> > > >    0xffffffff810fa89c <+12>:    48 98                   cdqe
> > > >    0xffffffff810fa89e <+14>:    c3                      ret
> > > > End of assembler dump.
> > > >
> > > > And here's a GDB disassembly dump of a piece of BPF program calling
> > > > bpf_get_smp_processor_id().
> > > >
> > > >   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
> > > >   000000000002d840 A pcpu_hot
> > > >   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
> > > >
> > > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> > > > program:
> > > >
> > > > (gdb) b *0xffffffffa000f8e2
> > > > Breakpoint 1 at 0xffffffffa000f8e2
> > > >
> > > > When RIP-relative instruction is used:
> > > >
> > > >   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
> > > >   0xffffffffa000f8e9      cltq
> > > >
> > > > You can see that final address is resolved to <pcpu_hot+12> as expected.
> > > >
> > > > When absolute addressing is used:
> > > >
> > > >   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
> > > >   0xffffffffa000f8ed      cltq
> > > >
> > > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> > > > plus 12 (0xc) bytes offset of cpu_number field.
> > > >
> > > > This inlining eliminates entire function call for this (rather trivial in terms
> > > > of instructions executed) helper, saving a bit of performance, but foremost
> > > > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> > > > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> > > > place.
> > >
> > > this should also 'fix' the k[ret]probe-multi-fast benchmark issue right?
> >
> > I already fixed it locally by switching to bpf_get_numa_node_id(), but
> > this change would generally make my original approach not work because
> > bpf_get_smp_processor_id() isn't actually called at runtime on x86-64
> > :)
>
> hm, but the reason was that program attached to bpf_get_smp_processor_id
> called bpf_get_smp_processor_id helper:
>   https://lore.kernel.org/bpf/CAEf4BzayNECKkmc4=XfLW5fzsPozMnjqOEmGO+r2UmEQXt1XyA@mail.gmail.com/
>
> inlining of bpf_get_smp_processor_id helper call would prevent that, no?

Yes, I'm agreeing with you. I'm just saying that in v2 I'm attaching
to bpf_get_numa_node_id() instead, so all this is irrelevant.

>
> jirka
>
>
> >
> > >
> > > https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/
> > >
> > > jirka
> > >
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
> > > >  1 file changed, 25 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > index 4900b1ee019f..5b7fdc24b5b8 100644
> > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> > > >       *pprog = prog;
> > > >  }
> > > >
> > > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> > > > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> > > > +
> > > >  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> > > >  {
> > > >       u8 *prog = *pprog;
> > > > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> > > >               pr_err("Target call %p is out of range\n", func);
> > > >               return -ERANGE;
> > > >       }
> > > > -     EMIT1_off32(opcode, offset);
> > > > +
> > > > +     /* inline bpf_get_smp_processor_id() to avoid calls */
> > > > +     if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> > > > +             /* 7 to account for the mov instruction itself,
> > > > +              * as rip value *after* mov instruction is used
> > > > +              */
> > > > +             offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> > > > +             if (is_simm32(offset)) {
> > > > +                     /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> > > > +                     EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> > > > +             } else {
> > > > +                     /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> > > > +                     offset = (s64)(void *)&pcpu_hot.cpu_number;
> > > > +                     EMIT2(0x65, 0xa1);
> > > > +                     EMIT((u32)offset, 4);
> > > > +                     EMIT((u64)offset >> 32, 4);
> > > > +             }
> > > > +             EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
> > > > +     } else {
> > > > +             EMIT1_off32(opcode, offset);
> > > > +     }
> > > > +
> > > >       *pprog = prog;
> > > >       return 0;
> > > >  }
> > > > --
> > > > 2.43.0
> > > >
> > > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko
@ 2024-03-21 23:46 ` Alexei Starovoitov
  2024-03-22 16:45   ` Andrii Nakryiko
  3 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-03-21 23:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Peter Zijlstra, Song Liu

On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
>
> There are still ways to reduce number of "wasted" records further, this is
> a problem that requires many small and rather independent steps.

I feel this is a wrong path to follow.
I think it would be better to introduce a flag for kprobe/fentry
to do perf_snapshot_branch_stack() as early as possible
and then bpf prog can copy these 16 or 32 8-byte entries at
its leasure.
Hacking all over the kernel and requiring bpf prog to call
bpf_get_branch_snapshot() in the first few instructions
looks like self inflicted pain.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko
  2024-03-21 21:08   ` Jiri Olsa
@ 2024-03-21 23:49   ` Alexei Starovoitov
  2024-03-22 16:45     ` Andrii Nakryiko
  1 sibling, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-03-21 23:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Peter Zijlstra, Song Liu

On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> gs segment-based addressing.
>
> Just to be on the safer side both rip-relative addressing is implemented
> (providing a shorter instruction, but limiting offset to signed 32 bits)
> and more universal absolute memory offset addressing is used as
> a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> instructions when compiling kernel code.
>
> Both instructions were tested and confirmed using gdb. We also already
> have a BPF selftest (raw_tp_test_run) that validates correctness of
> bpf_get_smp_processor_id(), while running target BPF program on each
> online CPU.
>
> Here's a disassembly of bpf_get_smp_processor_id() helper:
>
> $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> Dump of assembler code for function bpf_get_smp_processor_id:
>    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
>    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
>    0xffffffff810fa89c <+12>:    48 98                   cdqe
>    0xffffffff810fa89e <+14>:    c3                      ret
> End of assembler dump.
>
> And here's a GDB disassembly dump of a piece of BPF program calling
> bpf_get_smp_processor_id().
>
>   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
>   000000000002d840 A pcpu_hot
>   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
>
> Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> program:
>
> (gdb) b *0xffffffffa000f8e2
> Breakpoint 1 at 0xffffffffa000f8e2
>
> When RIP-relative instruction is used:
>
>   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
>   0xffffffffa000f8e9      cltq
>
> You can see that final address is resolved to <pcpu_hot+12> as expected.
>
> When absolute addressing is used:
>
>   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
>   0xffffffffa000f8ed      cltq
>
> And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> plus 12 (0xc) bytes offset of cpu_number field.
>
> This inlining eliminates entire function call for this (rather trivial in terms
> of instructions executed) helper, saving a bit of performance, but foremost
> saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> place.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 4900b1ee019f..5b7fdc24b5b8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>         *pprog = prog;
>  }
>
> +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> +
>  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
>  {
>         u8 *prog = *pprog;
> @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
>                 pr_err("Target call %p is out of range\n", func);
>                 return -ERANGE;
>         }
> -       EMIT1_off32(opcode, offset);
> +
> +       /* inline bpf_get_smp_processor_id() to avoid calls */
> +       if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> +               /* 7 to account for the mov instruction itself,
> +                * as rip value *after* mov instruction is used
> +                */
> +               offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> +               if (is_simm32(offset)) {
> +                       /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> +                       EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> +               } else {
> +                       /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> +                       offset = (s64)(void *)&pcpu_hot.cpu_number;
> +                       EMIT2(0x65, 0xa1);
> +                       EMIT((u32)offset, 4);
> +                       EMIT((u64)offset >> 32, 4);
> +               }
> +               EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */

Please introduce new pseudo insn that can access per-cpu vars
in a generic way instead of hacking a specific case.
Then we can use it in get_lookup in percpu array and hash maps
improving their performance and in lots of other places.

pw-bot: cr

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov
@ 2024-03-22 16:45   ` Andrii Nakryiko
  2024-03-25  2:05     ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-22 16:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> >
> > There are still ways to reduce number of "wasted" records further, this is
> > a problem that requires many small and rather independent steps.
>
> I feel this is a wrong path to follow.
> I think it would be better to introduce a flag for kprobe/fentry
> to do perf_snapshot_branch_stack() as early as possible
> and then bpf prog can copy these 16 or 32 8-byte entries at
> its leasure.

This is basically how Song started when he was adding this feature a
few years ago. And I feel like we discussed this and decided that it
would be cleaner to let the BPF program decide when (and whether) to
get LBR, based on conditions. It still feels like a right tradeoff.

Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
doesn't have to happen so fast. BPF program can evaluate some
conditions and grab LBR optionally, saving the overhead.

With prog flag saying "kernel should capture LBR ASAP", we:
  a) lose this flexibility to decide whether and when to grab LBR;
  b) pay overhead regardless if LBR is ever actually used for any
given prog invocation;
  c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
buffers for something that is pretty niche (though hugely valuable
when needed, of course);
  d) each program type that supports bpf_get_branch_snapshot() helper
needs to implement this logic in their corresponding
`bpf_prog_run_xxx()` running helpers, which is more than a few places.

Now, let's see how much we can also realistically save with this approach.

For fentry, we do save a few (2) entries, indeed. With changes in this
patch we are at:

  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->
bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->
bpf_trampoline_6442508829+0xa1
  [#01] bpf_trampoline_6442508829+0xad           ->
bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
intel_pmu_snapshot_branch_stack+0x0

With flag and kernel support, we'll be at something like

  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->
bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->
intel_pmu_snapshot_branch_stack+0x0

So we get 2 extra LBRs at the expense of all those downsides I mentioned above.

But for kretprobe-multi it's even worse (just 1). With changes in this
patch set, we are at:

  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->
arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->
rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->
kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->
kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->
bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3     ->
intel_pmu_snapshot_branch_stack+0x0

With custom flag support:

  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->
arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->
rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->
kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->
kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->
intel_pmu_snapshot_branch_stack+0x0

We save just 1 extra LBR record.

For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
at all. Is it really worth it?

Any other improvements (like flattening of rethook call pass somehow)
will benefit both approaches equally.

> Hacking all over the kernel and requiring bpf prog to call
> bpf_get_branch_snapshot() in the first few instructions
> looks like self inflicted pain.

While inlining bpf_get_branch_snapshot() does benefit only LBR use
case, it's a rather typical BPF helper inlining procedure we do for a
lot of helpers, so it's not exactly a hack or anything, just an
optimization.

But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
pretty frequently used helper used to implement various
BPF-program-specific per-CPU usages (recursion protection, temporary
storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
variables, which is already a bit faster approach, and now will be
even faster). And the implementation is well-contained.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-21 23:49   ` Alexei Starovoitov
@ 2024-03-22 16:45     ` Andrii Nakryiko
  2024-03-25  3:28       ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-22 16:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Thu, Mar 21, 2024 at 4:49 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's
> > gs segment-based addressing.
> >
> > Just to be on the safer side both rip-relative addressing is implemented
> > (providing a shorter instruction, but limiting offset to signed 32 bits)
> > and more universal absolute memory offset addressing is used as
> > a fallback in (unlikely) scenario that given offset doesn't fit int s32.
> > The latter is 5 bytes longer, and it seems compilers prefer rip-relative
> > instructions when compiling kernel code.
> >
> > Both instructions were tested and confirmed using gdb. We also already
> > have a BPF selftest (raw_tp_test_run) that validates correctness of
> > bpf_get_smp_processor_id(), while running target BPF program on each
> > online CPU.
> >
> > Here's a disassembly of bpf_get_smp_processor_id() helper:
> >
> > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id'
> > Dump of assembler code for function bpf_get_smp_processor_id:
> >    0xffffffff810fa890 <+0>:     0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> >    0xffffffff810fa895 <+5>:     65 8b 05 70 62 f3 7e    mov    eax,DWORD PTR gs:[rip+0x7ef36270]        # 0x30b0c <pcpu_hot+12>
> >    0xffffffff810fa89c <+12>:    48 98                   cdqe
> >    0xffffffff810fa89e <+14>:    c3                      ret
> > End of assembler dump.
> >
> > And here's a GDB disassembly dump of a piece of BPF program calling
> > bpf_get_smp_processor_id().
> >
> >   $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit'
> >   000000000002d840 A pcpu_hot
> >   ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit      [bpf]
> >
> > Then attaching GDB to the running kernel in QEMU and breaking inside BPF
> > program:
> >
> > (gdb) b *0xffffffffa000f8e2
> > Breakpoint 1 at 0xffffffffa000f8e2
> >
> > When RIP-relative instruction is used:
> >
> >   0xffffffffa000f8e2      mov    %gs:0x6001df63(%rip),%eax        # 0x2d84c <pcpu_hot+12>
> >   0xffffffffa000f8e9      cltq
> >
> > You can see that final address is resolved to <pcpu_hot+12> as expected.
> >
> > When absolute addressing is used:
> >
> >   0xffffffffa000f8e2      movabs %gs:0x2d84c,%eax
> >   0xffffffffa000f8ed      cltq
> >
> > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840),
> > plus 12 (0xc) bytes offset of cpu_number field.
> >
> > This inlining eliminates entire function call for this (rather trivial in terms
> > of instructions executed) helper, saving a bit of performance, but foremost
> > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for
> > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first
> > place.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 4900b1ee019f..5b7fdc24b5b8 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >         *pprog = prog;
> >  }
> >
> > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */
> > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64);
> > +
> >  static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> >  {
> >         u8 *prog = *pprog;
> > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
> >                 pr_err("Target call %p is out of range\n", func);
> >                 return -ERANGE;
> >         }
> > -       EMIT1_off32(opcode, offset);
> > +
> > +       /* inline bpf_get_smp_processor_id() to avoid calls */
> > +       if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) {
> > +               /* 7 to account for the mov instruction itself,
> > +                * as rip value *after* mov instruction is used
> > +                */
> > +               offset = (void *)&pcpu_hot.cpu_number - ip - 7;
> > +               if (is_simm32(offset)) {
> > +                       /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */
> > +                       EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset);
> > +               } else {
> > +                       /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */
> > +                       offset = (s64)(void *)&pcpu_hot.cpu_number;
> > +                       EMIT2(0x65, 0xa1);
> > +                       EMIT((u32)offset, 4);
> > +                       EMIT((u64)offset >> 32, 4);
> > +               }
> > +               EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */
>
> Please introduce new pseudo insn that can access per-cpu vars
> in a generic way instead of hacking a specific case.

Sure, but do they have to be mutually exclusive? One doesn't prevent
the other. Having bpf_get_smp_processor_id() inlined benefits tons of
existing BPF applications transparently, which sounds like a win to
me.

Designing and adding new instruction also sounds fine, but it's a
separate feature and is more involved and will require careful
considerations. E.g., we'll need to think through whether all JITs
will be able to implement it in native code (i.e., whether they will
have enough free registers to implement this efficiently; x86-64 is
lucky to not need any extra registers, I believe ARM64 needs one extra
register, though; other architectures I have no idea). Safety
considerations as well (do we accept any random offset, or only ones
coming from per-CPU BTF variables, etc). I don't know if there are any
differences, but per-CPU access for module per-CPU variables is
something to look into (I have no clue).

And even once we have this instruction and corresponding compiler
support, it will still take a while until applications can assume its
availability, so that adds to logistics.

Also, even with this instruction, getting CPU ID efficiently is still
important for cases when a BPF application needs its own per-CPU
"storage" solution. I'm not sure it's a good experience to require
every user to figure out the pcpu_hot.cpu_number thing by themselves
through a few layers of kernel macros.

As an interim compromise and solution, would you like me to implement
a similar inlining for bpf_this_cpu_ptr() as well? It's just as
trivial to do as for bpf_get_smp_processor_id(), and would benefit all
existing users of bpf_this_cpu_ptr() as well, while relying on
existing BTF information and surrounding infrastructure?

What's the worst outcome? If the kernel changes how CPU number is
defined (not pcpu_hot.cpu_number) and we can't easily adapt JIT logic,
we can just stop doing inlining and we'll lose a bit of performance
and function call avoidance. Bad, but not API breaking or anything
like that. And we will detect when things change, we have a test that
checks this logic for each CPU, making sure we get the right one.



> Then we can use it in get_lookup in percpu array and hash maps
> improving their performance and in lots of other places.
>
> pw-bot: cr

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-22 16:45   ` Andrii Nakryiko
@ 2024-03-25  2:05     ` Alexei Starovoitov
  2024-03-25 17:20       ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-03-25  2:05 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > >
> > > There are still ways to reduce number of "wasted" records further, this is
> > > a problem that requires many small and rather independent steps.
> >
> > I feel this is a wrong path to follow.
> > I think it would be better to introduce a flag for kprobe/fentry
> > to do perf_snapshot_branch_stack() as early as possible
> > and then bpf prog can copy these 16 or 32 8-byte entries at
> > its leasure.
>
> This is basically how Song started when he was adding this feature a
> few years ago. And I feel like we discussed this and decided that it
> would be cleaner to let the BPF program decide when (and whether) to
> get LBR, based on conditions. It still feels like a right tradeoff.

Right we discussed it back then and at that time it was about
collecting stacks.
What's different now is you want to collect all types of branches
in retrnoop including plain 'jmp pc+10' and conditional jmps.
This is not something that C code can control.
always_inline in C and inline by the verifier reduce call frames,
but they may have both positive and negative effect when
all branches are collected.
Hence __always_inline in kprobe_multi_link_prog_run()
is a leap of faith with assumptions that compiler won't
add jmps before calling into prog,
but lots of different compiler flags add instrumentation:
kasan, stack protector, security mitigation that count call depth, etc.


> Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
> that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
> doesn't have to happen so fast. BPF program can evaluate some
> conditions and grab LBR optionally, saving the overhead.
>
> With prog flag saying "kernel should capture LBR ASAP", we:

I was suggesting to use per attachment flag.
And kprobe is a lost cause.
I would do it for fentry only where
we can generate 'save LBR' call first thing in the bpf trampoline.

>   a) lose this flexibility to decide whether and when to grab LBR;
>   b) pay overhead regardless if LBR is ever actually used for any
> given prog invocation;

when retsnoop attaches a prog the prog gotta call that 'save LBR'
as soon as possible without any branches.
So per-attach flag is not really a downside.

>   c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
> buffers for something that is pretty niche (though hugely valuable
> when needed, of course);

I wouldn't worry about such a tiny buffer.

>   d) each program type that supports bpf_get_branch_snapshot() helper
> needs to implement this logic in their corresponding
> `bpf_prog_run_xxx()` running helpers, which is more than a few places.

I think new kfunc that copies from the buffer will do.
Nothing needs to change.
Maybe bpf_get_branch_snapshot() can be made smart too,
but that is optional.

> Now, let's see how much we can also realistically save with this approach.
>
> For fentry, we do save a few (2) entries, indeed. With changes in this
> patch we are at:
>
>   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
>   [#06] __x64_sys_bpf+0x1a                       ->
> bpf_trampoline_6442508829+0x7f
>   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
>   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
>   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
>   [#02] __bpf_prog_enter_recur+0x43              ->
> bpf_trampoline_6442508829+0xa1
>   [#01] bpf_trampoline_6442508829+0xad           ->
> bpf_prog_dc54a596b39d4177_fexit1+0x0
>   [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
> intel_pmu_snapshot_branch_stack+0x0
>
> With flag and kernel support, we'll be at something like
>
>   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
>   [#06] __x64_sys_bpf+0x1a                       ->
> bpf_trampoline_6442508829+0x7f
>   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
>   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
>   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
>   [#02] __bpf_prog_enter_recur+0x43              ->
> intel_pmu_snapshot_branch_stack+0x0

with flag migrate_disable and prog_enter* will be gone.
It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack.
If we try hard we can inline
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
Then it will be bpf_trampoline_ only and
that's as minimal as it can get. One entry.
Hacking all over the kernel with inline won't get anywhere close.

> For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
> at all. Is it really worth it?

any_return is ok today.
Especially with fentry.

> While inlining bpf_get_branch_snapshot() does benefit only LBR use
> case, it's a rather typical BPF helper inlining procedure we do for a
> lot of helpers, so it's not exactly a hack or anything, just an
> optimization.

Inlining bpf_get_branch_snapshot() may be ok,
but the way you're doing is not a clear win.
Replacing size / 24 (that compiler optimize into mul)
with actual divide and adding extra mul will be slower.
div+mul vs mul is quite a difference.
How noticable is that is questionable,
but from inlining perspective it doesn't feel right to do
"inline to avoid extra frame" instead of
"inline to improve performance".

> But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
> pretty frequently used helper used to implement various
> BPF-program-specific per-CPU usages (recursion protection, temporary
> storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
> variables, which is already a bit faster approach, and now will be
> even faster). And the implementation is well-contained.

Agree that inlining bpf_get_smp_processor_id() is a good thing,
but please do it cleanly so that per-cpu accessors can be
reused in other places.
I'll reply with details in the other thread.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-22 16:45     ` Andrii Nakryiko
@ 2024-03-25  3:28       ` Alexei Starovoitov
  2024-03-25 17:01         ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-03-25  3:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Designing and adding new instruction also sounds fine, but it's a
> separate feature and is more involved and will require careful
> considerations. E.g., we'll need to think through whether all JITs
> will be able to implement it in native code (i.e., whether they will
> have enough free registers to implement this efficiently; x86-64 is
> lucky to not need any extra registers, I believe ARM64 needs one extra
> register, though; other architectures I have no idea). Safety
> considerations as well (do we accept any random offset, or only ones
> coming from per-CPU BTF variables, etc). I don't know if there are any
> differences, but per-CPU access for module per-CPU variables is
> something to look into (I have no clue).

I didn't mean to add an insn that is exposed all the way to
bpf prog and requires verification, addr checks, etc.
But a pseudo insn that the verifier can emit when it inlines
various helpers.
It's ok to add them iteratively and revise as we go, since it
won't be an api.
Like per_cpu_ptr() equivalent insn doesn't need to be added right away.
this_cpu_ptr(offset) will be good enough to start.
JIT-ing that single insn will be trivial and
you implemented it already:
     EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); else ...

The verifier will emit BPF_THIS_CPU_PTR pseudo insn with
pcpu_hot.cpu_number offset when it inlines bpf_get_smp_processor_id().

At the same time it can inline bpf_this_cpu_ptr() with the same insn.

And we can finally implement map_ops->map_gen_lookup for
per-cpu array and hash map using the same pseudo insn.
Those lookups are often used in critical path and
the inlining will give a noticeable performance boost.

> As an interim compromise and solution, would you like me to implement
> a similar inlining for bpf_this_cpu_ptr() as well? It's just as
> trivial to do as for bpf_get_smp_processor_id(), and would benefit all
> existing users of bpf_this_cpu_ptr() as well, while relying on
> existing BTF information and surrounding infrastructure?

yes via pseudo insn.

after per-cpu maps, various bpf_*redirect*() helpers
can be inlined and XDP folks will enjoy extra performance,
and likely other cases where we used different solutions
instead of emitting per-cpu accesses in the verifier.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64
  2024-03-25  3:28       ` Alexei Starovoitov
@ 2024-03-25 17:01         ` Andrii Nakryiko
  0 siblings, 0 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-25 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Sun, Mar 24, 2024 at 8:28 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > Designing and adding new instruction also sounds fine, but it's a
> > separate feature and is more involved and will require careful
> > considerations. E.g., we'll need to think through whether all JITs
> > will be able to implement it in native code (i.e., whether they will
> > have enough free registers to implement this efficiently; x86-64 is
> > lucky to not need any extra registers, I believe ARM64 needs one extra
> > register, though; other architectures I have no idea). Safety
> > considerations as well (do we accept any random offset, or only ones
> > coming from per-CPU BTF variables, etc). I don't know if there are any
> > differences, but per-CPU access for module per-CPU variables is
> > something to look into (I have no clue).
>
> I didn't mean to add an insn that is exposed all the way to
> bpf prog and requires verification, addr checks, etc.
> But a pseudo insn that the verifier can emit when it inlines
> various helpers.
> It's ok to add them iteratively and revise as we go, since it
> won't be an api.

Ah, ok, so an instruction that JIT will know about, but the verifier
will reject if provided by the user? Ok, I can do that.

> Like per_cpu_ptr() equivalent insn doesn't need to be added right away.
> this_cpu_ptr(offset) will be good enough to start.
> JIT-ing that single insn will be trivial and
> you implemented it already:
>      EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); else ...
>
> The verifier will emit BPF_THIS_CPU_PTR pseudo insn with
> pcpu_hot.cpu_number offset when it inlines bpf_get_smp_processor_id().
>
> At the same time it can inline bpf_this_cpu_ptr() with the same insn.
>

sounds good, we'll need to guard all that with arch-specific check
whether JIT supports this instruction, but that's ok (eventually we
should be able to make this insn support mandatory and clean up those
checks)

> And we can finally implement map_ops->map_gen_lookup for
> per-cpu array and hash map using the same pseudo insn.
> Those lookups are often used in critical path and
> the inlining will give a noticeable performance boost.

yep, I'll look into that as well

>
> > As an interim compromise and solution, would you like me to implement
> > a similar inlining for bpf_this_cpu_ptr() as well? It's just as
> > trivial to do as for bpf_get_smp_processor_id(), and would benefit all
> > existing users of bpf_this_cpu_ptr() as well, while relying on
> > existing BTF information and surrounding infrastructure?
>
> yes via pseudo insn.
>
> after per-cpu maps, various bpf_*redirect*() helpers
> can be inlined and XDP folks will enjoy extra performance,
> and likely other cases where we used different solutions
> instead of emitting per-cpu accesses in the verifier.

that part I'll leave up to XDP folks :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-25  2:05     ` Alexei Starovoitov
@ 2024-03-25 17:20       ` Andrii Nakryiko
  2024-03-26  3:13         ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-25 17:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > >
> > > > There are still ways to reduce number of "wasted" records further, this is
> > > > a problem that requires many small and rather independent steps.
> > >
> > > I feel this is a wrong path to follow.
> > > I think it would be better to introduce a flag for kprobe/fentry
> > > to do perf_snapshot_branch_stack() as early as possible
> > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > its leasure.
> >
> > This is basically how Song started when he was adding this feature a
> > few years ago. And I feel like we discussed this and decided that it
> > would be cleaner to let the BPF program decide when (and whether) to
> > get LBR, based on conditions. It still feels like a right tradeoff.
>
> Right we discussed it back then and at that time it was about
> collecting stacks.
> What's different now is you want to collect all types of branches

I was using --lbr=any from the get go (and it actually was a
motivation for the entire feature, because we were lacking visibility
inside some large function with lots of conditions), but yes, we had
to leave with a bunch of entries wasted, which on Intel CPUs with 32
entries was tolerable, but on AMD now is useless (we get only 1-2
useful entries right now).

> in retrnoop including plain 'jmp pc+10' and conditional jmps.
> This is not something that C code can control.
> always_inline in C and inline by the verifier reduce call frames,
> but they may have both positive and negative effect when
> all branches are collected.
> Hence __always_inline in kprobe_multi_link_prog_run()
> is a leap of faith with assumptions that compiler won't
> add jmps before calling into prog,
> but lots of different compiler flags add instrumentation:
> kasan, stack protector, security mitigation that count call depth, etc.
>

I understand that, but at least for now in practice it does help. I
have some more changes in fprobe/ftrace space which reduces waste of
LBR entries some more (and would be beneficial regardless of this
custom flag support we are discussing), and there is some really good
news with aggressive inlining. a) I get only 4 entries wasted for
multi-kprobe (7 for fentry, still not bad, but this one is harder to
optimize) and b) I get +25% speed up for multi-kprobes, which seems
like a nice side benefit.

So I agree that none of this is any guarantee, but it also is not some
binding UAPI, so seems worth doing. And as I pointed above, I don't
think I see any regression in performance, rather the opposite.

>
> > Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
> > that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
> > doesn't have to happen so fast. BPF program can evaluate some
> > conditions and grab LBR optionally, saving the overhead.
> >
> > With prog flag saying "kernel should capture LBR ASAP", we:
>
> I was suggesting to use per attachment flag.
> And kprobe is a lost cause.
> I would do it for fentry only where
> we can generate 'save LBR' call first thing in the bpf trampoline.
>

See above, I get down to just 4 unavoidable LBR entries wasted with
multi-kprobe, all without a custom flag anywhere.

I just really not think it's worth it to complicate trampoline just
for this, we'll save at most 1-2 LBR entries, inlining
bpf_get_branch_snapshot() gets all basically the same benefit, but
across all supported program types.


> >   a) lose this flexibility to decide whether and when to grab LBR;
> >   b) pay overhead regardless if LBR is ever actually used for any
> > given prog invocation;
>
> when retsnoop attaches a prog the prog gotta call that 'save LBR'
> as soon as possible without any branches.
> So per-attach flag is not really a downside.

for retsnoop, yes, but only if this is supported in multi-kprobe,
which is the main mode. But see above, I just don't think we have to
do this to get almost all the benefit. I just need to inline
bpf_get_branch_snapshot().

>
> >   c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
> > buffers for something that is pretty niche (though hugely valuable
> > when needed, of course);
>
> I wouldn't worry about such a tiny buffer.
>
> >   d) each program type that supports bpf_get_branch_snapshot() helper
> > needs to implement this logic in their corresponding
> > `bpf_prog_run_xxx()` running helpers, which is more than a few places.
>
> I think new kfunc that copies from the buffer will do.
> Nothing needs to change.
> Maybe bpf_get_branch_snapshot() can be made smart too,
> but that is optional.
>

I meant that fentry would need to implement this LBR capture in BPF
trampoline, multi-kprobe in its kprobe_multi_link_prog_run, kprobe in
still another helper. And so on, we have many targeted "runner"
helpers for specific program types.

And just implementing this for fentry/fexit is not very useful.

> > Now, let's see how much we can also realistically save with this approach.
> >
> > For fentry, we do save a few (2) entries, indeed. With changes in this
> > patch we are at:
> >
> >   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
> >   [#06] __x64_sys_bpf+0x1a                       ->
> > bpf_trampoline_6442508829+0x7f
> >   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
> >   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
> >   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
> >   [#02] __bpf_prog_enter_recur+0x43              ->
> > bpf_trampoline_6442508829+0xa1
> >   [#01] bpf_trampoline_6442508829+0xad           ->
> > bpf_prog_dc54a596b39d4177_fexit1+0x0
> >   [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
> > intel_pmu_snapshot_branch_stack+0x0
> >
> > With flag and kernel support, we'll be at something like
> >
> >   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
> >   [#06] __x64_sys_bpf+0x1a                       ->
> > bpf_trampoline_6442508829+0x7f
> >   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
> >   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
> >   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
> >   [#02] __bpf_prog_enter_recur+0x43              ->
> > intel_pmu_snapshot_branch_stack+0x0
>
> with flag migrate_disable and prog_enter* will be gone.

I don't think we can get rid of migrate_disable, we need to make sure
we are freezing LBR on the CPU on which BPF program will run. So it's
either preempt_disable or migrate_disable.

Yes, __bpf_prog_enter_recur() won't be there if we code-generate code
for BPF trampoline (though, ugh, who wants more code generation than
necessary, but that's an aside). But then see above, migrate_disable
will have to be called before __bpf_prog_enter_recur(), which is just
more opaque code generation than necessary.

> It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack.

Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
implemented, we call into original function and it returns into
trampoline. So this seems unavoidable as well without completely
changing how trampoline works for fexit. Multi-kprobe actually,
conveniently, avoids this problem.

> If we try hard we can inline
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> Then it will be bpf_trampoline_ only and
> that's as minimal as it can get. One entry.
> Hacking all over the kernel with inline won't get anywhere close.

It's not like I'm doing something technically wrong, just enabling
more inlining. And it's not really all over the kernel, few targeted
places that deal with LBR and BPF programs running.

>
> > For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
> > at all. Is it really worth it?
>
> any_return is ok today.
> Especially with fentry.

Yes, even today we are at 2-3 entries, I'm not too worried about this
in general.

>
> > While inlining bpf_get_branch_snapshot() does benefit only LBR use
> > case, it's a rather typical BPF helper inlining procedure we do for a
> > lot of helpers, so it's not exactly a hack or anything, just an
> > optimization.
>
> Inlining bpf_get_branch_snapshot() may be ok,
> but the way you're doing is not a clear win.
> Replacing size / 24 (that compiler optimize into mul)
> with actual divide and adding extra mul will be slower.
> div+mul vs mul is quite a difference.
> How noticable is that is questionable,
> but from inlining perspective it doesn't feel right to do
> "inline to avoid extra frame" instead of
> "inline to improve performance".

Yes, I saw this division-through-multiplication division. I can
replicate that as well, but it will be pretty hard to understand, so I
thought it is probably not worth it. Note that
bpf_get_branch_snapshot() is not some sort of performance-critical
helper, if you are calling it on some super frequent kprobe/fentry,
you are paying a lot of price just for copying 700+ bytes (and
probably a bunch of other stuff).

So I think it's a wrong tradeoff to optimize for performance.
bpf_get_branch_snapshot() is about information (most complete LBR),
and that's why the inlining. I know it's a bit unconventional compared
to other inlining cases, but it's still valid objection, no?

>
> > But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
> > pretty frequently used helper used to implement various
> > BPF-program-specific per-CPU usages (recursion protection, temporary
> > storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
> > variables, which is already a bit faster approach, and now will be
> > even faster). And the implementation is well-contained.
>
> Agree that inlining bpf_get_smp_processor_id() is a good thing,
> but please do it cleanly so that per-cpu accessors can be
> reused in other places.
> I'll reply with details in the other thread.

Agreed, internal special instruction makes sense, replied on that patch as well.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-25 17:20       ` Andrii Nakryiko
@ 2024-03-26  3:13         ` Alexei Starovoitov
  2024-03-26 16:50           ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-03-26  3:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > >
> > > > > There are still ways to reduce number of "wasted" records further, this is
> > > > > a problem that requires many small and rather independent steps.
> > > >
> > > > I feel this is a wrong path to follow.
> > > > I think it would be better to introduce a flag for kprobe/fentry
> > > > to do perf_snapshot_branch_stack() as early as possible
> > > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > > its leasure.
> > >
> > > This is basically how Song started when he was adding this feature a
> > > few years ago. And I feel like we discussed this and decided that it
> > > would be cleaner to let the BPF program decide when (and whether) to
> > > get LBR, based on conditions. It still feels like a right tradeoff.
> >
> > Right we discussed it back then and at that time it was about
> > collecting stacks.
> > What's different now is you want to collect all types of branches
>
> I was using --lbr=any

and that's probably something to fix.
I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
To figure out the path to return in the code
ARCH_LBR_JCC      |
ARCH_LBR_REL_CALL |
ARCH_LBR_IND_CALL |
ARCH_LBR_RETURN

might be good enough and there won't be a need to do
inling in odd places just to avoid tail jmp.

> do this to get almost all the benefit. I just need to inline
> bpf_get_branch_snapshot().

If that is the only one that needs inling then fine,
but I really don't like to always_inline
kprobe_multi_link_prog_run().
A day goes by and somebody will send a patch
to save 500 bytes of kernel .text by removing always_inline.
The argument that it's there to help a user space tool that
wants to do lbr=all instead of excluding rel_jmp
won't look good.

>
> I don't think we can get rid of migrate_disable, we need to make sure
> we are freezing LBR on the CPU on which BPF program will run. So it's
> either preempt_disable or migrate_disable.

we cannot extend preempt disable across the prog
and migrate_disable won't really help, since
there could be another prog on the same cpu
doing the same "save lbr" action in a different hook
that will trash per-cpu scratch space.

But we don't need to either migrate_disable or preempt_disable.
We can have a 32*24 byte buffer per attach point.
In case of fentry it can be in bpf_trampoline or in bpf_link
(I didn't analyze pros/cons too far) and
fentry will only do the single "call intel_pmu_snapshot_branch_stack"
with that address.
That's a trivial addition to arch_prepare_bpf_trampoline.

Then bpf prog will take entries from link, since it has access to it.

Same thing for kprobes. As soon as it's triggered it will
call intel_pmu_snapshot_branch_stack.
Should be simple to add.

Recursion can overwrite that per-attach buffer, but
lbr is screwed anyway if we recursed. So not a concern.

> Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
> implemented, we call into original function and it returns into
> trampoline. So this seems unavoidable as well without completely
> changing how trampoline works for fexit. Multi-kprobe actually,
> conveniently, avoids this problem.

Definitely do not want to redesign that to help retsnoop save an lbr entry.

> > Inlining bpf_get_branch_snapshot() may be ok,
> > but the way you're doing is not a clear win.
> > Replacing size / 24 (that compiler optimize into mul)
> > with actual divide and adding extra mul will be slower.
> > div+mul vs mul is quite a difference.
> > How noticable is that is questionable,
> > but from inlining perspective it doesn't feel right to do
> > "inline to avoid extra frame" instead of
> > "inline to improve performance".
>
> Yes, I saw this division-through-multiplication division. I can
> replicate that as well, but it will be pretty hard to understand, so I
> thought it is probably not worth it. Note that
> bpf_get_branch_snapshot() is not some sort of performance-critical
> helper, if you are calling it on some super frequent kprobe/fentry,
> you are paying a lot of price just for copying 700+ bytes (and
> probably a bunch of other stuff).

div is the slowest instruction.
On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles.
L1 cache is 1-2.
So it might be faster to copy 768 bytes than do a single div.

I still think that adding call intel_pmu_snapshot_branch_stack
to fenty and kprobes is a simpler and cleaner solution that eliminates
all guess work of compiler inlining and optimizations.

We can potentially optimize it further.
Since arch_prepare_bpf_trampoline() is arch specific,
for x86 we can inline:
        local_irq_save(flags);
        __intel_pmu_disable_all(false);
        __intel_pmu_lbr_disable();
into generated trampoline (since above is just 5-6 instructions)
and call into __intel_pmu_snapshot_branch_stack.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-26  3:13         ` Alexei Starovoitov
@ 2024-03-26 16:50           ` Andrii Nakryiko
  2024-03-27 21:59             ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-26 16:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Mon, Mar 25, 2024 at 8:13 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > >
> > > > > > There are still ways to reduce number of "wasted" records further, this is
> > > > > > a problem that requires many small and rather independent steps.
> > > > >
> > > > > I feel this is a wrong path to follow.
> > > > > I think it would be better to introduce a flag for kprobe/fentry
> > > > > to do perf_snapshot_branch_stack() as early as possible
> > > > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > > > its leasure.
> > > >
> > > > This is basically how Song started when he was adding this feature a
> > > > few years ago. And I feel like we discussed this and decided that it
> > > > would be cleaner to let the BPF program decide when (and whether) to
> > > > get LBR, based on conditions. It still feels like a right tradeoff.
> > >
> > > Right we discussed it back then and at that time it was about
> > > collecting stacks.
> > > What's different now is you want to collect all types of branches
> >
> > I was using --lbr=any
>
> and that's probably something to fix.

Fix in the sense to adjust or add another generic
PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?

> I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> To figure out the path to return in the code
> ARCH_LBR_JCC      |
> ARCH_LBR_REL_CALL |
> ARCH_LBR_IND_CALL |
> ARCH_LBR_RETURN
>
> might be good enough and there won't be a need to do
> inling in odd places just to avoid tail jmp.

retsnoop supports all modes perf exposes generically (see [0]), I
believe I tried all of them and keep gravitating back to --lbr=any as
most useful, unfortunately.

But it's ok, let's put this particular __always_inline on pause for
now, it's one LBR record more, not the end of the world.

  [0] https://github.com/anakryiko/retsnoop/blob/master/src/retsnoop.c#L269-L280

>
> > do this to get almost all the benefit. I just need to inline
> > bpf_get_branch_snapshot().
>
> If that is the only one that needs inling then fine,

yes, it's one of the most important ones, I'll take it :)

> but I really don't like to always_inline
> kprobe_multi_link_prog_run().
> A day goes by and somebody will send a patch
> to save 500 bytes of kernel .text by removing always_inline.
> The argument that it's there to help a user space tool that
> wants to do lbr=all instead of excluding rel_jmp
> won't look good.
>

There is a lot of __always_inline in rethook/ftrace code, as well as
some in BPF code as well. I don't remember people trying to roll this
back, so this seems a bit overdramatic. But ok, if you think it will
be problematic to reject such hypothetical patches, let's put
kprobe_multi_link_prog_run inlining aside for now.

> >
> > I don't think we can get rid of migrate_disable, we need to make sure
> > we are freezing LBR on the CPU on which BPF program will run. So it's
> > either preempt_disable or migrate_disable.
>
> we cannot extend preempt disable across the prog
> and migrate_disable won't really help, since
> there could be another prog on the same cpu
> doing the same "save lbr" action in a different hook
> that will trash per-cpu scratch space.
>
> But we don't need to either migrate_disable or preempt_disable.
> We can have a 32*24 byte buffer per attach point.

I'm missing how we can get away from having a per-CPU buffer. LBRs on
different CPU cores are completely independent and one BPF prog
attachment can be simultaneously running on many CPUs.

Or do you mean per-CPU allocation for each attach point?

We can do LBR capture before migrate_disable calls and still have
correct data most of the time (hopefully), though, so yeah, it can be
another improvement (but with inlining of those two BPF helpers I'm
not sure we have to do this just yet).

> In case of fentry it can be in bpf_trampoline or in bpf_link
> (I didn't analyze pros/cons too far) and
> fentry will only do the single "call intel_pmu_snapshot_branch_stack"
> with that address.
> That's a trivial addition to arch_prepare_bpf_trampoline.
>
> Then bpf prog will take entries from link, since it has access to it.
>
> Same thing for kprobes. As soon as it's triggered it will
> call intel_pmu_snapshot_branch_stack.
> Should be simple to add.
>
> Recursion can overwrite that per-attach buffer, but
> lbr is screwed anyway if we recursed. So not a concern.
>
> > Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
> > implemented, we call into original function and it returns into
> > trampoline. So this seems unavoidable as well without completely
> > changing how trampoline works for fexit. Multi-kprobe actually,
> > conveniently, avoids this problem.
>
> Definitely do not want to redesign that to help retsnoop save an lbr entry.

Yep.

>
> > > Inlining bpf_get_branch_snapshot() may be ok,
> > > but the way you're doing is not a clear win.
> > > Replacing size / 24 (that compiler optimize into mul)
> > > with actual divide and adding extra mul will be slower.
> > > div+mul vs mul is quite a difference.
> > > How noticable is that is questionable,
> > > but from inlining perspective it doesn't feel right to do
> > > "inline to avoid extra frame" instead of
> > > "inline to improve performance".
> >
> > Yes, I saw this division-through-multiplication division. I can
> > replicate that as well, but it will be pretty hard to understand, so I
> > thought it is probably not worth it. Note that
> > bpf_get_branch_snapshot() is not some sort of performance-critical
> > helper, if you are calling it on some super frequent kprobe/fentry,
> > you are paying a lot of price just for copying 700+ bytes (and
> > probably a bunch of other stuff).
>
> div is the slowest instruction.
> On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles.
> L1 cache is 1-2.
> So it might be faster to copy 768 bytes than do a single div.

Ok, I can add the div-through-multiplication code to keep it 1:1 w/
compiled helper code, no problem.

>
> I still think that adding call intel_pmu_snapshot_branch_stack
> to fenty and kprobes is a simpler and cleaner solution that eliminates
> all guess work of compiler inlining and optimizations.
>
> We can potentially optimize it further.
> Since arch_prepare_bpf_trampoline() is arch specific,
> for x86 we can inline:
>         local_irq_save(flags);
>         __intel_pmu_disable_all(false);
>         __intel_pmu_lbr_disable();
> into generated trampoline (since above is just 5-6 instructions)
> and call into __intel_pmu_snapshot_branch_stack.

Let's keep it as plan B, given this gets into gnarly internals of
Intel-specific x86-64 code.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-26 16:50           ` Andrii Nakryiko
@ 2024-03-27 21:59             ` Alexei Starovoitov
  2024-03-28 22:53               ` Andrii Nakryiko
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-03-27 21:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Fix in the sense to adjust or add another generic
> PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?
>
> > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> > To figure out the path to return in the code
> > ARCH_LBR_JCC      |
> > ARCH_LBR_REL_CALL |
> > ARCH_LBR_IND_CALL |
> > ARCH_LBR_RETURN
> >
> > might be good enough and there won't be a need to do
> > inling in odd places just to avoid tail jmp.
>
> retsnoop supports all modes perf exposes generically (see [0]), I
> believe I tried all of them and keep gravitating back to --lbr=any as
> most useful, unfortunately.

I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND
which I suspect will exclude ARCH_LBR_REL_JMP
and will avoid counting normal goto-s.

> I'm missing how we can get away from having a per-CPU buffer. LBRs on
> different CPU cores are completely independent and one BPF prog
> attachment can be simultaneously running on many CPUs.
>
> Or do you mean per-CPU allocation for each attach point?
>
> We can do LBR capture before migrate_disable calls and still have
> correct data most of the time (hopefully), though, so yeah, it can be
> another improvement (but with inlining of those two BPF helpers I'm
> not sure we have to do this just yet).

I meant 32*24 buffer per attachment.
Doing per-attachment per-cpu might not scale?
It's certainly cleaner with per-cpu though.
With a single per-attach buffer the assumption was that the different
cpus will likely take the same path towards that kprobe.
retsnoop doesn't care which cpu it collected that stack trace from.
It cares about the code path and it will be there.
We can make the whole thing configurable.
bpf_link_attach would specify a buffer or per-cpu or
a buffer right in the bpf map array that should be used
to store the lbr trace.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers
  2024-03-27 21:59             ` Alexei Starovoitov
@ 2024-03-28 22:53               ` Andrii Nakryiko
  0 siblings, 0 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-28 22:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Peter Zijlstra, Song Liu

On Wed, Mar 27, 2024 at 2:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > Fix in the sense to adjust or add another generic
> > PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?
> >
> > > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> > > To figure out the path to return in the code
> > > ARCH_LBR_JCC      |
> > > ARCH_LBR_REL_CALL |
> > > ARCH_LBR_IND_CALL |
> > > ARCH_LBR_RETURN
> > >
> > > might be good enough and there won't be a need to do
> > > inling in odd places just to avoid tail jmp.
> >
> > retsnoop supports all modes perf exposes generically (see [0]), I
> > believe I tried all of them and keep gravitating back to --lbr=any as
> > most useful, unfortunately.
>
> I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND
> which I suspect will exclude ARCH_LBR_REL_JMP
> and will avoid counting normal goto-s.

This would be equivalent to passing `--lbr=any_call --lbr=cond` to
retsnoop. And yes, you are right that it doesn't record unconditional
jumps, saving a few more LBR frames. The problem is that it makes it
super hard to follow what's going on without disassembling all
relevant functions down to assembly and following *very carefully* to
understand the flow of logic. This is normally absolutely unnecessary
with --lbr=any (unless DWARF line info is screwed up, of course).
--lbr=any allows to understand C statement-level code flow based on
file:line information alone.

So, in summary, yes `--lbr=any_call --lbr=cond` is a good last resort
option if a few more LBR records are necessary. But it doesn't feel
like something I'd recommend typical users to start with.

>
> > I'm missing how we can get away from having a per-CPU buffer. LBRs on
> > different CPU cores are completely independent and one BPF prog
> > attachment can be simultaneously running on many CPUs.
> >
> > Or do you mean per-CPU allocation for each attach point?
> >
> > We can do LBR capture before migrate_disable calls and still have
> > correct data most of the time (hopefully), though, so yeah, it can be
> > another improvement (but with inlining of those two BPF helpers I'm
> > not sure we have to do this just yet).
>
> I meant 32*24 buffer per attachment.
> Doing per-attachment per-cpu might not scale?
> It's certainly cleaner with per-cpu though.
> With a single per-attach buffer the assumption was that the different
> cpus will likely take the same path towards that kprobe.

This is a rather bold assumption. It might be true in a lot of cases,
but certainly not always. Think about tracing __sys_bpf for errors.
There could be many simultaneous bpf() syscalls in the system, some
returning expected -ENOENT for expected map element iteration (so not
really interesting), but some resulting in confusing failures (and
thus "interesting"). It's even worse with some file system related
functions.

My point is that in retsnoop I don't want to rely on these
assumptions. I won't stop anyone trying to add this functionality in
the kernel, but I don't see retsnoop relying on something like that.

> retsnoop doesn't care which cpu it collected that stack trace from.
> It cares about the code path and it will be there.

It does, retsnoop has a bunch of filters to narrow down specific
conditions, where process information is taken into account, among
other things. And this filtering capabilities will just grow over
time.

> We can make the whole thing configurable.
> bpf_link_attach would specify a buffer or per-cpu or
> a buffer right in the bpf map array that should be used
> to store the lbr trace.

I'm pretty happy with existing functionality and don't really need new
APIs. I'll be posting a few more patches improving performance (not
just LBR usage) over the next few days. With those I get close enough:
4 LBR records waste with --lbr=any.

Thanks for the brainstorming, internal per-CPU instruction
implementation turned out pretty well, I'll be sending a patch set
soon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-03-28 22:54 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko
2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
2024-03-21 21:08   ` Jiri Olsa
2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
2024-03-21 21:08   ` Jiri Olsa
2024-03-21 21:27     ` Andrii Nakryiko
2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko
2024-03-21 21:08   ` Jiri Olsa
2024-03-21 21:09     ` Andrii Nakryiko
2024-03-21 22:57       ` Jiri Olsa
2024-03-21 23:38         ` Andrii Nakryiko
2024-03-21 23:49   ` Alexei Starovoitov
2024-03-22 16:45     ` Andrii Nakryiko
2024-03-25  3:28       ` Alexei Starovoitov
2024-03-25 17:01         ` Andrii Nakryiko
2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov
2024-03-22 16:45   ` Andrii Nakryiko
2024-03-25  2:05     ` Alexei Starovoitov
2024-03-25 17:20       ` Andrii Nakryiko
2024-03-26  3:13         ` Alexei Starovoitov
2024-03-26 16:50           ` Andrii Nakryiko
2024-03-27 21:59             ` Alexei Starovoitov
2024-03-28 22:53               ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.