All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy
@ 2024-01-04 14:22 Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 1/4] bpf, x64: Use emit_nops() to replace memcpy()'ing x86_nops[5] Leon Hwang
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Leon Hwang @ 2024-01-04 14:22 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm, kernel-patches-bot

The patchset fixes a tailcall hierarchy issue.

The issue is confirmed in the discussions of "bpf, x64: Fix tailcall infinite
loop"[0].

The issue is only resolved on x86.

Hopefully, the issue on arm64 and s390x will be resolved soon.

I provide a long commit message in the second patch to describe how the issue
happens and how this patchset resolves the issue in detail.

RFC v2 -> v1:
  * address comments from Maciej:
    * Replace all memcpy(prog, x86_nops[5], X86_PATCH_SIZE) with
        emit_nops(&prog, X86_PATCH_SIZE)

RFC v1 -> RFC v2:
  * address comments from Stanislav:
    * Separate moving emit_nops() as first patch.

Links:
[0] https://lore.kernel.org/bpf/6203dd01-789d-f02c-5293-def4c1b18aef@gmail.com/

Leon Hwang (4):
  bpf, x64: Use emit_nops() to replace memcpy()'ing x86_nops[5]
  bpf, x64: Fix tailcall hierarchy
  bpf, x64: Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR()
  selftests/bpf: Add testcases for tailcall hierarchy fixing

 arch/x86/net/bpf_jit_comp.c                   | 108 ++---
 .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
 5 files changed, 609 insertions(+), 52 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c


base-commit: 0c14840ae36f8170f06c2fa768203ef5a8e389e1
-- 
2.42.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 1/4] bpf, x64: Use emit_nops() to replace memcpy()'ing x86_nops[5]
  2024-01-04 14:22 [PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2024-01-04 14:22 ` Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Leon Hwang @ 2024-01-04 14:22 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm, kernel-patches-bot

For next commit to reuse emit_nops(), move emit_nops() before
emit_prologue().

By the way, replace all memcpy(prog, x86_nops[5], X86_PATCH_SIZE) with
emit_nops(&prog, X86_PATCH_SIZE).

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 47 +++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index bdacbb84456d9..fe30b9ebb8de4 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -307,6 +307,25 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
 	*pprog = prog;
 }
 
+static void emit_nops(u8 **pprog, int len)
+{
+	u8 *prog = *pprog;
+	int i, noplen;
+
+	while (len > 0) {
+		noplen = len;
+
+		if (noplen > ASM_NOP_MAX)
+			noplen = ASM_NOP_MAX;
+
+		for (i = 0; i < noplen; i++)
+			EMIT1(x86_nops[noplen][i]);
+		len -= noplen;
+	}
+
+	*pprog = prog;
+}
+
 /*
  * Emit the various CFI preambles, see asm/cfi.h and the comments about FineIBT
  * in arch/x86/kernel/alternative.c
@@ -385,8 +404,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	/* BPF trampoline can be made to work without these nops,
 	 * but let's waste 5 bytes for now and optimize later
 	 */
-	memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
-	prog += X86_PATCH_SIZE;
+	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
 		if (tail_call_reachable && !is_subprog)
 			/* When it's the entry of the whole tailcall context,
@@ -692,8 +710,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
 
-	memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
-	prog += X86_PATCH_SIZE;
+	emit_nops(&prog, X86_PATCH_SIZE);
 
 	/* out: */
 	ctx->tail_call_direct_label = prog - start;
@@ -1055,25 +1072,6 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
 	}
 }
 
-static void emit_nops(u8 **pprog, int len)
-{
-	u8 *prog = *pprog;
-	int i, noplen;
-
-	while (len > 0) {
-		noplen = len;
-
-		if (noplen > ASM_NOP_MAX)
-			noplen = ASM_NOP_MAX;
-
-		for (i = 0; i < noplen; i++)
-			EMIT1(x86_nops[noplen][i]);
-		len -= noplen;
-	}
-
-	*pprog = prog;
-}
-
 /* emit the 3-byte VEX prefix
  *
  * r: same as rex.r, extra bit for ModRM reg field
@@ -2700,8 +2698,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		/* remember return value in a stack for bpf prog to access */
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
 		im->ip_after_call = image + (prog - (u8 *)rw_image);
-		memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
-		prog += X86_PATCH_SIZE;
+		emit_nops(&prog, X86_PATCH_SIZE);
 	}
 
 	if (fmod_ret->nr_links) {
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-04 14:22 [PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 1/4] bpf, x64: Use emit_nops() to replace memcpy()'ing x86_nops[5] Leon Hwang
@ 2024-01-04 14:22 ` Leon Hwang
  2024-01-05  4:15   ` Alexei Starovoitov
  2024-01-04 14:22 ` [PATCH bpf-next 3/4] bpf, x64: Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR() Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
  3 siblings, 1 reply; 27+ messages in thread
From: Leon Hwang @ 2024-01-04 14:22 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm, kernel-patches-bot

From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.

From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.

How about:

1. More than 1 subprograms are called in a bpf program.
2. The tailcalls in the subprograms call the bpf program.

Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.

Let's take a look into an example:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include "bpf_legacy.h"

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
	volatile int ret = 1;

	count++;
	subprog_tail(skb); /* subprog call1 */
	subprog_tail(skb); /* subprog call2 */

	return ret;
}

char __license[] SEC("license") = "GPL";

And the entry bpf prog is populated to the 0th slot of jmp_table. Then,
what happens when entry bpf prog runs? The CPU will be stalled because
of too many tailcalls like this CI:

https://github.com/kernel-patches/bpf/pull/5807/checks

In this CI results, the test_progs failed to run on aarch64 and s390x
because of "rcu: INFO: rcu_sched self-detected stall on CPU".

So, if CPU does not stall because of too many tailcalls, how many
tailcalls will be there for this case? And why MAX_TAIL_CALL_CNT limit
does not work for this case?

Let's step into some running steps.

At the very first time when subprog_tail() is called, subprog_tail() does
tailcall the entry bpf prog. Then, subprog_taill() is called at second time
at the position subprog call1, and it tailcalls the entry bpf prog again.

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the tail_call_cnt is 33 in subprog_tail().

Next, the 34th subprog_tail() returns to entry() because of
MAX_TAIL_CALL_CNT limit.

In entry(), the 34th entry(), at the time after the 34th subprog_tail() at
the position subprog call1 finishes and before the 1st subprog_tail() at
the position subprog call2 calls in entry(), what's the value of
tail_call_cnt in entry()? It's 33.

As we know, tail_all_cnt is pushed on the stack of entry(), and propagates
to subprog_tail() by %rax from stack.

Then, at the time when subprog_tail() at the position subprog call2 is
called for its first time, tail_call_cnt 33 propagates to subprog_tail()
by %rax. And the tailcall in subprog_tail() is aborted because of
tail_call_cnt >= MAX_TAIL_CALL_CNT too.

Then, subprog_tail() at the position subprog call2 ends, and the 34th
entry() ends. And it returns to the 33rd subprog_tail() called from the
position subprog call1. But wait, at this time, what's the value of
tail_call_cnt under the stack of subprog_tail()? It's 33.

Then, in the 33rd entry(), at the time after the 33th subprog_tail() at
the position subprog call1 finishes and before the 2nd subprog_tail() at
the position subprog call2 calls, what's the value of tail_call_cnt
in current entry()? It's *32*. Why not 33?

Before stepping into subprog_tail() at the position subprog call2 in 33rd
entry(), like stopping the time machine, let's have a look at the stack
memory:

  |  STACK  |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  +---------+ RSP  <-- current rsp
  |   rip   | STACK of 34rd entry()
  |   rbp   | reuse the STACK of 33rd subprog_tail() at the position
  |   ret   |                                        subprog call1
  |   tcc   | its value is 33
  +---------+ rsp
  |   rip   | STACK of 1st subprog_tail() at the position subprog call2
  |   rbp   |
  |   tcc   | its value is 33
  +---------+ rsp

Why not 33? It's because tail_call_cnt does not back-propagate from
subprog_tail() to entry().

Then, while stepping into subprog_tail() at the position subprog call2 in 33rd
entry():

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 32; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Then, while pausing after tailcalling in 2nd subprog_tail() at the position
subprog call2:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 33; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Note: what happens to tail_call_cnt:
	/*
	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
	 *	goto out;
	 */
It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt.

So, current tailcall is allowed to run.

Then, entry() is tailcalled. And the stack memory status is:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 35th entry(); reuse STACK of subprog_tail() at the
  |   tcc   | its value is 33                   the position subprog call2
  +---------+ RSP  <-- current rsp

So, the tailcalls in the 35th entry() will be aborted.

And, ..., again and again.  :(

And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit
does not work for this case.

And, how many tailcalls are there for this case if CPU does not stall?

From top-down view, does it look like hierarchy layer and layer?

I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a
result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182
tailcalls. That's the guy making CPU stalled.

What about there are N subprog_tail() in entry()? If CPU does not stall
because of too many tailcalls, there will be almost N**34 tailcalls.

And, as we know about the issue, how does this patch resolve it?

I hope you have patience to read the following details, because it's
really hard to understand the code directly.

As we know, in tail call context, the tail_call_cnt propagates by stack
and rax register between BPF subprograms and trampolines.

How about propagating the pointer of tail_call_cnt instead of tail_call_cnt?

When propagating tail_call_cnt pointer by stack and rax register, it'll
make tail_call_cnt works like a global variable in current tail call
context. Then MAX_TAIL_CALL_CNT limit will be able to work for all
tailcalls in current tail call context.

But, where does tail_call_cnt store?

It stores on the stack of entry bpf prog's caller, like

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

Note: tcc is tail_call_cnt, tcc_ptr is tail_call_cnt pointer.

So, how does it store tail_call_cnt to the stack of entry bpf prog's
caller?

At the epilogue of entry bpf prog, before pushing %rbp, it initialises
tail_call_cnt by "xor eax, eax" and then push it to stack by "push rax".
Then, make %rax as the pointer that points to tail_call_cnt by "mov rax,
rsp". Next, call the main part of the entry bpf prog by "call 2". (This
is the exceptional point.) With this "call", %rip is pushed to stack. And
at the end of the entry bpf prog runtime, the %rip is popped from stack;
then, pop tail_call_cnt by "pop rcx" from stack too; and finally "ret"
again. The "pop rcx" and "ret" is the 2 in "call 2".

It seems invasive to use a "call" here. But it is the key of this patch.
With this "call", it is able to store tail_call_cnt to stack of entry
bpf prog's caller instead of the stack of entry bpf prog. As a result,
tail_call_cnt is protected by "call" actually.

Meanwhile tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
instruction on BPF JIT epilogue").

And when a tailcall happens, load tail_call_cnt pointer from stack to %rax
by "mov rax, qword ptr [rbp - tcc_ptr_off]", and compare tail_call_cnt with
MAX_TAIL_CALL_CNT by "cmp dword ptr [rax], MAX_TAIL_CALL_CNT", and then
increment tail_call_cnt by "add dword ptr [rax], 1". Finally, when pop %rax,
it's to pop tail_call_cnt pointer from stack to %rax.

Next, let's step into some running steps.

When the epilogue of entry() runs, the stack of entry() should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 0
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP  <-- current rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
    |   rbx   | saved regs
    +---------+ RSP  <-- current rsp

Then, when subprog_tail() is called for its very first time, its stack
should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 0
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP  <-- current rbp
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP  <-- current rsp

Then, when subprog_tail() tailcalls entry():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 1
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP  <-- current rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
    +---------+ RSP  <-- current rsp

Then, when entry() calls subprog_tail():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 1
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP  <-- current rbp
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP  <-- current rsp

Then, when subprog_tail() tailcalls entry():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 2
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP  <-- current rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
    +---------+ RSP  <-- current rsp

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the stack should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 33
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |    *    |
 |  |    *    |
 |  |    *    |
 |  +---------+ RBP  <-- current rbp
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP  <-- current rsp

At this time, the tailcalls in the future will be aborted because
tail_call_cnt has been 33, which reaches its MAX_TAIL_CALL_CNT limit.

This is the way how this patch works.

It's really nice if you reach here. I hope you have a clear idea to
understand the following code with above explaining.

Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index fe30b9ebb8de4..67fa337fc2e0c 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -259,7 +259,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
 
 static void push_r12(u8 **pprog)
 {
@@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	 */
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog)
+		if (tail_call_reachable && !is_subprog) {
 			/* When it's the entry of the whole tailcall context,
 			 * zeroing rax means initialising tail_call_cnt.
 			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
-		else
-			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
+			EMIT2(0x31, 0xC0);       /* xor eax, eax */
+			EMIT1(0x50);             /* push rax */
+			/* Make rax as ptr that points to tail_call_cnt. */
+			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
+			EMIT1_off32(0xE8, 2);    /* call main prog */
+			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
+			EMIT1(0xC3);             /* ret */
+		} else {
+			/* Keep the same instruction size. */
+			emit_nops(&prog, 13);
+		}
 	}
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
@@ -439,6 +446,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
 	if (tail_call_reachable)
+		/* Here, rax is tail_call_cnt_ptr. */
 		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
@@ -594,7 +602,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -619,13 +627,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                   /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
@@ -647,6 +654,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}
 
+	/* pop tail_call_cnt_ptr */
 	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
@@ -675,7 +683,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -683,13 +691,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_direct_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                       /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
 
 	poke->tailcall_bypass = ip + (prog - start);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -706,6 +713,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}
 
+	/* pop tail_call_cnt_ptr */
 	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 3/4] bpf, x64: Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR()
  2024-01-04 14:22 [PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 1/4] bpf, x64: Use emit_nops() to replace memcpy()'ing x86_nops[5] Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2024-01-04 14:22 ` Leon Hwang
  2024-01-04 14:22 ` [PATCH bpf-next 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
  3 siblings, 0 replies; 27+ messages in thread
From: Leon Hwang @ 2024-01-04 14:22 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm, kernel-patches-bot

With previous commit, %rax is used to propagate tail_call_cnt pointer
instead of tail_call_cnt. So, LOAD_TAIL_CALL_CNT_PTR() is more accurate.

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 67fa337fc2e0c..4065bdcc5b2a4 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1142,7 +1142,7 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 /* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
-#define RESTORE_TAIL_CALL_CNT(stack)				\
+#define LOAD_TAIL_CALL_CNT_PTR(stack)				\
 	EMIT3_off32(0x48, 0x8B, 0x85, -round_up(stack, 8) - 8)
 
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
@@ -1762,7 +1762,7 @@ st:			if (is_imm8(insn->off))
 
 			func = (u8 *) __bpf_call_base + imm32;
 			if (tail_call_reachable) {
-				RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
+				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
 				if (!imm32)
 					return -EINVAL;
 				offs = 7 + x86_call_depth_emit_accounting(&prog, func);
@@ -2558,7 +2558,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 	 *                     [ ...        ]
 	 *                     [ stack_arg2 ]
 	 * RBP - arg_stack_off [ stack_arg1 ]
-	 * RSP                 [ tail_call_cnt ] BPF_TRAMP_F_TAIL_CALL_CTX
+	 * RSP                 [ tail_call_cnt_ptr ] BPF_TRAMP_F_TAIL_CALL_CTX
 	 */
 
 	/* room for return value of orig_call or fentry prog */
@@ -2686,12 +2686,11 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		restore_regs(m, &prog, regs_off);
 		save_args(m, &prog, arg_stack_off, true);
 
-		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-			/* Before calling the original function, restore the
-			 * tail_call_cnt from stack to rax.
+		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
+			/* Before calling the original function, load the
+			 * tail_call_cnt_ptr to rax.
 			 */
-			RESTORE_TAIL_CALL_CNT(stack_size);
-		}
+			LOAD_TAIL_CALL_CNT_PTR(stack_size);
 
 		if (flags & BPF_TRAMP_F_ORIG_STACK) {
 			emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8);
@@ -2749,10 +2748,10 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 			goto cleanup;
 		}
 	} else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-		/* Before running the original function, restore the
-		 * tail_call_cnt from stack to rax.
+		/* Before running the original function, load the
+		 * tail_call_cnt_ptr to rax.
 		 */
-		RESTORE_TAIL_CALL_CNT(stack_size);
+		LOAD_TAIL_CALL_CNT_PTR(stack_size);
 	}
 
 	/* restore return value of orig_call or fentry prog back into RAX */
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing
  2024-01-04 14:22 [PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
                   ` (2 preceding siblings ...)
  2024-01-04 14:22 ` [PATCH bpf-next 3/4] bpf, x64: Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR() Leon Hwang
@ 2024-01-04 14:22 ` Leon Hwang
  3 siblings, 0 replies; 27+ messages in thread
From: Leon Hwang @ 2024-01-04 14:22 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm, kernel-patches-bot

Add some test cases to confirm the tailcall hierarchy issue has been fixed.

tools/testing/selftests/bpf/test_progs -t tailcalls
305/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
305/19  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
305/20  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
305/21  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
305/22  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
305/23  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
305     tailcalls:OK
Summary: 1/23 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
 4 files changed, 553 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index 59993fc9c0d7e..e796328faf2e6 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -1187,6 +1187,412 @@ static void test_tailcall_poke(void)
 	tailcall_poke__destroy(call);
 }
 
+static void test_tailcall_hierarchy_count(const char *which, bool test_fentry,
+					  bool test_fexit)
+{
+	int err, map_fd, prog_fd, main_data_fd, fentry_data_fd, fexit_data_fd, i, val;
+	struct bpf_object *obj = NULL, *fentry_obj = NULL, *fexit_obj = NULL;
+	struct bpf_link *fentry_link = NULL, *fexit_link = NULL;
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load(which, BPF_PROG_TYPE_SCHED_CLS, &obj,
+				 &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "prog_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "map_fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	if (test_fentry) {
+		fentry_obj = bpf_object__open_file("tailcall_bpf2bpf_fentry.bpf.o",
+						   NULL);
+		if (!ASSERT_OK_PTR(fentry_obj, "open fentry_obj file"))
+			goto out;
+
+		prog = bpf_object__find_program_by_name(fentry_obj, "fentry");
+		if (!ASSERT_OK_PTR(prog, "find fentry prog"))
+			goto out;
+
+		err = bpf_program__set_attach_target(prog, prog_fd,
+						     "subprog_tail");
+		if (!ASSERT_OK(err, "set_attach_target subprog_tail"))
+			goto out;
+
+		err = bpf_object__load(fentry_obj);
+		if (!ASSERT_OK(err, "load fentry_obj"))
+			goto out;
+
+		fentry_link = bpf_program__attach_trace(prog);
+		if (!ASSERT_OK_PTR(fentry_link, "attach_trace"))
+			goto out;
+	}
+
+	if (test_fexit) {
+		fexit_obj = bpf_object__open_file("tailcall_bpf2bpf_fexit.bpf.o",
+						  NULL);
+		if (!ASSERT_OK_PTR(fexit_obj, "open fexit_obj file"))
+			goto out;
+
+		prog = bpf_object__find_program_by_name(fexit_obj, "fexit");
+		if (!ASSERT_OK_PTR(prog, "find fexit prog"))
+			goto out;
+
+		err = bpf_program__set_attach_target(prog, prog_fd,
+						     "subprog_tail");
+		if (!ASSERT_OK(err, "set_attach_target subprog_tail"))
+			goto out;
+
+		err = bpf_object__load(fexit_obj);
+		if (!ASSERT_OK(err, "load fexit_obj"))
+			goto out;
+
+		fexit_link = bpf_program__attach_trace(prog);
+		if (!ASSERT_OK_PTR(fexit_link, "attach_trace"))
+			goto out;
+	}
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find data_map"))
+		goto out;
+
+	main_data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(main_data_fd, 0, "main_data_fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(main_data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 34, "tailcall count");
+
+	if (test_fentry) {
+		data_map = bpf_object__find_map_by_name(fentry_obj, ".bss");
+		if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+				  "find tailcall_bpf2bpf_fentry.bss map"))
+			goto out;
+
+		fentry_data_fd = bpf_map__fd(data_map);
+		if (!ASSERT_GE(fentry_data_fd, 0,
+				  "find tailcall_bpf2bpf_fentry.bss map fd"))
+			goto out;
+
+		i = 0;
+		err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
+		ASSERT_OK(err, "fentry count");
+		ASSERT_EQ(val, 68, "fentry count");
+	}
+
+	if (test_fexit) {
+		data_map = bpf_object__find_map_by_name(fexit_obj, ".bss");
+		if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+				  "find tailcall_bpf2bpf_fexit.bss map"))
+			goto out;
+
+		fexit_data_fd = bpf_map__fd(data_map);
+		if (!ASSERT_GE(fexit_data_fd, 0,
+				  "find tailcall_bpf2bpf_fexit.bss map fd"))
+			goto out;
+
+		i = 0;
+		err = bpf_map_lookup_elem(fexit_data_fd, &i, &val);
+		ASSERT_OK(err, "fexit count");
+		ASSERT_EQ(val, 68, "fexit count");
+	}
+
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (!ASSERT_OK(err, "delete_elem from jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	i = 0;
+	err = bpf_map_lookup_elem(main_data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 35, "tailcall count");
+
+	if (test_fentry) {
+		i = 0;
+		err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
+		ASSERT_OK(err, "fentry count");
+		ASSERT_EQ(val, 70, "fentry count");
+	}
+
+	if (test_fexit) {
+		i = 0;
+		err = bpf_map_lookup_elem(fexit_data_fd, &i, &val);
+		ASSERT_OK(err, "fexit count");
+		ASSERT_EQ(val, 70, "fexit count");
+	}
+
+out:
+	bpf_link__destroy(fentry_link);
+	bpf_link__destroy(fexit_link);
+	bpf_object__close(fentry_obj);
+	bpf_object__close(fexit_obj);
+	bpf_object__close(obj);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_1 checks that the count value of the tail
+ * call limit enforcement matches with expectations when tailcalls are preceded
+ * with two bpf2bpf calls.
+ *
+ *         subprog --tailcall-> entry
+ * entry <
+ *         subprog --tailcall-> entry
+ */
+static void test_tailcall_bpf2bpf_hierarchy_1(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      false, false);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fentry checks that the count value of the
+ * tail call limit enforcement matches with expectations when tailcalls are
+ * preceded with two bpf2bpf calls, and the two subprogs are traced by fentry.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fentry(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      true, false);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fexit checks that the count value of the tail
+ * call limit enforcement matches with expectations when tailcalls are preceded
+ * with two bpf2bpf calls, and the two subprogs are traced by fexit.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fexit(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      false, true);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fentry_fexit checks that the count value of
+ * the tail call limit enforcement matches with expectations when tailcalls are
+ * preceded with two bpf2bpf calls, and the two subprogs are traced by both
+ * fentry and fexit.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fentry_fexit(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      true, true);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_2 checks that the count value of the tail
+ * call limit enforcement matches with expectations:
+ *
+ *         subprog_tail0 --tailcall-> classifier_0 -> subprog_tail0
+ * entry <
+ *         subprog_tail1 --tailcall-> classifier_1 -> subprog_tail1
+ */
+static void test_tailcall_bpf2bpf_hierarchy_2(void)
+{
+	int err, map_fd, prog_fd, data_fd, main_fd, i, val[2];
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_object *obj = NULL;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load("tailcall_bpf2bpf_hierarchy2.bpf.o",
+				 BPF_PROG_TYPE_SCHED_CLS,
+				 &obj, &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(main_fd, 0, "main_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table map fd"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_0");
+	if (!ASSERT_OK_PTR(prog, "find classifier_0 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_0 prog fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_1");
+	if (!ASSERT_OK_PTR(prog, "find classifier_1 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_1 prog fd"))
+		goto out;
+
+	i = 1;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(main_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find .bss map"))
+		goto out;
+
+	data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(data_fd, 0, "find .bss map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall counts");
+	ASSERT_EQ(val[0], 33, "tailcall count0");
+	ASSERT_EQ(val[1], 0, "tailcall count1");
+
+out:
+	bpf_object__close(obj);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_3 checks that the count value of the tail
+ * call limit enforcement matches with expectations:
+ *
+ *                                   subprog with jmp_table0 to classifier_0
+ * entry --tailcall-> classifier_0 <
+ *                                   subprog with jmp_table1 to classifier_0
+ */
+static void test_tailcall_bpf2bpf_hierarchy_3(void)
+{
+	int err, map_fd, prog_fd, data_fd, main_fd, i, val;
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_object *obj = NULL;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load("tailcall_bpf2bpf_hierarchy3.bpf.o",
+				 BPF_PROG_TYPE_SCHED_CLS,
+				 &obj, &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(main_fd, 0, "main_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table0");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table0 map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table0 map fd"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_0");
+	if (!ASSERT_OK_PTR(prog, "find classifier_0 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_0 prog fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table0"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table1");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table1 map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table1 map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table1"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(main_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find .bss map"))
+		goto out;
+
+	data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(data_fd, 0, "find .bss map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 33, "tailcall count");
+
+out:
+	bpf_object__close(obj);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -1223,4 +1629,16 @@ void test_tailcalls(void)
 		test_tailcall_bpf2bpf_fentry_entry();
 	if (test__start_subtest("tailcall_poke"))
 		test_tailcall_poke();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_1"))
+		test_tailcall_bpf2bpf_hierarchy_1();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fentry"))
+		test_tailcall_bpf2bpf_hierarchy_fentry();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fexit"))
+		test_tailcall_bpf2bpf_hierarchy_fexit();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fentry_fexit"))
+		test_tailcall_bpf2bpf_hierarchy_fentry_fexit();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_2"))
+		test_tailcall_bpf2bpf_hierarchy_2();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_3"))
+		test_tailcall_bpf2bpf_hierarchy_3();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
new file mode 100644
index 0000000000000..0bfbb7c9637b7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	volatile int ret = 1;
+
+	count++;
+	subprog_tail(skb);
+	subprog_tail(skb);
+
+	return ret;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
new file mode 100644
index 0000000000000..b84541546082e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count0 = 0;
+int count1 = 0;
+
+static __noinline
+int subprog_tail0(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int classifier_0(struct __sk_buff *skb)
+{
+	count0++;
+	subprog_tail0(skb);
+	return 0;
+}
+
+static __noinline
+int subprog_tail1(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 1);
+	return 0;
+}
+
+SEC("tc")
+int classifier_1(struct __sk_buff *skb)
+{
+	count1++;
+	subprog_tail1(skb);
+	return 0;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	subprog_tail0(skb);
+	subprog_tail1(skb);
+
+	return 1;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
new file mode 100644
index 0000000000000..6398a1d277fc7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table0 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table1 SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb, void *jmp_table)
+{
+	bpf_tail_call_static(skb, jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int classifier_0(struct __sk_buff *skb)
+{
+	count++;
+	subprog_tail(skb, &jmp_table0);
+	subprog_tail(skb, &jmp_table1);
+	return 1;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table0, 0);
+
+	return 0;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-04 14:22 ` [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2024-01-05  4:15   ` Alexei Starovoitov
  2024-01-05  6:15     ` Leon Hwang
                       ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2024-01-05  4:15 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index fe30b9ebb8de4..67fa337fc2e0c 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -259,7 +259,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE         5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
>
>  static void push_r12(u8 **pprog)
>  {
> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>          */
>         emit_nops(&prog, X86_PATCH_SIZE);
>         if (!ebpf_from_cbpf) {
> -               if (tail_call_reachable && !is_subprog)
> +               if (tail_call_reachable && !is_subprog) {
>                         /* When it's the entry of the whole tailcall context,
>                          * zeroing rax means initialising tail_call_cnt.
>                          */
> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> -               else
> -                       /* Keep the same instruction layout. */
> -                       EMIT2(0x66, 0x90); /* nop2 */
> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +                       EMIT1(0x50);             /* push rax */
> +                       /* Make rax as ptr that points to tail_call_cnt. */
> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> +                       EMIT1(0xC3);             /* ret */
> +               } else {
> +                       /* Keep the same instruction size. */
> +                       emit_nops(&prog, 13);
> +               }

I'm afraid the extra call breaks stack unwinding and many other things.
The proper frame needs to be setup (push rbp; etc)
and 'leave' + emit_return() is used.
Plain 'ret' is not ok.
x86_call_depth_emit_accounting() needs to be used too.
That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
Also the fix doesn't address the stack size issue.
We shouldn't allow all the extra frames at run-time.

The tail_cnt_ptr approach is interesting but too heavy,
since arm64, s390 and other JITs would need to repeat it with equally
complicated calculations in TAIL_CALL_OFFSET.

The fix should really be thought through for all JITs. Not just x86.

I'm thinking whether we should do the following instead:

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 0bdbbbeab155..0b45571559be 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
        if (IS_ERR(prog))
                return prog;

-       if (!bpf_prog_map_compatible(map, prog)) {
+       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
                bpf_prog_put(prog);
                return ERR_PTR(-EINVAL);
        }

This will stop stack growth, but it will break a few existing tests.
I feel it's a price worth paying.

John, Daniel,

do you see anything breaking on cilium side if we disallow
progs with subprogs to be inserted in prog_array ?

Other alternatives?

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05  4:15   ` Alexei Starovoitov
@ 2024-01-05  6:15     ` Leon Hwang
  2024-01-05 17:43       ` Alexei Starovoitov
  2024-01-05 10:33     ` Leon Hwang
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Leon Hwang @ 2024-01-05  6:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 5/1/24 12:15, Alexei Starovoitov wrote:
> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index fe30b9ebb8de4..67fa337fc2e0c 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -259,7 +259,7 @@ struct jit_context {
>>  /* Number of bytes emit_patch() needs to generate instructions */
>>  #define X86_PATCH_SIZE         5
>>  /* Number of bytes that will be skipped on tailcall */
>> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
>> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
>>
>>  static void push_r12(u8 **pprog)
>>  {
>> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>          */
>>         emit_nops(&prog, X86_PATCH_SIZE);
>>         if (!ebpf_from_cbpf) {
>> -               if (tail_call_reachable && !is_subprog)
>> +               if (tail_call_reachable && !is_subprog) {
>>                         /* When it's the entry of the whole tailcall context,
>>                          * zeroing rax means initialising tail_call_cnt.
>>                          */
>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>> -               else
>> -                       /* Keep the same instruction layout. */
>> -                       EMIT2(0x66, 0x90); /* nop2 */
>> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
>> +                       EMIT1(0x50);             /* push rax */
>> +                       /* Make rax as ptr that points to tail_call_cnt. */
>> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
>> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
>> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
>> +                       EMIT1(0xC3);             /* ret */
>> +               } else {
>> +                       /* Keep the same instruction size. */
>> +                       emit_nops(&prog, 13);
>> +               }
> 
> I'm afraid the extra call breaks stack unwinding and many other things.

I was worried about it. But I'm not sure how it breaks stack unwinding.

However, without the extra call, I've tried another approach:

* [RFC PATCH bpf-next 1/3] bpf, x64: Fix tailcall hierarchy
  https://lore.kernel.org/bpf/20231005145814.83122-2-hffilwlqm@gmail.com/

It's to propagate tail_call_cnt_ptr, too. But more complicated:

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 8c10d9abc..001c5e4b7 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -313,24 +332,15 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 			  bool tail_call_reachable, bool is_subprog,
 			  bool is_exception_cb)
 {
+	int tcc_ptr_off = round_up(stack_depth, 8) + 8;
+	int tcc_off = tcc_ptr_off + 8;
 	u8 *prog = *pprog;
 
 	/* BPF trampoline can be made to work without these nops,
 	 * but let's waste 5 bytes for now and optimize later
 	 */
 	EMIT_ENDBR();
-	memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
-	prog += X86_PATCH_SIZE;
-	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog)
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
-			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
-		else
-			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
-	}
+	emit_nops(&prog, X86_PATCH_SIZE);
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
 		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
@@ -347,15 +357,52 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 		EMIT1(0x55);             /* push rbp */
 		EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	}
+	if (!ebpf_from_cbpf) {
+		if (tail_call_reachable && !is_subprog) {
+			/* Make rax as ptr that points to tail_call_cnt. */
+			EMIT3(0x48, 0x89, 0xE8);          /* mov rax, rbp */
+			EMIT2_off32(0x48, 0x2D, tcc_off); /* sub rax, tcc_off */
+			/* When it's the entry of the whole tail call context,
+			 * storing 0 means initialising tail_call_cnt.
+			 */
+			EMIT2_off32(0xC7, 0x00, 0);       /* mov dword ptr [rax], 0 */
+		} else {
+			/* Keep the same instruction layout. */
+			emit_nops(&prog, 3);
+			emit_nops(&prog, 6);
+			emit_nops(&prog, 6);
+		}
+	}
 
 	/* X86_TAIL_CALL_OFFSET is here */
 	EMIT_ENDBR();
 
+	if (tail_call_reachable) {
+		/* Here, rax is tail_call_cnt_ptr. */
+		if (!is_subprog) {
+			/* Because pushing tail_call_cnt_ptr may cover tail_call_cnt,
+			 * it's required to store tail_call_cnt before storing
+			 * tail_call_cnt_ptr.
+			 */
+			EMIT1(0x50);                       /* push rax */
+			EMIT2(0x8B, 0x00);                 /* mov eax, dword ptr [rax] */
+			EMIT2_off32(0x89, 0x85, -tcc_off); /* mov dword ptr [rbp - tcc_off], eax */
+			EMIT1(0x58);                       /* pop rax */
+			/* mov qword ptr [rbp - tcc_ptr_off], rax */
+			EMIT3_off32(0x48, 0x89, 0x85, -tcc_ptr_off);
+		} else {
+			/* As for subprog, tail_call_cnt is meaningless. Storing
+			 * tail_call_cnt_ptr is enough.
+			 */
+			/* mov qword ptr [rbp - tcc_ptr_off], rax */
+			EMIT3_off32(0x48, 0x89, 0x85, -tcc_ptr_off);
+		}
+		/* Reserve 16 bytes for tail_call_cnt_ptr and tail_call_cnt. */
+		stack_depth += 16;
+	}
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	if (tail_call_reachable)
-		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }

How about this approach?

Thanks,
Leon

> The proper frame needs to be setup (push rbp; etc)
> and 'leave' + emit_return() is used.
> Plain 'ret' is not ok.
> x86_call_depth_emit_accounting() needs to be used too.
> That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
> Also the fix doesn't address the stack size issue.
> We shouldn't allow all the extra frames at run-time.
> 
> The tail_cnt_ptr approach is interesting but too heavy,
> since arm64, s390 and other JITs would need to repeat it with equally
> complicated calculations in TAIL_CALL_OFFSET.
> 
> The fix should really be thought through for all JITs. Not just x86.
> 
> I'm thinking whether we should do the following instead:
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 0bdbbbeab155..0b45571559be 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
>         if (IS_ERR(prog))
>                 return prog;
> 
> -       if (!bpf_prog_map_compatible(map, prog)) {
> +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
>                 bpf_prog_put(prog);
>                 return ERR_PTR(-EINVAL);
>         }
> 
> This will stop stack growth, but it will break a few existing tests.
> I feel it's a price worth paying.
> 
> John, Daniel,
> 
> do you see anything breaking on cilium side if we disallow
> progs with subprogs to be inserted in prog_array ?
> 
> Other alternatives?

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05  4:15   ` Alexei Starovoitov
  2024-01-05  6:15     ` Leon Hwang
@ 2024-01-05 10:33     ` Leon Hwang
  2024-01-05 17:47       ` Alexei Starovoitov
  2024-01-05 12:40     ` Jiri Olsa
  2024-02-14  5:47     ` Leon Hwang
  3 siblings, 1 reply; 27+ messages in thread
From: Leon Hwang @ 2024-01-05 10:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 5/1/24 12:15, Alexei Starovoitov wrote:
> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index fe30b9ebb8de4..67fa337fc2e0c 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -259,7 +259,7 @@ struct jit_context {
>>  /* Number of bytes emit_patch() needs to generate instructions */
>>  #define X86_PATCH_SIZE         5
>>  /* Number of bytes that will be skipped on tailcall */
>> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
>> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
>>
>>  static void push_r12(u8 **pprog)
>>  {
>> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>          */
>>         emit_nops(&prog, X86_PATCH_SIZE);
>>         if (!ebpf_from_cbpf) {
>> -               if (tail_call_reachable && !is_subprog)
>> +               if (tail_call_reachable && !is_subprog) {
>>                         /* When it's the entry of the whole tailcall context,
>>                          * zeroing rax means initialising tail_call_cnt.
>>                          */
>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>> -               else
>> -                       /* Keep the same instruction layout. */
>> -                       EMIT2(0x66, 0x90); /* nop2 */
>> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
>> +                       EMIT1(0x50);             /* push rax */
>> +                       /* Make rax as ptr that points to tail_call_cnt. */
>> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
>> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
>> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
>> +                       EMIT1(0xC3);             /* ret */
>> +               } else {
>> +                       /* Keep the same instruction size. */
>> +                       emit_nops(&prog, 13);
>> +               }
> 
> I'm afraid the extra call breaks stack unwinding and many other things.
> The proper frame needs to be setup (push rbp; etc)
> and 'leave' + emit_return() is used.
> Plain 'ret' is not ok.
> x86_call_depth_emit_accounting() needs to be used too.
> That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
> Also the fix doesn't address the stack size issue.
> We shouldn't allow all the extra frames at run-time.
> 
> The tail_cnt_ptr approach is interesting but too heavy,
> since arm64, s390 and other JITs would need to repeat it with equally
> complicated calculations in TAIL_CALL_OFFSET.
> 
> The fix should really be thought through for all JITs. Not just x86.
> 
> I'm thinking whether we should do the following instead:
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 0bdbbbeab155..0b45571559be 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
>         if (IS_ERR(prog))
>                 return prog;
> 
> -       if (!bpf_prog_map_compatible(map, prog)) {
> +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
>                 bpf_prog_put(prog);
>                 return ERR_PTR(-EINVAL);
>         }
> 
> This will stop stack growth, but it will break a few existing tests.
> I feel it's a price worth paying.

I don't think this can avoid this issue completely.

For example:

#include "vmlinux.h"

#include "bpf_helpers.h"

struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 1);
    __uint(key_size, sizeof(__u32));
    __uint(value_size, sizeof(__u32));
} prog_array SEC(".maps");


static __noinline int
subprog(struct __sk_buff *skb)
{
    volatile int retval = 0;

    bpf_tail_call(skb, &prog_array, 0);

    return retval;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
    const int N = 10000;

    for (int i = 0; i < N; i++)
        subprog(skb);

    return 0;
}

char _license[] SEC("license") = "GPL";

Then, objdump its asm:

Disassembly of section .text:

0000000000000000 <subprog>:
; {
       0:       b7 02 00 00 00 00 00 00 r2 = 0x0
;     volatile int retval = 0;
       1:       63 2a fc ff 00 00 00 00 *(u32 *)(r10 - 0x4) = r2
;     bpf_tail_call(skb, &prog_array, 0);
       2:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
       4:       b7 03 00 00 00 00 00 00 r3 = 0x0
       5:       85 00 00 00 0c 00 00 00 call 0xc
;     return retval;
       6:       61 a1 fc ff 00 00 00 00 r1 = *(u32 *)(r10 - 0x4)
       7:       95 00 00 00 00 00 00 00 exit

Disassembly of section tc:

0000000000000000 <entry>:
; {
       0:       bf 16 00 00 00 00 00 00 r6 = r1
       1:       b7 07 00 00 10 27 00 00 r7 = 0x2710

0000000000000010 <LBB0_1>:
;         subprog(skb);
       2:       bf 61 00 00 00 00 00 00 r1 = r6
       3:       85 10 00 00 ff ff ff ff call -0x1
;     for (int i = 0; i < N; i++)
       4:       07 07 00 00 ff ff ff ff r7 += -0x1
       5:       bf 71 00 00 00 00 00 00 r1 = r7
       6:       67 01 00 00 20 00 00 00 r1 <<= 0x20
       7:       77 01 00 00 20 00 00 00 r1 >>= 0x20
       8:       15 01 01 00 00 00 00 00 if r1 == 0x0 goto +0x1 <LBB0_2>
       9:       05 00 f8 ff 00 00 00 00 goto -0x8 <LBB0_1>

0000000000000050 <LBB0_2>:
;     return 0;
      10:       b7 00 00 00 00 00 00 00 r0 = 0x0
      11:       95 00 00 00 00 00 00 00 exit

As a result, the bpf prog in prog_array can be tailcalled for N times,
even though there's no subprog in the bpf prog in prog_array.

Thanks,
Leon

> 
> John, Daniel,
> 
> do you see anything breaking on cilium side if we disallow
> progs with subprogs to be inserted in prog_array ?
> 
> Other alternatives?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05  4:15   ` Alexei Starovoitov
  2024-01-05  6:15     ` Leon Hwang
  2024-01-05 10:33     ` Leon Hwang
@ 2024-01-05 12:40     ` Jiri Olsa
  2024-01-06  0:18       ` John Fastabend
  2024-02-14  5:47     ` Leon Hwang
  3 siblings, 1 reply; 27+ messages in thread
From: Jiri Olsa @ 2024-01-05 12:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Fijalkowski, Maciej, Jakub Sitnicki,
	Ilya Leoshkevich, Hengqi Chen, kernel-patches-bot

On Thu, Jan 04, 2024 at 08:15:36PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index fe30b9ebb8de4..67fa337fc2e0c 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -259,7 +259,7 @@ struct jit_context {
> >  /* Number of bytes emit_patch() needs to generate instructions */
> >  #define X86_PATCH_SIZE         5
> >  /* Number of bytes that will be skipped on tailcall */
> > -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> > +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> >
> >  static void push_r12(u8 **pprog)
> >  {
> > @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >          */
> >         emit_nops(&prog, X86_PATCH_SIZE);
> >         if (!ebpf_from_cbpf) {
> > -               if (tail_call_reachable && !is_subprog)
> > +               if (tail_call_reachable && !is_subprog) {
> >                         /* When it's the entry of the whole tailcall context,
> >                          * zeroing rax means initialising tail_call_cnt.
> >                          */
> > -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> > -               else
> > -                       /* Keep the same instruction layout. */
> > -                       EMIT2(0x66, 0x90); /* nop2 */
> > +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
> > +                       EMIT1(0x50);             /* push rax */
> > +                       /* Make rax as ptr that points to tail_call_cnt. */
> > +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> > +                       EMIT1_off32(0xE8, 2);    /* call main prog */
> > +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> > +                       EMIT1(0xC3);             /* ret */
> > +               } else {
> > +                       /* Keep the same instruction size. */
> > +                       emit_nops(&prog, 13);
> > +               }
> 
> I'm afraid the extra call breaks stack unwinding and many other things.
> The proper frame needs to be setup (push rbp; etc)
> and 'leave' + emit_return() is used.
> Plain 'ret' is not ok.
> x86_call_depth_emit_accounting() needs to be used too.
> That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
> Also the fix doesn't address the stack size issue.
> We shouldn't allow all the extra frames at run-time.
> 
> The tail_cnt_ptr approach is interesting but too heavy,
> since arm64, s390 and other JITs would need to repeat it with equally
> complicated calculations in TAIL_CALL_OFFSET.
> 
> The fix should really be thought through for all JITs. Not just x86.
> 
> I'm thinking whether we should do the following instead:
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 0bdbbbeab155..0b45571559be 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
>         if (IS_ERR(prog))
>                 return prog;
> 
> -       if (!bpf_prog_map_compatible(map, prog)) {
> +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
>                 bpf_prog_put(prog);
>                 return ERR_PTR(-EINVAL);
>         }
> 
> This will stop stack growth, but it will break a few existing tests.
> I feel it's a price worth paying.
> 
> John, Daniel,
> 
> do you see anything breaking on cilium side if we disallow
> progs with subprogs to be inserted in prog_array ?

FWIW tetragon should be ok with this.. we use few subprograms in
hubble, but most of them are not called from tail called programs

jirka

> 
> Other alternatives?
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05  6:15     ` Leon Hwang
@ 2024-01-05 17:43       ` Alexei Starovoitov
  2024-01-06  2:38         ` Leon Hwang
  0 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2024-01-05 17:43 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Thu, Jan 4, 2024 at 10:16 PM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
>
> On 5/1/24 12:15, Alexei Starovoitov wrote:
> > On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >>
> >>
> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> >> index fe30b9ebb8de4..67fa337fc2e0c 100644
> >> --- a/arch/x86/net/bpf_jit_comp.c
> >> +++ b/arch/x86/net/bpf_jit_comp.c
> >> @@ -259,7 +259,7 @@ struct jit_context {
> >>  /* Number of bytes emit_patch() needs to generate instructions */
> >>  #define X86_PATCH_SIZE         5
> >>  /* Number of bytes that will be skipped on tailcall */
> >> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> >> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> >>
> >>  static void push_r12(u8 **pprog)
> >>  {
> >> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >>          */
> >>         emit_nops(&prog, X86_PATCH_SIZE);
> >>         if (!ebpf_from_cbpf) {
> >> -               if (tail_call_reachable && !is_subprog)
> >> +               if (tail_call_reachable && !is_subprog) {
> >>                         /* When it's the entry of the whole tailcall context,
> >>                          * zeroing rax means initialising tail_call_cnt.
> >>                          */
> >> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> >> -               else
> >> -                       /* Keep the same instruction layout. */
> >> -                       EMIT2(0x66, 0x90); /* nop2 */
> >> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
> >> +                       EMIT1(0x50);             /* push rax */
> >> +                       /* Make rax as ptr that points to tail_call_cnt. */
> >> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> >> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
> >> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> >> +                       EMIT1(0xC3);             /* ret */
> >> +               } else {
> >> +                       /* Keep the same instruction size. */
> >> +                       emit_nops(&prog, 13);
> >> +               }
> >
> > I'm afraid the extra call breaks stack unwinding and many other things.
>
> I was worried about it. But I'm not sure how it breaks stack unwinding.
>
> However, without the extra call, I've tried another approach:
>
> * [RFC PATCH bpf-next 1/3] bpf, x64: Fix tailcall hierarchy
>   https://lore.kernel.org/bpf/20231005145814.83122-2-hffilwlqm@gmail.com/
>
> It's to propagate tail_call_cnt_ptr, too. But more complicated:
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 8c10d9abc..001c5e4b7 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -313,24 +332,15 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>                           bool tail_call_reachable, bool is_subprog,
>                           bool is_exception_cb)
>  {
> +       int tcc_ptr_off = round_up(stack_depth, 8) + 8;
> +       int tcc_off = tcc_ptr_off + 8;
>         u8 *prog = *pprog;
>
>         /* BPF trampoline can be made to work without these nops,
>          * but let's waste 5 bytes for now and optimize later
>          */
>         EMIT_ENDBR();
> -       memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
> -       prog += X86_PATCH_SIZE;
> -       if (!ebpf_from_cbpf) {
> -               if (tail_call_reachable && !is_subprog)
> -                       /* When it's the entry of the whole tailcall context,
> -                        * zeroing rax means initialising tail_call_cnt.
> -                        */
> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> -               else
> -                       /* Keep the same instruction layout. */
> -                       EMIT2(0x66, 0x90); /* nop2 */
> -       }
> +       emit_nops(&prog, X86_PATCH_SIZE);
>         /* Exception callback receives FP as third parameter */
>         if (is_exception_cb) {
>                 EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
> @@ -347,15 +357,52 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>                 EMIT1(0x55);             /* push rbp */
>                 EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
>         }
> +       if (!ebpf_from_cbpf) {
> +               if (tail_call_reachable && !is_subprog) {
> +                       /* Make rax as ptr that points to tail_call_cnt. */
> +                       EMIT3(0x48, 0x89, 0xE8);          /* mov rax, rbp */
> +                       EMIT2_off32(0x48, 0x2D, tcc_off); /* sub rax, tcc_off */
> +                       /* When it's the entry of the whole tail call context,
> +                        * storing 0 means initialising tail_call_cnt.
> +                        */
> +                       EMIT2_off32(0xC7, 0x00, 0);       /* mov dword ptr [rax], 0 */
> +               } else {
> +                       /* Keep the same instruction layout. */
> +                       emit_nops(&prog, 3);
> +                       emit_nops(&prog, 6);
> +                       emit_nops(&prog, 6);

Extra 15 nops in the prologue of every bpf program (tailcall or not)
is too high a price to pay.

Think of a simple fix other on verifier side or
simple approach that all JITs can easily do.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05 10:33     ` Leon Hwang
@ 2024-01-05 17:47       ` Alexei Starovoitov
  2024-01-06  2:33         ` Leon Hwang
  0 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2024-01-05 17:47 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Fri, Jan 5, 2024 at 2:34 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
>
> On 5/1/24 12:15, Alexei Starovoitov wrote:
> > On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >>
> >>
> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> >> index fe30b9ebb8de4..67fa337fc2e0c 100644
> >> --- a/arch/x86/net/bpf_jit_comp.c
> >> +++ b/arch/x86/net/bpf_jit_comp.c
> >> @@ -259,7 +259,7 @@ struct jit_context {
> >>  /* Number of bytes emit_patch() needs to generate instructions */
> >>  #define X86_PATCH_SIZE         5
> >>  /* Number of bytes that will be skipped on tailcall */
> >> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> >> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> >>
> >>  static void push_r12(u8 **pprog)
> >>  {
> >> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >>          */
> >>         emit_nops(&prog, X86_PATCH_SIZE);
> >>         if (!ebpf_from_cbpf) {
> >> -               if (tail_call_reachable && !is_subprog)
> >> +               if (tail_call_reachable && !is_subprog) {
> >>                         /* When it's the entry of the whole tailcall context,
> >>                          * zeroing rax means initialising tail_call_cnt.
> >>                          */
> >> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> >> -               else
> >> -                       /* Keep the same instruction layout. */
> >> -                       EMIT2(0x66, 0x90); /* nop2 */
> >> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
> >> +                       EMIT1(0x50);             /* push rax */
> >> +                       /* Make rax as ptr that points to tail_call_cnt. */
> >> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> >> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
> >> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> >> +                       EMIT1(0xC3);             /* ret */
> >> +               } else {
> >> +                       /* Keep the same instruction size. */
> >> +                       emit_nops(&prog, 13);
> >> +               }
> >
> > I'm afraid the extra call breaks stack unwinding and many other things.
> > The proper frame needs to be setup (push rbp; etc)
> > and 'leave' + emit_return() is used.
> > Plain 'ret' is not ok.
> > x86_call_depth_emit_accounting() needs to be used too.
> > That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
> > Also the fix doesn't address the stack size issue.
> > We shouldn't allow all the extra frames at run-time.
> >
> > The tail_cnt_ptr approach is interesting but too heavy,
> > since arm64, s390 and other JITs would need to repeat it with equally
> > complicated calculations in TAIL_CALL_OFFSET.
> >
> > The fix should really be thought through for all JITs. Not just x86.
> >
> > I'm thinking whether we should do the following instead:
> >
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index 0bdbbbeab155..0b45571559be 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
> >         if (IS_ERR(prog))
> >                 return prog;
> >
> > -       if (!bpf_prog_map_compatible(map, prog)) {
> > +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
> >                 bpf_prog_put(prog);
> >                 return ERR_PTR(-EINVAL);
> >         }
> >
> > This will stop stack growth, but it will break a few existing tests.
> > I feel it's a price worth paying.
>
> I don't think this can avoid this issue completely.
>
> For example:
>
> #include "vmlinux.h"
>
> #include "bpf_helpers.h"
>
> struct {
>     __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>     __uint(max_entries, 1);
>     __uint(key_size, sizeof(__u32));
>     __uint(value_size, sizeof(__u32));
> } prog_array SEC(".maps");
>
>
> static __noinline int
> subprog(struct __sk_buff *skb)
> {
>     volatile int retval = 0;
>
>     bpf_tail_call(skb, &prog_array, 0);
>
>     return retval;
> }
>
> SEC("tc")
> int entry(struct __sk_buff *skb)
> {
>     const int N = 10000;
>
>     for (int i = 0; i < N; i++)
>         subprog(skb);
>
>     return 0;
> }
>
> char _license[] SEC("license") = "GPL";
>
> Then, objdump its asm:
>
> Disassembly of section .text:
>
> 0000000000000000 <subprog>:
> ; {
>        0:       b7 02 00 00 00 00 00 00 r2 = 0x0
> ;     volatile int retval = 0;
>        1:       63 2a fc ff 00 00 00 00 *(u32 *)(r10 - 0x4) = r2
> ;     bpf_tail_call(skb, &prog_array, 0);
>        2:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
>        4:       b7 03 00 00 00 00 00 00 r3 = 0x0
>        5:       85 00 00 00 0c 00 00 00 call 0xc
> ;     return retval;
>        6:       61 a1 fc ff 00 00 00 00 r1 = *(u32 *)(r10 - 0x4)
>        7:       95 00 00 00 00 00 00 00 exit
>
> Disassembly of section tc:
>
> 0000000000000000 <entry>:
> ; {
>        0:       bf 16 00 00 00 00 00 00 r6 = r1
>        1:       b7 07 00 00 10 27 00 00 r7 = 0x2710
>
> 0000000000000010 <LBB0_1>:
> ;         subprog(skb);
>        2:       bf 61 00 00 00 00 00 00 r1 = r6
>        3:       85 10 00 00 ff ff ff ff call -0x1
> ;     for (int i = 0; i < N; i++)
>        4:       07 07 00 00 ff ff ff ff r7 += -0x1
>        5:       bf 71 00 00 00 00 00 00 r1 = r7
>        6:       67 01 00 00 20 00 00 00 r1 <<= 0x20
>        7:       77 01 00 00 20 00 00 00 r1 >>= 0x20
>        8:       15 01 01 00 00 00 00 00 if r1 == 0x0 goto +0x1 <LBB0_2>
>        9:       05 00 f8 ff 00 00 00 00 goto -0x8 <LBB0_1>
>
> 0000000000000050 <LBB0_2>:
> ;     return 0;
>       10:       b7 00 00 00 00 00 00 00 r0 = 0x0
>       11:       95 00 00 00 00 00 00 00 exit
>
> As a result, the bpf prog in prog_array can be tailcalled for N times,
> even though there's no subprog in the bpf prog in prog_array.

You mean that total execution time is N*N ?
and tailcall is a way to increase loop count?
We allow BPF_MAX_LOOPS = 8 * 1024 * 1024 in bpf_loop,
so many calls to subprog(skb); is not an issue
as long as they don't stall cpu and don't increase stack size.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05 12:40     ` Jiri Olsa
@ 2024-01-06  0:18       ` John Fastabend
  2024-01-06  3:46         ` Alexei Starovoitov
  0 siblings, 1 reply; 27+ messages in thread
From: John Fastabend @ 2024-01-06  0:18 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Fijalkowski, Maciej, Jakub Sitnicki,
	Ilya Leoshkevich, Hengqi Chen, kernel-patches-bot

Jiri Olsa wrote:
> On Thu, Jan 04, 2024 at 08:15:36PM -0800, Alexei Starovoitov wrote:
> > On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> > >
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index fe30b9ebb8de4..67fa337fc2e0c 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -259,7 +259,7 @@ struct jit_context {
> > >  /* Number of bytes emit_patch() needs to generate instructions */
> > >  #define X86_PATCH_SIZE         5
> > >  /* Number of bytes that will be skipped on tailcall */
> > > -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> > > +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> > >
> > >  static void push_r12(u8 **pprog)
> > >  {
> > > @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> > >          */
> > >         emit_nops(&prog, X86_PATCH_SIZE);
> > >         if (!ebpf_from_cbpf) {
> > > -               if (tail_call_reachable && !is_subprog)
> > > +               if (tail_call_reachable && !is_subprog) {
> > >                         /* When it's the entry of the whole tailcall context,
> > >                          * zeroing rax means initialising tail_call_cnt.
> > >                          */
> > > -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> > > -               else
> > > -                       /* Keep the same instruction layout. */
> > > -                       EMIT2(0x66, 0x90); /* nop2 */
> > > +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
> > > +                       EMIT1(0x50);             /* push rax */
> > > +                       /* Make rax as ptr that points to tail_call_cnt. */
> > > +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> > > +                       EMIT1_off32(0xE8, 2);    /* call main prog */
> > > +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> > > +                       EMIT1(0xC3);             /* ret */
> > > +               } else {
> > > +                       /* Keep the same instruction size. */
> > > +                       emit_nops(&prog, 13);
> > > +               }
> > 
> > I'm afraid the extra call breaks stack unwinding and many other things.
> > The proper frame needs to be setup (push rbp; etc)
> > and 'leave' + emit_return() is used.
> > Plain 'ret' is not ok.
> > x86_call_depth_emit_accounting() needs to be used too.
> > That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
> > Also the fix doesn't address the stack size issue.
> > We shouldn't allow all the extra frames at run-time.
> > 
> > The tail_cnt_ptr approach is interesting but too heavy,
> > since arm64, s390 and other JITs would need to repeat it with equally
> > complicated calculations in TAIL_CALL_OFFSET.
> > 
> > The fix should really be thought through for all JITs. Not just x86.
> > 
> > I'm thinking whether we should do the following instead:
> > 
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index 0bdbbbeab155..0b45571559be 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
> >         if (IS_ERR(prog))
> >                 return prog;
> > 
> > -       if (!bpf_prog_map_compatible(map, prog)) {
> > +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
> >                 bpf_prog_put(prog);
> >                 return ERR_PTR(-EINVAL);
> >         }
> > 
> > This will stop stack growth, but it will break a few existing tests.
> > I feel it's a price worth paying.
> > 
> > John, Daniel,
> > 
> > do you see anything breaking on cilium side if we disallow
> > progs with subprogs to be inserted in prog_array ?
> 
> FWIW tetragon should be ok with this.. we use few subprograms in
> hubble, but most of them are not called from tail called programs

We actually do this in some of the l7 parsers where we try to use
subprogs as much as possible but still use prog_array for calls
that might end up being recursive.

I'll think about it Monday my first thought is some refactoring and
using bpf_loop or other helpers could resolve this.

If its just bpf-next and not backported onto stable we can probably
figure something out.

> 
> jirka
> 
> > 
> > Other alternatives?
> > 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05 17:47       ` Alexei Starovoitov
@ 2024-01-06  2:33         ` Leon Hwang
  2024-01-06  3:34           ` Alexei Starovoitov
  0 siblings, 1 reply; 27+ messages in thread
From: Leon Hwang @ 2024-01-06  2:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/1/6 01:47, Alexei Starovoitov wrote:
> On Fri, Jan 5, 2024 at 2:34 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>>
>> On 5/1/24 12:15, Alexei Starovoitov wrote:
>>> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>>>
>>>>
>>>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>>>> index fe30b9ebb8de4..67fa337fc2e0c 100644
>>>> --- a/arch/x86/net/bpf_jit_comp.c
>>>> +++ b/arch/x86/net/bpf_jit_comp.c
>>>> @@ -259,7 +259,7 @@ struct jit_context {
>>>>  /* Number of bytes emit_patch() needs to generate instructions */
>>>>  #define X86_PATCH_SIZE         5
>>>>  /* Number of bytes that will be skipped on tailcall */
>>>> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
>>>> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
>>>>
>>>>  static void push_r12(u8 **pprog)
>>>>  {
>>>> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>>>          */
>>>>         emit_nops(&prog, X86_PATCH_SIZE);
>>>>         if (!ebpf_from_cbpf) {
>>>> -               if (tail_call_reachable && !is_subprog)
>>>> +               if (tail_call_reachable && !is_subprog) {
>>>>                         /* When it's the entry of the whole tailcall context,
>>>>                          * zeroing rax means initialising tail_call_cnt.
>>>>                          */
>>>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>>>> -               else
>>>> -                       /* Keep the same instruction layout. */
>>>> -                       EMIT2(0x66, 0x90); /* nop2 */
>>>> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
>>>> +                       EMIT1(0x50);             /* push rax */
>>>> +                       /* Make rax as ptr that points to tail_call_cnt. */
>>>> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
>>>> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
>>>> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
>>>> +                       EMIT1(0xC3);             /* ret */
>>>> +               } else {
>>>> +                       /* Keep the same instruction size. */
>>>> +                       emit_nops(&prog, 13);
>>>> +               }
>>>
>>> I'm afraid the extra call breaks stack unwinding and many other things.
>>> The proper frame needs to be setup (push rbp; etc)
>>> and 'leave' + emit_return() is used.
>>> Plain 'ret' is not ok.
>>> x86_call_depth_emit_accounting() needs to be used too.
>>> That will make X86_TAIL_CALL_OFFSET adjustment very complicated.
>>> Also the fix doesn't address the stack size issue.
>>> We shouldn't allow all the extra frames at run-time.
>>>
>>> The tail_cnt_ptr approach is interesting but too heavy,
>>> since arm64, s390 and other JITs would need to repeat it with equally
>>> complicated calculations in TAIL_CALL_OFFSET.
>>>
>>> The fix should really be thought through for all JITs. Not just x86.
>>>
>>> I'm thinking whether we should do the following instead:
>>>
>>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>>> index 0bdbbbeab155..0b45571559be 100644
>>> --- a/kernel/bpf/arraymap.c
>>> +++ b/kernel/bpf/arraymap.c
>>> @@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
>>>         if (IS_ERR(prog))
>>>                 return prog;
>>>
>>> -       if (!bpf_prog_map_compatible(map, prog)) {
>>> +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
>>>                 bpf_prog_put(prog);
>>>                 return ERR_PTR(-EINVAL);
>>>         }
>>>
>>> This will stop stack growth, but it will break a few existing tests.
>>> I feel it's a price worth paying.
>>
>> I don't think this can avoid this issue completely.
>>
>> For example:
>>
>> #include "vmlinux.h"
>>
>> #include "bpf_helpers.h"
>>
>> struct {
>>     __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>     __uint(max_entries, 1);
>>     __uint(key_size, sizeof(__u32));
>>     __uint(value_size, sizeof(__u32));
>> } prog_array SEC(".maps");
>>
>>
>> static __noinline int
>> subprog(struct __sk_buff *skb)
>> {
>>     volatile int retval = 0;
>>
>>     bpf_tail_call(skb, &prog_array, 0);
>>
>>     return retval;
>> }
>>
>> SEC("tc")
>> int entry(struct __sk_buff *skb)
>> {
>>     const int N = 10000;
>>
>>     for (int i = 0; i < N; i++)
>>         subprog(skb);
>>
>>     return 0;
>> }
>>
>> char _license[] SEC("license") = "GPL";
>>
>> Then, objdump its asm:
>>
>> Disassembly of section .text:
>>
>> 0000000000000000 <subprog>:
>> ; {
>>        0:       b7 02 00 00 00 00 00 00 r2 = 0x0
>> ;     volatile int retval = 0;
>>        1:       63 2a fc ff 00 00 00 00 *(u32 *)(r10 - 0x4) = r2
>> ;     bpf_tail_call(skb, &prog_array, 0);
>>        2:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
>>        4:       b7 03 00 00 00 00 00 00 r3 = 0x0
>>        5:       85 00 00 00 0c 00 00 00 call 0xc
>> ;     return retval;
>>        6:       61 a1 fc ff 00 00 00 00 r1 = *(u32 *)(r10 - 0x4)
>>        7:       95 00 00 00 00 00 00 00 exit
>>
>> Disassembly of section tc:
>>
>> 0000000000000000 <entry>:
>> ; {
>>        0:       bf 16 00 00 00 00 00 00 r6 = r1
>>        1:       b7 07 00 00 10 27 00 00 r7 = 0x2710
>>
>> 0000000000000010 <LBB0_1>:
>> ;         subprog(skb);
>>        2:       bf 61 00 00 00 00 00 00 r1 = r6
>>        3:       85 10 00 00 ff ff ff ff call -0x1
>> ;     for (int i = 0; i < N; i++)
>>        4:       07 07 00 00 ff ff ff ff r7 += -0x1
>>        5:       bf 71 00 00 00 00 00 00 r1 = r7
>>        6:       67 01 00 00 20 00 00 00 r1 <<= 0x20
>>        7:       77 01 00 00 20 00 00 00 r1 >>= 0x20
>>        8:       15 01 01 00 00 00 00 00 if r1 == 0x0 goto +0x1 <LBB0_2>
>>        9:       05 00 f8 ff 00 00 00 00 goto -0x8 <LBB0_1>
>>
>> 0000000000000050 <LBB0_2>:
>> ;     return 0;
>>       10:       b7 00 00 00 00 00 00 00 r0 = 0x0
>>       11:       95 00 00 00 00 00 00 00 exit
>>
>> As a result, the bpf prog in prog_array can be tailcalled for N times,
>> even though there's no subprog in the bpf prog in prog_array.
> 
> You mean that total execution time is N*N ?

No, it's N. There's N tailcalls in subprog() to be called in entry().

> and tailcall is a way to increase loop count?

Yes, this is a way. And MAX_TAIL_CALL_CNT limit does not work for this case.

> We allow BPF_MAX_LOOPS = 8 * 1024 * 1024 in bpf_loop,
> so many calls to subprog(skb); is not an issue
> as long as they don't stall cpu and don't increase stack size.

What if there are BPF_MAX_LOOPS subprog(skb) and there are BPF_MAX_LOOPS
loops in the tail-callee bpf prog?

Thanks,
Leon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05 17:43       ` Alexei Starovoitov
@ 2024-01-06  2:38         ` Leon Hwang
  0 siblings, 0 replies; 27+ messages in thread
From: Leon Hwang @ 2024-01-06  2:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/1/6 01:43, Alexei Starovoitov wrote:
> On Thu, Jan 4, 2024 at 10:16 PM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>>
>> On 5/1/24 12:15, Alexei Starovoitov wrote:
>>> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>>>
>>>>
>>>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>>>> index fe30b9ebb8de4..67fa337fc2e0c 100644
>>>> --- a/arch/x86/net/bpf_jit_comp.c
>>>> +++ b/arch/x86/net/bpf_jit_comp.c
>>>> @@ -259,7 +259,7 @@ struct jit_context {
>>>>  /* Number of bytes emit_patch() needs to generate instructions */
>>>>  #define X86_PATCH_SIZE         5
>>>>  /* Number of bytes that will be skipped on tailcall */
>>>> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
>>>> +#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
>>>>
>>>>  static void push_r12(u8 **pprog)
>>>>  {
>>>> @@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>>>          */
>>>>         emit_nops(&prog, X86_PATCH_SIZE);
>>>>         if (!ebpf_from_cbpf) {
>>>> -               if (tail_call_reachable && !is_subprog)
>>>> +               if (tail_call_reachable && !is_subprog) {
>>>>                         /* When it's the entry of the whole tailcall context,
>>>>                          * zeroing rax means initialising tail_call_cnt.
>>>>                          */
>>>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>>>> -               else
>>>> -                       /* Keep the same instruction layout. */
>>>> -                       EMIT2(0x66, 0x90); /* nop2 */
>>>> +                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
>>>> +                       EMIT1(0x50);             /* push rax */
>>>> +                       /* Make rax as ptr that points to tail_call_cnt. */
>>>> +                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
>>>> +                       EMIT1_off32(0xE8, 2);    /* call main prog */
>>>> +                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
>>>> +                       EMIT1(0xC3);             /* ret */
>>>> +               } else {
>>>> +                       /* Keep the same instruction size. */
>>>> +                       emit_nops(&prog, 13);
>>>> +               }
>>>
>>> I'm afraid the extra call breaks stack unwinding and many other things.
>>
>> I was worried about it. But I'm not sure how it breaks stack unwinding.
>>
>> However, without the extra call, I've tried another approach:
>>
>> * [RFC PATCH bpf-next 1/3] bpf, x64: Fix tailcall hierarchy
>>   https://lore.kernel.org/bpf/20231005145814.83122-2-hffilwlqm@gmail.com/
>>
>> It's to propagate tail_call_cnt_ptr, too. But more complicated:
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 8c10d9abc..001c5e4b7 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -313,24 +332,15 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>                           bool tail_call_reachable, bool is_subprog,
>>                           bool is_exception_cb)
>>  {
>> +       int tcc_ptr_off = round_up(stack_depth, 8) + 8;
>> +       int tcc_off = tcc_ptr_off + 8;
>>         u8 *prog = *pprog;
>>
>>         /* BPF trampoline can be made to work without these nops,
>>          * but let's waste 5 bytes for now and optimize later
>>          */
>>         EMIT_ENDBR();
>> -       memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
>> -       prog += X86_PATCH_SIZE;
>> -       if (!ebpf_from_cbpf) {
>> -               if (tail_call_reachable && !is_subprog)
>> -                       /* When it's the entry of the whole tailcall context,
>> -                        * zeroing rax means initialising tail_call_cnt.
>> -                        */
>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>> -               else
>> -                       /* Keep the same instruction layout. */
>> -                       EMIT2(0x66, 0x90); /* nop2 */
>> -       }
>> +       emit_nops(&prog, X86_PATCH_SIZE);
>>         /* Exception callback receives FP as third parameter */
>>         if (is_exception_cb) {
>>                 EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
>> @@ -347,15 +357,52 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>                 EMIT1(0x55);             /* push rbp */
>>                 EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
>>         }
>> +       if (!ebpf_from_cbpf) {
>> +               if (tail_call_reachable && !is_subprog) {
>> +                       /* Make rax as ptr that points to tail_call_cnt. */
>> +                       EMIT3(0x48, 0x89, 0xE8);          /* mov rax, rbp */
>> +                       EMIT2_off32(0x48, 0x2D, tcc_off); /* sub rax, tcc_off */
>> +                       /* When it's the entry of the whole tail call context,
>> +                        * storing 0 means initialising tail_call_cnt.
>> +                        */
>> +                       EMIT2_off32(0xC7, 0x00, 0);       /* mov dword ptr [rax], 0 */
>> +               } else {
>> +                       /* Keep the same instruction layout. */
>> +                       emit_nops(&prog, 3);
>> +                       emit_nops(&prog, 6);
>> +                       emit_nops(&prog, 6);
> 
> Extra 15 nops in the prologue of every bpf program (tailcall or not)
> is too high a price to pay.
> 
> Think of a simple fix other on verifier side or
> simple approach that all JITs can easily do.

It's not easy but I'll have a hard try.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-06  2:33         ` Leon Hwang
@ 2024-01-06  3:34           ` Alexei Starovoitov
  0 siblings, 0 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2024-01-06  3:34 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Fri, Jan 5, 2024 at 6:33 PM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> > We allow BPF_MAX_LOOPS = 8 * 1024 * 1024 in bpf_loop,
> > so many calls to subprog(skb); is not an issue
> > as long as they don't stall cpu and don't increase stack size.
>
> What if there are BPF_MAX_LOOPS subprog(skb) and there are BPF_MAX_LOOPS
> loops in the tail-callee bpf prog?

It's fine. Every bpf_loop is capped individually.
We're working on generic "cancelling" of bpf progs
when they consume too much cpu time.
There is no way to do run-time counters.

PS pls trim your replies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-06  0:18       ` John Fastabend
@ 2024-01-06  3:46         ` Alexei Starovoitov
  0 siblings, 0 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2024-01-06  3:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiri Olsa, Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Fijalkowski, Maciej, Jakub Sitnicki,
	Ilya Leoshkevich, Hengqi Chen, kernel-patches-bot

On Fri, Jan 5, 2024 at 4:18 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> > > -       if (!bpf_prog_map_compatible(map, prog)) {
> > > +       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
> > >                 bpf_prog_put(prog);
> > >                 return ERR_PTR(-EINVAL);
> > >         }
> > >
> > > This will stop stack growth, but it will break a few existing tests.
> > > I feel it's a price worth paying.
> > >
> > > John, Daniel,
> > >
> > > do you see anything breaking on cilium side if we disallow
> > > progs with subprogs to be inserted in prog_array ?
> >
> > FWIW tetragon should be ok with this.. we use few subprograms in
> > hubble, but most of them are not called from tail called programs
>
> We actually do this in some of the l7 parsers where we try to use
> subprogs as much as possible but still use prog_array for calls
> that might end up being recursive.

So you do tail_call into a prog that has subprogs. Ok.
Any pointers to a code? (Just for my own education)

Anyway, we need to come up with something better.

I've been trying to play with a few ideas on how to propagate %rax
back from subprog into caller, since frame layout is known, but
struggling to x86 asm it.
Roughly:
RESTORE_TAIL_CALL_CNT
emit_call
copy from already dead stack area back into tail_call_cnt of this frame.
Since IRQs on x86 are using different stack it should be ok?

If I'm wrong about stack usage and dead stack can be scratched
the emit_return() can store ttc into %rdx and after emit_call we
take it from there.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-01-05  4:15   ` Alexei Starovoitov
                       ` (2 preceding siblings ...)
  2024-01-05 12:40     ` Jiri Olsa
@ 2024-02-14  5:47     ` Leon Hwang
  2024-02-14 11:25       ` Maciej Fijalkowski
  2024-02-14 23:16       ` Alexei Starovoitov
  3 siblings, 2 replies; 27+ messages in thread
From: Leon Hwang @ 2024-02-14  5:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/1/5 12:15, Alexei Starovoitov wrote:
> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
> 
> Other alternatives?

I've finish the POC of an alternative, which passed all tailcall
selftests including these tailcall hierarchy ones.

In this alternative, I use a new bpf_prog_run_ctx to wrap the original
ctx and the tcc_ptr, then get the tcc_ptr and recover the original ctx
in JIT.

Then, to avoid breaking runtime with tailcall on other arch, I add an
arch-related check bpf_jit_supports_tail_call_cnt_ptr() to determin
whether to use bpf_prog_run_ctx.

Here's the diff:

 diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4065bdcc5b2a4..56cea2676863e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -259,7 +259,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(16 + ENDBR_INSN_SIZE)

 static void push_r12(u8 **pprog)
 {
@@ -407,21 +407,19 @@ static void emit_prologue(u8 **pprog, u32
stack_depth, bool ebpf_from_cbpf,
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
 		if (tail_call_reachable && !is_subprog) {
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
-			 */
-			EMIT2(0x31, 0xC0);       /* xor eax, eax */
-			EMIT1(0x50);             /* push rax */
-			/* Make rax as ptr that points to tail_call_cnt. */
-			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
-			EMIT1_off32(0xE8, 2);    /* call main prog */
-			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
-			EMIT1(0xC3);             /* ret */
+			/* Make rax as tcc_ptr. */
+			EMIT4(0x48, 0x8B, 0x47, 0x08); /* mov rax, qword ptr [rdi + 8] */
 		} else {
-			/* Keep the same instruction size. */
-			emit_nops(&prog, 13);
+			/* Keep the same instruction layout. */
+			emit_nops(&prog, 4);
 		}
 	}
+	if (!is_subprog)
+		/* Recover the original ctx. */
+		EMIT3(0x48, 0x8B, 0x3F); /* mov rdi, qword ptr [rdi] */
+	else
+		/* Keep the same instruction layout. */
+		emit_nops(&prog, 3);
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
 		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
@@ -3152,6 +3150,12 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 	return true;
 }

+/* Indicate the JIT backend supports tail call count pointer in
tailcall context. */
+bool bpf_jit_supports_tail_call_cnt_ptr(void)
+{
+	return true;
+}
+
 void bpf_jit_free(struct bpf_prog *prog)
 {
 	if (prog->jited) {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7671530d6e4e0..fea4326c27d31 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1919,6 +1919,11 @@ int bpf_prog_array_copy(struct bpf_prog_array
*old_array,
 			u64 bpf_cookie,
 			struct bpf_prog_array **new_array);

+struct bpf_prog_run_ctx {
+	const void *ctx;
+	u32 *tail_call_cnt;
+};
+
 struct bpf_run_ctx {};

 struct bpf_cg_run_ctx {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 68fb6c8142fec..c1c035c44b4ab 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -629,6 +629,10 @@ typedef unsigned int (*bpf_dispatcher_fn)(const
void *ctx,
 					  unsigned int (*bpf_func)(const void *,
 								   const struct bpf_insn *));

+static __always_inline u32 __bpf_prog_run_dfunc(const struct bpf_prog
*prog,
+						const void *ctx,
+						bpf_dispatcher_fn dfunc);
+
 static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 					  const void *ctx,
 					  bpf_dispatcher_fn dfunc)
@@ -641,14 +645,14 @@ static __always_inline u32 __bpf_prog_run(const
struct bpf_prog *prog,
 		u64 start = sched_clock();
 		unsigned long flags;

-		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
+		ret = __bpf_prog_run_dfunc(prog, ctx, dfunc);
 		stats = this_cpu_ptr(prog->stats);
 		flags = u64_stats_update_begin_irqsave(&stats->syncp);
 		u64_stats_inc(&stats->cnt);
 		u64_stats_add(&stats->nsecs, sched_clock() - start);
 		u64_stats_update_end_irqrestore(&stats->syncp, flags);
 	} else {
-		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
+		ret = __bpf_prog_run_dfunc(prog, ctx, dfunc);
 	}
 	return ret;
 }
@@ -952,12 +956,31 @@ struct bpf_prog *bpf_int_jit_compile(struct
bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
 bool bpf_jit_supports_subprog_tailcalls(void);
+bool bpf_jit_supports_tail_call_cnt_ptr(void);
 bool bpf_jit_supports_kfunc_call(void);
 bool bpf_jit_supports_far_kfunc_call(void);
 bool bpf_jit_supports_exceptions(void);
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64
sp, u64 bp), void *cookie);
 bool bpf_helper_changes_pkt_data(void *func);

+static __always_inline u32 __bpf_prog_run_dfunc(const struct bpf_prog
*prog,
+						const void *ctx,
+						bpf_dispatcher_fn dfunc)
+{
+	struct bpf_prog_run_ctx run_ctx = {};
+	u32 ret, tcc = 0;
+
+	run_ctx.ctx = ctx;
+	run_ctx.tail_call_cnt = &tcc;
+
+	if (bpf_jit_supports_tail_call_cnt_ptr() && prog->jited)
+		ret = dfunc(&run_ctx, prog->insnsi, prog->bpf_func);
+	else
+		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
+
+	return ret;
+}
+
 static inline bool bpf_dump_raw_ok(const struct cred *cred)
 {
 	/* Reconstruction of call-sites is dependent on kallsyms,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ea6843be2616c..80b20e99456f0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2915,6 +2915,15 @@ bool __weak bpf_jit_supports_subprog_tailcalls(void)
 	return false;
 }

+/* Return TRUE if the JIT backend supports tail call count pointer in
tailcall
+ * context.
+ */
+bool __weak bpf_jit_supports_tail_call_cnt_ptr(void)
+{
+	return false;
+}
+EXPORT_SYMBOL(bpf_jit_supports_tail_call_cnt_ptr);
+
 bool __weak bpf_jit_supports_kfunc_call(void)
 {
 	return false;

Why use EXPORT_SYMBOL here?

It's to avoid the building error.

ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr"
[net/sched/act_bpf.ko] undefined!
ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr"
[net/sched/cls_bpf.ko] undefined!
ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr"
[net/netfilter/xt_bpf.ko] undefined!
ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr" [net/ipv6/ipv6.ko]
undefined!

I'm not familiar with this building error. Is it OK to use EXPORT_SYMBOL
here?

Thanks,
Leon


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-14  5:47     ` Leon Hwang
@ 2024-02-14 11:25       ` Maciej Fijalkowski
  2024-02-14 16:31         ` Leon Hwang
  2024-02-14 23:16       ` Alexei Starovoitov
  1 sibling, 1 reply; 27+ messages in thread
From: Maciej Fijalkowski @ 2024-02-14 11:25 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Jakub Sitnicki, Ilya Leoshkevich, Hengqi Chen,
	kernel-patches-bot

On Wed, Feb 14, 2024 at 01:47:45PM +0800, Leon Hwang wrote:
> 
> 
> On 2024/1/5 12:15, Alexei Starovoitov wrote:
> > On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >>
> >>
> > 
> > Other alternatives?
> 
> I've finish the POC of an alternative, which passed all tailcall
> selftests including these tailcall hierarchy ones.
> 
> In this alternative, I use a new bpf_prog_run_ctx to wrap the original
> ctx and the tcc_ptr, then get the tcc_ptr and recover the original ctx
> in JIT.
> 
> Then, to avoid breaking runtime with tailcall on other arch, I add an
> arch-related check bpf_jit_supports_tail_call_cnt_ptr() to determin
> whether to use bpf_prog_run_ctx.
> 
> Here's the diff:

This is diff against your previous proposed solution, would be good to see
how it currently looks being put together (this diff on top of your
patch), would save us some effort to dig the patch up and include diff.

> 
>  diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 4065bdcc5b2a4..56cea2676863e 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -259,7 +259,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE		5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET	(16 + ENDBR_INSN_SIZE)
> 
>  static void push_r12(u8 **pprog)
>  {
> @@ -407,21 +407,19 @@ static void emit_prologue(u8 **pprog, u32
> stack_depth, bool ebpf_from_cbpf,
>  	emit_nops(&prog, X86_PATCH_SIZE);
>  	if (!ebpf_from_cbpf) {
>  		if (tail_call_reachable && !is_subprog) {
> -			/* When it's the entry of the whole tailcall context,
> -			 * zeroing rax means initialising tail_call_cnt.
> -			 */
> -			EMIT2(0x31, 0xC0);       /* xor eax, eax */
> -			EMIT1(0x50);             /* push rax */
> -			/* Make rax as ptr that points to tail_call_cnt. */
> -			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> -			EMIT1_off32(0xE8, 2);    /* call main prog */
> -			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> -			EMIT1(0xC3);             /* ret */
> +			/* Make rax as tcc_ptr. */
> +			EMIT4(0x48, 0x8B, 0x47, 0x08); /* mov rax, qword ptr [rdi + 8] */
>  		} else {
> -			/* Keep the same instruction size. */
> -			emit_nops(&prog, 13);
> +			/* Keep the same instruction layout. */
> +			emit_nops(&prog, 4);
>  		}
>  	}
> +	if (!is_subprog)
> +		/* Recover the original ctx. */
> +		EMIT3(0x48, 0x8B, 0x3F); /* mov rdi, qword ptr [rdi] */
> +	else
> +		/* Keep the same instruction layout. */
> +		emit_nops(&prog, 3);
>  	/* Exception callback receives FP as third parameter */
>  	if (is_exception_cb) {
>  		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
> @@ -3152,6 +3150,12 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>  	return true;
>  }
> 
> +/* Indicate the JIT backend supports tail call count pointer in
> tailcall context. */
> +bool bpf_jit_supports_tail_call_cnt_ptr(void)
> +{
> +	return true;
> +}
> +
>  void bpf_jit_free(struct bpf_prog *prog)
>  {
>  	if (prog->jited) {
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 7671530d6e4e0..fea4326c27d31 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1919,6 +1919,11 @@ int bpf_prog_array_copy(struct bpf_prog_array
> *old_array,
>  			u64 bpf_cookie,
>  			struct bpf_prog_array **new_array);
> 
> +struct bpf_prog_run_ctx {
> +	const void *ctx;
> +	u32 *tail_call_cnt;
> +};
> +
>  struct bpf_run_ctx {};
> 
>  struct bpf_cg_run_ctx {
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 68fb6c8142fec..c1c035c44b4ab 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -629,6 +629,10 @@ typedef unsigned int (*bpf_dispatcher_fn)(const
> void *ctx,
>  					  unsigned int (*bpf_func)(const void *,
>  								   const struct bpf_insn *));
> 
> +static __always_inline u32 __bpf_prog_run_dfunc(const struct bpf_prog
> *prog,
> +						const void *ctx,
> +						bpf_dispatcher_fn dfunc);
> +
>  static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
>  					  const void *ctx,
>  					  bpf_dispatcher_fn dfunc)
> @@ -641,14 +645,14 @@ static __always_inline u32 __bpf_prog_run(const
> struct bpf_prog *prog,
>  		u64 start = sched_clock();
>  		unsigned long flags;
> 
> -		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> +		ret = __bpf_prog_run_dfunc(prog, ctx, dfunc);
>  		stats = this_cpu_ptr(prog->stats);
>  		flags = u64_stats_update_begin_irqsave(&stats->syncp);
>  		u64_stats_inc(&stats->cnt);
>  		u64_stats_add(&stats->nsecs, sched_clock() - start);
>  		u64_stats_update_end_irqrestore(&stats->syncp, flags);
>  	} else {
> -		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> +		ret = __bpf_prog_run_dfunc(prog, ctx, dfunc);
>  	}
>  	return ret;
>  }
> @@ -952,12 +956,31 @@ struct bpf_prog *bpf_int_jit_compile(struct
> bpf_prog *prog);
>  void bpf_jit_compile(struct bpf_prog *prog);
>  bool bpf_jit_needs_zext(void);
>  bool bpf_jit_supports_subprog_tailcalls(void);
> +bool bpf_jit_supports_tail_call_cnt_ptr(void);
>  bool bpf_jit_supports_kfunc_call(void);
>  bool bpf_jit_supports_far_kfunc_call(void);
>  bool bpf_jit_supports_exceptions(void);
>  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64
> sp, u64 bp), void *cookie);
>  bool bpf_helper_changes_pkt_data(void *func);
> 
> +static __always_inline u32 __bpf_prog_run_dfunc(const struct bpf_prog
> *prog,
> +						const void *ctx,
> +						bpf_dispatcher_fn dfunc)
> +{
> +	struct bpf_prog_run_ctx run_ctx = {};
> +	u32 ret, tcc = 0;
> +
> +	run_ctx.ctx = ctx;
> +	run_ctx.tail_call_cnt = &tcc;
> +
> +	if (bpf_jit_supports_tail_call_cnt_ptr() && prog->jited)
> +		ret = dfunc(&run_ctx, prog->insnsi, prog->bpf_func);
> +	else
> +		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> +
> +	return ret;
> +}
> +
>  static inline bool bpf_dump_raw_ok(const struct cred *cred)
>  {
>  	/* Reconstruction of call-sites is dependent on kallsyms,
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index ea6843be2616c..80b20e99456f0 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2915,6 +2915,15 @@ bool __weak bpf_jit_supports_subprog_tailcalls(void)
>  	return false;
>  }
> 
> +/* Return TRUE if the JIT backend supports tail call count pointer in
> tailcall
> + * context.
> + */
> +bool __weak bpf_jit_supports_tail_call_cnt_ptr(void)
> +{
> +	return false;
> +}
> +EXPORT_SYMBOL(bpf_jit_supports_tail_call_cnt_ptr);
> +
>  bool __weak bpf_jit_supports_kfunc_call(void)
>  {
>  	return false;
> 
> Why use EXPORT_SYMBOL here?
> 
> It's to avoid the building error.
> 
> ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr"
> [net/sched/act_bpf.ko] undefined!
> ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr"
> [net/sched/cls_bpf.ko] undefined!
> ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr"
> [net/netfilter/xt_bpf.ko] undefined!
> ERROR: modpost: "bpf_jit_supports_tail_call_cnt_ptr" [net/ipv6/ipv6.ko]
> undefined!
> 
> I'm not familiar with this building error. Is it OK to use EXPORT_SYMBOL
> here?
> 
> Thanks,
> Leon
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-14 11:25       ` Maciej Fijalkowski
@ 2024-02-14 16:31         ` Leon Hwang
  0 siblings, 0 replies; 27+ messages in thread
From: Leon Hwang @ 2024-02-14 16:31 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Jakub Sitnicki, Ilya Leoshkevich, Hengqi Chen,
	kernel-patches-bot



On 2024/2/14 19:25, Maciej Fijalkowski wrote:
> On Wed, Feb 14, 2024 at 01:47:45PM +0800, Leon Hwang wrote:
>>
>>
>> On 2024/1/5 12:15, Alexei Starovoitov wrote:
>>> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>>>
>>>>
>>>
>>> Other alternatives?
>>
>> I've finish the POC of an alternative, which passed all tailcall
>> selftests including these tailcall hierarchy ones.
>>
>> In this alternative, I use a new bpf_prog_run_ctx to wrap the original
>> ctx and the tcc_ptr, then get the tcc_ptr and recover the original ctx
>> in JIT.
>>
>> Then, to avoid breaking runtime with tailcall on other arch, I add an
>> arch-related check bpf_jit_supports_tail_call_cnt_ptr() to determin
>> whether to use bpf_prog_run_ctx.
>>
>> Here's the diff:
> 
> This is diff against your previous proposed solution, would be good to see

The previous proposed solution is buggy, when I have a deep analysis.

> how it currently looks being put together (this diff on top of your
> patch), would save us some effort to dig the patch up and include diff.
> 

But, we can not apply this patch with this diff. It's because it breaks
the runtime with tailcall of bpf progs, whose runtime entry is not
__bpf_prog_run(), e.g. trampoline-based fentry/fexit/fmod_ret.

So, is there any other bpf prog types whose runtime entry is not either
__bpf_prog_run() or trampoline?

I'd like to fix them in PATCH v2.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-14  5:47     ` Leon Hwang
  2024-02-14 11:25       ` Maciej Fijalkowski
@ 2024-02-14 23:16       ` Alexei Starovoitov
  2024-02-15 13:16         ` Leon Hwang
  1 sibling, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2024-02-14 23:16 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Tue, Feb 13, 2024 at 9:47 PM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
>
> On 2024/1/5 12:15, Alexei Starovoitov wrote:
> > On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >>
> >>
> >
> > Other alternatives?
>
> I've finish the POC of an alternative, which passed all tailcall
> selftests including these tailcall hierarchy ones.
>
> In this alternative, I use a new bpf_prog_run_ctx to wrap the original
> ctx and the tcc_ptr, then get the tcc_ptr and recover the original ctx
> in JIT.
>
> Then, to avoid breaking runtime with tailcall on other arch, I add an
> arch-related check bpf_jit_supports_tail_call_cnt_ptr() to determin
> whether to use bpf_prog_run_ctx.
>
> Here's the diff:
>
>  diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 4065bdcc5b2a4..56cea2676863e 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -259,7 +259,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE         5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET   (16 + ENDBR_INSN_SIZE)
>
>  static void push_r12(u8 **pprog)
>  {
> @@ -407,21 +407,19 @@ static void emit_prologue(u8 **pprog, u32
> stack_depth, bool ebpf_from_cbpf,
>         emit_nops(&prog, X86_PATCH_SIZE);
>         if (!ebpf_from_cbpf) {
>                 if (tail_call_reachable && !is_subprog) {
> -                       /* When it's the entry of the whole tailcall context,
> -                        * zeroing rax means initialising tail_call_cnt.
> -                        */
> -                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
> -                       EMIT1(0x50);             /* push rax */
> -                       /* Make rax as ptr that points to tail_call_cnt. */
> -                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> -                       EMIT1_off32(0xE8, 2);    /* call main prog */
> -                       EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> -                       EMIT1(0xC3);             /* ret */
> +                       /* Make rax as tcc_ptr. */
> +                       EMIT4(0x48, 0x8B, 0x47, 0x08); /* mov rax, qword ptr [rdi + 8] */
>                 } else {
> -                       /* Keep the same instruction size. */
> -                       emit_nops(&prog, 13);
> +                       /* Keep the same instruction layout. */
> +                       emit_nops(&prog, 4);
>                 }
>         }
> +       if (!is_subprog)
> +               /* Recover the original ctx. */
> +               EMIT3(0x48, 0x8B, 0x3F); /* mov rdi, qword ptr [rdi] */
> +       else
> +               /* Keep the same instruction layout. */
> +               emit_nops(&prog, 3);
>         /* Exception callback receives FP as third parameter */
>         if (is_exception_cb) {
>                 EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
> @@ -3152,6 +3150,12 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>         return true;
>  }
>
> +/* Indicate the JIT backend supports tail call count pointer in
> tailcall context. */
> +bool bpf_jit_supports_tail_call_cnt_ptr(void)
> +{
> +       return true;
> +}
> +
>  void bpf_jit_free(struct bpf_prog *prog)
>  {
>         if (prog->jited) {
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 7671530d6e4e0..fea4326c27d31 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1919,6 +1919,11 @@ int bpf_prog_array_copy(struct bpf_prog_array
> *old_array,
>                         u64 bpf_cookie,
>                         struct bpf_prog_array **new_array);
>
> +struct bpf_prog_run_ctx {
> +       const void *ctx;
> +       u32 *tail_call_cnt;
> +};
> +
>  struct bpf_run_ctx {};
>
>  struct bpf_cg_run_ctx {
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 68fb6c8142fec..c1c035c44b4ab 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -629,6 +629,10 @@ typedef unsigned int (*bpf_dispatcher_fn)(const
> void *ctx,
>                                           unsigned int (*bpf_func)(const void *,
>                                                                    const struct bpf_insn *));
>
> +static __always_inline u32 __bpf_prog_run_dfunc(const struct bpf_prog
> *prog,
> +                                               const void *ctx,
> +                                               bpf_dispatcher_fn dfunc);
> +
>  static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
>                                           const void *ctx,
>                                           bpf_dispatcher_fn dfunc)
> @@ -641,14 +645,14 @@ static __always_inline u32 __bpf_prog_run(const
> struct bpf_prog *prog,
>                 u64 start = sched_clock();
>                 unsigned long flags;
>
> -               ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> +               ret = __bpf_prog_run_dfunc(prog, ctx, dfunc);
>                 stats = this_cpu_ptr(prog->stats);
>                 flags = u64_stats_update_begin_irqsave(&stats->syncp);
>                 u64_stats_inc(&stats->cnt);
>                 u64_stats_add(&stats->nsecs, sched_clock() - start);
>                 u64_stats_update_end_irqrestore(&stats->syncp, flags);
>         } else {
> -               ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> +               ret = __bpf_prog_run_dfunc(prog, ctx, dfunc);
>         }
>         return ret;
>  }
> @@ -952,12 +956,31 @@ struct bpf_prog *bpf_int_jit_compile(struct
> bpf_prog *prog);
>  void bpf_jit_compile(struct bpf_prog *prog);
>  bool bpf_jit_needs_zext(void);
>  bool bpf_jit_supports_subprog_tailcalls(void);
> +bool bpf_jit_supports_tail_call_cnt_ptr(void);
>  bool bpf_jit_supports_kfunc_call(void);
>  bool bpf_jit_supports_far_kfunc_call(void);
>  bool bpf_jit_supports_exceptions(void);
>  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64
> sp, u64 bp), void *cookie);
>  bool bpf_helper_changes_pkt_data(void *func);
>
> +static __always_inline u32 __bpf_prog_run_dfunc(const struct bpf_prog
> *prog,
> +                                               const void *ctx,
> +                                               bpf_dispatcher_fn dfunc)
> +{
> +       struct bpf_prog_run_ctx run_ctx = {};
> +       u32 ret, tcc = 0;
> +
> +       run_ctx.ctx = ctx;
> +       run_ctx.tail_call_cnt = &tcc;
> +
> +       if (bpf_jit_supports_tail_call_cnt_ptr() && prog->jited)
> +               ret = dfunc(&run_ctx, prog->insnsi, prog->bpf_func);
> +       else
> +               ret = dfunc(ctx, prog->insnsi, prog->bpf_func);

This is no good either.
We cannot introduce two extra run-time checks before calling every bpf prog.
The solution must be overhead free for common cases.

Can we switch to percpu tail_call_cnt instead of on stack and %rax tricks ?

If that won't work, then we'd have to disable tail_calls from subprogs
in the verifier.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-14 23:16       ` Alexei Starovoitov
@ 2024-02-15 13:16         ` Leon Hwang
  2024-02-16  2:18           ` Alexei Starovoitov
  0 siblings, 1 reply; 27+ messages in thread
From: Leon Hwang @ 2024-02-15 13:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/2/15 07:16, Alexei Starovoitov wrote:
> On Tue, Feb 13, 2024 at 9:47 PM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>>
>> On 2024/1/5 12:15, Alexei Starovoitov wrote:
>>> On Thu, Jan 4, 2024 at 6:23 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>>>
>>>>
>>>
>>> Other alternatives?
>>
>> I've finish the POC of an alternative, which passed all tailcall
>> selftests including these tailcall hierarchy ones.
>>
>> In this alternative, I use a new bpf_prog_run_ctx to wrap the original
>> ctx and the tcc_ptr, then get the tcc_ptr and recover the original ctx
>> in JIT.
>>
>> Then, to avoid breaking runtime with tailcall on other arch, I add an
>> arch-related check bpf_jit_supports_tail_call_cnt_ptr() to determin
>> whether to use bpf_prog_run_ctx.
>>

[SNIP]

>> +
>> +       if (bpf_jit_supports_tail_call_cnt_ptr() && prog->jited)
>> +               ret = dfunc(&run_ctx, prog->insnsi, prog->bpf_func);
>> +       else
>> +               ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> 
> This is no good either.
> We cannot introduce two extra run-time checks before calling every bpf prog.
> The solution must be overhead free for common cases.
> 
> Can we switch to percpu tail_call_cnt instead of on stack and %rax tricks ?
> 

Good idea to use percpu tail_call_cnt.

I did another POC to use percpu tail_call_cnt, which passed all tailcall
selftests too.

In this POC, in order to prepare tcc_ptr at the prologue of x86 JIT,
it's to call bpf_tail_call_cnt_prepare() to get the pointer that points
to percpu tail_call_cnt, and to store the pointer to %rax meanwhile.

Here's the diff:

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4065bdcc5b2a4..fc1df6a7d87c9 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -241,6 +241,8 @@ int bpf_arch_text_invalidate(void *dst, size_t len)
 }

 struct jit_context {
+	int prologue_tail_call_offset;
+
 	int cleanup_addr; /* Epilogue code offset */

 	/*
@@ -250,6 +252,8 @@ struct jit_context {
 	 */
 	int tail_call_direct_label;
 	int tail_call_indirect_label;
+
+	bool tail_call_reachable;
 };

 /* Maximum number of bytes emitted while JITing one eBPF insn */
@@ -259,7 +263,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(14 + ENDBR_INSN_SIZE)

 static void push_r12(u8 **pprog)
 {
@@ -389,6 +393,19 @@ static void emit_cfi(u8 **pprog, u32 hash)
 	*pprog = prog;
 }

+DEFINE_PER_CPU(u32, bpf_tail_call_cnt);
+
+__attribute__((used))
+static u32 *bpf_tail_call_cnt_prepare(void)
+{
+	u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
+
+	/* Initialise tail_call_cnt. */
+	*tcc_ptr = 0;
+
+	return tcc_ptr;
+}
+
 /*
  * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
@@ -396,7 +413,7 @@ static void emit_cfi(u8 **pprog, u32 hash)
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 			  bool tail_call_reachable, bool is_subprog,
-			  bool is_exception_cb)
+			  bool is_exception_cb, struct jit_context *ctx)
 {
 	u8 *prog = *pprog;

@@ -406,21 +423,15 @@ static void emit_prologue(u8 **pprog, u32
stack_depth, bool ebpf_from_cbpf,
 	 */
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog) {
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
-			 */
-			EMIT2(0x31, 0xC0);       /* xor eax, eax */
-			EMIT1(0x50);             /* push rax */
-			/* Make rax as ptr that points to tail_call_cnt. */
-			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
-			EMIT1_off32(0xE8, 2);    /* call main prog */
-			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
-			EMIT1(0xC3);             /* ret */
-		} else {
-			/* Keep the same instruction size. */
-			emit_nops(&prog, 13);
-		}
+		/* These 5-bytes nops is prepared to emit_call() to call
+		 * bpf_tail_call_cnt_prepare later.
+		 *
+		 * After calling bpf_tail_call_cnt_prepare, %rax will be
+		 * the tail_call_cnt pointer that points to an initialised
+		 * PER-CPU tail_call_cnt.
+		 */
+		ctx->prologue_tail_call_offset = prog - *pprog;
+		emit_nops(&prog, X86_PATCH_SIZE);
 	}
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
@@ -583,6 +594,17 @@ static void emit_return(u8 **pprog, u8 *ip)
 	*pprog = prog;
 }

+static void bpf_tail_call_prologue_fixup(u8 *image, struct bpf_prog *prog,
+					 struct jit_context *ctx)
+{
+	bool ebpf_from_cbpf = bpf_prog_was_classic(prog);
+	u8 *ip = image + ctx->prologue_tail_call_offset;
+
+	if (!ebpf_from_cbpf && ctx->tail_call_reachable && !bpf_is_subprog(prog))
+		__bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL,
+				     bpf_tail_call_cnt_prepare);
+}
+
 /*
  * Generate the following code:
  *
@@ -1165,10 +1187,12 @@ static int do_jit(struct bpf_prog *bpf_prog, int
*addrs, u8 *image, u8 *rw_image

 	/* tail call's presence in current prog implies it is reachable */
 	tail_call_reachable |= tail_call_seen;
+	ctx->tail_call_reachable = tail_call_reachable;

 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
-		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
+		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb,
+		      ctx);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -3097,6 +3121,7 @@ struct bpf_prog *bpf_int_jit_compile(struct
bpf_prog *prog)
 			}

 			bpf_tail_call_direct_fixup(prog);
+			bpf_tail_call_prologue_fixup(image, prog, &ctx);
 		} else {
 			jit_data->addrs = addrs;
 			jit_data->ctx = ctx;

Thanks,
Leon

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-15 13:16         ` Leon Hwang
@ 2024-02-16  2:18           ` Alexei Starovoitov
  2024-02-17 13:43             ` Leon Hwang
  0 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2024-02-16  2:18 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Thu, Feb 15, 2024 at 5:16 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> Here's the diff:

Please always send a diff against bpf-next.
No one remembers your prior patch from months ago.

> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 4065bdcc5b2a4..fc1df6a7d87c9 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -241,6 +241,8 @@ int bpf_arch_text_invalidate(void *dst, size_t len)
>  }
>
>  struct jit_context {
> +       int prologue_tail_call_offset;
> +
>         int cleanup_addr; /* Epilogue code offset */
>
>         /*
> @@ -250,6 +252,8 @@ struct jit_context {
>          */
>         int tail_call_direct_label;
>         int tail_call_indirect_label;
> +
> +       bool tail_call_reachable;
>  };
>
>  /* Maximum number of bytes emitted while JITing one eBPF insn */
> @@ -259,7 +263,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE         5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET   (14 + ENDBR_INSN_SIZE)
>
>  static void push_r12(u8 **pprog)
>  {
> @@ -389,6 +393,19 @@ static void emit_cfi(u8 **pprog, u32 hash)
>         *pprog = prog;
>  }
>
> +DEFINE_PER_CPU(u32, bpf_tail_call_cnt);
> +
> +__attribute__((used))
> +static u32 *bpf_tail_call_cnt_prepare(void)
> +{
> +       u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
> +
> +       /* Initialise tail_call_cnt. */
> +       *tcc_ptr = 0;
> +
> +       return tcc_ptr;
> +}

This might need to be in asm to make sure no callee saved registers
are touched.

In general that's better, but it feels we can do better
and avoid passing rax around.
Just access bpf_tail_call_cnt directly from emit_bpf_tail_call.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-16  2:18           ` Alexei Starovoitov
@ 2024-02-17 13:43             ` Leon Hwang
  2024-02-20  5:13               ` Leon Hwang
  2024-02-20 17:33               ` Alexei Starovoitov
  0 siblings, 2 replies; 27+ messages in thread
From: Leon Hwang @ 2024-02-17 13:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/2/16 10:18, Alexei Starovoitov wrote:
> On Thu, Feb 15, 2024 at 5:16 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>> Here's the diff:
> 
> Please always send a diff against bpf-next.
> No one remembers your prior patch from months ago.

Got it. Thanks for your guide.
>>
>> +DEFINE_PER_CPU(u32, bpf_tail_call_cnt);
>> +
>> +__attribute__((used))
>> +static u32 *bpf_tail_call_cnt_prepare(void)
>> +{
>> +       u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
>> +
>> +       /* Initialise tail_call_cnt. */
>> +       *tcc_ptr = 0;
>> +
>> +       return tcc_ptr;
>> +}
> 
> This might need to be in asm to make sure no callee saved registers
> are touched.
> 
> In general that's better, but it feels we can do better
> and avoid passing rax around.
> Just access bpf_tail_call_cnt directly from emit_bpf_tail_call.
Yes, we can do better to avoid passing rax around:

1. At prologue, initialise percpu tail_call_cnt.
2. When tailcall, fetch and increment percpu tail_call_cnt.

As a result, we can remove pushing/popping rax at anywhere.

Finally, here's the diff against latest bpf-next with asm to handle
percpu tail_call_cnt:

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 67315505da32e..6f34636fc31d7 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -18,6 +18,7 @@
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
+#include <asm/percpu.h>

 static bool all_callee_regs_used[4] = {true, true, true, true};

@@ -259,7 +260,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(14 + ENDBR_INSN_SIZE)

 static void push_r12(u8 **pprog)
 {
@@ -389,68 +390,6 @@ static void emit_cfi(u8 **pprog, u32 hash)
 	*pprog = prog;
 }

-/*
- * Emit x86-64 prologue code for BPF program.
- * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
- * while jumping to another program
- */
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
-			  bool tail_call_reachable, bool is_subprog,
-			  bool is_exception_cb)
-{
-	u8 *prog = *pprog;
-
-	emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
-	/* BPF trampoline can be made to work without these nops,
-	 * but let's waste 5 bytes for now and optimize later
-	 */
-	emit_nops(&prog, X86_PATCH_SIZE);
-	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog) {
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
-			 */
-			EMIT2(0x31, 0xC0);       /* xor eax, eax */
-			EMIT1(0x50);             /* push rax */
-			/* Make rax as ptr that points to tail_call_cnt. */
-			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
-			EMIT1_off32(0xE8, 2);    /* call main prog */
-			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
-			EMIT1(0xC3);             /* ret */
-		} else {
-			/* Keep the same instruction size. */
-			emit_nops(&prog, 13);
-		}
-	}
-	/* Exception callback receives FP as third parameter */
-	if (is_exception_cb) {
-		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
-		EMIT3(0x48, 0x89, 0xD5); /* mov rbp, rdx */
-		/* The main frame must have exception_boundary as true, so we
-		 * first restore those callee-saved regs from stack, before
-		 * reusing the stack frame.
-		 */
-		pop_callee_regs(&prog, all_callee_regs_used);
-		pop_r12(&prog);
-		/* Reset the stack frame. */
-		EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
-	} else {
-		EMIT1(0x55);             /* push rbp */
-		EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
-	}
-
-	/* X86_TAIL_CALL_OFFSET is here */
-	EMIT_ENDBR();
-
-	/* sub rsp, rounded_stack_depth */
-	if (stack_depth)
-		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	if (tail_call_reachable)
-		/* Here, rax is tail_call_cnt_ptr. */
-		EMIT1(0x50);         /* push rax */
-	*pprog = prog;
-}
-
 static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode)
 {
 	u8 *prog = *pprog;
@@ -544,6 +483,105 @@ int bpf_arch_text_poke(void *ip, enum
bpf_text_poke_type t,
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr);
 }

+DEFINE_PER_CPU(u32, bpf_tail_call_cnt);
+
+__attribute__((used))
+static void bpf_tail_call_cnt_prepare(void)
+{
+	/* The following asm equals to
+	 *
+	 * u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
+	 *
+	 * *tcc_ptr = 0;
+	 *
+	 * Make sure this asm use %rax only.
+	 */
+
+	asm volatile (
+	     "addq " __percpu_arg(0) ", %1\n\t"
+	     "movl $0, (%%rax)\n\t"
+	     :
+	     : "m" (this_cpu_off), "r" (&bpf_tail_call_cnt)
+	);
+}
+
+__attribute__((used))
+static u32 bpf_tail_call_cnt_fetch_and_inc(void)
+{
+	u32 tail_call_cnt;
+
+	/* The following asm equals to
+	 *
+	 * u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
+	 *
+	 * (*tcc_ptr)++;
+	 * tail_call_cnt = *tcc_ptr;
+	 * tail_call_cnt--;
+	 *
+	 * Make sure this asm use %rax only.
+	 */
+
+	asm volatile (
+	     "addq " __percpu_arg(1) ", %2\n\t"
+	     "incl (%%rax)\n\t"
+	     "movl (%%rax), %0\n\t"
+	     "decl %0\n\t"
+	     : "=r" (tail_call_cnt)
+	     : "m" (this_cpu_off), "r" (&bpf_tail_call_cnt)
+	);
+
+	return tail_call_cnt;
+}
+
+/*
+ * Emit x86-64 prologue code for BPF program.
+ * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
+ * while jumping to another program
+ */
+static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
+			  bool tail_call_reachable, bool is_subprog,
+			  bool is_exception_cb, u8 *ip)
+{
+	u8 *prog = *pprog, *start = *pprog;
+
+	emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
+	/* BPF trampoline can be made to work without these nops,
+	 * but let's waste 5 bytes for now and optimize later
+	 */
+	emit_nops(&prog, X86_PATCH_SIZE);
+	if (!ebpf_from_cbpf) {
+		if (tail_call_reachable && !is_subprog)
+			emit_call(&prog, bpf_tail_call_cnt_prepare,
+				  ip + (prog - start));
+		else
+			emit_nops(&prog, X86_PATCH_SIZE);
+	}
+	/* Exception callback receives FP as third parameter */
+	if (is_exception_cb) {
+		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
+		EMIT3(0x48, 0x89, 0xD5); /* mov rbp, rdx */
+		/* The main frame must have exception_boundary as true, so we
+		 * first restore those callee-saved regs from stack, before
+		 * reusing the stack frame.
+		 */
+		pop_callee_regs(&prog, all_callee_regs_used);
+		pop_r12(&prog);
+		/* Reset the stack frame. */
+		EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
+	} else {
+		EMIT1(0x55);             /* push rbp */
+		EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
+	}
+
+	/* X86_TAIL_CALL_OFFSET is here */
+	EMIT_ENDBR();
+
+	/* sub rsp, rounded_stack_depth */
+	if (stack_depth)
+		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
+	*pprog = prog;
+}
+
 #define EMIT_LFENCE()	EMIT3(0x0F, 0xAE, 0xE8)

 static void emit_indirect_jump(u8 **pprog, int reg, u8 *ip)
@@ -602,7 +640,6 @@ static void emit_bpf_tail_call_indirect(struct
bpf_prog *bpf_prog,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
-	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;

@@ -623,16 +660,14 @@ static void emit_bpf_tail_call_indirect(struct
bpf_prog *bpf_prog,
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JBE, offset);                   /* jbe out */

-	/*
-	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
-	 *	goto out;
+	/* if (bpf_tail_call_cnt_fetch_and_inc() >= MAX_TAIL_CALL_CNT)
+	 * 	goto out;
 	 */
-	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp
- tcc_ptr_off] */
-	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax],
MAX_TAIL_CALL_CNT */
+	emit_call(&prog, bpf_tail_call_cnt_fetch_and_inc, ip + (prog - start));
+	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */

 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                   /* jae out */
-	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */

 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 +
offsetof(...)] */
@@ -654,8 +689,6 @@ static void emit_bpf_tail_call_indirect(struct
bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}

-	/* pop tail_call_cnt_ptr */
-	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
 			    round_up(stack_depth, 8));
@@ -683,20 +716,17 @@ static void emit_bpf_tail_call_direct(struct
bpf_prog *bpf_prog,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
-	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;

-	/*
-	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
-	 *	goto out;
+	/* if (bpf_tail_call_cnt_fetch_and_inc() >= MAX_TAIL_CALL_CNT)
+	 * 	goto out;
 	 */
-	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr
[rbp - tcc_ptr_off] */
-	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax],
MAX_TAIL_CALL_CNT */
+	emit_call(&prog, bpf_tail_call_cnt_fetch_and_inc, ip);
+	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax,
MAX_TAIL_CALL_CNT */

 	offset = ctx->tail_call_direct_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                       /* jae out */
-	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */

 	poke->tailcall_bypass = ip + (prog - start);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -713,8 +743,6 @@ static void emit_bpf_tail_call_direct(struct
bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}

-	/* pop tail_call_cnt_ptr */
-	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));

@@ -1141,10 +1169,6 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg,
u8 src_reg, bool is64, u8 op)

 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))

-/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
-#define LOAD_TAIL_CALL_CNT_PTR(stack)				\
-	EMIT3_off32(0x48, 0x8B, 0x85, -round_up(stack, 8) - 8)
-
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8
*rw_image,
 		  int oldproglen, struct jit_context *ctx, bool jmp_padding)
 {
@@ -1168,7 +1192,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int
*addrs, u8 *image, u8 *rw_image

 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
-		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
+		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb,
+		      image);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -1760,17 +1785,12 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			int offs;

+			if (!imm32)
+				return -EINVAL;
+
 			func = (u8 *) __bpf_call_base + imm32;
-			if (tail_call_reachable) {
-				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
-				if (!imm32)
-					return -EINVAL;
-				offs = 7 + x86_call_depth_emit_accounting(&prog, func);
-			} else {
-				if (!imm32)
-					return -EINVAL;
-				offs = x86_call_depth_emit_accounting(&prog, func);
-			}
+			offs = x86_call_depth_emit_accounting(&prog, func);
+
 			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
 				return -EINVAL;
 			break;
@@ -2558,7 +2578,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 	 *                     [ ...        ]
 	 *                     [ stack_arg2 ]
 	 * RBP - arg_stack_off [ stack_arg1 ]
-	 * RSP                 [ tail_call_cnt_ptr ] BPF_TRAMP_F_TAIL_CALL_CTX
 	 */

 	/* room for return value of orig_call or fentry prog */
@@ -2630,8 +2649,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 		/* sub rsp, stack_size */
 		EMIT4(0x48, 0x83, 0xEC, stack_size);
 	}
-	if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-		EMIT1(0x50);		/* push rax */
 	/* mov QWORD PTR [rbp - rbx_off], rbx */
 	emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_6, -rbx_off);

@@ -2686,15 +2703,9 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 		restore_regs(m, &prog, regs_off);
 		save_args(m, &prog, arg_stack_off, true);

-		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-			/* Before calling the original function, load the
-			 * tail_call_cnt_ptr to rax.
-			 */
-			LOAD_TAIL_CALL_CNT_PTR(stack_size);
-
 		if (flags & BPF_TRAMP_F_ORIG_STACK) {
-			emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8);
-			EMIT2(0xff, 0xd3); /* call *rbx */
+			emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 8);
+			EMIT2(0xff, 0xd0); /* call *rax */
 		} else {
 			/* call original function */
 			if (emit_rsb_call(&prog, orig_call, image + (prog - (u8 *)rw_image))) {
@@ -2747,11 +2758,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 			ret = -EINVAL;
 			goto cleanup;
 		}
-	} else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-		/* Before running the original function, load the
-		 * tail_call_cnt_ptr to rax.
-		 */
-		LOAD_TAIL_CALL_CNT_PTR(stack_size);
 	}

 	/* restore return value of orig_call or fentry prog back into RAX */


Thanks,
Leon

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-17 13:43             ` Leon Hwang
@ 2024-02-20  5:13               ` Leon Hwang
  2024-02-20 17:34                 ` Alexei Starovoitov
  2024-02-20 17:33               ` Alexei Starovoitov
  1 sibling, 1 reply; 27+ messages in thread
From: Leon Hwang @ 2024-02-20  5:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/2/17 21:43, Leon Hwang wrote:
> 
> 
> On 2024/2/16 10:18, Alexei Starovoitov wrote:

[SNIP]

>>
>> In general that's better, but it feels we can do better
>> and avoid passing rax around.
>> Just access bpf_tail_call_cnt directly from emit_bpf_tail_call.
> Yes, we can do better to avoid passing rax around:
> 
> 1. At prologue, initialise percpu tail_call_cnt.
> 2. When tailcall, fetch and increment percpu tail_call_cnt.
> 
> As a result, we can remove pushing/popping rax at anywhere.
> 
> Finally, here's the diff against latest bpf-next with asm to handle
> percpu tail_call_cnt:
> 


Hi Alexei,

Should I send PATCH v2?

May I add "Suggested-by: Alexei Starovoitov <ast@kernel.org>" at PATCH
v2? Because the key idea, percpu tail_call_cnt, is suggested by you.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-17 13:43             ` Leon Hwang
  2024-02-20  5:13               ` Leon Hwang
@ 2024-02-20 17:33               ` Alexei Starovoitov
  2024-02-21 14:42                 ` Leon Hwang
  1 sibling, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2024-02-20 17:33 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Sat, Feb 17, 2024 at 5:43 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> Finally, here's the diff against latest bpf-next with asm to handle
> percpu tail_call_cnt:

It is not against bpf-next.

>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)

There is no such thing in bpf-next.

Please make a proper patch post following the rules in
Documentation/bpf/bpf_devel_QA.rst

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-20  5:13               ` Leon Hwang
@ 2024-02-20 17:34                 ` Alexei Starovoitov
  0 siblings, 0 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2024-02-20 17:34 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot

On Mon, Feb 19, 2024 at 9:14 PM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> May I add "Suggested-by: Alexei Starovoitov <ast@kernel.org>" at PATCH
> v2? Because the key idea, percpu tail_call_cnt, is suggested by you.

No need for such attribution.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy
  2024-02-20 17:33               ` Alexei Starovoitov
@ 2024-02-21 14:42                 ` Leon Hwang
  0 siblings, 0 replies; 27+ messages in thread
From: Leon Hwang @ 2024-02-21 14:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen, kernel-patches-bot



On 2024/2/21 01:33, Alexei Starovoitov wrote:
> On Sat, Feb 17, 2024 at 5:43 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>> Finally, here's the diff against latest bpf-next with asm to handle
>> percpu tail_call_cnt:
> 
> It is not against bpf-next.
> 
>>  /* Number of bytes that will be skipped on tailcall */
>> -#define X86_TAIL_CALL_OFFSET   (22 + ENDBR_INSN_SIZE)
> 
> There is no such thing in bpf-next.
> 
> Please make a proper patch post following the rules in
> Documentation/bpf/bpf_devel_QA.rst

Sorry for my misunderstanding. I will send PATCH v2 instead, which is
against bpf-next truly.

I'll read the doc again to do better in the future.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-02-21 14:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-04 14:22 [PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
2024-01-04 14:22 ` [PATCH bpf-next 1/4] bpf, x64: Use emit_nops() to replace memcpy()'ing x86_nops[5] Leon Hwang
2024-01-04 14:22 ` [PATCH bpf-next 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
2024-01-05  4:15   ` Alexei Starovoitov
2024-01-05  6:15     ` Leon Hwang
2024-01-05 17:43       ` Alexei Starovoitov
2024-01-06  2:38         ` Leon Hwang
2024-01-05 10:33     ` Leon Hwang
2024-01-05 17:47       ` Alexei Starovoitov
2024-01-06  2:33         ` Leon Hwang
2024-01-06  3:34           ` Alexei Starovoitov
2024-01-05 12:40     ` Jiri Olsa
2024-01-06  0:18       ` John Fastabend
2024-01-06  3:46         ` Alexei Starovoitov
2024-02-14  5:47     ` Leon Hwang
2024-02-14 11:25       ` Maciej Fijalkowski
2024-02-14 16:31         ` Leon Hwang
2024-02-14 23:16       ` Alexei Starovoitov
2024-02-15 13:16         ` Leon Hwang
2024-02-16  2:18           ` Alexei Starovoitov
2024-02-17 13:43             ` Leon Hwang
2024-02-20  5:13               ` Leon Hwang
2024-02-20 17:34                 ` Alexei Starovoitov
2024-02-20 17:33               ` Alexei Starovoitov
2024-02-21 14:42                 ` Leon Hwang
2024-01-04 14:22 ` [PATCH bpf-next 3/4] bpf, x64: Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR() Leon Hwang
2024-01-04 14:22 ` [PATCH bpf-next 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.