[RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy
@ 2023-10-11 15:27 Leon Hwang
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH Leon Hwang
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Leon Hwang @ 2023-10-11 15:27 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm

This patchset fixes a tailcall hierarchy issue with a better solution than v1[0].

v1 solution stores tail_call_cnt on the stack of bpf prog:

    |  STACK  |
    +---------+ RBP
    |         |
    |         |
    |         |
 +--| tcc_ptr |
 +->|   tcc   |
    |   rbx   |
    +---------+ RSP

v2 solution stores tail_call_cnt on the stack of bpf prog's caller:

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

With this change, it requires less instructions to resolve this issue.

For more resolving details, please read the following patches.

The issue is confirmed in the discussions of "bpf, x64: Fix tailcall infinite
loop"[1].

Currently, I only resolve this issue on x86. The ones on arm64, s390x and
loongarch are waiting to be resolved. So, the ci pipeline fails to run for this
issue fixing.

Hopefully, this issue on s390x and arm64 will be resolved soon.

v1 -> v2:
  * address comments from Stanislav
    * Separate moving emit_nops() as first patch.

Links:
[0] https://lore.kernel.org/bpf/20231005145814.83122-1-hffilwlqm@gmail.com/
[1] https://lore.kernel.org/bpf/6203dd01-789d-f02c-5293-def4c1b18aef@gmail.com/

Leon Hwang (4):
  bpf, x64: Emit nops for X86_PATCH
  bpf, x64: Fix tailcall hierarchy
  bpf, x64: Load tail_call_cnt pointer
  selftests/bpf: Add testcases for tailcall hierarchy fixing

 arch/x86/net/bpf_jit_comp.c                   |  99 +++--
 .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
 5 files changed, 606 insertions(+), 46 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c


base-commit: 644b54d80d572438a815c05b1bab2b7871e1e5a1
-- 
2.41.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH
  2023-10-11 15:27 [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2023-10-11 15:27 ` Leon Hwang
  2023-12-05 13:08   ` Maciej Fijalkowski
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Leon Hwang @ 2023-10-11 15:27 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm

For next commit to reuse emit_nops(), move emit_nops() before
emit_prologue().

By the way, change memcpy(prog, x86_nops[5], X86_PATCH_SIZE) to
emit_nops(&prog, X86_PATCH_SIZE).

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 41 ++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 8c10d9abc2394..c2a0465d37da4 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -304,6 +304,25 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
 	*pprog = prog;
 }
 
+static void emit_nops(u8 **pprog, int len)
+{
+	u8 *prog = *pprog;
+	int i, noplen;
+
+	while (len > 0) {
+		noplen = len;
+
+		if (noplen > ASM_NOP_MAX)
+			noplen = ASM_NOP_MAX;
+
+		for (i = 0; i < noplen; i++)
+			EMIT1(x86_nops[noplen][i]);
+		len -= noplen;
+	}
+
+	*pprog = prog;
+}
+
 /*
  * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
@@ -319,8 +338,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	 * but let's waste 5 bytes for now and optimize later
 	 */
 	EMIT_ENDBR();
-	memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
-	prog += X86_PATCH_SIZE;
+	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
 		if (tail_call_reachable && !is_subprog)
 			/* When it's the entry of the whole tailcall context,
@@ -989,25 +1007,6 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
 	}
 }
 
-static void emit_nops(u8 **pprog, int len)
-{
-	u8 *prog = *pprog;
-	int i, noplen;
-
-	while (len > 0) {
-		noplen = len;
-
-		if (noplen > ASM_NOP_MAX)
-			noplen = ASM_NOP_MAX;
-
-		for (i = 0; i < noplen; i++)
-			EMIT1(x86_nops[noplen][i]);
-		len -= noplen;
-	}
-
-	*pprog = prog;
-}
-
 /* emit the 3-byte VEX prefix
  *
  * r: same as rex.r, extra bit for ModRM reg field
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-10-11 15:27 [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH Leon Hwang
@ 2023-10-11 15:27 ` Leon Hwang
  2023-12-05 23:03   ` Maciej Fijalkowski
  2023-12-21 12:02   ` Maciej Fijalkowski
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer Leon Hwang
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Leon Hwang @ 2023-10-11 15:27 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm

From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.

From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.

How about:

1. More than 1 subprograms are called in a bpf program.
2. The tailcalls in the subprograms call the bpf program.

Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.

As we know, in tail call context, the tail_call_cnt propagates by stack
and rax register between BPF subprograms. So, propagating tail_call_cnt
pointer by stack and rax register makes tail_call_cnt as like a global
variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
hierarchy cases.

Before jumping to other bpf prog, load tail_call_cnt from the pointer
and then compare with MAX_TAIL_CALL_CNT. Finally, increment
tail_call_cnt by its pointer.

But, where does tail_call_cnt store?

It stores on the stack of bpf prog's caller, like

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
instruction on BPF JIT epilogue").

Why not back-propagate tail_call_cnt?

It's because it's vulnerable to back-propagate it. It's unable to work
well with the following case.

int prog1();
int prog2();

prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
same time? The answer is NO. Otherwise, there will be a register to be
polluted, which will make kernel crash.

Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index c2a0465d37da4..36631129cc800 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -256,7 +256,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
 
 static void push_r12(u8 **pprog)
 {
@@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	EMIT_ENDBR();
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog)
+		if (tail_call_reachable && !is_subprog) {
 			/* When it's the entry of the whole tailcall context,
 			 * zeroing rax means initialising tail_call_cnt.
 			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
-		else
-			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
+			EMIT2(0x31, 0xC0);       /* xor eax, eax */
+			EMIT1(0x50);             /* push rax */
+			/* Make rax as ptr that points to tail_call_cnt. */
+			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
+			EMIT1_off32(0xE8, 2);    /* call main prog */
+			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
+			EMIT1(0xC3);             /* ret */
+		} else {
+			/* Keep the same instruction size. */
+			emit_nops(&prog, 13);
+		}
 	}
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
@@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
 	if (tail_call_reachable)
+		/* Here, rax is tail_call_cnt_ptr. */
 		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
@@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                   /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
@@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}
 
+	/* pop tail_call_cnt_ptr */
 	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
@@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_direct_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                       /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
 
 	poke->tailcall_bypass = ip + (prog - start);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}
 
+	/* pop tail_call_cnt_ptr */
 	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer
  2023-10-11 15:27 [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH Leon Hwang
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2023-10-11 15:27 ` Leon Hwang
  2023-12-11 18:03   ` Maciej Fijalkowski
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
  2023-11-16  8:33 ` [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  4 siblings, 1 reply; 21+ messages in thread
From: Leon Hwang @ 2023-10-11 15:27 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm

Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR().

With previous commit, rax is used to propagate tail_call_cnt pointer
instead of tail_call_cnt. So, LOAD_TAIL_CALL_CNT_PTR() is better.

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 36631129cc800..73da9a2125589 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1077,7 +1077,7 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 /* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
-#define RESTORE_TAIL_CALL_CNT(stack)				\
+#define LOAD_TAIL_CALL_CNT_PTR(stack)				\
 	EMIT3_off32(0x48, 0x8B, 0x85, -round_up(stack, 8) - 8)
 
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
@@ -1697,7 +1697,7 @@ st:			if (is_imm8(insn->off))
 
 			func = (u8 *) __bpf_call_base + imm32;
 			if (tail_call_reachable) {
-				RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
+				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
 				if (!imm32)
 					return -EINVAL;
 				offs = 7 + x86_call_depth_emit_accounting(&prog, func);
@@ -2479,7 +2479,7 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 	 *                     [ ...        ]
 	 *                     [ stack_arg2 ]
 	 * RBP - arg_stack_off [ stack_arg1 ]
-	 * RSP                 [ tail_call_cnt ] BPF_TRAMP_F_TAIL_CALL_CTX
+	 * RSP                 [ tail_call_cnt_ptr ] BPF_TRAMP_F_TAIL_CALL_CTX
 	 */
 
 	/* room for return value of orig_call or fentry prog */
@@ -2599,10 +2599,10 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 		save_args(m, &prog, arg_stack_off, true);
 
 		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-			/* Before calling the original function, restore the
-			 * tail_call_cnt from stack to rax.
+			/* Before calling the original function, load the
+			 * tail_call_cnt_ptr to rax.
 			 */
-			RESTORE_TAIL_CALL_CNT(stack_size);
+			LOAD_TAIL_CALL_CNT_PTR(stack_size);
 
 		if (flags & BPF_TRAMP_F_ORIG_STACK) {
 			emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8);
@@ -2658,10 +2658,10 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 			goto cleanup;
 		}
 	} else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-		/* Before running the original function, restore the
-		 * tail_call_cnt from stack to rax.
+		/* Before running the original function, load the
+		 * tail_call_cnt_ptr to rax.
 		 */
-		RESTORE_TAIL_CALL_CNT(stack_size);
+		LOAD_TAIL_CALL_CNT_PTR(stack_size);
 
 	/* restore return value of orig_call or fentry prog back into RAX */
 	if (save_ret)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing
  2023-10-11 15:27 [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
                   ` (2 preceding siblings ...)
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer Leon Hwang
@ 2023-10-11 15:27 ` Leon Hwang
  2023-12-11 18:05   ` Maciej Fijalkowski
  2023-11-16  8:33 ` [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  4 siblings, 1 reply; 21+ messages in thread
From: Leon Hwang @ 2023-10-11 15:27 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen,
	hffilwlqm

Add some test cases to confirm the tailcall hierarchy issue has been fixed.

tools/testing/selftests/bpf/test_progs -t tailcalls
235/17  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
235/18  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
235/19  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
235/20  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
235/21  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
235/22  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
235     tailcalls:OK
Summary: 1/22 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
 4 files changed, 553 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index fc6b2954e8f50..49d9b7ed41950 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -1105,6 +1105,412 @@ static void test_tailcall_bpf2bpf_fentry_entry(void)
 	bpf_object__close(tgt_obj);
 }
 
+static void test_tailcall_hierarchy_count(const char *which, bool test_fentry,
+					  bool test_fexit)
+{
+	int err, map_fd, prog_fd, main_data_fd, fentry_data_fd, fexit_data_fd, i, val;
+	struct bpf_object *obj = NULL, *fentry_obj = NULL, *fexit_obj = NULL;
+	struct bpf_link *fentry_link = NULL, *fexit_link = NULL;
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load(which, BPF_PROG_TYPE_SCHED_CLS, &obj,
+				 &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "prog_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "map_fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	if (test_fentry) {
+		fentry_obj = bpf_object__open_file("tailcall_bpf2bpf_fentry.bpf.o",
+						   NULL);
+		if (!ASSERT_OK_PTR(fentry_obj, "open fentry_obj file"))
+			goto out;
+
+		prog = bpf_object__find_program_by_name(fentry_obj, "fentry");
+		if (!ASSERT_OK_PTR(prog, "find fentry prog"))
+			goto out;
+
+		err = bpf_program__set_attach_target(prog, prog_fd,
+						     "subprog_tail");
+		if (!ASSERT_OK(err, "set_attach_target subprog_tail"))
+			goto out;
+
+		err = bpf_object__load(fentry_obj);
+		if (!ASSERT_OK(err, "load fentry_obj"))
+			goto out;
+
+		fentry_link = bpf_program__attach_trace(prog);
+		if (!ASSERT_OK_PTR(fentry_link, "attach_trace"))
+			goto out;
+	}
+
+	if (test_fexit) {
+		fexit_obj = bpf_object__open_file("tailcall_bpf2bpf_fexit.bpf.o",
+						  NULL);
+		if (!ASSERT_OK_PTR(fexit_obj, "open fexit_obj file"))
+			goto out;
+
+		prog = bpf_object__find_program_by_name(fexit_obj, "fexit");
+		if (!ASSERT_OK_PTR(prog, "find fexit prog"))
+			goto out;
+
+		err = bpf_program__set_attach_target(prog, prog_fd,
+						     "subprog_tail");
+		if (!ASSERT_OK(err, "set_attach_target subprog_tail"))
+			goto out;
+
+		err = bpf_object__load(fexit_obj);
+		if (!ASSERT_OK(err, "load fexit_obj"))
+			goto out;
+
+		fexit_link = bpf_program__attach_trace(prog);
+		if (!ASSERT_OK_PTR(fexit_link, "attach_trace"))
+			goto out;
+	}
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find data_map"))
+		goto out;
+
+	main_data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(main_data_fd, 0, "main_data_fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(main_data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 34, "tailcall count");
+
+	if (test_fentry) {
+		data_map = bpf_object__find_map_by_name(fentry_obj, ".bss");
+		if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+				  "find tailcall_bpf2bpf_fentry.bss map"))
+			goto out;
+
+		fentry_data_fd = bpf_map__fd(data_map);
+		if (!ASSERT_GE(fentry_data_fd, 0,
+				  "find tailcall_bpf2bpf_fentry.bss map fd"))
+			goto out;
+
+		i = 0;
+		err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
+		ASSERT_OK(err, "fentry count");
+		ASSERT_EQ(val, 68, "fentry count");
+	}
+
+	if (test_fexit) {
+		data_map = bpf_object__find_map_by_name(fexit_obj, ".bss");
+		if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+				  "find tailcall_bpf2bpf_fexit.bss map"))
+			goto out;
+
+		fexit_data_fd = bpf_map__fd(data_map);
+		if (!ASSERT_GE(fexit_data_fd, 0,
+				  "find tailcall_bpf2bpf_fexit.bss map fd"))
+			goto out;
+
+		i = 0;
+		err = bpf_map_lookup_elem(fexit_data_fd, &i, &val);
+		ASSERT_OK(err, "fexit count");
+		ASSERT_EQ(val, 68, "fexit count");
+	}
+
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (!ASSERT_OK(err, "delete_elem from jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	i = 0;
+	err = bpf_map_lookup_elem(main_data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 35, "tailcall count");
+
+	if (test_fentry) {
+		i = 0;
+		err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
+		ASSERT_OK(err, "fentry count");
+		ASSERT_EQ(val, 70, "fentry count");
+	}
+
+	if (test_fexit) {
+		i = 0;
+		err = bpf_map_lookup_elem(fexit_data_fd, &i, &val);
+		ASSERT_OK(err, "fexit count");
+		ASSERT_EQ(val, 70, "fexit count");
+	}
+
+out:
+	bpf_link__destroy(fentry_link);
+	bpf_link__destroy(fexit_link);
+	bpf_object__close(fentry_obj);
+	bpf_object__close(fexit_obj);
+	bpf_object__close(obj);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_1 checks that the count value of the tail
+ * call limit enforcement matches with expectations when tailcalls are preceded
+ * with two bpf2bpf calls.
+ *
+ *         subprog --tailcall-> entry
+ * entry <
+ *         subprog --tailcall-> entry
+ */
+static void test_tailcall_bpf2bpf_hierarchy_1(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      false, false);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fentry checks that the count value of the
+ * tail call limit enforcement matches with expectations when tailcalls are
+ * preceded with two bpf2bpf calls, and the two subprogs are traced by fentry.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fentry(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      true, false);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fexit checks that the count value of the tail
+ * call limit enforcement matches with expectations when tailcalls are preceded
+ * with two bpf2bpf calls, and the two subprogs are traced by fexit.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fexit(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      false, true);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fentry_fexit checks that the count value of
+ * the tail call limit enforcement matches with expectations when tailcalls are
+ * preceded with two bpf2bpf calls, and the two subprogs are traced by both
+ * fentry and fexit.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fentry_fexit(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      true, true);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_2 checks that the count value of the tail
+ * call limit enforcement matches with expectations:
+ *
+ *         subprog_tail0 --tailcall-> classifier_0 -> subprog_tail0
+ * entry <
+ *         subprog_tail1 --tailcall-> classifier_1 -> subprog_tail1
+ */
+static void test_tailcall_bpf2bpf_hierarchy_2(void)
+{
+	int err, map_fd, prog_fd, data_fd, main_fd, i, val[2];
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_object *obj = NULL;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load("tailcall_bpf2bpf_hierarchy2.bpf.o",
+				 BPF_PROG_TYPE_SCHED_CLS,
+				 &obj, &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(main_fd, 0, "main_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table map fd"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_0");
+	if (!ASSERT_OK_PTR(prog, "find classifier_0 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_0 prog fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_1");
+	if (!ASSERT_OK_PTR(prog, "find classifier_1 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_1 prog fd"))
+		goto out;
+
+	i = 1;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(main_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find .bss map"))
+		goto out;
+
+	data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(data_fd, 0, "find .bss map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall counts");
+	ASSERT_EQ(val[0], 33, "tailcall count0");
+	ASSERT_EQ(val[1], 0, "tailcall count1");
+
+out:
+	bpf_object__close(obj);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_3 checks that the count value of the tail
+ * call limit enforcement matches with expectations:
+ *
+ *                                   subprog with jmp_table0 to classifier_0
+ * entry --tailcall-> classifier_0 <
+ *                                   subprog with jmp_table1 to classifier_0
+ */
+static void test_tailcall_bpf2bpf_hierarchy_3(void)
+{
+	int err, map_fd, prog_fd, data_fd, main_fd, i, val;
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_object *obj = NULL;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load("tailcall_bpf2bpf_hierarchy3.bpf.o",
+				 BPF_PROG_TYPE_SCHED_CLS,
+				 &obj, &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(main_fd, 0, "main_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table0");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table0 map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table0 map fd"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_0");
+	if (!ASSERT_OK_PTR(prog, "find classifier_0 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_0 prog fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table0"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table1");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table1 map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table1 map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table1"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(main_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find .bss map"))
+		goto out;
+
+	data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(data_fd, 0, "find .bss map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 33, "tailcall count");
+
+out:
+	bpf_object__close(obj);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -1139,4 +1545,16 @@ void test_tailcalls(void)
 		test_tailcall_bpf2bpf_fentry_fexit();
 	if (test__start_subtest("tailcall_bpf2bpf_fentry_entry"))
 		test_tailcall_bpf2bpf_fentry_entry();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_1"))
+		test_tailcall_bpf2bpf_hierarchy_1();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fentry"))
+		test_tailcall_bpf2bpf_hierarchy_fentry();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fexit"))
+		test_tailcall_bpf2bpf_hierarchy_fexit();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fentry_fexit"))
+		test_tailcall_bpf2bpf_hierarchy_fentry_fexit();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_2"))
+		test_tailcall_bpf2bpf_hierarchy_2();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_3"))
+		test_tailcall_bpf2bpf_hierarchy_3();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
new file mode 100644
index 0000000000000..0bfbb7c9637b7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	volatile int ret = 1;
+
+	count++;
+	subprog_tail(skb);
+	subprog_tail(skb);
+
+	return ret;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
new file mode 100644
index 0000000000000..b84541546082e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count0 = 0;
+int count1 = 0;
+
+static __noinline
+int subprog_tail0(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int classifier_0(struct __sk_buff *skb)
+{
+	count0++;
+	subprog_tail0(skb);
+	return 0;
+}
+
+static __noinline
+int subprog_tail1(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 1);
+	return 0;
+}
+
+SEC("tc")
+int classifier_1(struct __sk_buff *skb)
+{
+	count1++;
+	subprog_tail1(skb);
+	return 0;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	subprog_tail0(skb);
+	subprog_tail1(skb);
+
+	return 1;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
new file mode 100644
index 0000000000000..6398a1d277fc7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table0 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table1 SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb, void *jmp_table)
+{
+	bpf_tail_call_static(skb, jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int classifier_0(struct __sk_buff *skb)
+{
+	count++;
+	subprog_tail(skb, &jmp_table0);
+	subprog_tail(skb, &jmp_table1);
+	return 1;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table0, 0);
+
+	return 0;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy
  2023-10-11 15:27 [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
                   ` (3 preceding siblings ...)
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
@ 2023-11-16  8:33 ` Leon Hwang
  2023-11-17 21:40   ` Alexei Starovoitov
  4 siblings, 1 reply; 21+ messages in thread
From: Leon Hwang @ 2023-11-16  8:33 UTC (permalink / raw)
  To: bpf; +Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, iii, hengqi.chen

PING

On 11/10/23 23:27, Leon Hwang wrote:
> This patchset fixes a tailcall hierarchy issue with a better solution than v1[0].
> 
> v1 solution stores tail_call_cnt on the stack of bpf prog:
> 
>     |  STACK  |
>     +---------+ RBP
>     |         |
>     |         |
>     |         |
>  +--| tcc_ptr |
>  +->|   tcc   |
>     |   rbx   |
>     +---------+ RSP
> 
> v2 solution stores tail_call_cnt on the stack of bpf prog's caller:
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> With this change, it requires less instructions to resolve this issue.
> 
> For more resolving details, please read the following patches.
> 
> The issue is confirmed in the discussions of "bpf, x64: Fix tailcall infinite
> loop"[1].
> 
> Currently, I only resolve this issue on x86. The ones on arm64, s390x and
> loongarch are waiting to be resolved. So, the ci pipeline fails to run for this
> issue fixing.
> 
> Hopefully, this issue on s390x and arm64 will be resolved soon.
> 
> v1 -> v2:
>   * address comments from Stanislav
>     * Separate moving emit_nops() as first patch.
> 
> Links:
> [0] https://lore.kernel.org/bpf/20231005145814.83122-1-hffilwlqm@gmail.com/
> [1] https://lore.kernel.org/bpf/6203dd01-789d-f02c-5293-def4c1b18aef@gmail.com/
> 
> Leon Hwang (4):
>   bpf, x64: Emit nops for X86_PATCH
>   bpf, x64: Fix tailcall hierarchy
>   bpf, x64: Load tail_call_cnt pointer
>   selftests/bpf: Add testcases for tailcall hierarchy fixing
> 
>  arch/x86/net/bpf_jit_comp.c                   |  99 +++--
>  .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
>  .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
>  .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
>  .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
>  5 files changed, 606 insertions(+), 46 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
> 
> 
> base-commit: 644b54d80d572438a815c05b1bab2b7871e1e5a1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy
  2023-11-16  8:33 ` [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2023-11-17 21:40   ` Alexei Starovoitov
  2023-11-20 12:41     ` Maciej Fijalkowski
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2023-11-17 21:40 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Ilya Leoshkevich,
	Hengqi Chen

On Thu, Nov 16, 2023 at 12:33 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
> PING

Sorry for the delay. I didn't have a chance to think it through.
I hope experts in the community can take a look soon.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy
  2023-11-17 21:40   ` Alexei Starovoitov
@ 2023-11-20 12:41     ` Maciej Fijalkowski
  2023-12-05  3:09       ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-11-20 12:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Jakub Sitnicki, Ilya Leoshkevich, Hengqi Chen

On Fri, Nov 17, 2023 at 01:40:41PM -0800, Alexei Starovoitov wrote:
> On Thu, Nov 16, 2023 at 12:33 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >
> > PING
> 
> Sorry for the delay. I didn't have a chance to think it through.
> I hope experts in the community can take a look soon.
> 

I'll take a look this week.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy
  2023-11-20 12:41     ` Maciej Fijalkowski
@ 2023-12-05  3:09       ` Alexei Starovoitov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2023-12-05  3:09 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Jakub Sitnicki, Ilya Leoshkevich, Hengqi Chen

On Mon, Nov 20, 2023 at 4:41 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Fri, Nov 17, 2023 at 01:40:41PM -0800, Alexei Starovoitov wrote:
> > On Thu, Nov 16, 2023 at 12:33 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> > >
> > > PING
> >
> > Sorry for the delay. I didn't have a chance to think it through.
> > I hope experts in the community can take a look soon.
> >
>
> I'll take a look this week.

Maciej,

It's been awhile.
Please review asap.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH Leon Hwang
@ 2023-12-05 13:08   ` Maciej Fijalkowski
  0 siblings, 0 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-12-05 13:08 UTC (permalink / raw)
  To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen

On Wed, Oct 11, 2023 at 11:27:22PM +0800, Leon Hwang wrote:
> For next commit to reuse emit_nops(), move emit_nops() before
> emit_prologue().
> 
> By the way, change memcpy(prog, x86_nops[5], X86_PATCH_SIZE) to
> emit_nops(&prog, X86_PATCH_SIZE).

I find the subject of the commit a bit bogus. Could you change it to
something like:

use emit_nops() to produce nop5 instead memcpy'ing x86_nops[5]

I also feel that you should be consistent and address other spots that are
the same as the one that you are touching in emit_prologue() - there are
two more from what i see.

> 
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 41 ++++++++++++++++++-------------------
>  1 file changed, 20 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 8c10d9abc2394..c2a0465d37da4 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -304,6 +304,25 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
>  	*pprog = prog;
>  }
>  
> +static void emit_nops(u8 **pprog, int len)
> +{
> +	u8 *prog = *pprog;
> +	int i, noplen;
> +
> +	while (len > 0) {
> +		noplen = len;
> +
> +		if (noplen > ASM_NOP_MAX)
> +			noplen = ASM_NOP_MAX;
> +
> +		for (i = 0; i < noplen; i++)
> +			EMIT1(x86_nops[noplen][i]);
> +		len -= noplen;
> +	}
> +
> +	*pprog = prog;
> +}
> +
>  /*
>   * Emit x86-64 prologue code for BPF program.
>   * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
> @@ -319,8 +338,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	 * but let's waste 5 bytes for now and optimize later
>  	 */
>  	EMIT_ENDBR();
> -	memcpy(prog, x86_nops[5], X86_PATCH_SIZE);
> -	prog += X86_PATCH_SIZE;
> +	emit_nops(&prog, X86_PATCH_SIZE);
>  	if (!ebpf_from_cbpf) {
>  		if (tail_call_reachable && !is_subprog)
>  			/* When it's the entry of the whole tailcall context,
> @@ -989,25 +1007,6 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
>  	}
>  }
>  
> -static void emit_nops(u8 **pprog, int len)
> -{
> -	u8 *prog = *pprog;
> -	int i, noplen;
> -
> -	while (len > 0) {
> -		noplen = len;
> -
> -		if (noplen > ASM_NOP_MAX)
> -			noplen = ASM_NOP_MAX;
> -
> -		for (i = 0; i < noplen; i++)
> -			EMIT1(x86_nops[noplen][i]);
> -		len -= noplen;
> -	}
> -
> -	*pprog = prog;
> -}
> -
>  /* emit the 3-byte VEX prefix
>   *
>   * r: same as rex.r, extra bit for ModRM reg field
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2023-12-05 23:03   ` Maciej Fijalkowski
  2023-12-06  6:51     ` Leon Hwang
  2023-12-21 12:02   ` Maciej Fijalkowski
  1 sibling, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-12-05 23:03 UTC (permalink / raw)
  To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen

On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> handling in JIT"), the tailcall on x64 works better than before.
> 
> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
> 
> How about:
> 
> 1. More than 1 subprograms are called in a bpf program.
> 2. The tailcalls in the subprograms call the bpf program.
> 
> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
> 
> As we know, in tail call context, the tail_call_cnt propagates by stack
> and rax register between BPF subprograms. So, propagating tail_call_cnt
> pointer by stack and rax register makes tail_call_cnt as like a global
> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
> hierarchy cases.
> 
> Before jumping to other bpf prog, load tail_call_cnt from the pointer
> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
> tail_call_cnt by its pointer.
> 
> But, where does tail_call_cnt store?
> 
> It stores on the stack of bpf prog's caller, like
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
> instruction on BPF JIT epilogue").
> 
> Why not back-propagate tail_call_cnt?
> 
> It's because it's vulnerable to back-propagate it. It's unable to work
> well with the following case.
> 
> int prog1();
> int prog2();
> 
> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
> same time? The answer is NO. Otherwise, there will be a register to be
> polluted, which will make kernel crash.

Sorry but I keep on reading this explanation and I'm lost what is being
fixed here.

You want to limit the total amount of tail calls that entry prog can do to
MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
to protect us from overflowing kernel stack and endless loops. As long a
single call chain doesn't go over 8kB program is fine. Verifier has a
limit of 256 subprogs from what I see.

Can you elaborate a bit more about the kernel crash you mention in the
last paragraph?

I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
changed in the meantime to 33 and we should adjust the max allowed stack
depth of subprogs? I believe this was brought up at LPC?

> 
> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c2a0465d37da4..36631129cc800 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -256,7 +256,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE		5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
>  
>  static void push_r12(u8 **pprog)
>  {
> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	EMIT_ENDBR();
>  	emit_nops(&prog, X86_PATCH_SIZE);
>  	if (!ebpf_from_cbpf) {
> -		if (tail_call_reachable && !is_subprog)
> +		if (tail_call_reachable && !is_subprog) {
>  			/* When it's the entry of the whole tailcall context,
>  			 * zeroing rax means initialising tail_call_cnt.
>  			 */
> -			EMIT2(0x31, 0xC0); /* xor eax, eax */
> -		else
> -			/* Keep the same instruction layout. */
> -			EMIT2(0x66, 0x90); /* nop2 */
> +			EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +			EMIT1(0x50);             /* push rax */
> +			/* Make rax as ptr that points to tail_call_cnt. */
> +			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> +			EMIT1_off32(0xE8, 2);    /* call main prog */
> +			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> +			EMIT1(0xC3);             /* ret */
> +		} else {
> +			/* Keep the same instruction size. */
> +			emit_nops(&prog, 13);
> +		}
>  	}
>  	/* Exception callback receives FP as third parameter */
>  	if (is_exception_cb) {
> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>  	if (tail_call_reachable)
> +		/* Here, rax is tail_call_cnt_ptr. */
>  		EMIT1(0x50);         /* push rax */
>  	*pprog = prog;
>  }
> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  					u32 stack_depth, u8 *ip,
>  					struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                   /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
>  
>  	/* prog = array->ptrs[index]; */
>  	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                              /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  				      bool *callee_regs_used, u32 stack_depth,
>  				      struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_direct_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                       /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
>  
>  	poke->tailcall_bypass = ip + (prog - start);
>  	poke->adj_off = X86_TAIL_CALL_OFFSET;
> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                                  /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
> -- 
> 2.41.0
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-12-05 23:03   ` Maciej Fijalkowski
@ 2023-12-06  6:51     ` Leon Hwang
  2023-12-11 18:02       ` Maciej Fijalkowski
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Hwang @ 2023-12-06  6:51 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen



On 6/12/23 07:03, Maciej Fijalkowski wrote:
> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>> handling in JIT"), the tailcall on x64 works better than before.
>>
>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
>>
>> How about:
>>
>> 1. More than 1 subprograms are called in a bpf program.
>> 2. The tailcalls in the subprograms call the bpf program.
>>
>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
>>
>> As we know, in tail call context, the tail_call_cnt propagates by stack
>> and rax register between BPF subprograms. So, propagating tail_call_cnt
>> pointer by stack and rax register makes tail_call_cnt as like a global
>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
>> hierarchy cases.
>>
>> Before jumping to other bpf prog, load tail_call_cnt from the pointer
>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
>> tail_call_cnt by its pointer.
>>
>> But, where does tail_call_cnt store?
>>
>> It stores on the stack of bpf prog's caller, like
>>
>>     |  STACK  |
>>     |         |
>>     |   rip   |
>>  +->|   tcc   |
>>  |  |   rip   |
>>  |  |   rbp   |
>>  |  +---------+ RBP
>>  |  |         |
>>  |  |         |
>>  |  |         |
>>  +--| tcc_ptr |
>>     |   rbx   |
>>     +---------+ RSP
>>
>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
>> instruction on BPF JIT epilogue").
>>
>> Why not back-propagate tail_call_cnt?
>>
>> It's because it's vulnerable to back-propagate it. It's unable to work
>> well with the following case.
>>
>> int prog1();
>> int prog2();
>>
>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
>> same time? The answer is NO. Otherwise, there will be a register to be
>> polluted, which will make kernel crash.
> 
> Sorry but I keep on reading this explanation and I'm lost what is being
> fixed here> 
> You want to limit the total amount of tail calls that entry prog can do to
> MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
> rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
> to protect us from overflowing kernel stack and endless loops. As long a
> single call chain doesn't go over 8kB program is fine. Verifier has a
> limit of 256 subprogs from what I see.

I try to resolve the following-like cases here.

          +--------- tailcall --+
          |                     |
          V       --> subprog1 -+
 entry bpf prog <
          A       --> subprog2 -+
          |                     |
          +--------- tailcall --+

Without this fixing, the CPU will be stalled because of too many tailcalls.

> 
> Can you elaborate a bit more about the kernel crash you mention in the
> last paragraph?

We have progs, prog1, prog2, prog3 and prog4, and the scenario:

case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3
case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3
case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3
                         --> subprog4 --tailcall-> prog4

How does prog2 back-propagate tail_call_cnt to prog1?

Possible way 1:
When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to
back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do
not handle the value in RCX register, like case2.

Possible way 2:
Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0
at epilogue when current prog has tailcall?
No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt,
because prog3 has no tailcall and won't populate RCX register with tail_call_cnt.

However, I don't like the back-propagating way. Then, I "burn" my brain to figure
out pointer propagating ways.

RFC PATCH v1 way:
Propagate tail_call_cnt and its pointer together. Then, the pointer is used to
check MAX_TAIL_CALL_CNT and increment tail_call_cnt.

    |  STACK  |
    +---------+ RBP
    |         |
    |         |
    |         |
 +--| tcc_ptr |
 +->|   tcc   |
    |   rbx   |
    +---------+ RSP

RFC PATCH v2 way (current patchset):
Propagate tail_call_cnt pointer only. Then, the pointer is used to check
MAX_TAIL_CALL_CNT and increment tail_call_cnt.

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

> 
> I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
> changed in the meantime to 33 and we should adjust the max allowed stack
> depth of subprogs? I believe this was brought up at LPC?

There's following code snippet in verifier.c:

	/* protect against potential stack overflow that might happen when
	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
	 * depth for such case down to 256 so that the worst case scenario
	 * would result in 8k stack size (32 which is tailcall limit * 256 =
	 * 8k).

But, MAX_TAIL_CALL_CNT is 33.

This was not brought up at LPC 2022&2023. I don't know whether this was
brought up at previous LPCs.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-12-06  6:51     ` Leon Hwang
@ 2023-12-11 18:02       ` Maciej Fijalkowski
  2023-12-13  2:48         ` Leon Hwang
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-12-11 18:02 UTC (permalink / raw)
  To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen

On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote:
> 
> 
> On 6/12/23 07:03, Maciej Fijalkowski wrote:
> > On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
> >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> >> handling in JIT"), the tailcall on x64 works better than before.
> >>
> >> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
> >> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
> >>
> >> How about:
> >>
> >> 1. More than 1 subprograms are called in a bpf program.
> >> 2. The tailcalls in the subprograms call the bpf program.
> >>
> >> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
> >> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
> >>
> >> As we know, in tail call context, the tail_call_cnt propagates by stack
> >> and rax register between BPF subprograms. So, propagating tail_call_cnt
> >> pointer by stack and rax register makes tail_call_cnt as like a global
> >> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
> >> hierarchy cases.
> >>
> >> Before jumping to other bpf prog, load tail_call_cnt from the pointer
> >> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
> >> tail_call_cnt by its pointer.
> >>
> >> But, where does tail_call_cnt store?
> >>
> >> It stores on the stack of bpf prog's caller, like
> >>
> >>     |  STACK  |
> >>     |         |
> >>     |   rip   |
> >>  +->|   tcc   |
> >>  |  |   rip   |
> >>  |  |   rbp   |
> >>  |  +---------+ RBP
> >>  |  |         |
> >>  |  |         |
> >>  |  |         |
> >>  +--| tcc_ptr |
> >>     |   rbx   |
> >>     +---------+ RSP
> >>
> >> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
> >> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
> >> instruction on BPF JIT epilogue").
> >>
> >> Why not back-propagate tail_call_cnt?
> >>
> >> It's because it's vulnerable to back-propagate it. It's unable to work
> >> well with the following case.
> >>
> >> int prog1();
> >> int prog2();
> >>
> >> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
> >> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
> >> same time? The answer is NO. Otherwise, there will be a register to be
> >> polluted, which will make kernel crash.
> > 
> > Sorry but I keep on reading this explanation and I'm lost what is being
> > fixed here> 
> > You want to limit the total amount of tail calls that entry prog can do to
> > MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
> > rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
> > to protect us from overflowing kernel stack and endless loops. As long a
> > single call chain doesn't go over 8kB program is fine. Verifier has a
> > limit of 256 subprogs from what I see.
> 
> I try to resolve the following-like cases here.
> 
>           +--------- tailcall --+
>           |                     |
>           V       --> subprog1 -+
>  entry bpf prog <
>           A       --> subprog2 -+
>           |                     |
>           +--------- tailcall --+
> 
> Without this fixing, the CPU will be stalled because of too many tailcalls.

You still didn't explain why CPU might stall :/ if you stumble upon any
kernel splats it's good habit to include them. So, for future readers of
this thread, I reduced one of your examples down to:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include "bpf_legacy.h"

struct {
        __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
        __uint(max_entries, 1);
        __uint(key_size, sizeof(__u32));
        __uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
        bpf_tail_call_static(skb, &jmp_table, 0);
        return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
        subprog_tail(skb);
        return 0;
}

char __license[] SEC("license") = "GPL";

and indeed CPU got stuck:

[10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59
[10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc
[10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286
[10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000
[10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100
[10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000
[10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001
[10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f
[10623.960174]  ? bpf_test_run+0xef/0x320
[10623.963988]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.969826]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.975662]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.981500]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10623.987334]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10623.993174]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.999016]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.004854]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.010688]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.016529]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.022371]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.028209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.034043]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.043211]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.052344]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.061453]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.070404]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.079200]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.087841]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.096333]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.104680]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.112886]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.121014]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.129048]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.136949]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.144716]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.152479]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.160209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.167905]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.175565]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.183173]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.190738]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.198262]  bpf_test_run+0x186/0x320
[10624.203677]  ? kmem_cache_alloc+0x14d/0x2c0
[10624.209615]  bpf_prog_test_run_skb+0x35c/0x710
[10624.215810]  __sys_bpf+0xf3a/0x2d80
[10624.221068]  ? do_mmap+0x409/0x5e0
[10624.226217]  __x64_sys_bpf+0x1a/0x30
[10624.231693]  do_syscall_64+0x2e/0xa0
[10624.236988]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[10624.243550] RIP: 0033:0x7fd76171e69d
[10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48
[10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d
[10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a
[10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0
[10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348
[10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040

Looking at assembly code:

entry:
ffffffffc0012c60 <load3+0x12c60>:
ffffffffc0012c60:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012c65:       31 c0                   xor    %eax,%eax
ffffffffc0012c67:       55                      push   %rbp
ffffffffc0012c68:       48 89 e5                mov    %rsp,%rbp
ffffffffc0012c6b:       50                      push   %rax
ffffffffc0012c6c:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
ffffffffc0012c73:       e8 30 00 00 00          call   0xffffffffc0012ca8
ffffffffc0012c78:       31 c0                   xor    %eax,%eax
ffffffffc0012c7a:       c9                      leave
ffffffffc0012c7b:       c3                      ret

subprog_tail:
ffffffffc0012ca8 <load3+0x12ca8>:
ffffffffc0012ca8:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012cad:       66 90                   xchg   %ax,%ax
ffffffffc0012caf:       55                      push   %rbp
ffffffffc0012cb0:       48 89 e5                mov    %rsp,%rbp
ffffffffc0012cb3:       50                      push   %rax
ffffffffc0012cb4:       53                      push   %rbx
ffffffffc0012cb5:       41 55                   push   %r13
ffffffffc0012cb7:       48 89 fb                mov    %rdi,%rbx
ffffffffc0012cba:       49 bd 00 7c 73 0c 81    movabs $0xffff88810c737c00,%r13
ffffffffc0012cc1:       88 ff ff
ffffffffc0012cc4:       48 89 df                mov    %rbx,%rdi
ffffffffc0012cc7:       4c 89 ee                mov    %r13,%rsi
ffffffffc0012cca:       31 d2                   xor    %edx,%edx
ffffffffc0012ccc:       8b 85 fc ff ff ff       mov    -0x4(%rbp),%eax
ffffffffc0012cd2:       83 f8 21                cmp    $0x21,%eax
ffffffffc0012cd5:       73 17                   jae    0xffffffffc0012cee
ffffffffc0012cd7:       83 c0 01                add    $0x1,%eax
ffffffffc0012cda:       89 85 fc ff ff ff       mov    %eax,-0x4(%rbp)
ffffffffc0012ce0:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012ce5:       41 5d                   pop    %r13
ffffffffc0012ce7:       5b                      pop    %rbx
ffffffffc0012ce8:       58                      pop    %rax
ffffffffc0012ce9:       e9 7d ff ff ff          jmp    0xffffffffc0012c6b
ffffffffc0012cee:       41 5d                   pop    %r13
ffffffffc0012cf0:       5b                      pop    %rbx
ffffffffc0012cf1:       c9                      leave
ffffffffc0012cf2:       c3                      ret

I am trying to understand what are writing to %rax when entry is target of
tailcall. rbp comes from subprog_tail and we popped all of the previously
pushed regs. So what is exactly causing this hang?

(I thought I'd rather respond just to let you know I'm on it)

> 
> > 
> > Can you elaborate a bit more about the kernel crash you mention in the
> > last paragraph?
> 
> We have progs, prog1, prog2, prog3 and prog4, and the scenario:
> 
> case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3
> case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3
> case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3
>                          --> subprog4 --tailcall-> prog4
> 
> How does prog2 back-propagate tail_call_cnt to prog1?
> 
> Possible way 1:
> When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to
> back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do
> not handle the value in RCX register, like case2.
> 
> Possible way 2:
> Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0
> at epilogue when current prog has tailcall?
> No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt,
> because prog3 has no tailcall and won't populate RCX register with tail_call_cnt.

I don't get it. Let's start with small example and do the baby steps.

> 
> However, I don't like the back-propagating way. Then, I "burn" my brain to figure
> out pointer propagating ways.

I know that code can damage programmer's brain, that's why we need to
strive for elaborative and easy to understand commit messages so that
later on we will be able to pick up what is going on here.

> 
> RFC PATCH v1 way:
> Propagate tail_call_cnt and its pointer together. Then, the pointer is used to
> check MAX_TAIL_CALL_CNT and increment tail_call_cnt.
> 
>     |  STACK  |
>     +---------+ RBP
>     |         |
>     |         |
>     |         |
>  +--| tcc_ptr |
>  +->|   tcc   |
>     |   rbx   |
>     +---------+ RSP
> 
> RFC PATCH v2 way (current patchset):
> Propagate tail_call_cnt pointer only. Then, the pointer is used to check
> MAX_TAIL_CALL_CNT and increment tail_call_cnt.
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> > 
> > I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
> > changed in the meantime to 33 and we should adjust the max allowed stack
> > depth of subprogs? I believe this was brought up at LPC?
> 
> There's following code snippet in verifier.c:
> 
> 	/* protect against potential stack overflow that might happen when
> 	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
> 	 * depth for such case down to 256 so that the worst case scenario
> 	 * would result in 8k stack size (32 which is tailcall limit * 256 =
> 	 * 8k).
> 
> But, MAX_TAIL_CALL_CNT is 33.
> 
> This was not brought up at LPC 2022&2023. I don't know whether this was
> brought up at previous LPCs.

Was referring to this:
https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf

but let us focus on issue you're bringing up here in this patch. I brought
this up thinking JIT code was fine, now it's clear that things go south
when entry prog is tailcall target, which we didn't test.

> 
> Thanks,
> Leon
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer Leon Hwang
@ 2023-12-11 18:03   ` Maciej Fijalkowski
  2023-12-13  2:49     ` Leon Hwang
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-12-11 18:03 UTC (permalink / raw)
  To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen

On Wed, Oct 11, 2023 at 11:27:24PM +0800, Leon Hwang wrote:
> Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR().
> 
> With previous commit, rax is used to propagate tail_call_cnt pointer
> instead of tail_call_cnt. So, LOAD_TAIL_CALL_CNT_PTR() is better.
> 
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>

IMHO this is out of the scope for this set. We are going to target the bpf
tree and this patch can be send to bpf-next in the future.

> ---
>  arch/x86/net/bpf_jit_comp.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
@ 2023-12-11 18:05   ` Maciej Fijalkowski
  2023-12-13  3:09     ` Leon Hwang
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-12-11 18:05 UTC (permalink / raw)
  To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen

On Wed, Oct 11, 2023 at 11:27:25PM +0800, Leon Hwang wrote:
> Add some test cases to confirm the tailcall hierarchy issue has been fixed.
> 
> tools/testing/selftests/bpf/test_progs -t tailcalls
> 235/17  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
> 235/18  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
> 235/19  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
> 235/20  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
> 235/21  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
> 235/22  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
> 235     tailcalls:OK
> Summary: 1/22 PASSED, 0 SKIPPED, 0 FAILED
> 
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++

Generally it feels like a lot of duplicated code from your previous
fentry/fexit fixes, but I didn't look closely if it would be possible to
pull out something in common.

>  .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
>  .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
>  .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
>  4 files changed, 553 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-12-11 18:02       ` Maciej Fijalkowski
@ 2023-12-13  2:48         ` Leon Hwang
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Hwang @ 2023-12-13  2:48 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen



On 12/12/23 02:02, Maciej Fijalkowski wrote:
> On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote:
>>
>>
>> On 6/12/23 07:03, Maciej Fijalkowski wrote:
>>> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
>>>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>>>> handling in JIT"), the tailcall on x64 works better than before.
>>>>
>>>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
>>>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
>>>>
>>>> How about:
>>>>
>>>> 1. More than 1 subprograms are called in a bpf program.
>>>> 2. The tailcalls in the subprograms call the bpf program.
>>>>
>>>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
>>>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
>>>>
>>>> As we know, in tail call context, the tail_call_cnt propagates by stack
>>>> and rax register between BPF subprograms. So, propagating tail_call_cnt
>>>> pointer by stack and rax register makes tail_call_cnt as like a global
>>>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
>>>> hierarchy cases.
>>>>
>>>> Before jumping to other bpf prog, load tail_call_cnt from the pointer
>>>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
>>>> tail_call_cnt by its pointer.
>>>>
>>>> But, where does tail_call_cnt store?
>>>>
>>>> It stores on the stack of bpf prog's caller, like
>>>>
>>>>     |  STACK  |
>>>>     |         |
>>>>     |   rip   |
>>>>  +->|   tcc   |
>>>>  |  |   rip   |
>>>>  |  |   rbp   |
>>>>  |  +---------+ RBP
>>>>  |  |         |
>>>>  |  |         |
>>>>  |  |         |
>>>>  +--| tcc_ptr |
>>>>     |   rbx   |
>>>>     +---------+ RSP
>>>>
>>>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
>>>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
>>>> instruction on BPF JIT epilogue").
>>>>
>>>> Why not back-propagate tail_call_cnt?
>>>>
>>>> It's because it's vulnerable to back-propagate it. It's unable to work
>>>> well with the following case.
>>>>
>>>> int prog1();
>>>> int prog2();
>>>>
>>>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
>>>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
>>>> same time? The answer is NO. Otherwise, there will be a register to be
>>>> polluted, which will make kernel crash.
>>>
>>> Sorry but I keep on reading this explanation and I'm lost what is being
>>> fixed here> 
>>> You want to limit the total amount of tail calls that entry prog can do to
>>> MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
>>> rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
>>> to protect us from overflowing kernel stack and endless loops. As long a
>>> single call chain doesn't go over 8kB program is fine. Verifier has a
>>> limit of 256 subprogs from what I see.
>>
>> I try to resolve the following-like cases here.
>>
>>           +--------- tailcall --+
>>           |                     |
>>           V       --> subprog1 -+
>>  entry bpf prog <
>>           A       --> subprog2 -+
>>           |                     |
>>           +--------- tailcall --+
>>
>> Without this fixing, the CPU will be stalled because of too many tailcalls.
> 
> You still didn't explain why CPU might stall :/ if you stumble upon any
> kernel splats it's good habit to include them. So, for future readers of
> this thread, I reduced one of your examples down to:
> 
Thanks for your help.

Here is a GitHub CI history of this patchset:

https://github.com/kernel-patches/bpf/pull/5807/checks

In this CI results, the test_progs failed to run on aarch64 and s390x
because of "rcu: INFO: rcu_sched self-detected stall on CPU".

But why does CPU stall?

It's because there are too many tailcalls to run on the bpf prog. How many
tailcalls? Why MAX_TAIL_CALL_CNT limit does not work for those cases?

Let's have a look at a simple case:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include "bpf_legacy.h"

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
	volatile int ret = 1;

	count++;
	subprog_tail(skb); /* subprog call1 */
	subprog_tail(skb); /* subprog call2 */

	return ret;
}

char __license[] SEC("license") = "GPL";

And the entry bpf prog is populated to the 0th slot of jmp_table. Then,
what happens when entry bpf prog runs?

At the very first time when subprog_tail() is called, subprog_tail() does
tailcall the entry bpf prog. Then, subprog_taill() is called at second time
at the position subprog call1, and it tailcalls the entry bpf prog again.

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the tail_call_cnt is 33 in subprog_tail().

Next, the 34th subprog_tail() returns to entry() because of
MAX_TAIL_CALL_CNT limit.

In entry(), the 34th entry(), at the time after the 34th subprog_tail() at
the position subprog call1 finishes and before the 1st subprog_tail() at
the position subprog call2 calls in entry(), what's the value of
tail_call_cnt in entry()? It's 33.

As we know, tail_all_cnt is pushed on the stack of entry(), and propagates
to subprog_tail() by %rax from stack.

Then, at the time when subprog_tail() at the position subprog call2 is
called for its first time, tail_call_cnt 33 propagates to subprog_tail()
by %rax. And the tailcall in subprog_tail() is aborted because of
tail_call_cnt >= MAX_TAIL_CALL_CNT too.

Then, subprog_tail() at the position subprog call2 ends, and the 34th
entry() ends. And it returns to the 33rd subprog_tail() called from the
position subprog call1. But wait, at this time, what's the value of
tail_call_cnt under the stack of subprog_tail()? It's 33.

Then, in the 33rd entry(), at the time after the 33th subprog_tail() at
the position subprog call1 finishes and before the 2nd subprog_tail() at
the position subprog call2 calls, what's the value of tail_call_cnt
in current entry()? It's **32**. Why not 33?

Before stepping into subprog_tail() at the position subprog call2 in 33rd
entry(), like stopping the time, let's have a look at the stack memory:

  |  STACK  |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  +---------+ RSP  <-- current rsp
  |   rip   | STACK of 34rd entry()
  |   rbp   | reuse the STACK of 33rd subprog_tail() at the position
  |   ret   |                                        subprog call1
  |   tcc   | its value is 33
  +---------+ rsp
  |   rip   | STACK of 1st subprog_tail() at the position subprog call2
  |   rbp   |
  |   tcc   | its value is 33
  +---------+ rsp

Why not 33? It's because tail_call_cnt does not back-propagate from
subprog_tail() to entry().

Then, while stepping into subprog_tail() at the position subprog call2 in 33rd
entry():

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 32; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Then, while pausing after tailcalling in 2nd subprog_tail() at the position
subprog call2:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 33; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Note: what happens to tail_call_cnt:
	/*
	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
	 *	goto out;
	 */
It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt.

So, current tailcall is allowed to run.

Then, entry() is tailcalled. And the stack memory status is:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 35th entry(); reuse STACK of subprog_tail() at the
  |   tcc   | its value is 33                   the position subprog call2
  +---------+ RSP  <-- current rsp

So, the tailcalls in the 35th entry() will be aborted.

And, ..., again and again.  :(

And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit
does not work for those cases.

And, how many tailcalls are there if CPU does not stall?

Let's count it. 1+1+1+...+1=?

Let's build a math model to calculate the total tailcall count. Sorry, I'm
poor at math.

How about top-down view? Does it look like hierarchy layer and layer?

I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a
result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182
tailcalls. That's the guy which makes CPU stalled.

What about there are N subprog_tail() in entry()? If CPU does not stall
because of too many tailcalls, there will be almost N**34 tailcalls.

> #include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>
> #include "bpf_legacy.h"
> 
> struct {
>         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>         __uint(max_entries, 1);
>         __uint(key_size, sizeof(__u32));
>         __uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
> 
> static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
>         bpf_tail_call_static(skb, &jmp_table, 0);
>         return 0;
> }
> 
> SEC("tc")
> int entry(struct __sk_buff *skb)
> {
>         subprog_tail(skb);
>         return 0;
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> and indeed CPU got stuck:
> 
> [10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59
> [10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc
> [10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286
> [10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000
> [10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100
> [10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000
> [10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001
> [10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f
> [10623.960174]  ? bpf_test_run+0xef/0x320
> [10623.963988]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.969826]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.975662]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.981500]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10623.987334]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10623.993174]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.999016]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.004854]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.010688]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.016529]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.022371]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.028209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.034043]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.043211]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.052344]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.061453]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.070404]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.079200]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.087841]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.096333]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.104680]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.112886]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.121014]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.129048]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.136949]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.144716]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.152479]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.160209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.167905]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.175565]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.183173]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.190738]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.198262]  bpf_test_run+0x186/0x320
> [10624.203677]  ? kmem_cache_alloc+0x14d/0x2c0
> [10624.209615]  bpf_prog_test_run_skb+0x35c/0x710
> [10624.215810]  __sys_bpf+0xf3a/0x2d80
> [10624.221068]  ? do_mmap+0x409/0x5e0
> [10624.226217]  __x64_sys_bpf+0x1a/0x30
> [10624.231693]  do_syscall_64+0x2e/0xa0
> [10624.236988]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [10624.243550] RIP: 0033:0x7fd76171e69d
> [10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48
> [10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
> [10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d
> [10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a
> [10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0
> [10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348
> [10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040
> 
> Looking at assembly code:
> 
> entry:
> ffffffffc0012c60 <load3+0x12c60>:
> ffffffffc0012c60:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012c65:       31 c0                   xor    %eax,%eax
> ffffffffc0012c67:       55                      push   %rbp
> ffffffffc0012c68:       48 89 e5                mov    %rsp,%rbp
> ffffffffc0012c6b:       50                      push   %rax
> ffffffffc0012c6c:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
> ffffffffc0012c73:       e8 30 00 00 00          call   0xffffffffc0012ca8
> ffffffffc0012c78:       31 c0                   xor    %eax,%eax
> ffffffffc0012c7a:       c9                      leave
> ffffffffc0012c7b:       c3                      ret
> 
> subprog_tail:
> ffffffffc0012ca8 <load3+0x12ca8>:
> ffffffffc0012ca8:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012cad:       66 90                   xchg   %ax,%ax
> ffffffffc0012caf:       55                      push   %rbp
> ffffffffc0012cb0:       48 89 e5                mov    %rsp,%rbp
> ffffffffc0012cb3:       50                      push   %rax
> ffffffffc0012cb4:       53                      push   %rbx
> ffffffffc0012cb5:       41 55                   push   %r13
> ffffffffc0012cb7:       48 89 fb                mov    %rdi,%rbx
> ffffffffc0012cba:       49 bd 00 7c 73 0c 81    movabs $0xffff88810c737c00,%r13
> ffffffffc0012cc1:       88 ff ff
> ffffffffc0012cc4:       48 89 df                mov    %rbx,%rdi
> ffffffffc0012cc7:       4c 89 ee                mov    %r13,%rsi
> ffffffffc0012cca:       31 d2                   xor    %edx,%edx
> ffffffffc0012ccc:       8b 85 fc ff ff ff       mov    -0x4(%rbp),%eax
> ffffffffc0012cd2:       83 f8 21                cmp    $0x21,%eax
> ffffffffc0012cd5:       73 17                   jae    0xffffffffc0012cee
> ffffffffc0012cd7:       83 c0 01                add    $0x1,%eax
> ffffffffc0012cda:       89 85 fc ff ff ff       mov    %eax,-0x4(%rbp)
> ffffffffc0012ce0:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012ce5:       41 5d                   pop    %r13
> ffffffffc0012ce7:       5b                      pop    %rbx
> ffffffffc0012ce8:       58                      pop    %rax
> ffffffffc0012ce9:       e9 7d ff ff ff          jmp    0xffffffffc0012c6b
> ffffffffc0012cee:       41 5d                   pop    %r13
> ffffffffc0012cf0:       5b                      pop    %rbx
> ffffffffc0012cf1:       c9                      leave
> ffffffffc0012cf2:       c3                      ret
> 
> I am trying to understand what are writing to %rax when entry is target of
> tailcall. rbp comes from subprog_tail and we popped all of the previously
> pushed regs. So what is exactly causing this hang?
> 

Oh, your example is wierd to make CPU stuck. I don't know why, either.

> (I thought I'd rather respond just to let you know I'm on it)
> 

Thanks for your reminding.

>>
>>>
>>> Can you elaborate a bit more about the kernel crash you mention in the
>>> last paragraph?
>>
>> We have progs, prog1, prog2, prog3 and prog4, and the scenario:
>>
>> case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3
>> case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3
>> case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3
>>                          --> subprog4 --tailcall-> prog4
>>
>> How does prog2 back-propagate tail_call_cnt to prog1?
>>
>> Possible way 1:
>> When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to
>> back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do
>> not handle the value in RCX register, like case2.
>>
>> Possible way 2:
>> Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0
>> at epilogue when current prog has tailcall?
>> No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt,
>> because prog3 has no tailcall and won't populate RCX register with tail_call_cnt.
> 
> I don't get it. Let's start with small example and do the baby steps.
> 

I lost the details of back-propagating solutions which I tried hard to
convince myself the solutions were not good enough as my expectation.

Sorry about that.

>>
>> However, I don't like the back-propagating way. Then, I "burn" my brain to figure
>> out pointer propagating ways.
> 
> I know that code can damage programmer's brain, that's why we need to
> strive for elaborative and easy to understand commit messages so that
> later on we will be able to pick up what is going on here.
> 

Yeah, it's not easy to understand the code of tailcall on arch, like x86.

I'm trying hard to explain this patch.

>>
>> RFC PATCH v1 way:
>> Propagate tail_call_cnt and its pointer together. Then, the pointer is used to
>> check MAX_TAIL_CALL_CNT and increment tail_call_cnt.
>>
>>     |  STACK  |
>>     +---------+ RBP
>>     |         |
>>     |         |
>>     |         |
>>  +--| tcc_ptr |
>>  +->|   tcc   |
>>     |   rbx   |
>>     +---------+ RSP
>>
>> RFC PATCH v2 way (current patchset):
>> Propagate tail_call_cnt pointer only. Then, the pointer is used to check
>> MAX_TAIL_CALL_CNT and increment tail_call_cnt.
>>
>>     |  STACK  |
>>     |         |
>>     |   rip   |
>>  +->|   tcc   |
>>  |  |   rip   |
>>  |  |   rbp   |
>>  |  +---------+ RBP
>>  |  |         |
>>  |  |         |
>>  |  |         |
>>  +--| tcc_ptr |
>>     |   rbx   |
>>     +---------+ RSP
>>

It's time to describe the details of this patch. And take a look at how it
handles the above simple case.

This patch is to keep tail_call_cnt on the stack of entry bpf prog's caller,
and then propagate tail_call_cnt pointer like the way of the previous
tail_call_cnt.

So, at the epilogue of entry bpf prog, before pushing %rbp, it has to
initialise tail_call_cnt and to push it to stack by using %rax. Then, make
%rax as the pointer that points to tail_call_cnt, by "mov rax, rsp". Then,
call the main part of the entry bpf prog by "call 2". **This is the
exceptional point.** With this "call", %rip is pushed to stack. And at the
end of the entry bpf prog runtime, the %rip is popped from stack; then, pop
tail_call_cnt from stack too; and finally "ret" again. The popping
tail_call_cnt and "ret" is the 2 in "call 2".

So, when "call 2", %rax is the tail_call_cnt pointer. And then,
tail_call_cnt pointer will be pushed to stack. This is the way to push %rax
when entry bpf prog is tailcalled, too.

So, when a subprog is called, tail_call_cnt pointer is popped from stack to
%rax.

And when a tailcall happens, load tail_call_cnt pointer from stack to %rax
by "mov rax, qword ptr [rbp - tcc_ptr_off]", and compare tail_call_cnt with
MAX_TAIL_CALL_CNT by "cmp dword ptr [rax], MAX_TAIL_CALL_CNT", and then
increment tail_call_cnt by "add dword ptr [rax], 1". Finally, when pop %rax,
it's to pop tail_call_cnt pointer from stack to %rax.

When the epilogue of entry() runs, the stack of entry() should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 0
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
    |   rbx   | saved regs
    +---------+ RSP

Then, when subprog_tail() is called for its very first time, its stack
should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 0
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

Then, when subprog_tail() tailcalls entry():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 1
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
    +---------+ RSP

Then, when entry() calls subprog_tail():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 1
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

Then, when subprog_tail() tailcalls entry():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 2
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
    +---------+ RSP

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the stack should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 33
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |    *    |
 |  |    *    |
 |  |    *    |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

After this time, the tailcalls in the future will be aborted because
tail_call_cnt has been 33, which reaches its MAX_TAIL_CALL_CNT limit.

That's the way how this patch works.

It's really nice if you reach here. I hope you have a clear idea to
understand this patch by reading above replys.

>>>
>>> I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
>>> changed in the meantime to 33 and we should adjust the max allowed stack
>>> depth of subprogs? I believe this was brought up at LPC?
>>
>> There's following code snippet in verifier.c:
>>
>> 	/* protect against potential stack overflow that might happen when
>> 	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
>> 	 * depth for such case down to 256 so that the worst case scenario
>> 	 * would result in 8k stack size (32 which is tailcall limit * 256 =
>> 	 * 8k).
>>
>> But, MAX_TAIL_CALL_CNT is 33.
>>
>> This was not brought up at LPC 2022&2023. I don't know whether this was
>> brought up at previous LPCs.
> 
> Was referring to this:
> https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf
> 

That's interesting. I tried to overflow the stack of one bpf prog, whose
max stack size limits to 8k, with fexit. I failed because the stack size
of thread is not so that small. But, exceptionally, it was able to compose
a tailcall infinite loop with fexit, which broke the tail_call_cnt
propagating chain because it did not pass through the fexit's trampoline.

Thanks,
Leon

> but let us focus on issue you're bringing up here in this patch. I brought
> this up thinking JIT code was fine, now it's clear that things go south
> when entry prog is tailcall target, which we didn't test.
> 
>>
>> Thanks,
>> Leon
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer
  2023-12-11 18:03   ` Maciej Fijalkowski
@ 2023-12-13  2:49     ` Leon Hwang
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Hwang @ 2023-12-13  2:49 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen



On 12/12/23 02:03, Maciej Fijalkowski wrote:
> On Wed, Oct 11, 2023 at 11:27:24PM +0800, Leon Hwang wrote:
>> Rename RESTORE_TAIL_CALL_CNT() to LOAD_TAIL_CALL_CNT_PTR().
>>
>> With previous commit, rax is used to propagate tail_call_cnt pointer
>> instead of tail_call_cnt. So, LOAD_TAIL_CALL_CNT_PTR() is better.
>>
>> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> 
> IMHO this is out of the scope for this set. We are going to target the bpf
> tree and this patch can be send to bpf-next in the future.
> 

LGTM.

Thanks,
Leon

>> ---
>>  arch/x86/net/bpf_jit_comp.c | 18 +++++++++---------
>>  1 file changed, 9 insertions(+), 9 deletions(-)
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing
  2023-12-11 18:05   ` Maciej Fijalkowski
@ 2023-12-13  3:09     ` Leon Hwang
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Hwang @ 2023-12-13  3:09 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen



On 12/12/23 02:05, Maciej Fijalkowski wrote:
> On Wed, Oct 11, 2023 at 11:27:25PM +0800, Leon Hwang wrote:
>> Add some test cases to confirm the tailcall hierarchy issue has been fixed.
>>
>> tools/testing/selftests/bpf/test_progs -t tailcalls
>> 235/17  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
>> 235/18  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
>> 235/19  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
>> 235/20  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
>> 235/21  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
>> 235/22  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
>> 235     tailcalls:OK
>> Summary: 1/22 PASSED, 0 SKIPPED, 0 FAILED
>>
>> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
>> ---
>>  .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
> 
> Generally it feels like a lot of duplicated code from your previous
> fentry/fexit fixes, but I didn't look closely if it would be possible to
> pull out something in common.
> 

test_tailcall_hierarchy_count() and test_tailcall_count() are similar. They can
combine into test_tailcall_count() with more arguments.

Thanks,
Leon

>>  .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  34 ++
>>  .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  55 +++
>>  .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  46 ++
>>  4 files changed, 553 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-10-11 15:27 ` [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2023-12-05 23:03   ` Maciej Fijalkowski
@ 2023-12-21 12:02   ` Maciej Fijalkowski
  2023-12-21 14:56     ` Leon Hwang
  1 sibling, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2023-12-21 12:02 UTC (permalink / raw)
  To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen

On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> handling in JIT"), the tailcall on x64 works better than before.
> 
> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
> 
> How about:
> 
> 1. More than 1 subprograms are called in a bpf program.
> 2. The tailcalls in the subprograms call the bpf program.
> 
> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
> 
> As we know, in tail call context, the tail_call_cnt propagates by stack
> and rax register between BPF subprograms. So, propagating tail_call_cnt
> pointer by stack and rax register makes tail_call_cnt as like a global
> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
> hierarchy cases.
> 
> Before jumping to other bpf prog, load tail_call_cnt from the pointer
> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
> tail_call_cnt by its pointer.
> 
> But, where does tail_call_cnt store?
> 
> It stores on the stack of bpf prog's caller, like
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
> instruction on BPF JIT epilogue").
> 
> Why not back-propagate tail_call_cnt?
> 
> It's because it's vulnerable to back-propagate it. It's unable to work
> well with the following case.
> 
> int prog1();
> int prog2();
> 
> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
> same time? The answer is NO. Otherwise, there will be a register to be
> polluted, which will make kernel crash.
> 
> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c2a0465d37da4..36631129cc800 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -256,7 +256,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE		5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
>  
>  static void push_r12(u8 **pprog)
>  {
> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	EMIT_ENDBR();
>  	emit_nops(&prog, X86_PATCH_SIZE);
>  	if (!ebpf_from_cbpf) {
> -		if (tail_call_reachable && !is_subprog)
> +		if (tail_call_reachable && !is_subprog) {
>  			/* When it's the entry of the whole tailcall context,
>  			 * zeroing rax means initialising tail_call_cnt.
>  			 */
> -			EMIT2(0x31, 0xC0); /* xor eax, eax */
> -		else
> -			/* Keep the same instruction layout. */
> -			EMIT2(0x66, 0x90); /* nop2 */
> +			EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +			EMIT1(0x50);             /* push rax */
> +			/* Make rax as ptr that points to tail_call_cnt. */
> +			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> +			EMIT1_off32(0xE8, 2);    /* call main prog */
> +			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> +			EMIT1(0xC3);             /* ret */
> +		} else {
> +			/* Keep the same instruction size. */
> +			emit_nops(&prog, 13);
> +		}

At first sight it seemed to me too invasive but after trying out few other
approaches in the end it is elegant.

I wanted to avoid a bit puzzling call insn in the prologue with a following
prologue layout (this will be based on entry prog from tailcall_bpf2bpf3.c that
was the first one to break):

ffffffffc0012cb4:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012cb9:       55                      push   %rbp
ffffffffc0012cba:       48 89 e5                mov    %rsp,%rbp
ffffffffc0012cbd:       48 83 ec 10             sub    $0x10,%rsp
ffffffffc0012cc1:       48 89 65 f8             mov    %rsp,-0x8(%rbp)
ffffffffc0012cc5:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
ffffffffc0012ccc:       00
ffffffffc0012ccd:       48 8b 45 f8             mov    -0x8(%rbp),%rax
ffffffffc0012cd1:       50                      push   %rax
ffffffffc0012cd2:       48 81 ec 80 00 00 00    sub    $0x80,%rsp

So we would have hidden 16 bytes on stack at the *beginning* of entry stack
frame. First thing right after rbp would be tcc pointer so referring to it
wouldn't require us to take into account stack depth. But then if we
follow with rest of insns:

ffffffffc0012cd9:       31 f6                   xor    %esi,%esi
ffffffffc0012cdb:       48 89 75 f8             mov    %rsi,-0x8(%rbp)  // BUG, overwrite of tcc ptr
ffffffffc0012cdf:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
ffffffffc0012ce3:       48 89 75 e8             mov    %rsi,-0x18(%rbp)
ffffffffc0012ce7:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
ffffffffc0012ceb:       48 89 75 d8             mov    %rsi,-0x28(%rbp)
ffffffffc0012cef:       48 89 75 d0             mov    %rsi,-0x30(%rbp)
ffffffffc0012cf3:       48 89 75 c8             mov    %rsi,-0x38(%rbp)
ffffffffc0012cf7:       48 89 75 c0             mov    %rsi,-0x40(%rbp)
ffffffffc0012cfb:       48 89 75 b8             mov    %rsi,-0x48(%rbp)
ffffffffc0012cff:       48 89 75 b0             mov    %rsi,-0x50(%rbp)
ffffffffc0012d03:       48 89 75 a8             mov    %rsi,-0x58(%rbp)
ffffffffc0012d07:       48 89 75 a0             mov    %rsi,-0x60(%rbp)
ffffffffc0012d0b:       48 89 75 98             mov    %rsi,-0x68(%rbp)
ffffffffc0012d0f:       48 89 75 90             mov    %rsi,-0x70(%rbp)
ffffffffc0012d13:       48 89 75 88             mov    %rsi,-0x78(%rbp)
ffffffffc0012d17:       48 89 75 80             mov    %rsi,-0x80(%rbp)
ffffffffc0012d1b:       48 0f b6 75 ff          movzbq -0x1(%rbp),%rsi
ffffffffc0012d20:       40 88 75 ff             mov    %sil,-0x1(%rbp)
ffffffffc0012d24:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
ffffffffc0012d2b:       e8 30 00 00 00          call   0xffffffffc0012d60
ffffffffc0012d30:       c9                      leave
ffffffffc0012d31:       c3                      ret

So even though it would seem more obvious while looking at prologue what
is the intent behind it, this would require us to patch the instructions
that make us of R10/stack, which in the end would be way more invasive.

After all, for x86 JIT code:
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

but it is a must to have a better commit message here.

Thanks!

>  	}
>  	/* Exception callback receives FP as third parameter */
>  	if (is_exception_cb) {
> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>  	if (tail_call_reachable)
> +		/* Here, rax is tail_call_cnt_ptr. */
>  		EMIT1(0x50);         /* push rax */
>  	*pprog = prog;
>  }
> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  					u32 stack_depth, u8 *ip,
>  					struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                   /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
>  
>  	/* prog = array->ptrs[index]; */
>  	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                              /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  				      bool *callee_regs_used, u32 stack_depth,
>  				      struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_direct_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                       /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
>  
>  	poke->tailcall_bypass = ip + (prog - start);
>  	poke->adj_off = X86_TAIL_CALL_OFFSET;
> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                                  /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
> -- 
> 2.41.0
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-12-21 12:02   ` Maciej Fijalkowski
@ 2023-12-21 14:56     ` Leon Hwang
  2024-01-04  6:23       ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Hwang @ 2023-12-21 14:56 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf, ast, daniel, andrii, jakub, iii, hengqi.chen



On 2023/12/21 20:02, Maciej Fijalkowski wrote:
> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>> handling in JIT"), the tailcall on x64 works better than before.
>>
>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
>>
>> How about:
>>
>> 1. More than 1 subprograms are called in a bpf program.
>> 2. The tailcalls in the subprograms call the bpf program.
>>
>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
>>
>> As we know, in tail call context, the tail_call_cnt propagates by stack
>> and rax register between BPF subprograms. So, propagating tail_call_cnt
>> pointer by stack and rax register makes tail_call_cnt as like a global
>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
>> hierarchy cases.
>>
>> Before jumping to other bpf prog, load tail_call_cnt from the pointer
>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
>> tail_call_cnt by its pointer.
>>
>> But, where does tail_call_cnt store?
>>
>> It stores on the stack of bpf prog's caller, like
>>
>>     |  STACK  |
>>     |         |
>>     |   rip   |
>>  +->|   tcc   |
>>  |  |   rip   |
>>  |  |   rbp   |
>>  |  +---------+ RBP
>>  |  |         |
>>  |  |         |
>>  |  |         |
>>  +--| tcc_ptr |
>>     |   rbx   |
>>     +---------+ RSP
>>
>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
>> instruction on BPF JIT epilogue").
>>
>> Why not back-propagate tail_call_cnt?
>>
>> It's because it's vulnerable to back-propagate it. It's unable to work
>> well with the following case.
>>
>> int prog1();
>> int prog2();
>>
>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
>> same time? The answer is NO. Otherwise, there will be a register to be
>> polluted, which will make kernel crash.
>>
>> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
>> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
>> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
>> ---
>>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
>>  1 file changed, 24 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index c2a0465d37da4..36631129cc800 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -256,7 +256,7 @@ struct jit_context {
>>  /* Number of bytes emit_patch() needs to generate instructions */
>>  #define X86_PATCH_SIZE		5
>>  /* Number of bytes that will be skipped on tailcall */
>> -#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
>> +#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
>>  
>>  static void push_r12(u8 **pprog)
>>  {
>> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>  	EMIT_ENDBR();
>>  	emit_nops(&prog, X86_PATCH_SIZE);
>>  	if (!ebpf_from_cbpf) {
>> -		if (tail_call_reachable && !is_subprog)
>> +		if (tail_call_reachable && !is_subprog) {
>>  			/* When it's the entry of the whole tailcall context,
>>  			 * zeroing rax means initialising tail_call_cnt.
>>  			 */
>> -			EMIT2(0x31, 0xC0); /* xor eax, eax */
>> -		else
>> -			/* Keep the same instruction layout. */
>> -			EMIT2(0x66, 0x90); /* nop2 */
>> +			EMIT2(0x31, 0xC0);       /* xor eax, eax */
>> +			EMIT1(0x50);             /* push rax */
>> +			/* Make rax as ptr that points to tail_call_cnt. */
>> +			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
>> +			EMIT1_off32(0xE8, 2);    /* call main prog */
>> +			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
>> +			EMIT1(0xC3);             /* ret */
>> +		} else {
>> +			/* Keep the same instruction size. */
>> +			emit_nops(&prog, 13);
>> +		}
> 
> At first sight it seemed to me too invasive but after trying out few other
> approaches in the end it is elegant.
> 
> I wanted to avoid a bit puzzling call insn in the prologue with a following
> prologue layout (this will be based on entry prog from tailcall_bpf2bpf3.c that
> was the first one to break):
> 
> ffffffffc0012cb4:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012cb9:       55                      push   %rbp
> ffffffffc0012cba:       48 89 e5                mov    %rsp,%rbp
> ffffffffc0012cbd:       48 83 ec 10             sub    $0x10,%rsp
> ffffffffc0012cc1:       48 89 65 f8             mov    %rsp,-0x8(%rbp)
> ffffffffc0012cc5:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
> ffffffffc0012ccc:       00
> ffffffffc0012ccd:       48 8b 45 f8             mov    -0x8(%rbp),%rax
> ffffffffc0012cd1:       50                      push   %rax
> ffffffffc0012cd2:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
> 
> So we would have hidden 16 bytes on stack at the *beginning* of entry stack
> frame. First thing right after rbp would be tcc pointer so referring to it
> wouldn't require us to take into account stack depth. But then if we
> follow with rest of insns:
> 
> ffffffffc0012cd9:       31 f6                   xor    %esi,%esi
> ffffffffc0012cdb:       48 89 75 f8             mov    %rsi,-0x8(%rbp)  // BUG, overwrite of tcc ptr
> ffffffffc0012cdf:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
> ffffffffc0012ce3:       48 89 75 e8             mov    %rsi,-0x18(%rbp)
> ffffffffc0012ce7:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
> ffffffffc0012ceb:       48 89 75 d8             mov    %rsi,-0x28(%rbp)
> ffffffffc0012cef:       48 89 75 d0             mov    %rsi,-0x30(%rbp)
> ffffffffc0012cf3:       48 89 75 c8             mov    %rsi,-0x38(%rbp)
> ffffffffc0012cf7:       48 89 75 c0             mov    %rsi,-0x40(%rbp)
> ffffffffc0012cfb:       48 89 75 b8             mov    %rsi,-0x48(%rbp)
> ffffffffc0012cff:       48 89 75 b0             mov    %rsi,-0x50(%rbp)
> ffffffffc0012d03:       48 89 75 a8             mov    %rsi,-0x58(%rbp)
> ffffffffc0012d07:       48 89 75 a0             mov    %rsi,-0x60(%rbp)
> ffffffffc0012d0b:       48 89 75 98             mov    %rsi,-0x68(%rbp)
> ffffffffc0012d0f:       48 89 75 90             mov    %rsi,-0x70(%rbp)
> ffffffffc0012d13:       48 89 75 88             mov    %rsi,-0x78(%rbp)
> ffffffffc0012d17:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> ffffffffc0012d1b:       48 0f b6 75 ff          movzbq -0x1(%rbp),%rsi
> ffffffffc0012d20:       40 88 75 ff             mov    %sil,-0x1(%rbp)
> ffffffffc0012d24:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
> ffffffffc0012d2b:       e8 30 00 00 00          call   0xffffffffc0012d60
> ffffffffc0012d30:       c9                      leave
> ffffffffc0012d31:       c3                      ret
> 
> So even though it would seem more obvious while looking at prologue what
> is the intent behind it, this would require us to patch the instructions
> that make us of R10/stack, which in the end would be way more invasive.
> 
> After all, for x86 JIT code:
> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Thanks for your review.

> 
> but it is a must to have a better commit message here.
> 

I'll write a better commit message here.

Thanks,
Leon

> Thanks!
> 
>>  	}
>>  	/* Exception callback receives FP as third parameter */
>>  	if (is_exception_cb) {
>> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>  	if (stack_depth)
>>  		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>>  	if (tail_call_reachable)
>> +		/* Here, rax is tail_call_cnt_ptr. */
>>  		EMIT1(0x50);         /* push rax */
>>  	*pprog = prog;
>>  }
>> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>  					u32 stack_depth, u8 *ip,
>>  					struct jit_context *ctx)
>>  {
>> -	int tcc_off = -4 - round_up(stack_depth, 8);
>> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>>  	u8 *prog = *pprog, *start = *pprog;
>>  	int offset;
>>  
>> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>>  	 *	goto out;
>>  	 */
>> -	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
>> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
>> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
>> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>>  
>>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>>  	EMIT2(X86_JAE, offset);                   /* jae out */
>> -	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
>> -	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
>> +	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
>>  
>>  	/* prog = array->ptrs[index]; */
>>  	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
>> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>  		pop_callee_regs(&prog, callee_regs_used);
>>  	}
>>  
>> +	/* pop tail_call_cnt_ptr */
>>  	EMIT1(0x58);                              /* pop rax */
>>  	if (stack_depth)
>>  		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
>> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>  				      bool *callee_regs_used, u32 stack_depth,
>>  				      struct jit_context *ctx)
>>  {
>> -	int tcc_off = -4 - round_up(stack_depth, 8);
>> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>>  	u8 *prog = *pprog, *start = *pprog;
>>  	int offset;
>>  
>> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>>  	 *	goto out;
>>  	 */
>> -	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
>> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
>> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
>> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>>  
>>  	offset = ctx->tail_call_direct_label - (prog + 2 - start);
>>  	EMIT2(X86_JAE, offset);                       /* jae out */
>> -	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
>> -	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
>> +	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
>>  
>>  	poke->tailcall_bypass = ip + (prog - start);
>>  	poke->adj_off = X86_TAIL_CALL_OFFSET;
>> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>  		pop_callee_regs(&prog, callee_regs_used);
>>  	}
>>  
>> +	/* pop tail_call_cnt_ptr */
>>  	EMIT1(0x58);                                  /* pop rax */
>>  	if (stack_depth)
>>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
>> -- 
>> 2.41.0
>>
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy
  2023-12-21 14:56     ` Leon Hwang
@ 2024-01-04  6:23       ` Alexei Starovoitov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2024-01-04  6:23 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Maciej Fijalkowski, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Jakub Sitnicki, Ilya Leoshkevich, Hengqi Chen

On Thu, Dec 21, 2023 at 6:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> >
> > but it is a must to have a better commit message here.
> >
>
> I'll write a better commit message here.

Please respin with updated commit messages and drop RFC tag.
Keep Reviewed-by that you received already.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-01-04  6:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-11 15:27 [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
2023-10-11 15:27 ` [RFC PATCH bpf-next v2 1/4] bpf, x64: Emit nops for X86_PATCH Leon Hwang
2023-12-05 13:08   ` Maciej Fijalkowski
2023-10-11 15:27 ` [RFC PATCH bpf-next v2 2/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
2023-12-05 23:03   ` Maciej Fijalkowski
2023-12-06  6:51     ` Leon Hwang
2023-12-11 18:02       ` Maciej Fijalkowski
2023-12-13  2:48         ` Leon Hwang
2023-12-21 12:02   ` Maciej Fijalkowski
2023-12-21 14:56     ` Leon Hwang
2024-01-04  6:23       ` Alexei Starovoitov
2023-10-11 15:27 ` [RFC PATCH bpf-next v2 3/4] bpf, x64: Load tail_call_cnt pointer Leon Hwang
2023-12-11 18:03   ` Maciej Fijalkowski
2023-12-13  2:49     ` Leon Hwang
2023-10-11 15:27 ` [RFC PATCH bpf-next v2 4/4] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
2023-12-11 18:05   ` Maciej Fijalkowski
2023-12-13  3:09     ` Leon Hwang
2023-11-16  8:33 ` [RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy Leon Hwang
2023-11-17 21:40   ` Alexei Starovoitov
2023-11-20 12:41     ` Maciej Fijalkowski
2023-12-05  3:09       ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.