[RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
@ 2020-05-11 14:39 Maciej Fijalkowski
  2020-05-11 14:39 ` [RFC PATCH bpf-next 1/1] " Maciej Fijalkowski
  2020-05-11 20:05 ` [RFC PATCH bpf-next 0/1] " Daniel Borkmann
  0 siblings, 2 replies; 8+ messages in thread
From: Maciej Fijalkowski @ 2020-05-11 14:39 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Hi!

Today, BPF x86-64 JIT is preserving all of the callee-saved registers
for each BPF program being JITed, even when none of the R6-R9 registers
are used by the BPF program. Furthermore the tail call counter is always
pushed/popped to/from the stack even when there is no tail call usage in
BPF program being JITed. Optimization can be introduced that would
detect the usage of R6-R9 and based on that push/pop to/from the stack
only what is needed. Same goes for tail call counter.

Results look promising for such instruction reduction. Below are the
numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen:

* With optimization: 22.3 Mpps
* Without:           19.0 mpps

So it's around 15% of performance improvement. Note that xdp1 is not
using any of callee saved registers, nor the tail call, hence such
speed-up.

There is one detail that needs to be handled though.

Currently, x86-64 JIT tail call implementation is skipping the prologue
of target BPF program that has constant size. With the mentioned
optimization implemented, each particular BPF program that might be
inserted onto the prog array map and therefore be the target of tail
call, could have various prologue size.

Let's have some pseudo-code example:

func1:
pro
code
epi

func2:
pro
code'
epi

func3:
pro
code''
epi

Today, pro and epi are always the same (9/7) instructions. So a tail
call from func1 to func2 is just a:

jump func2 + sizeof pro in bytes (PROLOGUE_SIZE)

With the optimization:

func1:
pro
code
epi

func2:
pro'
code'
epi'

func3:
pro''
code''
epi''

For making the tail calls up and running with the mentioned optimization
in place, x86-64 JIT should emit the pop registers instructions
that were pushed on prologue before the actual jump. Jump offset should
skip the instructions that are handling rbp/rsp, not the whole prologue.

A tail call within func1 would then need to be:
epi -> pop what pro pushed, but no leave/ret instructions
jump func2 + 16 // first push insn of pro'; if no push, then this would
                // a direct jump to code'

Magic value of 16 comes from count of bytes that represent instructions
that are skipped:
0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
55                      push   %rbp
48 89 e5                mov    %rsp,%rbp
48 81 ec 08 00 00 00    sub    $0x8,%rsp

which would in many cases add *more* instructions for tailcalls. If none
of callee-saved registers are used, then there would be no overhead with
such optimization in place.

I'm not sure how to measure properly the impact on the BPF programs that
are utilizing tail calls. Any suggestions?

Daniel, Alexei, what is your view on this?

For implementation details, see commit message of included patch.

Thank you,
Maciej

Maciej Fijalkowski (1):
  bpf, x64: optimize JIT prologue/epilogue generation

 arch/x86/net/bpf_jit_comp.c | 190 ++++++++++++++++++++++++++++--------
 1 file changed, 148 insertions(+), 42 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH bpf-next 1/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-11 14:39 [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation Maciej Fijalkowski
@ 2020-05-11 14:39 ` Maciej Fijalkowski
  2020-05-11 20:05 ` [RFC PATCH bpf-next 0/1] " Daniel Borkmann
  1 sibling, 0 replies; 8+ messages in thread
From: Maciej Fijalkowski @ 2020-05-11 14:39 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Optimize the prologue and epilogue JIT sections in two ways. At first,
teach x86-64 JIT to detect whether R6-R9 are unused by program being
JITed and if so, don't emit push/pop insn pair for each particular
callee saved register. Secondly, don't initialize tail call counter on
stack when there is no tail call usage in picture.

While these changes are straight forward, taking care of tail call code
is more cumbersome.

With described optimization in place, it is not guaranteed that after
the tail call callee-saved programs would be properly restored, as the
target program can simply not use the R6-R9 registers, therefore it will
not pop them in the epilogue section.

Precede the actual tail call by popping callee saved registers. Tail
call will skip the instructions that handle rbp/rsp, but will execute
the pushes of callee-saved registers, if any.

For direct tail call case, there is one case that need special treatment
- e.g. the direct jump can be the NOP so in order to proceed with flow,
emit the instructions that will push back the R6-R9 to stack (since they
have been popped before the tail call).

In order to still be able to refer to tail call counter on stack, flip
its placement so that it appears as a first thing after subtracted rsp,
followed by callee saved registers, if any. Modify the
emit_bpf_tail_call_[in]direct routines to reflect that change, e.g. the
instructions that are picking the tail call counter from stack, bumping
it by 1 and updating the value on stack need to look at the -516 stack
offset from now on.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop a bunch of BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

For regression checks, 'tailcalls' kselftest was executed:
$ sudo ./test_progs -t tailcalls
#64/1 tailcall_1:OK
#64/2 tailcall_2:OK
#64/3 tailcall_3:OK
#64/4 tailcall_4:OK
#64/5 tailcall_5:OK
#64 tailcalls:OK
Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 190 ++++++++++++++++++++++++++++--------
 1 file changed, 148 insertions(+), 42 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5ea7c2cf7ab4..f208bcad856c 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -209,13 +209,44 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 
-#define PROLOGUE_SIZE		25
+static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[0])
+		EMIT1(0x53);         /* push rbx */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x55);   /* push r13 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x56);   /* push r14 */
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x57);   /* push r15 */
+	*pprog = prog;
+}
+
+static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x5F);   /* pop r15 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x5E);   /* pop r14 */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x5D);   /* pop r13 */
+	if (callee_regs_used[0])
+		EMIT1(0x5B);         /* pop rbx */
+	*pprog = prog;
+}
 
 /*
- * Emit x86-64 prologue code for BPF program and check its size.
+ * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip it while jumping into another program
  */
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
+static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
+			  bool tail_call)
 {
 	u8 *prog = *pprog;
 	int cnt = X86_PATCH_SIZE;
@@ -229,15 +260,8 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
 	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	/* sub rsp, rounded_stack_depth */
 	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	EMIT1(0x53);             /* push rbx */
-	EMIT2(0x41, 0x55);       /* push r13 */
-	EMIT2(0x41, 0x56);       /* push r14 */
-	EMIT2(0x41, 0x57);       /* push r15 */
-	if (!ebpf_from_cbpf) {
-		/* zero init tail_call_cnt */
+	if (!ebpf_from_cbpf && tail_call)
 		EMIT2(0x6a, 0x00);
-		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
-	}
 	*pprog = prog;
 }
 
@@ -338,12 +362,38 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
  *   goto *(prog->bpf_func + prologue_size);
  * out:
  */
-static void emit_bpf_tail_call_indirect(u8 **pprog)
+static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used)
 {
 	u8 *prog = *pprog;
-	int label1, label2, label3;
+	int off1 = 41;
+	int off2 = 30;
+	int off3 = 8;
 	int cnt = 0;
 
+	/* count the additional bytes used for pushing callee regs to stack
+	 * that need to be taken into account for each of the offsets that
+	 * are used for bailing out of the tail call
+	 */
+	if (callee_regs_used[3]) {
+		off1 += 2;
+		off2 += 2;
+		off3 += 2;
+	}
+	if (callee_regs_used[2]) {
+		off1 += 2;
+		off2 += 2;
+		off3 += 2;
+	}
+	if (callee_regs_used[1]) {
+		off1 += 2;
+		off2 += 2;
+		off3 += 2;
+	}
+	if (callee_regs_used[0]) {
+		off1 += 1;
+		off2 += 1;
+		off3 += 1;
+	}
 	/*
 	 * rdi - pointer to ctx
 	 * rsi - pointer to bpf_array
@@ -357,21 +407,19 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
 	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
 	      offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
+#define OFFSET1 (off1 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
-	label1 = cnt;
 
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85, -4 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 516] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
+#define OFFSET2 (off2 + RETPOLINE_RAX_BPF_JIT_SIZE)
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
-	label2 = cnt;
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85, -4 - MAX_BPF_STACK); /* mov dword ptr [rbp - 516], eax */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
@@ -382,14 +430,17 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	 *	goto out;
 	 */
 	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
-#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
+#define OFFSET3 (off3 + RETPOLINE_RAX_BPF_JIT_SIZE)
 	EMIT2(X86_JE, OFFSET3);                   /* je out */
-	label3 = cnt;
+
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
 
 	/* goto *(prog->bpf_func + prologue_size); */
 	EMIT4(0x48, 0x8B, 0x40,                   /* mov rax, qword ptr [rax + 32] */
 	      offsetof(struct bpf_prog, bpf_func));
-	EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
+	EMIT4(0x48, 0x83, 0xC0, 16);   /* add rax, prologue_size */
 
 	/*
 	 * Wow we're ready to jump into next BPF program
@@ -399,33 +450,64 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	RETPOLINE_RAX_BPF_JIT();
 
 	/* out: */
-	BUILD_BUG_ON(cnt - label1 != OFFSET1);
-	BUILD_BUG_ON(cnt - label2 != OFFSET2);
-	BUILD_BUG_ON(cnt - label3 != OFFSET3);
 	*pprog = prog;
 }
 
 static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
-				      u8 **pprog, int addr, u8 *image)
+				      u8 **pprog, int addr, u8 *image,
+				      bool *callee_regs_used)
 {
 	u8 *prog = *pprog;
 	int cnt = 0;
+	int off1 = 14;
+	int poke_off = 0;
 
+	/* count the additional bytes used for pushing callee regs to stack
+	 * that need to be taken into account for offset that is used for
+	 * bailing out of the tail call and the poke->ip
+	 */
+	if (callee_regs_used[3]) {
+		off1 += 4;
+		poke_off += 2;
+	}
+	if (callee_regs_used[2]) {
+		off1 += 4;
+		poke_off += 2;
+	}
+	if (callee_regs_used[1]) {
+		off1 += 4;
+		poke_off += 2;
+	}
+	if (callee_regs_used[0]) {
+		off1 += 2;
+		poke_off += 1;
+	}
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85, -4 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 516] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
-	EMIT2(X86_JA, 14);                            /* ja out */
+	EMIT2(X86_JA, off1);                            /* ja out */
 	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85, -4 - MAX_BPF_STACK); /* mov dword ptr [rbp - 516], eax */
+
+	poke->ip = image + (addr - X86_PATCH_SIZE - poke_off);
+	poke->adj_off = 16;
 
-	poke->ip = image + (addr - X86_PATCH_SIZE);
-	poke->adj_off = PROLOGUE_SIZE;
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
 
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
 	prog += X86_PATCH_SIZE;
+
+	/* in case of the target program being a NOP, restore the callee
+	 * registers on stack
+	 */
+	*pprog = prog;
+	push_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
 	/* out: */
 
 	*pprog = prog;
@@ -640,19 +722,44 @@ static bool ex_handler_bpf(const struct exception_table_entry *x,
 	return true;
 }
 
+static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
+			     bool *regs_used, bool *tail_call_seen)
+{
+	int i;
+
+	for (i = 1; i <= insn_cnt; i++, insn++) {
+		if (insn->code == (BPF_JMP | BPF_TAIL_CALL))
+			*tail_call_seen = true;
+		if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6)
+			regs_used[0] = true;
+		if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7)
+			regs_used[1] = true;
+		if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8)
+			regs_used[2] = true;
+		if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9)
+			regs_used[3] = true;
+	}
+}
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 		  int oldproglen, struct jit_context *ctx)
 {
 	struct bpf_insn *insn = bpf_prog->insnsi;
+	bool callee_regs_used[4] = {};
 	int insn_cnt = bpf_prog->len;
+	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
 	int i, cnt = 0, excnt = 0;
 	int proglen = 0;
 	u8 *prog = temp;
 
+	detect_reg_usage(insn, insn_cnt, callee_regs_used,
+			 &tail_call_seen);
+
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
-		      bpf_prog_was_classic(bpf_prog));
+		      bpf_prog_was_classic(bpf_prog), tail_call_seen);
+	push_callee_regs(&prog, callee_regs_used);
 	addrs[0] = prog - temp;
 
 	for (i = 1; i <= insn_cnt; i++, insn++) {
@@ -1097,9 +1204,11 @@ xadd:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_TAIL_CALL:
 			if (imm32)
 				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
-							  &prog, addrs[i], image);
+							  &prog, addrs[i], image,
+							  callee_regs_used);
 			else
-				emit_bpf_tail_call_indirect(&prog);
+				emit_bpf_tail_call_indirect(&prog,
+							    callee_regs_used);
 			break;
 
 			/* cond jump */
@@ -1282,14 +1391,11 @@ xadd:			if (is_imm8(insn->off))
 			seen_exit = true;
 			/* Update cleanup_addr */
 			ctx->cleanup_addr = proglen;
-			if (!bpf_prog_was_classic(bpf_prog))
-				EMIT1(0x5B); /* get rid of tail_call_cnt */
-			EMIT2(0x41, 0x5F);   /* pop r15 */
-			EMIT2(0x41, 0x5E);   /* pop r14 */
-			EMIT2(0x41, 0x5D);   /* pop r13 */
-			EMIT1(0x5B);         /* pop rbx */
-			EMIT1(0xC9);         /* leave */
-			EMIT1(0xC3);         /* ret */
+			pop_callee_regs(&prog, callee_regs_used);
+			if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen)
+				EMIT1(0x59);         /* pop rcx, get rid of tail_call_cnt */
+			EMIT1(0xC9);                 /* leave */
+			EMIT1(0xC3);                 /* ret */
 			break;
 
 		default:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-11 14:39 [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation Maciej Fijalkowski
  2020-05-11 14:39 ` [RFC PATCH bpf-next 1/1] " Maciej Fijalkowski
@ 2020-05-11 20:05 ` Daniel Borkmann
  2020-05-12  0:01   ` Alexei Starovoitov
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Borkmann @ 2020-05-11 20:05 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast
  Cc: bpf, netdev, bjorn.topel, magnus.karlsson, lmb, john.fastabend

Hey Maciej,

On 5/11/20 4:39 PM, Maciej Fijalkowski wrote:
> Hi!
> 
> Today, BPF x86-64 JIT is preserving all of the callee-saved registers
> for each BPF program being JITed, even when none of the R6-R9 registers
> are used by the BPF program. Furthermore the tail call counter is always
> pushed/popped to/from the stack even when there is no tail call usage in
> BPF program being JITed. Optimization can be introduced that would
> detect the usage of R6-R9 and based on that push/pop to/from the stack
> only what is needed. Same goes for tail call counter.
> 
> Results look promising for such instruction reduction. Below are the
> numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen:
> 
> * With optimization: 22.3 Mpps
> * Without:           19.0 mpps
> 
> So it's around 15% of performance improvement. Note that xdp1 is not
> using any of callee saved registers, nor the tail call, hence such
> speed-up.
> 
> There is one detail that needs to be handled though.
> 
> Currently, x86-64 JIT tail call implementation is skipping the prologue
> of target BPF program that has constant size. With the mentioned
> optimization implemented, each particular BPF program that might be
> inserted onto the prog array map and therefore be the target of tail
> call, could have various prologue size.
> 
> Let's have some pseudo-code example:
> 
> func1:
> pro
> code
> epi
> 
> func2:
> pro
> code'
> epi
> 
> func3:
> pro
> code''
> epi
> 
> Today, pro and epi are always the same (9/7) instructions. So a tail
> call from func1 to func2 is just a:
> 
> jump func2 + sizeof pro in bytes (PROLOGUE_SIZE)
> 
> With the optimization:
> 
> func1:
> pro
> code
> epi
> 
> func2:
> pro'
> code'
> epi'
> 
> func3:
> pro''
> code''
> epi''
> 
> For making the tail calls up and running with the mentioned optimization
> in place, x86-64 JIT should emit the pop registers instructions
> that were pushed on prologue before the actual jump. Jump offset should
> skip the instructions that are handling rbp/rsp, not the whole prologue.
> 
> A tail call within func1 would then need to be:
> epi -> pop what pro pushed, but no leave/ret instructions
> jump func2 + 16 // first push insn of pro'; if no push, then this would
>                  // a direct jump to code'
> 
> Magic value of 16 comes from count of bytes that represent instructions
> that are skipped:
> 0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> 55                      push   %rbp
> 48 89 e5                mov    %rsp,%rbp
> 48 81 ec 08 00 00 00    sub    $0x8,%rsp
> 
> which would in many cases add *more* instructions for tailcalls. If none
> of callee-saved registers are used, then there would be no overhead with
> such optimization in place.
> 
> I'm not sure how to measure properly the impact on the BPF programs that
> are utilizing tail calls. Any suggestions?

Right, so far the numbers above (no callee saved registers, no tail calls)
are really the best case scenario. I think programs not using callee saved
registers are probably very limited in what they do, and tail calls are often
used as well (although good enough for AF_XDP, for example). So I wonder how
far we would regress with callee saved registers and tail calls. For Cilium
right now you can roughly assume a worst case tail call depth of ~6 with static
jumps (that we patch to jmp/nop). Only in one case we have a tail call map
index that is non-static. In terms of registers, assume all of them are used
one way or another. If you could check the impact in such setting, that would
be great.

> Daniel, Alexei, what is your view on this?

I think performance wise this would be both pro and con depending how tail
calls are used. One upside however, and I think you didn't mention it here
would be that we don't need to clamp used stack space to 512, so we could
actually track how much stack is used (or if any is used at all) and adapt
it between tail calls? Depending on the numbers, if we'd go that route, it
should rather be generalized and tracked via verifier so all JITs can behave
the same (and these workarounds in verifier lifted). But then 15% performance
improvement as you state above is a lot, probably we might regress at least
as much as well in your benchmark. I wonder whether there should be a knob
for it, though it's mainly implementation detail..

> For implementation details, see commit message of included patch.
> 
> Thank you,
> Maciej
> 
> 
> Maciej Fijalkowski (1):
>    bpf, x64: optimize JIT prologue/epilogue generation
> 
>   arch/x86/net/bpf_jit_comp.c | 190 ++++++++++++++++++++++++++++--------
>   1 file changed, 148 insertions(+), 42 deletions(-)
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-11 20:05 ` [RFC PATCH bpf-next 0/1] " Daniel Borkmann
@ 2020-05-12  0:01   ` Alexei Starovoitov
  2020-05-13 11:58     ` Maciej Fijalkowski
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2020-05-12  0:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Maciej Fijalkowski, ast, bpf, netdev, bjorn.topel,
	magnus.karlsson, lmb, john.fastabend

On Mon, May 11, 2020 at 10:05:25PM +0200, Daniel Borkmann wrote:
> Hey Maciej,
> 
> On 5/11/20 4:39 PM, Maciej Fijalkowski wrote:
> > Hi!
> > 
> > Today, BPF x86-64 JIT is preserving all of the callee-saved registers
> > for each BPF program being JITed, even when none of the R6-R9 registers
> > are used by the BPF program. Furthermore the tail call counter is always
> > pushed/popped to/from the stack even when there is no tail call usage in
> > BPF program being JITed. Optimization can be introduced that would
> > detect the usage of R6-R9 and based on that push/pop to/from the stack
> > only what is needed. Same goes for tail call counter.
> > 
> > Results look promising for such instruction reduction. Below are the
> > numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen:
> > 
> > * With optimization: 22.3 Mpps
> > * Without:           19.0 mpps
> > 
> > So it's around 15% of performance improvement. Note that xdp1 is not
> > using any of callee saved registers, nor the tail call, hence such
> > speed-up.
> > 
> > There is one detail that needs to be handled though.
> > 
> > Currently, x86-64 JIT tail call implementation is skipping the prologue
> > of target BPF program that has constant size. With the mentioned
> > optimization implemented, each particular BPF program that might be
> > inserted onto the prog array map and therefore be the target of tail
> > call, could have various prologue size.
> > 
> > Let's have some pseudo-code example:
> > 
> > func1:
> > pro
> > code
> > epi
> > 
> > func2:
> > pro
> > code'
> > epi
> > 
> > func3:
> > pro
> > code''
> > epi
> > 
> > Today, pro and epi are always the same (9/7) instructions. So a tail
> > call from func1 to func2 is just a:
> > 
> > jump func2 + sizeof pro in bytes (PROLOGUE_SIZE)
> > 
> > With the optimization:
> > 
> > func1:
> > pro
> > code
> > epi
> > 
> > func2:
> > pro'
> > code'
> > epi'
> > 
> > func3:
> > pro''
> > code''
> > epi''
> > 
> > For making the tail calls up and running with the mentioned optimization
> > in place, x86-64 JIT should emit the pop registers instructions
> > that were pushed on prologue before the actual jump. Jump offset should
> > skip the instructions that are handling rbp/rsp, not the whole prologue.
> > 
> > A tail call within func1 would then need to be:
> > epi -> pop what pro pushed, but no leave/ret instructions
> > jump func2 + 16 // first push insn of pro'; if no push, then this would
> >                  // a direct jump to code'
> > 
> > Magic value of 16 comes from count of bytes that represent instructions
> > that are skipped:
> > 0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > 55                      push   %rbp
> > 48 89 e5                mov    %rsp,%rbp
> > 48 81 ec 08 00 00 00    sub    $0x8,%rsp
> > 
> > which would in many cases add *more* instructions for tailcalls. If none
> > of callee-saved registers are used, then there would be no overhead with
> > such optimization in place.
> > 
> > I'm not sure how to measure properly the impact on the BPF programs that
> > are utilizing tail calls. Any suggestions?
> 
> Right, so far the numbers above (no callee saved registers, no tail calls)
> are really the best case scenario. I think programs not using callee saved
> registers are probably very limited in what they do, and tail calls are often
> used as well (although good enough for AF_XDP, for example). So I wonder how
> far we would regress with callee saved registers and tail calls. For Cilium
> right now you can roughly assume a worst case tail call depth of ~6 with static
> jumps (that we patch to jmp/nop). Only in one case we have a tail call map
> index that is non-static. In terms of registers, assume all of them are used
> one way or another. If you could check the impact in such setting, that would
> be great.
> 
> > Daniel, Alexei, what is your view on this?
> 
> I think performance wise this would be both pro and con depending how tail
> calls are used. One upside however, and I think you didn't mention it here
> would be that we don't need to clamp used stack space to 512, so we could
> actually track how much stack is used (or if any is used at all) and adapt
> it between tail calls? Depending on the numbers, if we'd go that route, it
> should rather be generalized and tracked via verifier so all JITs can behave
> the same (and these workarounds in verifier lifted). But then 15% performance
> improvement as you state above is a lot, probably we might regress at least
> as much as well in your benchmark. I wonder whether there should be a knob
> for it, though it's mainly implementation detail..

I was thinking about knob too, but users are rarely going to touch it,
so if we go for opt-in it will mostly be unused except by few folks.
So I think it's better to go with this approach unconditionally,
but first I'd like to see the performance numbers in how it regresses
the common case. AF_XDP's empty prog that does 'return XDP_PASS' is
a rare case and imo not worth optimizing for, but I see a lot of value
if this approach allows to lift tail_call vs bpf2bpf restriction.
It looks to me that bpf2bpf will be able to work. prog_A will call into prog_B
and if that prog does any kind of tail_call that tail_call will
eventually finish and the execution will return to prog_A as normal.
So I'd like to ask for two things:
1. perf numbers for something like progs/bpf_flow.c before and after
2. removal of bpf2bpf vs tail_call restriction and new selftests to prove
that it's working. that will include droping 512 stack clamp.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-12  0:01   ` Alexei Starovoitov
@ 2020-05-13 11:58     ` Maciej Fijalkowski
  2020-05-17  4:32       ` getting bpf_tail_call to work with bpf function calls. Was: " Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Maciej Fijalkowski @ 2020-05-13 11:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson,
	lmb, john.fastabend

On Mon, May 11, 2020 at 05:01:53PM -0700, Alexei Starovoitov wrote:
> On Mon, May 11, 2020 at 10:05:25PM +0200, Daniel Borkmann wrote:
> > Hey Maciej,

Sorry for the delay.
Combining two answers in here (Alexei/Daniel). I appreciate your input.

> > 
> > On 5/11/20 4:39 PM, Maciej Fijalkowski wrote:
> > > Hi!
> > > 
> > > Today, BPF x86-64 JIT is preserving all of the callee-saved registers
> > > for each BPF program being JITed, even when none of the R6-R9 registers
> > > are used by the BPF program. Furthermore the tail call counter is always
> > > pushed/popped to/from the stack even when there is no tail call usage in
> > > BPF program being JITed. Optimization can be introduced that would
> > > detect the usage of R6-R9 and based on that push/pop to/from the stack
> > > only what is needed. Same goes for tail call counter.
> > > 
> > > Results look promising for such instruction reduction. Below are the
> > > numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen:
> > > 
> > > * With optimization: 22.3 Mpps
> > > * Without:           19.0 mpps
> > > 
> > > So it's around 15% of performance improvement. Note that xdp1 is not
> > > using any of callee saved registers, nor the tail call, hence such
> > > speed-up.
> > > 
> > > There is one detail that needs to be handled though.
> > > 
> > > Currently, x86-64 JIT tail call implementation is skipping the prologue
> > > of target BPF program that has constant size. With the mentioned
> > > optimization implemented, each particular BPF program that might be
> > > inserted onto the prog array map and therefore be the target of tail
> > > call, could have various prologue size.
> > > 
> > > Let's have some pseudo-code example:
> > > 
> > > func1:
> > > pro
> > > code
> > > epi
> > > 
> > > func2:
> > > pro
> > > code'
> > > epi
> > > 
> > > func3:
> > > pro
> > > code''
> > > epi
> > > 
> > > Today, pro and epi are always the same (9/7) instructions. So a tail
> > > call from func1 to func2 is just a:
> > > 
> > > jump func2 + sizeof pro in bytes (PROLOGUE_SIZE)
> > > 
> > > With the optimization:
> > > 
> > > func1:
> > > pro
> > > code
> > > epi
> > > 
> > > func2:
> > > pro'
> > > code'
> > > epi'
> > > 
> > > func3:
> > > pro''
> > > code''
> > > epi''
> > > 
> > > For making the tail calls up and running with the mentioned optimization
> > > in place, x86-64 JIT should emit the pop registers instructions
> > > that were pushed on prologue before the actual jump. Jump offset should
> > > skip the instructions that are handling rbp/rsp, not the whole prologue.
> > > 
> > > A tail call within func1 would then need to be:
> > > epi -> pop what pro pushed, but no leave/ret instructions
> > > jump func2 + 16 // first push insn of pro'; if no push, then this would
> > >                  // a direct jump to code'
> > > 
> > > Magic value of 16 comes from count of bytes that represent instructions
> > > that are skipped:
> > > 0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > > 55                      push   %rbp
> > > 48 89 e5                mov    %rsp,%rbp
> > > 48 81 ec 08 00 00 00    sub    $0x8,%rsp
> > > 
> > > which would in many cases add *more* instructions for tailcalls. If none
> > > of callee-saved registers are used, then there would be no overhead with
> > > such optimization in place.
> > > 
> > > I'm not sure how to measure properly the impact on the BPF programs that
> > > are utilizing tail calls. Any suggestions?
> > 
> > Right, so far the numbers above (no callee saved registers, no tail calls)
> > are really the best case scenario. I think programs not using callee saved
> > registers are probably very limited in what they do, and tail calls are often
> > used as well (although good enough for AF_XDP, for example). So I wonder how
> > far we would regress with callee saved registers and tail calls. For Cilium
> > right now you can roughly assume a worst case tail call depth of ~6 with static
> > jumps (that we patch to jmp/nop). Only in one case we have a tail call map
> > index that is non-static. In terms of registers, assume all of them are used
> > one way or another. If you could check the impact in such setting, that would
> > be great.

Alexei suggested progs/bpf_flow.c - is that a good start to you? It won't
reproduce the Cilium environment but I suppose this would give us a taste
of how much regression we might introduce with this approach for tail
calls.

> > 
> > > Daniel, Alexei, what is your view on this?
> > 
> > I think performance wise this would be both pro and con depending how tail
> > calls are used. One upside however, and I think you didn't mention it here
> > would be that we don't need to clamp used stack space to 512, so we could
> > actually track how much stack is used (or if any is used at all) and adapt
> > it between tail calls? Depending on the numbers, if we'd go that route, it

Hm, I didn't mention that because I don't destroy the stack frame from the
tail call 'caller' program (I mean that if A tailcalls to B, then A is a
'caller'). So sort of the old x86-64 JIT behavior is kept - if B returns
then it will go over leave/ret pair so that stack frame (that was created
by program A) is destroyed and we get back to the address that pushed to
stack by caller of main BPF program (A).

Whereas program A before actual tail call will only pop callee saved
registers and B will not manipulate rbp/rsp in prologue.

Such approach is allowing us to refer to the tail call counter in a
constant way, with a little hack that tail call counter is the first thing
pushed to stack, followed by callee-saved registers (if any), so
throughout the tailcalls its placement on stack is left untouched.

So to me, if we would like to get rid of maxing out stack space, then we
would have to do some dancing for preserving the tail call counter - keep
it in some unused register? Or epilogue would pop it from stack to some
register and target program's prologue would push it to stack from that
register (I am making this up probably). And rbp/rsp would need to be
created/destroyed during the program-to-program transition that happens
via tailcall. That would mean also more instructions.

BTW maxing out stack space was because of the way how tail call counter is
handled on x86-64 JIT?

I might be wrong though! :) or missing something obvious. Did you have a
chance to look at the actual patch that went among with this cover letter?

> > should rather be generalized and tracked via verifier so all JITs can behave
> > the same (and these workarounds in verifier lifted). But then 15% performance
> > improvement as you state above is a lot, probably we might regress at least
> > as much as well in your benchmark. I wonder whether there should be a knob
> > for it, though it's mainly implementation detail..
> 
> I was thinking about knob too, but users are rarely going to touch it,
> so if we go for opt-in it will mostly be unused except by few folks.
> So I think it's better to go with this approach unconditionally,
> but first I'd like to see the performance numbers in how it regresses
> the common case. AF_XDP's empty prog that does 'return XDP_PASS' is
> a rare case and imo not worth optimizing for, but I see a lot of value
> if this approach allows to lift tail_call vs bpf2bpf restriction.
> It looks to me that bpf2bpf will be able to work. prog_A will call into prog_B
> and if that prog does any kind of tail_call that tail_call will
> eventually finish and the execution will return to prog_A as normal.

If I am not missing anything then this would involve the rbp/rsp handling
as I stated earlier? If we agree on some way of handling tail call counter
then I can try this out (no 512 stack clamp and handling stack frame
throughout tail calls).

> So I'd like to ask for two things:
> 1. perf numbers for something like progs/bpf_flow.c before and after
> 2. removal of bpf2bpf vs tail_call restriction and new selftests to prove
> that it's working. that will include droping 512 stack clamp.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* getting bpf_tail_call to work with bpf function calls. Was: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-13 11:58     ` Maciej Fijalkowski
@ 2020-05-17  4:32       ` Alexei Starovoitov
  2020-05-18 18:44         ` Maciej Fijalkowski
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2020-05-17  4:32 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson,
	lmb, john.fastabend

On Wed, May 13, 2020 at 01:58:55PM +0200, Maciej Fijalkowski wrote:
> 
> So to me, if we would like to get rid of maxing out stack space, then we
> would have to do some dancing for preserving the tail call counter - keep
> it in some unused register? Or epilogue would pop it from stack to some
> register and target program's prologue would push it to stack from that
> register (I am making this up probably). And rbp/rsp would need to be
> created/destroyed during the program-to-program transition that happens
> via tailcall. That would mean also more instructions.

How about the following:
The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp'

This way new function will set its own stack size and will init tail call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there.
That's the only downside I see.
Any other ideas?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: getting bpf_tail_call to work with bpf function calls. Was: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-17  4:32       ` getting bpf_tail_call to work with bpf function calls. Was: " Alexei Starovoitov
@ 2020-05-18 18:44         ` Maciej Fijalkowski
  2020-05-21  4:05           ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Maciej Fijalkowski @ 2020-05-18 18:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson,
	lmb, john.fastabend

On Sat, May 16, 2020 at 09:32:27PM -0700, Alexei Starovoitov wrote:
> On Wed, May 13, 2020 at 01:58:55PM +0200, Maciej Fijalkowski wrote:
> > 
> > So to me, if we would like to get rid of maxing out stack space, then we
> > would have to do some dancing for preserving the tail call counter - keep
> > it in some unused register? Or epilogue would pop it from stack to some
> > register and target program's prologue would push it to stack from that
> > register (I am making this up probably). And rbp/rsp would need to be
> > created/destroyed during the program-to-program transition that happens
> > via tailcall. That would mean also more instructions.
> 
> How about the following:
> The prologue will look like:
> nop5
> xor eax,eax  // two new bytes if bpf_tail_call() is used in this function
> push rbp
> mov rbp, rsp
> sub rsp, rounded_stack_depth
> push rax // zero init tail_call counter
> variable number of push rbx,r13,r14,r15
> 
> Then bpf_tail_call will pop variable number rbx,..
> and final 'pop rax'
> Then 'add rsp, size_of_current_stack_frame'
> jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp'
> 
> This way new function will set its own stack size and will init tail call
> counter with whatever value the parent had.
> 
> If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> Instead it would need to have 'nop2' in there.
> That's the only downside I see.
> Any other ideas?

Not really - had a thought with Bjorn about using one callee-saved
register that is yet unused by x64 JIT (%r12) and i was also thinking
about some freaky usage of SSE register as a general purpose one. However,
your idea is pretty neat - I gave it already a shot and with a single
tweak I managed to got it working, e.g. selftests are fine as well as two
samples that utilize tail calls. Note also that I got rid of the stack
clamp being done in fixup_bpf_calls.

About a tweak:
- RETPOLINE_RAX_BPF_JIT used for indirect tail calls needed to become a
  RETPOLINE_RCX_BPF_JIT, so that we preserve the content of %rax across
  jumping between programs via tail calls. I looked up GCC commit that
  Daniel quoted on a patch that implements RETPOLINE_RAX_BPF_JIT and it
  said that for register that is holding the address of function that we
  will be jumping onto, we are free to use most of GP registers. I picked
  %rcx.

I was also thinking about a minor optimization where we would replace the
add/sub %rsp, $off32 with a nop7 if stack depth is 0.

About a way forward - I reached out to Bjorn to co-operate on providing
the benchmark for measuring the impact of new tail call handling as well
as providing a proof in a form of selftests that bpf2bpf is working
together with tail calls.

About a benchmark, we think that having tests for best and worst cases
would tell us what is going on. So:
- have a main program that is not using any of callee registers that will
  be tailcalling onto another program that is also not using any of R6-R9.
- have the same flow but both programs will be using R6, R7, R8, R9; main
  program needs to use them because we will be popping these registers
  before the tail call and target program will be doing pushes.

Daniel, John, is there some Cilium benchmark that we could incorporate? I
don't think we be able to come up with a program that would mimic what you
have previously described, e.g. 6 static jumps where every program would
be utilizing every callee-saved register. Any help/pointers on how should
we approach it would be very appreciated.

Does that sound like a plan, overall?

Thank you,
Maciej

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: getting bpf_tail_call to work with bpf function calls. Was: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
  2020-05-18 18:44         ` Maciej Fijalkowski
@ 2020-05-21  4:05           ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2020-05-21  4:05 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson,
	lmb, john.fastabend

On Mon, May 18, 2020 at 08:44:58PM +0200, Maciej Fijalkowski wrote:
> On Sat, May 16, 2020 at 09:32:27PM -0700, Alexei Starovoitov wrote:
> > On Wed, May 13, 2020 at 01:58:55PM +0200, Maciej Fijalkowski wrote:
> > > 
> > > So to me, if we would like to get rid of maxing out stack space, then we
> > > would have to do some dancing for preserving the tail call counter - keep
> > > it in some unused register? Or epilogue would pop it from stack to some
> > > register and target program's prologue would push it to stack from that
> > > register (I am making this up probably). And rbp/rsp would need to be
> > > created/destroyed during the program-to-program transition that happens
> > > via tailcall. That would mean also more instructions.
> > 
> > How about the following:
> > The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> > 
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp'
> > 
> > This way new function will set its own stack size and will init tail call
> > counter with whatever value the parent had.
> > 
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there.
> > That's the only downside I see.
> > Any other ideas?
> 
> Not really - had a thought with Bjorn about using one callee-saved
> register that is yet unused by x64 JIT (%r12) and i was also thinking

people keep trying to use r12 for all sorts of things, but I'd like
to keep it reserved. I hope we can add R11 to bpf ISA one day.

> about some freaky usage of SSE register as a general purpose one. However,
> your idea is pretty neat - I gave it already a shot and with a single
> tweak I managed to got it working, e.g. selftests are fine as well as two
> samples that utilize tail calls. Note also that I got rid of the stack
> clamp being done in fixup_bpf_calls.
> 
> About a tweak:
> - RETPOLINE_RAX_BPF_JIT used for indirect tail calls needed to become a
>   RETPOLINE_RCX_BPF_JIT, so that we preserve the content of %rax across
>   jumping between programs via tail calls. I looked up GCC commit that
>   Daniel quoted on a patch that implements RETPOLINE_RAX_BPF_JIT and it
>   said that for register that is holding the address of function that we
>   will be jumping onto, we are free to use most of GP registers. I picked
>   %rcx.

Good catch. Indeed. We have to use rcx for retpoline.
rdi/rsi/rdx are used to pass args and bpf_tail_call() doesn't have
4th argument.
r8 could have been used, but it will take more bytes to encode.
so imo rcx is the only choice.

> I was also thinking about a minor optimization where we would replace the
> add/sub %rsp, $off32 with a nop7 if stack depth is 0.

why leaf functions are special?
I've been thinking about it as well, but trampoline fentry/fexit can
be attached to bpf progs too and it would unnecessary complicate
calling original.
So I've discared nop7 idea.

Instead I was thinking to add useless two byte prefix to
either 'push rbp' or 'mov rbp, rsp'.
Like 0x66, so from cpu uops point of view it will stay single uop
to execute in the ooo pipeline, whereas nop2 is a separate uop.
But it's not clear whether decoder performance will be better
for separate nop2 or 0x66 will add unnecssary stress on it.
Like pushing it into 'complex' opcode that can be done by only one exclusive
decoder vs 'simple' opcode that multiple decoders process in parallel.
Some microbenchmarking is needed. Though I'm not sure that the time spent
in such micro performance analysis is worth it :)
So imo nop2 is good enough.

> About a way forward - I reached out to Bjorn to co-operate on providing
> the benchmark for measuring the impact of new tail call handling as well
> as providing a proof in a form of selftests that bpf2bpf is working
> together with tail calls.
> 
> About a benchmark, we think that having tests for best and worst cases
> would tell us what is going on. So:
> - have a main program that is not using any of callee registers that will
>   be tailcalling onto another program that is also not using any of R6-R9.

I don't think such micro benchmark will be realistic. You can probably
craft one in assembler, but when there is a function call and 'ctx' is
passed into that function the llvm will use R6 at least.

> - have the same flow but both programs will be using R6, R7, R8, R9; main
>   program needs to use them because we will be popping these registers
>   before the tail call and target program will be doing pushes.

I would create a benchmark out of progs/tailcall[1-5].c and progs/bpf_flow.c
I think it will be good enough to assess performance before/after.

> Daniel, John, is there some Cilium benchmark that we could incorporate? I
> don't think we be able to come up with a program that would mimic what you
> have previously described, e.g. 6 static jumps where every program would
> be utilizing every callee-saved register. Any help/pointers on how should
> we approach it would be very appreciated.

progs/bpf_flow.c is pretty much such example or I missing something?
Every jump of tail_call is using r6 and r7 at least there.
But bodies of the functions are not empty, so more registers being used
the more work the function is doing and less noticeable the overhead
of new tail_call will be.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-05-21  4:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 14:39 [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation Maciej Fijalkowski
2020-05-11 14:39 ` [RFC PATCH bpf-next 1/1] " Maciej Fijalkowski
2020-05-11 20:05 ` [RFC PATCH bpf-next 0/1] " Daniel Borkmann
2020-05-12  0:01   ` Alexei Starovoitov
2020-05-13 11:58     ` Maciej Fijalkowski
2020-05-17  4:32       ` getting bpf_tail_call to work with bpf function calls. Was: " Alexei Starovoitov
2020-05-18 18:44         ` Maciej Fijalkowski
2020-05-21  4:05           ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).