All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: ast@kernel.org, bpf@vger.kernel.org, netdev@vger.kernel.org,
	bjorn.topel@intel.com, magnus.karlsson@intel.com
Subject: Re: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
Date: Fri, 17 Jul 2020 12:52:55 +0200	[thread overview]
Message-ID: <20200717105255.GA11239@ranger.igk.intel.com> (raw)
In-Reply-To: <932141f5-7abb-1c01-111d-a64baf187a40@iogearbox.net>

On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote:
> On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> > This commit serves two things:
> > 1) it optimizes BPF prologue/epilogue generation
> > 2) it makes possible to have tailcalls within BPF subprogram
> > 
> > Both points are related to each other since without 1), 2) could not be
> > achieved.
> > 
> > In [1], Alexei says:
> > "The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this
> >               // function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> > 
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> > rbp, rsp'
> > 
> > This way new function will set its own stack size and will init tail
> > call
> > counter with whatever value the parent had.
> > 
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there."
> > 
> > Implement that suggestion.
> > 
> > Since the layout of stack is changed, tail call counter handling can not
> > rely anymore on popping it to rbx just like it have been handled for
> > constant prologue case and later overwrite of rbx with actual value of
> > rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
> > is considered to be volatile/caller-saved and pop the value of tail call
> > counter in there in the epilogue.
> > 
> > Drop the BUILD_BUG_ON in emit_prologue and in
> > emit_bpf_tail_call_indirect where instruction layout is not constant
> > anymore.
> > 
> > Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
> > dedicated for skipping the register pops and stack unwind that are
> > generated right before the actual jump to target program. Reflect also
> > the actual purpose of poke->ip and rename it to poke->tailcall_target so
> > that it will not the be confused with the poke target that is being
> > introduced here.
> > For case when the target program is not present, BPF program will skip
> > the pop instructions and nop5 dedicated for jmpq $target. An example of
> > such state when only R6 of callee saved registers is used by program:
> > 
> > ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
> > ffffffffc0513aa6:       5b                      pop    %rbx
> > ffffffffc0513aa7:       58                      pop    %rax
> > ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
> > 
> > When target program is inserted, the jump that was there to skip
> > pops/nop5 will become the nop5, so CPU will go over pops and do the
> > actual tailcall.
> > 
> > One might ask why there simply can not be pushes after the nop5?
> > In the following example snippet:
> > 
> > ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
> > (...)
> > ffffffffc0370332:       5b                      pop    %rbx
> > ffffffffc0370333:       58                      pop    %rax
> > ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
> > ffffffffc0370347:       50                      push   %rax
> > ffffffffc0370348:       53                      push   %rbx
> > ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
> > ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
> > 
> > There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
> > and jump target is not present. ctx is in %rbx register and BPF
> > subprogram that we will call into on ffffffffc037034c is relying on it,
> > e.g. it will pick ctx from there. Such code layout is therefore broken
> > as we would overwrite the content of %rbx with the value that was pushed
> > on the prologue. That is the reason for the 'bypass' approach.
> > 
> > Special care needs to be taken during the install/update/remove of
> > tailcall target. In case when target program is not present, the CPU
> > must not execute the pop instructions that precede the tailcall.
> > 
> > To address that, the following states can be defined:
> > A nop, unwind, nop
> > B nop, unwind, tail
> > C skip, unwind, nop
> > D skip, unwind, tail
> > 
> > A is forbidden (lead to incorrectness). The state transitions between
> > tailcall install/update/remove will work as follows:
> > 
> > First install tail call f: C->D->B(f)
> >   * poke the tailcall, after that get rid of the skip
> > Update tail call f to f': B(f)->B(f')
> >   * poke the tailcall (poke->tailcall_target) and do NOT touch the
> >     poke->tailcall_bypass
> > Remove tail call: B(f')->C(f')
> >   * poke->tailcall_bypass is poked back to jump, then we wait the RCU
> >     grace period so that other programs will finish its execution and
> >     after that we are safe to remove the poke->tailcall_target
> > Install new tail call (f''): C(f')->D(f'')->B(f'').
> >   * same as first step
> > 
> > This way CPU can never be exposed to "unwind, tail" state.
> > 
> > For regression checks, 'tailcalls' kselftest was executed:
> > $ sudo ./test_progs -t tailcalls
> >   #64/1 tailcall_1:OK
> >   #64/2 tailcall_2:OK
> >   #64/3 tailcall_3:OK
> >   #64/4 tailcall_4:OK
> >   #64/5 tailcall_5:OK
> >   #64 tailcalls:OK
> > Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > Tail call related cases from test_verifier kselftest are also working
> > fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
> > work properly as well.
> > 
> > [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
> > 
> > Suggested-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> Overall approach looks reasonable to me. The patch here could still be cleaned up a
> bit further, still very rough. Just minor comments below:

Thank you for spotting all of the issues. I will provide a v2 on monday as
from today to sunday I will be out of reach.

> 
> > ---
> >   arch/x86/net/bpf_jit_comp.c | 241 +++++++++++++++++++++++++++---------
> >   include/linux/bpf.h         |   8 +-
> >   kernel/bpf/arraymap.c       |  61 +++++++--
> >   kernel/bpf/core.c           |   3 +-
> >   4 files changed, 239 insertions(+), 74 deletions(-)
> > 
> [...]
> >   /*
> > - * Emit x86-64 prologue code for BPF program and check its size.
> > + * Emit x86-64 prologue code for BPF program.
> >    * bpf_tail_call helper will skip it while jumping into another program
> >    */
> > -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> > +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> > +			  bool tail_call)
> >   {
> >   	u8 *prog = *pprog;
> >   	int cnt = X86_PATCH_SIZE;
> > @@ -238,19 +269,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> >   	 */
> >   	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
> >   	prog += cnt;
> > +	if (!ebpf_from_cbpf && tail_call)
> > +		EMIT2(0x31, 0xC0);       /* xor eax, eax */
> > +	else
> > +		EMIT2(0x66, 0x90);       /* nop2 */
> 
> nit: Why does the ebpf_from_cbpf need the extra nop?

Good catch, it doesn't need it :)

> 
> >   	EMIT1(0x55);             /* push rbp */
> >   	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
> >   	/* sub rsp, rounded_stack_depth */
> >   	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
> > -	EMIT1(0x53);             /* push rbx */
> > -	EMIT2(0x41, 0x55);       /* push r13 */
> > -	EMIT2(0x41, 0x56);       /* push r14 */
> > -	EMIT2(0x41, 0x57);       /* push r15 */
> > -	if (!ebpf_from_cbpf) {
> > -		/* zero init tail_call_cnt */
> > -		EMIT2(0x6a, 0x00);
> > -		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
> > -	}
> > +	if (!ebpf_from_cbpf && tail_call)
> > +		EMIT1(0x50);         /* push rax */
> >   	*pprog = prog;
> >   }
> [...]
> > -static void emit_bpf_tail_call_indirect(u8 **pprog)
> > +static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> > +					u32 stack_depth)
> >   {
> >   	u8 *prog = *pprog;
> > -	int label1, label2, label3;
> > +	int pop_bytes = 0;
> > +	int off1 = 49;
> > +	int off2 = 38;
> > +	int off3 = 16;
> >   	int cnt = 0;
> > +	/* count the additional bytes used for popping callee regs from stack
> > +	 * that need to be taken into account for each of the offsets that
> > +	 * are used for bailing out of the tail call
> > +	 */
> > +	pop_bytes = get_pop_bytes(callee_regs_used);
> > +	off1 += pop_bytes;
> > +	off2 += pop_bytes;
> > +	off3 += pop_bytes;
> > +
> >   	/*
> >   	 * rdi - pointer to ctx
> >   	 * rsi - pointer to bpf_array
> > @@ -370,72 +427,108 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
> >   	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
> >   	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
> >   	      offsetof(struct bpf_array, map.max_entries));
> > -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
> > +#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
> 
> The whole rename belongs into the first patch to avoid breaking bisectability
> as mentioned.

Ack.

> 
> >   	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
> > -	label1 = cnt;
> >   	/*
> >   	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
> >   	 *	goto out;
> >   	 */
> > -	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
> > +	EMIT2_off32(0x8B, 0x85                    /* mov eax, dword ptr [rbp - (4 + sd)] */,
> > +		    -4 - round_up(stack_depth, 8));
> >   	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> > -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
> > +#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
> >   	EMIT2(X86_JA, OFFSET2);                   /* ja out */
> > -	label2 = cnt;
> >   	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> > -	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
> > +	EMIT2_off32(0x89, 0x85,                   /* mov dword ptr [rbp - (4 + sd)], eax */
> > +		    -4 - round_up(stack_depth, 8));
> 
> nit: should probably sit in a var

Sure.

> 
> >   	/* prog = array->ptrs[index]; */
> > -	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
> > +	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,        /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> >   		    offsetof(struct bpf_array, ptrs));
> >   	/*
> >   	 * if (prog == NULL)
> >   	 *	goto out;
> >   	 */
> > -	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
> > -#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
> > -	EMIT2(X86_JE, OFFSET3);                   /* je out */
> > -	label3 = cnt;
> > +	EMIT3(0x48, 0x85, 0xC9);                   /* test rcx,rcx */
> > +#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
> > +	EMIT2(X86_JE, OFFSET3);                    /* je out */
> [...]
> 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index c67c88ad35f8..38897b9c7d61 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -651,14 +651,15 @@ enum bpf_jit_poke_reason {
> >   /* Descriptor of pokes pointing /into/ the JITed image. */
> >   struct bpf_jit_poke_descriptor {
> > -	void *ip;
> > +	void *tailcall_target;
> > +	void *tailcall_bypass;
> >   	union {
> >   		struct {
> >   			struct bpf_map *map;
> >   			u32 key;
> >   		} tail_call;
> >   	};
> > -	bool ip_stable;
> > +	bool tailcall_target_stable;
> 
> Probably makes sense to split off the pure rename into a separate patch to
> reduce this one slightly.

I was thinking of that as well. I will pull out as you're suggesting.

> 
> >   	u8 adj_off;
> >   	u16 reason;
> >   };
> > @@ -1775,6 +1776,9 @@ enum bpf_text_poke_type {
> >   	BPF_MOD_JUMP,
> >   };
> > +/* Number of bytes emit_patch() needs to generate instructions */
> > +#define X86_PATCH_SIZE		5
> 
> nit: this is arch specific, so should not be exposed in here, neither in
> arraymap.c below

Okay, so I think that we should add another member to poke descriptor that
will hold specifically the bypass address so that we wouldn't have to
calculate it in here. And I think this extension should go to this patch
whereas the renaming to a separate.

> 
> >   int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
> >   		       void *addr1, void *addr2);
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index c66e8273fccd..d15729a3f46c 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> >   				    struct bpf_prog *old,
> >   				    struct bpf_prog *new)
> >   {
> > +	u8 *bypass_addr, *old_addr, *new_addr;
> >   	struct prog_poke_elem *elem;
> >   	struct bpf_array_aux *aux;
> > @@ -770,12 +771,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> >   			 *    there could be danger of use after free otherwise.
> >   			 * 2) Initially when we start tracking aux, the program
> >   			 *    is not JITed yet and also does not have a kallsyms
> > -			 *    entry. We skip these as poke->ip_stable is not
> > -			 *    active yet. The JIT will do the final fixup before
> > -			 *    setting it stable. The various poke->ip_stable are
> > -			 *    successively activated, so tail call updates can
> > -			 *    arrive from here while JIT is still finishing its
> > -			 *    final fixup for non-activated poke entries.
> > +			 *    entry. We skip these as poke->tailcall_target_stable
> > +			 *    is not active yet. The JIT will do the final fixup
> > +			 *    before setting it stable. The various
> > +			 *    poke->tailcall_target_stable are successively activated,
> > +			 *    so tail call updates can arrive from here while JIT
> > +			 *    is still finishing its final fixup for non-activated
> > +			 *    poke entries.
> >   			 * 3) On program teardown, the program's kallsym entry gets
> >   			 *    removed out of RCU callback, but we can only untrack
> >   			 *    from sleepable context, therefore bpf_arch_text_poke()
> > @@ -792,20 +794,53 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> >   			 * 5) Any other error happening below from bpf_arch_text_poke()
> >   			 *    is a unexpected bug.
> >   			 */
> > -			if (!READ_ONCE(poke->ip_stable))
> > +			if (!READ_ONCE(poke->tailcall_target_stable))
> >   				continue;
> >   			if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
> >   				continue;
> >   			if (poke->tail_call.map != map ||
> >   			    poke->tail_call.key != key)
> >   				continue;
> > +			/* protect against un-updated poke descriptors since
> > +			 * we could fill them from subprog and the same desc
> > +			 * is present on main's program poke tab
> > +			 */
> > +			if (!poke->tailcall_bypass || !poke->tailcall_target)
> > +				continue;
> 
> Can't we avoid copying these descriptors over to the subprog in the first place?

I think we can, but can we consider it as something that we will do as a
follow-up?

> 
> > +			if (!old && !new)
> > +				continue;
> 
> Could we avoid this above but instead signal via bpf_arch_text_poke() that nothing
> had to be patched? Reason is that bpf_arch_text_poke() will still do the sanity
> check to make sure reality meets expectation wrt current insns (which is also
> why I didn't add this skip). In that case we could then just avoid the expensive
> synchronize_rcu().

I was even thinking to have such a check before walking through the poke
descriptors, so that's the opposite of what you suggest.

If you insist, I can play with this a bit on monday, but I recall that it
was the only thing that was stopping the Alexei's pseudo-code from being
fully functional (the nop->nop update).

> 
> > -			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
> > -						 old ? (u8 *)old->bpf_func +
> > -						 poke->adj_off : NULL,
> > -						 new ? (u8 *)new->bpf_func +
> > -						 poke->adj_off : NULL);
> > -			BUG_ON(ret < 0 && ret != -EINVAL);
> > +			bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE;
> > +			old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
> > +			new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
> > +
> > +			if (new) {
> > +				ret = bpf_arch_text_poke(poke->tailcall_target,
> > +							 BPF_MOD_JUMP,
> > +							 old_addr, new_addr);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +				if (!old) {
> > +					ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > +								 BPF_MOD_JUMP,
> > +								 bypass_addr, NULL);
> > +					BUG_ON(ret < 0 && ret != -EINVAL);
> > +				}
> > +			} else {
> > +				ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > +							 BPF_MOD_JUMP,
> > +							 NULL, bypass_addr);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +				/* let other CPUs finish the execution of program
> > +				 * so that it will not possible to expose them
> > +				 * to invalid nop, stack unwind, nop state
> > +				 */
> > +				synchronize_rcu();
> 
> Very heavyweight that we need to potentially call this /multiple/ times for just a
> /single/ map update under poke mutex even ... but agree it's needed here to avoid
> racing. :(
> 
> > +				ret = bpf_arch_text_poke(poke->tailcall_target,
> > +							 BPF_MOD_JUMP,
> > +							 old_addr, NULL);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +			}
> >   		}
> >   	}
> >   }

  parent reply	other threads:[~2020-07-17 11:02 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
2020-07-16 20:36   ` Daniel Borkmann
2020-07-17  9:29     ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-16 21:10   ` Daniel Borkmann
2020-07-16 21:29   ` Daniel Borkmann
2020-07-16 22:46     ` Daniel Borkmann
2020-07-17 11:39       ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
2020-07-16 21:16   ` Daniel Borkmann
2020-07-17  9:36     ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
2020-07-16 23:06   ` Daniel Borkmann
2020-07-17  2:16     ` Alexei Starovoitov
2020-07-17 10:57       ` Maciej Fijalkowski
2020-07-17 16:16         ` Alexei Starovoitov
2020-07-17 10:52     ` Maciej Fijalkowski [this message]
2020-07-15 23:36 ` [PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200717105255.GA11239@ranger.igk.intel.com \
    --to=maciej.fijalkowski@intel.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.