Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: "Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	bpf <bpf@vger.kernel.org>,
	"Network Development" <netdev@vger.kernel.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>
Subject: Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
Date: Fri, 10 Jul 2020 20:20:29 -0700	[thread overview]
Message-ID: <CAADnVQJhhQnjQdrQgMCsx2EDDwELkCvY7Zpfdi_SJUmH6VzZYw@mail.gmail.com> (raw)
In-Reply-To: <20200710235632.lhn6edwf4a2l3kiz@ast-mbp.dhcp.thefacebook.com>

On Fri, Jul 10, 2020 at 4:56 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 02, 2020 at 03:49:29PM +0200, Maciej Fijalkowski wrote:
> > This commit serves two things:
> > 1) it optimizes BPF prologue/epilogue generation
> > 2) it makes possible to have tailcalls within BPF subprogram
> >
> > Both points are related to each other since without 1), 2) could not be
> > achieved.
> >
> > In [1], Alexei says:
> > "The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this
> >              // function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> >
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> > rbp, rsp'
> >
> > This way new function will set its own stack size and will init tail
> > call
> > counter with whatever value the parent had.
> >
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there."
> >
> > Implement that suggestion.
> >
> > Since the layout of stack is changed, tail call counter handling can not
> > rely anymore on popping it to rbx just like it have been handled for
> > constant prologue case and later overwrite of rbx with actual value of
> > rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
> > is considered to be volatile/caller-saved and pop the value of tail call
> > counter in there in the epilogue.
> >
> > Drop the BUILD_BUG_ON in emit_prologue and in
> > emit_bpf_tail_call_indirect where instruction layout is not constant
> > anymore.
> >
> > Introduce new poke target, 'ip_aux' to poke descriptor that is dedicated
>
> imo ip_aux approach has too much x86 specific code in kernel/bpf/arraymap.c
> Ex. NOP_ATOMIC5 is x86 only and will break build on all other archs.
>
> But I'm not sure ip_aux is really necessary.
> It's nice to optimize the case when tail_call target is NULL, but
> redundant unwind + nop5 + push_regs_again makes for much simpler design
> without worrying about state transitions.
>
> So I don't think optimizing the case of target==NULL is really worth the complexity.
>
> > for skipping the register pops and stack unwind that are generated right
> > before the actual jump to target program.
> > For case when the target program is not present, BPF program will skip
> > the pop instructions and nop5 dedicated for jmpq $target. An example of
> > such state when only R6 of callee saved registers is used by program:
> >
> > ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
> > ffffffffc0513aa6:       5b                      pop    %rbx
> > ffffffffc0513aa7:       58                      pop    %rax
> > ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
>
> so this last rbx->rdi insn is not part of bpf_tail_call insn, right?
> That is just 'R1 = R6;' bpf insn jited.
>
> >
> > When target program is inserted, the jump that was there to skip
> > pops/nop5 will become the nop5, so CPU will go over pops and do the
> > actual tailcall.
> >
> > One might ask why there simply can not be pushes after the nop5?
>
> exactly... and...
>
> > In the following example snippet:
> >
> > ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
> > (...)
> > ffffffffc0370332:       5b                      pop    %rbx
> > ffffffffc0370333:       58                      pop    %rax
> > ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
> > ffffffffc0370347:       50                      push   %rax
> > ffffffffc0370348:       53                      push   %rbx
> > ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
> > ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
> >
> > There is the bpf2bpf call right after the tailcall and jump target is
> > not present. ctx is %rbx and BPF subprogram that we will call into on
> > ffffffffc037034c is relying on it, e.g. it will pick ctx from there.
> > Such code layout is therefore broken as we would overwrite the content
> > of %rbx with the value that was pushed on the prologue.
>
> I don't understand above explanation.
> Are you saying 'callq  0xffffffffc0372548' above is a call to bpf subprogram?
> The 'mov %rbx,%rdi' was 'R1 = R6' before JIT.
> The code is storing ctx into R1 to pass into bpf subprogram.
> It's not part of proposed emit_bpf_tail_call_direct() handling.
> It's part of BPF program.
> I don't see what breaks.
>
> The new emit_bpf_tail_call_indirect() looks correct to me.
>
> But emit_bpf_tail_call_direct() doesn't need
> + emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux);
> and messy poke->ip_aux.
>
> It can do:
> pop_callee_regs()
> memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
> push_callee_regs()
>
> When target is NULL the pairs of pop/push overall is a nop.
> They don't affect correctness.
> When prog_array_map_poke_run() is called it will replace a nop5
> with a jump. So still all good.
>
> Yes there will be tiny overhead when tail_call target is NULL,
> but x86 will execute pop/push pair in _one_ cpu cycle.
> As far as I recall x86 hardware has special logic to recognize
> such push/pop sequences so they are really fast.
>
> What am I missing?

Of course you are right.
pop+nop+push is incorrect.

How about the following instead:
- during JIT:
emit_jump(to_skip_below)  <- poke->tailcall_bypass
pop_callee_regs
emit_jump(to_tailcall_target) <- poke->tailcall_target

- Transition from one target to another:
text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
if (new_jmp != NULL)
  text_poke(poke->tailcall_bypass, MOD jmp into nop);
else
  text_poke(poke->tailcall_bypass, MOD nop into jmp);

In other words, let's keep jmp as always valid, so the race
you've described in the cover letter won't ever happen.

The kernel/bpf/arraymap.c will stay arch independent too.

Thoughts?