bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	ast@kernel.org, bpf@vger.kernel.org, netdev@vger.kernel.org,
	bjorn.topel@intel.com, magnus.karlsson@intel.com,
	lmb@cloudflare.com, john.fastabend@gmail.com
Subject: Re: getting bpf_tail_call to work with bpf function calls. Was: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
Date: Mon, 18 May 2020 20:44:58 +0200	[thread overview]
Message-ID: <20200518184458.GC6472@ranger.igk.intel.com> (raw)
In-Reply-To: <20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com>

On Sat, May 16, 2020 at 09:32:27PM -0700, Alexei Starovoitov wrote:
> On Wed, May 13, 2020 at 01:58:55PM +0200, Maciej Fijalkowski wrote:
> > 
> > So to me, if we would like to get rid of maxing out stack space, then we
> > would have to do some dancing for preserving the tail call counter - keep
> > it in some unused register? Or epilogue would pop it from stack to some
> > register and target program's prologue would push it to stack from that
> > register (I am making this up probably). And rbp/rsp would need to be
> > created/destroyed during the program-to-program transition that happens
> > via tailcall. That would mean also more instructions.
> 
> How about the following:
> The prologue will look like:
> nop5
> xor eax,eax  // two new bytes if bpf_tail_call() is used in this function
> push rbp
> mov rbp, rsp
> sub rsp, rounded_stack_depth
> push rax // zero init tail_call counter
> variable number of push rbx,r13,r14,r15
> 
> Then bpf_tail_call will pop variable number rbx,..
> and final 'pop rax'
> Then 'add rsp, size_of_current_stack_frame'
> jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp'
> 
> This way new function will set its own stack size and will init tail call
> counter with whatever value the parent had.
> 
> If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> Instead it would need to have 'nop2' in there.
> That's the only downside I see.
> Any other ideas?

Not really - had a thought with Bjorn about using one callee-saved
register that is yet unused by x64 JIT (%r12) and i was also thinking
about some freaky usage of SSE register as a general purpose one. However,
your idea is pretty neat - I gave it already a shot and with a single
tweak I managed to got it working, e.g. selftests are fine as well as two
samples that utilize tail calls. Note also that I got rid of the stack
clamp being done in fixup_bpf_calls.

About a tweak:
- RETPOLINE_RAX_BPF_JIT used for indirect tail calls needed to become a
  RETPOLINE_RCX_BPF_JIT, so that we preserve the content of %rax across
  jumping between programs via tail calls. I looked up GCC commit that
  Daniel quoted on a patch that implements RETPOLINE_RAX_BPF_JIT and it
  said that for register that is holding the address of function that we
  will be jumping onto, we are free to use most of GP registers. I picked
  %rcx.

I was also thinking about a minor optimization where we would replace the
add/sub %rsp, $off32 with a nop7 if stack depth is 0.

About a way forward - I reached out to Bjorn to co-operate on providing
the benchmark for measuring the impact of new tail call handling as well
as providing a proof in a form of selftests that bpf2bpf is working
together with tail calls.

About a benchmark, we think that having tests for best and worst cases
would tell us what is going on. So:
- have a main program that is not using any of callee registers that will
  be tailcalling onto another program that is also not using any of R6-R9.
- have the same flow but both programs will be using R6, R7, R8, R9; main
  program needs to use them because we will be popping these registers
  before the tail call and target program will be doing pushes.

Daniel, John, is there some Cilium benchmark that we could incorporate? I
don't think we be able to come up with a program that would mimic what you
have previously described, e.g. 6 static jumps where every program would
be utilizing every callee-saved register. Any help/pointers on how should
we approach it would be very appreciated.

Does that sound like a plan, overall?

Thank you,
Maciej

  reply	other threads:[~2020-05-18 18:48 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-11 14:39 [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation Maciej Fijalkowski
2020-05-11 14:39 ` [RFC PATCH bpf-next 1/1] " Maciej Fijalkowski
2020-05-11 20:05 ` [RFC PATCH bpf-next 0/1] " Daniel Borkmann
2020-05-12  0:01   ` Alexei Starovoitov
2020-05-13 11:58     ` Maciej Fijalkowski
2020-05-17  4:32       ` getting bpf_tail_call to work with bpf function calls. Was: " Alexei Starovoitov
2020-05-18 18:44         ` Maciej Fijalkowski [this message]
2020-05-21  4:05           ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200518184458.GC6472@ranger.igk.intel.com \
    --to=maciej.fijalkowski@intel.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=john.fastabend@gmail.com \
    --cc=lmb@cloudflare.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).