BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	ast@kernel.org, bpf@vger.kernel.org, netdev@vger.kernel.org,
	bjorn.topel@intel.com, magnus.karlsson@intel.com,
	lmb@cloudflare.com, john.fastabend@gmail.com
Subject: Re: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation
Date: Mon, 11 May 2020 17:01:53 -0700
Message-ID: <20200512000153.hfdeh653v533qbe6@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <2e3c6be0-e482-d856-7cc1-b1d03a26428e@iogearbox.net>

On Mon, May 11, 2020 at 10:05:25PM +0200, Daniel Borkmann wrote:
> Hey Maciej,
> 
> On 5/11/20 4:39 PM, Maciej Fijalkowski wrote:
> > Hi!
> > 
> > Today, BPF x86-64 JIT is preserving all of the callee-saved registers
> > for each BPF program being JITed, even when none of the R6-R9 registers
> > are used by the BPF program. Furthermore the tail call counter is always
> > pushed/popped to/from the stack even when there is no tail call usage in
> > BPF program being JITed. Optimization can be introduced that would
> > detect the usage of R6-R9 and based on that push/pop to/from the stack
> > only what is needed. Same goes for tail call counter.
> > 
> > Results look promising for such instruction reduction. Below are the
> > numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen:
> > 
> > * With optimization: 22.3 Mpps
> > * Without:           19.0 mpps
> > 
> > So it's around 15% of performance improvement. Note that xdp1 is not
> > using any of callee saved registers, nor the tail call, hence such
> > speed-up.
> > 
> > There is one detail that needs to be handled though.
> > 
> > Currently, x86-64 JIT tail call implementation is skipping the prologue
> > of target BPF program that has constant size. With the mentioned
> > optimization implemented, each particular BPF program that might be
> > inserted onto the prog array map and therefore be the target of tail
> > call, could have various prologue size.
> > 
> > Let's have some pseudo-code example:
> > 
> > func1:
> > pro
> > code
> > epi
> > 
> > func2:
> > pro
> > code'
> > epi
> > 
> > func3:
> > pro
> > code''
> > epi
> > 
> > Today, pro and epi are always the same (9/7) instructions. So a tail
> > call from func1 to func2 is just a:
> > 
> > jump func2 + sizeof pro in bytes (PROLOGUE_SIZE)
> > 
> > With the optimization:
> > 
> > func1:
> > pro
> > code
> > epi
> > 
> > func2:
> > pro'
> > code'
> > epi'
> > 
> > func3:
> > pro''
> > code''
> > epi''
> > 
> > For making the tail calls up and running with the mentioned optimization
> > in place, x86-64 JIT should emit the pop registers instructions
> > that were pushed on prologue before the actual jump. Jump offset should
> > skip the instructions that are handling rbp/rsp, not the whole prologue.
> > 
> > A tail call within func1 would then need to be:
> > epi -> pop what pro pushed, but no leave/ret instructions
> > jump func2 + 16 // first push insn of pro'; if no push, then this would
> >                  // a direct jump to code'
> > 
> > Magic value of 16 comes from count of bytes that represent instructions
> > that are skipped:
> > 0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > 55                      push   %rbp
> > 48 89 e5                mov    %rsp,%rbp
> > 48 81 ec 08 00 00 00    sub    $0x8,%rsp
> > 
> > which would in many cases add *more* instructions for tailcalls. If none
> > of callee-saved registers are used, then there would be no overhead with
> > such optimization in place.
> > 
> > I'm not sure how to measure properly the impact on the BPF programs that
> > are utilizing tail calls. Any suggestions?
> 
> Right, so far the numbers above (no callee saved registers, no tail calls)
> are really the best case scenario. I think programs not using callee saved
> registers are probably very limited in what they do, and tail calls are often
> used as well (although good enough for AF_XDP, for example). So I wonder how
> far we would regress with callee saved registers and tail calls. For Cilium
> right now you can roughly assume a worst case tail call depth of ~6 with static
> jumps (that we patch to jmp/nop). Only in one case we have a tail call map
> index that is non-static. In terms of registers, assume all of them are used
> one way or another. If you could check the impact in such setting, that would
> be great.
> 
> > Daniel, Alexei, what is your view on this?
> 
> I think performance wise this would be both pro and con depending how tail
> calls are used. One upside however, and I think you didn't mention it here
> would be that we don't need to clamp used stack space to 512, so we could
> actually track how much stack is used (or if any is used at all) and adapt
> it between tail calls? Depending on the numbers, if we'd go that route, it
> should rather be generalized and tracked via verifier so all JITs can behave
> the same (and these workarounds in verifier lifted). But then 15% performance
> improvement as you state above is a lot, probably we might regress at least
> as much as well in your benchmark. I wonder whether there should be a knob
> for it, though it's mainly implementation detail..

I was thinking about knob too, but users are rarely going to touch it,
so if we go for opt-in it will mostly be unused except by few folks.
So I think it's better to go with this approach unconditionally,
but first I'd like to see the performance numbers in how it regresses
the common case. AF_XDP's empty prog that does 'return XDP_PASS' is
a rare case and imo not worth optimizing for, but I see a lot of value
if this approach allows to lift tail_call vs bpf2bpf restriction.
It looks to me that bpf2bpf will be able to work. prog_A will call into prog_B
and if that prog does any kind of tail_call that tail_call will
eventually finish and the execution will return to prog_A as normal.
So I'd like to ask for two things:
1. perf numbers for something like progs/bpf_flow.c before and after
2. removal of bpf2bpf vs tail_call restriction and new selftests to prove
that it's working. that will include droping 512 stack clamp.

  reply index

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-11 14:39 Maciej Fijalkowski
2020-05-11 14:39 ` [RFC PATCH bpf-next 1/1] " Maciej Fijalkowski
2020-05-11 20:05 ` [RFC PATCH bpf-next 0/1] " Daniel Borkmann
2020-05-12  0:01   ` Alexei Starovoitov [this message]
2020-05-13 11:58     ` Maciej Fijalkowski
2020-05-17  4:32       ` getting bpf_tail_call to work with bpf function calls. Was: " Alexei Starovoitov
2020-05-18 18:44         ` Maciej Fijalkowski
2020-05-21  4:05           ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200512000153.hfdeh653v533qbe6@ast-mbp.dhcp.thefacebook.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=john.fastabend@gmail.com \
    --cc=lmb@cloudflare.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git