From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=1NzR=62=vger.kernel.org=bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C5630C47255
	for <bpf@archiver.kernel.org>; Tue, 12 May 2020 00:01:58 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 91548206DB
	for <bpf@archiver.kernel.org>; Tue, 12 May 2020 00:01:58 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="assIV3zV"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1725854AbgELAB6 (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Mon, 11 May 2020 20:01:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44180 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL)
        by vger.kernel.org with ESMTP id S1725836AbgELAB5 (ORCPT
        <rfc822;bpf@vger.kernel.org>); Mon, 11 May 2020 20:01:57 -0400
Received: from mail-pj1-x1043.google.com (mail-pj1-x1043.google.com [IPv6:2607:f8b0:4864:20::1043])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBAB1C061A0C;
        Mon, 11 May 2020 17:01:57 -0700 (PDT)
Received: by mail-pj1-x1043.google.com with SMTP id mq3so8633736pjb.1;
        Mon, 11 May 2020 17:01:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=9Svr3YAjn2yrRKCIaggaB6qtioXJpolfEgJytca6lcg=;
        b=assIV3zVCORlQwJIBK6MUhT8ARddmeZoOLfjTgMyGKLjrn9Kbn2BUJlyurnw3tHsge
         PLCbhgNmuvl8j6kwuCkK2flC2KDILNcRMC4DguTG0BrPvZN/m89IXQvVRcmKNChaGCt3
         9DW8UNS5qdcm+97DO2oXje5yfGprN1ePhBR7SEVmA1HfD4e4S6ApyOCs6GSAUVxbf0sZ
         iaX4GRwnJB1iysQhquB6i0YrIjaAcM1CnNRDNSOVjHV8XrJncTSCI3R07NEJIQUAW9BW
         zl/6jXYbUkqlfu0Io4Aj/cuFK43PKDOW/pUXhmkaBcjJ3ka2BSsQiiakP6qloJNF+yuf
         J5cQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=9Svr3YAjn2yrRKCIaggaB6qtioXJpolfEgJytca6lcg=;
        b=Jc4xXJX5iAY/olK4V1qiQveYzT4fHYSdvrI4k7n6YF8ewo1nfaB72q7S6MwWwGHBAA
         XG6mAKjPnvrpRJG1dMbXXfTI0C0Fn84grESNOfJcLhVY0TB1Oxk6eKg3L2UXVawcTluM
         KTRLMM1eYUkUNMVollIPPOxu3+8G/sMaPjYBCnkwzK9HVyKmKyCRcIpqeYTQEMCF88nN
         nQnhcM+cyMD5NOAdDSknoLWrZ6V3PI8VH42cK8WVOWiAW6ALATyE6zn90dcf3XUMeeHj
         f466sAH54jnYC9qJVMc34kk4z8y3jwyWx/uCScocIByJgkv9J6GjSouYzf4mfBcaVCrS
         7SNQ==
X-Gm-Message-State: AGi0PuYt4FXAsEmJ1ilEWbxFF340rxs7NR2mXETTussIWbaXvF4Wis5z
        vE9RGg/XED0Y/RlVmGnn+gE=
X-Google-Smtp-Source: APiQypLlofye8gqxGNKaf5nqsxbwmxLsohbiJLxbvf2cDDF67OQlMKfrRxfSWpGh7G1QpIRbqGX2dQ==
X-Received: by 2002:a17:902:7003:: with SMTP id y3mr18096521plk.18.1589241717015;
        Mon, 11 May 2020 17:01:57 -0700 (PDT)
Received: from ast-mbp.dhcp.thefacebook.com ([2620:10d:c090:400::5:ec64])
        by smtp.gmail.com with ESMTPSA id a7sm10384764pfg.157.2020.05.11.17.01.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 May 2020 17:01:56 -0700 (PDT)
Date:   Mon, 11 May 2020 17:01:53 -0700
From:   Alexei Starovoitov <alexei.starovoitov@gmail.com>
To:     Daniel Borkmann <daniel@iogearbox.net>
Cc:     Maciej Fijalkowski <maciej.fijalkowski@intel.com>, ast@kernel.org,
        bpf@vger.kernel.org, netdev@vger.kernel.org, bjorn.topel@intel.com,
        magnus.karlsson@intel.com, lmb@cloudflare.com,
        john.fastabend@gmail.com
Subject: Re: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT
 prologue/epilogue generation
Message-ID: <20200512000153.hfdeh653v533qbe6@ast-mbp.dhcp.thefacebook.com>
References: <20200511143912.34086-1-maciej.fijalkowski@intel.com>
 <2e3c6be0-e482-d856-7cc1-b1d03a26428e@iogearbox.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <2e3c6be0-e482-d856-7cc1-b1d03a26428e@iogearbox.net>
Sender: bpf-owner@vger.kernel.org
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org

On Mon, May 11, 2020 at 10:05:25PM +0200, Daniel Borkmann wrote:
> Hey Maciej,
> 
> On 5/11/20 4:39 PM, Maciej Fijalkowski wrote:
> > Hi!
> > 
> > Today, BPF x86-64 JIT is preserving all of the callee-saved registers
> > for each BPF program being JITed, even when none of the R6-R9 registers
> > are used by the BPF program. Furthermore the tail call counter is always
> > pushed/popped to/from the stack even when there is no tail call usage in
> > BPF program being JITed. Optimization can be introduced that would
> > detect the usage of R6-R9 and based on that push/pop to/from the stack
> > only what is needed. Same goes for tail call counter.
> > 
> > Results look promising for such instruction reduction. Below are the
> > numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen:
> > 
> > * With optimization: 22.3 Mpps
> > * Without:           19.0 mpps
> > 
> > So it's around 15% of performance improvement. Note that xdp1 is not
> > using any of callee saved registers, nor the tail call, hence such
> > speed-up.
> > 
> > There is one detail that needs to be handled though.
> > 
> > Currently, x86-64 JIT tail call implementation is skipping the prologue
> > of target BPF program that has constant size. With the mentioned
> > optimization implemented, each particular BPF program that might be
> > inserted onto the prog array map and therefore be the target of tail
> > call, could have various prologue size.
> > 
> > Let's have some pseudo-code example:
> > 
> > func1:
> > pro
> > code
> > epi
> > 
> > func2:
> > pro
> > code'
> > epi
> > 
> > func3:
> > pro
> > code''
> > epi
> > 
> > Today, pro and epi are always the same (9/7) instructions. So a tail
> > call from func1 to func2 is just a:
> > 
> > jump func2 + sizeof pro in bytes (PROLOGUE_SIZE)
> > 
> > With the optimization:
> > 
> > func1:
> > pro
> > code
> > epi
> > 
> > func2:
> > pro'
> > code'
> > epi'
> > 
> > func3:
> > pro''
> > code''
> > epi''
> > 
> > For making the tail calls up and running with the mentioned optimization
> > in place, x86-64 JIT should emit the pop registers instructions
> > that were pushed on prologue before the actual jump. Jump offset should
> > skip the instructions that are handling rbp/rsp, not the whole prologue.
> > 
> > A tail call within func1 would then need to be:
> > epi -> pop what pro pushed, but no leave/ret instructions
> > jump func2 + 16 // first push insn of pro'; if no push, then this would
> >                  // a direct jump to code'
> > 
> > Magic value of 16 comes from count of bytes that represent instructions
> > that are skipped:
> > 0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > 55                      push   %rbp
> > 48 89 e5                mov    %rsp,%rbp
> > 48 81 ec 08 00 00 00    sub    $0x8,%rsp
> > 
> > which would in many cases add *more* instructions for tailcalls. If none
> > of callee-saved registers are used, then there would be no overhead with
> > such optimization in place.
> > 
> > I'm not sure how to measure properly the impact on the BPF programs that
> > are utilizing tail calls. Any suggestions?
> 
> Right, so far the numbers above (no callee saved registers, no tail calls)
> are really the best case scenario. I think programs not using callee saved
> registers are probably very limited in what they do, and tail calls are often
> used as well (although good enough for AF_XDP, for example). So I wonder how
> far we would regress with callee saved registers and tail calls. For Cilium
> right now you can roughly assume a worst case tail call depth of ~6 with static
> jumps (that we patch to jmp/nop). Only in one case we have a tail call map
> index that is non-static. In terms of registers, assume all of them are used
> one way or another. If you could check the impact in such setting, that would
> be great.
> 
> > Daniel, Alexei, what is your view on this?
> 
> I think performance wise this would be both pro and con depending how tail
> calls are used. One upside however, and I think you didn't mention it here
> would be that we don't need to clamp used stack space to 512, so we could
> actually track how much stack is used (or if any is used at all) and adapt
> it between tail calls? Depending on the numbers, if we'd go that route, it
> should rather be generalized and tracked via verifier so all JITs can behave
> the same (and these workarounds in verifier lifted). But then 15% performance
> improvement as you state above is a lot, probably we might regress at least
> as much as well in your benchmark. I wonder whether there should be a knob
> for it, though it's mainly implementation detail..

I was thinking about knob too, but users are rarely going to touch it,
so if we go for opt-in it will mostly be unused except by few folks.
So I think it's better to go with this approach unconditionally,
but first I'd like to see the performance numbers in how it regresses
the common case. AF_XDP's empty prog that does 'return XDP_PASS' is
a rare case and imo not worth optimizing for, but I see a lot of value
if this approach allows to lift tail_call vs bpf2bpf restriction.
It looks to me that bpf2bpf will be able to work. prog_A will call into prog_B
and if that prog does any kind of tail_call that tail_call will
eventually finish and the execution will return to prog_A as normal.
So I'd like to ask for two things:
1. perf numbers for something like progs/bpf_flow.c before and after
2. removal of bpf2bpf vs tail_call restriction and new selftests to prove
that it's working. that will include droping 512 stack clamp.