Re: BPF sk_lookup v5 - TCP SYN and UDP 0-len flood benchmarks

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Andrii Nakryiko <andriin@fb.com>,
	Lorenz Bauer <lmb@cloudflare.com>,
	Marek Majkowski <marek@cloudflare.com>,
	Martin KaFai Lau <kafai@fb.com>, Yonghong Song <yhs@fb.com>
Subject: Re: BPF sk_lookup v5 - TCP SYN and UDP 0-len flood benchmarks
Date: Thu, 20 Aug 2020 12:29:30 +0200	[thread overview]
Message-ID: <87k0xtsj91.fsf@cloudflare.com> (raw)
In-Reply-To: <CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@mail.gmail.com>

On Tue, Aug 18, 2020 at 08:19 PM CEST, Alexei Starovoitov wrote:
> On Tue, Aug 18, 2020 at 8:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>          :                      rcu_read_lock();
>>          :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
>>     0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
>>          :                      if (run_array) {
>>     0.00 :   ffffffff817f862c:       test   %rsi,%rsi
>>     0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
>>          :                      struct bpf_sk_lookup_kern ctx = {
>>     1.05 :   ffffffff817f8635:       xor    %eax,%eax
>>     0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
>>     0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
>>     0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
>>    18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
>>     1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
>>     0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
>>     0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
>>     0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
>>     1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
>>     0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
>>     0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
>>          :                      .sport          = sport,
>>          :                      .dport          = dport,
>>          :                      };
>
> Such heavy hit to zero init 56-byte structure is surprising.
> There are two 4-byte holes in this struct. You can try to pack it and
> make sure that 'rep stoq' is used instead of 'rep stos' (8 byte at a time vs 4).

Thanks for the tip. I'll give it a try.

> Long term we should probably stop doing *_kern style of ctx passing
> into bpf progs.
> We have BTF, CO-RE and freplace now. This old style of memset *_kern and manual
> ctx conversion has performance implications and annoying copy-paste of ctx
> conversion routines.
> For this particular case instead of introducing udp4_lookup_run_bpf()
> and copying registers into stack we could have used freplace of
> udp4_lib_lookup2.
> More verifier work needed, of course.
> My main point that existing approach "lets prep args for bpf prog to
> run" that is used
> pretty much in every bpf hook is no longer necessary.

Andrii has also suggested leveraging BTF [0], but to expose the *_kern
struct directly to BPF prog instead of emitting ctx access instructions.

What I'm curious about is if we get rid of prepping args and ctx
conversion, then how do we limit what memory BPF prog can access?

Say, I'm passing a struct sock * to my BPF prog. If it's not a tracing
prog, then I don't want it to have access to everything that is
reachable from struct sock *. This is where this approach currently
breaks down for me.

[0] https://lore.kernel.org/bpf/CAEf4BzZ7-0TFD4+NqpK9X=Yuiem89Ug27v90fev=nn+3anCTpA@mail.gmail.com/