From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37A0EC04AB4 for ; Fri, 17 May 2019 14:15:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EC71F216F4 for ; Fri, 17 May 2019 14:15:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b="peazptBP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728929AbfEQOPw (ORCPT ); Fri, 17 May 2019 10:15:52 -0400 Received: from mail-oi1-f178.google.com ([209.85.167.178]:39789 "EHLO mail-oi1-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728535AbfEQOPw (ORCPT ); Fri, 17 May 2019 10:15:52 -0400 Received: by mail-oi1-f178.google.com with SMTP id v2so5260389oie.6 for ; Fri, 17 May 2019 07:15:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qEdqQ0TVwMnwono/AVNH0gD3Yp9fek4Cvk4YPyWSv98=; b=peazptBPmA9W8okD7apzXztwROK1+YjIH4KzQYYcIPkBRrd4FqTW8uOamkDT3wAKjz tN4O7PIuIlApxXb2oBifv9r+jeRQFKnSrWz9+DxQb3fZ3rymdHE5VbvGwDcIW3i5HSx6 4WMJdDSWa6QVIuGccYqE00BKD7OGYAnbCmq/8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qEdqQ0TVwMnwono/AVNH0gD3Yp9fek4Cvk4YPyWSv98=; b=IepxcoY/nEuFW9UDTRxc7u9xFJ0b0e1wVEJTW42eEXCHcCmHxtYxCD95DfEMhrrEuP wy424TECiIMHc6CVawsYvdRECymItH7yDfkHObYyFJKtiMBp7q7lPIj2mjWgx2g7XAhm U01eWSnc1sy2BUkvG0ClyE3U01RG8enAKoztMcCyjYhRK4eZCmMK1M9shNXRWYLP2zg/ EVOZcSRswV0X2HYmPtLP+zDPJ8AgngqT2dpQSBdmf2dN7azCfj35t/B4bdQaHjOXh7et bH+BHGHTnhv0YNdjQbEi8HVGdg7GkvDPqkFljNCXkFJqk53tUiiBaIO9rDYw2/1Pezet 3jfw== X-Gm-Message-State: APjAAAVEUqYv30pgXYm+MO1KWZbrsirduDdvTT54N0DIi2wNCnHSJHb3 vwyYjbFBmxm2fDVEyLm4EVrMGmq1msMrGfXq7DSnAA== X-Google-Smtp-Source: APXvYqyiJdxfY4j12zpzWuhY9zRT1KkusvtBkF6eOWCx7oS7ACpD3n5A2kswFpst4qD4QMYv3k2UmwtX9xLWw8fwWlA= X-Received: by 2002:aca:43d5:: with SMTP id q204mr14890216oia.13.1558102551024; Fri, 17 May 2019 07:15:51 -0700 (PDT) MIME-Version: 1.0 References: <20190516203325.uhg7c5sr45od7lzm@ast-mbp> In-Reply-To: <20190516203325.uhg7c5sr45od7lzm@ast-mbp> From: Lorenz Bauer Date: Fri, 17 May 2019 15:15:39 +0100 Message-ID: Subject: Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers To: Alexei Starovoitov Cc: Joe Stringer , Networking , bpf@vger.kernel.org, Martin Lau , Daniel Borkmann , edumazet@google.com Content-Type: text/plain; charset="UTF-8" Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, 16 May 2019 at 21:33, Alexei Starovoitov wrote: > > On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote: > > On Wed, 15 May 2019 at 18:16, Joe Stringer wrote: > > > > > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer wrote: > > > > > > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned > > > > that the sk_lookup_* helpers currently return inconsistent results if > > > > SK_REUSEPORT programs are in play. > > > > > > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access > > > > to the full packet > > > > that triggered the look up. To support this, inet_lookup gained a new > > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT > > > > program is skipped and instead the socket is selected by its hash. > > > > > > > > The first problem is that not all callers to inet_lookup from BPF have > > > > an skb, e.g. XDP. This means that a look up from XDP gives an > > > > incorrect result. For now that is not a huge problem. However, once we > > > > get sk_assign as proposed by Joe, we can end up circumventing > > > > SK_REUSEPORT. > > > > > > To clarify a bit, the reason this is a problem is that a > > > straightforward implementation may just consider passing the skb > > > context into the sk_lookup_*() and through to the inet_lookup() so > > > that it would run the SK_REUSEPORT BPF program for socket selection on > > > the skb when the packet-path BPF program performs the socket lookup. > > > However, as this paragraph describes, the skb context is not always > > > available. > > > > > > > At the conference, someone suggested using a similar approach to the > > > > work done on the flow dissector by Stanislav: create a dedicated > > > > context sk_reuseport which can either take an skb or a plain pointer. > > > > Patch up load_bytes to deal with both. Pass the context to > > > > inet_lookup. > > > > > > > > This is when we hit the second problem: using the skb or XDP context > > > > directly is incorrect, because it assumes that the relevant protocol > > > > headers are at the start of the buffer. In our use case, the correct > > > > headers are at an offset since we're inspecting encapsulated packets. > > > > > > > > The best solution I've come up with is to steal 17 bits from the flags > > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for > > > > the offset itself. > > > > > > FYI there's also the upper 32 bits of the netns_id parameter, another > > > option would be to steal 16 bits from there. > > > > Or len, which is only 16 bits realistically. The offset doesn't really fit into > > either of them very well, using flags seemed the cleanest to me. > > Is there some best practice around this? > > > > > > > > > Thoughts? > > > > > > Internally with skbs, we use `skb_pull()` to manage header offsets, > > > could we do something similar with `bpf_xdp_adjust_head()` prior to > > > the call to `bpf_sk_lookup_*()`? > > > > That would only work if it retained the contents of the skipped > > buffer, and if there > > was a way to undo the adjustment later. We're doing the sk_lookup to > > decide whether to > > accept or forward the packet, so at the point of the call we might still need > > that data. Is that feasible with skb / XDP ctx? > > While discussing the solution for reuseport I propose to use > progs/test_select_reuseport_kern.c as an example of realistic program. > It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes() > including payload after the header. > It also uses bpf_skb_load_bytes_relative() to fetch IP. > I think if we're fixing the sk_lookup from XDP the above program > would need to work. Agreed. > And I think we can make it work by adding new requirement that > 'struct bpf_sock_tuple *' argument to bpf_sk_lookup_* must be > a pointer to the packet and not a pointer to bpf program stack. This would break existing users, no? The sk_assign use case Joe Stringer is working on would also break, because its impossible to look up a tuple that hasn't come from the network. It occurs to me that it's impossible to reconcile this use case with SK_REUSEPORT in general. It would be great if we could return an error in such case. > Then helper can construct a fake skb and assign > fake_skb->data = &bpf_sock_tuple_arg.sport That isn't valid if the packet contains IP options or extension headers, because the offset of sport is variable. > It can check that struct bpf_sock_tuple * pointer is within 100-ish bytes > from xdp->data and within xdp->data_end Why the 100-byte limitation? > This way the reuseport program's assumption that ctx->data points to tcp/udp > will be preserved and it can access it all including payload. How about the following: sk_lookup(ctx, &saddr, len, netns, BPF_F_IPV4 | BPF_F_OFFSET(offsetof(sport)) SK_REUSEPORT can then access from saddr+offsetof(sport) to saddr+len. The helper uses offsetof(sport) to retrieve the tuple. - Works with stack, map, packet pointers - The verifier does bounds checking on the buffer for us due to ARG_CONST_SIZE - If no BPF_F_IPV? is present, we fall back to current behaviour > > This approach doesn't need to mess with xdp_adjust_head and adjust uapi to pass length. > Existing progs/test_sk_lookup_kern.c will magically start working with XDP > even when reuseport prog is attached. > Thoughts? > -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com