From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C03AC2D0E8 for ; Fri, 27 Mar 2020 19:42:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 49C5B206F1 for ; Fri, 27 Mar 2020 19:42:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QE7qx52c" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727639AbgC0Tmg (ORCPT ); Fri, 27 Mar 2020 15:42:36 -0400 Received: from mail-qk1-f195.google.com ([209.85.222.195]:44804 "EHLO mail-qk1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726738AbgC0Tmf (ORCPT ); Fri, 27 Mar 2020 15:42:35 -0400 Received: by mail-qk1-f195.google.com with SMTP id j4so12084930qkc.11; Fri, 27 Mar 2020 12:42:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=rKL+VFjmUVuqurp/2aQxdNqmyU9h2BvyuAzk/lejr0M=; b=QE7qx52c5uU7otJedEAGqLh7z72neXsaeOu/9TeTe0mXPoGy7UTj8Ugj8f+po4bWqo DPoJQquEYKvF1g02ta+EMSU/ve4jkYTI3DLoC+d/KiJYkOQEjgRAyfaBY44TIoSko34F Yln6RAgYvI3Apb/nIL6el/sKDokUH2PNNZKl9cS/HSEQAd0brCtw46o8iCliCbvaQ3TX BO6CIAypMKqOeNjifTh/fJS9fl+EjgWlp+5wfK/dLC1oacmWq3JxPhfiO6y5wFlOubA/ 9NDy01d48teEmBJ11WzYJUnueBrbqHk6gHOsuFVj99YANashoeE36yPhYytagAPKyWly RosQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=rKL+VFjmUVuqurp/2aQxdNqmyU9h2BvyuAzk/lejr0M=; b=lNM3+SKSpCuU7BbafFD1LvsqlxBoKcof+u1oewWu7MAJseZD9/Gunok44c6/bVSTsw VfQ87Cl6PGHP92AQvhYWwNgykeHLnrP6sO+ScXnfqkbNKQNoIlo8KD/tGM11aonQN7GH GctW0WpiX8yUhb+iOH9dFmQTNQLykNbA93c4EQ6dOsOHwylKllex8rtejwnHnYBMOojY Jk17MarDkxbfxkcLLh+DI17Ck1xo8j/1ITR0tvGKmDYMvTJm67bV7Qi9Ua5PYRKoI9tY mpYpGHFsDX8KYl3DDqboHu3DHmHekO3Z76bygLK6PQJ0N+hPrU4/W33fTsTPkaYcvQGl GCPg== X-Gm-Message-State: ANhLgQ1XIh8nM6pMHZzu5QJyPx2MXVm+pqSbiaLRNJCRY62WdtCYHZUV xx/3jobdPnlX+E5BezD1SZRdRor+gIbGda1E4Wo= X-Google-Smtp-Source: ADFU+vuICBI/8pH0HKxQowt2zL+CvN3o5UpzUK1m4NcibUolVrrMyPoTSdLh2ubwLYMkkpcjjimZAMzKyKNyuCS43zc= X-Received: by 2002:a37:6411:: with SMTP id y17mr985067qkb.437.1585338154019; Fri, 27 Mar 2020 12:42:34 -0700 (PDT) MIME-Version: 1.0 References: <158462359206.164779.15902346296781033076.stgit@toke.dk> <158462359315.164779.13931660750493121404.stgit@toke.dk> <20200319155236.3d8537c5@kicinski-fedora-PC1C0HJN> <875zez76ph.fsf@toke.dk> <20200320103530.2853c573@kicinski-fedora-PC1C0HJN> <5e750bd4ebf8d_233f2ab4c81425c4ce@john-XPS-13-9370.notmuch> <87tv2f48lp.fsf@toke.dk> <87h7ye3mf3.fsf@toke.dk> <87tv2e10ly.fsf@toke.dk> <87369wrcyv.fsf@toke.dk> <87pncznvjy.fsf@toke.dk> In-Reply-To: From: Andrii Nakryiko Date: Fri, 27 Mar 2020 12:42:20 -0700 Message-ID: Subject: Re: [PATCH bpf-next 1/4] xdp: Support specifying expected existing program when attaching XDP To: Lorenz Bauer Cc: =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= , John Fastabend , Jakub Kicinski , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , Andrii Nakryiko , "David S. Miller" , Jesper Dangaard Brouer , Andrey Ignatov , Networking , bpf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Fri, Mar 27, 2020 at 4:07 AM Lorenz Bauer wrote: > > On Thu, 26 Mar 2020 at 19:06, Andrii Nakryiko = wrote: > > > > On Thu, Mar 26, 2020 at 5:35 AM Toke H=C3=B8iland-J=C3=B8rgensen wrote: > > > > > > Andrii Nakryiko writes: > > > > > > > Now for XDP. It has same flawed model. And even if it seems to you > > > > that it's not a big issue, and even if Jakub thinks we are trying t= o > > > > solve non-existing problem, it is a real problem and a real concern > > > > from people that have to support XDP in production with many > > > > well-meaning developers developing BPF applications independently. > > > > Copying what you wrote in another thread: > > > > > > > >> Setting aside the question of which is the best abstraction to rep= resent > > > >> an attachment, it seems to me that the actual behavioural problem = (XDP > > > >> programs being overridden by mistake) would be solvable by this pa= tch, > > > >> assuming well-behaved userspace applications. > > > > > > > > ... this is a horrible and unrealistic assumption that we just cann= ot > > > > make and accept. However well-behaved userspace applications are, t= hey > > > > are written by people that make mistakes. And rather than blissfull= y > > > > expect that everything will be fine, we want to have enforcements i= n > > > > place that will prevent some buggy application to wreck havoc in > > > > production. > > > > > > Look, I'm not trying to tell you how to managed your internal systems= . > > > I'm just objecting to your assertion that your deployment model is th= e > > > only one that can possibly work, and the refusal to consider other > > > alternatives that comes with it. > > > > Your assumption doesn't work for us. Because of that we need something > > like bpf_link. Existing attachment API doesn't go away and is still > > supported. Feel free to use existing API. As for EXPECTED_FD API you > > are adding, it will be up to maintainers to decide, ultimately, I > > can't block it, even if I wanted to. > > > > > > > > >> You're saying that like we didn't already have the netlink API. We > > > >> essentially already have (the equivalent of) LINK_CREATE and LINK_= QUERY, > > > >> this is just adding LINK_UPDATE. It's a straight-forward fix of an > > > >> existing API; essentially you're saying we should keep the old API= in a > > > >> crippled state in order to promote your (proposed) new API. > > > > > > > > This is the fundamental disagreement that we seem to have. XDP's BP= F > > > > program attachment is not in any way equivalent to bpf_link. So no, > > > > netlink API currently doesn't have anything that's close to bpf_lin= k. > > > > Let me try to summarize what bpf_link is and what are its fundament= al > > > > properties regardless of type of BPF programs. > > > > > > First of all, thank you for this summary; that is very useful! > > > > Sure, you're welcome. > > > > > > > > > 1. bpf_link represents a connection (pairing?) of BPF program and s= ome > > > > BPF hook it is attached to. BPF hook could be perf event, cgroup, > > > > netdev, etc. It's a completely independent object in itself, along = the > > > > bpf_map and bpf_prog, which has its own lifetime and kernel > > > > representation. To user-space application it is returned as an > > > > installed FD, similar to loaded BPF program and BPF map. It is > > > > important that it's not just a BPF program, because BPF program can= be > > > > attached to multiple BPF hooks (e.g., same XDP program can be attac= hed > > > > to multiple interface; same kprobe handler can be installed multipl= e > > > > times), which means that having BPF program FD isn't enough to > > > > uniquely represent that one specific BPF program attachment and det= ach > > > > it or query it. Having kernel object for this allows to encapsulate > > > > all these various details of what is attached were and present to > > > > user-space a single handle (FD) to work with. > > > > > > For XDP there is already a unique handle, it's just implicit: Each > > > netdev can have exactly one XDP program loaded. So I don't really see > > > how bpf_link adds anything, other than another API for the same thing= ? > > > > I certainly failed to explain things clearly if you are still asking > > this. See point #2, once you attach bpf_link you can't just replace > > it. This is what XDP doesn't have right now. > > From your description I like bpf_link, because it'll make attachment easi= er > to support, and the pinning behaviour also seems nice. I'm really not fus= sed > by netlink vs syscall, whatever. Great, thanks. > > However, this behaviour concerns me. It's like Windows not > letting you delete a file while an application has it opened, which just = leads > to randomly killing programs until you find the right one. It's frustrati= ng > and counter productive. > > You're taking power away from the operator. In your deployment scenario > this might make sense, but I think it's a really bad model in general. If= I am > privileged I need to be able to exercise that privilege. This means that = if > there is a netdevice in my network namespace, and I have CAP_NET_ADMIN > or whatever, I can break the association. > > So, to be constructive: I'd prefer bpf_link to replace a netlink attachme= nt and > vice versa. If you need to restrict control, use network namespaces > to hide the devices, instead of hiding the bpffs. Alexei mentioned a "nuke" option few times already, that will solve this. The idea is that human operation should be able to do this, but not applications, even though they have CAP_NET_ADMIN. I don't know how exactly interface will look like, but it shouldn't allow applications just randomly replace bpf_link. There are legitimate use cases where application has to have CAP_NET_ADMIN and we can't hide netdevice from them, unfortunately. > > > > > It's a game of picking features/properties in isolation and "we can do > > this particular thing this different way with what we have". Please, > > try consider all of it together, it's important. Every single aspect > > of bpf_link is not that unique, but it's all of them together that > > matter. > > > > > > > > > 2. Due to having FD associated with bpf_link, it's not possible to > > > > talk about "owning" bpf_link. If application created link and never > > > > shared its FD with any other application, it is the sole owner of i= t. > > > > But it also means that you can share it, if you need it. Now, once > > > > application closes FD or app crashes and kernel automatically close= s > > > > that FD, bpf_link refcount is decremented. If it was the last or on= ly > > > > FD, it will trigger automatica detachment and clean up of that > > > > particular BPF program attachment. Note, not a clean up of BPF > > > > program, which can still be attached somewhere else: only that > > > > particular attachment. > > > > > > This behaviour is actually one of my reservations against bpf_link fo= r > > > XDP: I think that automatically detaching XDP programs when the FD is > > > closed is very much the wrong behaviour. An XDP program processes > > > packets, and when loading one I very much expect it to keep doing tha= t > > > until I explicitly tell it to stop. > > > > As you mentioned earlier, "it's not the only one mode". Just like with > > tracing APIs, you can imagine scripts that would adds their > > packet-sniffing XDP program temporarily. If they crash, "temporarily" > > turns into "permanently, but no one knows". This is bad. And again, > > it's a choice, just with a default to auto-cleanup, because it's safe, > > even if it requires extra step for applications willing to do > > permanent XDP attachment. > > > > > > > > > 3. This derives from the concept of ownership of bpf_link. Once > > > > bpf_link is attached, no other application that doesn't own that > > > > bpf_link can replace, detach or modify the link. For some cases it > > > > doesn't matter. E.g., for tracing, all attachment to the same fentr= y > > > > trampoline are completely independent. But for other cases this is > > > > crucial property. E.g., when you attach BPF program in an exclusive > > > > (single) mode, it means that particular cgroup and any of its child= ren > > > > cgroups can have any more BPF programs attached. This is important = for > > > > container management systems to enforce invariants and correct > > > > functioning of the system. Right now it's very easy to violate that= - > > > > you just go and attach your own BPF program, and previous BPF progr= am > > > > gets automatically detached without original application that put i= t > > > > there knowing about this. Chaos ensues after that and real people h= ave > > > > to deal with this. Which is why existing > > > > BPF_PROG_ATTACH/BPF_PROG_DETACH API is inadequate and we are adding > > > > bpf_link support. > > > > > > I can totally see how having an option to enforce a policy such as > > > locking out others from installing cgroup BPF programs is useful. But > > > such an option is just that: policy. So building this policy in as a > > > fundamental property of the API seems like a bad idea; that is > > > effectively enforcing policy in the kernel, isn't it? > > > > I hope we won't go into a dictionary definition of what "policy" means > > here :). For me it's about guarantee that kernel gives to user-space. > > bpf_link doesn't care about dictating policies. If you don't want this > > guarantee - don't use bpf_link, use direct program attachment. As > > simple as that. Policy is implemented by user-space application by > > using APIs with just the right guarantees. > > > > > > > > > Those same folks have similar concern with XDP. In the world where > > > > container management installs "root" XDP program which other user > > > > applications can plug into (libxdp use case, right?), it's crucial = to > > > > ensure that this root XDP program is not accidentally overwritten b= y > > > > some well-meaning, but not overly cautious developer experimenting = in > > > > his own container with XDP programs. This is where bpf_link ownersh= ip > > > > plays a huge role. Tupperware agent (FB's container management agen= t) > > > > would install root XDP program and will hold onto this bpf_link > > > > without sharing it with other applications. That will guarantee tha= t > > > > the system will be stable and can't be compromised. > > > > > > See this is where we get into "deployment-model specific territory". = I > > > mean, sure, in the "central management daemon" model, it makes sense > > > that no other applications can replace the XDP program. But, erm, we > > > already have a mechanism to ensure that: Just don't grant those > > > applications CAP_NET_ADMIN? So again, bpf_link doesn't really seem to > > > add anything other than a different way to do the same thing? > > > > Because there are still applications that need CAP_NET_ADMIN in order > > to function (for other reasons than attaching XDP), so it's impossible > > to enforce with for everyone. > > I think I'm missing some context. CAP_NET_ADMIN is trusted by definition, > so trust these applications to not fiddle with XDP? Are there many of the= se? > Are they inside a user namespace or something? > > > > > > > > > Additionally, in the case where there is *not* a central management > > > daemon (i.e., what I'm implementing with libxdp), this would be the f= low > > > implemented by the library without bpf_link: > > > > > > 1. Query kernel for current BPF prog loaded on $IFACE > > > 2. Sanity-check that this program is a dispatcher program installed b= y > > > libxdp > > > 3. Create a new dispatcher program with whatever changes we want to d= o > > > (such as adding another component program). > > > 4. Atomically replace the old program with the new one using the netl= ink > > > API in this patch series. > > > > > > Whereas with bpf_link, it would be: > > > > > > 1. Find the pinned bpf_link for $IFACE (e.g., load from > > > /sys/fs/bpf/iface-links/$IFNAME). > > > > But now you can hide this mount point from containerized > > root/CAP_NET_ADMIN application, can't you? See the difference? One > > might think about bpf_link as a fine-grained capability in this sense. > > > > > > > 2. Query kernel for current BPF prog linked to $LINK > > > 3. Sanity-check that this program is a dispatcher program installed b= y > > > libxdp > > > 4. Create a new dispatcher program with whatever changes we want to d= o > > > (such as adding another component program). > > > 5. Atomically replace the old program with the new one using the > > > LINK_UPDATE bpf() API. > > > > > > > > > So all this does is add an additional step, and another dependency on > > > bpffs. And crucially, I really don't see how the "bpf_link is the onl= y > > > thing that is not fundamentally broken" argument holds up. > > > > > > -Toke > > > > > > > -- > Lorenz Bauer | Systems Engineer > 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK > > www.cloudflare.com