From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5C4BC48BE6 for ; Wed, 16 Jun 2021 12:04:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8E5986128C for ; Wed, 16 Jun 2021 12:04:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232742AbhFPMGK (ORCPT ); Wed, 16 Jun 2021 08:06:10 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:35945 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232797AbhFPMGI (ORCPT ); Wed, 16 Jun 2021 08:06:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623845041; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pBL/cWZLjH7AvvpeVIeZW3LuRVTP6jy3YXY7ILAJQ3Q=; b=FPRGDX3nT7hHH5TRMCJzcdNy6Mv0i2OoRabRRCeKl2xs2y2MH1IkC+mFoBrrO9+zf8ej61 LmlJuNhtDcDZ2cP64Tb79BJc1oB9Cwb84JE0Gd0lFBxmb2RKihvNwK/OEmAFDW0em2trqa g0ZQ3wT2zwRFveQ4ohxtoKNQ7yc9RW0= Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com [209.85.218.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-589-1r8xJ6NZMxmCdRN7AkSupA-1; Wed, 16 Jun 2021 08:04:00 -0400 X-MC-Unique: 1r8xJ6NZMxmCdRN7AkSupA-1 Received: by mail-ej1-f71.google.com with SMTP id nd10-20020a170907628ab02903a324b229bfso851542ejc.7 for ; Wed, 16 Jun 2021 05:04:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=pBL/cWZLjH7AvvpeVIeZW3LuRVTP6jy3YXY7ILAJQ3Q=; b=iLnSzinwryQQyxt4ykwbOChvEhrPFLd9ViXL3lm8pyXC80NqcpNNmMSm446HkT9BRt VoI7IU6C2NDM6jLve0CR6kl8bnilZYnXZs3x99TIE+7g67QGbbK2FESR0BD6ju4DsNKL oeT9hsE5/42K0Kq9H386BBYcawzTtUj1S1CDWToJHjGVHWKFnXJXEFfqpKag1SsBiMw8 WVNphL45VeT+HQjh2VM4xSpA3GnfuYpDcELiOr/47H4sX0KRdcKXnvTS64mt98B3vqok bn7ybubJvQxUV5UufP6D7yxkwMm/Y2sOHXipk7pPL3uPwTPBe3w/wliRJBO8JVmKy/7z Ubcg== X-Gm-Message-State: AOAM53064eEqI8EtRZ/0jLjEeif8jH7zaY+cCAf5ifZSaW3KQPHo67MD UmrZElnC604cjyiRrL+XznLJ7IsPewNsboC6UfJitYnNrDRbRHFmTMsn0eBTgb7ap72d6j2Iznk x9mhS8kHkLPiE X-Received: by 2002:aa7:c7cd:: with SMTP id o13mr3723530eds.269.1623845039556; Wed, 16 Jun 2021 05:03:59 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz1fCsxeMQFLiuXlDkhshkO5c8JQbYXu6dp+IeYZNq+FBqnplEU62m2iAFyL8pPxX7VHJ70kA== X-Received: by 2002:aa7:c7cd:: with SMTP id o13mr3723486eds.269.1623845039352; Wed, 16 Jun 2021 05:03:59 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2]) by smtp.gmail.com with ESMTPSA id df20sm1058528edb.76.2021.06.16.05.03.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Jun 2021 05:03:58 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id E58641802C0; Wed, 16 Jun 2021 14:03:56 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Daniel Borkmann , Cong Wang , Kumar Kartikeya Dwivedi Cc: bpf , Alexei Starovoitov , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Jamal Hadi Salim , Vlad Buslov , Jiri Pirko , "David S. Miller" , Jakub Kicinski , Joe Stringer , Quentin Monnet , Jesper Dangaard Brouer , Linux Kernel Network Developers Subject: Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API In-Reply-To: <15cd0a9c-95a1-9766-fca1-4bf9d09e4100@iogearbox.net> References: <20210528195946.2375109-1-memxor@gmail.com> <20210607033724.wn6qn4v42dlm4j4o@apollo> <20210607060724.4nidap5eywb23l3d@apollo> <20210608071908.sos275adj3gunewo@apollo> <20210613025308.75uia7rnt4ue2k7q@apollo> <877divs5py.fsf@toke.dk> <15cd0a9c-95a1-9766-fca1-4bf9d09e4100@iogearbox.net> X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 16 Jun 2021 14:03:56 +0200 Message-ID: <87a6nqqamr.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Daniel Borkmann writes: > On 6/15/21 1:54 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> Cong Wang writes: > [...] >>>>> I offer two different views here: >>>>> >>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, t= hey >>>>> are no different from this perspective. Maybe the fact that a TC filt= er >>>>> resides in a qdisc makes a slight difference here, but like I mention= ed, it >>>>> actually makes sense to let TC filters be standalone, qdisc's just ha= ve to >>>>> bind with them, like how we bind TC filters with standalone TC action= s. >>>> >>>> You propose something different below IIUC, but I explained why I'm wa= ry of >>>> these unbound filters. They seem to add a step to classifier setup for= no real >>>> benefit to the user (except keeping track of one more object and clean= ing it >>>> up with the link when done). >>> >>> I am not even sure if unbound filters help your case at all, making >>> them unbound merely changes their residence, not ownership. >>> You are trying to pass the ownership from TC to bpf_link, which >>> is what I am against. >>=20 >> So what do you propose instead? >>=20 >> bpf_link is solving a specific problem: ensuring automatic cleanup of >> kernel resources held by a userspace application with a BPF component. >> Not all applications work this way, but for the ones that do it's very >> useful. But if the TC filter stays around after bpf_link detaches, that >> kinda defeats the point of the automatic cleanup. >>=20 >> So I don't really see any way around transferring ownership somehow. >> Unless you have some other idea that I'm missing? > > Just to keep on brainstorming here, I wanted to bring back Alexei's earli= er quote: > > > I think it makes sense to create these objects as part of establishi= ng bpf_link. > > ingress qdisc is a fake qdisc anyway. > > If we could go back in time I would argue that its existence doesn't > > need to be shown in iproute2. It's an object that serves no purpose > > other than attaching filters to it. It doesn't do any queuing unlike > > real qdiscs. > > It's an artifact of old choices. Old doesn't mean good. > > The kernel is full of such quirks and oddities. New api-s shouldn't > > blindly follow them. > > tc qdisc add dev eth0 clsact > > is a useless command with nop effect. > > The whole bpf_link in this context feels somewhat awkward because both ar= e two > different worlds, one accessible via netlink with its own lifetime etc, t= he other > one tied to fds and bpf syscall. Back in the days we did the cls_bpf inte= gration > since it felt the most natural at that time and it had support for both t= he ingress > and egress side, along with the direct action support which was added lat= er to have > a proper fast path for BPF. One thing that I personally never liked is th= at later > on tc sadly became a complex, quirky dumping ground for all the nic hw of= floads (I > guess mainly driven from ovs side) for which I have a hard time convincin= g myself > that this is used at scale in production. Stuff like af699626ee26 just to= pick one > which annoyingly also adds to the fast path given distros will just compi= le in most > of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is = not tied > at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_c= lassify_ > {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. M= eaning, > you could run existing tc BPF prog without any modifications and without = additional > extra overhead (no need to walk the clsact qdisc and then again into the = cls_bpf > one). These tc BPF programs would be managed only from bpf() via tc bpf_l= ink api, > and are otherwise not bothering to classic tc command (though they could = be dumped > there as well for sake of visibility, though bpftool would be fitting too= ). However, > if there is something attached from classic tc side, it would also go int= o the old > style tcf_classify_ingress() implementation and walk whatever is there so= that nothing > existing breaks (same as when no bpf_link would be present so that there = is no extra > overhead). This would also allow for a migration path of multi prog from = cls_bpf to > this new implementation. Details still tbd, but I would much rather like = such an > approach than the current discussed one, and it would also fit better giv= en we don't > run into this current mismatch of both worlds. So this would entail adding a separate list of BPF programs and run through those at the start of sch_handle_{egress,ingress}() I suppose? And that list of filters would only contain bpf_link-attached BPF programs, sorted by priority like TC filters? And return codes of TC_ACT_OK or TC_ACT_RECLASSIFY would continue through to tcf_classify_{egress,ingress}()? I suppose that could work; we could even stick the second filter list in struct mini_Qdisc and have clsact and bpf_link cooperate on managing that, no? That way it would also be easy to dump the BPF filters via netlink: I do think that will be the least surprising thing to do (so people can at least see there's something there with existing tools). The overhead would be a single extra branch when only one of clsact or bpf_link is in use (to check if the other list of filters is set); that's probably acceptable at this level... -Toke