Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Stanislav Fomichev <sdf@google.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	ast@kernel.org, andrii@kernel.org,  martin.lau@linux.dev,
	razor@blackwall.org, john.fastabend@gmail.com,  kuba@kernel.org,
	dxu@dxuuu.xyz, joe@cilium.io, toke@kernel.org,
	 davem@davemloft.net, bpf@vger.kernel.org,
	netdev@vger.kernel.org
Subject: Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
Date: Thu, 8 Jun 2023 13:59:33 -0700	[thread overview]
Message-ID: <CAEf4BzbEf+U53UY6o+g5OZ6rg+T65_Aou4Nvrdbo-8sAjmdJmA@mail.gmail.com> (raw)
In-Reply-To: <ZIIOr1zvdRNTFKR7@google.com>

On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/07, Daniel Borkmann wrote:
> > This adds a generic layer called bpf_mprog which can be reused by different
> > attachment layers to enable multi-program attachment and dependency resolution.
> > In-kernel users of the bpf_mprog don't need to care about the dependency
> > resolution internals, they can just consume it with few API calls.
> >
> > The initial idea of having a generic API sparked out of discussion [0] from an
> > earlier revision of this work where tc's priority was reused and exposed via
> > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> > as-is for classic tc BPF. The feedback was that priority provides a bad user
> > experience and is hard to use [1], e.g.:
> >
> >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
> >   and friends is done because "that's how things were done in the past". [...]
> >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
> >   right there for users to understand. And that's the main problem with it.
> >
> >   The user don't want to and don't need to be aware of it, but uapi forces them
> >   to pick the priority. [...] Your cover letter [0] example proves that in
> >   real life different service pick the same priority. They simply don't know
> >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
> >   they just copy-paste and everyone ends up using the same.
> >
> > The course of the discussion showed more and more the need for a generic,
> > reusable API where the "same look and feel" can be applied for various other
> > program types beyond just tc BPF, for example XDP today does not have multi-
> > program support in kernel, but also there was interest around this API for
> > improving management of cgroup program types. Such common multi-program
> > management concept is useful for BPF management daemons or user space BPF
> > applications coordinating about their attachments.
> >
> > Both from Cilium and Meta side [2], we've collected the following requirements
> > for a generic attach/detach/query API for multi-progs which has been implemented
> > as part of this work:
> >
> >   - Support prog-based attach/detach and link API
> >   - Dependency directives (can also be combined):
> >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> >       - BPF_F_ID flag as {fd,id} toggle
> >       - BPF_F_LINK flag as {prog,link} toggle
> >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> >         BPF_F_AFTER will just append for the case of attaching
> >       - Enforced only at attach time
> >     - BPF_F_{FIRST,LAST}
> >       - Enforced throughout the bpf_mprog state's lifetime
> >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> >   - Internal revision counter and optionally being able to pass expected_revision
> >   - User space daemon can query current state with revision, and pass it along
> >     for attachment to assert current state before doing updates
> >   - Query also gets extension for link_ids array and link_attach_flags:
> >     - prog_ids are always filled with program IDs
> >     - link_ids are filled with link IDs when link was used, otherwise 0
> >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
> >   - Must be easy to integrate/reuse for in-kernel users
> >
> > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> > consisting of the additions of the attachment flags, revision counter, and
> > expanding existing union with relative_{fd,id} member.
> >
> > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> > structure). Both have been separated, so that fast-path gets efficient packing
> > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> > instead of linked list or other structures to remove unnecessary indirections
> > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> > is populated and then just swapped which avoids additional allocations that
> > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> > currently static, but they could be converted to dynamic allocation if necessary
> > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> > for example, in case of tcx which uses this API in the next patch, it piggy-
> > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> > add,del} implementation and an extensive test suite for checking all aspects
> > of this API for prog-based attach/detach and link API as BPF selftests in
> > this series.
> >
> > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> >
> >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> >
> > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > ---
> >  MAINTAINERS                    |   1 +
> >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
> >  include/uapi/linux/bpf.h       |  37 ++-
> >  kernel/bpf/Makefile            |   2 +-
> >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  37 ++-
> >  6 files changed, 781 insertions(+), 17 deletions(-)
> >  create mode 100644 include/linux/bpf_mprog.h
> >  create mode 100644 kernel/bpf/mprog.c
> >

[...]

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index a7b5e91dd768..207f8a37b327 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> >   */
> >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> >  #define BPF_F_ALLOW_MULTI    (1U << 1)
> > +/* Generic attachment flags. */
> >  #define BPF_F_REPLACE                (1U << 2)
> > +#define BPF_F_BEFORE         (1U << 3)
> > +#define BPF_F_AFTER          (1U << 4)
>
> [..]
>
> > +#define BPF_F_FIRST          (1U << 5)
> > +#define BPF_F_LAST           (1U << 6)
>
> I'm still not sure whether the hard semantics of first/last is really
> useful. My worry is that some prog will just use BPF_F_FIRST which
> would prevent the rest of the users.. (starting with only
> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> need first/laste).

Without FIRST/LAST some scenarios cannot be guaranteed to be safely
implemented. E.g., if I have some hard audit requirements and I need
to guarantee that my program runs first and observes each event, I'll
enforce BPF_F_FIRST when attaching it. And if that attachment fails,
then server setup is broken and my application cannot function.

In a setup where we expect multiple applications to co-exist, it
should be a rule that no one is using FIRST/LAST (unless it's
absolutely required). And if someone doesn't comply, then that's a bug
and has to be reported to application owners.

But it's not up to the kernel to enforce this cooperation by
disallowing FIRST/LAST semantics, because that semantics is critical
for some applications, IMO.

>
> But if everyone besides myself is on board with first/last, maybe at least
> put a comment here saying that only a single program can be first/last?
> And the users are advised not to use these unless they really really really
> need to be first/last. (IOW, feels like first/last should be reserved
> for observability tools/etc).

+1, we can definitely make it clear in API that this will prevent
anyone else from being attached as FIRST/LAST, so it's not cooperative
in nature and has to be very consciously evaluated.

>
> > +#define BPF_F_ID             (1U << 7)
> > +#define BPF_F_LINK           BPF_F_LINK /* 1 << 13 */
> >
> >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> >   * verifier will perform strict alignment checking as if the kernel

[...]