All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Stanislav Fomichev <sdf@google.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	ast@kernel.org, andrii@kernel.org,  martin.lau@linux.dev,
	razor@blackwall.org, john.fastabend@gmail.com,  kuba@kernel.org,
	dxu@dxuuu.xyz, joe@cilium.io, toke@kernel.org,
	 davem@davemloft.net, bpf@vger.kernel.org,
	netdev@vger.kernel.org
Subject: Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
Date: Thu, 8 Jun 2023 13:59:33 -0700	[thread overview]
Message-ID: <CAEf4BzbEf+U53UY6o+g5OZ6rg+T65_Aou4Nvrdbo-8sAjmdJmA@mail.gmail.com> (raw)
In-Reply-To: <ZIIOr1zvdRNTFKR7@google.com>

On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/07, Daniel Borkmann wrote:
> > This adds a generic layer called bpf_mprog which can be reused by different
> > attachment layers to enable multi-program attachment and dependency resolution.
> > In-kernel users of the bpf_mprog don't need to care about the dependency
> > resolution internals, they can just consume it with few API calls.
> >
> > The initial idea of having a generic API sparked out of discussion [0] from an
> > earlier revision of this work where tc's priority was reused and exposed via
> > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> > as-is for classic tc BPF. The feedback was that priority provides a bad user
> > experience and is hard to use [1], e.g.:
> >
> >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
> >   and friends is done because "that's how things were done in the past". [...]
> >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
> >   right there for users to understand. And that's the main problem with it.
> >
> >   The user don't want to and don't need to be aware of it, but uapi forces them
> >   to pick the priority. [...] Your cover letter [0] example proves that in
> >   real life different service pick the same priority. They simply don't know
> >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
> >   they just copy-paste and everyone ends up using the same.
> >
> > The course of the discussion showed more and more the need for a generic,
> > reusable API where the "same look and feel" can be applied for various other
> > program types beyond just tc BPF, for example XDP today does not have multi-
> > program support in kernel, but also there was interest around this API for
> > improving management of cgroup program types. Such common multi-program
> > management concept is useful for BPF management daemons or user space BPF
> > applications coordinating about their attachments.
> >
> > Both from Cilium and Meta side [2], we've collected the following requirements
> > for a generic attach/detach/query API for multi-progs which has been implemented
> > as part of this work:
> >
> >   - Support prog-based attach/detach and link API
> >   - Dependency directives (can also be combined):
> >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> >       - BPF_F_ID flag as {fd,id} toggle
> >       - BPF_F_LINK flag as {prog,link} toggle
> >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> >         BPF_F_AFTER will just append for the case of attaching
> >       - Enforced only at attach time
> >     - BPF_F_{FIRST,LAST}
> >       - Enforced throughout the bpf_mprog state's lifetime
> >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> >   - Internal revision counter and optionally being able to pass expected_revision
> >   - User space daemon can query current state with revision, and pass it along
> >     for attachment to assert current state before doing updates
> >   - Query also gets extension for link_ids array and link_attach_flags:
> >     - prog_ids are always filled with program IDs
> >     - link_ids are filled with link IDs when link was used, otherwise 0
> >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
> >   - Must be easy to integrate/reuse for in-kernel users
> >
> > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> > consisting of the additions of the attachment flags, revision counter, and
> > expanding existing union with relative_{fd,id} member.
> >
> > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> > structure). Both have been separated, so that fast-path gets efficient packing
> > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> > instead of linked list or other structures to remove unnecessary indirections
> > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> > is populated and then just swapped which avoids additional allocations that
> > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> > currently static, but they could be converted to dynamic allocation if necessary
> > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> > for example, in case of tcx which uses this API in the next patch, it piggy-
> > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> > add,del} implementation and an extensive test suite for checking all aspects
> > of this API for prog-based attach/detach and link API as BPF selftests in
> > this series.
> >
> > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> >
> >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> >
> > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > ---
> >  MAINTAINERS                    |   1 +
> >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
> >  include/uapi/linux/bpf.h       |  37 ++-
> >  kernel/bpf/Makefile            |   2 +-
> >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  37 ++-
> >  6 files changed, 781 insertions(+), 17 deletions(-)
> >  create mode 100644 include/linux/bpf_mprog.h
> >  create mode 100644 kernel/bpf/mprog.c
> >

[...]

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index a7b5e91dd768..207f8a37b327 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> >   */
> >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> >  #define BPF_F_ALLOW_MULTI    (1U << 1)
> > +/* Generic attachment flags. */
> >  #define BPF_F_REPLACE                (1U << 2)
> > +#define BPF_F_BEFORE         (1U << 3)
> > +#define BPF_F_AFTER          (1U << 4)
>
> [..]
>
> > +#define BPF_F_FIRST          (1U << 5)
> > +#define BPF_F_LAST           (1U << 6)
>
> I'm still not sure whether the hard semantics of first/last is really
> useful. My worry is that some prog will just use BPF_F_FIRST which
> would prevent the rest of the users.. (starting with only
> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> need first/laste).

Without FIRST/LAST some scenarios cannot be guaranteed to be safely
implemented. E.g., if I have some hard audit requirements and I need
to guarantee that my program runs first and observes each event, I'll
enforce BPF_F_FIRST when attaching it. And if that attachment fails,
then server setup is broken and my application cannot function.

In a setup where we expect multiple applications to co-exist, it
should be a rule that no one is using FIRST/LAST (unless it's
absolutely required). And if someone doesn't comply, then that's a bug
and has to be reported to application owners.

But it's not up to the kernel to enforce this cooperation by
disallowing FIRST/LAST semantics, because that semantics is critical
for some applications, IMO.

>
> But if everyone besides myself is on board with first/last, maybe at least
> put a comment here saying that only a single program can be first/last?
> And the users are advised not to use these unless they really really really
> need to be first/last. (IOW, feels like first/last should be reserved
> for observability tools/etc).

+1, we can definitely make it clear in API that this will prevent
anyone else from being attached as FIRST/LAST, so it's not cooperative
in nature and has to be very consciously evaluated.

>
> > +#define BPF_F_ID             (1U << 7)
> > +#define BPF_F_LINK           BPF_F_LINK /* 1 << 13 */
> >
> >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> >   * verifier will perform strict alignment checking as if the kernel

[...]

  reply	other threads:[~2023-06-08 20:59 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
2023-06-08 17:23   ` Stanislav Fomichev
2023-06-08 20:59     ` Andrii Nakryiko [this message]
2023-06-08 21:52       ` Stanislav Fomichev
2023-06-08 22:13         ` Andrii Nakryiko
2023-06-08 23:06           ` Stanislav Fomichev
2023-06-08 23:54             ` Alexei Starovoitov
2023-06-09  0:08               ` Andrii Nakryiko
2023-06-09  0:38                 ` Stanislav Fomichev
2023-06-09  0:29             ` Toke Høiland-Jørgensen
2023-06-09  6:52               ` Daniel Borkmann
2023-06-09  7:15                 ` Daniel Borkmann
2023-06-09 11:04                 ` Toke Høiland-Jørgensen
2023-06-09 12:34                   ` Timo Beckers
2023-06-09 13:11                     ` Toke Høiland-Jørgensen
2023-06-09 14:15                       ` Daniel Borkmann
2023-06-09 16:41                         ` Stanislav Fomichev
2023-06-09 19:03                           ` Andrii Nakryiko
2023-06-10  2:52                             ` Daniel Xu
2023-06-09 18:58                         ` Andrii Nakryiko
2023-06-09 20:28                         ` Toke Høiland-Jørgensen
2023-06-12 11:21                         ` Dave Tucker
2023-06-12 12:43                           ` Daniel Borkmann
2023-06-09 18:56                       ` Andrii Nakryiko
2023-06-09 20:08                         ` Alexei Starovoitov
     [not found]                           ` <20230610022721.2950602-1-prankgup@fb.com>
2023-06-10  3:37                             ` Alexei Starovoitov
2023-06-09 20:20                         ` Toke Høiland-Jørgensen
2023-06-08 20:53   ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
2023-06-08  1:25   ` Jamal Hadi Salim
2023-06-08 10:11     ` Daniel Borkmann
2023-06-08 19:46       ` Jamal Hadi Salim
2023-06-08 21:24         ` Andrii Nakryiko
2023-07-04 21:36           ` Jamal Hadi Salim
2023-07-04 22:01             ` Daniel Borkmann
2023-07-04 22:38               ` Jamal Hadi Salim
2023-07-05  7:34                 ` Daniel Borkmann
2023-07-06 13:31                   ` Jamal Hadi Salim
2023-06-08 17:50   ` Stanislav Fomichev
2023-06-08 21:20   ` Andrii Nakryiko
2023-06-09  3:06   ` Jakub Kicinski
2023-06-07 19:26 ` [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
2023-06-08 21:37   ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 4/7] libbpf: Add link-based " Daniel Borkmann
2023-06-08 21:45   ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 5/7] bpftool: Extend net dump with tcx progs Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 6/7] selftests/bpf: Add mprog API tests for BPF tcx opts Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 7/7] selftests/bpf: Add mprog API tests for BPF tcx links Daniel Borkmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEf4BzbEf+U53UY6o+g5OZ6rg+T65_Aou4Nvrdbo-8sAjmdJmA@mail.gmail.com \
    --to=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dxu@dxuuu.xyz \
    --cc=joe@cilium.io \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=razor@blackwall.org \
    --cc=sdf@google.com \
    --cc=toke@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.