From: Stanislav Fomichev <sdf@google.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: ast@kernel.org, andrii@kernel.org, martin.lau@linux.dev,
razor@blackwall.org, john.fastabend@gmail.com, kuba@kernel.org,
dxu@dxuuu.xyz, joe@cilium.io, toke@kernel.org,
davem@davemloft.net, bpf@vger.kernel.org,
netdev@vger.kernel.org
Subject: Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
Date: Thu, 8 Jun 2023 10:23:59 -0700 [thread overview]
Message-ID: <ZIIOr1zvdRNTFKR7@google.com> (raw)
In-Reply-To: <20230607192625.22641-2-daniel@iogearbox.net>
On 06/07, Daniel Borkmann wrote:
> This adds a generic layer called bpf_mprog which can be reused by different
> attachment layers to enable multi-program attachment and dependency resolution.
> In-kernel users of the bpf_mprog don't need to care about the dependency
> resolution internals, they can just consume it with few API calls.
>
> The initial idea of having a generic API sparked out of discussion [0] from an
> earlier revision of this work where tc's priority was reused and exposed via
> BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> as-is for classic tc BPF. The feedback was that priority provides a bad user
> experience and is hard to use [1], e.g.:
>
> I cannot help but feel that priority logic copy-paste from old tc, netfilter
> and friends is done because "that's how things were done in the past". [...]
> Priority gets exposed everywhere in uapi all the way to bpftool when it's
> right there for users to understand. And that's the main problem with it.
>
> The user don't want to and don't need to be aware of it, but uapi forces them
> to pick the priority. [...] Your cover letter [0] example proves that in
> real life different service pick the same priority. They simply don't know
> any better. Priority is an unnecessary magic that apps _have_ to pick, so
> they just copy-paste and everyone ends up using the same.
>
> The course of the discussion showed more and more the need for a generic,
> reusable API where the "same look and feel" can be applied for various other
> program types beyond just tc BPF, for example XDP today does not have multi-
> program support in kernel, but also there was interest around this API for
> improving management of cgroup program types. Such common multi-program
> management concept is useful for BPF management daemons or user space BPF
> applications coordinating about their attachments.
>
> Both from Cilium and Meta side [2], we've collected the following requirements
> for a generic attach/detach/query API for multi-progs which has been implemented
> as part of this work:
>
> - Support prog-based attach/detach and link API
> - Dependency directives (can also be combined):
> - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> - BPF_F_ID flag as {fd,id} toggle
> - BPF_F_LINK flag as {prog,link} toggle
> - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> BPF_F_AFTER will just append for the case of attaching
> - Enforced only at attach time
> - BPF_F_{FIRST,LAST}
> - Enforced throughout the bpf_mprog state's lifetime
> - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> - Internal revision counter and optionally being able to pass expected_revision
> - User space daemon can query current state with revision, and pass it along
> for attachment to assert current state before doing updates
> - Query also gets extension for link_ids array and link_attach_flags:
> - prog_ids are always filled with program IDs
> - link_ids are filled with link IDs when link was used, otherwise 0
> - {prog,link}_attach_flags for holding {prog,link}-specific flags
> - Must be easy to integrate/reuse for in-kernel users
>
> The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> consisting of the additions of the attachment flags, revision counter, and
> expanding existing union with relative_{fd,id} member.
>
> The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> structure). Both have been separated, so that fast-path gets efficient packing
> of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> instead of linked list or other structures to remove unnecessary indirections
> for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> is populated and then just swapped which avoids additional allocations that
> could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> currently static, but they could be converted to dynamic allocation if necessary
> at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> for example, in case of tcx which uses this API in the next patch, it piggy-
> backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> add,del} implementation and an extensive test suite for checking all aspects
> of this API for prog-based attach/detach and link API as BPF selftests in
> this series.
>
> Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
>
> [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> MAINTAINERS | 1 +
> include/linux/bpf_mprog.h | 245 +++++++++++++++++
> include/uapi/linux/bpf.h | 37 ++-
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/mprog.c | 476 +++++++++++++++++++++++++++++++++
> tools/include/uapi/linux/bpf.h | 37 ++-
> 6 files changed, 781 insertions(+), 17 deletions(-)
> create mode 100644 include/linux/bpf_mprog.h
> create mode 100644 kernel/bpf/mprog.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c904dba1733b..754a9eeca0a1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3733,6 +3733,7 @@ F: include/linux/filter.h
> F: include/linux/tnum.h
> F: kernel/bpf/core.c
> F: kernel/bpf/dispatcher.c
> +F: kernel/bpf/mprog.c
> F: kernel/bpf/syscall.c
> F: kernel/bpf/tnum.c
> F: kernel/bpf/trampoline.c
> diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
> new file mode 100644
> index 000000000000..7399181d8e6c
> --- /dev/null
> +++ b/include/linux/bpf_mprog.h
> @@ -0,0 +1,245 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __BPF_MPROG_H
> +#define __BPF_MPROG_H
> +
> +#include <linux/bpf.h>
> +
> +#define BPF_MPROG_MAX 64
> +#define BPF_MPROG_SWAP 1
> +#define BPF_MPROG_FREE 2
> +
> +struct bpf_mprog_fp {
> + struct bpf_prog *prog;
> +};
> +
> +struct bpf_mprog_cp {
> + struct bpf_link *link;
> + u32 flags;
> +};
> +
> +struct bpf_mprog_entry {
> + struct bpf_mprog_fp fp_items[BPF_MPROG_MAX] ____cacheline_aligned;
> + struct bpf_mprog_cp cp_items[BPF_MPROG_MAX] ____cacheline_aligned;
> + struct bpf_mprog_bundle *parent;
> +};
> +
> +struct bpf_mprog_bundle {
> + struct bpf_mprog_entry a;
> + struct bpf_mprog_entry b;
> + struct rcu_head rcu;
> + struct bpf_prog *ref;
> + atomic_t revision;
> +};
> +
> +struct bpf_tuple {
> + struct bpf_prog *prog;
> + struct bpf_link *link;
> +};
> +
> +static inline struct bpf_mprog_entry *
> +bpf_mprog_peer(const struct bpf_mprog_entry *entry)
> +{
> + if (entry == &entry->parent->a)
> + return &entry->parent->b;
> + else
> + return &entry->parent->a;
> +}
> +
> +#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \
> + for (fp = &entry->fp_items[0], cp = &entry->cp_items[0]; \
> + ({ \
> + t.prog = READ_ONCE(fp->prog); \
> + t.link = cp->link; \
> + t.prog; \
> + }); \
> + fp++, cp++)
> +
> +#define bpf_mprog_foreach_prog(entry, fp, p) \
> + for (fp = &entry->fp_items[0]; \
> + (p = READ_ONCE(fp->prog)); \
> + fp++)
> +
> +static inline struct bpf_mprog_entry *bpf_mprog_create(size_t extra_size)
> +{
> + struct bpf_mprog_bundle *bundle;
> +
> + /* Fast-path items are not extensible, must only contain prog pointer! */
> + BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
> + /* Control-path items can be extended w/o affecting fast-path. */
> + BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) != ARRAY_SIZE(bundle->a.cp_items));
> +
> + bundle = kzalloc(sizeof(*bundle) + extra_size, GFP_KERNEL);
> + if (bundle) {
> + atomic_set(&bundle->revision, 1);
> + bundle->a.parent = bundle;
> + bundle->b.parent = bundle;
> + return &bundle->a;
> + }
> + return NULL;
> +}
> +
> +static inline void bpf_mprog_free(struct bpf_mprog_entry *entry)
> +{
> + kfree_rcu(entry->parent, rcu);
> +}
> +
> +static inline void bpf_mprog_mark_ref(struct bpf_mprog_entry *entry,
> + struct bpf_prog *prog)
> +{
> + WARN_ON_ONCE(entry->parent->ref);
> + entry->parent->ref = prog;
> +}
> +
> +static inline u32 bpf_mprog_flags(u32 cur_flags, u32 req_flags, u32 flag)
> +{
> + if (req_flags & flag)
> + cur_flags |= flag;
> + else
> + cur_flags &= ~flag;
> + return cur_flags;
> +}
> +
> +static inline u32 bpf_mprog_max(void)
> +{
> + return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_first(struct bpf_mprog_entry *entry)
> +{
> + return READ_ONCE(entry->fp_items[0].prog);
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_last(struct bpf_mprog_entry *entry)
> +{
> + struct bpf_prog *tmp, *prog = NULL;
> + struct bpf_mprog_fp *fp;
> +
> + bpf_mprog_foreach_prog(entry, fp, tmp)
> + prog = tmp;
> + return prog;
> +}
> +
> +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
> + struct bpf_prog *prog)
> +{
> + const struct bpf_mprog_fp *fp;
> + const struct bpf_prog *tmp;
> +
> + bpf_mprog_foreach_prog(entry, fp, tmp) {
> + if (tmp == prog)
> + return true;
> + }
> + return false;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_first_reg(struct bpf_mprog_entry *entry)
> +{
> + struct bpf_tuple tuple = {};
> + struct bpf_mprog_fp *fp;
> + struct bpf_mprog_cp *cp;
> +
> + bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> + if (cp->flags & BPF_F_FIRST)
> + continue;
> + return tuple.prog;
> + }
> + return NULL;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_last_reg(struct bpf_mprog_entry *entry)
> +{
> + struct bpf_tuple tuple = {};
> + struct bpf_prog *prog = NULL;
> + struct bpf_mprog_fp *fp;
> + struct bpf_mprog_cp *cp;
> +
> + bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> + if (cp->flags & BPF_F_LAST)
> + break;
> + prog = tuple.prog;
> + }
> + return prog;
> +}
> +
> +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
> +{
[..]
> + do {
> + atomic_inc(&entry->parent->revision);
> + } while (atomic_read(&entry->parent->revision) == 0);
Can you explain more what's going on here? Maybe with a comment?
> + synchronize_rcu();
> + if (entry->parent->ref) {
> + bpf_prog_put(entry->parent->ref);
> + entry->parent->ref = NULL;
> + }
I'm assuming this is to guard the detach path? But isn't bpf_prog_put
already doing the deferred dealloc? So calling it without synchronize_rcu
here should be ok?
> +}
> +
> +static inline void bpf_mprog_entry_clear(struct bpf_mprog_entry *entry)
> +{
> + memset(entry->fp_items, 0, sizeof(entry->fp_items));
> + memset(entry->cp_items, 0, sizeof(entry->cp_items));
> +}
> +
> +static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
> +{
> + return atomic_read(&entry->parent->revision);
> +}
> +
> +static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 which,
> + struct bpf_mprog_fp **fp_dst,
> + struct bpf_mprog_cp **cp_dst)
> +{
> + *fp_dst = &entry->fp_items[which];
> + *cp_dst = &entry->cp_items[which];
> +}
> +
> +static inline void bpf_mprog_write(struct bpf_mprog_fp *fp_dst,
> + struct bpf_mprog_cp *cp_dst,
> + struct bpf_tuple *tuple, u32 flags)
> +{
> + WRITE_ONCE(fp_dst->prog, tuple->prog);
> + cp_dst->link = tuple->link;
> + cp_dst->flags = flags;
> +}
> +
> +static inline void bpf_mprog_copy(struct bpf_mprog_fp *fp_dst,
> + struct bpf_mprog_cp *cp_dst,
> + struct bpf_mprog_fp *fp_src,
> + struct bpf_mprog_cp *cp_src)
> +{
> + WRITE_ONCE(fp_dst->prog, READ_ONCE(fp_src->prog));
> + memcpy(cp_dst, cp_src, sizeof(*cp_src));
nit: why not simply *cp_dst = *cp_src? memcpy somewhat implies (in my
mind) that we are copying several entries..
> +}
> +
> +static inline void bpf_mprog_copy_range(struct bpf_mprog_entry *peer,
> + struct bpf_mprog_entry *entry,
> + u32 idx_peer, u32 idx_entry, u32 num)
> +{
> + memcpy(&peer->fp_items[idx_peer], &entry->fp_items[idx_entry],
> + num * sizeof(peer->fp_items[0]));
> + memcpy(&peer->cp_items[idx_peer], &entry->cp_items[idx_entry],
> + num * sizeof(peer->cp_items[0]));
> +}
> +
> +static inline u32 bpf_mprog_total(struct bpf_mprog_entry *entry)
> +{
> + const struct bpf_mprog_fp *fp;
> + const struct bpf_prog *tmp;
> + u32 num = 0;
> +
> + bpf_mprog_foreach_prog(entry, fp, tmp)
> + num++;
> + return num;
> +}
> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> + struct bpf_link *link, u32 flags, u32 object,
> + u32 expected_revision);
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> + struct bpf_link *link, u32 flags, u32 object,
> + u32 expected_revision);
> +
> +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> + struct bpf_mprog_entry *entry);
> +
> +#endif /* __BPF_MPROG_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a7b5e91dd768..207f8a37b327 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> */
> #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> #define BPF_F_ALLOW_MULTI (1U << 1)
> +/* Generic attachment flags. */
> #define BPF_F_REPLACE (1U << 2)
> +#define BPF_F_BEFORE (1U << 3)
> +#define BPF_F_AFTER (1U << 4)
> +#define BPF_F_FIRST (1U << 5)
> +#define BPF_F_LAST (1U << 6)
> +#define BPF_F_ID (1U << 7)
> +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
>
> /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> * verifier will perform strict alignment checking as if the kernel
> @@ -1433,14 +1440,19 @@ union bpf_attr {
> };
>
> struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> - __u32 target_fd; /* container object to attach to */
> - __u32 attach_bpf_fd; /* eBPF program to attach */
> + union {
> + __u32 target_fd; /* target object to attach to or ... */
> + __u32 target_ifindex; /* target ifindex */
> + };
> + __u32 attach_bpf_fd;
> __u32 attach_type;
> __u32 attach_flags;
> - __u32 replace_bpf_fd; /* previously attached eBPF
> - * program to replace if
> - * BPF_F_REPLACE is used
> - */
> + union {
> + __u32 relative_fd;
> + __u32 relative_id;
> + __u32 replace_bpf_fd;
> + };
> + __u32 expected_revision;
> };
>
> struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1486,16 +1498,25 @@ union bpf_attr {
> } info;
>
> struct { /* anonymous struct used by BPF_PROG_QUERY command */
> - __u32 target_fd; /* container object to query */
> + union {
> + __u32 target_fd; /* target object to query or ... */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_type;
> __u32 query_flags;
> __u32 attach_flags;
> __aligned_u64 prog_ids;
> - __u32 prog_cnt;
> + union {
> + __u32 prog_cnt;
> + __u32 count;
> + };
> + __u32 revision;
> /* output: per-program attach_flags.
> * not allowed to be set during effective query.
> */
> __aligned_u64 prog_attach_flags;
> + __aligned_u64 link_ids;
> + __aligned_u64 link_attach_flags;
> } query;
>
> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1d3892168d32..1bea2eb912cd 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
> -obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
> obj-$(CONFIG_BPF_JIT) += trampoline.o
> obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
> obj-$(CONFIG_BPF_JIT) += dispatcher.o
> diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
> new file mode 100644
> index 000000000000..efc3b73f8bf5
> --- /dev/null
> +++ b/kernel/bpf/mprog.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +#include <linux/filter.h>
> +
> +static int bpf_mprog_tuple_relative(struct bpf_tuple *tuple,
> + u32 object, u32 flags,
> + enum bpf_prog_type type)
> +{
> + struct bpf_prog *prog;
> + struct bpf_link *link;
> +
> + memset(tuple, 0, sizeof(*tuple));
> + if (!(flags & (BPF_F_REPLACE | BPF_F_BEFORE | BPF_F_AFTER)))
> + return object || (flags & (BPF_F_ID | BPF_F_LINK)) ?
> + -EINVAL : 0;
> + if (flags & BPF_F_LINK) {
> + if (flags & BPF_F_ID)
> + link = bpf_link_by_id(object);
> + else
> + link = bpf_link_get_from_fd(object);
> + if (IS_ERR(link))
> + return PTR_ERR(link);
> + if (type && link->prog->type != type) {
> + bpf_link_put(link);
> + return -EINVAL;
> + }
> + tuple->link = link;
> + tuple->prog = link->prog;
> + } else {
> + if (flags & BPF_F_ID)
> + prog = bpf_prog_by_id(object);
> + else
> + prog = bpf_prog_get(object);
> + if (IS_ERR(prog)) {
> + if (!object &&
> + !(flags & BPF_F_ID))
> + return 0;
> + return PTR_ERR(prog);
> + }
> + if (type && prog->type != type) {
> + bpf_prog_put(prog);
> + return -EINVAL;
> + }
> + tuple->link = NULL;
> + tuple->prog = prog;
> + }
> + return 0;
> +}
> +
> +static void bpf_mprog_tuple_put(struct bpf_tuple *tuple)
> +{
> + if (tuple->link)
> + bpf_link_put(tuple->link);
> + else if (tuple->prog)
> + bpf_prog_put(tuple->prog);
> +}
> +
> +static int bpf_mprog_replace(struct bpf_mprog_entry *entry,
> + struct bpf_tuple *ntuple,
> + struct bpf_tuple *rtuple, u32 rflags)
> +{
> + struct bpf_mprog_fp *fp;
> + struct bpf_mprog_cp *cp;
> + struct bpf_prog *oprog;
> + u32 iflags;
> + int i;
> +
> + if (rflags & (BPF_F_BEFORE | BPF_F_AFTER | BPF_F_LINK))
> + return -EINVAL;
> + if (rtuple->prog != ntuple->prog &&
> + bpf_mprog_exists(entry, ntuple->prog))
> + return -EEXIST;
> + for (i = 0; i < bpf_mprog_max(); i++) {
> + bpf_mprog_read(entry, i, &fp, &cp);
> + oprog = READ_ONCE(fp->prog);
> + if (!oprog)
> + break;
> + if (oprog != rtuple->prog)
> + continue;
> + if (cp->link != ntuple->link)
> + return -EBUSY;
> + iflags = cp->flags;
> + if ((iflags & BPF_F_FIRST) !=
> + (rflags & BPF_F_FIRST)) {
> + iflags = bpf_mprog_flags(iflags, rflags,
> + BPF_F_FIRST);
> + if ((iflags & BPF_F_FIRST) &&
> + rtuple->prog != bpf_mprog_first(entry))
> + return -EACCES;
> + }
> + if ((iflags & BPF_F_LAST) !=
> + (rflags & BPF_F_LAST)) {
> + iflags = bpf_mprog_flags(iflags, rflags,
> + BPF_F_LAST);
> + if ((iflags & BPF_F_LAST) &&
> + rtuple->prog != bpf_mprog_last(entry))
> + return -EACCES;
> + }
> + bpf_mprog_write(fp, cp, ntuple, iflags);
> + if (!ntuple->link)
> + bpf_prog_put(oprog);
> + return 0;
> + }
> + return -ENOENT;
> +}
> +
> +static int bpf_mprog_head_tail(struct bpf_mprog_entry *entry,
> + struct bpf_tuple *ntuple,
> + struct bpf_tuple *rtuple, u32 aflags)
> +{
> + struct bpf_mprog_entry *peer;
> + struct bpf_mprog_fp *fp;
> + struct bpf_mprog_cp *cp;
> + struct bpf_prog *oprog;
> + u32 iflags, items;
> +
> + if (bpf_mprog_exists(entry, ntuple->prog))
> + return -EEXIST;
> + items = bpf_mprog_total(entry);
> + peer = bpf_mprog_peer(entry);
> + bpf_mprog_entry_clear(peer);
> + if (aflags & BPF_F_FIRST) {
> + if (aflags & BPF_F_AFTER)
> + return -EINVAL;
> + bpf_mprog_read(entry, 0, &fp, &cp);
> + iflags = cp->flags;
> + if (iflags & BPF_F_FIRST)
> + return -EBUSY;
> + if (aflags & BPF_F_LAST) {
> + if (aflags & BPF_F_BEFORE)
> + return -EINVAL;
> + if (items)
> + return -EBUSY;
> + bpf_mprog_read(peer, 0, &fp, &cp);
> + bpf_mprog_write(fp, cp, ntuple,
> + BPF_F_FIRST | BPF_F_LAST);
> + return BPF_MPROG_SWAP;
> + }
> + if (aflags & BPF_F_BEFORE) {
> + oprog = READ_ONCE(fp->prog);
> + if (oprog != rtuple->prog ||
> + (rtuple->link &&
> + rtuple->link != cp->link))
> + return -EBUSY;
> + }
> + if (items >= bpf_mprog_max())
> + return -ENOSPC;
> + bpf_mprog_read(peer, 0, &fp, &cp);
> + bpf_mprog_write(fp, cp, ntuple, BPF_F_FIRST);
> + bpf_mprog_copy_range(peer, entry, 1, 0, items);
> + return BPF_MPROG_SWAP;
> + }
> + if (aflags & BPF_F_LAST) {
> + if (aflags & BPF_F_BEFORE)
> + return -EINVAL;
> + if (items) {
> + bpf_mprog_read(entry, items - 1, &fp, &cp);
> + iflags = cp->flags;
> + if (iflags & BPF_F_LAST)
> + return -EBUSY;
> + if (aflags & BPF_F_AFTER) {
> + oprog = READ_ONCE(fp->prog);
> + if (oprog != rtuple->prog ||
> + (rtuple->link &&
> + rtuple->link != cp->link))
> + return -EBUSY;
> + }
> + if (items >= bpf_mprog_max())
> + return -ENOSPC;
> + } else {
> + if (aflags & BPF_F_AFTER)
> + return -EBUSY;
> + }
> + bpf_mprog_read(peer, items, &fp, &cp);
> + bpf_mprog_write(fp, cp, ntuple, BPF_F_LAST);
> + bpf_mprog_copy_range(peer, entry, 0, 0, items);
> + return BPF_MPROG_SWAP;
> + }
> + return -ENOENT;
> +}
> +
> +static int bpf_mprog_add(struct bpf_mprog_entry *entry,
> + struct bpf_tuple *ntuple,
> + struct bpf_tuple *rtuple, u32 aflags)
> +{
> + struct bpf_mprog_fp *fp_dst, *fp_src;
> + struct bpf_mprog_cp *cp_dst, *cp_src;
> + struct bpf_mprog_entry *peer;
> + struct bpf_prog *oprog;
> + bool found = false;
> + u32 items;
> + int i, j;
> +
> + items = bpf_mprog_total(entry);
> + if (items >= bpf_mprog_max())
> + return -ENOSPC;
> + if ((aflags & (BPF_F_BEFORE | BPF_F_AFTER)) ==
> + (BPF_F_BEFORE | BPF_F_AFTER))
> + return -EINVAL;
> + if (bpf_mprog_exists(entry, ntuple->prog))
> + return -EEXIST;
> + if (!rtuple->prog && (aflags & (BPF_F_BEFORE | BPF_F_AFTER))) {
> + if (!items)
> + aflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
> + if (aflags & BPF_F_BEFORE)
> + rtuple->prog = bpf_mprog_first_reg(entry);
> + if (aflags & BPF_F_AFTER)
> + rtuple->prog = bpf_mprog_last_reg(entry);
> + if (!rtuple->prog)
> + aflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
> + else
> + bpf_prog_inc(rtuple->prog);
> + }
> + peer = bpf_mprog_peer(entry);
> + bpf_mprog_entry_clear(peer);
> + for (i = 0, j = 0; i < bpf_mprog_max(); i++, j++) {
> + bpf_mprog_read(entry, i, &fp_src, &cp_src);
> + bpf_mprog_read(peer, j, &fp_dst, &cp_dst);
> + oprog = READ_ONCE(fp_src->prog);
> + if (!oprog) {
> + if (i != j)
> + break;
> + if (i > 0) {
> + bpf_mprog_read(entry, i - 1,
> + &fp_src, &cp_src);
> + if (cp_src->flags & BPF_F_LAST) {
> + if (cp_src->flags & BPF_F_FIRST)
> + return -EBUSY;
> + bpf_mprog_copy(fp_dst, cp_dst,
> + fp_src, cp_src);
> + bpf_mprog_read(peer, --j,
> + &fp_dst, &cp_dst);
> + }
> + }
> + bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
> + break;
> + }
> + if (aflags & (BPF_F_BEFORE | BPF_F_AFTER)) {
> + if (rtuple->prog != oprog ||
> + (rtuple->link &&
> + rtuple->link != cp_src->link))
> + goto next;
> + found = true;
> + if (aflags & BPF_F_BEFORE) {
> + if (cp_src->flags & BPF_F_FIRST)
> + return -EBUSY;
> + bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
> + bpf_mprog_read(peer, ++j, &fp_dst, &cp_dst);
> + goto next;
> + }
> + if (aflags & BPF_F_AFTER) {
> + if (cp_src->flags & BPF_F_LAST)
> + return -EBUSY;
> + bpf_mprog_copy(fp_dst, cp_dst,
> + fp_src, cp_src);
> + bpf_mprog_read(peer, ++j, &fp_dst, &cp_dst);
> + bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
> + continue;
> + }
> + }
> +next:
> + bpf_mprog_copy(fp_dst, cp_dst,
> + fp_src, cp_src);
> + }
> + if (rtuple->prog && !found)
> + return -ENOENT;
> + return BPF_MPROG_SWAP;
> +}
> +
> +static int bpf_mprog_del(struct bpf_mprog_entry *entry,
> + struct bpf_tuple *dtuple,
> + struct bpf_tuple *rtuple, u32 dflags)
> +{
> + struct bpf_mprog_fp *fp_dst, *fp_src;
> + struct bpf_mprog_cp *cp_dst, *cp_src;
> + struct bpf_mprog_entry *peer;
> + struct bpf_prog *oprog;
> + bool found = false;
> + int i, j, ret;
> +
> + if (dflags & BPF_F_REPLACE)
> + return -EINVAL;
> + if (dflags & BPF_F_FIRST) {
> + oprog = bpf_mprog_first(entry);
> + if (dtuple->prog &&
> + dtuple->prog != oprog)
> + return -ENOENT;
> + dtuple->prog = oprog;
> + }
> + if (dflags & BPF_F_LAST) {
> + oprog = bpf_mprog_last(entry);
> + if (dtuple->prog &&
> + dtuple->prog != oprog)
> + return -ENOENT;
> + dtuple->prog = oprog;
> + }
> + if (!rtuple->prog && (dflags & (BPF_F_BEFORE | BPF_F_AFTER))) {
> + if (dtuple->prog)
> + return -EINVAL;
> + if (dflags & BPF_F_BEFORE)
> + dtuple->prog = bpf_mprog_first_reg(entry);
> + if (dflags & BPF_F_AFTER)
> + dtuple->prog = bpf_mprog_last_reg(entry);
> + if (dtuple->prog)
> + dflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
> + }
> + for (i = 0; i < bpf_mprog_max(); i++) {
> + bpf_mprog_read(entry, i, &fp_src, &cp_src);
> + oprog = READ_ONCE(fp_src->prog);
> + if (!oprog)
> + break;
> + if (dflags & (BPF_F_BEFORE | BPF_F_AFTER)) {
> + if (rtuple->prog != oprog ||
> + (rtuple->link &&
> + rtuple->link != cp_src->link))
> + continue;
> + found = true;
> + if (dflags & BPF_F_BEFORE) {
> + if (!i)
> + return -ENOENT;
> + bpf_mprog_read(entry, i - 1,
> + &fp_src, &cp_src);
> + oprog = READ_ONCE(fp_src->prog);
> + if (dtuple->prog &&
> + dtuple->prog != oprog)
> + return -ENOENT;
> + dtuple->prog = oprog;
> + break;
> + }
> + if (dflags & BPF_F_AFTER) {
> + bpf_mprog_read(entry, i + 1,
> + &fp_src, &cp_src);
> + oprog = READ_ONCE(fp_src->prog);
> + if (dtuple->prog &&
> + dtuple->prog != oprog)
> + return -ENOENT;
> + dtuple->prog = oprog;
> + break;
> + }
> + }
> + }
> + if (!dtuple->prog || (rtuple->prog && !found))
> + return -ENOENT;
> + peer = bpf_mprog_peer(entry);
> + bpf_mprog_entry_clear(peer);
> + ret = -ENOENT;
> + for (i = 0, j = 0; i < bpf_mprog_max(); i++) {
> + bpf_mprog_read(entry, i, &fp_src, &cp_src);
> + bpf_mprog_read(peer, j, &fp_dst, &cp_dst);
> + oprog = READ_ONCE(fp_src->prog);
> + if (!oprog)
> + break;
> + if (oprog != dtuple->prog) {
> + bpf_mprog_copy(fp_dst, cp_dst,
> + fp_src, cp_src);
> + j++;
> + } else {
> + if (cp_src->link != dtuple->link)
> + return -EBUSY;
> + if (!cp_src->link)
> + bpf_mprog_mark_ref(entry, dtuple->prog);
> + ret = BPF_MPROG_SWAP;
> + }
> + }
> + if (!bpf_mprog_total(peer))
> + ret = BPF_MPROG_FREE;
> + return ret;
> +}
> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> + struct bpf_link *link, u32 flags, u32 object,
> + u32 expected_revision)
> +{
> + struct bpf_tuple rtuple, ntuple = {
> + .prog = prog,
> + .link = link,
> + };
> + int ret;
> +
> + if (expected_revision &&
> + expected_revision != bpf_mprog_revision(entry))
> + return -ESTALE;
> + ret = bpf_mprog_tuple_relative(&rtuple, object, flags, prog->type);
> + if (ret)
> + return ret;
> + if (flags & BPF_F_REPLACE)
> + ret = bpf_mprog_replace(entry, &ntuple, &rtuple, flags);
> + else if (flags & (BPF_F_FIRST | BPF_F_LAST))
> + ret = bpf_mprog_head_tail(entry, &ntuple, &rtuple, flags);
> + else
> + ret = bpf_mprog_add(entry, &ntuple, &rtuple, flags);
> + bpf_mprog_tuple_put(&rtuple);
> + return ret;
> +}
> +
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> + struct bpf_link *link, u32 flags, u32 object,
> + u32 expected_revision)
> +{
> + struct bpf_tuple rtuple, dtuple = {
> + .prog = prog,
> + .link = link,
> + };
> + int ret;
> +
> + if (expected_revision &&
> + expected_revision != bpf_mprog_revision(entry))
> + return -ESTALE;
> + ret = bpf_mprog_tuple_relative(&rtuple, object, flags,
> + prog ? prog->type :
> + BPF_PROG_TYPE_UNSPEC);
> + if (ret)
> + return ret;
> + ret = bpf_mprog_del(entry, &dtuple, &rtuple, flags);
> + bpf_mprog_tuple_put(&rtuple);
> + return ret;
> +}
> +
> +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> + struct bpf_mprog_entry *entry)
> +{
> + u32 i, id, flags = 0, count, revision;
> + u32 __user *uprog_id, *uprog_af;
> + u32 __user *ulink_id, *ulink_af;
> + struct bpf_mprog_fp *fp;
> + struct bpf_mprog_cp *cp;
> + struct bpf_prog *prog;
> + int ret = 0;
> +
> + if (attr->query.query_flags || attr->query.attach_flags)
> + return -EINVAL;
> + revision = bpf_mprog_revision(entry);
> + count = bpf_mprog_total(entry);
> + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
> + return -EFAULT;
> + if (copy_to_user(&uattr->query.revision, &revision, sizeof(revision)))
> + return -EFAULT;
> + if (copy_to_user(&uattr->query.count, &count, sizeof(count)))
> + return -EFAULT;
> + uprog_id = u64_to_user_ptr(attr->query.prog_ids);
> + if (attr->query.count == 0 || !uprog_id || !count)
> + return 0;
> + if (attr->query.count < count) {
> + count = attr->query.count;
> + ret = -ENOSPC;
> + }
> + uprog_af = u64_to_user_ptr(attr->query.prog_attach_flags);
> + ulink_id = u64_to_user_ptr(attr->query.link_ids);
> + ulink_af = u64_to_user_ptr(attr->query.link_attach_flags);
> + for (i = 0; i < ARRAY_SIZE(entry->fp_items); i++) {
> + bpf_mprog_read(entry, i, &fp, &cp);
> + prog = READ_ONCE(fp->prog);
> + if (!prog)
> + break;
> + id = prog->aux->id;
> + if (copy_to_user(uprog_id + i, &id, sizeof(id)))
> + return -EFAULT;
> + id = cp->link ? cp->link->id : 0;
> + if (ulink_id &&
> + copy_to_user(ulink_id + i, &id, sizeof(id)))
> + return -EFAULT;
> + flags = cp->flags;
> + if (uprog_af && !id &&
> + copy_to_user(uprog_af + i, &flags, sizeof(flags)))
> + return -EFAULT;
> + if (ulink_af && id &&
> + copy_to_user(ulink_af + i, &flags, sizeof(flags)))
> + return -EFAULT;
> + if (i + 1 == count)
> + break;
> + }
> + return ret;
> +}
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index a7b5e91dd768..207f8a37b327 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> */
> #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> #define BPF_F_ALLOW_MULTI (1U << 1)
> +/* Generic attachment flags. */
> #define BPF_F_REPLACE (1U << 2)
> +#define BPF_F_BEFORE (1U << 3)
> +#define BPF_F_AFTER (1U << 4)
[..]
> +#define BPF_F_FIRST (1U << 5)
> +#define BPF_F_LAST (1U << 6)
I'm still not sure whether the hard semantics of first/last is really
useful. My worry is that some prog will just use BPF_F_FIRST which
would prevent the rest of the users.. (starting with only
F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
need first/laste).
But if everyone besides myself is on board with first/last, maybe at least
put a comment here saying that only a single program can be first/last?
And the users are advised not to use these unless they really really really
need to be first/last. (IOW, feels like first/last should be reserved
for observability tools/etc).
> +#define BPF_F_ID (1U << 7)
> +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
>
> /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> * verifier will perform strict alignment checking as if the kernel
> @@ -1433,14 +1440,19 @@ union bpf_attr {
> };
>
> struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> - __u32 target_fd; /* container object to attach to */
> - __u32 attach_bpf_fd; /* eBPF program to attach */
> + union {
> + __u32 target_fd; /* target object to attach to or ... */
> + __u32 target_ifindex; /* target ifindex */
> + };
> + __u32 attach_bpf_fd;
> __u32 attach_type;
> __u32 attach_flags;
> - __u32 replace_bpf_fd; /* previously attached eBPF
> - * program to replace if
> - * BPF_F_REPLACE is used
> - */
> + union {
> + __u32 relative_fd;
> + __u32 relative_id;
> + __u32 replace_bpf_fd;
> + };
> + __u32 expected_revision;
> };
>
> struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1486,16 +1498,25 @@ union bpf_attr {
> } info;
>
> struct { /* anonymous struct used by BPF_PROG_QUERY command */
> - __u32 target_fd; /* container object to query */
> + union {
> + __u32 target_fd; /* target object to query or ... */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_type;
> __u32 query_flags;
> __u32 attach_flags;
> __aligned_u64 prog_ids;
> - __u32 prog_cnt;
> + union {
> + __u32 prog_cnt;
> + __u32 count;
> + };
> + __u32 revision;
> /* output: per-program attach_flags.
> * not allowed to be set during effective query.
> */
> __aligned_u64 prog_attach_flags;
> + __aligned_u64 link_ids;
> + __aligned_u64 link_attach_flags;
> } query;
>
> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> --
> 2.34.1
>
next prev parent reply other threads:[~2023-06-08 17:24 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
2023-06-08 17:23 ` Stanislav Fomichev [this message]
2023-06-08 20:59 ` Andrii Nakryiko
2023-06-08 21:52 ` Stanislav Fomichev
2023-06-08 22:13 ` Andrii Nakryiko
2023-06-08 23:06 ` Stanislav Fomichev
2023-06-08 23:54 ` Alexei Starovoitov
2023-06-09 0:08 ` Andrii Nakryiko
2023-06-09 0:38 ` Stanislav Fomichev
2023-06-09 0:29 ` Toke Høiland-Jørgensen
2023-06-09 6:52 ` Daniel Borkmann
2023-06-09 7:15 ` Daniel Borkmann
2023-06-09 11:04 ` Toke Høiland-Jørgensen
2023-06-09 12:34 ` Timo Beckers
2023-06-09 13:11 ` Toke Høiland-Jørgensen
2023-06-09 14:15 ` Daniel Borkmann
2023-06-09 16:41 ` Stanislav Fomichev
2023-06-09 19:03 ` Andrii Nakryiko
2023-06-10 2:52 ` Daniel Xu
2023-06-09 18:58 ` Andrii Nakryiko
2023-06-09 20:28 ` Toke Høiland-Jørgensen
2023-06-12 11:21 ` Dave Tucker
2023-06-12 12:43 ` Daniel Borkmann
2023-06-09 18:56 ` Andrii Nakryiko
2023-06-09 20:08 ` Alexei Starovoitov
[not found] ` <20230610022721.2950602-1-prankgup@fb.com>
2023-06-10 3:37 ` Alexei Starovoitov
2023-06-09 20:20 ` Toke Høiland-Jørgensen
2023-06-08 20:53 ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
2023-06-08 1:25 ` Jamal Hadi Salim
2023-06-08 10:11 ` Daniel Borkmann
2023-06-08 19:46 ` Jamal Hadi Salim
2023-06-08 21:24 ` Andrii Nakryiko
2023-07-04 21:36 ` Jamal Hadi Salim
2023-07-04 22:01 ` Daniel Borkmann
2023-07-04 22:38 ` Jamal Hadi Salim
2023-07-05 7:34 ` Daniel Borkmann
2023-07-06 13:31 ` Jamal Hadi Salim
2023-06-08 17:50 ` Stanislav Fomichev
2023-06-08 21:20 ` Andrii Nakryiko
2023-06-09 3:06 ` Jakub Kicinski
2023-06-07 19:26 ` [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
2023-06-08 21:37 ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 4/7] libbpf: Add link-based " Daniel Borkmann
2023-06-08 21:45 ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 5/7] bpftool: Extend net dump with tcx progs Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 6/7] selftests/bpf: Add mprog API tests for BPF tcx opts Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 7/7] selftests/bpf: Add mprog API tests for BPF tcx links Daniel Borkmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZIIOr1zvdRNTFKR7@google.com \
--to=sdf@google.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dxu@dxuuu.xyz \
--cc=joe@cilium.io \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=razor@blackwall.org \
--cc=toke@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).