netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs
@ 2023-06-07 19:26 Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
                   ` (6 more replies)
  0 siblings, 7 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

This series adds BPF link support for tc BPF programs. We initially
presented the motivation, related work and design at last year's LPC
conference in the networking & BPF track [0], and a recent update on
our progress of the rework during this year's LSF/MM/BPF summit [1].
The main changes are in first two patches and the last two have an
extensive batch of test cases we developed along with it, please see
individual patches for details. We tested this series with tc-testing
selftest suite as well as BPF CI/selftests. Thanks!

  [0] https://lpc.events/event/16/contributions/1353/
  [1] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf

Daniel Borkmann (7):
  bpf: Add generic attach/detach/query API for multi-progs
  bpf: Add fd-based tcx multi-prog infra with link support
  libbpf: Add opts-based attach/detach/query API for tcx
  libbpf: Add link-based API for tcx
  bpftool: Extend net dump with tcx progs
  selftests/bpf: Add mprog API tests for BPF tcx opts
  selftests/bpf: Add mprog API tests for BPF tcx links

 MAINTAINERS                                   |    5 +-
 include/linux/bpf_mprog.h                     |  245 ++
 include/linux/netdevice.h                     |   15 +-
 include/linux/skbuff.h                        |    4 +-
 include/net/sch_generic.h                     |    2 +-
 include/net/tcx.h                             |  157 +
 include/uapi/linux/bpf.h                      |   72 +-
 kernel/bpf/Kconfig                            |    1 +
 kernel/bpf/Makefile                           |    3 +-
 kernel/bpf/mprog.c                            |  476 +++
 kernel/bpf/syscall.c                          |   95 +-
 kernel/bpf/tcx.c                              |  347 +++
 net/Kconfig                                   |    5 +
 net/core/dev.c                                |  267 +-
 net/core/filter.c                             |    4 +-
 net/sched/Kconfig                             |    4 +-
 net/sched/sch_ingress.c                       |   45 +-
 tools/bpf/bpftool/net.c                       |   92 +-
 tools/include/uapi/linux/bpf.h                |   72 +-
 tools/lib/bpf/bpf.c                           |   83 +-
 tools/lib/bpf/bpf.h                           |   61 +-
 tools/lib/bpf/libbpf.c                        |   50 +-
 tools/lib/bpf/libbpf.h                        |   17 +
 tools/lib/bpf/libbpf.map                      |    2 +
 .../selftests/bpf/prog_tests/tc_helpers.h     |   72 +
 .../selftests/bpf/prog_tests/tc_links.c       | 2279 ++++++++++++++
 .../selftests/bpf/prog_tests/tc_opts.c        | 2698 +++++++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |   40 +
 28 files changed, 6995 insertions(+), 218 deletions(-)
 create mode 100644 include/linux/bpf_mprog.h
 create mode 100644 include/net/tcx.h
 create mode 100644 kernel/bpf/mprog.c
 create mode 100644 kernel/bpf/tcx.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_links.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_opts.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_link.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  2023-06-08 17:23   ` Stanislav Fomichev
  2023-06-08 20:53   ` Andrii Nakryiko
  2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.

The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:

  I cannot help but feel that priority logic copy-paste from old tc, netfilter
  and friends is done because "that's how things were done in the past". [...]
  Priority gets exposed everywhere in uapi all the way to bpftool when it's
  right there for users to understand. And that's the main problem with it.

  The user don't want to and don't need to be aware of it, but uapi forces them
  to pick the priority. [...] Your cover letter [0] example proves that in
  real life different service pick the same priority. They simply don't know
  any better. Priority is an unnecessary magic that apps _have_ to pick, so
  they just copy-paste and everyone ends up using the same.

The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating about their attachments.

Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:

  - Support prog-based attach/detach and link API
  - Dependency directives (can also be combined):
    - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
      - BPF_F_ID flag as {fd,id} toggle
      - BPF_F_LINK flag as {prog,link} toggle
      - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
        BPF_F_AFTER will just append for the case of attaching
      - Enforced only at attach time
    - BPF_F_{FIRST,LAST}
      - Enforced throughout the bpf_mprog state's lifetime
      - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
  - Internal revision counter and optionally being able to pass expected_revision
  - User space daemon can query current state with revision, and pass it along
    for attachment to assert current state before doing updates
  - Query also gets extension for link_ids array and link_attach_flags:
    - prog_ids are always filled with program IDs
    - link_ids are filled with link IDs when link was used, otherwise 0
    - {prog,link}_attach_flags for holding {prog,link}-specific flags
  - Must be easy to integrate/reuse for in-kernel users

The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.

The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
structure). Both have been separated, so that fast-path gets efficient packing
of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
instead of linked list or other structures to remove unnecessary indirections
for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
is populated and then just swapped which avoids additional allocations that
could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
currently static, but they could be converted to dynamic allocation if necessary
at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
for example, in case of tcx which uses this API in the next patch, it piggy-
backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
add,del} implementation and an extensive test suite for checking all aspects
of this API for prog-based attach/detach and link API as BPF selftests in
this series.

Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.

  [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
  [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
  [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 MAINTAINERS                    |   1 +
 include/linux/bpf_mprog.h      | 245 +++++++++++++++++
 include/uapi/linux/bpf.h       |  37 ++-
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  37 ++-
 6 files changed, 781 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/bpf_mprog.h
 create mode 100644 kernel/bpf/mprog.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c904dba1733b..754a9eeca0a1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3733,6 +3733,7 @@ F:	include/linux/filter.h
 F:	include/linux/tnum.h
 F:	kernel/bpf/core.c
 F:	kernel/bpf/dispatcher.c
+F:	kernel/bpf/mprog.c
 F:	kernel/bpf/syscall.c
 F:	kernel/bpf/tnum.c
 F:	kernel/bpf/trampoline.c
diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
new file mode 100644
index 000000000000..7399181d8e6c
--- /dev/null
+++ b/include/linux/bpf_mprog.h
@@ -0,0 +1,245 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Isovalent */
+#ifndef __BPF_MPROG_H
+#define __BPF_MPROG_H
+
+#include <linux/bpf.h>
+
+#define BPF_MPROG_MAX	64
+#define BPF_MPROG_SWAP	1
+#define BPF_MPROG_FREE	2
+
+struct bpf_mprog_fp {
+	struct bpf_prog *prog;
+};
+
+struct bpf_mprog_cp {
+	struct bpf_link *link;
+	u32 flags;
+};
+
+struct bpf_mprog_entry {
+	struct bpf_mprog_fp fp_items[BPF_MPROG_MAX] ____cacheline_aligned;
+	struct bpf_mprog_cp cp_items[BPF_MPROG_MAX] ____cacheline_aligned;
+	struct bpf_mprog_bundle *parent;
+};
+
+struct bpf_mprog_bundle {
+	struct bpf_mprog_entry a;
+	struct bpf_mprog_entry b;
+	struct rcu_head rcu;
+	struct bpf_prog *ref;
+	atomic_t revision;
+};
+
+struct bpf_tuple {
+	struct bpf_prog *prog;
+	struct bpf_link *link;
+};
+
+static inline struct bpf_mprog_entry *
+bpf_mprog_peer(const struct bpf_mprog_entry *entry)
+{
+	if (entry == &entry->parent->a)
+		return &entry->parent->b;
+	else
+		return &entry->parent->a;
+}
+
+#define bpf_mprog_foreach_tuple(entry, fp, cp, t)			\
+	for (fp = &entry->fp_items[0], cp = &entry->cp_items[0];	\
+	     ({								\
+		t.prog = READ_ONCE(fp->prog);				\
+		t.link = cp->link;					\
+		t.prog;							\
+	      });							\
+	     fp++, cp++)
+
+#define bpf_mprog_foreach_prog(entry, fp, p)				\
+	for (fp = &entry->fp_items[0];					\
+	     (p = READ_ONCE(fp->prog));					\
+	     fp++)
+
+static inline struct bpf_mprog_entry *bpf_mprog_create(size_t extra_size)
+{
+	struct bpf_mprog_bundle *bundle;
+
+	/* Fast-path items are not extensible, must only contain prog pointer! */
+	BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
+	/* Control-path items can be extended w/o affecting fast-path. */
+	BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) != ARRAY_SIZE(bundle->a.cp_items));
+
+	bundle = kzalloc(sizeof(*bundle) + extra_size, GFP_KERNEL);
+	if (bundle) {
+		atomic_set(&bundle->revision, 1);
+		bundle->a.parent = bundle;
+		bundle->b.parent = bundle;
+		return &bundle->a;
+	}
+	return NULL;
+}
+
+static inline void bpf_mprog_free(struct bpf_mprog_entry *entry)
+{
+	kfree_rcu(entry->parent, rcu);
+}
+
+static inline void bpf_mprog_mark_ref(struct bpf_mprog_entry *entry,
+				      struct bpf_prog *prog)
+{
+	WARN_ON_ONCE(entry->parent->ref);
+	entry->parent->ref = prog;
+}
+
+static inline u32 bpf_mprog_flags(u32 cur_flags, u32 req_flags, u32 flag)
+{
+	if (req_flags & flag)
+		cur_flags |= flag;
+	else
+		cur_flags &= ~flag;
+	return cur_flags;
+}
+
+static inline u32 bpf_mprog_max(void)
+{
+	return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
+}
+
+static inline struct bpf_prog *bpf_mprog_first(struct bpf_mprog_entry *entry)
+{
+	return READ_ONCE(entry->fp_items[0].prog);
+}
+
+static inline struct bpf_prog *bpf_mprog_last(struct bpf_mprog_entry *entry)
+{
+	struct bpf_prog *tmp, *prog = NULL;
+	struct bpf_mprog_fp *fp;
+
+	bpf_mprog_foreach_prog(entry, fp, tmp)
+		prog = tmp;
+	return prog;
+}
+
+static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
+				    struct bpf_prog *prog)
+{
+	const struct bpf_mprog_fp *fp;
+	const struct bpf_prog *tmp;
+
+	bpf_mprog_foreach_prog(entry, fp, tmp) {
+		if (tmp == prog)
+			return true;
+	}
+	return false;
+}
+
+static inline struct bpf_prog *bpf_mprog_first_reg(struct bpf_mprog_entry *entry)
+{
+	struct bpf_tuple tuple = {};
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+
+	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
+		if (cp->flags & BPF_F_FIRST)
+			continue;
+		return tuple.prog;
+	}
+	return NULL;
+}
+
+static inline struct bpf_prog *bpf_mprog_last_reg(struct bpf_mprog_entry *entry)
+{
+	struct bpf_tuple tuple = {};
+	struct bpf_prog *prog = NULL;
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+
+	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
+		if (cp->flags & BPF_F_LAST)
+			break;
+		prog = tuple.prog;
+	}
+	return prog;
+}
+
+static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
+{
+	do {
+		atomic_inc(&entry->parent->revision);
+	} while (atomic_read(&entry->parent->revision) == 0);
+	synchronize_rcu();
+	if (entry->parent->ref) {
+		bpf_prog_put(entry->parent->ref);
+		entry->parent->ref = NULL;
+	}
+}
+
+static inline void bpf_mprog_entry_clear(struct bpf_mprog_entry *entry)
+{
+	memset(entry->fp_items, 0, sizeof(entry->fp_items));
+	memset(entry->cp_items, 0, sizeof(entry->cp_items));
+}
+
+static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
+{
+	return atomic_read(&entry->parent->revision);
+}
+
+static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 which,
+				  struct bpf_mprog_fp **fp_dst,
+				  struct bpf_mprog_cp **cp_dst)
+{
+	*fp_dst = &entry->fp_items[which];
+	*cp_dst = &entry->cp_items[which];
+}
+
+static inline void bpf_mprog_write(struct bpf_mprog_fp *fp_dst,
+				   struct bpf_mprog_cp *cp_dst,
+				   struct bpf_tuple *tuple, u32 flags)
+{
+	WRITE_ONCE(fp_dst->prog, tuple->prog);
+	cp_dst->link  = tuple->link;
+	cp_dst->flags = flags;
+}
+
+static inline void bpf_mprog_copy(struct bpf_mprog_fp *fp_dst,
+				  struct bpf_mprog_cp *cp_dst,
+				  struct bpf_mprog_fp *fp_src,
+				  struct bpf_mprog_cp *cp_src)
+{
+	WRITE_ONCE(fp_dst->prog, READ_ONCE(fp_src->prog));
+	memcpy(cp_dst, cp_src, sizeof(*cp_src));
+}
+
+static inline void bpf_mprog_copy_range(struct bpf_mprog_entry *peer,
+					struct bpf_mprog_entry *entry,
+					u32 idx_peer, u32 idx_entry, u32 num)
+{
+	memcpy(&peer->fp_items[idx_peer], &entry->fp_items[idx_entry],
+	       num * sizeof(peer->fp_items[0]));
+	memcpy(&peer->cp_items[idx_peer], &entry->cp_items[idx_entry],
+	       num * sizeof(peer->cp_items[0]));
+}
+
+static inline u32 bpf_mprog_total(struct bpf_mprog_entry *entry)
+{
+	const struct bpf_mprog_fp *fp;
+	const struct bpf_prog *tmp;
+	u32 num = 0;
+
+	bpf_mprog_foreach_prog(entry, fp, tmp)
+		num++;
+	return num;
+}
+
+int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
+		     struct bpf_link *link, u32 flags, u32 object,
+		     u32 expected_revision);
+int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
+		     struct bpf_link *link, u32 flags, u32 object,
+		     u32 expected_revision);
+
+int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
+		    struct bpf_mprog_entry *entry);
+
+#endif /* __BPF_MPROG_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a7b5e91dd768..207f8a37b327 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1102,7 +1102,14 @@ enum bpf_link_type {
  */
 #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
 #define BPF_F_ALLOW_MULTI	(1U << 1)
+/* Generic attachment flags. */
 #define BPF_F_REPLACE		(1U << 2)
+#define BPF_F_BEFORE		(1U << 3)
+#define BPF_F_AFTER		(1U << 4)
+#define BPF_F_FIRST		(1U << 5)
+#define BPF_F_LAST		(1U << 6)
+#define BPF_F_ID		(1U << 7)
+#define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
 
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
  * verifier will perform strict alignment checking as if the kernel
@@ -1433,14 +1440,19 @@ union bpf_attr {
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
-		__u32		target_fd;	/* container object to attach to */
-		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		union {
+			__u32	target_fd;	/* target object to attach to or ... */
+			__u32	target_ifindex;	/* target ifindex */
+		};
+		__u32		attach_bpf_fd;
 		__u32		attach_type;
 		__u32		attach_flags;
-		__u32		replace_bpf_fd;	/* previously attached eBPF
-						 * program to replace if
-						 * BPF_F_REPLACE is used
-						 */
+		union {
+			__u32	relative_fd;
+			__u32	relative_id;
+			__u32	replace_bpf_fd;
+		};
+		__u32		expected_revision;
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
@@ -1486,16 +1498,25 @@ union bpf_attr {
 	} info;
 
 	struct { /* anonymous struct used by BPF_PROG_QUERY command */
-		__u32		target_fd;	/* container object to query */
+		union {
+			__u32	target_fd;	/* target object to query or ... */
+			__u32	target_ifindex;	/* target ifindex */
+		};
 		__u32		attach_type;
 		__u32		query_flags;
 		__u32		attach_flags;
 		__aligned_u64	prog_ids;
-		__u32		prog_cnt;
+		union {
+			__u32	prog_cnt;
+			__u32	count;
+		};
+		__u32		revision;
 		/* output: per-program attach_flags.
 		 * not allowed to be set during effective query.
 		 */
 		__aligned_u64	prog_attach_flags;
+		__aligned_u64	link_ids;
+		__aligned_u64	link_attach_flags;
 	} query;
 
 	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1d3892168d32..1bea2eb912cd 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
-obj-$(CONFIG_BPF_SYSCALL) += disasm.o
+obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
new file mode 100644
index 000000000000..efc3b73f8bf5
--- /dev/null
+++ b/kernel/bpf/mprog.c
@@ -0,0 +1,476 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+
+#include <linux/bpf.h>
+#include <linux/bpf_mprog.h>
+#include <linux/filter.h>
+
+static int bpf_mprog_tuple_relative(struct bpf_tuple *tuple,
+				    u32 object, u32 flags,
+				    enum bpf_prog_type type)
+{
+	struct bpf_prog *prog;
+	struct bpf_link *link;
+
+	memset(tuple, 0, sizeof(*tuple));
+	if (!(flags & (BPF_F_REPLACE | BPF_F_BEFORE | BPF_F_AFTER)))
+		return object || (flags & (BPF_F_ID | BPF_F_LINK)) ?
+		       -EINVAL : 0;
+	if (flags & BPF_F_LINK) {
+		if (flags & BPF_F_ID)
+			link = bpf_link_by_id(object);
+		else
+			link = bpf_link_get_from_fd(object);
+		if (IS_ERR(link))
+			return PTR_ERR(link);
+		if (type && link->prog->type != type) {
+			bpf_link_put(link);
+			return -EINVAL;
+		}
+		tuple->link = link;
+		tuple->prog = link->prog;
+	} else {
+		if (flags & BPF_F_ID)
+			prog = bpf_prog_by_id(object);
+		else
+			prog = bpf_prog_get(object);
+		if (IS_ERR(prog)) {
+			if (!object &&
+			    !(flags & BPF_F_ID))
+				return 0;
+			return PTR_ERR(prog);
+		}
+		if (type && prog->type != type) {
+			bpf_prog_put(prog);
+			return -EINVAL;
+		}
+		tuple->link = NULL;
+		tuple->prog = prog;
+	}
+	return 0;
+}
+
+static void bpf_mprog_tuple_put(struct bpf_tuple *tuple)
+{
+	if (tuple->link)
+		bpf_link_put(tuple->link);
+	else if (tuple->prog)
+		bpf_prog_put(tuple->prog);
+}
+
+static int bpf_mprog_replace(struct bpf_mprog_entry *entry,
+			     struct bpf_tuple *ntuple,
+			     struct bpf_tuple *rtuple, u32 rflags)
+{
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+	struct bpf_prog *oprog;
+	u32 iflags;
+	int i;
+
+	if (rflags & (BPF_F_BEFORE | BPF_F_AFTER | BPF_F_LINK))
+		return -EINVAL;
+	if (rtuple->prog != ntuple->prog &&
+	    bpf_mprog_exists(entry, ntuple->prog))
+		return -EEXIST;
+	for (i = 0; i < bpf_mprog_max(); i++) {
+		bpf_mprog_read(entry, i, &fp, &cp);
+		oprog = READ_ONCE(fp->prog);
+		if (!oprog)
+			break;
+		if (oprog != rtuple->prog)
+			continue;
+		if (cp->link != ntuple->link)
+			return -EBUSY;
+		iflags = cp->flags;
+		if ((iflags & BPF_F_FIRST) !=
+		    (rflags & BPF_F_FIRST)) {
+			iflags = bpf_mprog_flags(iflags, rflags,
+						 BPF_F_FIRST);
+			if ((iflags & BPF_F_FIRST) &&
+			    rtuple->prog != bpf_mprog_first(entry))
+				return -EACCES;
+		}
+		if ((iflags & BPF_F_LAST) !=
+		    (rflags & BPF_F_LAST)) {
+			iflags = bpf_mprog_flags(iflags, rflags,
+						 BPF_F_LAST);
+			if ((iflags & BPF_F_LAST) &&
+			    rtuple->prog != bpf_mprog_last(entry))
+				return -EACCES;
+		}
+		bpf_mprog_write(fp, cp, ntuple, iflags);
+		if (!ntuple->link)
+			bpf_prog_put(oprog);
+		return 0;
+	}
+	return -ENOENT;
+}
+
+static int bpf_mprog_head_tail(struct bpf_mprog_entry *entry,
+			       struct bpf_tuple *ntuple,
+			       struct bpf_tuple *rtuple, u32 aflags)
+{
+	struct bpf_mprog_entry *peer;
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+	struct bpf_prog *oprog;
+	u32 iflags, items;
+
+	if (bpf_mprog_exists(entry, ntuple->prog))
+		return -EEXIST;
+	items = bpf_mprog_total(entry);
+	peer = bpf_mprog_peer(entry);
+	bpf_mprog_entry_clear(peer);
+	if (aflags & BPF_F_FIRST) {
+		if (aflags & BPF_F_AFTER)
+			return -EINVAL;
+		bpf_mprog_read(entry, 0, &fp, &cp);
+		iflags = cp->flags;
+		if (iflags & BPF_F_FIRST)
+			return -EBUSY;
+		if (aflags & BPF_F_LAST) {
+			if (aflags & BPF_F_BEFORE)
+				return -EINVAL;
+			if (items)
+				return -EBUSY;
+			bpf_mprog_read(peer, 0, &fp, &cp);
+			bpf_mprog_write(fp, cp, ntuple,
+					BPF_F_FIRST | BPF_F_LAST);
+			return BPF_MPROG_SWAP;
+		}
+		if (aflags & BPF_F_BEFORE) {
+			oprog = READ_ONCE(fp->prog);
+			if (oprog != rtuple->prog ||
+			    (rtuple->link &&
+			     rtuple->link != cp->link))
+				return -EBUSY;
+		}
+		if (items >= bpf_mprog_max())
+			return -ENOSPC;
+		bpf_mprog_read(peer, 0, &fp, &cp);
+		bpf_mprog_write(fp, cp, ntuple, BPF_F_FIRST);
+		bpf_mprog_copy_range(peer, entry, 1, 0, items);
+		return BPF_MPROG_SWAP;
+	}
+	if (aflags & BPF_F_LAST) {
+		if (aflags & BPF_F_BEFORE)
+			return -EINVAL;
+		if (items) {
+			bpf_mprog_read(entry, items - 1, &fp, &cp);
+			iflags = cp->flags;
+			if (iflags & BPF_F_LAST)
+				return -EBUSY;
+			if (aflags & BPF_F_AFTER) {
+				oprog = READ_ONCE(fp->prog);
+				if (oprog != rtuple->prog ||
+				    (rtuple->link &&
+				     rtuple->link != cp->link))
+					return -EBUSY;
+			}
+			if (items >= bpf_mprog_max())
+				return -ENOSPC;
+		} else {
+			if (aflags & BPF_F_AFTER)
+				return -EBUSY;
+		}
+		bpf_mprog_read(peer, items, &fp, &cp);
+		bpf_mprog_write(fp, cp, ntuple, BPF_F_LAST);
+		bpf_mprog_copy_range(peer, entry, 0, 0, items);
+		return BPF_MPROG_SWAP;
+	}
+	return -ENOENT;
+}
+
+static int bpf_mprog_add(struct bpf_mprog_entry *entry,
+			 struct bpf_tuple *ntuple,
+			 struct bpf_tuple *rtuple, u32 aflags)
+{
+	struct bpf_mprog_fp *fp_dst, *fp_src;
+	struct bpf_mprog_cp *cp_dst, *cp_src;
+	struct bpf_mprog_entry *peer;
+	struct bpf_prog *oprog;
+	bool found = false;
+	u32 items;
+	int i, j;
+
+	items = bpf_mprog_total(entry);
+	if (items >= bpf_mprog_max())
+		return -ENOSPC;
+	if ((aflags & (BPF_F_BEFORE | BPF_F_AFTER)) ==
+	    (BPF_F_BEFORE | BPF_F_AFTER))
+		return -EINVAL;
+	if (bpf_mprog_exists(entry, ntuple->prog))
+		return -EEXIST;
+	if (!rtuple->prog && (aflags & (BPF_F_BEFORE | BPF_F_AFTER))) {
+		if (!items)
+			aflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
+		if (aflags & BPF_F_BEFORE)
+			rtuple->prog = bpf_mprog_first_reg(entry);
+		if (aflags & BPF_F_AFTER)
+			rtuple->prog = bpf_mprog_last_reg(entry);
+		if (!rtuple->prog)
+			aflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
+		else
+			bpf_prog_inc(rtuple->prog);
+	}
+	peer = bpf_mprog_peer(entry);
+	bpf_mprog_entry_clear(peer);
+	for (i = 0, j = 0; i < bpf_mprog_max(); i++, j++) {
+		bpf_mprog_read(entry, i, &fp_src, &cp_src);
+		bpf_mprog_read(peer,  j, &fp_dst, &cp_dst);
+		oprog = READ_ONCE(fp_src->prog);
+		if (!oprog) {
+			if (i != j)
+				break;
+			if (i > 0) {
+				bpf_mprog_read(entry, i - 1,
+					       &fp_src, &cp_src);
+				if (cp_src->flags & BPF_F_LAST) {
+					if (cp_src->flags & BPF_F_FIRST)
+						return -EBUSY;
+					bpf_mprog_copy(fp_dst, cp_dst,
+						       fp_src, cp_src);
+					bpf_mprog_read(peer, --j,
+						       &fp_dst, &cp_dst);
+				}
+			}
+			bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
+			break;
+		}
+		if (aflags & (BPF_F_BEFORE | BPF_F_AFTER)) {
+			if (rtuple->prog != oprog ||
+			    (rtuple->link &&
+			     rtuple->link != cp_src->link))
+				goto next;
+			found = true;
+			if (aflags & BPF_F_BEFORE) {
+				if (cp_src->flags & BPF_F_FIRST)
+					return -EBUSY;
+				bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
+				bpf_mprog_read(peer, ++j, &fp_dst, &cp_dst);
+				goto next;
+			}
+			if (aflags & BPF_F_AFTER) {
+				if (cp_src->flags & BPF_F_LAST)
+					return -EBUSY;
+				bpf_mprog_copy(fp_dst, cp_dst,
+					       fp_src, cp_src);
+				bpf_mprog_read(peer, ++j, &fp_dst, &cp_dst);
+				bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
+				continue;
+			}
+		}
+next:
+		bpf_mprog_copy(fp_dst, cp_dst,
+			       fp_src, cp_src);
+	}
+	if (rtuple->prog && !found)
+		return -ENOENT;
+	return BPF_MPROG_SWAP;
+}
+
+static int bpf_mprog_del(struct bpf_mprog_entry *entry,
+			 struct bpf_tuple *dtuple,
+			 struct bpf_tuple *rtuple, u32 dflags)
+{
+	struct bpf_mprog_fp *fp_dst, *fp_src;
+	struct bpf_mprog_cp *cp_dst, *cp_src;
+	struct bpf_mprog_entry *peer;
+	struct bpf_prog *oprog;
+	bool found = false;
+	int i, j, ret;
+
+	if (dflags & BPF_F_REPLACE)
+		return -EINVAL;
+	if (dflags & BPF_F_FIRST) {
+		oprog = bpf_mprog_first(entry);
+		if (dtuple->prog &&
+		    dtuple->prog != oprog)
+			return -ENOENT;
+		dtuple->prog = oprog;
+	}
+	if (dflags & BPF_F_LAST) {
+		oprog = bpf_mprog_last(entry);
+		if (dtuple->prog &&
+		    dtuple->prog != oprog)
+			return -ENOENT;
+		dtuple->prog = oprog;
+	}
+	if (!rtuple->prog && (dflags & (BPF_F_BEFORE | BPF_F_AFTER))) {
+		if (dtuple->prog)
+			return -EINVAL;
+		if (dflags & BPF_F_BEFORE)
+			dtuple->prog = bpf_mprog_first_reg(entry);
+		if (dflags & BPF_F_AFTER)
+			dtuple->prog = bpf_mprog_last_reg(entry);
+		if (dtuple->prog)
+			dflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
+	}
+	for (i = 0; i < bpf_mprog_max(); i++) {
+		bpf_mprog_read(entry, i, &fp_src, &cp_src);
+		oprog = READ_ONCE(fp_src->prog);
+		if (!oprog)
+			break;
+		if (dflags & (BPF_F_BEFORE | BPF_F_AFTER)) {
+			if (rtuple->prog != oprog ||
+			    (rtuple->link &&
+			     rtuple->link != cp_src->link))
+				continue;
+			found = true;
+			if (dflags & BPF_F_BEFORE) {
+				if (!i)
+					return -ENOENT;
+				bpf_mprog_read(entry, i - 1,
+					       &fp_src, &cp_src);
+				oprog = READ_ONCE(fp_src->prog);
+				if (dtuple->prog &&
+				    dtuple->prog != oprog)
+					return -ENOENT;
+				dtuple->prog = oprog;
+				break;
+			}
+			if (dflags & BPF_F_AFTER) {
+				bpf_mprog_read(entry, i + 1,
+					       &fp_src, &cp_src);
+				oprog = READ_ONCE(fp_src->prog);
+				if (dtuple->prog &&
+				    dtuple->prog != oprog)
+					return -ENOENT;
+				dtuple->prog = oprog;
+				break;
+			}
+		}
+	}
+	if (!dtuple->prog || (rtuple->prog && !found))
+		return -ENOENT;
+	peer = bpf_mprog_peer(entry);
+	bpf_mprog_entry_clear(peer);
+	ret = -ENOENT;
+	for (i = 0, j = 0; i < bpf_mprog_max(); i++) {
+		bpf_mprog_read(entry, i, &fp_src, &cp_src);
+		bpf_mprog_read(peer,  j, &fp_dst, &cp_dst);
+		oprog = READ_ONCE(fp_src->prog);
+		if (!oprog)
+			break;
+		if (oprog != dtuple->prog) {
+			bpf_mprog_copy(fp_dst, cp_dst,
+				       fp_src, cp_src);
+			j++;
+		} else {
+			if (cp_src->link != dtuple->link)
+				return -EBUSY;
+			if (!cp_src->link)
+				bpf_mprog_mark_ref(entry, dtuple->prog);
+			ret = BPF_MPROG_SWAP;
+		}
+	}
+	if (!bpf_mprog_total(peer))
+		ret = BPF_MPROG_FREE;
+	return ret;
+}
+
+int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
+		     struct bpf_link *link, u32 flags, u32 object,
+		     u32 expected_revision)
+{
+	struct bpf_tuple rtuple, ntuple = {
+		.prog = prog,
+		.link = link,
+	};
+	int ret;
+
+	if (expected_revision &&
+	    expected_revision != bpf_mprog_revision(entry))
+		return -ESTALE;
+	ret = bpf_mprog_tuple_relative(&rtuple, object, flags, prog->type);
+	if (ret)
+		return ret;
+	if (flags & BPF_F_REPLACE)
+		ret = bpf_mprog_replace(entry, &ntuple, &rtuple, flags);
+	else if (flags & (BPF_F_FIRST | BPF_F_LAST))
+		ret = bpf_mprog_head_tail(entry, &ntuple, &rtuple, flags);
+	else
+		ret = bpf_mprog_add(entry, &ntuple, &rtuple, flags);
+	bpf_mprog_tuple_put(&rtuple);
+	return ret;
+}
+
+int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
+		     struct bpf_link *link, u32 flags, u32 object,
+		     u32 expected_revision)
+{
+	struct bpf_tuple rtuple, dtuple = {
+		.prog = prog,
+		.link = link,
+	};
+	int ret;
+
+	if (expected_revision &&
+	    expected_revision != bpf_mprog_revision(entry))
+		return -ESTALE;
+	ret = bpf_mprog_tuple_relative(&rtuple, object, flags,
+				       prog ? prog->type :
+				       BPF_PROG_TYPE_UNSPEC);
+	if (ret)
+		return ret;
+	ret = bpf_mprog_del(entry, &dtuple, &rtuple, flags);
+	bpf_mprog_tuple_put(&rtuple);
+	return ret;
+}
+
+int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
+		    struct bpf_mprog_entry *entry)
+{
+	u32 i, id, flags = 0, count, revision;
+	u32 __user *uprog_id, *uprog_af;
+	u32 __user *ulink_id, *ulink_af;
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	if (attr->query.query_flags || attr->query.attach_flags)
+		return -EINVAL;
+	revision = bpf_mprog_revision(entry);
+	count = bpf_mprog_total(entry);
+	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
+		return -EFAULT;
+	if (copy_to_user(&uattr->query.revision, &revision, sizeof(revision)))
+		return -EFAULT;
+	if (copy_to_user(&uattr->query.count, &count, sizeof(count)))
+		return -EFAULT;
+	uprog_id = u64_to_user_ptr(attr->query.prog_ids);
+	if (attr->query.count == 0 || !uprog_id || !count)
+		return 0;
+	if (attr->query.count < count) {
+		count = attr->query.count;
+		ret = -ENOSPC;
+	}
+	uprog_af = u64_to_user_ptr(attr->query.prog_attach_flags);
+	ulink_id = u64_to_user_ptr(attr->query.link_ids);
+	ulink_af = u64_to_user_ptr(attr->query.link_attach_flags);
+	for (i = 0; i < ARRAY_SIZE(entry->fp_items); i++) {
+		bpf_mprog_read(entry, i, &fp, &cp);
+		prog = READ_ONCE(fp->prog);
+		if (!prog)
+			break;
+		id = prog->aux->id;
+		if (copy_to_user(uprog_id + i, &id, sizeof(id)))
+			return -EFAULT;
+		id = cp->link ? cp->link->id : 0;
+		if (ulink_id &&
+		    copy_to_user(ulink_id + i, &id, sizeof(id)))
+			return -EFAULT;
+		flags = cp->flags;
+		if (uprog_af && !id &&
+		    copy_to_user(uprog_af + i, &flags, sizeof(flags)))
+			return -EFAULT;
+		if (ulink_af && id &&
+		    copy_to_user(ulink_af + i, &flags, sizeof(flags)))
+			return -EFAULT;
+		if (i + 1 == count)
+			break;
+	}
+	return ret;
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a7b5e91dd768..207f8a37b327 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1102,7 +1102,14 @@ enum bpf_link_type {
  */
 #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
 #define BPF_F_ALLOW_MULTI	(1U << 1)
+/* Generic attachment flags. */
 #define BPF_F_REPLACE		(1U << 2)
+#define BPF_F_BEFORE		(1U << 3)
+#define BPF_F_AFTER		(1U << 4)
+#define BPF_F_FIRST		(1U << 5)
+#define BPF_F_LAST		(1U << 6)
+#define BPF_F_ID		(1U << 7)
+#define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
 
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
  * verifier will perform strict alignment checking as if the kernel
@@ -1433,14 +1440,19 @@ union bpf_attr {
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
-		__u32		target_fd;	/* container object to attach to */
-		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		union {
+			__u32	target_fd;	/* target object to attach to or ... */
+			__u32	target_ifindex;	/* target ifindex */
+		};
+		__u32		attach_bpf_fd;
 		__u32		attach_type;
 		__u32		attach_flags;
-		__u32		replace_bpf_fd;	/* previously attached eBPF
-						 * program to replace if
-						 * BPF_F_REPLACE is used
-						 */
+		union {
+			__u32	relative_fd;
+			__u32	relative_id;
+			__u32	replace_bpf_fd;
+		};
+		__u32		expected_revision;
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
@@ -1486,16 +1498,25 @@ union bpf_attr {
 	} info;
 
 	struct { /* anonymous struct used by BPF_PROG_QUERY command */
-		__u32		target_fd;	/* container object to query */
+		union {
+			__u32	target_fd;	/* target object to query or ... */
+			__u32	target_ifindex;	/* target ifindex */
+		};
 		__u32		attach_type;
 		__u32		query_flags;
 		__u32		attach_flags;
 		__aligned_u64	prog_ids;
-		__u32		prog_cnt;
+		union {
+			__u32	prog_cnt;
+			__u32	count;
+		};
+		__u32		revision;
 		/* output: per-program attach_flags.
 		 * not allowed to be set during effective query.
 		 */
 		__aligned_u64	prog_attach_flags;
+		__aligned_u64	link_ids;
+		__aligned_u64	link_attach_flags;
 	} query;
 
 	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  2023-06-08  1:25   ` Jamal Hadi Salim
                     ` (3 more replies)
  2023-06-07 19:26 ` [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
                   ` (4 subsequent siblings)
  6 siblings, 4 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail. The work has been tested with tc-testing selftest suite
which all passes, as well as the tc BPF tests from the BPF CI, and also with
Cilium's L4LB.

Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 MAINTAINERS                    |   4 +-
 include/linux/netdevice.h      |  15 +-
 include/linux/skbuff.h         |   4 +-
 include/net/sch_generic.h      |   2 +-
 include/net/tcx.h              | 157 +++++++++++++++
 include/uapi/linux/bpf.h       |  35 +++-
 kernel/bpf/Kconfig             |   1 +
 kernel/bpf/Makefile            |   1 +
 kernel/bpf/syscall.c           |  95 +++++++--
 kernel/bpf/tcx.c               | 347 +++++++++++++++++++++++++++++++++
 net/Kconfig                    |   5 +
 net/core/dev.c                 | 267 +++++++++++++++----------
 net/core/filter.c              |   4 +-
 net/sched/Kconfig              |   4 +-
 net/sched/sch_ingress.c        |  45 ++++-
 tools/include/uapi/linux/bpf.h |  35 +++-
 16 files changed, 877 insertions(+), 144 deletions(-)
 create mode 100644 include/net/tcx.h
 create mode 100644 kernel/bpf/tcx.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 754a9eeca0a1..7a0d0b0c5a5e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3827,13 +3827,15 @@ L:	netdev@vger.kernel.org
 S:	Maintained
 F:	kernel/bpf/bpf_struct*
 
-BPF [NETWORKING] (tc BPF, sock_addr)
+BPF [NETWORKING] (tcx & tc BPF, sock_addr)
 M:	Martin KaFai Lau <martin.lau@linux.dev>
 M:	Daniel Borkmann <daniel@iogearbox.net>
 R:	John Fastabend <john.fastabend@gmail.com>
 L:	bpf@vger.kernel.org
 L:	netdev@vger.kernel.org
 S:	Maintained
+F:	include/net/tcx.h
+F:	kernel/bpf/tcx.c
 F:	net/core/filter.c
 F:	net/sched/act_bpf.c
 F:	net/sched/cls_bpf.c
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 08fbd4622ccf..fd4281d1cdbb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1927,8 +1927,7 @@ enum netdev_ml_priv_type {
  *
  *	@rx_handler:		handler for received packets
  *	@rx_handler_data: 	XXX: need comments on this one
- *	@miniq_ingress:		ingress/clsact qdisc specific data for
- *				ingress processing
+ *	@tcx_ingress:		BPF & clsact qdisc specific data for ingress processing
  *	@ingress_queue:		XXX: need comments on this one
  *	@nf_hooks_ingress:	netfilter hooks executed for ingress packets
  *	@broadcast:		hw bcast address
@@ -1949,8 +1948,7 @@ enum netdev_ml_priv_type {
  *	@xps_maps:		all CPUs/RXQs maps for XPS device
  *
  *	@xps_maps:	XXX: need comments on this one
- *	@miniq_egress:		clsact qdisc specific data for
- *				egress processing
+ *	@tcx_egress:		BPF & clsact qdisc specific data for egress processing
  *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
@@ -2249,9 +2247,8 @@ struct net_device {
 	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
-
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc __rcu	*miniq_ingress;
+#ifdef CONFIG_NET_XGRESS
+	struct bpf_mprog_entry __rcu *tcx_ingress;
 #endif
 	struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -2279,8 +2276,8 @@ struct net_device {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc __rcu	*miniq_egress;
+#ifdef CONFIG_NET_XGRESS
+	struct bpf_mprog_entry __rcu *tcx_egress;
 #endif
 #ifdef CONFIG_NETFILTER_EGRESS
 	struct nf_hook_entries __rcu *nf_hooks_egress;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5951904413ab..48c3e307f057 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -943,7 +943,7 @@ struct sk_buff {
 	__u8			__mono_tc_offset[0];
 	/* public: */
 	__u8			mono_delivery_time:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
 	__u8			tc_skip_classify:1;
 #endif
@@ -992,7 +992,7 @@ struct sk_buff {
 	__u8			csum_not_inet:1;
 #endif
 
-#ifdef CONFIG_NET_SCHED
+#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
 #endif
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..0ade5d1a72b2 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -695,7 +695,7 @@ int skb_do_redirect(struct sk_buff *);
 
 static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
 {
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	return skb->tc_at_ingress;
 #else
 	return false;
diff --git a/include/net/tcx.h b/include/net/tcx.h
new file mode 100644
index 000000000000..27885ecedff9
--- /dev/null
+++ b/include/net/tcx.h
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Isovalent */
+#ifndef __NET_TCX_H
+#define __NET_TCX_H
+
+#include <linux/bpf.h>
+#include <linux/bpf_mprog.h>
+
+#include <net/sch_generic.h>
+
+struct mini_Qdisc;
+
+struct tcx_entry {
+	struct bpf_mprog_bundle		bundle;
+	struct mini_Qdisc __rcu		*miniq;
+};
+
+struct tcx_link {
+	struct bpf_link link;
+	struct net_device *dev;
+	u32 location;
+	u32 flags;
+};
+
+static inline struct tcx_link *tcx_link(struct bpf_link *link)
+{
+	return container_of(link, struct tcx_link, link);
+}
+
+static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link)
+{
+	return tcx_link((struct bpf_link *)link);
+}
+
+static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress)
+{
+#ifdef CONFIG_NET_XGRESS
+	skb->tc_at_ingress = ingress;
+#endif
+}
+
+#ifdef CONFIG_NET_XGRESS
+void tcx_inc(void);
+void tcx_dec(void);
+
+static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry)
+{
+	return container_of(entry->parent, struct tcx_entry, bundle);
+}
+
+static inline void
+tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry, bool ingress)
+{
+	ASSERT_RTNL();
+	if (ingress)
+		rcu_assign_pointer(dev->tcx_ingress, entry);
+	else
+		rcu_assign_pointer(dev->tcx_egress, entry);
+}
+
+static inline struct bpf_mprog_entry *
+dev_tcx_entry_fetch(struct net_device *dev, bool ingress)
+{
+	ASSERT_RTNL();
+	if (ingress)
+		return rcu_dereference_rtnl(dev->tcx_ingress);
+	else
+		return rcu_dereference_rtnl(dev->tcx_egress);
+}
+
+static inline struct bpf_mprog_entry *
+dev_tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created)
+{
+	struct bpf_mprog_entry *entry = dev_tcx_entry_fetch(dev, ingress);
+
+	*created = false;
+	if (!entry) {
+		entry = bpf_mprog_create(sizeof_field(struct tcx_entry,
+						      miniq));
+		if (!entry)
+			return NULL;
+		*created = true;
+	}
+	return entry;
+}
+
+static inline void tcx_skeys_inc(bool ingress)
+{
+	tcx_inc();
+	if (ingress)
+		net_inc_ingress_queue();
+	else
+		net_inc_egress_queue();
+}
+
+static inline void tcx_skeys_dec(bool ingress)
+{
+	if (ingress)
+		net_dec_ingress_queue();
+	else
+		net_dec_egress_queue();
+	tcx_dec();
+}
+
+static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb, int code)
+{
+	switch (code) {
+	case TCX_PASS:
+		skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
+		fallthrough;
+	case TCX_DROP:
+	case TCX_REDIRECT:
+		return code;
+	case TCX_NEXT:
+	default:
+		return TCX_NEXT;
+	}
+}
+#endif /* CONFIG_NET_XGRESS */
+
+#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL)
+int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
+int tcx_prog_query(const union bpf_attr *attr,
+		   union bpf_attr __user *uattr);
+void dev_tcx_uninstall(struct net_device *dev);
+#else
+static inline int tcx_prog_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int tcx_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int tcx_prog_detach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int tcx_prog_query(const union bpf_attr *attr,
+				 union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+
+static inline void dev_tcx_uninstall(struct net_device *dev)
+{
+}
+#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */
+#endif /* __NET_TCX_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 207f8a37b327..e7584e24bc83 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1035,6 +1035,8 @@ enum bpf_attach_type {
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
 	BPF_STRUCT_OPS,
+	BPF_TCX_INGRESS,
+	BPF_TCX_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1052,7 +1054,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
 	BPF_LINK_TYPE_NETFILTER = 10,
-
+	BPF_LINK_TYPE_TCX = 11,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1559,13 +1561,13 @@ union bpf_attr {
 			__u32		map_fd;		/* struct_ops to attach */
 		};
 		union {
-			__u32		target_fd;	/* object to attach to */
-			__u32		target_ifindex; /* target ifindex */
+			__u32	target_fd;	/* target object to attach to or ... */
+			__u32	target_ifindex; /* target ifindex */
 		};
 		__u32		attach_type;	/* attach type */
 		__u32		flags;		/* extra flags */
 		union {
-			__u32		target_btf_id;	/* btf_id of target to attach to */
+			__u32	target_btf_id;	/* btf_id of target to attach to */
 			struct {
 				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
 				__u32		iter_info_len;	/* iter_info length */
@@ -1599,6 +1601,13 @@ union bpf_attr {
 				__s32		priority;
 				__u32		flags;
 			} netfilter;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u32		expected_revision;
+			} tcx;
 		};
 	} link_create;
 
@@ -6207,6 +6216,19 @@ struct bpf_sock_tuple {
 	};
 };
 
+/* (Simplified) user return codes for tcx prog type.
+ * A valid tcx program must return one of these defined values. All other
+ * return codes are reserved for future use. Must remain compatible with
+ * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
+ * return codes are mapped to TCX_NEXT.
+ */
+enum tcx_action_base {
+	TCX_NEXT	= -1,
+	TCX_PASS	= 0,
+	TCX_DROP	= 2,
+	TCX_REDIRECT	= 7,
+};
+
 struct bpf_xdp_sock {
 	__u32 queue_id;
 };
@@ -6459,6 +6481,11 @@ struct bpf_link_info {
 			__s32 priority;
 			__u32 flags;
 		} netfilter;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+			__u32 flags;
+		} tcx;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 2dfe1079f772..6a906ff93006 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -31,6 +31,7 @@ config BPF_SYSCALL
 	select TASKS_TRACE_RCU
 	select BINARY_PRINTF
 	select NET_SOCK_MSG if NET
+	select NET_XGRESS if NET
 	select PAGE_POOL if NET
 	default n
 	help
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1bea2eb912cd..f526b7573e97 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
+obj-$(CONFIG_BPF_SYSCALL) += tcx.o
 endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 92a57efc77de..e2c219d053f4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -37,6 +37,8 @@
 #include <linux/trace_events.h>
 #include <net/netfilter/nf_bpf_link.h>
 
+#include <net/tcx.h>
+
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
@@ -3522,31 +3524,57 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_XDP;
 	case BPF_LSM_CGROUP:
 		return BPF_PROG_TYPE_LSM;
+	case BPF_TCX_INGRESS:
+	case BPF_TCX_EGRESS:
+		return BPF_PROG_TYPE_SCHED_CLS;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
 }
 
-#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd
+#define BPF_PROG_ATTACH_LAST_FIELD expected_revision
+
+#define BPF_F_ATTACH_MASK_BASE	\
+	(BPF_F_ALLOW_OVERRIDE |	\
+	 BPF_F_ALLOW_MULTI |	\
+	 BPF_F_REPLACE)
+
+#define BPF_F_ATTACH_MASK_MPROG	\
+	(BPF_F_REPLACE |	\
+	 BPF_F_BEFORE |		\
+	 BPF_F_AFTER |		\
+	 BPF_F_FIRST |		\
+	 BPF_F_LAST |		\
+	 BPF_F_ID |		\
+	 BPF_F_LINK)
 
-#define BPF_F_ATTACH_MASK \
-	(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE)
+static bool bpf_supports_mprog(enum bpf_prog_type ptype)
+{
+	switch (ptype) {
+	case BPF_PROG_TYPE_SCHED_CLS:
+		return true;
+	default:
+		return false;
+	}
+}
 
 static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
 	struct bpf_prog *prog;
+	u32 mask;
 	int ret;
 
 	if (CHECK_ATTR(BPF_PROG_ATTACH))
 		return -EINVAL;
 
-	if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
-		return -EINVAL;
-
 	ptype = attach_type_to_prog_type(attr->attach_type);
 	if (ptype == BPF_PROG_TYPE_UNSPEC)
 		return -EINVAL;
+	mask = bpf_supports_mprog(ptype) ?
+	       BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE;
+	if (attr->attach_flags & ~mask)
+		return -EINVAL;
 
 	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
 	if (IS_ERR(prog))
@@ -3582,6 +3610,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 		else
 			ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = tcx_prog_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -3591,25 +3622,42 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	return ret;
 }
 
-#define BPF_PROG_DETACH_LAST_FIELD attach_type
+#define BPF_PROG_DETACH_LAST_FIELD expected_revision
 
 static int bpf_prog_detach(const union bpf_attr *attr)
 {
+	struct bpf_prog *prog = NULL;
 	enum bpf_prog_type ptype;
+	int ret;
 
 	if (CHECK_ATTR(BPF_PROG_DETACH))
 		return -EINVAL;
 
 	ptype = attach_type_to_prog_type(attr->attach_type);
+	if (bpf_supports_mprog(ptype)) {
+		if (ptype == BPF_PROG_TYPE_UNSPEC)
+			return -EINVAL;
+		if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG)
+			return -EINVAL;
+		prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+		if (IS_ERR(prog)) {
+			if ((int)attr->attach_bpf_fd > 0)
+				return PTR_ERR(prog);
+			prog = NULL;
+		}
+	}
 
 	switch (ptype) {
 	case BPF_PROG_TYPE_SK_MSG:
 	case BPF_PROG_TYPE_SK_SKB:
-		return sock_map_prog_detach(attr, ptype);
+		ret = sock_map_prog_detach(attr, ptype);
+		break;
 	case BPF_PROG_TYPE_LIRC_MODE2:
-		return lirc_prog_detach(attr);
+		ret = lirc_prog_detach(attr);
+		break;
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
-		return netns_bpf_prog_detach(attr, ptype);
+		ret = netns_bpf_prog_detach(attr, ptype);
+		break;
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -3618,13 +3666,21 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_LSM:
-		return cgroup_bpf_prog_detach(attr, ptype);
+		ret = cgroup_bpf_prog_detach(attr, ptype);
+		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = tcx_prog_detach(attr, prog);
+		break;
 	default:
-		return -EINVAL;
+		ret = -EINVAL;
 	}
+
+	if (prog)
+		bpf_prog_put(prog);
+	return ret;
 }
 
-#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags
+#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags
 
 static int bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
@@ -3672,6 +3728,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_SK_MSG_VERDICT:
 	case BPF_SK_SKB_VERDICT:
 		return sock_map_bpf_prog_query(attr, uattr);
+	case BPF_TCX_INGRESS:
+	case BPF_TCX_EGRESS:
+		return tcx_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
@@ -4629,6 +4688,13 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 			goto out;
 		}
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		if (attr->link_create.attach_type != BPF_TCX_INGRESS &&
+		    attr->link_create.attach_type != BPF_TCX_EGRESS) {
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
 	default:
 		ptype = attach_type_to_prog_type(attr->link_create.attach_type);
 		if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
@@ -4680,6 +4746,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	case BPF_PROG_TYPE_XDP:
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = tcx_link_attach(attr, prog);
+		break;
 	case BPF_PROG_TYPE_NETFILTER:
 		ret = bpf_nf_link_attach(attr, prog);
 		break;
diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c
new file mode 100644
index 000000000000..d3d23b4ed4f0
--- /dev/null
+++ b/kernel/bpf/tcx.c
@@ -0,0 +1,347 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+
+#include <linux/bpf.h>
+#include <linux/bpf_mprog.h>
+#include <linux/netdevice.h>
+
+#include <net/tcx.h>
+
+int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	bool created, ingress = attr->attach_type == BPF_TCX_INGRESS;
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_mprog_entry *entry;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created);
+	if (!entry) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = bpf_mprog_attach(entry, prog, NULL, attr->attach_flags,
+			       attr->relative_fd, attr->expected_revision);
+	if (ret >= 0) {
+		if (ret == BPF_MPROG_SWAP)
+			tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
+		bpf_mprog_commit(entry);
+		tcx_skeys_inc(ingress);
+		ret = 0;
+	} else if (created) {
+		bpf_mprog_free(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static bool tcx_release_entry(struct bpf_mprog_entry *entry, int code)
+{
+	return code == BPF_MPROG_FREE && !tcx_entry(entry)->miniq;
+}
+
+int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	bool tcx_release, ingress = attr->attach_type == BPF_TCX_INGRESS;
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_mprog_entry *entry, *peer;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	entry = dev_tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, prog, NULL, attr->attach_flags,
+			       attr->relative_fd, attr->expected_revision);
+	if (ret >= 0) {
+		tcx_release = tcx_release_entry(entry, ret);
+		peer = tcx_release ? NULL : bpf_mprog_peer(entry);
+		if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE)
+			tcx_entry_update(dev, peer, ingress);
+		bpf_mprog_commit(entry);
+		tcx_skeys_dec(ingress);
+		if (tcx_release)
+			bpf_mprog_free(entry);
+		ret = 0;
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void tcx_uninstall(struct net_device *dev, bool ingress)
+{
+	struct bpf_tuple tuple = {};
+	struct bpf_mprog_entry *entry;
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+
+	entry = dev_tcx_entry_fetch(dev, ingress);
+	if (!entry)
+		return;
+	tcx_entry_update(dev, NULL, ingress);
+	bpf_mprog_commit(entry);
+	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
+		if (tuple.link)
+			tcx_link(tuple.link)->dev = NULL;
+		else
+			bpf_prog_put(tuple.prog);
+		tcx_skeys_dec(ingress);
+	}
+	WARN_ON_ONCE(tcx_entry(entry)->miniq);
+	bpf_mprog_free(entry);
+}
+
+void dev_tcx_uninstall(struct net_device *dev)
+{
+	ASSERT_RTNL();
+	tcx_uninstall(dev, true);
+	tcx_uninstall(dev, false);
+}
+
+int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	bool ingress = attr->query.attach_type == BPF_TCX_INGRESS;
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_mprog_entry *entry;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->query.target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	entry = dev_tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_query(attr, uattr, entry);
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static int tcx_link_prog_attach(struct bpf_link *l, u32 flags, u32 object,
+				u32 expected_revision)
+{
+	struct tcx_link *link = tcx_link(l);
+	bool created, ingress = link->location == BPF_TCX_INGRESS;
+	struct net_device *dev = link->dev;
+	struct bpf_mprog_entry *entry;
+	int ret;
+
+	ASSERT_RTNL();
+	entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created);
+	if (!entry)
+		return -ENOMEM;
+	ret = bpf_mprog_attach(entry, l->prog, l, flags, object,
+			       expected_revision);
+	if (ret >= 0) {
+		if (ret == BPF_MPROG_SWAP)
+			tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
+		bpf_mprog_commit(entry);
+		tcx_skeys_inc(ingress);
+		ret = 0;
+	} else if (created) {
+		bpf_mprog_free(entry);
+	}
+	return ret;
+}
+
+static void tcx_link_release(struct bpf_link *l)
+{
+	struct tcx_link *link = tcx_link(l);
+	bool tcx_release, ingress = link->location == BPF_TCX_INGRESS;
+	struct bpf_mprog_entry *entry, *peer;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = link->dev;
+	if (!dev)
+		goto out;
+	entry = dev_tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, l->prog, l, link->flags, 0, 0);
+	if (ret >= 0) {
+		tcx_release = tcx_release_entry(entry, ret);
+		peer = tcx_release ? NULL : bpf_mprog_peer(entry);
+		if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE)
+			tcx_entry_update(dev, peer, ingress);
+		bpf_mprog_commit(entry);
+		tcx_skeys_dec(ingress);
+		if (tcx_release)
+			bpf_mprog_free(entry);
+		link->dev = NULL;
+		ret = 0;
+	}
+out:
+	WARN_ON_ONCE(ret);
+	rtnl_unlock();
+}
+
+static int tcx_link_update(struct bpf_link *l, struct bpf_prog *nprog,
+			   struct bpf_prog *oprog)
+{
+	struct tcx_link *link = tcx_link(l);
+	bool ingress = link->location == BPF_TCX_INGRESS;
+	struct net_device *dev = link->dev;
+	struct bpf_mprog_entry *entry;
+	int ret = 0;
+
+	rtnl_lock();
+	if (!link->dev) {
+		ret = -ENOLINK;
+		goto out;
+	}
+	if (oprog && l->prog != oprog) {
+		ret = -EPERM;
+		goto out;
+	}
+	oprog = l->prog;
+	if (oprog == nprog) {
+		bpf_prog_put(nprog);
+		goto out;
+	}
+	entry = dev_tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_attach(entry, nprog, l,
+			       BPF_F_REPLACE | BPF_F_ID | link->flags,
+			       l->prog->aux->id, 0);
+	if (ret >= 0) {
+		if (ret == BPF_MPROG_SWAP)
+			tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
+		bpf_mprog_commit(entry);
+		tcx_skeys_inc(ingress);
+		oprog = xchg(&l->prog, nprog);
+		bpf_prog_put(oprog);
+		ret = 0;
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void tcx_link_dealloc(struct bpf_link *l)
+{
+	kfree(tcx_link(l));
+}
+
+static void tcx_link_fdinfo(const struct bpf_link *l, struct seq_file *seq)
+{
+	const struct tcx_link *link = tcx_link_const(l);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (link->dev)
+		ifindex = link->dev->ifindex;
+	rtnl_unlock();
+
+	seq_printf(seq, "ifindex:\t%u\n", ifindex);
+	seq_printf(seq, "attach_type:\t%u (%s)\n",
+		   link->location,
+		   link->location == BPF_TCX_INGRESS ? "ingress" : "egress");
+	seq_printf(seq, "flags:\t%u\n", link->flags);
+}
+
+static int tcx_link_fill_info(const struct bpf_link *l,
+			      struct bpf_link_info *info)
+{
+	const struct tcx_link *link = tcx_link_const(l);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (link->dev)
+		ifindex = link->dev->ifindex;
+	rtnl_unlock();
+
+	info->tcx.ifindex = ifindex;
+	info->tcx.attach_type = link->location;
+	info->tcx.flags = link->flags;
+	return 0;
+}
+
+static int tcx_link_detach(struct bpf_link *l)
+{
+	tcx_link_release(l);
+	return 0;
+}
+
+static const struct bpf_link_ops tcx_link_lops = {
+	.release	= tcx_link_release,
+	.detach		= tcx_link_detach,
+	.dealloc	= tcx_link_dealloc,
+	.update_prog	= tcx_link_update,
+	.show_fdinfo	= tcx_link_fdinfo,
+	.fill_link_info	= tcx_link_fill_info,
+};
+
+int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_link_primer link_primer;
+	struct net_device *dev;
+	struct tcx_link *link;
+	int fd, err;
+
+	dev = dev_get_by_index(net, attr->link_create.target_ifindex);
+	if (!dev)
+		return -EINVAL;
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto out_put;
+	}
+
+	bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
+	link->location = attr->link_create.attach_type;
+	link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
+	link->dev = dev;
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err) {
+		kfree(link);
+		goto out_put;
+	}
+	rtnl_lock();
+	err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
+				   attr->link_create.tcx.relative_fd,
+				   attr->link_create.tcx.expected_revision);
+	if (!err)
+		fd = bpf_link_settle(&link_primer);
+	rtnl_unlock();
+	if (err) {
+		link->dev = NULL;
+		bpf_link_cleanup(&link_primer);
+		goto out_put;
+	}
+	dev_put(dev);
+	return fd;
+out_put:
+	dev_put(dev);
+	return err;
+}
diff --git a/net/Kconfig b/net/Kconfig
index 2fb25b534df5..d532ec33f1fe 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -52,6 +52,11 @@ config NET_INGRESS
 config NET_EGRESS
 	bool
 
+config NET_XGRESS
+	select NET_INGRESS
+	select NET_EGRESS
+	bool
+
 config NET_REDIRECT
 	bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 3393c2f3dbe8..95c7e3189884 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -107,6 +107,7 @@
 #include <net/pkt_cls.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
+#include <net/tcx.h>
 #include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/module.h>
@@ -154,7 +155,6 @@
 #include "dev.h"
 #include "net-sysfs.h"
 
-
 static DEFINE_SPINLOCK(ptype_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;	/* Taps */
@@ -3923,69 +3923,200 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 EXPORT_SYMBOL(dev_loopback_xmit);
 
 #ifdef CONFIG_NET_EGRESS
-static struct sk_buff *
-sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+static struct netdev_queue *
+netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
+{
+	int qm = skb_get_queue_mapping(skb);
+
+	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
+}
+
+static bool netdev_xmit_txqueue_skipped(void)
 {
+	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+}
+
+void netdev_xmit_skip_txqueue(bool skip)
+{
+	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+}
+EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
+#endif /* CONFIG_NET_EGRESS */
+
+#ifdef CONFIG_NET_XGRESS
+static int tc_run(struct tcx_entry *entry, struct sk_buff *skb)
+{
+	int ret = TC_ACT_UNSPEC;
 #ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
-	struct tcf_result cl_res;
+	struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq);
+	struct tcf_result res;
 
 	if (!miniq)
-		return skb;
+		return ret;
 
-	/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
 	tc_skb_cb(skb)->mru = 0;
 	tc_skb_cb(skb)->post_ct = false;
-	mini_qdisc_bstats_cpu_update(miniq, skb);
 
-	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
+	mini_qdisc_bstats_cpu_update(miniq, skb);
+	ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
+	/* Only tcf related quirks below. */
+	switch (ret) {
+	case TC_ACT_SHOT:
+		mini_qdisc_qstats_cpu_drop(miniq);
+		break;
 	case TC_ACT_OK:
 	case TC_ACT_RECLASSIFY:
-		skb->tc_index = TC_H_MIN(cl_res.classid);
+		skb->tc_index = TC_H_MIN(res.classid);
 		break;
+	}
+#endif /* CONFIG_NET_CLS_ACT */
+	return ret;
+}
+
+static DEFINE_STATIC_KEY_FALSE(tcx_needed_key);
+
+void tcx_inc(void)
+{
+	static_branch_inc(&tcx_needed_key);
+}
+EXPORT_SYMBOL_GPL(tcx_inc);
+
+void tcx_dec(void)
+{
+	static_branch_dec(&tcx_needed_key);
+}
+EXPORT_SYMBOL_GPL(tcx_dec);
+
+static __always_inline enum tcx_action_base
+tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
+	const bool needs_mac)
+{
+	const struct bpf_mprog_fp *fp;
+	const struct bpf_prog *prog;
+	int ret = TCX_NEXT;
+
+	if (needs_mac)
+		__skb_push(skb, skb->mac_len);
+	bpf_mprog_foreach_prog(entry, fp, prog) {
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run(prog, skb);
+		if (ret != TCX_NEXT)
+			break;
+	}
+	if (needs_mac)
+		__skb_pull(skb, skb->mac_len);
+	return tcx_action_code(skb, ret);
+}
+
+static __always_inline struct sk_buff *
+sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
+		   struct net_device *orig_dev, bool *another)
+{
+	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	int sch_ret;
+
+	if (!entry)
+		return skb;
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	qdisc_skb_cb(skb)->pkt_len = skb->len;
+	tcx_set_ingress(skb, true);
+
+	if (static_branch_unlikely(&tcx_needed_key)) {
+		sch_ret = tcx_run(entry, skb, true);
+		if (sch_ret != TC_ACT_UNSPEC)
+			goto ingress_verdict;
+	}
+	sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb);
+ingress_verdict:
+	switch (sch_ret) {
+	case TC_ACT_REDIRECT:
+		/* skb_mac_header check was done by BPF, so we can safely
+		 * push the L2 header back before redirecting to another
+		 * netdev.
+		 */
+		__skb_push(skb, skb->mac_len);
+		if (skb_do_redirect(skb) == -EAGAIN) {
+			__skb_pull(skb, skb->mac_len);
+			*another = true;
+			break;
+		}
+		*ret = NET_RX_SUCCESS;
+		return NULL;
 	case TC_ACT_SHOT:
-		mini_qdisc_qstats_cpu_drop(miniq);
-		*ret = NET_XMIT_DROP;
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
+		*ret = NET_RX_DROP;
 		return NULL;
+	/* used by tc_run */
 	case TC_ACT_STOLEN:
 	case TC_ACT_QUEUED:
 	case TC_ACT_TRAP:
-		*ret = NET_XMIT_SUCCESS;
 		consume_skb(skb);
+		fallthrough;
+	case TC_ACT_CONSUMED:
+		*ret = NET_RX_SUCCESS;
 		return NULL;
+	}
+
+	return skb;
+}
+
+static __always_inline struct sk_buff *
+sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+{
+	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
+	int sch_ret;
+
+	if (!entry)
+		return skb;
+
+	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
+	 * already set by the caller.
+	 */
+	if (static_branch_unlikely(&tcx_needed_key)) {
+		sch_ret = tcx_run(entry, skb, false);
+		if (sch_ret != TC_ACT_UNSPEC)
+			goto egress_verdict;
+	}
+	sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb);
+egress_verdict:
+	switch (sch_ret) {
 	case TC_ACT_REDIRECT:
 		/* No need to push/pop skb's mac_header here on egress! */
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
-	default:
-		break;
+	case TC_ACT_SHOT:
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
+		*ret = NET_XMIT_DROP;
+		return NULL;
+	/* used by tc_run */
+	case TC_ACT_STOLEN:
+	case TC_ACT_QUEUED:
+	case TC_ACT_TRAP:
+		*ret = NET_XMIT_SUCCESS;
+		return NULL;
 	}
-#endif /* CONFIG_NET_CLS_ACT */
 
 	return skb;
 }
-
-static struct netdev_queue *
-netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
-{
-	int qm = skb_get_queue_mapping(skb);
-
-	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
-}
-
-static bool netdev_xmit_txqueue_skipped(void)
+#else
+static __always_inline struct sk_buff *
+sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
+		   struct net_device *orig_dev, bool *another)
 {
-	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+	return skb;
 }
 
-void netdev_xmit_skip_txqueue(bool skip)
+static __always_inline struct sk_buff *
+sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
-	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+	return skb;
 }
-EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
-#endif /* CONFIG_NET_EGRESS */
+#endif /* CONFIG_NET_XGRESS */
 
 #ifdef CONFIG_XPS
 static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
@@ -4169,9 +4300,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_update_prio(skb);
 
 	qdisc_pkt_len_init(skb);
-#ifdef CONFIG_NET_CLS_ACT
-	skb->tc_at_ingress = 0;
-#endif
+	tcx_set_ingress(skb, false);
 #ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
 		if (nf_hook_egress_active()) {
@@ -5103,72 +5232,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-static inline struct sk_buff *
-sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
-		   struct net_device *orig_dev, bool *another)
-{
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
-	struct tcf_result cl_res;
-
-	/* If there's at least one ingress present somewhere (so
-	 * we get here via enabled static key), remaining devices
-	 * that are not configured with an ingress qdisc will bail
-	 * out here.
-	 */
-	if (!miniq)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	qdisc_skb_cb(skb)->pkt_len = skb->len;
-	tc_skb_cb(skb)->mru = 0;
-	tc_skb_cb(skb)->post_ct = false;
-	skb->tc_at_ingress = 1;
-	mini_qdisc_bstats_cpu_update(miniq, skb);
-
-	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
-	case TC_ACT_OK:
-	case TC_ACT_RECLASSIFY:
-		skb->tc_index = TC_H_MIN(cl_res.classid);
-		break;
-	case TC_ACT_SHOT:
-		mini_qdisc_qstats_cpu_drop(miniq);
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
-		*ret = NET_RX_DROP;
-		return NULL;
-	case TC_ACT_STOLEN:
-	case TC_ACT_QUEUED:
-	case TC_ACT_TRAP:
-		consume_skb(skb);
-		*ret = NET_RX_SUCCESS;
-		return NULL;
-	case TC_ACT_REDIRECT:
-		/* skb_mac_header check was done by cls/act_bpf, so
-		 * we can safely push the L2 header back before
-		 * redirecting to another netdev
-		 */
-		__skb_push(skb, skb->mac_len);
-		if (skb_do_redirect(skb) == -EAGAIN) {
-			__skb_pull(skb, skb->mac_len);
-			*another = true;
-			break;
-		}
-		*ret = NET_RX_SUCCESS;
-		return NULL;
-	case TC_ACT_CONSUMED:
-		*ret = NET_RX_SUCCESS;
-		return NULL;
-	default:
-		break;
-	}
-#endif /* CONFIG_NET_CLS_ACT */
-	return skb;
-}
-
 /**
  *	netdev_is_rx_handler_busy - check if receive handler is registered
  *	@dev: device to check
@@ -10873,7 +10936,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
 
 		/* Shutdown queueing discipline. */
 		dev_shutdown(dev);
-
+		dev_tcx_uninstall(dev);
 		dev_xdp_uninstall(dev);
 		bpf_dev_bound_netdev_unregister(dev);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d25d52854c21..1ff9a0988ea6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9233,7 +9233,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
 	__u8 value_reg = si->dst_reg;
 	__u8 skb_reg = si->src_reg;
 
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	/* If the tstamp_type is read,
 	 * the bpf prog is aware the tstamp could have delivery time.
 	 * Thus, read skb->tstamp as is if tstamp_type_access is true.
@@ -9267,7 +9267,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
 	__u8 value_reg = si->src_reg;
 	__u8 skb_reg = si->dst_reg;
 
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	/* If the tstamp_type is read,
 	 * the bpf prog is aware the tstamp could have delivery time.
 	 * Thus, write skb->tstamp as is if tstamp_type_access is true.
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 4b95cb1ac435..470c70deffe2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -347,8 +347,7 @@ config NET_SCH_FQ_PIE
 config NET_SCH_INGRESS
 	tristate "Ingress/classifier-action Qdisc"
 	depends on NET_CLS_ACT
-	select NET_INGRESS
-	select NET_EGRESS
+	select NET_XGRESS
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,
@@ -679,6 +678,7 @@ config NET_EMATCH_IPT
 config NET_CLS_ACT
 	bool "Actions"
 	select NET_CLS
+	select NET_XGRESS
 	help
 	  Say Y here if you want to use traffic control actions. Actions
 	  get attached to classifiers and are invoked after a successful
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 84838128b9c5..4af1360f537e 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -13,6 +13,7 @@
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
+#include <net/tcx.h>
 
 struct ingress_sched_data {
 	struct tcf_block *block;
@@ -78,11 +79,18 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *entry;
+	bool created;
 	int err;
 
 	net_inc_ingress_queue();
 
-	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
+	entry = dev_tcx_entry_fetch_or_create(dev, true, &created);
+	if (!entry)
+		return -ENOMEM;
+	mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq);
+	if (created)
+		tcx_entry_update(dev, entry, true);
 
 	q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	q->block_info.chain_head_change = clsact_chain_head_change;
@@ -93,15 +101,20 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 		return err;
 
 	mini_qdisc_pair_block_init(&q->miniqp, q->block);
-
 	return 0;
 }
 
 static void ingress_destroy(struct Qdisc *sch)
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress);
 
 	tcf_block_put_ext(q->block, sch, &q->block_info);
+	if (entry && !bpf_mprog_total(entry)) {
+		tcx_entry_update(dev, NULL, true);
+		bpf_mprog_free(entry);
+	}
 	net_dec_ingress_queue();
 }
 
@@ -217,12 +230,19 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *entry;
+	bool created;
 	int err;
 
 	net_inc_ingress_queue();
 	net_inc_egress_queue();
 
-	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
+	entry = dev_tcx_entry_fetch_or_create(dev, true, &created);
+	if (!entry)
+		return -ENOMEM;
+	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq);
+	if (created)
+		tcx_entry_update(dev, entry, true);
 
 	q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	q->ingress_block_info.chain_head_change = clsact_chain_head_change;
@@ -235,7 +255,12 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 
 	mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
 
-	mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
+	entry = dev_tcx_entry_fetch_or_create(dev, false, &created);
+	if (!entry)
+		return -ENOMEM;
+	mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq);
+	if (created)
+		tcx_entry_update(dev, entry, false);
 
 	q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
 	q->egress_block_info.chain_head_change = clsact_chain_head_change;
@@ -247,9 +272,21 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 static void clsact_destroy(struct Qdisc *sch)
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress);
+	struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress);
 
 	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
+	if (egress_entry && !bpf_mprog_total(egress_entry)) {
+		tcx_entry_update(dev, NULL, false);
+		bpf_mprog_free(egress_entry);
+	}
+
 	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
+	if (ingress_entry && !bpf_mprog_total(ingress_entry)) {
+		tcx_entry_update(dev, NULL, true);
+		bpf_mprog_free(ingress_entry);
+	}
 
 	net_dec_ingress_queue();
 	net_dec_egress_queue();
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 207f8a37b327..e7584e24bc83 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1035,6 +1035,8 @@ enum bpf_attach_type {
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
 	BPF_STRUCT_OPS,
+	BPF_TCX_INGRESS,
+	BPF_TCX_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1052,7 +1054,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
 	BPF_LINK_TYPE_NETFILTER = 10,
-
+	BPF_LINK_TYPE_TCX = 11,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1559,13 +1561,13 @@ union bpf_attr {
 			__u32		map_fd;		/* struct_ops to attach */
 		};
 		union {
-			__u32		target_fd;	/* object to attach to */
-			__u32		target_ifindex; /* target ifindex */
+			__u32	target_fd;	/* target object to attach to or ... */
+			__u32	target_ifindex; /* target ifindex */
 		};
 		__u32		attach_type;	/* attach type */
 		__u32		flags;		/* extra flags */
 		union {
-			__u32		target_btf_id;	/* btf_id of target to attach to */
+			__u32	target_btf_id;	/* btf_id of target to attach to */
 			struct {
 				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
 				__u32		iter_info_len;	/* iter_info length */
@@ -1599,6 +1601,13 @@ union bpf_attr {
 				__s32		priority;
 				__u32		flags;
 			} netfilter;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u32		expected_revision;
+			} tcx;
 		};
 	} link_create;
 
@@ -6207,6 +6216,19 @@ struct bpf_sock_tuple {
 	};
 };
 
+/* (Simplified) user return codes for tcx prog type.
+ * A valid tcx program must return one of these defined values. All other
+ * return codes are reserved for future use. Must remain compatible with
+ * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
+ * return codes are mapped to TCX_NEXT.
+ */
+enum tcx_action_base {
+	TCX_NEXT	= -1,
+	TCX_PASS	= 0,
+	TCX_DROP	= 2,
+	TCX_REDIRECT	= 7,
+};
+
 struct bpf_xdp_sock {
 	__u32 queue_id;
 };
@@ -6459,6 +6481,11 @@ struct bpf_link_info {
 			__s32 priority;
 			__u32 flags;
 		} netfilter;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+			__u32 flags;
+		} tcx;
 	};
 } __attribute__((aligned(8)));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  2023-06-08 21:37   ` Andrii Nakryiko
  2023-06-07 19:26 ` [PATCH bpf-next v2 4/7] libbpf: Add link-based " Daniel Borkmann
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

Extend libbpf attach opts and add a new detach opts API so this can be used
to add/remove fd-based tcx BPF programs. The old-style bpf_prog_detach and
bpf_prog_detach2 APIs are refactored to reuse the detach opts internally.

The bpf_prog_query_opts API got extended to be able to handle the new link_ids,
link_attach_flags and revision fields.

For concrete usage examples, see the extensive selftests that have been
developed as part of this series.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c      | 78 ++++++++++++++++++++++------------------
 tools/lib/bpf/bpf.h      | 54 +++++++++++++++++++++-------
 tools/lib/bpf/libbpf.c   |  6 ++++
 tools/lib/bpf/libbpf.map |  1 +
 4 files changed, 91 insertions(+), 48 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index ed86b37d8024..a3d1b7ebe224 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -629,11 +629,21 @@ int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
 	return bpf_prog_attach_opts(prog_fd, target_fd, type, &opts);
 }
 
-int bpf_prog_attach_opts(int prog_fd, int target_fd,
-			  enum bpf_attach_type type,
-			  const struct bpf_prog_attach_opts *opts)
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	return bpf_prog_detach_opts(0, target_fd, type, NULL);
+}
+
+int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, replace_bpf_fd);
+	return bpf_prog_detach_opts(prog_fd, target_fd, type, NULL);
+}
+
+int bpf_prog_attach_opts(int prog_fd, int target,
+			 enum bpf_attach_type type,
+			 const struct bpf_prog_attach_opts *opts)
+{
+	const size_t attr_sz = offsetofend(union bpf_attr, expected_revision);
 	union bpf_attr attr;
 	int ret;
 
@@ -641,40 +651,35 @@ int bpf_prog_attach_opts(int prog_fd, int target_fd,
 		return libbpf_err(-EINVAL);
 
 	memset(&attr, 0, attr_sz);
-	attr.target_fd	   = target_fd;
-	attr.attach_bpf_fd = prog_fd;
-	attr.attach_type   = type;
-	attr.attach_flags  = OPTS_GET(opts, flags, 0);
-	attr.replace_bpf_fd = OPTS_GET(opts, replace_prog_fd, 0);
+	attr.target_fd		= target;
+	attr.attach_bpf_fd	= prog_fd;
+	attr.attach_type	= type;
+	attr.attach_flags	= OPTS_GET(opts, flags, 0);
+	attr.replace_bpf_fd	= OPTS_GET(opts, relative_fd, 0);
+	attr.expected_revision	= OPTS_GET(opts, expected_revision, 0);
 
 	ret = sys_bpf(BPF_PROG_ATTACH, &attr, attr_sz);
 	return libbpf_err_errno(ret);
 }
 
-int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+int bpf_prog_detach_opts(int prog_fd, int target,
+			 enum bpf_attach_type type,
+			 const struct bpf_prog_detach_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, replace_bpf_fd);
+	const size_t attr_sz = offsetofend(union bpf_attr, expected_revision);
 	union bpf_attr attr;
 	int ret;
 
-	memset(&attr, 0, attr_sz);
-	attr.target_fd	 = target_fd;
-	attr.attach_type = type;
-
-	ret = sys_bpf(BPF_PROG_DETACH, &attr, attr_sz);
-	return libbpf_err_errno(ret);
-}
-
-int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
-{
-	const size_t attr_sz = offsetofend(union bpf_attr, replace_bpf_fd);
-	union bpf_attr attr;
-	int ret;
+	if (!OPTS_VALID(opts, bpf_prog_detach_opts))
+		return libbpf_err(-EINVAL);
 
 	memset(&attr, 0, attr_sz);
-	attr.target_fd	 = target_fd;
-	attr.attach_bpf_fd = prog_fd;
-	attr.attach_type = type;
+	attr.target_fd		= target;
+	attr.attach_bpf_fd	= prog_fd;
+	attr.attach_type	= type;
+	attr.attach_flags	= OPTS_GET(opts, flags, 0);
+	attr.replace_bpf_fd	= OPTS_GET(opts, relative_fd, 0);
+	attr.expected_revision	= OPTS_GET(opts, expected_revision, 0);
 
 	ret = sys_bpf(BPF_PROG_DETACH, &attr, attr_sz);
 	return libbpf_err_errno(ret);
@@ -833,7 +838,7 @@ int bpf_iter_create(int link_fd)
 	return libbpf_err_errno(fd);
 }
 
-int bpf_prog_query_opts(int target_fd,
+int bpf_prog_query_opts(int target,
 			enum bpf_attach_type type,
 			struct bpf_prog_query_opts *opts)
 {
@@ -846,17 +851,20 @@ int bpf_prog_query_opts(int target_fd,
 
 	memset(&attr, 0, attr_sz);
 
-	attr.query.target_fd	= target_fd;
-	attr.query.attach_type	= type;
-	attr.query.query_flags	= OPTS_GET(opts, query_flags, 0);
-	attr.query.prog_cnt	= OPTS_GET(opts, prog_cnt, 0);
-	attr.query.prog_ids	= ptr_to_u64(OPTS_GET(opts, prog_ids, NULL));
-	attr.query.prog_attach_flags = ptr_to_u64(OPTS_GET(opts, prog_attach_flags, NULL));
+	attr.query.target_fd		= target;
+	attr.query.attach_type		= type;
+	attr.query.query_flags		= OPTS_GET(opts, query_flags, 0);
+	attr.query.count		= OPTS_GET(opts, count, 0);
+	attr.query.prog_ids		= ptr_to_u64(OPTS_GET(opts, prog_ids, NULL));
+	attr.query.prog_attach_flags	= ptr_to_u64(OPTS_GET(opts, prog_attach_flags, NULL));
+	attr.query.link_ids		= ptr_to_u64(OPTS_GET(opts, link_ids, NULL));
+	attr.query.link_attach_flags	= ptr_to_u64(OPTS_GET(opts, link_attach_flags, NULL));
 
 	ret = sys_bpf(BPF_PROG_QUERY, &attr, attr_sz);
 
 	OPTS_SET(opts, attach_flags, attr.query.attach_flags);
-	OPTS_SET(opts, prog_cnt, attr.query.prog_cnt);
+	OPTS_SET(opts, revision, attr.query.revision);
+	OPTS_SET(opts, count, attr.query.count);
 
 	return libbpf_err_errno(ret);
 }
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9aa0ee473754..480c584a6f7f 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -312,22 +312,43 @@ LIBBPF_API int bpf_obj_get(const char *pathname);
 LIBBPF_API int bpf_obj_get_opts(const char *pathname,
 				const struct bpf_obj_get_opts *opts);
 
-struct bpf_prog_attach_opts {
-	size_t sz; /* size of this struct for forward/backward compatibility */
-	unsigned int flags;
-	int replace_prog_fd;
-};
-#define bpf_prog_attach_opts__last_field replace_prog_fd
-
 LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
 			       enum bpf_attach_type type, unsigned int flags);
-LIBBPF_API int bpf_prog_attach_opts(int prog_fd, int attachable_fd,
-				     enum bpf_attach_type type,
-				     const struct bpf_prog_attach_opts *opts);
 LIBBPF_API int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
 LIBBPF_API int bpf_prog_detach2(int prog_fd, int attachable_fd,
 				enum bpf_attach_type type);
 
+struct bpf_prog_attach_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+	__u32 flags;
+	union {
+		int	replace_prog_fd;
+		int	replace_fd;
+		int	relative_fd;
+		__u32	relative_id;
+	};
+	__u32 expected_revision;
+};
+#define bpf_prog_attach_opts__last_field expected_revision
+
+struct bpf_prog_detach_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+	__u32 flags;
+	union {
+		int	relative_fd;
+		__u32	relative_id;
+	};
+	__u32 expected_revision;
+};
+#define bpf_prog_detach_opts__last_field expected_revision
+
+LIBBPF_API int bpf_prog_attach_opts(int prog_fd, int target,
+				    enum bpf_attach_type type,
+				    const struct bpf_prog_attach_opts *opts);
+LIBBPF_API int bpf_prog_detach_opts(int prog_fd, int target,
+				    enum bpf_attach_type type,
+				    const struct bpf_prog_detach_opts *opts);
+
 union bpf_iter_link_info; /* defined in up-to-date linux/bpf.h */
 struct bpf_link_create_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
@@ -489,14 +510,21 @@ struct bpf_prog_query_opts {
 	__u32 query_flags;
 	__u32 attach_flags; /* output argument */
 	__u32 *prog_ids;
-	__u32 prog_cnt; /* input+output argument */
+	union {
+		__u32 prog_cnt; /* input+output argument */
+		__u32 count;
+	};
 	__u32 *prog_attach_flags;
+	__u32 *link_ids;
+	__u32 *link_attach_flags;
+	__u32 revision;
 };
-#define bpf_prog_query_opts__last_field prog_attach_flags
+#define bpf_prog_query_opts__last_field revision
 
-LIBBPF_API int bpf_prog_query_opts(int target_fd,
+LIBBPF_API int bpf_prog_query_opts(int target,
 				   enum bpf_attach_type type,
 				   struct bpf_prog_query_opts *opts);
+
 LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type,
 			      __u32 query_flags, __u32 *attach_flags,
 			      __u32 *prog_ids, __u32 *prog_cnt);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 47632606b06d..b89127471c6a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -117,6 +117,8 @@ static const char * const attach_type_name[] = {
 	[BPF_PERF_EVENT]		= "perf_event",
 	[BPF_TRACE_KPROBE_MULTI]	= "trace_kprobe_multi",
 	[BPF_STRUCT_OPS]		= "struct_ops",
+	[BPF_TCX_INGRESS]		= "tcx_ingress",
+	[BPF_TCX_EGRESS]		= "tcx_egress",
 };
 
 static const char * const link_type_name[] = {
@@ -8669,6 +8671,10 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("kretsyscall+",		KPROBE, 0, SEC_NONE, attach_ksyscall),
 	SEC_DEF("usdt+",		KPROBE,	0, SEC_NONE, attach_usdt),
 	SEC_DEF("tc",			SCHED_CLS, 0, SEC_NONE),
+	SEC_DEF("tc/ingress",		SCHED_CLS, BPF_TCX_INGRESS, SEC_ATTACHABLE_OPT),
+	SEC_DEF("tc/egress",		SCHED_CLS, BPF_TCX_EGRESS, SEC_ATTACHABLE_OPT),
+	SEC_DEF("tcx/ingress",		SCHED_CLS, BPF_TCX_INGRESS, SEC_ATTACHABLE_OPT),
+	SEC_DEF("tcx/egress",		SCHED_CLS, BPF_TCX_EGRESS, SEC_ATTACHABLE_OPT),
 	SEC_DEF("classifier",		SCHED_CLS, 0, SEC_NONE),
 	SEC_DEF("action",		SCHED_ACT, 0, SEC_NONE),
 	SEC_DEF("tracepoint+",		TRACEPOINT, 0, SEC_NONE, attach_tp),
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7521a2fb7626..a29b90e9713c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -395,4 +395,5 @@ LIBBPF_1.2.0 {
 LIBBPF_1.3.0 {
 	global:
 		bpf_obj_pin_opts;
+		bpf_prog_detach_opts;
 } LIBBPF_1.2.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 4/7] libbpf: Add link-based API for tcx
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
                   ` (2 preceding siblings ...)
  2023-06-07 19:26 ` [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  2023-06-08 21:45   ` Andrii Nakryiko
  2023-06-07 19:26 ` [PATCH bpf-next v2 5/7] bpftool: Extend net dump with tcx progs Daniel Borkmann
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

Implement tcx BPF link support for libbpf.

The bpf_program__attach_fd_opts() API has been refactored slightly in order to
pass bpf_link_create_opts pointer as input.

A new bpf_program__attach_tcx_opts() has been added on top of this which allows
for passing all relevant data via extensible struct bpf_tcx_opts.

The program sections tcx/ingress and tcx/egress correspond to the hook locations
for tc ingress and egress, respectively.

For concrete usage examples, see the extensive selftests that have been
developed as part of this series.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c      |  5 +++++
 tools/lib/bpf/bpf.h      |  7 +++++++
 tools/lib/bpf/libbpf.c   | 44 +++++++++++++++++++++++++++++++++++-----
 tools/lib/bpf/libbpf.h   | 17 ++++++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index a3d1b7ebe224..c340d3cbc6bd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -746,6 +746,11 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, tracing))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_TCX_INGRESS:
+	case BPF_TCX_EGRESS:
+		attr.link_create.tcx.relative_fd = OPTS_GET(opts, tcx.relative_fd, 0);
+		attr.link_create.tcx.expected_revision = OPTS_GET(opts, tcx.expected_revision, 0);
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 480c584a6f7f..12591516dca0 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -370,6 +370,13 @@ struct bpf_link_create_opts {
 		struct {
 			__u64 cookie;
 		} tracing;
+		struct {
+			union {
+				__u32 relative_fd;
+				__u32 relative_id;
+			};
+			__u32 expected_revision;
+		} tcx;
 	};
 	size_t :0;
 };
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index b89127471c6a..d7b6ff49f02e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -133,6 +133,7 @@ static const char * const link_type_name[] = {
 	[BPF_LINK_TYPE_KPROBE_MULTI]		= "kprobe_multi",
 	[BPF_LINK_TYPE_STRUCT_OPS]		= "struct_ops",
 	[BPF_LINK_TYPE_NETFILTER]		= "netfilter",
+	[BPF_LINK_TYPE_TCX]			= "tcx",
 };
 
 static const char * const map_type_name[] = {
@@ -11685,11 +11686,10 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
 }
 
 static struct bpf_link *
-bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
-		       const char *target_name)
+bpf_program__attach_fd_opts(const struct bpf_program *prog,
+			    const struct bpf_link_create_opts *opts,
+			    int target_fd, const char *target_name)
 {
-	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
-			    .target_btf_id = btf_id);
 	enum bpf_attach_type attach_type;
 	char errmsg[STRERR_BUFSIZE];
 	struct bpf_link *link;
@@ -11707,7 +11707,7 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
 	link->detach = &bpf_link__detach_fd;
 
 	attach_type = bpf_program__expected_attach_type(prog);
-	link_fd = bpf_link_create(prog_fd, target_fd, attach_type, &opts);
+	link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
 	if (link_fd < 0) {
 		link_fd = -errno;
 		free(link);
@@ -11720,6 +11720,17 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
 	return link;
 }
 
+static struct bpf_link *
+bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
+		       const char *target_name)
+{
+	LIBBPF_OPTS(bpf_link_create_opts, opts,
+		.target_btf_id = btf_id,
+	);
+
+	return bpf_program__attach_fd_opts(prog, &opts, target_fd, target_name);
+}
+
 struct bpf_link *
 bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
 {
@@ -11738,6 +11749,29 @@ struct bpf_link *bpf_program__attach_xdp(const struct bpf_program *prog, int ifi
 	return bpf_program__attach_fd(prog, ifindex, 0, "xdp");
 }
 
+struct bpf_link *
+bpf_program__attach_tcx_opts(const struct bpf_program *prog,
+			     const struct bpf_tcx_opts *opts)
+{
+	LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
+	int ifindex = OPTS_GET(opts, ifindex, 0);
+
+	if (!OPTS_VALID(opts, bpf_tcx_opts))
+		return libbpf_err_ptr(-EINVAL);
+	if (!ifindex) {
+		pr_warn("prog '%s': target netdevice ifindex cannot be zero\n",
+			prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link_create_opts.tcx.expected_revision = OPTS_GET(opts, expected_revision, 0);
+	link_create_opts.tcx.relative_fd = OPTS_GET(opts, relative_fd, 0);
+	link_create_opts.flags = OPTS_GET(opts, flags, 0);
+
+	/* target_fd/target_ifindex use the same field in LINK_CREATE */
+	return bpf_program__attach_fd_opts(prog, &link_create_opts, ifindex, "tc");
+}
+
 struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
 					      int target_fd,
 					      const char *attach_func_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 754da73c643b..8ffba0f67c60 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -718,6 +718,23 @@ LIBBPF_API struct bpf_link *
 bpf_program__attach_freplace(const struct bpf_program *prog,
 			     int target_fd, const char *attach_func_name);
 
+struct bpf_tcx_opts {
+	/* size of this struct, for forward/backward compatibility */
+	size_t sz;
+	int ifindex;
+	__u32 flags;
+	union {
+		__u32 relative_fd;
+		__u32 relative_id;
+	};
+	__u32 expected_revision;
+};
+#define bpf_tcx_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_program__attach_tcx_opts(const struct bpf_program *prog,
+			     const struct bpf_tcx_opts *opts);
+
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index a29b90e9713c..f66b714512c2 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -396,4 +396,5 @@ LIBBPF_1.3.0 {
 	global:
 		bpf_obj_pin_opts;
 		bpf_prog_detach_opts;
+		bpf_program__attach_tcx_opts;
 } LIBBPF_1.2.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 5/7] bpftool: Extend net dump with tcx progs
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
                   ` (3 preceding siblings ...)
  2023-06-07 19:26 ` [PATCH bpf-next v2 4/7] libbpf: Add link-based " Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 6/7] selftests/bpf: Add mprog API tests for BPF tcx opts Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 7/7] selftests/bpf: Add mprog API tests for BPF tcx links Daniel Borkmann
  6 siblings, 0 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

Add support to dump fd-based attach types via bpftool. This includes both
the tc BPF link and attach ops programs. Dumped information contain the
attach location, function entry name, program ID and link ID when applicable.

Example with tc BPF link:

  # ./bpftool net
  xdp:

  tc:
  bond0(4) bpf/ingress cil_from_netdev prog id 784 link id 10
  bond0(4) bpf/egress cil_to_netdev prog id 804 link id 11

  flow_dissector:

  netfilter:

Example with tc BPF attach ops:

  # ./bpftool net
  xdp:

  tc:
  bond0(4) bpf/ingress cil_from_netdev prog id 654
  bond0(4) bpf/egress cil_to_netdev prog id 672

  flow_dissector:

  netfilter:

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/net.c | 92 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 88 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 26a49965bf71..b23b346b3ae9 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -76,6 +76,11 @@ static const char * const attach_type_strings[] = {
 	[NET_ATTACH_TYPE_XDP_OFFLOAD]	= "xdpoffload",
 };
 
+static const char * const attach_loc_strings[] = {
+	[BPF_TCX_INGRESS]		= "bpf/ingress",
+	[BPF_TCX_EGRESS]		= "bpf/egress",
+};
+
 const size_t net_attach_type_size = ARRAY_SIZE(attach_type_strings);
 
 static enum net_attach_type parse_attach_type(const char *str)
@@ -422,8 +427,86 @@ static int dump_filter_nlmsg(void *cookie, void *msg, struct nlattr **tb)
 			      filter_info->devname, filter_info->ifindex);
 }
 
-static int show_dev_tc_bpf(int sock, unsigned int nl_pid,
-			   struct ip_devname_ifindex *dev)
+static const char *flags_strings(__u32 flags)
+{
+	if (flags == (BPF_F_FIRST | BPF_F_LAST))
+		return json_output ? "first,last" : " first last";
+	if (flags & BPF_F_FIRST)
+		return json_output ? "first" : " first";
+	if (flags & BPF_F_LAST)
+		return json_output ? "last" : " last";
+	return json_output ? "none" : "";
+}
+
+static int __show_dev_tc_bpf_name(__u32 id, char *name, size_t len)
+{
+	struct bpf_prog_info info = {};
+	__u32 ilen = sizeof(info);
+	int fd, ret;
+
+	fd = bpf_prog_get_fd_by_id(id);
+	if (fd < 0)
+		return fd;
+	ret = bpf_obj_get_info_by_fd(fd, &info, &ilen);
+	if (ret < 0)
+		goto out;
+	ret = -ENOENT;
+	if (info.name[0]) {
+		get_prog_full_name(&info, fd, name, len);
+		ret = 0;
+	}
+out:
+	close(fd);
+	return ret;
+}
+
+static void __show_dev_tc_bpf(const struct ip_devname_ifindex *dev,
+			      const enum bpf_attach_type loc)
+{
+	__u32 prog_flags[64] = {}, link_flags[64] = {}, i;
+	__u32 prog_ids[64] = {}, link_ids[64] = {};
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	char prog_name[MAX_PROG_FULL_NAME];
+	int ret;
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	ret = bpf_prog_query_opts(dev->ifindex, loc, &optq);
+	if (ret)
+		return;
+	for (i = 0; i < optq.count; i++) {
+		NET_START_OBJECT;
+		NET_DUMP_STR("devname", "%s", dev->devname);
+		NET_DUMP_UINT("ifindex", "(%u)", dev->ifindex);
+		NET_DUMP_STR("kind", " %s", attach_loc_strings[loc]);
+		ret = __show_dev_tc_bpf_name(prog_ids[i], prog_name,
+					     sizeof(prog_name));
+		if (!ret)
+			NET_DUMP_STR("name", " %s", prog_name);
+		NET_DUMP_UINT("prog_id", " prog id %u", prog_ids[i]);
+		if (prog_flags[i])
+			NET_DUMP_STR("prog_flags", "%s", flags_strings(prog_flags[i]));
+		if (link_ids[i])
+			NET_DUMP_UINT("link_id", " link id %u",
+				      link_ids[i]);
+		if (link_flags[i])
+			NET_DUMP_STR("link_flags", "%s", flags_strings(link_flags[i]));
+		NET_END_OBJECT_FINAL;
+	}
+}
+
+static void show_dev_tc_bpf(struct ip_devname_ifindex *dev)
+{
+	__show_dev_tc_bpf(dev, BPF_TCX_INGRESS);
+	__show_dev_tc_bpf(dev, BPF_TCX_EGRESS);
+}
+
+static int show_dev_tc_bpf_classic(int sock, unsigned int nl_pid,
+				   struct ip_devname_ifindex *dev)
 {
 	struct bpf_filter_t filter_info;
 	struct bpf_tcinfo_t tcinfo;
@@ -790,8 +873,9 @@ static int do_show(int argc, char **argv)
 	if (!ret) {
 		NET_START_ARRAY("tc", "%s:\n");
 		for (i = 0; i < dev_array.used_len; i++) {
-			ret = show_dev_tc_bpf(sock, nl_pid,
-					      &dev_array.devices[i]);
+			show_dev_tc_bpf(&dev_array.devices[i]);
+			ret = show_dev_tc_bpf_classic(sock, nl_pid,
+						      &dev_array.devices[i]);
 			if (ret)
 				break;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 6/7] selftests/bpf: Add mprog API tests for BPF tcx opts
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
                   ` (4 preceding siblings ...)
  2023-06-07 19:26 ` [PATCH bpf-next v2 5/7] bpftool: Extend net dump with tcx progs Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  2023-06-07 19:26 ` [PATCH bpf-next v2 7/7] selftests/bpf: Add mprog API tests for BPF tcx links Daniel Borkmann
  6 siblings, 0 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

Add a big batch of test coverage to assert all aspects of the tcx opts
attach, detach and query API:

  # ./vmtest.sh -- ./test_progs -t tc_opts
  [...]
  #237     tc_opts_after:OK
  #238     tc_opts_append:OK
  #239     tc_opts_basic:OK
  #240     tc_opts_before:OK
  #241     tc_opts_both:OK
  #242     tc_opts_chain_classic:OK
  #243     tc_opts_demixed:OK
  #244     tc_opts_detach:OK
  #245     tc_opts_detach_after:OK
  #246     tc_opts_detach_before:OK
  #247     tc_opts_dev_cleanup:OK
  #248     tc_opts_first:OK
  #249     tc_opts_invalid:OK
  #250     tc_opts_last:OK
  #251     tc_opts_mixed:OK
  #252     tc_opts_prepend:OK
  #253     tc_opts_replace:OK
  #254     tc_opts_revision:OK
  Summary: 18/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 .../selftests/bpf/prog_tests/tc_helpers.h     |   72 +
 .../selftests/bpf/prog_tests/tc_opts.c        | 2698 +++++++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |   40 +
 3 files changed, 2810 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_opts.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_link.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tc_helpers.h b/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
new file mode 100644
index 000000000000..6c93215be8a3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Isovalent */
+#ifndef TC_HELPERS
+#define TC_HELPERS
+#include <test_progs.h>
+
+static inline __u32 id_from_prog_fd(int fd)
+{
+	struct bpf_prog_info prog_info = {};
+	__u32 prog_info_len = sizeof(prog_info);
+	int err;
+
+	err = bpf_obj_get_info_by_fd(fd, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "id_from_prog_fd"))
+		return 0;
+
+	ASSERT_NEQ(prog_info.id, 0, "prog_info.id");
+	return prog_info.id;
+}
+
+static inline __u32 id_from_link_fd(int fd)
+{
+	struct bpf_link_info link_info = {};
+	__u32 link_info_len = sizeof(link_info);
+	int err;
+
+	err = bpf_link_get_info_by_fd(fd, &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "id_from_link_fd"))
+		return 0;
+
+	ASSERT_NEQ(link_info.id, 0, "link_info.id");
+	return link_info.id;
+}
+
+static inline __u32 ifindex_from_link_fd(int fd)
+{
+	struct bpf_link_info link_info = {};
+	__u32 link_info_len = sizeof(link_info);
+	int err;
+
+	err = bpf_link_get_info_by_fd(fd, &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "id_from_link_fd"))
+		return 0;
+
+	return link_info.tcx.ifindex;
+}
+
+static inline void __assert_mprog_count(int target, int expected, bool miniq, int ifindex)
+{
+	__u32 count = 0, attach_flags = 0;
+	int err;
+
+	err = bpf_prog_query(ifindex, target, 0, &attach_flags,
+			     NULL, &count);
+	ASSERT_EQ(count, expected, "count");
+	if (!expected && !miniq)
+		ASSERT_EQ(err, -ENOENT, "prog_query");
+	else
+		ASSERT_EQ(err, 0, "prog_query");
+}
+
+static inline void assert_mprog_count(int target, int expected)
+{
+	__assert_mprog_count(target, expected, false, loopback);
+}
+
+static inline void assert_mprog_count_ifindex(int ifindex, int target, int expected)
+{
+	__assert_mprog_count(target, expected, false, ifindex);
+}
+
+#endif /* TC_HELPERS */
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_opts.c b/tools/testing/selftests/bpf/prog_tests/tc_opts.c
new file mode 100644
index 000000000000..273521ca364a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_opts.c
@@ -0,0 +1,2698 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <uapi/linux/if_link.h>
+#include <net/if.h>
+#include <test_progs.h>
+
+#define loopback 1
+#define ping_cmd "ping -q -c1 -w1 127.0.0.1 > /dev/null"
+
+#include "test_tc_link.skel.h"
+#include "tc_helpers.h"
+
+/* Test:
+ *
+ * Basic test which attaches a prog to ingress/egress, validates
+ * that the prog got attached, runs traffic through the programs,
+ * validates that traffic has been seen, and detaches everything
+ * again. Programs are attached without special flags.
+ */
+void serial_test_tc_opts_basic(void)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, id1, id2;
+	struct test_tc_link *skel;
+	__u32 prog_ids[2];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	assert_mprog_count(BPF_TCX_INGRESS, 0);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	err = bpf_prog_attach_opts(fd1, loopback, BPF_TCX_INGRESS, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(BPF_TCX_INGRESS, 1);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+
+	optq.prog_ids = prog_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, BPF_TCX_INGRESS, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_in;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	err = bpf_prog_attach_opts(fd2, loopback, BPF_TCX_EGRESS, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_in;
+
+	assert_mprog_count(BPF_TCX_INGRESS, 1);
+	assert_mprog_count(BPF_TCX_EGRESS, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, BPF_TCX_EGRESS, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_eg;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+
+cleanup_eg:
+	err = bpf_prog_detach_opts(fd2, loopback, BPF_TCX_EGRESS, &optd);
+	ASSERT_OK(err, "prog_detach_eg");
+
+	assert_mprog_count(BPF_TCX_INGRESS, 1);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+
+cleanup_in:
+	err = bpf_prog_detach_opts(fd1, loopback, BPF_TCX_INGRESS, &optd);
+	ASSERT_OK(err, "prog_detach_in");
+
+	assert_mprog_count(BPF_TCX_INGRESS, 0);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+static void test_tc_opts_first_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, id1, id2;
+	struct test_tc_link *skel;
+	__u32 prog_ids[3];
+	__u32 prog_flags[3];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_FIRST;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	opta.flags = BPF_F_FIRST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_BEFORE;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_BEFORE | BPF_F_ID;
+	opta.relative_id = id1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	optd.flags = 0;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_LAST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], BPF_F_LAST, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+cleanup_target2:
+	optd.flags = 0;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 1);
+cleanup_target:
+	optd.flags = BPF_F_FIRST;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with first flag set,
+ * validates that the prog got attached, other attach attempts for
+ * this position should fail. Regular attach attempts or with last
+ * flag set should succeed. Detach everything again.
+ */
+void serial_test_tc_opts_first(void)
+{
+	test_tc_opts_first_target(BPF_TCX_INGRESS);
+	test_tc_opts_first_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_last_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, id1, id2;
+	struct test_tc_link *skel;
+	__u32 prog_ids[3];
+	__u32 prog_flags[3];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_LAST;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_LAST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	opta.flags = BPF_F_LAST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_AFTER;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_AFTER | BPF_F_ID;
+	opta.relative_id = id1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], BPF_F_LAST, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	optd.flags = 0;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_FIRST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], BPF_F_LAST, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+cleanup_target2:
+	optd.flags = 0;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 1);
+
+cleanup_target:
+	optd.flags = BPF_F_LAST;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with last flag set,
+ * validates that the prog got attached, other attach attempts for
+ * this position should fail. Regular attach attempts or with first
+ * flag set should succeed. Detach everything again.
+ */
+void serial_test_tc_opts_last(void)
+{
+	test_tc_opts_last_target(BPF_TCX_INGRESS);
+	test_tc_opts_last_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_both_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, id1, id2, detach_fd;
+	struct test_tc_link *skel;
+	__u32 prog_ids[3];
+	__u32 prog_flags[3];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	detach_fd = fd1;
+
+	opta.flags = BPF_F_FIRST | BPF_F_LAST;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST | BPF_F_LAST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	opta.flags = BPF_F_LAST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_FIRST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_AFTER;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_AFTER | BPF_F_ID;
+	opta.relative_id = id1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_BEFORE;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_BEFORE | BPF_F_ID;
+	opta.relative_id = id1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_FIRST | BPF_F_LAST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_FIRST | BPF_F_LAST | BPF_F_REPLACE;
+	opta.replace_fd = fd1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 1);
+
+	detach_fd = fd2;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST | BPF_F_LAST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+cleanup_target:
+	optd.flags = BPF_F_FIRST | BPF_F_LAST;
+	err = bpf_prog_detach_opts(detach_fd, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with first and last
+ * flag set, validates that the prog got attached, other attach
+ * attempts should fail. Replace should work. Detach everything
+ * again.
+ */
+void serial_test_tc_opts_both(void)
+{
+	test_tc_opts_both_target(BPF_TCX_INGRESS);
+	test_tc_opts_both_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_before_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	opta.flags = BPF_F_BEFORE;
+	opta.relative_fd = fd2;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target2;
+
+	opta.flags = BPF_F_BEFORE | BPF_F_ID;
+	opta.relative_id = id1;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target3;
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id4, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id3, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id2, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+
+cleanup_target4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+cleanup_target3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup_target2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+cleanup_target:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with before flag
+ * set, validates that the prog got attached in the right location.
+ * The first test inserts in the middle, then we insert to the front.
+ * Detach everything again.
+ */
+void serial_test_tc_opts_before(void)
+{
+	test_tc_opts_before_target(BPF_TCX_INGRESS);
+	test_tc_opts_before_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_after_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	opta.flags = BPF_F_AFTER;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target2;
+
+	opta.flags = BPF_F_AFTER | BPF_F_ID;
+	opta.relative_id = id2;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target3;
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id2, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id4, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+
+cleanup_target4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target3;
+
+	ASSERT_EQ(optq.count, 3, "count");
+	ASSERT_EQ(optq.revision, 6, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id2, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], 0, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+
+cleanup_target3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 7, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+cleanup_target2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 8, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+cleanup_target:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with after flag
+ * set, validates that the prog got attached in the right location.
+ * The first test inserts in the middle, then we insert to the end.
+ * Detach everything again.
+ */
+void serial_test_tc_opts_after(void)
+{
+	test_tc_opts_after_target(BPF_TCX_INGRESS);
+	test_tc_opts_after_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_revision_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, id1, id2;
+	struct test_tc_link *skel;
+	__u32 prog_ids[3];
+	__u32 prog_flags[3];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	opta.expected_revision = 1;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	opta.expected_revision = 1;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, -ESTALE, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	opta.expected_revision = 2;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+
+	optd.expected_revision = 2;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_EQ(err, -ESTALE, "prog_detach");
+	assert_mprog_count(target, 2);
+
+cleanup_target2:
+	optd.expected_revision = 3;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+
+cleanup_target:
+	optd.expected_revision = 0;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with revision count
+ * set, validates that the prog got attached and validate that
+ * when the count mismatches that the operation bails out. Detach
+ * everything again.
+ */
+void serial_test_tc_opts_revision(void)
+{
+	test_tc_opts_revision_target(BPF_TCX_INGRESS);
+	test_tc_opts_revision_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_chain_classic(int target, bool chain_tc_old)
+{
+	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
+	LIBBPF_OPTS(bpf_tc_hook, tc_hook, .ifindex = loopback);
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	bool hook_created = false, tc_attached = false;
+	__u32 fd1, fd2, fd3, id1, id2, id3;
+	struct test_tc_link *skel;
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	if (chain_tc_old) {
+		tc_hook.attach_point = target == BPF_TCX_INGRESS ?
+				       BPF_TC_INGRESS : BPF_TC_EGRESS;
+		err = bpf_tc_hook_create(&tc_hook);
+		if (err == 0)
+			hook_created = true;
+		err = err == -EEXIST ? 0 : err;
+		if (!ASSERT_OK(err, "bpf_tc_hook_create"))
+			goto cleanup;
+
+		tc_opts.prog_fd = fd3;
+		err = bpf_tc_attach(&tc_hook, &tc_opts);
+		if (!ASSERT_OK(err, "bpf_tc_attach"))
+			goto cleanup;
+		tc_attached = true;
+	}
+
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_detach;
+
+	assert_mprog_count(target, 2);
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, chain_tc_old, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup_detach;
+
+	assert_mprog_count(target, 1);
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, chain_tc_old, "seen_tc3");
+
+cleanup_detach:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup;
+
+	__assert_mprog_count(target, 0, chain_tc_old, loopback);
+cleanup:
+	if (tc_attached) {
+		tc_opts.flags = tc_opts.prog_fd = tc_opts.prog_id = 0;
+		err = bpf_tc_detach(&tc_hook, &tc_opts);
+		ASSERT_OK(err, "bpf_tc_detach");
+	}
+	if (hook_created) {
+		tc_hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS;
+		bpf_tc_hook_destroy(&tc_hook);
+	}
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches two progs to ingress/egress through the new
+ * API and one prog via classic API. Traffic runs through and it
+ * validates that the program has been executed. One of the two
+ * progs gets removed and test is rerun again. Detach everything
+ * at the end.
+ */
+void serial_test_tc_opts_chain_classic(void)
+{
+	test_tc_chain_classic(BPF_TCX_INGRESS, false);
+	test_tc_chain_classic(BPF_TCX_EGRESS, false);
+	test_tc_chain_classic(BPF_TCX_INGRESS, true);
+	test_tc_chain_classic(BPF_TCX_EGRESS, true);
+}
+
+static void test_tc_opts_replace_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, id1, id2, id3, detach_fd;
+	struct test_tc_link *skel;
+	__u32 prog_ids[4];
+	__u32 prog_flags[4];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	opta.expected_revision = 1;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_BEFORE | BPF_F_FIRST | BPF_F_ID;
+	opta.relative_id = id1;
+	opta.expected_revision = 2;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	detach_fd = fd2;
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	opta.flags = BPF_F_REPLACE;
+	opta.replace_fd = fd2;
+	opta.expected_revision = 3;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target2;
+
+	detach_fd = fd3;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 4, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	opta.flags = BPF_F_FIRST | BPF_F_REPLACE;
+	opta.replace_fd = fd3;
+	opta.expected_revision = 4;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target2;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+
+	opta.flags = BPF_F_LAST | BPF_F_REPLACE;
+	opta.replace_fd = fd3;
+	opta.expected_revision = 5;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	ASSERT_EQ(err, -EACCES, "prog_attach");
+	assert_mprog_count(target, 2);
+
+	opta.flags = BPF_F_FIRST | BPF_F_LAST | BPF_F_REPLACE;
+	opta.replace_fd = fd3;
+	opta.expected_revision = 5;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	ASSERT_EQ(err, -EACCES, "prog_attach");
+	assert_mprog_count(target, 2);
+
+	optd.flags = BPF_F_FIRST | BPF_F_BEFORE | BPF_F_ID;
+	optd.relative_id = id1;
+	optd.expected_revision = 5;
+cleanup_target2:
+	err = bpf_prog_detach_opts(detach_fd, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+
+cleanup_target:
+	optd.flags = 0;
+	optd.relative_id = 0;
+	optd.expected_revision = 0;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress and validates
+ * replacement in combination with various flags. Similar for
+ * later detachment.
+ */
+void serial_test_tc_opts_replace(void)
+{
+	test_tc_opts_replace_target(BPF_TCX_INGRESS);
+	test_tc_opts_replace_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_invalid_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	__u32 fd1, fd2, id1, id2;
+	struct test_tc_link *skel;
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_BEFORE | BPF_F_AFTER;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_BEFORE | BPF_F_ID;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -ENOENT, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_AFTER | BPF_F_ID;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -ENOENT, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_LAST | BPF_F_BEFORE;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_FIRST | BPF_F_AFTER;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_FIRST | BPF_F_LAST;
+	opta.relative_fd = fd2;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	opta.relative_fd = fd2;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_BEFORE | BPF_F_AFTER;
+	opta.relative_fd = fd2;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_ID;
+	opta.relative_id = id2;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EINVAL, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_BEFORE;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -ENOENT, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = BPF_F_AFTER;
+	opta.relative_fd = fd1;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -ENOENT, "prog_attach");
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EEXIST, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_LAST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EEXIST, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_FIRST;
+	opta.relative_fd = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	ASSERT_EQ(err, -EEXIST, "prog_attach");
+	assert_mprog_count(target, 1);
+
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test invalid flag combinations when attaching/detaching a
+ * program.
+ */
+void serial_test_tc_opts_invalid(void)
+{
+	test_tc_opts_invalid_target(BPF_TCX_INGRESS);
+	test_tc_opts_invalid_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_prepend_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_BEFORE;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	opta.flags = BPF_F_FIRST;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target2;
+
+	opta.flags = BPF_F_BEFORE;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target3;
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], BPF_F_FIRST, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id4, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id2, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id1, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+
+cleanup_target4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+cleanup_target3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup_target2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+cleanup_target:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with before flag
+ * set and no fd/id, validates prepend behavior that the prog got
+ * attached in the right location. Detaches everything.
+ */
+void serial_test_tc_opts_prepend(void)
+{
+	test_tc_opts_prepend_target(BPF_TCX_INGRESS);
+	test_tc_opts_prepend_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_append_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = BPF_F_AFTER;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target;
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	opta.flags = BPF_F_LAST;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target2;
+
+	opta.flags = BPF_F_AFTER;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_target3;
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_target4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id4, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id3, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], BPF_F_LAST, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+
+cleanup_target4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+cleanup_target3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup_target2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+cleanup_target:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with after flag
+ * set and no fd/id, validates append behavior that the prog got
+ * attached in the right location. Detaches everything.
+ */
+void serial_test_tc_opts_append(void)
+{
+	test_tc_opts_append_target(BPF_TCX_INGRESS);
+	test_tc_opts_append_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_dev_cleanup_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	int err, ifindex;
+
+	ASSERT_OK(system("ip link add dev tcx_opts1 type veth peer name tcx_opts2"), "add veth");
+	ifindex = if_nametoindex("tcx_opts1");
+	ASSERT_NEQ(ifindex, 0, "non_zero_ifindex");
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	err = bpf_prog_attach_opts(fd1, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	err = bpf_prog_attach_opts(fd2, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup1;
+	assert_mprog_count_ifindex(ifindex, target, 2);
+
+	err = bpf_prog_attach_opts(fd3, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup2;
+	assert_mprog_count_ifindex(ifindex, target, 3);
+
+	err = bpf_prog_attach_opts(fd4, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup3;
+	assert_mprog_count_ifindex(ifindex, target, 4);
+
+	ASSERT_OK(system("ip link del dev tcx_opts1"), "del veth");
+	ASSERT_EQ(if_nametoindex("tcx_opts1"), 0, "dev1_removed");
+	ASSERT_EQ(if_nametoindex("tcx_opts2"), 0, "dev2_removed");
+	return;
+cleanup3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 2);
+cleanup2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 1);
+cleanup1:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+
+	ASSERT_OK(system("ip link del dev tcx_opts1"), "del veth");
+	ASSERT_EQ(if_nametoindex("tcx_opts1"), 0, "dev1_removed");
+	ASSERT_EQ(if_nametoindex("tcx_opts2"), 0, "dev2_removed");
+}
+
+/* Test:
+ *
+ * Test which attaches progs to ingress/egress on a newly created
+ * device. Removes the device with attached programs.
+ */
+void serial_test_tc_opts_dev_cleanup(void)
+{
+	test_tc_opts_dev_cleanup_target(BPF_TCX_INGRESS);
+	test_tc_opts_dev_cleanup_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_mixed_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 pid1, pid2, pid3, pid4, lid2, lid4;
+	__u32 prog_flags[4], link_flags[4];
+	__u32 prog_ids[4], link_ids[4];
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, detach_fd;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc4, target),
+		  0, "tc4_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	pid4 = id_from_prog_fd(bpf_program__fd(skel->progs.tc4));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid3, pid4, "prog_ids_3_4");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc1),
+				   loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+	detach_fd = bpf_program__fd(skel->progs.tc1);
+
+	assert_mprog_count(target, 1);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup1;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = BPF_F_REPLACE;
+	opta.replace_fd = bpf_program__fd(skel->progs.tc1);
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc2),
+				   loopback, target, &opta);
+	ASSERT_EQ(err, -EEXIST, "prog_attach");
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = BPF_F_REPLACE;
+	opta.replace_fd = bpf_program__fd(skel->progs.tc2);
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc1),
+				   loopback, target, &opta);
+	ASSERT_EQ(err, -EEXIST, "prog_attach");
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = BPF_F_REPLACE;
+	opta.replace_fd = bpf_program__fd(skel->progs.tc2);
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc3),
+				   loopback, target, &opta);
+	ASSERT_EQ(err, -EBUSY, "prog_attach");
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = BPF_F_REPLACE;
+	opta.replace_fd = bpf_program__fd(skel->progs.tc1);
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc3),
+				   loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup1;
+	detach_fd = bpf_program__fd(skel->progs.tc3);
+
+	assert_mprog_count(target, 2);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc4, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup1;
+	skel->links.tc4 = link;
+
+	lid4 = id_from_link_fd(bpf_link__fd(skel->links.tc4));
+
+	assert_mprog_count(target, 3);
+
+	opta.flags = BPF_F_REPLACE;
+	opta.replace_fd = bpf_program__fd(skel->progs.tc4);
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc2),
+				   loopback, target, &opta);
+	ASSERT_EQ(err, -EEXIST, "prog_attach");
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup1;
+
+	ASSERT_EQ(optq.count, 3, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], 0, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], pid4, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], lid4, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], 0, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.link_ids[3], 0, "link_ids[3]");
+	ASSERT_EQ(optq.link_attach_flags[3], 0, "link_flags[3]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+cleanup1:
+	err = bpf_prog_detach_opts(detach_fd, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attache a link and attempts to replace/delete via opts
+ * for ingress/egress. Ensures that the link is unaffected.
+ */
+void serial_test_tc_opts_mixed(void)
+{
+	test_tc_opts_mixed_target(BPF_TCX_INGRESS);
+	test_tc_opts_mixed_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_demixed_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	__u32 pid1, pid2;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	err = bpf_prog_attach_opts(bpf_program__fd(skel->progs.tc1),
+				   loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup1;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+
+	optd.flags = BPF_F_AFTER;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_EQ(err, -EBUSY, "prog_detach");
+
+	assert_mprog_count(target, 2);
+
+	optd.flags = BPF_F_BEFORE;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 1);
+	goto cleanup;
+cleanup1:
+	err = bpf_prog_detach_opts(bpf_program__fd(skel->progs.tc1),
+				   loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches progs to ingress/egress, validates that the progs
+ * got attached in the right location, and removes them with before/after
+ * detach flag and empty detach prog. Validates that link cannot be removed
+ * this way.
+ */
+void serial_test_tc_opts_demixed(void)
+{
+	test_tc_opts_demixed_target(BPF_TCX_INGRESS);
+	test_tc_opts_demixed_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_detach_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup1;
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup2;
+
+	assert_mprog_count(target, 3);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup3;
+
+	assert_mprog_count(target, 4);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id3, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id4, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	optd.flags = BPF_F_BEFORE;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 3);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 3, "count");
+	ASSERT_EQ(optq.revision, 6, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id4, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], 0, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+
+	optd.flags = BPF_F_AFTER;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 7, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	optd.flags = 0;
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+
+	optd.flags = BPF_F_BEFORE;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+
+	optd.flags = BPF_F_AFTER;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	goto cleanup;
+cleanup4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+cleanup3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+cleanup1:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches progs to ingress/egress, validates that the progs
+ * got attached in the right location, and removes them with before/after
+ * detach flag and empty detach prog. Valides that head/tail gets removed.
+ */
+void serial_test_tc_opts_detach(void)
+{
+	test_tc_opts_detach_target(BPF_TCX_INGRESS);
+	test_tc_opts_detach_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_detach_before_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup1;
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup2;
+
+	assert_mprog_count(target, 3);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup3;
+
+	assert_mprog_count(target, 4);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id3, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id4, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	optd.flags = BPF_F_BEFORE;
+	optd.relative_fd = fd2;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 3);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 3, "count");
+	ASSERT_EQ(optq.revision, 6, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id4, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], 0, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+
+	optd.flags = BPF_F_BEFORE;
+	optd.relative_fd = fd2;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_BEFORE;
+	optd.relative_fd = fd4;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_BEFORE;
+	optd.relative_fd = fd1;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_BEFORE;
+	optd.relative_fd = fd3;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 7, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id4, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	optd.flags = BPF_F_BEFORE;
+	optd.relative_fd = fd4;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 8, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id4, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+	optd.flags = 0;
+	optd.relative_fd = 0;
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 0);
+	goto cleanup;
+cleanup4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+cleanup3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+cleanup1:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches progs to ingress/egress, validates that the progs
+ * got attached in the right location, and removes them with before
+ * detach flag and non-empty detach prog. Validates that the right ones
+ * got removed.
+ */
+void serial_test_tc_opts_detach_before(void)
+{
+	test_tc_opts_detach_before_target(BPF_TCX_INGRESS);
+	test_tc_opts_detach_before_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_opts_detach_after_target(int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts,  optq);
+	__u32 fd1, fd2, fd3, fd4, id1, id2, id3, id4;
+	struct test_tc_link *skel;
+	__u32 prog_ids[5];
+	__u32 prog_flags[5];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+	fd3 = bpf_program__fd(skel->progs.tc3);
+	fd4 = bpf_program__fd(skel->progs.tc4);
+
+	id1 = id_from_prog_fd(fd1);
+	id2 = id_from_prog_fd(fd2);
+	id3 = id_from_prog_fd(fd3);
+	id4 = id_from_prog_fd(fd4);
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+	ASSERT_NEQ(id3, id4, "prog_ids_3_4");
+	ASSERT_NEQ(id2, id3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd1, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd2, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup1;
+
+	assert_mprog_count(target, 2);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd3, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup2;
+
+	assert_mprog_count(target, 3);
+
+	opta.flags = 0;
+	err = bpf_prog_attach_opts(fd4, loopback, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup3;
+
+	assert_mprog_count(target, 4);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id3, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], id4, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd1;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 3);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 3, "count");
+	ASSERT_EQ(optq.revision, 6, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], id4, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], 0, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd1;
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd4;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd3;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd1;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_EQ(err, -ENOENT, "prog_detach");
+	assert_mprog_count(target, 3);
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd1;
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 7, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], id4, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+
+	optd.flags = BPF_F_AFTER;
+	optd.relative_fd = fd1;
+	err = bpf_prog_detach_opts(0, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup4;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 8, "revision");
+	ASSERT_EQ(optq.prog_ids[0], id1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+
+	optd.flags = 0;
+	optd.relative_fd = 0;
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	assert_mprog_count(target, 0);
+	goto cleanup;
+cleanup4:
+	err = bpf_prog_detach_opts(fd4, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 3);
+cleanup3:
+	err = bpf_prog_detach_opts(fd3, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 2);
+cleanup2:
+	err = bpf_prog_detach_opts(fd2, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 1);
+cleanup1:
+	err = bpf_prog_detach_opts(fd1, loopback, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count(target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+/* Test:
+ *
+ * Test which attaches progs to ingress/egress, validates that the progs
+ * got attached in the right location, and removes them with after
+ * detach flag and non-empty detach prog. Validates that the right ones
+ * got removed.
+ */
+void serial_test_tc_opts_detach_after(void)
+{
+	test_tc_opts_detach_after_target(BPF_TCX_INGRESS);
+	test_tc_opts_detach_after_target(BPF_TCX_EGRESS);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tc_link.c b/tools/testing/selftests/bpf/progs/test_tc_link.c
new file mode 100644
index 000000000000..ed1fd0e9cee9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_link.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char LICENSE[] SEC("license") = "GPL";
+
+bool seen_tc1;
+bool seen_tc2;
+bool seen_tc3;
+bool seen_tc4;
+
+SEC("tc/ingress")
+int tc1(struct __sk_buff *skb)
+{
+	seen_tc1 = true;
+	return TCX_NEXT;
+}
+
+SEC("tc/egress")
+int tc2(struct __sk_buff *skb)
+{
+	seen_tc2 = true;
+	return TCX_NEXT;
+}
+
+SEC("tc/egress")
+int tc3(struct __sk_buff *skb)
+{
+	seen_tc3 = true;
+	return TCX_NEXT;
+}
+
+SEC("tc/egress")
+int tc4(struct __sk_buff *skb)
+{
+	seen_tc4 = true;
+	return TCX_NEXT;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 7/7] selftests/bpf: Add mprog API tests for BPF tcx links
  2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
                   ` (5 preceding siblings ...)
  2023-06-07 19:26 ` [PATCH bpf-next v2 6/7] selftests/bpf: Add mprog API tests for BPF tcx opts Daniel Borkmann
@ 2023-06-07 19:26 ` Daniel Borkmann
  6 siblings, 0 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-07 19:26 UTC (permalink / raw)
  To: ast
  Cc: andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev, Daniel Borkmann

Add a big batch of test coverage to assert all aspects of the tcx link API:

  # ./vmtest.sh -- ./test_progs -t tc_links
  [...]
  #224     tc_links_after:OK
  #225     tc_links_append:OK
  #226     tc_links_basic:OK
  #227     tc_links_before:OK
  #228     tc_links_both:OK
  #229     tc_links_chain_classic:OK
  #230     tc_links_dev_cleanup:OK
  #231     tc_links_first:OK
  #232     tc_links_invalid:OK
  #233     tc_links_last:OK
  #234     tc_links_prepend:OK
  #235     tc_links_replace:OK
  #236     tc_links_revision:OK
  Summary: 13/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 .../selftests/bpf/prog_tests/tc_links.c       | 2279 +++++++++++++++++
 1 file changed, 2279 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_links.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tc_links.c b/tools/testing/selftests/bpf/prog_tests/tc_links.c
new file mode 100644
index 000000000000..98039db6ccaf
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_links.c
@@ -0,0 +1,2279 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <uapi/linux/if_link.h>
+#include <net/if.h>
+#include <test_progs.h>
+
+#define loopback 1
+#define ping_cmd "ping -q -c1 -w1 127.0.0.1 > /dev/null"
+
+#include "test_tc_link.skel.h"
+#include "tc_helpers.h"
+
+/* Test:
+ *
+ * Basic test which attaches a link to ingress/egress, validates
+ * that the link got attached, runs traffic through the programs,
+ * validates that traffic has been seen, and detaches everything
+ * again. Programs are attached without special flags.
+ */
+void serial_test_tc_links_basic(void)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(BPF_TCX_INGRESS, 0);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(BPF_TCX_INGRESS, 1);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, BPF_TCX_INGRESS, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+	ASSERT_NEQ(lid1, lid2, "link_ids_1_2");
+
+	assert_mprog_count(BPF_TCX_INGRESS, 1);
+	assert_mprog_count(BPF_TCX_EGRESS, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, BPF_TCX_EGRESS, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count(BPF_TCX_INGRESS, 0);
+	assert_mprog_count(BPF_TCX_EGRESS, 0);
+}
+
+static void test_tc_links_first_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[3], link_flags[3];
+	__u32 prog_ids[3], link_ids[3];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_FIRST;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	assert_mprog_count(target, 1);
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	optl.flags = BPF_F_FIRST;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_LINK;
+	optl.relative_fd = bpf_link__fd(skel->links.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_ID | BPF_F_LINK;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = 0;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	bpf_link__destroy(skel->links.tc2);
+	skel->links.tc2 = NULL;
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_LAST;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], BPF_F_LAST, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a prog to ingress/egress with first flag set,
+ * validates that the prog got attached, other attach attempts for
+ * this position should fail. Regular attach attempts or with last
+ * flag set should succeed. Detach everything again.
+ */
+void serial_test_tc_links_first(void)
+{
+	test_tc_links_first_target(BPF_TCX_INGRESS);
+	test_tc_links_first_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_last_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[3], link_flags[3];
+	__u32 prog_ids[3], link_ids[3];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_LAST;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	assert_mprog_count(target, 1);
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_LAST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	optl.flags = BPF_F_LAST;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER | BPF_F_LINK;
+	optl.relative_fd = bpf_link__fd(skel->links.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER | BPF_F_ID | BPF_F_LINK;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = 0;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], BPF_F_LAST, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	bpf_link__destroy(skel->links.tc2);
+	skel->links.tc2 = NULL;
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_FIRST;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], BPF_F_LAST, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with last flag set,
+ * validates that the prog got attached, other attach attempts for
+ * this position should fail. Regular attach attempts or with first
+ * flag set should succeed. Detach everything again.
+ */
+void serial_test_tc_links_last(void)
+{
+	test_tc_links_last_target(BPF_TCX_INGRESS);
+	test_tc_links_last_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_both_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[3], link_flags[3];
+	__u32 prog_ids[3], link_ids[3];
+	__u32 pid1, pid2, lid1;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_FIRST | BPF_F_LAST;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	assert_mprog_count(target, 1);
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST | BPF_F_LAST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	optl.flags = BPF_F_LAST;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_FIRST;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER | BPF_F_LINK;
+	optl.relative_fd = bpf_link__fd(skel->links.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER | BPF_F_ID;
+	optl.relative_id = pid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER | BPF_F_ID | BPF_F_LINK;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_ID;
+	optl.relative_id = pid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_ID | BPF_F_LINK;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = 0;
+	optl.relative_id = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_FIRST | BPF_F_LAST;
+	optl.relative_id = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_FIRST | BPF_F_LAST | BPF_F_REPLACE;
+	optl.relative_id = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST | BPF_F_LAST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	err = bpf_link__update_program(skel->links.tc1, skel->progs.tc2);
+	if (!ASSERT_OK(err, "link_update"))
+		goto cleanup;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST | BPF_F_LAST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with first and last
+ * flag set, validates that the link got attached, other attach
+ * attempts should fail. Link update should work. Detach everything
+ * again.
+ */
+void serial_test_tc_links_both(void)
+{
+	test_tc_links_both_target(BPF_TCX_INGRESS);
+	test_tc_links_both_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_before_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[5], link_flags[5];
+	__u32 prog_ids[5], link_ids[5];
+	__u32 pid1, pid2, pid3, pid4;
+	__u32 lid1, lid2, lid3, lid4;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc4, target),
+		  0, "tc4_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	pid4 = id_from_prog_fd(bpf_program__fd(skel->progs.tc4));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid3, pid4, "prog_ids_3_4");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+
+	optl.flags = BPF_F_BEFORE;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc2);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc3 = link;
+
+	lid3 = id_from_link_fd(bpf_link__fd(skel->links.tc3));
+
+	optl.flags = BPF_F_BEFORE | BPF_F_ID | BPF_F_LINK;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc4, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc4 = link;
+
+	lid4 = id_from_link_fd(bpf_link__fd(skel->links.tc4));
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid4, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid4, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], pid3, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], lid3, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], pid2, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.link_ids[3], lid2, "link_ids[3]");
+	ASSERT_EQ(optq.link_attach_flags[3], 0, "link_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+	ASSERT_EQ(optq.link_ids[4], 0, "link_ids[4]");
+	ASSERT_EQ(optq.link_attach_flags[4], 0, "link_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with before flag
+ * set, validates that the link got attached in the right location.
+ * The first test inserts in the middle, then we insert to the front.
+ * Detach everything again.
+ */
+void serial_test_tc_links_before(void)
+{
+	test_tc_links_before_target(BPF_TCX_INGRESS);
+	test_tc_links_before_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_after_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[5], link_flags[5];
+	__u32 prog_ids[5], link_ids[5];
+	__u32 pid1, pid2, pid3, pid4;
+	__u32 lid1, lid2, lid3, lid4;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc4, target),
+		  0, "tc4_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	pid4 = id_from_prog_fd(bpf_program__fd(skel->progs.tc4));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid3, pid4, "prog_ids_3_4");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+
+	optl.flags = BPF_F_AFTER;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc3 = link;
+
+	lid3 = id_from_link_fd(bpf_link__fd(skel->links.tc3));
+
+	optl.flags = BPF_F_AFTER | BPF_F_LINK;
+	optl.relative_fd = bpf_link__fd(skel->links.tc2);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc4, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc4 = link;
+
+	lid4 = id_from_link_fd(bpf_link__fd(skel->links.tc4));
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid3, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid3, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], pid2, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], lid2, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], pid4, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.link_ids[3], lid4, "link_ids[3]");
+	ASSERT_EQ(optq.link_attach_flags[3], 0, "link_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+	ASSERT_EQ(optq.link_ids[4], 0, "link_ids[4]");
+	ASSERT_EQ(optq.link_attach_flags[4], 0, "link_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with after flag
+ * set, validates that the link got attached in the right location.
+ * The first test inserts in the middle, then we insert to the end.
+ * Detach everything again.
+ */
+void serial_test_tc_links_after(void)
+{
+	test_tc_links_after_target(BPF_TCX_INGRESS);
+	test_tc_links_after_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_revision_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[3], link_flags[3];
+	__u32 prog_ids[3], link_ids[3];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = 0;
+	optl.expected_revision = 1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	optl.flags = 0;
+	optl.expected_revision = 1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = 0;
+	optl.expected_revision = 2;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "prog_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with revision count
+ * set, validates that the link got attached and validates that
+ * when the count mismatches that the operation bails out. Detach
+ * everything again.
+ */
+void serial_test_tc_links_revision(void)
+{
+	test_tc_links_revision_target(BPF_TCX_INGRESS);
+	test_tc_links_revision_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_chain_classic(int target, bool chain_tc_old)
+{
+	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
+	LIBBPF_OPTS(bpf_tc_hook, tc_hook, .ifindex = loopback);
+	bool hook_created = false, tc_attached = false;
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	__u32 pid1, pid2, pid3;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	if (chain_tc_old) {
+		tc_hook.attach_point = target == BPF_TCX_INGRESS ?
+				       BPF_TC_INGRESS : BPF_TC_EGRESS;
+		err = bpf_tc_hook_create(&tc_hook);
+		if (err == 0)
+			hook_created = true;
+		err = err == -EEXIST ? 0 : err;
+		if (!ASSERT_OK(err, "bpf_tc_hook_create"))
+			goto cleanup;
+
+		tc_opts.prog_fd = bpf_program__fd(skel->progs.tc3);
+		err = bpf_tc_attach(&tc_hook, &tc_opts);
+		if (!ASSERT_OK(err, "bpf_tc_attach"))
+			goto cleanup;
+		tc_attached = true;
+	}
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, chain_tc_old, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	err = bpf_link__detach(skel->links.tc2);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, chain_tc_old, "seen_tc3");
+cleanup:
+	if (tc_attached) {
+		tc_opts.flags = tc_opts.prog_fd = tc_opts.prog_id = 0;
+		err = bpf_tc_detach(&tc_hook, &tc_opts);
+		ASSERT_OK(err, "bpf_tc_detach");
+	}
+	if (hook_created) {
+		tc_hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS;
+		bpf_tc_hook_destroy(&tc_hook);
+	}
+	assert_mprog_count(target, 1);
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches two links to ingress/egress through the new
+ * API and one prog via classic API. Traffic runs through and it
+ * validates that the program has been executed. One of the two
+ * links gets removed and test is rerun again. Detach everything
+ * at the end.
+ */
+void serial_test_tc_links_chain_classic(void)
+{
+	test_tc_chain_classic(BPF_TCX_INGRESS, false);
+	test_tc_chain_classic(BPF_TCX_EGRESS, false);
+	test_tc_chain_classic(BPF_TCX_INGRESS, true);
+	test_tc_chain_classic(BPF_TCX_EGRESS, true);
+}
+
+static void test_tc_links_replace_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 pid1, pid2, pid3, lid1, lid2;
+	__u32 prog_flags[4], link_flags[4];
+	__u32 prog_ids[4], link_ids[4];
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = 0;
+	optl.expected_revision = 1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_FIRST | BPF_F_ID;
+	optl.relative_id = pid1;
+	optl.expected_revision = 2;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	optl.flags = BPF_F_REPLACE;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc2);
+	optl.expected_revision = 3;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 2);
+
+	optl.flags = BPF_F_REPLACE | BPF_F_LINK;
+	optl.relative_fd = bpf_link__fd(skel->links.tc2);
+	optl.expected_revision = 3;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 2);
+
+	optl.flags = BPF_F_REPLACE | BPF_F_LINK | BPF_F_FIRST | BPF_F_ID;
+	optl.relative_id = lid2;
+	optl.expected_revision = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 2);
+
+	err = bpf_link__update_program(skel->links.tc2, skel->progs.tc3);
+	if (!ASSERT_OK(err, "link_update"))
+		goto cleanup;
+
+	assert_mprog_count(target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 4, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	err = bpf_link__detach(skel->links.tc2);
+	if (!ASSERT_OK(err, "link_detach"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+	skel->bss->seen_tc3 = false;
+
+	err = bpf_link__update_program(skel->links.tc1, skel->progs.tc1);
+	if (!ASSERT_OK(err, "link_update_self"))
+		goto cleanup;
+
+	assert_mprog_count(target, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress and validates
+ * replacement in combination with various ops/flags. Similar
+ * for later detachment.
+ */
+void serial_test_tc_links_replace(void)
+{
+	test_tc_links_replace_target(BPF_TCX_INGRESS);
+	test_tc_links_replace_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_invalid_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 pid1, pid2, lid1;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_AFTER;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_ID;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_AFTER | BPF_F_ID;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_ID;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_LAST | BPF_F_BEFORE;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_FIRST | BPF_F_AFTER;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_FIRST | BPF_F_LAST;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc2);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_FIRST | BPF_F_LAST | BPF_F_LINK;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc2);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_FIRST | BPF_F_LAST | BPF_F_LINK;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = 0;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc2);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_AFTER;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc2);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_BEFORE;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_ID;
+	optl.relative_id = pid2;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_BEFORE;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_LINK;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_AFTER;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = BPF_F_AFTER | BPF_F_LINK;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 0);
+
+	optl.flags = 0;
+	optl.relative_fd = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER | BPF_F_LINK;
+	optl.relative_fd = bpf_program__fd(skel->progs.tc1);
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_LINK | BPF_F_ID;
+	optl.relative_id = ~0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_LINK | BPF_F_ID;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_ID;
+	optl.relative_id = pid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE | BPF_F_LINK | BPF_F_ID;
+	optl.relative_id = lid1;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	assert_mprog_count(target, 2);
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test invalid flag combinations when attaching/detaching a
+ * link.
+ */
+void serial_test_tc_links_invalid(void)
+{
+	test_tc_links_invalid_target(BPF_TCX_INGRESS);
+	test_tc_links_invalid_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_prepend_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[5], link_flags[5];
+	__u32 prog_ids[5], link_ids[5];
+	__u32 pid1, pid2, pid3, pid4;
+	__u32 lid1, lid2, lid3, lid4;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc4, target),
+		  0, "tc4_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	pid4 = id_from_prog_fd(bpf_program__fd(skel->progs.tc4));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid3, pid4, "prog_ids_3_4");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_BEFORE;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+
+	optl.flags = BPF_F_FIRST;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc3 = link;
+
+	lid3 = id_from_link_fd(bpf_link__fd(skel->links.tc3));
+
+	optl.flags = BPF_F_BEFORE;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc4, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc4 = link;
+
+	lid4 = id_from_link_fd(bpf_link__fd(skel->links.tc4));
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid3, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid3, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], BPF_F_FIRST, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid4, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid4, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], pid2, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], lid2, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], pid1, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.link_ids[3], lid1, "link_ids[3]");
+	ASSERT_EQ(optq.link_attach_flags[3], 0, "link_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+	ASSERT_EQ(optq.link_ids[4], 0, "link_ids[4]");
+	ASSERT_EQ(optq.link_attach_flags[4], 0, "link_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with before flag
+ * set and no fd/id, validates prepend behavior that the link got
+ * attached in the right location. Detaches everything.
+ */
+void serial_test_tc_links_prepend(void)
+{
+	test_tc_links_prepend_target(BPF_TCX_INGRESS);
+	test_tc_links_prepend_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_append_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl,
+		.ifindex = loopback,
+	);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 prog_flags[5], link_flags[5];
+	__u32 prog_ids[5], link_ids[5];
+	__u32 pid1, pid2, pid3, pid4;
+	__u32 lid1, lid2, lid3, lid4;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc4, target),
+		  0, "tc4_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	pid4 = id_from_prog_fd(bpf_program__fd(skel->progs.tc4));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid3, pid4, "prog_ids_3_4");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	optl.flags = 0;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count(target, 1);
+
+	optl.flags = BPF_F_AFTER;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+
+	assert_mprog_count(target, 2);
+
+	optq.prog_ids = prog_ids;
+	optq.prog_attach_flags = prog_flags;
+	optq.link_ids = link_ids;
+	optq.link_attach_flags = link_flags;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, false, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, false, "seen_tc4");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+
+	optl.flags = BPF_F_LAST;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc3 = link;
+
+	lid3 = id_from_link_fd(bpf_link__fd(skel->links.tc3));
+
+	optl.flags = BPF_F_AFTER;
+	link = bpf_program__attach_tcx_opts(skel->progs.tc4, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc4 = link;
+
+	lid4 = id_from_link_fd(bpf_link__fd(skel->links.tc4));
+
+	assert_mprog_count(target, 4);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(prog_flags, 0, sizeof(prog_flags));
+	memset(link_ids, 0, sizeof(link_ids));
+	memset(link_flags, 0, sizeof(link_flags));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(loopback, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 4, "count");
+	ASSERT_EQ(optq.revision, 5, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_attach_flags[0], 0, "prog_flags[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.link_attach_flags[0], 0, "link_flags[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid2, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_attach_flags[1], 0, "prog_flags[1]");
+	ASSERT_EQ(optq.link_ids[1], lid2, "link_ids[1]");
+	ASSERT_EQ(optq.link_attach_flags[1], 0, "link_flags[1]");
+	ASSERT_EQ(optq.prog_ids[2], pid4, "prog_ids[2]");
+	ASSERT_EQ(optq.prog_attach_flags[2], 0, "prog_flags[2]");
+	ASSERT_EQ(optq.link_ids[2], lid4, "link_ids[2]");
+	ASSERT_EQ(optq.link_attach_flags[2], 0, "link_flags[2]");
+	ASSERT_EQ(optq.prog_ids[3], pid3, "prog_ids[3]");
+	ASSERT_EQ(optq.prog_attach_flags[3], 0, "prog_flags[3]");
+	ASSERT_EQ(optq.link_ids[3], lid3, "link_ids[3]");
+	ASSERT_EQ(optq.link_attach_flags[3], BPF_F_LAST, "link_flags[3]");
+	ASSERT_EQ(optq.prog_ids[4], 0, "prog_ids[4]");
+	ASSERT_EQ(optq.prog_attach_flags[4], 0, "prog_flags[4]");
+	ASSERT_EQ(optq.link_ids[4], 0, "link_ids[4]");
+	ASSERT_EQ(optq.link_attach_flags[4], 0, "link_flags[4]");
+
+	ASSERT_OK(system(ping_cmd), ping_cmd);
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+	ASSERT_EQ(skel->bss->seen_tc3, true, "seen_tc3");
+	ASSERT_EQ(skel->bss->seen_tc4, true, "seen_tc4");
+cleanup:
+	test_tc_link__destroy(skel);
+	assert_mprog_count(target, 0);
+}
+
+/* Test:
+ *
+ * Test which attaches a link to ingress/egress with after flag
+ * set and no fd/id, validates append behavior that the link got
+ * attached in the right location. Detaches everything.
+ */
+void serial_test_tc_links_append(void)
+{
+	test_tc_links_append_target(BPF_TCX_INGRESS);
+	test_tc_links_append_target(BPF_TCX_EGRESS);
+}
+
+static void test_tc_links_dev_cleanup_target(int target)
+{
+	LIBBPF_OPTS(bpf_tcx_opts, optl);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 pid1, pid2, pid3, pid4;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	ASSERT_OK(system("ip link add dev tcx_opts1 type veth peer name tcx_opts2"), "add veth");
+	ifindex = if_nametoindex("tcx_opts1");
+	ASSERT_NEQ(ifindex, 0, "non_zero_ifindex");
+	optl.ifindex = ifindex;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1, target),
+		  0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2, target),
+		  0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3, target),
+		  0, "tc3_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc4, target),
+		  0, "tc4_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+	pid3 = id_from_prog_fd(bpf_program__fd(skel->progs.tc3));
+	pid4 = id_from_prog_fd(bpf_program__fd(skel->progs.tc4));
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+	ASSERT_NEQ(pid3, pid4, "prog_ids_3_4");
+	ASSERT_NEQ(pid2, pid3, "prog_ids_2_3");
+
+	assert_mprog_count(target, 0);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc1, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc1 = link;
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc2, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc2 = link;
+	assert_mprog_count_ifindex(ifindex, target, 2);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc3, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc3 = link;
+	assert_mprog_count_ifindex(ifindex, target, 3);
+
+	link = bpf_program__attach_tcx_opts(skel->progs.tc4, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc4 = link;
+	assert_mprog_count_ifindex(ifindex, target, 4);
+
+	ASSERT_OK(system("ip link del dev tcx_opts1"), "del veth");
+	ASSERT_EQ(if_nametoindex("tcx_opts1"), 0, "dev1_removed");
+	ASSERT_EQ(if_nametoindex("tcx_opts2"), 0, "dev2_removed");
+
+	ASSERT_EQ(ifindex_from_link_fd(bpf_link__fd(skel->links.tc1)), 0, "tc1_ifindex");
+	ASSERT_EQ(ifindex_from_link_fd(bpf_link__fd(skel->links.tc2)), 0, "tc2_ifindex");
+	ASSERT_EQ(ifindex_from_link_fd(bpf_link__fd(skel->links.tc3)), 0, "tc3_ifindex");
+	ASSERT_EQ(ifindex_from_link_fd(bpf_link__fd(skel->links.tc4)), 0, "tc4_ifindex");
+
+	test_tc_link__destroy(skel);
+	return;
+cleanup:
+	test_tc_link__destroy(skel);
+
+	ASSERT_OK(system("ip link del dev tcx_opts1"), "del veth");
+	ASSERT_EQ(if_nametoindex("tcx_opts1"), 0, "dev1_removed");
+	ASSERT_EQ(if_nametoindex("tcx_opts2"), 0, "dev2_removed");
+}
+
+/* Test:
+ *
+ * Test which attaches links to ingress/egress on a newly created
+ * device. Removes the device with attached links. Check links are
+ * indetached state.
+ */
+void serial_test_tc_links_dev_cleanup(void)
+{
+	test_tc_links_dev_cleanup_target(BPF_TCX_INGRESS);
+	test_tc_links_dev_cleanup_target(BPF_TCX_EGRESS);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
@ 2023-06-08  1:25   ` Jamal Hadi Salim
  2023-06-08 10:11     ` Daniel Borkmann
  2023-06-08 17:50   ` Stanislav Fomichev
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Jamal Hadi Salim @ 2023-06-08  1:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

Daniel,

A general question (which i think i asked last time as well): who
decides what comes after/before what prog in this setup? And would
that same entity not have been able to make the same decision using tc
priorities?

The idea of protecting programs from being unloaded is very welcome
but feels would have made sense to be a separate patchset (we have
good need for it). Would it be possible to use that feature in tc and
xdp?

comments inline:

On Wed, Jun 7, 2023 at 3:29 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This work refactors and adds a lightweight extension ("tcx") to the tc BPF
> ingress and egress data path side for allowing BPF program management based
> on fds via bpf() syscall through the newly added generic multi-prog API.
> The main goal behind this work which we also presented at LPC [0] last year
> and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
> BPF link functionality for tc BPF programs, which allows for a model of safe
> ownership and program detachment.
>
> Given the rise in tc BPF users in cloud native environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications accidentally stepping on each others toes.
> As a recap, a BPF link represents the attachment of a BPF program to a BPF
> hook point. The BPF link holds a single reference to keep BPF program alive.
> Moreover, hook points do not reference a BPF link, only the application's
> fd or pinning does. A BPF link holds meta-data specific to attachment and
> implements operations for link creation, (atomic) BPF program update,
> detachment and introspection. The motivation for BPF links for tc BPF programs
> is multi-fold, for example:
>
>   - From Meta: "It's especially important for applications that are deployed
>     fleet-wide and that don't "control" hosts they are deployed to. If such
>     application crashes and no one notices and does anything about that, BPF
>     program will keep running draining resources or even just, say, dropping
>     packets. We at FB had outages due to such permanent BPF attachment
>     semantics. With fd-based BPF link we are getting a framework, which allows
>     safe, auto-detachable behavior by default, unless application explicitly
>     opts in by pinning the BPF link." [1]
>
>   - From Cilium-side the tc BPF programs we attach to host-facing veth devices
>     and phys devices build the core datapath for Kubernetes Pods, and they
>     implement forwarding, load-balancing, policy, EDT-management, etc, within
>     BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>     experienced hard-to-debug issues in a user's staging environment where
>     another Kubernetes application using tc BPF attached to the same prio/handle
>     of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
>     it. The goal is to establish a clear/safe ownership model via links which
>     cannot accidentally be overridden. [0,2]
>
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
>
> Earlier attempts [4] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
>
> We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
> attach API, so that the BPF link implementation blends in naturally similar to
> other link types which are fd-based and without the need for changing core tc
> internal APIs. BPF programs for tc can then be successively migrated from classic
> cls_bpf to the new tc BPF link without needing to change the program's source
> code, just the BPF loader mechanics for attaching is sufficient.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook have a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> entry point for tc BPF. The name tcx has been suggested from discussion of
> earlier revisions of this work as a good fit, and to more easily differ between
> the classic cls_bpf attachment and the fd-based one.
>
> For the ingress and egress tcx points, the device holds a cache-friendly array
> with program pointers which is separated from control plane (slow-path) data.
> Earlier versions of this work used priority to determine ordering and expression
> of dependencies similar as with classic tc, but it was challenged that for
> something more future-proof a better user experience is required. Hence this
> resulted in the design and development of the generic attach/detach/query API
> for multi-progs. See prior patch with its discussion on the API design. tcx is
> the first user and later we plan to integrate also others, for example, one
> candidate is multi-prog support for XDP which would benefit and have the same
> 'look and feel' from API perspective.
>
> The goal with tcx is to have maximum compatibility to existing tc BPF programs,
> so they don't need to be rewritten specifically. Compatibility to call into
> classic tcf_classify() is also provided in order to allow successive migration
> or both to cleanly co-exist where needed given its all one logical tc layer.
> tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
> to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
> The fd-based API is behind a static key, so that when unused the code is also
> not entered. The struct tcx_entry's program array is currently static, but
> could be made dynamic if necessary at a point in future. The a/b pair swap
> design has been chosen so that for detachment there are no allocations which
> otherwise could fail. The work has been tested with tc-testing selftest suite
> which all passes, as well as the tc BPF tests from the BPF CI, and also with
> Cilium's L4LB.
>
> Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
> of this work.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>   [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
>   [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
>   [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>   [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/netdevice.h      |  15 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/tcx.h              | 157 +++++++++++++++
>  include/uapi/linux/bpf.h       |  35 +++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/syscall.c           |  95 +++++++--
>  kernel/bpf/tcx.c               | 347 +++++++++++++++++++++++++++++++++
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 267 +++++++++++++++----------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  45 ++++-
>  tools/include/uapi/linux/bpf.h |  35 +++-
>  16 files changed, 877 insertions(+), 144 deletions(-)
>  create mode 100644 include/net/tcx.h
>  create mode 100644 kernel/bpf/tcx.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 754a9eeca0a1..7a0d0b0c5a5e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3827,13 +3827,15 @@ L:      netdev@vger.kernel.org
>  S:     Maintained
>  F:     kernel/bpf/bpf_struct*
>
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (tcx & tc BPF, sock_addr)
>  M:     Martin KaFai Lau <martin.lau@linux.dev>
>  M:     Daniel Borkmann <daniel@iogearbox.net>
>  R:     John Fastabend <john.fastabend@gmail.com>
>  L:     bpf@vger.kernel.org
>  L:     netdev@vger.kernel.org
>  S:     Maintained
> +F:     include/net/tcx.h
> +F:     kernel/bpf/tcx.c
>  F:     net/core/filter.c
>  F:     net/sched/act_bpf.c
>  F:     net/sched/cls_bpf.c
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 08fbd4622ccf..fd4281d1cdbb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1927,8 +1927,7 @@ enum netdev_ml_priv_type {
>   *
>   *     @rx_handler:            handler for received packets
>   *     @rx_handler_data:       XXX: need comments on this one
> - *     @miniq_ingress:         ingress/clsact qdisc specific data for
> - *                             ingress processing
> + *     @tcx_ingress:           BPF & clsact qdisc specific data for ingress processing
>   *     @ingress_queue:         XXX: need comments on this one
>   *     @nf_hooks_ingress:      netfilter hooks executed for ingress packets
>   *     @broadcast:             hw bcast address
> @@ -1949,8 +1948,7 @@ enum netdev_ml_priv_type {
>   *     @xps_maps:              all CPUs/RXQs maps for XPS device
>   *
>   *     @xps_maps:      XXX: need comments on this one
> - *     @miniq_egress:          clsact qdisc specific data for
> - *                             egress processing
> + *     @tcx_egress:            BPF & clsact qdisc specific data for egress processing
>   *     @nf_hooks_egress:       netfilter hooks executed for egress packets
>   *     @qdisc_hash:            qdisc hash table
>   *     @watchdog_timeo:        Represents the timeout that is used by
> @@ -2249,9 +2247,8 @@ struct net_device {
>         unsigned int            gro_ipv4_max_size;
>         rx_handler_func_t __rcu *rx_handler;
>         void __rcu              *rx_handler_data;
> -
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc __rcu *miniq_ingress;
> +#ifdef CONFIG_NET_XGRESS
> +       struct bpf_mprog_entry __rcu *tcx_ingress;
>  #endif
>         struct netdev_queue __rcu *ingress_queue;
>  #ifdef CONFIG_NETFILTER_INGRESS
> @@ -2279,8 +2276,8 @@ struct net_device {
>  #ifdef CONFIG_XPS
>         struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
>  #endif
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc __rcu *miniq_egress;
> +#ifdef CONFIG_NET_XGRESS
> +       struct bpf_mprog_entry __rcu *tcx_egress;
>  #endif
>  #ifdef CONFIG_NETFILTER_EGRESS
>         struct nf_hook_entries __rcu *nf_hooks_egress;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 5951904413ab..48c3e307f057 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -943,7 +943,7 @@ struct sk_buff {
>         __u8                    __mono_tc_offset[0];
>         /* public: */
>         __u8                    mono_delivery_time:1;   /* See SKB_MONO_DELIVERY_TIME_MASK */
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         __u8                    tc_at_ingress:1;        /* See TC_AT_INGRESS_MASK */
>         __u8                    tc_skip_classify:1;
>  #endif
> @@ -992,7 +992,7 @@ struct sk_buff {
>         __u8                    csum_not_inet:1;
>  #endif
>
> -#ifdef CONFIG_NET_SCHED
> +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>         __u16                   tc_index;       /* traffic control index */
>  #endif
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..0ade5d1a72b2 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -695,7 +695,7 @@ int skb_do_redirect(struct sk_buff *);
>
>  static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
>  {
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         return skb->tc_at_ingress;
>  #else
>         return false;
> diff --git a/include/net/tcx.h b/include/net/tcx.h
> new file mode 100644
> index 000000000000..27885ecedff9
> --- /dev/null
> +++ b/include/net/tcx.h
> @@ -0,0 +1,157 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __NET_TCX_H
> +#define __NET_TCX_H
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/sch_generic.h>
> +
> +struct mini_Qdisc;
> +
> +struct tcx_entry {
> +       struct bpf_mprog_bundle         bundle;
> +       struct mini_Qdisc __rcu         *miniq;
> +};
> +

Can you please move miniq to the front? From where i sit this looks:
struct tcx_entry {
        struct bpf_mprog_bundle    bundle
__attribute__((__aligned__(64))); /*     0  3264 */

        /* XXX last struct has 36 bytes of padding */

        /* --- cacheline 51 boundary (3264 bytes) --- */
        struct mini_Qdisc *        miniq;                /*  3264     8 */

        /* size: 3328, cachelines: 52, members: 2 */
        /* padding: 56 */
        /* paddings: 1, sum paddings: 36 */
        /* forced alignments: 1 */
} __attribute__((__aligned__(64)));

That is a _lot_ of cachelines - at the expense of the status quo
clsact/ingress qdiscs which access miniq.

> +struct tcx_link {
> +       struct bpf_link link;
> +       struct net_device *dev;
> +       u32 location;
> +       u32 flags;
> +};
> +
> +static inline struct tcx_link *tcx_link(struct bpf_link *link)
> +{
> +       return container_of(link, struct tcx_link, link);
> +}
> +
> +static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link)
> +{
> +       return tcx_link((struct bpf_link *)link);
> +}
> +
> +static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress)
> +{
> +#ifdef CONFIG_NET_XGRESS
> +       skb->tc_at_ingress = ingress;
> +#endif
> +}
> +
> +#ifdef CONFIG_NET_XGRESS
> +void tcx_inc(void);
> +void tcx_dec(void);
> +
> +static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry)
> +{
> +       return container_of(entry->parent, struct tcx_entry, bundle);
> +}
> +
> +static inline void
> +tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry, bool ingress)
> +{
> +       ASSERT_RTNL();
> +       if (ingress)
> +               rcu_assign_pointer(dev->tcx_ingress, entry);
> +       else
> +               rcu_assign_pointer(dev->tcx_egress, entry);
> +}
> +
> +static inline struct bpf_mprog_entry *
> +dev_tcx_entry_fetch(struct net_device *dev, bool ingress)
> +{
> +       ASSERT_RTNL();
> +       if (ingress)
> +               return rcu_dereference_rtnl(dev->tcx_ingress);
> +       else
> +               return rcu_dereference_rtnl(dev->tcx_egress);
> +}
> +
> +static inline struct bpf_mprog_entry *
> +dev_tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created)
> +{
> +       struct bpf_mprog_entry *entry = dev_tcx_entry_fetch(dev, ingress);
> +
> +       *created = false;
> +       if (!entry) {
> +               entry = bpf_mprog_create(sizeof_field(struct tcx_entry,
> +                                                     miniq));
> +               if (!entry)
> +                       return NULL;
> +               *created = true;
> +       }
> +       return entry;
> +}
> +
> +static inline void tcx_skeys_inc(bool ingress)
> +{
> +       tcx_inc();
> +       if (ingress)
> +               net_inc_ingress_queue();
> +       else
> +               net_inc_egress_queue();
> +}
> +
> +static inline void tcx_skeys_dec(bool ingress)
> +{
> +       if (ingress)
> +               net_dec_ingress_queue();
> +       else
> +               net_dec_egress_queue();
> +       tcx_dec();
> +}
> +
> +static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb, int code)
> +{
> +       switch (code) {
> +       case TCX_PASS:
> +               skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
> +               fallthrough;
> +       case TCX_DROP:
> +       case TCX_REDIRECT:
> +               return code;
> +       case TCX_NEXT:
> +       default:
> +               return TCX_NEXT;
> +       }
> +}
> +#endif /* CONFIG_NET_XGRESS */
> +
> +#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL)
> +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_prog_query(const union bpf_attr *attr,
> +                  union bpf_attr __user *uattr);
> +void dev_tcx_uninstall(struct net_device *dev);
> +#else
> +static inline int tcx_prog_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int tcx_link_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int tcx_prog_detach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int tcx_prog_query(const union bpf_attr *attr,
> +                                union bpf_attr __user *uattr)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline void dev_tcx_uninstall(struct net_device *dev)
> +{
> +}
> +#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */
> +#endif /* __NET_TCX_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
>         BPF_STRUCT_OPS,
> +       BPF_TCX_INGRESS,
> +       BPF_TCX_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
>         BPF_LINK_TYPE_STRUCT_OPS = 9,
>         BPF_LINK_TYPE_NETFILTER = 10,
> -
> +       BPF_LINK_TYPE_TCX = 11,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1559,13 +1561,13 @@ union bpf_attr {
>                         __u32           map_fd;         /* struct_ops to attach */
>                 };
>                 union {
> -                       __u32           target_fd;      /* object to attach to */
> -                       __u32           target_ifindex; /* target ifindex */
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
>                 };
>                 __u32           attach_type;    /* attach type */
>                 __u32           flags;          /* extra flags */
>                 union {
> -                       __u32           target_btf_id;  /* btf_id of target to attach to */
> +                       __u32   target_btf_id;  /* btf_id of target to attach to */
>                         struct {
>                                 __aligned_u64   iter_info;      /* extra bpf_iter_link_info */
>                                 __u32           iter_info_len;  /* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
>                                 __s32           priority;
>                                 __u32           flags;
>                         } netfilter;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u32           expected_revision;
> +                       } tcx;
>                 };
>         } link_create;
>
> @@ -6207,6 +6216,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tcx prog type.
> + * A valid tcx program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TCX_NEXT.
> + */
> +enum tcx_action_base {
> +       TCX_NEXT        = -1,
> +       TCX_PASS        = 0,
> +       TCX_DROP        = 2,
> +       TCX_REDIRECT    = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6459,6 +6481,11 @@ struct bpf_link_info {
>                         __s32 priority;
>                         __u32 flags;
>                 } netfilter;
> +               struct {
> +                       __u32 ifindex;
> +                       __u32 attach_type;
> +                       __u32 flags;
> +               } tcx;
>         };
>  } __attribute__((aligned(8)));
>
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..6a906ff93006 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -31,6 +31,7 @@ config BPF_SYSCALL
>         select TASKS_TRACE_RCU
>         select BINARY_PRINTF
>         select NET_SOCK_MSG if NET
> +       select NET_XGRESS if NET
>         select PAGE_POOL if NET
>         default n
>         help
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1bea2eb912cd..f526b7573e97 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -21,6 +21,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
>  obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  obj-$(CONFIG_BPF_SYSCALL) += offload.o
>  obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
> +obj-$(CONFIG_BPF_SYSCALL) += tcx.o
>  endif
>  ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 92a57efc77de..e2c219d053f4 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -37,6 +37,8 @@
>  #include <linux/trace_events.h>
>  #include <net/netfilter/nf_bpf_link.h>
>
> +#include <net/tcx.h>
> +
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>                           (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
>                           (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> @@ -3522,31 +3524,57 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>                 return BPF_PROG_TYPE_XDP;
>         case BPF_LSM_CGROUP:
>                 return BPF_PROG_TYPE_LSM;
> +       case BPF_TCX_INGRESS:
> +       case BPF_TCX_EGRESS:
> +               return BPF_PROG_TYPE_SCHED_CLS;
>         default:
>                 return BPF_PROG_TYPE_UNSPEC;
>         }
>  }
>
> -#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd
> +#define BPF_PROG_ATTACH_LAST_FIELD expected_revision
> +
> +#define BPF_F_ATTACH_MASK_BASE \
> +       (BPF_F_ALLOW_OVERRIDE | \
> +        BPF_F_ALLOW_MULTI |    \
> +        BPF_F_REPLACE)
> +
> +#define BPF_F_ATTACH_MASK_MPROG        \
> +       (BPF_F_REPLACE |        \
> +        BPF_F_BEFORE |         \
> +        BPF_F_AFTER |          \
> +        BPF_F_FIRST |          \
> +        BPF_F_LAST |           \
> +        BPF_F_ID |             \
> +        BPF_F_LINK)
>
> -#define BPF_F_ATTACH_MASK \
> -       (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE)
> +static bool bpf_supports_mprog(enum bpf_prog_type ptype)
> +{
> +       switch (ptype) {
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               return true;
> +       default:
> +               return false;
> +       }
> +}
>
>  static int bpf_prog_attach(const union bpf_attr *attr)
>  {
>         enum bpf_prog_type ptype;
>         struct bpf_prog *prog;
> +       u32 mask;
>         int ret;
>
>         if (CHECK_ATTR(BPF_PROG_ATTACH))
>                 return -EINVAL;
>
> -       if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
> -               return -EINVAL;
> -
>         ptype = attach_type_to_prog_type(attr->attach_type);
>         if (ptype == BPF_PROG_TYPE_UNSPEC)
>                 return -EINVAL;
> +       mask = bpf_supports_mprog(ptype) ?
> +              BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE;
> +       if (attr->attach_flags & ~mask)
> +               return -EINVAL;
>
>         prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
>         if (IS_ERR(prog))
> @@ -3582,6 +3610,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>                 else
>                         ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = tcx_prog_attach(attr, prog);
> +               break;
>         default:
>                 ret = -EINVAL;
>         }
> @@ -3591,25 +3622,42 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         return ret;
>  }
>
> -#define BPF_PROG_DETACH_LAST_FIELD attach_type
> +#define BPF_PROG_DETACH_LAST_FIELD expected_revision
>
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
> +       struct bpf_prog *prog = NULL;
>         enum bpf_prog_type ptype;
> +       int ret;
>
>         if (CHECK_ATTR(BPF_PROG_DETACH))
>                 return -EINVAL;
>
>         ptype = attach_type_to_prog_type(attr->attach_type);
> +       if (bpf_supports_mprog(ptype)) {
> +               if (ptype == BPF_PROG_TYPE_UNSPEC)
> +                       return -EINVAL;
> +               if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG)
> +                       return -EINVAL;
> +               prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
> +               if (IS_ERR(prog)) {
> +                       if ((int)attr->attach_bpf_fd > 0)
> +                               return PTR_ERR(prog);
> +                       prog = NULL;
> +               }
> +       }
>
>         switch (ptype) {
>         case BPF_PROG_TYPE_SK_MSG:
>         case BPF_PROG_TYPE_SK_SKB:
> -               return sock_map_prog_detach(attr, ptype);
> +               ret = sock_map_prog_detach(attr, ptype);
> +               break;
>         case BPF_PROG_TYPE_LIRC_MODE2:
> -               return lirc_prog_detach(attr);
> +               ret = lirc_prog_detach(attr);
> +               break;
>         case BPF_PROG_TYPE_FLOW_DISSECTOR:
> -               return netns_bpf_prog_detach(attr, ptype);
> +               ret = netns_bpf_prog_detach(attr, ptype);
> +               break;
>         case BPF_PROG_TYPE_CGROUP_DEVICE:
>         case BPF_PROG_TYPE_CGROUP_SKB:
>         case BPF_PROG_TYPE_CGROUP_SOCK:
> @@ -3618,13 +3666,21 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>         case BPF_PROG_TYPE_CGROUP_SYSCTL:
>         case BPF_PROG_TYPE_SOCK_OPS:
>         case BPF_PROG_TYPE_LSM:
> -               return cgroup_bpf_prog_detach(attr, ptype);
> +               ret = cgroup_bpf_prog_detach(attr, ptype);
> +               break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = tcx_prog_detach(attr, prog);
> +               break;
>         default:
> -               return -EINVAL;
> +               ret = -EINVAL;
>         }
> +
> +       if (prog)
> +               bpf_prog_put(prog);
> +       return ret;
>  }
>
> -#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags
> +#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags
>
>  static int bpf_prog_query(const union bpf_attr *attr,
>                           union bpf_attr __user *uattr)
> @@ -3672,6 +3728,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
>         case BPF_SK_MSG_VERDICT:
>         case BPF_SK_SKB_VERDICT:
>                 return sock_map_bpf_prog_query(attr, uattr);
> +       case BPF_TCX_INGRESS:
> +       case BPF_TCX_EGRESS:
> +               return tcx_prog_query(attr, uattr);
>         default:
>                 return -EINVAL;
>         }
> @@ -4629,6 +4688,13 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>                         goto out;
>                 }
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               if (attr->link_create.attach_type != BPF_TCX_INGRESS &&
> +                   attr->link_create.attach_type != BPF_TCX_EGRESS) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +               break;
>         default:
>                 ptype = attach_type_to_prog_type(attr->link_create.attach_type);
>                 if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
> @@ -4680,6 +4746,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>         case BPF_PROG_TYPE_XDP:
>                 ret = bpf_xdp_link_attach(attr, prog);
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = tcx_link_attach(attr, prog);
> +               break;
>         case BPF_PROG_TYPE_NETFILTER:
>                 ret = bpf_nf_link_attach(attr, prog);
>                 break;
> diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c
> new file mode 100644
> index 000000000000..d3d23b4ed4f0
> --- /dev/null
> +++ b/kernel/bpf/tcx.c
> @@ -0,0 +1,347 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/tcx.h>
> +
> +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       bool created, ingress = attr->attach_type == BPF_TCX_INGRESS;
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_mprog_entry *entry;
> +       struct net_device *dev;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created);
> +       if (!entry) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +       ret = bpf_mprog_attach(entry, prog, NULL, attr->attach_flags,
> +                              attr->relative_fd, attr->expected_revision);
> +       if (ret >= 0) {
> +               if (ret == BPF_MPROG_SWAP)
> +                       tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
> +               bpf_mprog_commit(entry);
> +               tcx_skeys_inc(ingress);
> +               ret = 0;
> +       } else if (created) {
> +               bpf_mprog_free(entry);
> +       }
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static bool tcx_release_entry(struct bpf_mprog_entry *entry, int code)
> +{
> +       return code == BPF_MPROG_FREE && !tcx_entry(entry)->miniq;
> +}
> +
> +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       bool tcx_release, ingress = attr->attach_type == BPF_TCX_INGRESS;
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_mprog_entry *entry, *peer;
> +       struct net_device *dev;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       entry = dev_tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_detach(entry, prog, NULL, attr->attach_flags,
> +                              attr->relative_fd, attr->expected_revision);
> +       if (ret >= 0) {
> +               tcx_release = tcx_release_entry(entry, ret);
> +               peer = tcx_release ? NULL : bpf_mprog_peer(entry);
> +               if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE)
> +                       tcx_entry_update(dev, peer, ingress);
> +               bpf_mprog_commit(entry);
> +               tcx_skeys_dec(ingress);
> +               if (tcx_release)
> +                       bpf_mprog_free(entry);
> +               ret = 0;
> +       }
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static void tcx_uninstall(struct net_device *dev, bool ingress)
> +{
> +       struct bpf_tuple tuple = {};
> +       struct bpf_mprog_entry *entry;
> +       struct bpf_mprog_fp *fp;
> +       struct bpf_mprog_cp *cp;
> +
> +       entry = dev_tcx_entry_fetch(dev, ingress);
> +       if (!entry)
> +               return;
> +       tcx_entry_update(dev, NULL, ingress);
> +       bpf_mprog_commit(entry);
> +       bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +               if (tuple.link)
> +                       tcx_link(tuple.link)->dev = NULL;
> +               else
> +                       bpf_prog_put(tuple.prog);
> +               tcx_skeys_dec(ingress);
> +       }
> +       WARN_ON_ONCE(tcx_entry(entry)->miniq);
> +       bpf_mprog_free(entry);
> +}
> +
> +void dev_tcx_uninstall(struct net_device *dev)
> +{
> +       ASSERT_RTNL();
> +       tcx_uninstall(dev, true);
> +       tcx_uninstall(dev, false);
> +}
> +
> +int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +       bool ingress = attr->query.attach_type == BPF_TCX_INGRESS;
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_mprog_entry *entry;
> +       struct net_device *dev;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->query.target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       entry = dev_tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_query(attr, uattr, entry);
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static int tcx_link_prog_attach(struct bpf_link *l, u32 flags, u32 object,
> +                               u32 expected_revision)
> +{
> +       struct tcx_link *link = tcx_link(l);
> +       bool created, ingress = link->location == BPF_TCX_INGRESS;
> +       struct net_device *dev = link->dev;
> +       struct bpf_mprog_entry *entry;
> +       int ret;
> +
> +       ASSERT_RTNL();
> +       entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       ret = bpf_mprog_attach(entry, l->prog, l, flags, object,
> +                              expected_revision);
> +       if (ret >= 0) {
> +               if (ret == BPF_MPROG_SWAP)
> +                       tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
> +               bpf_mprog_commit(entry);
> +               tcx_skeys_inc(ingress);
> +               ret = 0;
> +       } else if (created) {
> +               bpf_mprog_free(entry);
> +       }
> +       return ret;
> +}
> +
> +static void tcx_link_release(struct bpf_link *l)
> +{
> +       struct tcx_link *link = tcx_link(l);
> +       bool tcx_release, ingress = link->location == BPF_TCX_INGRESS;
> +       struct bpf_mprog_entry *entry, *peer;
> +       struct net_device *dev;
> +       int ret = 0;
> +
> +       rtnl_lock();
> +       dev = link->dev;
> +       if (!dev)
> +               goto out;
> +       entry = dev_tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_detach(entry, l->prog, l, link->flags, 0, 0);
> +       if (ret >= 0) {
> +               tcx_release = tcx_release_entry(entry, ret);
> +               peer = tcx_release ? NULL : bpf_mprog_peer(entry);
> +               if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE)
> +                       tcx_entry_update(dev, peer, ingress);
> +               bpf_mprog_commit(entry);
> +               tcx_skeys_dec(ingress);
> +               if (tcx_release)
> +                       bpf_mprog_free(entry);
> +               link->dev = NULL;
> +               ret = 0;
> +       }
> +out:
> +       WARN_ON_ONCE(ret);
> +       rtnl_unlock();
> +}
> +
> +static int tcx_link_update(struct bpf_link *l, struct bpf_prog *nprog,
> +                          struct bpf_prog *oprog)
> +{
> +       struct tcx_link *link = tcx_link(l);
> +       bool ingress = link->location == BPF_TCX_INGRESS;
> +       struct net_device *dev = link->dev;
> +       struct bpf_mprog_entry *entry;
> +       int ret = 0;
> +
> +       rtnl_lock();
> +       if (!link->dev) {
> +               ret = -ENOLINK;
> +               goto out;
> +       }
> +       if (oprog && l->prog != oprog) {
> +               ret = -EPERM;
> +               goto out;
> +       }
> +       oprog = l->prog;
> +       if (oprog == nprog) {
> +               bpf_prog_put(nprog);
> +               goto out;
> +       }
> +       entry = dev_tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_attach(entry, nprog, l,
> +                              BPF_F_REPLACE | BPF_F_ID | link->flags,
> +                              l->prog->aux->id, 0);
> +       if (ret >= 0) {
> +               if (ret == BPF_MPROG_SWAP)
> +                       tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
> +               bpf_mprog_commit(entry);
> +               tcx_skeys_inc(ingress);
> +               oprog = xchg(&l->prog, nprog);
> +               bpf_prog_put(oprog);
> +               ret = 0;
> +       }
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static void tcx_link_dealloc(struct bpf_link *l)
> +{
> +       kfree(tcx_link(l));
> +}
> +
> +static void tcx_link_fdinfo(const struct bpf_link *l, struct seq_file *seq)
> +{
> +       const struct tcx_link *link = tcx_link_const(l);
> +       u32 ifindex = 0;
> +
> +       rtnl_lock();
> +       if (link->dev)
> +               ifindex = link->dev->ifindex;
> +       rtnl_unlock();
> +
> +       seq_printf(seq, "ifindex:\t%u\n", ifindex);
> +       seq_printf(seq, "attach_type:\t%u (%s)\n",
> +                  link->location,
> +                  link->location == BPF_TCX_INGRESS ? "ingress" : "egress");
> +       seq_printf(seq, "flags:\t%u\n", link->flags);
> +}
> +
> +static int tcx_link_fill_info(const struct bpf_link *l,
> +                             struct bpf_link_info *info)
> +{
> +       const struct tcx_link *link = tcx_link_const(l);
> +       u32 ifindex = 0;
> +
> +       rtnl_lock();
> +       if (link->dev)
> +               ifindex = link->dev->ifindex;
> +       rtnl_unlock();
> +
> +       info->tcx.ifindex = ifindex;
> +       info->tcx.attach_type = link->location;
> +       info->tcx.flags = link->flags;
> +       return 0;
> +}
> +
> +static int tcx_link_detach(struct bpf_link *l)
> +{
> +       tcx_link_release(l);
> +       return 0;
> +}
> +
> +static const struct bpf_link_ops tcx_link_lops = {
> +       .release        = tcx_link_release,
> +       .detach         = tcx_link_detach,
> +       .dealloc        = tcx_link_dealloc,
> +       .update_prog    = tcx_link_update,
> +       .show_fdinfo    = tcx_link_fdinfo,
> +       .fill_link_info = tcx_link_fill_info,
> +};
> +
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_link_primer link_primer;
> +       struct net_device *dev;
> +       struct tcx_link *link;
> +       int fd, err;
> +
> +       dev = dev_get_by_index(net, attr->link_create.target_ifindex);
> +       if (!dev)
> +               return -EINVAL;
> +       link = kzalloc(sizeof(*link), GFP_USER);
> +       if (!link) {
> +               err = -ENOMEM;
> +               goto out_put;
> +       }
> +
> +       bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> +       link->location = attr->link_create.attach_type;
> +       link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
> +       link->dev = dev;
> +
> +       err = bpf_link_prime(&link->link, &link_primer);
> +       if (err) {
> +               kfree(link);
> +               goto out_put;
> +       }
> +       rtnl_lock();
> +       err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
> +                                  attr->link_create.tcx.relative_fd,
> +                                  attr->link_create.tcx.expected_revision);
> +       if (!err)
> +               fd = bpf_link_settle(&link_primer);
> +       rtnl_unlock();
> +       if (err) {
> +               link->dev = NULL;
> +               bpf_link_cleanup(&link_primer);
> +               goto out_put;
> +       }
> +       dev_put(dev);
> +       return fd;
> +out_put:
> +       dev_put(dev);
> +       return err;
> +}
> diff --git a/net/Kconfig b/net/Kconfig
> index 2fb25b534df5..d532ec33f1fe 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -52,6 +52,11 @@ config NET_INGRESS
>  config NET_EGRESS
>         bool
>
> +config NET_XGRESS
> +       select NET_INGRESS
> +       select NET_EGRESS
> +       bool
> +
>  config NET_REDIRECT
>         bool
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3393c2f3dbe8..95c7e3189884 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -107,6 +107,7 @@
>  #include <net/pkt_cls.h>
>  #include <net/checksum.h>
>  #include <net/xfrm.h>
> +#include <net/tcx.h>
>  #include <linux/highmem.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> @@ -154,7 +155,6 @@
>  #include "dev.h"
>  #include "net-sysfs.h"
>
> -
>  static DEFINE_SPINLOCK(ptype_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>  struct list_head ptype_all __read_mostly;      /* Taps */
> @@ -3923,69 +3923,200 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>  EXPORT_SYMBOL(dev_loopback_xmit);
>
>  #ifdef CONFIG_NET_EGRESS
> -static struct sk_buff *
> -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +       int qm = skb_get_queue_mapping(skb);
> +
> +       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
>  {
> +       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> +       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> +#endif /* CONFIG_NET_EGRESS */
> +
> +#ifdef CONFIG_NET_XGRESS
> +static int tc_run(struct tcx_entry *entry, struct sk_buff *skb)
> +{
> +       int ret = TC_ACT_UNSPEC;
>  #ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
> -       struct tcf_result cl_res;
> +       struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq);
> +       struct tcf_result res;
>
>         if (!miniq)
> -               return skb;
> +               return ret;
>
> -       /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
>         tc_skb_cb(skb)->mru = 0;
>         tc_skb_cb(skb)->post_ct = false;
> -       mini_qdisc_bstats_cpu_update(miniq, skb);
>
> -       switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> +       mini_qdisc_bstats_cpu_update(miniq, skb);
> +       ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> +       /* Only tcf related quirks below. */
> +       switch (ret) {
> +       case TC_ACT_SHOT:
> +               mini_qdisc_qstats_cpu_drop(miniq);
> +               break;
>         case TC_ACT_OK:
>          case TC_ACT_RECLASSIFY:
> -               skb->tc_index = TC_H_MIN(cl_res.classid);
> +               skb->tc_index = TC_H_MIN(res.classid);
>                 break;
> +       }
> +#endif /* CONFIG_NET_CLS_ACT */
> +       return ret;
> +}
> +
> +static DEFINE_STATIC_KEY_FALSE(tcx_needed_key);
> +
> +void tcx_inc(void)
> +{
> +       static_branch_inc(&tcx_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(tcx_inc);
> +
> +void tcx_dec(void)
> +{
> +       static_branch_dec(&tcx_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(tcx_dec);
> +
> +static __always_inline enum tcx_action_base
> +tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
> +       const bool needs_mac)
> +{
> +       const struct bpf_mprog_fp *fp;
> +       const struct bpf_prog *prog;
> +       int ret = TCX_NEXT;
> +
> +       if (needs_mac)
> +               __skb_push(skb, skb->mac_len);
> +       bpf_mprog_foreach_prog(entry, fp, prog) {
> +               bpf_compute_data_pointers(skb);
> +               ret = bpf_prog_run(prog, skb);
> +               if (ret != TCX_NEXT)
> +                       break;
> +       }
> +       if (needs_mac)
> +               __skb_pull(skb, skb->mac_len);
> +       return tcx_action_code(skb, ret);
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +                  struct net_device *orig_dev, bool *another)
> +{
> +       struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> +       int sch_ret;
> +
> +       if (!entry)
> +               return skb;
> +       if (*pt_prev) {
> +               *ret = deliver_skb(skb, *pt_prev, orig_dev);
> +               *pt_prev = NULL;
> +       }
> +
> +       qdisc_skb_cb(skb)->pkt_len = skb->len;
> +       tcx_set_ingress(skb, true);
> +
> +       if (static_branch_unlikely(&tcx_needed_key)) {
> +               sch_ret = tcx_run(entry, skb, true);
> +               if (sch_ret != TC_ACT_UNSPEC)
> +                       goto ingress_verdict;
> +       }
> +       sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb);

This...
Essentially if we have a mix of tc and txc, tcx gets to run
first/faster. It would be fairer to have tc infront.
Same for egress...

cheers,
jamal

> +ingress_verdict:
> +       switch (sch_ret) {
> +       case TC_ACT_REDIRECT:
> +               /* skb_mac_header check was done by BPF, so we can safely
> +                * push the L2 header back before redirecting to another
> +                * netdev.
> +                */
> +               __skb_push(skb, skb->mac_len);
> +               if (skb_do_redirect(skb) == -EAGAIN) {
> +                       __skb_pull(skb, skb->mac_len);
> +                       *another = true;
> +                       break;
> +               }
> +               *ret = NET_RX_SUCCESS;
> +               return NULL;
>         case TC_ACT_SHOT:
> -               mini_qdisc_qstats_cpu_drop(miniq);
> -               *ret = NET_XMIT_DROP;
> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> +               *ret = NET_RX_DROP;
>                 return NULL;
> +       /* used by tc_run */
>         case TC_ACT_STOLEN:
>         case TC_ACT_QUEUED:
>         case TC_ACT_TRAP:
> -               *ret = NET_XMIT_SUCCESS;
>                 consume_skb(skb);
> +               fallthrough;
> +       case TC_ACT_CONSUMED:
> +               *ret = NET_RX_SUCCESS;
>                 return NULL;
> +       }
> +
> +       return skb;
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +{
> +       struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> +       int sch_ret;
> +
> +       if (!entry)
> +               return skb;
> +
> +       /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> +        * already set by the caller.
> +        */
> +       if (static_branch_unlikely(&tcx_needed_key)) {
> +               sch_ret = tcx_run(entry, skb, false);
> +               if (sch_ret != TC_ACT_UNSPEC)
> +                       goto egress_verdict;
> +       }
> +       sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb);
> +egress_verdict:
> +       switch (sch_ret) {
>         case TC_ACT_REDIRECT:
>                 /* No need to push/pop skb's mac_header here on egress! */
>                 skb_do_redirect(skb);
>                 *ret = NET_XMIT_SUCCESS;
>                 return NULL;
> -       default:
> -               break;
> +       case TC_ACT_SHOT:
> +               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +               *ret = NET_XMIT_DROP;
> +               return NULL;
> +       /* used by tc_run */
> +       case TC_ACT_STOLEN:
> +       case TC_ACT_QUEUED:
> +       case TC_ACT_TRAP:
> +               *ret = NET_XMIT_SUCCESS;
> +               return NULL;
>         }
> -#endif /* CONFIG_NET_CLS_ACT */
>
>         return skb;
>  }
> -
> -static struct netdev_queue *
> -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> -{
> -       int qm = skb_get_queue_mapping(skb);
> -
> -       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> -}
> -
> -static bool netdev_xmit_txqueue_skipped(void)
> +#else
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +                  struct net_device *orig_dev, bool *another)
>  {
> -       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +       return skb;
>  }
>
> -void netdev_xmit_skip_txqueue(bool skip)
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>  {
> -       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +       return skb;
>  }
> -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> -#endif /* CONFIG_NET_EGRESS */
> +#endif /* CONFIG_NET_XGRESS */
>
>  #ifdef CONFIG_XPS
>  static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
> @@ -4169,9 +4300,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>         skb_update_prio(skb);
>
>         qdisc_pkt_len_init(skb);
> -#ifdef CONFIG_NET_CLS_ACT
> -       skb->tc_at_ingress = 0;
> -#endif
> +       tcx_set_ingress(skb, false);
>  #ifdef CONFIG_NET_EGRESS
>         if (static_branch_unlikely(&egress_needed_key)) {
>                 if (nf_hook_egress_active()) {
> @@ -5103,72 +5232,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
>  EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>  #endif
>
> -static inline struct sk_buff *
> -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> -                  struct net_device *orig_dev, bool *another)
> -{
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
> -       struct tcf_result cl_res;
> -
> -       /* If there's at least one ingress present somewhere (so
> -        * we get here via enabled static key), remaining devices
> -        * that are not configured with an ingress qdisc will bail
> -        * out here.
> -        */
> -       if (!miniq)
> -               return skb;
> -
> -       if (*pt_prev) {
> -               *ret = deliver_skb(skb, *pt_prev, orig_dev);
> -               *pt_prev = NULL;
> -       }
> -
> -       qdisc_skb_cb(skb)->pkt_len = skb->len;
> -       tc_skb_cb(skb)->mru = 0;
> -       tc_skb_cb(skb)->post_ct = false;
> -       skb->tc_at_ingress = 1;
> -       mini_qdisc_bstats_cpu_update(miniq, skb);
> -
> -       switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> -       case TC_ACT_OK:
> -       case TC_ACT_RECLASSIFY:
> -               skb->tc_index = TC_H_MIN(cl_res.classid);
> -               break;
> -       case TC_ACT_SHOT:
> -               mini_qdisc_qstats_cpu_drop(miniq);
> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> -               *ret = NET_RX_DROP;
> -               return NULL;
> -       case TC_ACT_STOLEN:
> -       case TC_ACT_QUEUED:
> -       case TC_ACT_TRAP:
> -               consume_skb(skb);
> -               *ret = NET_RX_SUCCESS;
> -               return NULL;
> -       case TC_ACT_REDIRECT:
> -               /* skb_mac_header check was done by cls/act_bpf, so
> -                * we can safely push the L2 header back before
> -                * redirecting to another netdev
> -                */
> -               __skb_push(skb, skb->mac_len);
> -               if (skb_do_redirect(skb) == -EAGAIN) {
> -                       __skb_pull(skb, skb->mac_len);
> -                       *another = true;
> -                       break;
> -               }
> -               *ret = NET_RX_SUCCESS;
> -               return NULL;
> -       case TC_ACT_CONSUMED:
> -               *ret = NET_RX_SUCCESS;
> -               return NULL;
> -       default:
> -               break;
> -       }
> -#endif /* CONFIG_NET_CLS_ACT */
> -       return skb;
> -}
> -
>  /**
>   *     netdev_is_rx_handler_busy - check if receive handler is registered
>   *     @dev: device to check
> @@ -10873,7 +10936,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
>
>                 /* Shutdown queueing discipline. */
>                 dev_shutdown(dev);
> -
> +               dev_tcx_uninstall(dev);
>                 dev_xdp_uninstall(dev);
>                 bpf_dev_bound_netdev_unregister(dev);
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d25d52854c21..1ff9a0988ea6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9233,7 +9233,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
>         __u8 value_reg = si->dst_reg;
>         __u8 skb_reg = si->src_reg;
>
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         /* If the tstamp_type is read,
>          * the bpf prog is aware the tstamp could have delivery time.
>          * Thus, read skb->tstamp as is if tstamp_type_access is true.
> @@ -9267,7 +9267,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
>         __u8 value_reg = si->src_reg;
>         __u8 skb_reg = si->dst_reg;
>
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         /* If the tstamp_type is read,
>          * the bpf prog is aware the tstamp could have delivery time.
>          * Thus, write skb->tstamp as is if tstamp_type_access is true.
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 4b95cb1ac435..470c70deffe2 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -347,8 +347,7 @@ config NET_SCH_FQ_PIE
>  config NET_SCH_INGRESS
>         tristate "Ingress/classifier-action Qdisc"
>         depends on NET_CLS_ACT
> -       select NET_INGRESS
> -       select NET_EGRESS
> +       select NET_XGRESS
>         help
>           Say Y here if you want to use classifiers for incoming and/or outgoing
>           packets. This qdisc doesn't do anything else besides running classifiers,
> @@ -679,6 +678,7 @@ config NET_EMATCH_IPT
>  config NET_CLS_ACT
>         bool "Actions"
>         select NET_CLS
> +       select NET_XGRESS
>         help
>           Say Y here if you want to use traffic control actions. Actions
>           get attached to classifiers and are invoked after a successful
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..4af1360f537e 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -13,6 +13,7 @@
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
>  #include <net/pkt_cls.h>
> +#include <net/tcx.h>
>
>  struct ingress_sched_data {
>         struct tcf_block *block;
> @@ -78,11 +79,18 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
>         struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *entry;
> +       bool created;
>         int err;
>
>         net_inc_ingress_queue();
>
> -       mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> +       entry = dev_tcx_entry_fetch_or_create(dev, true, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq);
> +       if (created)
> +               tcx_entry_update(dev, entry, true);
>
>         q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>         q->block_info.chain_head_change = clsact_chain_head_change;
> @@ -93,15 +101,20 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>                 return err;
>
>         mini_qdisc_pair_block_init(&q->miniqp, q->block);
> -
>         return 0;
>  }
>
>  static void ingress_destroy(struct Qdisc *sch)
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress);
>
>         tcf_block_put_ext(q->block, sch, &q->block_info);
> +       if (entry && !bpf_mprog_total(entry)) {
> +               tcx_entry_update(dev, NULL, true);
> +               bpf_mprog_free(entry);
> +       }
>         net_dec_ingress_queue();
>  }
>
> @@ -217,12 +230,19 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>         struct clsact_sched_data *q = qdisc_priv(sch);
>         struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *entry;
> +       bool created;
>         int err;
>
>         net_inc_ingress_queue();
>         net_inc_egress_queue();
>
> -       mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
> +       entry = dev_tcx_entry_fetch_or_create(dev, true, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq);
> +       if (created)
> +               tcx_entry_update(dev, entry, true);
>
>         q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>         q->ingress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -235,7 +255,12 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>
>         mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
>
> -       mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
> +       entry = dev_tcx_entry_fetch_or_create(dev, false, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq);
> +       if (created)
> +               tcx_entry_update(dev, entry, false);
>
>         q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
>         q->egress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -247,9 +272,21 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  static void clsact_destroy(struct Qdisc *sch)
>  {
>         struct clsact_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress);
> +       struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress);
>
>         tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> +       if (egress_entry && !bpf_mprog_total(egress_entry)) {
> +               tcx_entry_update(dev, NULL, false);
> +               bpf_mprog_free(egress_entry);
> +       }
> +
>         tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> +       if (ingress_entry && !bpf_mprog_total(ingress_entry)) {
> +               tcx_entry_update(dev, NULL, true);
> +               bpf_mprog_free(ingress_entry);
> +       }
>
>         net_dec_ingress_queue();
>         net_dec_egress_queue();
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
>         BPF_STRUCT_OPS,
> +       BPF_TCX_INGRESS,
> +       BPF_TCX_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
>         BPF_LINK_TYPE_STRUCT_OPS = 9,
>         BPF_LINK_TYPE_NETFILTER = 10,
> -
> +       BPF_LINK_TYPE_TCX = 11,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1559,13 +1561,13 @@ union bpf_attr {
>                         __u32           map_fd;         /* struct_ops to attach */
>                 };
>                 union {
> -                       __u32           target_fd;      /* object to attach to */
> -                       __u32           target_ifindex; /* target ifindex */
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
>                 };
>                 __u32           attach_type;    /* attach type */
>                 __u32           flags;          /* extra flags */
>                 union {
> -                       __u32           target_btf_id;  /* btf_id of target to attach to */
> +                       __u32   target_btf_id;  /* btf_id of target to attach to */
>                         struct {
>                                 __aligned_u64   iter_info;      /* extra bpf_iter_link_info */
>                                 __u32           iter_info_len;  /* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
>                                 __s32           priority;
>                                 __u32           flags;
>                         } netfilter;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u32           expected_revision;
> +                       } tcx;
>                 };
>         } link_create;
>
> @@ -6207,6 +6216,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tcx prog type.
> + * A valid tcx program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TCX_NEXT.
> + */
> +enum tcx_action_base {
> +       TCX_NEXT        = -1,
> +       TCX_PASS        = 0,
> +       TCX_DROP        = 2,
> +       TCX_REDIRECT    = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6459,6 +6481,11 @@ struct bpf_link_info {
>                         __s32 priority;
>                         __u32 flags;
>                 } netfilter;
> +               struct {
> +                       __u32 ifindex;
> +                       __u32 attach_type;
> +                       __u32 flags;
> +               } tcx;
>         };
>  } __attribute__((aligned(8)));
>
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-08  1:25   ` Jamal Hadi Salim
@ 2023-06-08 10:11     ` Daniel Borkmann
  2023-06-08 19:46       ` Jamal Hadi Salim
  0 siblings, 1 reply; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-08 10:11 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

Hi Jamal,

On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
[...]
> A general question (which i think i asked last time as well): who
> decides what comes after/before what prog in this setup? And would
> that same entity not have been able to make the same decision using tc
> priorities?

Back in the first version of the series I initially coded up this option
that the tc_run() would basically be a fake 'bpf_prog' and it would have,
say, fixed prio 1000. It would get executed via tcx_run() when iterating
via bpf_mprog_foreach_prog() where bpf_prog_run() is called, and then users
could pick for native BPF prio before or after that. But then the feedback
was that sticking to prio is a bad user experience which led to the
development of what is in patch 1 of this series (see the details there).

> The idea of protecting programs from being unloaded is very welcome
> but feels would have made sense to be a separate patchset (we have
> good need for it). Would it be possible to use that feature in tc and
> xdp?
BPF links are supported for XDP today, just tc BPF is one of the few
remainders where it is not the case, hence the work of this series. What
XDP lacks today however is multi-prog support. With the bpf_mprog concept
that could be addressed with that common/uniform api (and Andrii expressed
interest in integrating this also for cgroup progs), so yes, various hook
points/program types could benefit from it.

>> +struct tcx_entry {
>> +       struct bpf_mprog_bundle         bundle;
>> +       struct mini_Qdisc __rcu         *miniq;
>> +};
>> +
> 
> Can you please move miniq to the front? From where i sit this looks:
> struct tcx_entry {
>          struct bpf_mprog_bundle    bundle
> __attribute__((__aligned__(64))); /*     0  3264 */
> 
>          /* XXX last struct has 36 bytes of padding */
> 
>          /* --- cacheline 51 boundary (3264 bytes) --- */
>          struct mini_Qdisc *        miniq;                /*  3264     8 */
> 
>          /* size: 3328, cachelines: 52, members: 2 */
>          /* padding: 56 */
>          /* paddings: 1, sum paddings: 36 */
>          /* forced alignments: 1 */
> } __attribute__((__aligned__(64)));
> 
> That is a _lot_ of cachelines - at the expense of the status quo
> clsact/ingress qdiscs which access miniq.

Ah yes, I'll fix this up.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
@ 2023-06-08 17:23   ` Stanislav Fomichev
  2023-06-08 20:59     ` Andrii Nakryiko
  2023-06-08 20:53   ` Andrii Nakryiko
  1 sibling, 1 reply; 49+ messages in thread
From: Stanislav Fomichev @ 2023-06-08 17:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev

On 06/07, Daniel Borkmann wrote:
> This adds a generic layer called bpf_mprog which can be reused by different
> attachment layers to enable multi-program attachment and dependency resolution.
> In-kernel users of the bpf_mprog don't need to care about the dependency
> resolution internals, they can just consume it with few API calls.
> 
> The initial idea of having a generic API sparked out of discussion [0] from an
> earlier revision of this work where tc's priority was reused and exposed via
> BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> as-is for classic tc BPF. The feedback was that priority provides a bad user
> experience and is hard to use [1], e.g.:
> 
>   I cannot help but feel that priority logic copy-paste from old tc, netfilter
>   and friends is done because "that's how things were done in the past". [...]
>   Priority gets exposed everywhere in uapi all the way to bpftool when it's
>   right there for users to understand. And that's the main problem with it.
> 
>   The user don't want to and don't need to be aware of it, but uapi forces them
>   to pick the priority. [...] Your cover letter [0] example proves that in
>   real life different service pick the same priority. They simply don't know
>   any better. Priority is an unnecessary magic that apps _have_ to pick, so
>   they just copy-paste and everyone ends up using the same.
> 
> The course of the discussion showed more and more the need for a generic,
> reusable API where the "same look and feel" can be applied for various other
> program types beyond just tc BPF, for example XDP today does not have multi-
> program support in kernel, but also there was interest around this API for
> improving management of cgroup program types. Such common multi-program
> management concept is useful for BPF management daemons or user space BPF
> applications coordinating about their attachments.
> 
> Both from Cilium and Meta side [2], we've collected the following requirements
> for a generic attach/detach/query API for multi-progs which has been implemented
> as part of this work:
> 
>   - Support prog-based attach/detach and link API
>   - Dependency directives (can also be combined):
>     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
>       - BPF_F_ID flag as {fd,id} toggle
>       - BPF_F_LINK flag as {prog,link} toggle
>       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
>         BPF_F_AFTER will just append for the case of attaching
>       - Enforced only at attach time
>     - BPF_F_{FIRST,LAST}
>       - Enforced throughout the bpf_mprog state's lifetime
>       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
>   - Internal revision counter and optionally being able to pass expected_revision
>   - User space daemon can query current state with revision, and pass it along
>     for attachment to assert current state before doing updates
>   - Query also gets extension for link_ids array and link_attach_flags:
>     - prog_ids are always filled with program IDs
>     - link_ids are filled with link IDs when link was used, otherwise 0
>     - {prog,link}_attach_flags for holding {prog,link}-specific flags
>   - Must be easy to integrate/reuse for in-kernel users
> 
> The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> consisting of the additions of the attachment flags, revision counter, and
> expanding existing union with relative_{fd,id} member.
> 
> The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> structure). Both have been separated, so that fast-path gets efficient packing
> of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> instead of linked list or other structures to remove unnecessary indirections
> for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> is populated and then just swapped which avoids additional allocations that
> could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> currently static, but they could be converted to dynamic allocation if necessary
> at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> for example, in case of tcx which uses this API in the next patch, it piggy-
> backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> add,del} implementation and an extensive test suite for checking all aspects
> of this API for prog-based attach/detach and link API as BPF selftests in
> this series.
> 
> Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> 
>   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
>   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
>   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   1 +
>  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
>  include/uapi/linux/bpf.h       |  37 ++-
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  37 ++-
>  6 files changed, 781 insertions(+), 17 deletions(-)
>  create mode 100644 include/linux/bpf_mprog.h
>  create mode 100644 kernel/bpf/mprog.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c904dba1733b..754a9eeca0a1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3733,6 +3733,7 @@ F:	include/linux/filter.h
>  F:	include/linux/tnum.h
>  F:	kernel/bpf/core.c
>  F:	kernel/bpf/dispatcher.c
> +F:	kernel/bpf/mprog.c
>  F:	kernel/bpf/syscall.c
>  F:	kernel/bpf/tnum.c
>  F:	kernel/bpf/trampoline.c
> diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
> new file mode 100644
> index 000000000000..7399181d8e6c
> --- /dev/null
> +++ b/include/linux/bpf_mprog.h
> @@ -0,0 +1,245 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __BPF_MPROG_H
> +#define __BPF_MPROG_H
> +
> +#include <linux/bpf.h>
> +
> +#define BPF_MPROG_MAX	64
> +#define BPF_MPROG_SWAP	1
> +#define BPF_MPROG_FREE	2
> +
> +struct bpf_mprog_fp {
> +	struct bpf_prog *prog;
> +};
> +
> +struct bpf_mprog_cp {
> +	struct bpf_link *link;
> +	u32 flags;
> +};
> +
> +struct bpf_mprog_entry {
> +	struct bpf_mprog_fp fp_items[BPF_MPROG_MAX] ____cacheline_aligned;
> +	struct bpf_mprog_cp cp_items[BPF_MPROG_MAX] ____cacheline_aligned;
> +	struct bpf_mprog_bundle *parent;
> +};
> +
> +struct bpf_mprog_bundle {
> +	struct bpf_mprog_entry a;
> +	struct bpf_mprog_entry b;
> +	struct rcu_head rcu;
> +	struct bpf_prog *ref;
> +	atomic_t revision;
> +};
> +
> +struct bpf_tuple {
> +	struct bpf_prog *prog;
> +	struct bpf_link *link;
> +};
> +
> +static inline struct bpf_mprog_entry *
> +bpf_mprog_peer(const struct bpf_mprog_entry *entry)
> +{
> +	if (entry == &entry->parent->a)
> +		return &entry->parent->b;
> +	else
> +		return &entry->parent->a;
> +}
> +
> +#define bpf_mprog_foreach_tuple(entry, fp, cp, t)			\
> +	for (fp = &entry->fp_items[0], cp = &entry->cp_items[0];	\
> +	     ({								\
> +		t.prog = READ_ONCE(fp->prog);				\
> +		t.link = cp->link;					\
> +		t.prog;							\
> +	      });							\
> +	     fp++, cp++)
> +
> +#define bpf_mprog_foreach_prog(entry, fp, p)				\
> +	for (fp = &entry->fp_items[0];					\
> +	     (p = READ_ONCE(fp->prog));					\
> +	     fp++)
> +
> +static inline struct bpf_mprog_entry *bpf_mprog_create(size_t extra_size)
> +{
> +	struct bpf_mprog_bundle *bundle;
> +
> +	/* Fast-path items are not extensible, must only contain prog pointer! */
> +	BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
> +	/* Control-path items can be extended w/o affecting fast-path. */
> +	BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) != ARRAY_SIZE(bundle->a.cp_items));
> +
> +	bundle = kzalloc(sizeof(*bundle) + extra_size, GFP_KERNEL);
> +	if (bundle) {
> +		atomic_set(&bundle->revision, 1);
> +		bundle->a.parent = bundle;
> +		bundle->b.parent = bundle;
> +		return &bundle->a;
> +	}
> +	return NULL;
> +}
> +
> +static inline void bpf_mprog_free(struct bpf_mprog_entry *entry)
> +{
> +	kfree_rcu(entry->parent, rcu);
> +}
> +
> +static inline void bpf_mprog_mark_ref(struct bpf_mprog_entry *entry,
> +				      struct bpf_prog *prog)
> +{
> +	WARN_ON_ONCE(entry->parent->ref);
> +	entry->parent->ref = prog;
> +}
> +
> +static inline u32 bpf_mprog_flags(u32 cur_flags, u32 req_flags, u32 flag)
> +{
> +	if (req_flags & flag)
> +		cur_flags |= flag;
> +	else
> +		cur_flags &= ~flag;
> +	return cur_flags;
> +}
> +
> +static inline u32 bpf_mprog_max(void)
> +{
> +	return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_first(struct bpf_mprog_entry *entry)
> +{
> +	return READ_ONCE(entry->fp_items[0].prog);
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_last(struct bpf_mprog_entry *entry)
> +{
> +	struct bpf_prog *tmp, *prog = NULL;
> +	struct bpf_mprog_fp *fp;
> +
> +	bpf_mprog_foreach_prog(entry, fp, tmp)
> +		prog = tmp;
> +	return prog;
> +}
> +
> +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
> +				    struct bpf_prog *prog)
> +{
> +	const struct bpf_mprog_fp *fp;
> +	const struct bpf_prog *tmp;
> +
> +	bpf_mprog_foreach_prog(entry, fp, tmp) {
> +		if (tmp == prog)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_first_reg(struct bpf_mprog_entry *entry)
> +{
> +	struct bpf_tuple tuple = {};
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +
> +	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +		if (cp->flags & BPF_F_FIRST)
> +			continue;
> +		return tuple.prog;
> +	}
> +	return NULL;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_last_reg(struct bpf_mprog_entry *entry)
> +{
> +	struct bpf_tuple tuple = {};
> +	struct bpf_prog *prog = NULL;
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +
> +	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +		if (cp->flags & BPF_F_LAST)
> +			break;
> +		prog = tuple.prog;
> +	}
> +	return prog;
> +}
> +
> +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
> +{

[..]

> +	do {
> +		atomic_inc(&entry->parent->revision);
> +	} while (atomic_read(&entry->parent->revision) == 0);

Can you explain more what's going on here? Maybe with a comment?

> +	synchronize_rcu();
> +	if (entry->parent->ref) {
> +		bpf_prog_put(entry->parent->ref);
> +		entry->parent->ref = NULL;
> +	}

I'm assuming this is to guard the detach path? But isn't bpf_prog_put
already doing the deferred dealloc? So calling it without synchronize_rcu
here should be ok?

> +}
> +
> +static inline void bpf_mprog_entry_clear(struct bpf_mprog_entry *entry)
> +{
> +	memset(entry->fp_items, 0, sizeof(entry->fp_items));
> +	memset(entry->cp_items, 0, sizeof(entry->cp_items));
> +}
> +
> +static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
> +{
> +	return atomic_read(&entry->parent->revision);
> +}
> +
> +static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 which,
> +				  struct bpf_mprog_fp **fp_dst,
> +				  struct bpf_mprog_cp **cp_dst)
> +{
> +	*fp_dst = &entry->fp_items[which];
> +	*cp_dst = &entry->cp_items[which];
> +}
> +
> +static inline void bpf_mprog_write(struct bpf_mprog_fp *fp_dst,
> +				   struct bpf_mprog_cp *cp_dst,
> +				   struct bpf_tuple *tuple, u32 flags)
> +{
> +	WRITE_ONCE(fp_dst->prog, tuple->prog);
> +	cp_dst->link  = tuple->link;
> +	cp_dst->flags = flags;
> +}
> +
> +static inline void bpf_mprog_copy(struct bpf_mprog_fp *fp_dst,
> +				  struct bpf_mprog_cp *cp_dst,
> +				  struct bpf_mprog_fp *fp_src,
> +				  struct bpf_mprog_cp *cp_src)
> +{
> +	WRITE_ONCE(fp_dst->prog, READ_ONCE(fp_src->prog));
> +	memcpy(cp_dst, cp_src, sizeof(*cp_src));

nit: why not simply *cp_dst = *cp_src? memcpy somewhat implies (in my
mind) that we are copying several entries..

> +}
> +
> +static inline void bpf_mprog_copy_range(struct bpf_mprog_entry *peer,
> +					struct bpf_mprog_entry *entry,
> +					u32 idx_peer, u32 idx_entry, u32 num)
> +{
> +	memcpy(&peer->fp_items[idx_peer], &entry->fp_items[idx_entry],
> +	       num * sizeof(peer->fp_items[0]));
> +	memcpy(&peer->cp_items[idx_peer], &entry->cp_items[idx_entry],
> +	       num * sizeof(peer->cp_items[0]));
> +}
> +
> +static inline u32 bpf_mprog_total(struct bpf_mprog_entry *entry)
> +{
> +	const struct bpf_mprog_fp *fp;
> +	const struct bpf_prog *tmp;
> +	u32 num = 0;
> +
> +	bpf_mprog_foreach_prog(entry, fp, tmp)
> +		num++;
> +	return num;
> +}
> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +		     struct bpf_link *link, u32 flags, u32 object,
> +		     u32 expected_revision);
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +		     struct bpf_link *link, u32 flags, u32 object,
> +		     u32 expected_revision);
> +
> +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> +		    struct bpf_mprog_entry *entry);
> +
> +#endif /* __BPF_MPROG_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a7b5e91dd768..207f8a37b327 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
>   */
>  #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
>  #define BPF_F_ALLOW_MULTI	(1U << 1)
> +/* Generic attachment flags. */
>  #define BPF_F_REPLACE		(1U << 2)
> +#define BPF_F_BEFORE		(1U << 3)
> +#define BPF_F_AFTER		(1U << 4)
> +#define BPF_F_FIRST		(1U << 5)
> +#define BPF_F_LAST		(1U << 6)
> +#define BPF_F_ID		(1U << 7)
> +#define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
>  
>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>   * verifier will perform strict alignment checking as if the kernel
> @@ -1433,14 +1440,19 @@ union bpf_attr {
>  	};
>  
>  	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -		__u32		target_fd;	/* container object to attach to */
> -		__u32		attach_bpf_fd;	/* eBPF program to attach */
> +		union {
> +			__u32	target_fd;	/* target object to attach to or ... */
> +			__u32	target_ifindex;	/* target ifindex */
> +		};
> +		__u32		attach_bpf_fd;
>  		__u32		attach_type;
>  		__u32		attach_flags;
> -		__u32		replace_bpf_fd;	/* previously attached eBPF
> -						 * program to replace if
> -						 * BPF_F_REPLACE is used
> -						 */
> +		union {
> +			__u32	relative_fd;
> +			__u32	relative_id;
> +			__u32	replace_bpf_fd;
> +		};
> +		__u32		expected_revision;
>  	};
>  
>  	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1486,16 +1498,25 @@ union bpf_attr {
>  	} info;
>  
>  	struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -		__u32		target_fd;	/* container object to query */
> +		union {
> +			__u32	target_fd;	/* target object to query or ... */
> +			__u32	target_ifindex;	/* target ifindex */
> +		};
>  		__u32		attach_type;
>  		__u32		query_flags;
>  		__u32		attach_flags;
>  		__aligned_u64	prog_ids;
> -		__u32		prog_cnt;
> +		union {
> +			__u32	prog_cnt;
> +			__u32	count;
> +		};
> +		__u32		revision;
>  		/* output: per-program attach_flags.
>  		 * not allowed to be set during effective query.
>  		 */
>  		__aligned_u64	prog_attach_flags;
> +		__aligned_u64	link_ids;
> +		__aligned_u64	link_attach_flags;
>  	} query;
>  
>  	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1d3892168d32..1bea2eb912cd 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>  obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
> -obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
>  obj-$(CONFIG_BPF_JIT) += dispatcher.o
> diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
> new file mode 100644
> index 000000000000..efc3b73f8bf5
> --- /dev/null
> +++ b/kernel/bpf/mprog.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +#include <linux/filter.h>
> +
> +static int bpf_mprog_tuple_relative(struct bpf_tuple *tuple,
> +				    u32 object, u32 flags,
> +				    enum bpf_prog_type type)
> +{
> +	struct bpf_prog *prog;
> +	struct bpf_link *link;
> +
> +	memset(tuple, 0, sizeof(*tuple));
> +	if (!(flags & (BPF_F_REPLACE | BPF_F_BEFORE | BPF_F_AFTER)))
> +		return object || (flags & (BPF_F_ID | BPF_F_LINK)) ?
> +		       -EINVAL : 0;
> +	if (flags & BPF_F_LINK) {
> +		if (flags & BPF_F_ID)
> +			link = bpf_link_by_id(object);
> +		else
> +			link = bpf_link_get_from_fd(object);
> +		if (IS_ERR(link))
> +			return PTR_ERR(link);
> +		if (type && link->prog->type != type) {
> +			bpf_link_put(link);
> +			return -EINVAL;
> +		}
> +		tuple->link = link;
> +		tuple->prog = link->prog;
> +	} else {
> +		if (flags & BPF_F_ID)
> +			prog = bpf_prog_by_id(object);
> +		else
> +			prog = bpf_prog_get(object);
> +		if (IS_ERR(prog)) {
> +			if (!object &&
> +			    !(flags & BPF_F_ID))
> +				return 0;
> +			return PTR_ERR(prog);
> +		}
> +		if (type && prog->type != type) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}
> +		tuple->link = NULL;
> +		tuple->prog = prog;
> +	}
> +	return 0;
> +}
> +
> +static void bpf_mprog_tuple_put(struct bpf_tuple *tuple)
> +{
> +	if (tuple->link)
> +		bpf_link_put(tuple->link);
> +	else if (tuple->prog)
> +		bpf_prog_put(tuple->prog);
> +}
> +
> +static int bpf_mprog_replace(struct bpf_mprog_entry *entry,
> +			     struct bpf_tuple *ntuple,
> +			     struct bpf_tuple *rtuple, u32 rflags)
> +{
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +	struct bpf_prog *oprog;
> +	u32 iflags;
> +	int i;
> +
> +	if (rflags & (BPF_F_BEFORE | BPF_F_AFTER | BPF_F_LINK))
> +		return -EINVAL;
> +	if (rtuple->prog != ntuple->prog &&
> +	    bpf_mprog_exists(entry, ntuple->prog))
> +		return -EEXIST;
> +	for (i = 0; i < bpf_mprog_max(); i++) {
> +		bpf_mprog_read(entry, i, &fp, &cp);
> +		oprog = READ_ONCE(fp->prog);
> +		if (!oprog)
> +			break;
> +		if (oprog != rtuple->prog)
> +			continue;
> +		if (cp->link != ntuple->link)
> +			return -EBUSY;
> +		iflags = cp->flags;
> +		if ((iflags & BPF_F_FIRST) !=
> +		    (rflags & BPF_F_FIRST)) {
> +			iflags = bpf_mprog_flags(iflags, rflags,
> +						 BPF_F_FIRST);
> +			if ((iflags & BPF_F_FIRST) &&
> +			    rtuple->prog != bpf_mprog_first(entry))
> +				return -EACCES;
> +		}
> +		if ((iflags & BPF_F_LAST) !=
> +		    (rflags & BPF_F_LAST)) {
> +			iflags = bpf_mprog_flags(iflags, rflags,
> +						 BPF_F_LAST);
> +			if ((iflags & BPF_F_LAST) &&
> +			    rtuple->prog != bpf_mprog_last(entry))
> +				return -EACCES;
> +		}
> +		bpf_mprog_write(fp, cp, ntuple, iflags);
> +		if (!ntuple->link)
> +			bpf_prog_put(oprog);
> +		return 0;
> +	}
> +	return -ENOENT;
> +}
> +
> +static int bpf_mprog_head_tail(struct bpf_mprog_entry *entry,
> +			       struct bpf_tuple *ntuple,
> +			       struct bpf_tuple *rtuple, u32 aflags)
> +{
> +	struct bpf_mprog_entry *peer;
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +	struct bpf_prog *oprog;
> +	u32 iflags, items;
> +
> +	if (bpf_mprog_exists(entry, ntuple->prog))
> +		return -EEXIST;
> +	items = bpf_mprog_total(entry);
> +	peer = bpf_mprog_peer(entry);
> +	bpf_mprog_entry_clear(peer);
> +	if (aflags & BPF_F_FIRST) {
> +		if (aflags & BPF_F_AFTER)
> +			return -EINVAL;
> +		bpf_mprog_read(entry, 0, &fp, &cp);
> +		iflags = cp->flags;
> +		if (iflags & BPF_F_FIRST)
> +			return -EBUSY;
> +		if (aflags & BPF_F_LAST) {
> +			if (aflags & BPF_F_BEFORE)
> +				return -EINVAL;
> +			if (items)
> +				return -EBUSY;
> +			bpf_mprog_read(peer, 0, &fp, &cp);
> +			bpf_mprog_write(fp, cp, ntuple,
> +					BPF_F_FIRST | BPF_F_LAST);
> +			return BPF_MPROG_SWAP;
> +		}
> +		if (aflags & BPF_F_BEFORE) {
> +			oprog = READ_ONCE(fp->prog);
> +			if (oprog != rtuple->prog ||
> +			    (rtuple->link &&
> +			     rtuple->link != cp->link))
> +				return -EBUSY;
> +		}
> +		if (items >= bpf_mprog_max())
> +			return -ENOSPC;
> +		bpf_mprog_read(peer, 0, &fp, &cp);
> +		bpf_mprog_write(fp, cp, ntuple, BPF_F_FIRST);
> +		bpf_mprog_copy_range(peer, entry, 1, 0, items);
> +		return BPF_MPROG_SWAP;
> +	}
> +	if (aflags & BPF_F_LAST) {
> +		if (aflags & BPF_F_BEFORE)
> +			return -EINVAL;
> +		if (items) {
> +			bpf_mprog_read(entry, items - 1, &fp, &cp);
> +			iflags = cp->flags;
> +			if (iflags & BPF_F_LAST)
> +				return -EBUSY;
> +			if (aflags & BPF_F_AFTER) {
> +				oprog = READ_ONCE(fp->prog);
> +				if (oprog != rtuple->prog ||
> +				    (rtuple->link &&
> +				     rtuple->link != cp->link))
> +					return -EBUSY;
> +			}
> +			if (items >= bpf_mprog_max())
> +				return -ENOSPC;
> +		} else {
> +			if (aflags & BPF_F_AFTER)
> +				return -EBUSY;
> +		}
> +		bpf_mprog_read(peer, items, &fp, &cp);
> +		bpf_mprog_write(fp, cp, ntuple, BPF_F_LAST);
> +		bpf_mprog_copy_range(peer, entry, 0, 0, items);
> +		return BPF_MPROG_SWAP;
> +	}
> +	return -ENOENT;
> +}
> +
> +static int bpf_mprog_add(struct bpf_mprog_entry *entry,
> +			 struct bpf_tuple *ntuple,
> +			 struct bpf_tuple *rtuple, u32 aflags)
> +{
> +	struct bpf_mprog_fp *fp_dst, *fp_src;
> +	struct bpf_mprog_cp *cp_dst, *cp_src;
> +	struct bpf_mprog_entry *peer;
> +	struct bpf_prog *oprog;
> +	bool found = false;
> +	u32 items;
> +	int i, j;
> +
> +	items = bpf_mprog_total(entry);
> +	if (items >= bpf_mprog_max())
> +		return -ENOSPC;
> +	if ((aflags & (BPF_F_BEFORE | BPF_F_AFTER)) ==
> +	    (BPF_F_BEFORE | BPF_F_AFTER))
> +		return -EINVAL;
> +	if (bpf_mprog_exists(entry, ntuple->prog))
> +		return -EEXIST;
> +	if (!rtuple->prog && (aflags & (BPF_F_BEFORE | BPF_F_AFTER))) {
> +		if (!items)
> +			aflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
> +		if (aflags & BPF_F_BEFORE)
> +			rtuple->prog = bpf_mprog_first_reg(entry);
> +		if (aflags & BPF_F_AFTER)
> +			rtuple->prog = bpf_mprog_last_reg(entry);
> +		if (!rtuple->prog)
> +			aflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
> +		else
> +			bpf_prog_inc(rtuple->prog);
> +	}
> +	peer = bpf_mprog_peer(entry);
> +	bpf_mprog_entry_clear(peer);
> +	for (i = 0, j = 0; i < bpf_mprog_max(); i++, j++) {
> +		bpf_mprog_read(entry, i, &fp_src, &cp_src);
> +		bpf_mprog_read(peer,  j, &fp_dst, &cp_dst);
> +		oprog = READ_ONCE(fp_src->prog);
> +		if (!oprog) {
> +			if (i != j)
> +				break;
> +			if (i > 0) {
> +				bpf_mprog_read(entry, i - 1,
> +					       &fp_src, &cp_src);
> +				if (cp_src->flags & BPF_F_LAST) {
> +					if (cp_src->flags & BPF_F_FIRST)
> +						return -EBUSY;
> +					bpf_mprog_copy(fp_dst, cp_dst,
> +						       fp_src, cp_src);
> +					bpf_mprog_read(peer, --j,
> +						       &fp_dst, &cp_dst);
> +				}
> +			}
> +			bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
> +			break;
> +		}
> +		if (aflags & (BPF_F_BEFORE | BPF_F_AFTER)) {
> +			if (rtuple->prog != oprog ||
> +			    (rtuple->link &&
> +			     rtuple->link != cp_src->link))
> +				goto next;
> +			found = true;
> +			if (aflags & BPF_F_BEFORE) {
> +				if (cp_src->flags & BPF_F_FIRST)
> +					return -EBUSY;
> +				bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
> +				bpf_mprog_read(peer, ++j, &fp_dst, &cp_dst);
> +				goto next;
> +			}
> +			if (aflags & BPF_F_AFTER) {
> +				if (cp_src->flags & BPF_F_LAST)
> +					return -EBUSY;
> +				bpf_mprog_copy(fp_dst, cp_dst,
> +					       fp_src, cp_src);
> +				bpf_mprog_read(peer, ++j, &fp_dst, &cp_dst);
> +				bpf_mprog_write(fp_dst, cp_dst, ntuple, 0);
> +				continue;
> +			}
> +		}
> +next:
> +		bpf_mprog_copy(fp_dst, cp_dst,
> +			       fp_src, cp_src);
> +	}
> +	if (rtuple->prog && !found)
> +		return -ENOENT;
> +	return BPF_MPROG_SWAP;
> +}
> +
> +static int bpf_mprog_del(struct bpf_mprog_entry *entry,
> +			 struct bpf_tuple *dtuple,
> +			 struct bpf_tuple *rtuple, u32 dflags)
> +{
> +	struct bpf_mprog_fp *fp_dst, *fp_src;
> +	struct bpf_mprog_cp *cp_dst, *cp_src;
> +	struct bpf_mprog_entry *peer;
> +	struct bpf_prog *oprog;
> +	bool found = false;
> +	int i, j, ret;
> +
> +	if (dflags & BPF_F_REPLACE)
> +		return -EINVAL;
> +	if (dflags & BPF_F_FIRST) {
> +		oprog = bpf_mprog_first(entry);
> +		if (dtuple->prog &&
> +		    dtuple->prog != oprog)
> +			return -ENOENT;
> +		dtuple->prog = oprog;
> +	}
> +	if (dflags & BPF_F_LAST) {
> +		oprog = bpf_mprog_last(entry);
> +		if (dtuple->prog &&
> +		    dtuple->prog != oprog)
> +			return -ENOENT;
> +		dtuple->prog = oprog;
> +	}
> +	if (!rtuple->prog && (dflags & (BPF_F_BEFORE | BPF_F_AFTER))) {
> +		if (dtuple->prog)
> +			return -EINVAL;
> +		if (dflags & BPF_F_BEFORE)
> +			dtuple->prog = bpf_mprog_first_reg(entry);
> +		if (dflags & BPF_F_AFTER)
> +			dtuple->prog = bpf_mprog_last_reg(entry);
> +		if (dtuple->prog)
> +			dflags &= ~(BPF_F_AFTER | BPF_F_BEFORE);
> +	}
> +	for (i = 0; i < bpf_mprog_max(); i++) {
> +		bpf_mprog_read(entry, i, &fp_src, &cp_src);
> +		oprog = READ_ONCE(fp_src->prog);
> +		if (!oprog)
> +			break;
> +		if (dflags & (BPF_F_BEFORE | BPF_F_AFTER)) {
> +			if (rtuple->prog != oprog ||
> +			    (rtuple->link &&
> +			     rtuple->link != cp_src->link))
> +				continue;
> +			found = true;
> +			if (dflags & BPF_F_BEFORE) {
> +				if (!i)
> +					return -ENOENT;
> +				bpf_mprog_read(entry, i - 1,
> +					       &fp_src, &cp_src);
> +				oprog = READ_ONCE(fp_src->prog);
> +				if (dtuple->prog &&
> +				    dtuple->prog != oprog)
> +					return -ENOENT;
> +				dtuple->prog = oprog;
> +				break;
> +			}
> +			if (dflags & BPF_F_AFTER) {
> +				bpf_mprog_read(entry, i + 1,
> +					       &fp_src, &cp_src);
> +				oprog = READ_ONCE(fp_src->prog);
> +				if (dtuple->prog &&
> +				    dtuple->prog != oprog)
> +					return -ENOENT;
> +				dtuple->prog = oprog;
> +				break;
> +			}
> +		}
> +	}
> +	if (!dtuple->prog || (rtuple->prog && !found))
> +		return -ENOENT;
> +	peer = bpf_mprog_peer(entry);
> +	bpf_mprog_entry_clear(peer);
> +	ret = -ENOENT;
> +	for (i = 0, j = 0; i < bpf_mprog_max(); i++) {
> +		bpf_mprog_read(entry, i, &fp_src, &cp_src);
> +		bpf_mprog_read(peer,  j, &fp_dst, &cp_dst);
> +		oprog = READ_ONCE(fp_src->prog);
> +		if (!oprog)
> +			break;
> +		if (oprog != dtuple->prog) {
> +			bpf_mprog_copy(fp_dst, cp_dst,
> +				       fp_src, cp_src);
> +			j++;
> +		} else {
> +			if (cp_src->link != dtuple->link)
> +				return -EBUSY;
> +			if (!cp_src->link)
> +				bpf_mprog_mark_ref(entry, dtuple->prog);
> +			ret = BPF_MPROG_SWAP;
> +		}
> +	}
> +	if (!bpf_mprog_total(peer))
> +		ret = BPF_MPROG_FREE;
> +	return ret;
> +}
> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +		     struct bpf_link *link, u32 flags, u32 object,
> +		     u32 expected_revision)
> +{
> +	struct bpf_tuple rtuple, ntuple = {
> +		.prog = prog,
> +		.link = link,
> +	};
> +	int ret;
> +
> +	if (expected_revision &&
> +	    expected_revision != bpf_mprog_revision(entry))
> +		return -ESTALE;
> +	ret = bpf_mprog_tuple_relative(&rtuple, object, flags, prog->type);
> +	if (ret)
> +		return ret;
> +	if (flags & BPF_F_REPLACE)
> +		ret = bpf_mprog_replace(entry, &ntuple, &rtuple, flags);
> +	else if (flags & (BPF_F_FIRST | BPF_F_LAST))
> +		ret = bpf_mprog_head_tail(entry, &ntuple, &rtuple, flags);
> +	else
> +		ret = bpf_mprog_add(entry, &ntuple, &rtuple, flags);
> +	bpf_mprog_tuple_put(&rtuple);
> +	return ret;
> +}
> +
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +		     struct bpf_link *link, u32 flags, u32 object,
> +		     u32 expected_revision)
> +{
> +	struct bpf_tuple rtuple, dtuple = {
> +		.prog = prog,
> +		.link = link,
> +	};
> +	int ret;
> +
> +	if (expected_revision &&
> +	    expected_revision != bpf_mprog_revision(entry))
> +		return -ESTALE;
> +	ret = bpf_mprog_tuple_relative(&rtuple, object, flags,
> +				       prog ? prog->type :
> +				       BPF_PROG_TYPE_UNSPEC);
> +	if (ret)
> +		return ret;
> +	ret = bpf_mprog_del(entry, &dtuple, &rtuple, flags);
> +	bpf_mprog_tuple_put(&rtuple);
> +	return ret;
> +}
> +
> +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> +		    struct bpf_mprog_entry *entry)
> +{
> +	u32 i, id, flags = 0, count, revision;
> +	u32 __user *uprog_id, *uprog_af;
> +	u32 __user *ulink_id, *ulink_af;
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	if (attr->query.query_flags || attr->query.attach_flags)
> +		return -EINVAL;
> +	revision = bpf_mprog_revision(entry);
> +	count = bpf_mprog_total(entry);
> +	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
> +		return -EFAULT;
> +	if (copy_to_user(&uattr->query.revision, &revision, sizeof(revision)))
> +		return -EFAULT;
> +	if (copy_to_user(&uattr->query.count, &count, sizeof(count)))
> +		return -EFAULT;
> +	uprog_id = u64_to_user_ptr(attr->query.prog_ids);
> +	if (attr->query.count == 0 || !uprog_id || !count)
> +		return 0;
> +	if (attr->query.count < count) {
> +		count = attr->query.count;
> +		ret = -ENOSPC;
> +	}
> +	uprog_af = u64_to_user_ptr(attr->query.prog_attach_flags);
> +	ulink_id = u64_to_user_ptr(attr->query.link_ids);
> +	ulink_af = u64_to_user_ptr(attr->query.link_attach_flags);
> +	for (i = 0; i < ARRAY_SIZE(entry->fp_items); i++) {
> +		bpf_mprog_read(entry, i, &fp, &cp);
> +		prog = READ_ONCE(fp->prog);
> +		if (!prog)
> +			break;
> +		id = prog->aux->id;
> +		if (copy_to_user(uprog_id + i, &id, sizeof(id)))
> +			return -EFAULT;
> +		id = cp->link ? cp->link->id : 0;
> +		if (ulink_id &&
> +		    copy_to_user(ulink_id + i, &id, sizeof(id)))
> +			return -EFAULT;
> +		flags = cp->flags;
> +		if (uprog_af && !id &&
> +		    copy_to_user(uprog_af + i, &flags, sizeof(flags)))
> +			return -EFAULT;
> +		if (ulink_af && id &&
> +		    copy_to_user(ulink_af + i, &flags, sizeof(flags)))
> +			return -EFAULT;
> +		if (i + 1 == count)
> +			break;
> +	}
> +	return ret;
> +}
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index a7b5e91dd768..207f8a37b327 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
>   */
>  #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
>  #define BPF_F_ALLOW_MULTI	(1U << 1)
> +/* Generic attachment flags. */
>  #define BPF_F_REPLACE		(1U << 2)
> +#define BPF_F_BEFORE		(1U << 3)
> +#define BPF_F_AFTER		(1U << 4)

[..]

> +#define BPF_F_FIRST		(1U << 5)
> +#define BPF_F_LAST		(1U << 6)

I'm still not sure whether the hard semantics of first/last is really
useful. My worry is that some prog will just use BPF_F_FIRST which
would prevent the rest of the users.. (starting with only
F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
need first/laste).

But if everyone besides myself is on board with first/last, maybe at least
put a comment here saying that only a single program can be first/last?
And the users are advised not to use these unless they really really really
need to be first/last. (IOW, feels like first/last should be reserved
for observability tools/etc).

> +#define BPF_F_ID		(1U << 7)
> +#define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
>  
>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>   * verifier will perform strict alignment checking as if the kernel
> @@ -1433,14 +1440,19 @@ union bpf_attr {
>  	};
>  
>  	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -		__u32		target_fd;	/* container object to attach to */
> -		__u32		attach_bpf_fd;	/* eBPF program to attach */
> +		union {
> +			__u32	target_fd;	/* target object to attach to or ... */
> +			__u32	target_ifindex;	/* target ifindex */
> +		};
> +		__u32		attach_bpf_fd;
>  		__u32		attach_type;
>  		__u32		attach_flags;
> -		__u32		replace_bpf_fd;	/* previously attached eBPF
> -						 * program to replace if
> -						 * BPF_F_REPLACE is used
> -						 */
> +		union {
> +			__u32	relative_fd;
> +			__u32	relative_id;
> +			__u32	replace_bpf_fd;
> +		};
> +		__u32		expected_revision;
>  	};
>  
>  	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1486,16 +1498,25 @@ union bpf_attr {
>  	} info;
>  
>  	struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -		__u32		target_fd;	/* container object to query */
> +		union {
> +			__u32	target_fd;	/* target object to query or ... */
> +			__u32	target_ifindex;	/* target ifindex */
> +		};
>  		__u32		attach_type;
>  		__u32		query_flags;
>  		__u32		attach_flags;
>  		__aligned_u64	prog_ids;
> -		__u32		prog_cnt;
> +		union {
> +			__u32	prog_cnt;
> +			__u32	count;
> +		};
> +		__u32		revision;
>  		/* output: per-program attach_flags.
>  		 * not allowed to be set during effective query.
>  		 */
>  		__aligned_u64	prog_attach_flags;
> +		__aligned_u64	link_ids;
> +		__aligned_u64	link_attach_flags;
>  	} query;
>  
>  	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
  2023-06-08  1:25   ` Jamal Hadi Salim
@ 2023-06-08 17:50   ` Stanislav Fomichev
  2023-06-08 21:20   ` Andrii Nakryiko
  2023-06-09  3:06   ` Jakub Kicinski
  3 siblings, 0 replies; 49+ messages in thread
From: Stanislav Fomichev @ 2023-06-08 17:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	toke, davem, bpf, netdev

On 06/07, Daniel Borkmann wrote:
> This work refactors and adds a lightweight extension ("tcx") to the tc BPF
> ingress and egress data path side for allowing BPF program management based
> on fds via bpf() syscall through the newly added generic multi-prog API.
> The main goal behind this work which we also presented at LPC [0] last year
> and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
> BPF link functionality for tc BPF programs, which allows for a model of safe
> ownership and program detachment.
> 
> Given the rise in tc BPF users in cloud native environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications accidentally stepping on each others toes.
> As a recap, a BPF link represents the attachment of a BPF program to a BPF
> hook point. The BPF link holds a single reference to keep BPF program alive.
> Moreover, hook points do not reference a BPF link, only the application's
> fd or pinning does. A BPF link holds meta-data specific to attachment and
> implements operations for link creation, (atomic) BPF program update,
> detachment and introspection. The motivation for BPF links for tc BPF programs
> is multi-fold, for example:
> 
>   - From Meta: "It's especially important for applications that are deployed
>     fleet-wide and that don't "control" hosts they are deployed to. If such
>     application crashes and no one notices and does anything about that, BPF
>     program will keep running draining resources or even just, say, dropping
>     packets. We at FB had outages due to such permanent BPF attachment
>     semantics. With fd-based BPF link we are getting a framework, which allows
>     safe, auto-detachable behavior by default, unless application explicitly
>     opts in by pinning the BPF link." [1]
> 
>   - From Cilium-side the tc BPF programs we attach to host-facing veth devices
>     and phys devices build the core datapath for Kubernetes Pods, and they
>     implement forwarding, load-balancing, policy, EDT-management, etc, within
>     BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>     experienced hard-to-debug issues in a user's staging environment where
>     another Kubernetes application using tc BPF attached to the same prio/handle
>     of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
>     it. The goal is to establish a clear/safe ownership model via links which
>     cannot accidentally be overridden. [0,2]
> 
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
> 
> Earlier attempts [4] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
> 
> We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
> attach API, so that the BPF link implementation blends in naturally similar to
> other link types which are fd-based and without the need for changing core tc
> internal APIs. BPF programs for tc can then be successively migrated from classic
> cls_bpf to the new tc BPF link without needing to change the program's source
> code, just the BPF loader mechanics for attaching is sufficient.
> 
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook have a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> entry point for tc BPF. The name tcx has been suggested from discussion of
> earlier revisions of this work as a good fit, and to more easily differ between
> the classic cls_bpf attachment and the fd-based one.
> 
> For the ingress and egress tcx points, the device holds a cache-friendly array
> with program pointers which is separated from control plane (slow-path) data.
> Earlier versions of this work used priority to determine ordering and expression
> of dependencies similar as with classic tc, but it was challenged that for
> something more future-proof a better user experience is required. Hence this
> resulted in the design and development of the generic attach/detach/query API
> for multi-progs. See prior patch with its discussion on the API design. tcx is
> the first user and later we plan to integrate also others, for example, one
> candidate is multi-prog support for XDP which would benefit and have the same
> 'look and feel' from API perspective.
> 
> The goal with tcx is to have maximum compatibility to existing tc BPF programs,
> so they don't need to be rewritten specifically. Compatibility to call into
> classic tcf_classify() is also provided in order to allow successive migration
> or both to cleanly co-exist where needed given its all one logical tc layer.
> tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
> to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
> The fd-based API is behind a static key, so that when unused the code is also
> not entered. The struct tcx_entry's program array is currently static, but
> could be made dynamic if necessary at a point in future. The a/b pair swap
> design has been chosen so that for detachment there are no allocations which
> otherwise could fail. The work has been tested with tc-testing selftest suite
> which all passes, as well as the tc BPF tests from the BPF CI, and also with
> Cilium's L4LB.
> 
> Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
> of this work.
> 
>   [0] https://lpc.events/event/16/contributions/1353/
>   [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
>   [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
>   [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>   [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/netdevice.h      |  15 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/tcx.h              | 157 +++++++++++++++
>  include/uapi/linux/bpf.h       |  35 +++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/syscall.c           |  95 +++++++--
>  kernel/bpf/tcx.c               | 347 +++++++++++++++++++++++++++++++++
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 267 +++++++++++++++----------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  45 ++++-
>  tools/include/uapi/linux/bpf.h |  35 +++-
>  16 files changed, 877 insertions(+), 144 deletions(-)
>  create mode 100644 include/net/tcx.h
>  create mode 100644 kernel/bpf/tcx.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 754a9eeca0a1..7a0d0b0c5a5e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3827,13 +3827,15 @@ L:	netdev@vger.kernel.org
>  S:	Maintained
>  F:	kernel/bpf/bpf_struct*
>  
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (tcx & tc BPF, sock_addr)
>  M:	Martin KaFai Lau <martin.lau@linux.dev>
>  M:	Daniel Borkmann <daniel@iogearbox.net>
>  R:	John Fastabend <john.fastabend@gmail.com>
>  L:	bpf@vger.kernel.org
>  L:	netdev@vger.kernel.org
>  S:	Maintained
> +F:	include/net/tcx.h
> +F:	kernel/bpf/tcx.c
>  F:	net/core/filter.c
>  F:	net/sched/act_bpf.c
>  F:	net/sched/cls_bpf.c
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 08fbd4622ccf..fd4281d1cdbb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1927,8 +1927,7 @@ enum netdev_ml_priv_type {
>   *
>   *	@rx_handler:		handler for received packets
>   *	@rx_handler_data: 	XXX: need comments on this one
> - *	@miniq_ingress:		ingress/clsact qdisc specific data for
> - *				ingress processing
> + *	@tcx_ingress:		BPF & clsact qdisc specific data for ingress processing
>   *	@ingress_queue:		XXX: need comments on this one
>   *	@nf_hooks_ingress:	netfilter hooks executed for ingress packets
>   *	@broadcast:		hw bcast address
> @@ -1949,8 +1948,7 @@ enum netdev_ml_priv_type {
>   *	@xps_maps:		all CPUs/RXQs maps for XPS device
>   *
>   *	@xps_maps:	XXX: need comments on this one
> - *	@miniq_egress:		clsact qdisc specific data for
> - *				egress processing
> + *	@tcx_egress:		BPF & clsact qdisc specific data for egress processing
>   *	@nf_hooks_egress:	netfilter hooks executed for egress packets
>   *	@qdisc_hash:		qdisc hash table
>   *	@watchdog_timeo:	Represents the timeout that is used by
> @@ -2249,9 +2247,8 @@ struct net_device {
>  	unsigned int		gro_ipv4_max_size;
>  	rx_handler_func_t __rcu	*rx_handler;
>  	void __rcu		*rx_handler_data;
> -
> -#ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc __rcu	*miniq_ingress;
> +#ifdef CONFIG_NET_XGRESS
> +	struct bpf_mprog_entry __rcu *tcx_ingress;
>  #endif
>  	struct netdev_queue __rcu *ingress_queue;
>  #ifdef CONFIG_NETFILTER_INGRESS
> @@ -2279,8 +2276,8 @@ struct net_device {
>  #ifdef CONFIG_XPS
>  	struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
>  #endif
> -#ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc __rcu	*miniq_egress;
> +#ifdef CONFIG_NET_XGRESS
> +	struct bpf_mprog_entry __rcu *tcx_egress;
>  #endif
>  #ifdef CONFIG_NETFILTER_EGRESS
>  	struct nf_hook_entries __rcu *nf_hooks_egress;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 5951904413ab..48c3e307f057 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -943,7 +943,7 @@ struct sk_buff {
>  	__u8			__mono_tc_offset[0];
>  	/* public: */
>  	__u8			mono_delivery_time:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>  	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
>  	__u8			tc_skip_classify:1;
>  #endif
> @@ -992,7 +992,7 @@ struct sk_buff {
>  	__u8			csum_not_inet:1;
>  #endif
>  
> -#ifdef CONFIG_NET_SCHED
> +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>  	__u16			tc_index;	/* traffic control index */
>  #endif
>  
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..0ade5d1a72b2 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -695,7 +695,7 @@ int skb_do_redirect(struct sk_buff *);
>  
>  static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
>  {
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>  	return skb->tc_at_ingress;
>  #else
>  	return false;
> diff --git a/include/net/tcx.h b/include/net/tcx.h
> new file mode 100644
> index 000000000000..27885ecedff9
> --- /dev/null
> +++ b/include/net/tcx.h
> @@ -0,0 +1,157 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __NET_TCX_H
> +#define __NET_TCX_H
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/sch_generic.h>
> +
> +struct mini_Qdisc;
> +
> +struct tcx_entry {
> +	struct bpf_mprog_bundle		bundle;
> +	struct mini_Qdisc __rcu		*miniq;
> +};
> +
> +struct tcx_link {
> +	struct bpf_link link;
> +	struct net_device *dev;
> +	u32 location;
> +	u32 flags;
> +};
> +
> +static inline struct tcx_link *tcx_link(struct bpf_link *link)
> +{
> +	return container_of(link, struct tcx_link, link);
> +}
> +
> +static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link)
> +{
> +	return tcx_link((struct bpf_link *)link);
> +}
> +
> +static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress)
> +{
> +#ifdef CONFIG_NET_XGRESS
> +	skb->tc_at_ingress = ingress;
> +#endif
> +}
> +
> +#ifdef CONFIG_NET_XGRESS
> +void tcx_inc(void);
> +void tcx_dec(void);
> +
> +static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry)
> +{
> +	return container_of(entry->parent, struct tcx_entry, bundle);
> +}
> +
> +static inline void
> +tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry, bool ingress)
> +{
> +	ASSERT_RTNL();
> +	if (ingress)
> +		rcu_assign_pointer(dev->tcx_ingress, entry);
> +	else
> +		rcu_assign_pointer(dev->tcx_egress, entry);
> +}
> +
> +static inline struct bpf_mprog_entry *
> +dev_tcx_entry_fetch(struct net_device *dev, bool ingress)
> +{
> +	ASSERT_RTNL();
> +	if (ingress)
> +		return rcu_dereference_rtnl(dev->tcx_ingress);
> +	else
> +		return rcu_dereference_rtnl(dev->tcx_egress);
> +}
> +
> +static inline struct bpf_mprog_entry *

[..]

> +dev_tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created)

Regarding 'created' argument: any reason we are not doing conventional
reference counting on bpf_mprog_entry? I wonder if there is a better way
to hide those places where we handle BPF_MPROG_FREE explicitly.

Btw, thinking of this a/b arrays, should we call them active/inactive?

> +{
> +	struct bpf_mprog_entry *entry = dev_tcx_entry_fetch(dev, ingress);
> +
> +	*created = false;
> +	if (!entry) {
> +		entry = bpf_mprog_create(sizeof_field(struct tcx_entry,
> +						      miniq));
> +		if (!entry)
> +			return NULL;
> +		*created = true;
> +	}
> +	return entry;
> +}
> +
> +static inline void tcx_skeys_inc(bool ingress)
> +{
> +	tcx_inc();
> +	if (ingress)
> +		net_inc_ingress_queue();
> +	else
> +		net_inc_egress_queue();
> +}
> +
> +static inline void tcx_skeys_dec(bool ingress)
> +{
> +	if (ingress)
> +		net_dec_ingress_queue();
> +	else
> +		net_dec_egress_queue();
> +	tcx_dec();
> +}
> +
> +static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb, int code)
> +{
> +	switch (code) {
> +	case TCX_PASS:
> +		skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
> +		fallthrough;
> +	case TCX_DROP:
> +	case TCX_REDIRECT:
> +		return code;
> +	case TCX_NEXT:
> +	default:
> +		return TCX_NEXT;
> +	}
> +}
> +#endif /* CONFIG_NET_XGRESS */
> +
> +#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL)
> +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_prog_query(const union bpf_attr *attr,
> +		   union bpf_attr __user *uattr);
> +void dev_tcx_uninstall(struct net_device *dev);
> +#else
> +static inline int tcx_prog_attach(const union bpf_attr *attr,
> +				  struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int tcx_link_attach(const union bpf_attr *attr,
> +				  struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int tcx_prog_detach(const union bpf_attr *attr,
> +				  struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int tcx_prog_query(const union bpf_attr *attr,
> +				 union bpf_attr __user *uattr)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline void dev_tcx_uninstall(struct net_device *dev)
> +{
> +}
> +#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */
> +#endif /* __NET_TCX_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
>  	BPF_TRACE_KPROBE_MULTI,
>  	BPF_LSM_CGROUP,
>  	BPF_STRUCT_OPS,
> +	BPF_TCX_INGRESS,
> +	BPF_TCX_EGRESS,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
>  	BPF_LINK_TYPE_KPROBE_MULTI = 8,
>  	BPF_LINK_TYPE_STRUCT_OPS = 9,
>  	BPF_LINK_TYPE_NETFILTER = 10,
> -
> +	BPF_LINK_TYPE_TCX = 11,
>  	MAX_BPF_LINK_TYPE,
>  };
>  
> @@ -1559,13 +1561,13 @@ union bpf_attr {
>  			__u32		map_fd;		/* struct_ops to attach */
>  		};
>  		union {
> -			__u32		target_fd;	/* object to attach to */
> -			__u32		target_ifindex; /* target ifindex */
> +			__u32	target_fd;	/* target object to attach to or ... */
> +			__u32	target_ifindex; /* target ifindex */
>  		};
>  		__u32		attach_type;	/* attach type */
>  		__u32		flags;		/* extra flags */
>  		union {
> -			__u32		target_btf_id;	/* btf_id of target to attach to */
> +			__u32	target_btf_id;	/* btf_id of target to attach to */
>  			struct {
>  				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
>  				__u32		iter_info_len;	/* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
>  				__s32		priority;
>  				__u32		flags;
>  			} netfilter;
> +			struct {
> +				union {
> +					__u32	relative_fd;
> +					__u32	relative_id;
> +				};
> +				__u32		expected_revision;
> +			} tcx;
>  		};
>  	} link_create;
>  
> @@ -6207,6 +6216,19 @@ struct bpf_sock_tuple {
>  	};
>  };
>  
> +/* (Simplified) user return codes for tcx prog type.
> + * A valid tcx program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TCX_NEXT.
> + */
> +enum tcx_action_base {
> +	TCX_NEXT	= -1,
> +	TCX_PASS	= 0,
> +	TCX_DROP	= 2,
> +	TCX_REDIRECT	= 7,
> +};
> +
>  struct bpf_xdp_sock {
>  	__u32 queue_id;
>  };
> @@ -6459,6 +6481,11 @@ struct bpf_link_info {
>  			__s32 priority;
>  			__u32 flags;
>  		} netfilter;
> +		struct {
> +			__u32 ifindex;
> +			__u32 attach_type;
> +			__u32 flags;
> +		} tcx;
>  	};
>  } __attribute__((aligned(8)));
>  
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..6a906ff93006 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -31,6 +31,7 @@ config BPF_SYSCALL
>  	select TASKS_TRACE_RCU
>  	select BINARY_PRINTF
>  	select NET_SOCK_MSG if NET
> +	select NET_XGRESS if NET
>  	select PAGE_POOL if NET
>  	default n
>  	help
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1bea2eb912cd..f526b7573e97 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -21,6 +21,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
>  obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  obj-$(CONFIG_BPF_SYSCALL) += offload.o
>  obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
> +obj-$(CONFIG_BPF_SYSCALL) += tcx.o
>  endif
>  ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 92a57efc77de..e2c219d053f4 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -37,6 +37,8 @@
>  #include <linux/trace_events.h>
>  #include <net/netfilter/nf_bpf_link.h>
>  
> +#include <net/tcx.h>
> +
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>  			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
>  			  (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> @@ -3522,31 +3524,57 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>  		return BPF_PROG_TYPE_XDP;
>  	case BPF_LSM_CGROUP:
>  		return BPF_PROG_TYPE_LSM;
> +	case BPF_TCX_INGRESS:
> +	case BPF_TCX_EGRESS:
> +		return BPF_PROG_TYPE_SCHED_CLS;
>  	default:
>  		return BPF_PROG_TYPE_UNSPEC;
>  	}
>  }
>  
> -#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd
> +#define BPF_PROG_ATTACH_LAST_FIELD expected_revision
> +
> +#define BPF_F_ATTACH_MASK_BASE	\
> +	(BPF_F_ALLOW_OVERRIDE |	\
> +	 BPF_F_ALLOW_MULTI |	\
> +	 BPF_F_REPLACE)
> +
> +#define BPF_F_ATTACH_MASK_MPROG	\
> +	(BPF_F_REPLACE |	\
> +	 BPF_F_BEFORE |		\
> +	 BPF_F_AFTER |		\
> +	 BPF_F_FIRST |		\
> +	 BPF_F_LAST |		\
> +	 BPF_F_ID |		\
> +	 BPF_F_LINK)
>  
> -#define BPF_F_ATTACH_MASK \
> -	(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE)
> +static bool bpf_supports_mprog(enum bpf_prog_type ptype)
> +{
> +	switch (ptype) {
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
>  
>  static int bpf_prog_attach(const union bpf_attr *attr)
>  {
>  	enum bpf_prog_type ptype;
>  	struct bpf_prog *prog;
> +	u32 mask;
>  	int ret;
>  
>  	if (CHECK_ATTR(BPF_PROG_ATTACH))
>  		return -EINVAL;
>  
> -	if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
> -		return -EINVAL;
> -
>  	ptype = attach_type_to_prog_type(attr->attach_type);
>  	if (ptype == BPF_PROG_TYPE_UNSPEC)
>  		return -EINVAL;
> +	mask = bpf_supports_mprog(ptype) ?
> +	       BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE;
> +	if (attr->attach_flags & ~mask)
> +		return -EINVAL;
>  
>  	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
>  	if (IS_ERR(prog))
> @@ -3582,6 +3610,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  		else
>  			ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>  		break;
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		ret = tcx_prog_attach(attr, prog);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  	}
> @@ -3591,25 +3622,42 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	return ret;
>  }
>  
> -#define BPF_PROG_DETACH_LAST_FIELD attach_type
> +#define BPF_PROG_DETACH_LAST_FIELD expected_revision
>  
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
> +	struct bpf_prog *prog = NULL;
>  	enum bpf_prog_type ptype;
> +	int ret;
>  
>  	if (CHECK_ATTR(BPF_PROG_DETACH))
>  		return -EINVAL;
>  
>  	ptype = attach_type_to_prog_type(attr->attach_type);
> +	if (bpf_supports_mprog(ptype)) {
> +		if (ptype == BPF_PROG_TYPE_UNSPEC)
> +			return -EINVAL;
> +		if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG)
> +			return -EINVAL;
> +		prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
> +		if (IS_ERR(prog)) {
> +			if ((int)attr->attach_bpf_fd > 0)
> +				return PTR_ERR(prog);
> +			prog = NULL;
> +		}
> +	}
>  
>  	switch (ptype) {
>  	case BPF_PROG_TYPE_SK_MSG:
>  	case BPF_PROG_TYPE_SK_SKB:
> -		return sock_map_prog_detach(attr, ptype);
> +		ret = sock_map_prog_detach(attr, ptype);
> +		break;
>  	case BPF_PROG_TYPE_LIRC_MODE2:
> -		return lirc_prog_detach(attr);
> +		ret = lirc_prog_detach(attr);
> +		break;
>  	case BPF_PROG_TYPE_FLOW_DISSECTOR:
> -		return netns_bpf_prog_detach(attr, ptype);
> +		ret = netns_bpf_prog_detach(attr, ptype);
> +		break;
>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>  	case BPF_PROG_TYPE_CGROUP_SKB:
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
> @@ -3618,13 +3666,21 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  	case BPF_PROG_TYPE_CGROUP_SYSCTL:
>  	case BPF_PROG_TYPE_SOCK_OPS:
>  	case BPF_PROG_TYPE_LSM:
> -		return cgroup_bpf_prog_detach(attr, ptype);
> +		ret = cgroup_bpf_prog_detach(attr, ptype);
> +		break;
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		ret = tcx_prog_detach(attr, prog);
> +		break;
>  	default:
> -		return -EINVAL;
> +		ret = -EINVAL;
>  	}
> +
> +	if (prog)
> +		bpf_prog_put(prog);
> +	return ret;
>  }
>  
> -#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags
> +#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags
>  
>  static int bpf_prog_query(const union bpf_attr *attr,
>  			  union bpf_attr __user *uattr)
> @@ -3672,6 +3728,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
>  	case BPF_SK_MSG_VERDICT:
>  	case BPF_SK_SKB_VERDICT:
>  		return sock_map_bpf_prog_query(attr, uattr);
> +	case BPF_TCX_INGRESS:
> +	case BPF_TCX_EGRESS:
> +		return tcx_prog_query(attr, uattr);
>  	default:
>  		return -EINVAL;
>  	}
> @@ -4629,6 +4688,13 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>  			goto out;
>  		}
>  		break;
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		if (attr->link_create.attach_type != BPF_TCX_INGRESS &&
> +		    attr->link_create.attach_type != BPF_TCX_EGRESS) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		break;
>  	default:
>  		ptype = attach_type_to_prog_type(attr->link_create.attach_type);
>  		if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
> @@ -4680,6 +4746,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>  	case BPF_PROG_TYPE_XDP:
>  		ret = bpf_xdp_link_attach(attr, prog);
>  		break;
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		ret = tcx_link_attach(attr, prog);
> +		break;
>  	case BPF_PROG_TYPE_NETFILTER:
>  		ret = bpf_nf_link_attach(attr, prog);
>  		break;
> diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c
> new file mode 100644
> index 000000000000..d3d23b4ed4f0
> --- /dev/null
> +++ b/kernel/bpf/tcx.c
> @@ -0,0 +1,347 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/tcx.h>
> +
> +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	bool created, ingress = attr->attach_type == BPF_TCX_INGRESS;
> +	struct net *net = current->nsproxy->net_ns;
> +	struct bpf_mprog_entry *entry;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->target_ifindex);
> +	if (!dev) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +	entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created);
> +	if (!entry) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	ret = bpf_mprog_attach(entry, prog, NULL, attr->attach_flags,
> +			       attr->relative_fd, attr->expected_revision);
> +	if (ret >= 0) {
> +		if (ret == BPF_MPROG_SWAP)
> +			tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
> +		bpf_mprog_commit(entry);
> +		tcx_skeys_inc(ingress);
> +		ret = 0;
> +	} else if (created) {
> +		bpf_mprog_free(entry);
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static bool tcx_release_entry(struct bpf_mprog_entry *entry, int code)
> +{
> +	return code == BPF_MPROG_FREE && !tcx_entry(entry)->miniq;
> +}
> +
> +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	bool tcx_release, ingress = attr->attach_type == BPF_TCX_INGRESS;
> +	struct net *net = current->nsproxy->net_ns;
> +	struct bpf_mprog_entry *entry, *peer;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->target_ifindex);
> +	if (!dev) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +	entry = dev_tcx_entry_fetch(dev, ingress);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_detach(entry, prog, NULL, attr->attach_flags,
> +			       attr->relative_fd, attr->expected_revision);
> +	if (ret >= 0) {
> +		tcx_release = tcx_release_entry(entry, ret);
> +		peer = tcx_release ? NULL : bpf_mprog_peer(entry);
> +		if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE)
> +			tcx_entry_update(dev, peer, ingress);
> +		bpf_mprog_commit(entry);
> +		tcx_skeys_dec(ingress);
> +		if (tcx_release)
> +			bpf_mprog_free(entry);
> +		ret = 0;
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static void tcx_uninstall(struct net_device *dev, bool ingress)
> +{
> +	struct bpf_tuple tuple = {};
> +	struct bpf_mprog_entry *entry;
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +
> +	entry = dev_tcx_entry_fetch(dev, ingress);
> +	if (!entry)
> +		return;
> +	tcx_entry_update(dev, NULL, ingress);
> +	bpf_mprog_commit(entry);
> +	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +		if (tuple.link)
> +			tcx_link(tuple.link)->dev = NULL;
> +		else
> +			bpf_prog_put(tuple.prog);
> +		tcx_skeys_dec(ingress);
> +	}
> +	WARN_ON_ONCE(tcx_entry(entry)->miniq);
> +	bpf_mprog_free(entry);
> +}
> +
> +void dev_tcx_uninstall(struct net_device *dev)
> +{
> +	ASSERT_RTNL();
> +	tcx_uninstall(dev, true);
> +	tcx_uninstall(dev, false);
> +}
> +
> +int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +	bool ingress = attr->query.attach_type == BPF_TCX_INGRESS;
> +	struct net *net = current->nsproxy->net_ns;
> +	struct bpf_mprog_entry *entry;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->query.target_ifindex);
> +	if (!dev) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +	entry = dev_tcx_entry_fetch(dev, ingress);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_query(attr, uattr, entry);
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static int tcx_link_prog_attach(struct bpf_link *l, u32 flags, u32 object,
> +				u32 expected_revision)
> +{
> +	struct tcx_link *link = tcx_link(l);
> +	bool created, ingress = link->location == BPF_TCX_INGRESS;
> +	struct net_device *dev = link->dev;
> +	struct bpf_mprog_entry *entry;
> +	int ret;
> +
> +	ASSERT_RTNL();
> +	entry = dev_tcx_entry_fetch_or_create(dev, ingress, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	ret = bpf_mprog_attach(entry, l->prog, l, flags, object,
> +			       expected_revision);
> +	if (ret >= 0) {
> +		if (ret == BPF_MPROG_SWAP)
> +			tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
> +		bpf_mprog_commit(entry);
> +		tcx_skeys_inc(ingress);
> +		ret = 0;
> +	} else if (created) {
> +		bpf_mprog_free(entry);
> +	}
> +	return ret;
> +}
> +
> +static void tcx_link_release(struct bpf_link *l)
> +{
> +	struct tcx_link *link = tcx_link(l);
> +	bool tcx_release, ingress = link->location == BPF_TCX_INGRESS;
> +	struct bpf_mprog_entry *entry, *peer;
> +	struct net_device *dev;
> +	int ret = 0;
> +
> +	rtnl_lock();
> +	dev = link->dev;
> +	if (!dev)
> +		goto out;
> +	entry = dev_tcx_entry_fetch(dev, ingress);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_detach(entry, l->prog, l, link->flags, 0, 0);
> +	if (ret >= 0) {
> +		tcx_release = tcx_release_entry(entry, ret);
> +		peer = tcx_release ? NULL : bpf_mprog_peer(entry);
> +		if (ret == BPF_MPROG_SWAP || ret == BPF_MPROG_FREE)
> +			tcx_entry_update(dev, peer, ingress);
> +		bpf_mprog_commit(entry);
> +		tcx_skeys_dec(ingress);
> +		if (tcx_release)
> +			bpf_mprog_free(entry);
> +		link->dev = NULL;
> +		ret = 0;
> +	}
> +out:
> +	WARN_ON_ONCE(ret);
> +	rtnl_unlock();
> +}
> +
> +static int tcx_link_update(struct bpf_link *l, struct bpf_prog *nprog,
> +			   struct bpf_prog *oprog)
> +{
> +	struct tcx_link *link = tcx_link(l);
> +	bool ingress = link->location == BPF_TCX_INGRESS;
> +	struct net_device *dev = link->dev;
> +	struct bpf_mprog_entry *entry;
> +	int ret = 0;
> +
> +	rtnl_lock();
> +	if (!link->dev) {
> +		ret = -ENOLINK;
> +		goto out;
> +	}
> +	if (oprog && l->prog != oprog) {
> +		ret = -EPERM;
> +		goto out;
> +	}
> +	oprog = l->prog;
> +	if (oprog == nprog) {
> +		bpf_prog_put(nprog);
> +		goto out;
> +	}
> +	entry = dev_tcx_entry_fetch(dev, ingress);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_attach(entry, nprog, l,
> +			       BPF_F_REPLACE | BPF_F_ID | link->flags,
> +			       l->prog->aux->id, 0);
> +	if (ret >= 0) {
> +		if (ret == BPF_MPROG_SWAP)
> +			tcx_entry_update(dev, bpf_mprog_peer(entry), ingress);
> +		bpf_mprog_commit(entry);
> +		tcx_skeys_inc(ingress);
> +		oprog = xchg(&l->prog, nprog);
> +		bpf_prog_put(oprog);
> +		ret = 0;
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static void tcx_link_dealloc(struct bpf_link *l)
> +{
> +	kfree(tcx_link(l));
> +}
> +
> +static void tcx_link_fdinfo(const struct bpf_link *l, struct seq_file *seq)
> +{
> +	const struct tcx_link *link = tcx_link_const(l);
> +	u32 ifindex = 0;
> +
> +	rtnl_lock();
> +	if (link->dev)
> +		ifindex = link->dev->ifindex;
> +	rtnl_unlock();
> +
> +	seq_printf(seq, "ifindex:\t%u\n", ifindex);
> +	seq_printf(seq, "attach_type:\t%u (%s)\n",
> +		   link->location,
> +		   link->location == BPF_TCX_INGRESS ? "ingress" : "egress");
> +	seq_printf(seq, "flags:\t%u\n", link->flags);
> +}
> +
> +static int tcx_link_fill_info(const struct bpf_link *l,
> +			      struct bpf_link_info *info)
> +{
> +	const struct tcx_link *link = tcx_link_const(l);
> +	u32 ifindex = 0;
> +
> +	rtnl_lock();
> +	if (link->dev)
> +		ifindex = link->dev->ifindex;
> +	rtnl_unlock();
> +
> +	info->tcx.ifindex = ifindex;
> +	info->tcx.attach_type = link->location;
> +	info->tcx.flags = link->flags;
> +	return 0;
> +}
> +
> +static int tcx_link_detach(struct bpf_link *l)
> +{
> +	tcx_link_release(l);
> +	return 0;
> +}
> +
> +static const struct bpf_link_ops tcx_link_lops = {
> +	.release	= tcx_link_release,
> +	.detach		= tcx_link_detach,
> +	.dealloc	= tcx_link_dealloc,
> +	.update_prog	= tcx_link_update,
> +	.show_fdinfo	= tcx_link_fdinfo,
> +	.fill_link_info	= tcx_link_fill_info,
> +};
> +
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	struct bpf_link_primer link_primer;
> +	struct net_device *dev;
> +	struct tcx_link *link;
> +	int fd, err;
> +
> +	dev = dev_get_by_index(net, attr->link_create.target_ifindex);
> +	if (!dev)
> +		return -EINVAL;
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put;
> +	}
> +
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> +	link->location = attr->link_create.attach_type;
> +	link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
> +	link->dev = dev;
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err) {
> +		kfree(link);
> +		goto out_put;
> +	}
> +	rtnl_lock();
> +	err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
> +				   attr->link_create.tcx.relative_fd,
> +				   attr->link_create.tcx.expected_revision);
> +	if (!err)
> +		fd = bpf_link_settle(&link_primer);
> +	rtnl_unlock();
> +	if (err) {
> +		link->dev = NULL;
> +		bpf_link_cleanup(&link_primer);
> +		goto out_put;
> +	}
> +	dev_put(dev);
> +	return fd;
> +out_put:
> +	dev_put(dev);
> +	return err;
> +}
> diff --git a/net/Kconfig b/net/Kconfig
> index 2fb25b534df5..d532ec33f1fe 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -52,6 +52,11 @@ config NET_INGRESS
>  config NET_EGRESS
>  	bool
>  
> +config NET_XGRESS
> +	select NET_INGRESS
> +	select NET_EGRESS
> +	bool
> +
>  config NET_REDIRECT
>  	bool
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3393c2f3dbe8..95c7e3189884 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -107,6 +107,7 @@
>  #include <net/pkt_cls.h>
>  #include <net/checksum.h>
>  #include <net/xfrm.h>
> +#include <net/tcx.h>
>  #include <linux/highmem.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> @@ -154,7 +155,6 @@
>  #include "dev.h"
>  #include "net-sysfs.h"
>  
> -
>  static DEFINE_SPINLOCK(ptype_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>  struct list_head ptype_all __read_mostly;	/* Taps */
> @@ -3923,69 +3923,200 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>  EXPORT_SYMBOL(dev_loopback_xmit);
>  
>  #ifdef CONFIG_NET_EGRESS
> -static struct sk_buff *
> -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int qm = skb_get_queue_mapping(skb);
> +
> +	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
>  {
> +	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> +	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> +#endif /* CONFIG_NET_EGRESS */
> +
> +#ifdef CONFIG_NET_XGRESS
> +static int tc_run(struct tcx_entry *entry, struct sk_buff *skb)
> +{
> +	int ret = TC_ACT_UNSPEC;
>  #ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
> -	struct tcf_result cl_res;
> +	struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq);
> +	struct tcf_result res;
>  
>  	if (!miniq)
> -		return skb;
> +		return ret;
>  
> -	/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
>  	tc_skb_cb(skb)->mru = 0;
>  	tc_skb_cb(skb)->post_ct = false;
> -	mini_qdisc_bstats_cpu_update(miniq, skb);
>  
> -	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> +	mini_qdisc_bstats_cpu_update(miniq, skb);
> +	ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> +	/* Only tcf related quirks below. */
> +	switch (ret) {
> +	case TC_ACT_SHOT:
> +		mini_qdisc_qstats_cpu_drop(miniq);
> +		break;
>  	case TC_ACT_OK:
>  	case TC_ACT_RECLASSIFY:
> -		skb->tc_index = TC_H_MIN(cl_res.classid);
> +		skb->tc_index = TC_H_MIN(res.classid);
>  		break;
> +	}
> +#endif /* CONFIG_NET_CLS_ACT */
> +	return ret;
> +}
> +
> +static DEFINE_STATIC_KEY_FALSE(tcx_needed_key);
> +
> +void tcx_inc(void)
> +{
> +	static_branch_inc(&tcx_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(tcx_inc);
> +
> +void tcx_dec(void)
> +{
> +	static_branch_dec(&tcx_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(tcx_dec);
> +
> +static __always_inline enum tcx_action_base
> +tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
> +	const bool needs_mac)
> +{
> +	const struct bpf_mprog_fp *fp;
> +	const struct bpf_prog *prog;
> +	int ret = TCX_NEXT;
> +
> +	if (needs_mac)
> +		__skb_push(skb, skb->mac_len);
> +	bpf_mprog_foreach_prog(entry, fp, prog) {
> +		bpf_compute_data_pointers(skb);
> +		ret = bpf_prog_run(prog, skb);
> +		if (ret != TCX_NEXT)
> +			break;
> +	}
> +	if (needs_mac)
> +		__skb_pull(skb, skb->mac_len);
> +	return tcx_action_code(skb, ret);
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +		   struct net_device *orig_dev, bool *another)
> +{
> +	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> +	int sch_ret;
> +
> +	if (!entry)
> +		return skb;
> +	if (*pt_prev) {
> +		*ret = deliver_skb(skb, *pt_prev, orig_dev);
> +		*pt_prev = NULL;
> +	}
> +
> +	qdisc_skb_cb(skb)->pkt_len = skb->len;
> +	tcx_set_ingress(skb, true);
> +
> +	if (static_branch_unlikely(&tcx_needed_key)) {
> +		sch_ret = tcx_run(entry, skb, true);
> +		if (sch_ret != TC_ACT_UNSPEC)
> +			goto ingress_verdict;
> +	}
> +	sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb);
> +ingress_verdict:
> +	switch (sch_ret) {
> +	case TC_ACT_REDIRECT:
> +		/* skb_mac_header check was done by BPF, so we can safely
> +		 * push the L2 header back before redirecting to another
> +		 * netdev.
> +		 */
> +		__skb_push(skb, skb->mac_len);
> +		if (skb_do_redirect(skb) == -EAGAIN) {
> +			__skb_pull(skb, skb->mac_len);
> +			*another = true;
> +			break;
> +		}
> +		*ret = NET_RX_SUCCESS;
> +		return NULL;
>  	case TC_ACT_SHOT:
> -		mini_qdisc_qstats_cpu_drop(miniq);
> -		*ret = NET_XMIT_DROP;
> -		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> +		*ret = NET_RX_DROP;
>  		return NULL;
> +	/* used by tc_run */
>  	case TC_ACT_STOLEN:
>  	case TC_ACT_QUEUED:
>  	case TC_ACT_TRAP:
> -		*ret = NET_XMIT_SUCCESS;
>  		consume_skb(skb);
> +		fallthrough;
> +	case TC_ACT_CONSUMED:
> +		*ret = NET_RX_SUCCESS;
>  		return NULL;
> +	}
> +
> +	return skb;
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +{
> +	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> +	int sch_ret;
> +
> +	if (!entry)
> +		return skb;
> +
> +	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> +	 * already set by the caller.
> +	 */
> +	if (static_branch_unlikely(&tcx_needed_key)) {
> +		sch_ret = tcx_run(entry, skb, false);
> +		if (sch_ret != TC_ACT_UNSPEC)
> +			goto egress_verdict;
> +	}
> +	sch_ret = tc_run(container_of(entry->parent, struct tcx_entry, bundle), skb);
> +egress_verdict:
> +	switch (sch_ret) {
>  	case TC_ACT_REDIRECT:
>  		/* No need to push/pop skb's mac_header here on egress! */
>  		skb_do_redirect(skb);
>  		*ret = NET_XMIT_SUCCESS;
>  		return NULL;
> -	default:
> -		break;
> +	case TC_ACT_SHOT:
> +		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +		*ret = NET_XMIT_DROP;
> +		return NULL;
> +	/* used by tc_run */
> +	case TC_ACT_STOLEN:
> +	case TC_ACT_QUEUED:
> +	case TC_ACT_TRAP:
> +		*ret = NET_XMIT_SUCCESS;
> +		return NULL;
>  	}
> -#endif /* CONFIG_NET_CLS_ACT */
>  
>  	return skb;
>  }
> -
> -static struct netdev_queue *
> -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> -{
> -	int qm = skb_get_queue_mapping(skb);
> -
> -	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> -}
> -
> -static bool netdev_xmit_txqueue_skipped(void)
> +#else
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +		   struct net_device *orig_dev, bool *another)
>  {
> -	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +	return skb;
>  }
>  
> -void netdev_xmit_skip_txqueue(bool skip)
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>  {
> -	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +	return skb;
>  }
> -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> -#endif /* CONFIG_NET_EGRESS */
> +#endif /* CONFIG_NET_XGRESS */
>  
>  #ifdef CONFIG_XPS
>  static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
> @@ -4169,9 +4300,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  	skb_update_prio(skb);
>  
>  	qdisc_pkt_len_init(skb);
> -#ifdef CONFIG_NET_CLS_ACT
> -	skb->tc_at_ingress = 0;
> -#endif
> +	tcx_set_ingress(skb, false);
>  #ifdef CONFIG_NET_EGRESS
>  	if (static_branch_unlikely(&egress_needed_key)) {
>  		if (nf_hook_egress_active()) {
> @@ -5103,72 +5232,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
>  EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>  #endif
>  
> -static inline struct sk_buff *
> -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> -		   struct net_device *orig_dev, bool *another)
> -{
> -#ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
> -	struct tcf_result cl_res;
> -
> -	/* If there's at least one ingress present somewhere (so
> -	 * we get here via enabled static key), remaining devices
> -	 * that are not configured with an ingress qdisc will bail
> -	 * out here.
> -	 */
> -	if (!miniq)
> -		return skb;
> -
> -	if (*pt_prev) {
> -		*ret = deliver_skb(skb, *pt_prev, orig_dev);
> -		*pt_prev = NULL;
> -	}
> -
> -	qdisc_skb_cb(skb)->pkt_len = skb->len;
> -	tc_skb_cb(skb)->mru = 0;
> -	tc_skb_cb(skb)->post_ct = false;
> -	skb->tc_at_ingress = 1;
> -	mini_qdisc_bstats_cpu_update(miniq, skb);
> -
> -	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> -	case TC_ACT_OK:
> -	case TC_ACT_RECLASSIFY:
> -		skb->tc_index = TC_H_MIN(cl_res.classid);
> -		break;
> -	case TC_ACT_SHOT:
> -		mini_qdisc_qstats_cpu_drop(miniq);
> -		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> -		*ret = NET_RX_DROP;
> -		return NULL;
> -	case TC_ACT_STOLEN:
> -	case TC_ACT_QUEUED:
> -	case TC_ACT_TRAP:
> -		consume_skb(skb);
> -		*ret = NET_RX_SUCCESS;
> -		return NULL;
> -	case TC_ACT_REDIRECT:
> -		/* skb_mac_header check was done by cls/act_bpf, so
> -		 * we can safely push the L2 header back before
> -		 * redirecting to another netdev
> -		 */
> -		__skb_push(skb, skb->mac_len);
> -		if (skb_do_redirect(skb) == -EAGAIN) {
> -			__skb_pull(skb, skb->mac_len);
> -			*another = true;
> -			break;
> -		}
> -		*ret = NET_RX_SUCCESS;
> -		return NULL;
> -	case TC_ACT_CONSUMED:
> -		*ret = NET_RX_SUCCESS;
> -		return NULL;
> -	default:
> -		break;
> -	}
> -#endif /* CONFIG_NET_CLS_ACT */
> -	return skb;
> -}
> -
>  /**
>   *	netdev_is_rx_handler_busy - check if receive handler is registered
>   *	@dev: device to check
> @@ -10873,7 +10936,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
>  
>  		/* Shutdown queueing discipline. */
>  		dev_shutdown(dev);
> -
> +		dev_tcx_uninstall(dev);
>  		dev_xdp_uninstall(dev);
>  		bpf_dev_bound_netdev_unregister(dev);
>  
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d25d52854c21..1ff9a0988ea6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9233,7 +9233,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
>  	__u8 value_reg = si->dst_reg;
>  	__u8 skb_reg = si->src_reg;
>  
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>  	/* If the tstamp_type is read,
>  	 * the bpf prog is aware the tstamp could have delivery time.
>  	 * Thus, read skb->tstamp as is if tstamp_type_access is true.
> @@ -9267,7 +9267,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
>  	__u8 value_reg = si->src_reg;
>  	__u8 skb_reg = si->dst_reg;
>  
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>  	/* If the tstamp_type is read,
>  	 * the bpf prog is aware the tstamp could have delivery time.
>  	 * Thus, write skb->tstamp as is if tstamp_type_access is true.
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 4b95cb1ac435..470c70deffe2 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -347,8 +347,7 @@ config NET_SCH_FQ_PIE
>  config NET_SCH_INGRESS
>  	tristate "Ingress/classifier-action Qdisc"
>  	depends on NET_CLS_ACT
> -	select NET_INGRESS
> -	select NET_EGRESS
> +	select NET_XGRESS
>  	help
>  	  Say Y here if you want to use classifiers for incoming and/or outgoing
>  	  packets. This qdisc doesn't do anything else besides running classifiers,
> @@ -679,6 +678,7 @@ config NET_EMATCH_IPT
>  config NET_CLS_ACT
>  	bool "Actions"
>  	select NET_CLS
> +	select NET_XGRESS
>  	help
>  	  Say Y here if you want to use traffic control actions. Actions
>  	  get attached to classifiers and are invoked after a successful
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..4af1360f537e 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -13,6 +13,7 @@
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
>  #include <net/pkt_cls.h>
> +#include <net/tcx.h>
>  
>  struct ingress_sched_data {
>  	struct tcf_block *block;
> @@ -78,11 +79,18 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>  	struct ingress_sched_data *q = qdisc_priv(sch);
>  	struct net_device *dev = qdisc_dev(sch);
> +	struct bpf_mprog_entry *entry;
> +	bool created;
>  	int err;
>  
>  	net_inc_ingress_queue();
>  
> -	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> +	entry = dev_tcx_entry_fetch_or_create(dev, true, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq);
> +	if (created)
> +		tcx_entry_update(dev, entry, true);
>  
>  	q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>  	q->block_info.chain_head_change = clsact_chain_head_change;
> @@ -93,15 +101,20 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>  		return err;
>  
>  	mini_qdisc_pair_block_init(&q->miniqp, q->block);
> -
>  	return 0;
>  }
>  
>  static void ingress_destroy(struct Qdisc *sch)
>  {
>  	struct ingress_sched_data *q = qdisc_priv(sch);
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress);
>  
>  	tcf_block_put_ext(q->block, sch, &q->block_info);
> +	if (entry && !bpf_mprog_total(entry)) {
> +		tcx_entry_update(dev, NULL, true);
> +		bpf_mprog_free(entry);
> +	}
>  	net_dec_ingress_queue();
>  }
>  
> @@ -217,12 +230,19 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>  	struct clsact_sched_data *q = qdisc_priv(sch);
>  	struct net_device *dev = qdisc_dev(sch);
> +	struct bpf_mprog_entry *entry;
> +	bool created;
>  	int err;
>  
>  	net_inc_ingress_queue();
>  	net_inc_egress_queue();
>  
> -	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
> +	entry = dev_tcx_entry_fetch_or_create(dev, true, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq);
> +	if (created)
> +		tcx_entry_update(dev, entry, true);
>  
>  	q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>  	q->ingress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -235,7 +255,12 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  
>  	mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
>  
> -	mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
> +	entry = dev_tcx_entry_fetch_or_create(dev, false, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq);
> +	if (created)
> +		tcx_entry_update(dev, entry, false);
>  
>  	q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
>  	q->egress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -247,9 +272,21 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  static void clsact_destroy(struct Qdisc *sch)
>  {
>  	struct clsact_sched_data *q = qdisc_priv(sch);
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress);
> +	struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress);
>  
>  	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> +	if (egress_entry && !bpf_mprog_total(egress_entry)) {
> +		tcx_entry_update(dev, NULL, false);
> +		bpf_mprog_free(egress_entry);
> +	}
> +
>  	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> +	if (ingress_entry && !bpf_mprog_total(ingress_entry)) {
> +		tcx_entry_update(dev, NULL, true);
> +		bpf_mprog_free(ingress_entry);
> +	}
>  
>  	net_dec_ingress_queue();
>  	net_dec_egress_queue();
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
>  	BPF_TRACE_KPROBE_MULTI,
>  	BPF_LSM_CGROUP,
>  	BPF_STRUCT_OPS,
> +	BPF_TCX_INGRESS,
> +	BPF_TCX_EGRESS,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
>  	BPF_LINK_TYPE_KPROBE_MULTI = 8,
>  	BPF_LINK_TYPE_STRUCT_OPS = 9,
>  	BPF_LINK_TYPE_NETFILTER = 10,
> -
> +	BPF_LINK_TYPE_TCX = 11,
>  	MAX_BPF_LINK_TYPE,
>  };
>  
> @@ -1559,13 +1561,13 @@ union bpf_attr {
>  			__u32		map_fd;		/* struct_ops to attach */
>  		};
>  		union {
> -			__u32		target_fd;	/* object to attach to */
> -			__u32		target_ifindex; /* target ifindex */
> +			__u32	target_fd;	/* target object to attach to or ... */
> +			__u32	target_ifindex; /* target ifindex */
>  		};
>  		__u32		attach_type;	/* attach type */
>  		__u32		flags;		/* extra flags */
>  		union {
> -			__u32		target_btf_id;	/* btf_id of target to attach to */
> +			__u32	target_btf_id;	/* btf_id of target to attach to */
>  			struct {
>  				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
>  				__u32		iter_info_len;	/* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
>  				__s32		priority;
>  				__u32		flags;
>  			} netfilter;
> +			struct {
> +				union {
> +					__u32	relative_fd;
> +					__u32	relative_id;
> +				};
> +				__u32		expected_revision;
> +			} tcx;
>  		};
>  	} link_create;
>  
> @@ -6207,6 +6216,19 @@ struct bpf_sock_tuple {
>  	};
>  };
>  
> +/* (Simplified) user return codes for tcx prog type.
> + * A valid tcx program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TCX_NEXT.
> + */
> +enum tcx_action_base {
> +	TCX_NEXT	= -1,
> +	TCX_PASS	= 0,
> +	TCX_DROP	= 2,
> +	TCX_REDIRECT	= 7,
> +};
> +
>  struct bpf_xdp_sock {
>  	__u32 queue_id;
>  };
> @@ -6459,6 +6481,11 @@ struct bpf_link_info {
>  			__s32 priority;
>  			__u32 flags;
>  		} netfilter;
> +		struct {
> +			__u32 ifindex;
> +			__u32 attach_type;
> +			__u32 flags;
> +		} tcx;
>  	};
>  } __attribute__((aligned(8)));
>  
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-08 10:11     ` Daniel Borkmann
@ 2023-06-08 19:46       ` Jamal Hadi Salim
  2023-06-08 21:24         ` Andrii Nakryiko
  0 siblings, 1 reply; 49+ messages in thread
From: Jamal Hadi Salim @ 2023-06-08 19:46 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

Hi Daniel,

On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Hi Jamal,
>
> On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
> [...]
> > A general question (which i think i asked last time as well): who
> > decides what comes after/before what prog in this setup? And would
> > that same entity not have been able to make the same decision using tc
> > priorities?
>
> Back in the first version of the series I initially coded up this option
> that the tc_run() would basically be a fake 'bpf_prog' and it would have,
> say, fixed prio 1000. It would get executed via tcx_run() when iterating
> via bpf_mprog_foreach_prog() where bpf_prog_run() is called, and then users
> could pick for native BPF prio before or after that. But then the feedback
> was that sticking to prio is a bad user experience which led to the
> development of what is in patch 1 of this series (see the details there).
>

Thanks. I read the commit message in patch 1 and followed the thread
back including some of the discussion we had and i am still
disagreeing that this couldnt be solved with a smart priority based
scheme - but i think we can move on since this is standalone and
doesnt affect tc.

Daniel - i am still curious in the new scheme of things how would
cilium vs datadog food fight get resolved without some arbitration
entity?

> > The idea of protecting programs from being unloaded is very welcome
> > but feels would have made sense to be a separate patchset (we have
> > good need for it). Would it be possible to use that feature in tc and
> > xdp?
> BPF links are supported for XDP today, just tc BPF is one of the few
> remainders where it is not the case, hence the work of this series. What
> XDP lacks today however is multi-prog support. With the bpf_mprog concept
> that could be addressed with that common/uniform api (and Andrii expressed
> interest in integrating this also for cgroup progs), so yes, various hook
> points/program types could benefit from it.

Is there some sample XDP related i could look at?  Let me describe our
use case: lets say we load an ebpf program foo attached to XDP of a
netdev  and then something further upstream in the stack is consuming
the results of that ebpf XDP program. For some reason someone, at some
point, decides to replace the XDP prog with a different one - and the
new prog does a very different thing. Could we stop the replacement
with the link mechanism you describe? i.e the program is still loaded
but is no longer attached to the netdev.


> >> +struct tcx_entry {
> >> +       struct bpf_mprog_bundle         bundle;
> >> +       struct mini_Qdisc __rcu         *miniq;
> >> +};
> >> +
> >
> > Can you please move miniq to the front? From where i sit this looks:
> > struct tcx_entry {
> >          struct bpf_mprog_bundle    bundle
> > __attribute__((__aligned__(64))); /*     0  3264 */
> >
> >          /* XXX last struct has 36 bytes of padding */
> >
> >          /* --- cacheline 51 boundary (3264 bytes) --- */
> >          struct mini_Qdisc *        miniq;                /*  3264     8 */
> >
> >          /* size: 3328, cachelines: 52, members: 2 */
> >          /* padding: 56 */
> >          /* paddings: 1, sum paddings: 36 */
> >          /* forced alignments: 1 */
> > } __attribute__((__aligned__(64)));
> >
> > That is a _lot_ of cachelines - at the expense of the status quo
> > clsact/ingress qdiscs which access miniq.
>
> Ah yes, I'll fix this up.

Thanks.

cheers,
jamal
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
  2023-06-08 17:23   ` Stanislav Fomichev
@ 2023-06-08 20:53   ` Andrii Nakryiko
  1 sibling, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 20:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

On Wed, Jun 7, 2023 at 12:27 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This adds a generic layer called bpf_mprog which can be reused by different
> attachment layers to enable multi-program attachment and dependency resolution.
> In-kernel users of the bpf_mprog don't need to care about the dependency
> resolution internals, they can just consume it with few API calls.
>
> The initial idea of having a generic API sparked out of discussion [0] from an
> earlier revision of this work where tc's priority was reused and exposed via
> BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> as-is for classic tc BPF. The feedback was that priority provides a bad user
> experience and is hard to use [1], e.g.:
>
>   I cannot help but feel that priority logic copy-paste from old tc, netfilter
>   and friends is done because "that's how things were done in the past". [...]
>   Priority gets exposed everywhere in uapi all the way to bpftool when it's
>   right there for users to understand. And that's the main problem with it.
>
>   The user don't want to and don't need to be aware of it, but uapi forces them
>   to pick the priority. [...] Your cover letter [0] example proves that in
>   real life different service pick the same priority. They simply don't know
>   any better. Priority is an unnecessary magic that apps _have_ to pick, so
>   they just copy-paste and everyone ends up using the same.
>
> The course of the discussion showed more and more the need for a generic,
> reusable API where the "same look and feel" can be applied for various other
> program types beyond just tc BPF, for example XDP today does not have multi-
> program support in kernel, but also there was interest around this API for
> improving management of cgroup program types. Such common multi-program
> management concept is useful for BPF management daemons or user space BPF
> applications coordinating about their attachments.
>
> Both from Cilium and Meta side [2], we've collected the following requirements
> for a generic attach/detach/query API for multi-progs which has been implemented
> as part of this work:
>
>   - Support prog-based attach/detach and link API
>   - Dependency directives (can also be combined):
>     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
>       - BPF_F_ID flag as {fd,id} toggle
>       - BPF_F_LINK flag as {prog,link} toggle
>       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
>         BPF_F_AFTER will just append for the case of attaching
>       - Enforced only at attach time
>     - BPF_F_{FIRST,LAST}
>       - Enforced throughout the bpf_mprog state's lifetime
>       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
>   - Internal revision counter and optionally being able to pass expected_revision
>   - User space daemon can query current state with revision, and pass it along
>     for attachment to assert current state before doing updates
>   - Query also gets extension for link_ids array and link_attach_flags:
>     - prog_ids are always filled with program IDs
>     - link_ids are filled with link IDs when link was used, otherwise 0
>     - {prog,link}_attach_flags for holding {prog,link}-specific flags
>   - Must be easy to integrate/reuse for in-kernel users
>
> The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> consisting of the additions of the attachment flags, revision counter, and
> expanding existing union with relative_{fd,id} member.
>
> The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> structure). Both have been separated, so that fast-path gets efficient packing
> of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> instead of linked list or other structures to remove unnecessary indirections
> for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> is populated and then just swapped which avoids additional allocations that
> could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> currently static, but they could be converted to dynamic allocation if necessary
> at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> for example, in case of tcx which uses this API in the next patch, it piggy-
> backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> add,del} implementation and an extensive test suite for checking all aspects
> of this API for prog-based attach/detach and link API as BPF selftests in
> this series.
>
> Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
>
>   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
>   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
>   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   1 +
>  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
>  include/uapi/linux/bpf.h       |  37 ++-
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  37 ++-
>  6 files changed, 781 insertions(+), 17 deletions(-)
>  create mode 100644 include/linux/bpf_mprog.h
>  create mode 100644 kernel/bpf/mprog.c
>

I like the API itself, I think it strikes the right balance. My
questions and comments below are mostly about specific implementation
details only.

> diff --git a/MAINTAINERS b/MAINTAINERS
> index c904dba1733b..754a9eeca0a1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3733,6 +3733,7 @@ F:        include/linux/filter.h
>  F:     include/linux/tnum.h
>  F:     kernel/bpf/core.c
>  F:     kernel/bpf/dispatcher.c
> +F:     kernel/bpf/mprog.c
>  F:     kernel/bpf/syscall.c
>  F:     kernel/bpf/tnum.c
>  F:     kernel/bpf/trampoline.c
> diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
> new file mode 100644
> index 000000000000..7399181d8e6c
> --- /dev/null
> +++ b/include/linux/bpf_mprog.h
> @@ -0,0 +1,245 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __BPF_MPROG_H
> +#define __BPF_MPROG_H
> +
> +#include <linux/bpf.h>
> +
> +#define BPF_MPROG_MAX  64
> +#define BPF_MPROG_SWAP 1
> +#define BPF_MPROG_FREE 2
> +
> +struct bpf_mprog_fp {
> +       struct bpf_prog *prog;
> +};
> +
> +struct bpf_mprog_cp {
> +       struct bpf_link *link;
> +       u32 flags;
> +};
> +
> +struct bpf_mprog_entry {
> +       struct bpf_mprog_fp fp_items[BPF_MPROG_MAX] ____cacheline_aligned;
> +       struct bpf_mprog_cp cp_items[BPF_MPROG_MAX] ____cacheline_aligned;
> +       struct bpf_mprog_bundle *parent;
> +};
> +
> +struct bpf_mprog_bundle {
> +       struct bpf_mprog_entry a;
> +       struct bpf_mprog_entry b;

I get why we want a and b for bpf_mprog_fd, as we don't want to modify
"effective" array while it might be read by bpf_prog_run(). But can't
we just modify bpf_mprog_cp items in place instead of having an entire
BPF_MPROG_MAX of copies?

> +       struct rcu_head rcu;
> +       struct bpf_prog *ref;
> +       atomic_t revision;
> +};
> +
> +struct bpf_tuple {
> +       struct bpf_prog *prog;
> +       struct bpf_link *link;
> +};
> +
> +static inline struct bpf_mprog_entry *
> +bpf_mprog_peer(const struct bpf_mprog_entry *entry)
> +{
> +       if (entry == &entry->parent->a)
> +               return &entry->parent->b;
> +       else
> +               return &entry->parent->a;
> +}
> +
> +#define bpf_mprog_foreach_tuple(entry, fp, cp, t)                      \
> +       for (fp = &entry->fp_items[0], cp = &entry->cp_items[0];        \
> +            ({                                                         \
> +               t.prog = READ_ONCE(fp->prog);                           \
> +               t.link = cp->link;                                      \
> +               t.prog;                                                 \
> +             });                                                       \
> +            fp++, cp++)
> +
> +#define bpf_mprog_foreach_prog(entry, fp, p)                           \
> +       for (fp = &entry->fp_items[0];                                  \
> +            (p = READ_ONCE(fp->prog));                                 \
> +            fp++)
> +
> +static inline struct bpf_mprog_entry *bpf_mprog_create(size_t extra_size)
> +{
> +       struct bpf_mprog_bundle *bundle;
> +
> +       /* Fast-path items are not extensible, must only contain prog pointer! */
> +       BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
> +       /* Control-path items can be extended w/o affecting fast-path. */
> +       BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) != ARRAY_SIZE(bundle->a.cp_items));
> +
> +       bundle = kzalloc(sizeof(*bundle) + extra_size, GFP_KERNEL);
> +       if (bundle) {
> +               atomic_set(&bundle->revision, 1);
> +               bundle->a.parent = bundle;
> +               bundle->b.parent = bundle;
> +               return &bundle->a;
> +       }
> +       return NULL;
> +}
> +
> +static inline void bpf_mprog_free(struct bpf_mprog_entry *entry)
> +{
> +       kfree_rcu(entry->parent, rcu);
> +}
> +
> +static inline void bpf_mprog_mark_ref(struct bpf_mprog_entry *entry,
> +                                     struct bpf_prog *prog)
> +{
> +       WARN_ON_ONCE(entry->parent->ref);
> +       entry->parent->ref = prog;
> +}
> +
> +static inline u32 bpf_mprog_flags(u32 cur_flags, u32 req_flags, u32 flag)
> +{
> +       if (req_flags & flag)
> +               cur_flags |= flag;
> +       else
> +               cur_flags &= ~flag;
> +       return cur_flags;
> +}
> +
> +static inline u32 bpf_mprog_max(void)
> +{
> +       return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_first(struct bpf_mprog_entry *entry)
> +{
> +       return READ_ONCE(entry->fp_items[0].prog);
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_last(struct bpf_mprog_entry *entry)
> +{
> +       struct bpf_prog *tmp, *prog = NULL;
> +       struct bpf_mprog_fp *fp;
> +
> +       bpf_mprog_foreach_prog(entry, fp, tmp)
> +               prog = tmp;
> +       return prog;
> +}
> +
> +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
> +                                   struct bpf_prog *prog)
> +{
> +       const struct bpf_mprog_fp *fp;
> +       const struct bpf_prog *tmp;
> +
> +       bpf_mprog_foreach_prog(entry, fp, tmp) {
> +               if (tmp == prog)
> +                       return true;
> +       }
> +       return false;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_first_reg(struct bpf_mprog_entry *entry)
> +{
> +       struct bpf_tuple tuple = {};
> +       struct bpf_mprog_fp *fp;
> +       struct bpf_mprog_cp *cp;
> +
> +       bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +               if (cp->flags & BPF_F_FIRST)
> +                       continue;
> +               return tuple.prog;
> +       }
> +       return NULL;
> +}
> +
> +static inline struct bpf_prog *bpf_mprog_last_reg(struct bpf_mprog_entry *entry)
> +{
> +       struct bpf_tuple tuple = {};
> +       struct bpf_prog *prog = NULL;
> +       struct bpf_mprog_fp *fp;
> +       struct bpf_mprog_cp *cp;
> +
> +       bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +               if (cp->flags & BPF_F_LAST)
> +                       break;
> +               prog = tuple.prog;
> +       }
> +       return prog;
> +}
> +
> +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
> +{
> +       do {
> +               atomic_inc(&entry->parent->revision);
> +       } while (atomic_read(&entry->parent->revision) == 0);

why not just use atomic64_t and never care about zero wrap-around?

> +       synchronize_rcu();
> +       if (entry->parent->ref) {
> +               bpf_prog_put(entry->parent->ref);
> +               entry->parent->ref = NULL;
> +       }
> +}
> +
> +static inline void bpf_mprog_entry_clear(struct bpf_mprog_entry *entry)
> +{
> +       memset(entry->fp_items, 0, sizeof(entry->fp_items));
> +       memset(entry->cp_items, 0, sizeof(entry->cp_items));
> +}
> +
> +static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
> +{
> +       return atomic_read(&entry->parent->revision);
> +}
> +
> +static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 which,
> +                                 struct bpf_mprog_fp **fp_dst,
> +                                 struct bpf_mprog_cp **cp_dst)
> +{
> +       *fp_dst = &entry->fp_items[which];
> +       *cp_dst = &entry->cp_items[which];
> +}
> +
> +static inline void bpf_mprog_write(struct bpf_mprog_fp *fp_dst,
> +                                  struct bpf_mprog_cp *cp_dst,
> +                                  struct bpf_tuple *tuple, u32 flags)
> +{
> +       WRITE_ONCE(fp_dst->prog, tuple->prog);
> +       cp_dst->link  = tuple->link;
> +       cp_dst->flags = flags;
> +}
> +
> +static inline void bpf_mprog_copy(struct bpf_mprog_fp *fp_dst,
> +                                 struct bpf_mprog_cp *cp_dst,
> +                                 struct bpf_mprog_fp *fp_src,
> +                                 struct bpf_mprog_cp *cp_src)
> +{
> +       WRITE_ONCE(fp_dst->prog, READ_ONCE(fp_src->prog));
> +       memcpy(cp_dst, cp_src, sizeof(*cp_src));
> +}
> +
> +static inline void bpf_mprog_copy_range(struct bpf_mprog_entry *peer,
> +                                       struct bpf_mprog_entry *entry,

it's not clear what is source and what is destination, why not just
use "src" and "dst" naming?

> +                                       u32 idx_peer, u32 idx_entry, u32 num)
> +{
> +       memcpy(&peer->fp_items[idx_peer], &entry->fp_items[idx_entry],
> +              num * sizeof(peer->fp_items[0]));
> +       memcpy(&peer->cp_items[idx_peer], &entry->cp_items[idx_entry],
> +              num * sizeof(peer->cp_items[0]));
> +}
> +
> +static inline u32 bpf_mprog_total(struct bpf_mprog_entry *entry)
> +{
> +       const struct bpf_mprog_fp *fp;
> +       const struct bpf_prog *tmp;
> +       u32 num = 0;
> +
> +       bpf_mprog_foreach_prog(entry, fp, tmp)
> +               num++;
> +       return num;
> +}
> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +                    struct bpf_link *link, u32 flags, u32 object,
> +                    u32 expected_revision);
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +                    struct bpf_link *link, u32 flags, u32 object,
> +                    u32 expected_revision);
> +
> +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> +                   struct bpf_mprog_entry *entry);
> +
> +#endif /* __BPF_MPROG_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a7b5e91dd768..207f8a37b327 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
>   */
>  #define BPF_F_ALLOW_OVERRIDE   (1U << 0)
>  #define BPF_F_ALLOW_MULTI      (1U << 1)
> +/* Generic attachment flags. */
>  #define BPF_F_REPLACE          (1U << 2)
> +#define BPF_F_BEFORE           (1U << 3)
> +#define BPF_F_AFTER            (1U << 4)
> +#define BPF_F_FIRST            (1U << 5)
> +#define BPF_F_LAST             (1U << 6)
> +#define BPF_F_ID               (1U << 7)
> +#define BPF_F_LINK             BPF_F_LINK /* 1 << 13 */
>
>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>   * verifier will perform strict alignment checking as if the kernel
> @@ -1433,14 +1440,19 @@ union bpf_attr {
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -               __u32           target_fd;      /* container object to attach to */
> -               __u32           attach_bpf_fd;  /* eBPF program to attach */
> +               union {
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
> +               __u32           attach_bpf_fd;
>                 __u32           attach_type;
>                 __u32           attach_flags;
> -               __u32           replace_bpf_fd; /* previously attached eBPF
> -                                                * program to replace if
> -                                                * BPF_F_REPLACE is used
> -                                                */
> +               union {
> +                       __u32   relative_fd;
> +                       __u32   relative_id;
> +                       __u32   replace_bpf_fd;
> +               };
> +               __u32           expected_revision;
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1486,16 +1498,25 @@ union bpf_attr {
>         } info;
>
>         struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -               __u32           target_fd;      /* container object to query */
> +               union {
> +                       __u32   target_fd;      /* target object to query or ... */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_type;
>                 __u32           query_flags;
>                 __u32           attach_flags;
>                 __aligned_u64   prog_ids;
> -               __u32           prog_cnt;
> +               union {
> +                       __u32   prog_cnt;
> +                       __u32   count;
> +               };
> +               __u32           revision;
>                 /* output: per-program attach_flags.
>                  * not allowed to be set during effective query.
>                  */
>                 __aligned_u64   prog_attach_flags;
> +               __aligned_u64   link_ids;
> +               __aligned_u64   link_attach_flags;

flags are 32-bit, no?

>         } query;
>
>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1d3892168d32..1bea2eb912cd 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>  obj-${CONFIG_BPF_LSM}    += bpf_inode_storage.o
> -obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
>  obj-$(CONFIG_BPF_JIT) += dispatcher.o
> diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
> new file mode 100644
> index 000000000000..efc3b73f8bf5
> --- /dev/null
> +++ b/kernel/bpf/mprog.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +#include <linux/filter.h>
> +
> +static int bpf_mprog_tuple_relative(struct bpf_tuple *tuple,
> +                                   u32 object, u32 flags,
> +                                   enum bpf_prog_type type)
> +{
> +       struct bpf_prog *prog;
> +       struct bpf_link *link;
> +
> +       memset(tuple, 0, sizeof(*tuple));
> +       if (!(flags & (BPF_F_REPLACE | BPF_F_BEFORE | BPF_F_AFTER)))
> +               return object || (flags & (BPF_F_ID | BPF_F_LINK)) ?
> +                      -EINVAL : 0;
> +       if (flags & BPF_F_LINK) {
> +               if (flags & BPF_F_ID)
> +                       link = bpf_link_by_id(object);
> +               else
> +                       link = bpf_link_get_from_fd(object);
> +               if (IS_ERR(link))
> +                       return PTR_ERR(link);
> +               if (type && link->prog->type != type) {
> +                       bpf_link_put(link);
> +                       return -EINVAL;
> +               }
> +               tuple->link = link;
> +               tuple->prog = link->prog;
> +       } else {
> +               if (flags & BPF_F_ID)
> +                       prog = bpf_prog_by_id(object);
> +               else
> +                       prog = bpf_prog_get(object);
> +               if (IS_ERR(prog)) {
> +                       if (!object &&
> +                           !(flags & BPF_F_ID))
> +                               return 0;
> +                       return PTR_ERR(prog);
> +               }
> +               if (type && prog->type != type) {
> +                       bpf_prog_put(prog);
> +                       return -EINVAL;
> +               }
> +               tuple->link = NULL;
> +               tuple->prog = prog;
> +       }
> +       return 0;
> +}
> +
> +static void bpf_mprog_tuple_put(struct bpf_tuple *tuple)
> +{
> +       if (tuple->link)
> +               bpf_link_put(tuple->link);
> +       else if (tuple->prog)
> +               bpf_prog_put(tuple->prog);
> +}
> +
> +static int bpf_mprog_replace(struct bpf_mprog_entry *entry,
> +                            struct bpf_tuple *ntuple,
> +                            struct bpf_tuple *rtuple, u32 rflags)
> +{
> +       struct bpf_mprog_fp *fp;
> +       struct bpf_mprog_cp *cp;
> +       struct bpf_prog *oprog;
> +       u32 iflags;
> +       int i;
> +
> +       if (rflags & (BPF_F_BEFORE | BPF_F_AFTER | BPF_F_LINK))
> +               return -EINVAL;
> +       if (rtuple->prog != ntuple->prog &&
> +           bpf_mprog_exists(entry, ntuple->prog))
> +               return -EEXIST;
> +       for (i = 0; i < bpf_mprog_max(); i++) {

why not just keep track of actual count in bpf_mprob_bundle? that will
speed up bpf_mprog_last() as well

> +               bpf_mprog_read(entry, i, &fp, &cp);
> +               oprog = READ_ONCE(fp->prog);
> +               if (!oprog)
> +                       break;
> +               if (oprog != rtuple->prog)
> +                       continue;
> +               if (cp->link != ntuple->link)
> +                       return -EBUSY;
> +               iflags = cp->flags;
> +               if ((iflags & BPF_F_FIRST) !=
> +                   (rflags & BPF_F_FIRST)) {
> +                       iflags = bpf_mprog_flags(iflags, rflags,
> +                                                BPF_F_FIRST);
> +                       if ((iflags & BPF_F_FIRST) &&
> +                           rtuple->prog != bpf_mprog_first(entry))
> +                               return -EACCES;
> +               }
> +               if ((iflags & BPF_F_LAST) !=
> +                   (rflags & BPF_F_LAST)) {
> +                       iflags = bpf_mprog_flags(iflags, rflags,
> +                                                BPF_F_LAST);
> +                       if ((iflags & BPF_F_LAST) &&
> +                           rtuple->prog != bpf_mprog_last(entry))
> +                               return -EACCES;
> +               }
> +               bpf_mprog_write(fp, cp, ntuple, iflags);
> +               if (!ntuple->link)
> +                       bpf_prog_put(oprog);
> +               return 0;
> +       }
> +       return -ENOENT;
> +}
> +

[...]

> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +                    struct bpf_link *link, u32 flags, u32 object,
> +                    u32 expected_revision)
> +{
> +       struct bpf_tuple rtuple, ntuple = {
> +               .prog = prog,
> +               .link = link,
> +       };
> +       int ret;
> +
> +       if (expected_revision &&
> +           expected_revision != bpf_mprog_revision(entry))
> +               return -ESTALE;
> +       ret = bpf_mprog_tuple_relative(&rtuple, object, flags, prog->type);
> +       if (ret)
> +               return ret;
> +       if (flags & BPF_F_REPLACE)
> +               ret = bpf_mprog_replace(entry, &ntuple, &rtuple, flags);
> +       else if (flags & (BPF_F_FIRST | BPF_F_LAST))
> +               ret = bpf_mprog_head_tail(entry, &ntuple, &rtuple, flags);
> +       else
> +               ret = bpf_mprog_add(entry, &ntuple, &rtuple, flags);
> +       bpf_mprog_tuple_put(&rtuple);
> +       return ret;
> +}

We chatted about this a bit offline, but let me expand a bit on this
in writing. I still find the need to have mprog_add, mprog_head_tail
and mprog_replace variants a bit unnecessary. Especially that each of
them kind of reimplements all these BEFORE/AFTER and FIRST/LAST
constraints each time.

I was thinking if it would be more straightforward to evaluate target
position for the prog/link that needs to be replaced/inserted/deleted
for each "rule" independently. And if all rules agree on that position
or existing element, then enact replacement/addition/deletion.
Something like below, in very pseudo-code-like manner:

int idx = -1, tidx;

if (flag & BPF_F_REPLACE) {
    tidx = mprog_pos_exact(object, flags);
    if (tidx < 0 || (idx >= 0 && tidx != idx))
        return -EINVAL;
    idx = tidx;
}
if (flag & BPF_F_BEFORE) {
    tidx = mprog_pos_before(object, flags);
    if (tidx < 0 || (idx >= 0 && tidx != idx))
        return -EINVAL;
    idx = tidx;
}
if (flag & BPF_F_AFTER) {
    tidx = mprog_pos_after(object, flags);
    if (tidx < 0 || (idx >= 0 && tidx != idx))
        return -EINVAL;
    idx = tidx;
}
if (flag & BPF_F_FIRST) {
    if (idx >= 0 && idx != 0)
        return -EINVAL;
    idx = 0;
    if (idx < bpf_mprog_cnt() && (prog_flag_at(idx) & BPF_F_FIRST))
        return -EBUSY;
}
if (flag & BPF_F_LAST) {
    if (idx >= 0 && idx != bpf_mprog_cnt())
        return -EINVAL;
    idx = bpf_mprog_cnt();
    if (idx < bpf_mprog_cnt() && (prog_flag_at(idx) & BPF_F_LAST))
        return -EBUSY;
}

if (flag & BPF_F_REPLACE)
   replace_in_place(idx, flags & (BPF_F_FIRST | BPF_F_LAST));
else /* if delete */
   delete_at_pos(idx);
else /* add new element */
   insert_at_pos(idx, flags & (BPF_F_FIRST | BPF_F_LAST));

Each of those mprog_pos_{exact,before,after} should be trivial to
implement (they will just find position that satisfies one of those
conditions). idx will mean either position at which we need to insert
new element (and so everything at that position and to the right
should be shifted), or if it's a replacement/deletion where we expect
to find existing prog/link, we'll have to find matching element there.

I guess REPLACE and BEFORE/AFTER are currently fundamentally
incompatible because they reuse the same field to specify FD/ID, so
we'd have to check that both are not specified at the same time. Or we
can choose to have separate replace_fd and relative_fd/id. And then
you could even express "replace prog if it's before another prog X",
similarly how you can express REPLACE + FIRST (replace if it's first).

You mentioned actual implementation gets hard, so I'm curious which
parts with such approach are becoming convoluted in actual
implementation?


> +
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry, struct bpf_prog *prog,
> +                    struct bpf_link *link, u32 flags, u32 object,
> +                    u32 expected_revision)
> +{
> +       struct bpf_tuple rtuple, dtuple = {
> +               .prog = prog,
> +               .link = link,
> +       };
> +       int ret;
> +
> +       if (expected_revision &&
> +           expected_revision != bpf_mprog_revision(entry))
> +               return -ESTALE;
> +       ret = bpf_mprog_tuple_relative(&rtuple, object, flags,
> +                                      prog ? prog->type :
> +                                      BPF_PROG_TYPE_UNSPEC);
> +       if (ret)
> +               return ret;
> +       ret = bpf_mprog_del(entry, &dtuple, &rtuple, flags);
> +       bpf_mprog_tuple_put(&rtuple);
> +       return ret;
> +}
> +



[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 17:23   ` Stanislav Fomichev
@ 2023-06-08 20:59     ` Andrii Nakryiko
  2023-06-08 21:52       ` Stanislav Fomichev
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 20:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, john.fastabend,
	kuba, dxu, joe, toke, davem, bpf, netdev

On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/07, Daniel Borkmann wrote:
> > This adds a generic layer called bpf_mprog which can be reused by different
> > attachment layers to enable multi-program attachment and dependency resolution.
> > In-kernel users of the bpf_mprog don't need to care about the dependency
> > resolution internals, they can just consume it with few API calls.
> >
> > The initial idea of having a generic API sparked out of discussion [0] from an
> > earlier revision of this work where tc's priority was reused and exposed via
> > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> > as-is for classic tc BPF. The feedback was that priority provides a bad user
> > experience and is hard to use [1], e.g.:
> >
> >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
> >   and friends is done because "that's how things were done in the past". [...]
> >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
> >   right there for users to understand. And that's the main problem with it.
> >
> >   The user don't want to and don't need to be aware of it, but uapi forces them
> >   to pick the priority. [...] Your cover letter [0] example proves that in
> >   real life different service pick the same priority. They simply don't know
> >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
> >   they just copy-paste and everyone ends up using the same.
> >
> > The course of the discussion showed more and more the need for a generic,
> > reusable API where the "same look and feel" can be applied for various other
> > program types beyond just tc BPF, for example XDP today does not have multi-
> > program support in kernel, but also there was interest around this API for
> > improving management of cgroup program types. Such common multi-program
> > management concept is useful for BPF management daemons or user space BPF
> > applications coordinating about their attachments.
> >
> > Both from Cilium and Meta side [2], we've collected the following requirements
> > for a generic attach/detach/query API for multi-progs which has been implemented
> > as part of this work:
> >
> >   - Support prog-based attach/detach and link API
> >   - Dependency directives (can also be combined):
> >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> >       - BPF_F_ID flag as {fd,id} toggle
> >       - BPF_F_LINK flag as {prog,link} toggle
> >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> >         BPF_F_AFTER will just append for the case of attaching
> >       - Enforced only at attach time
> >     - BPF_F_{FIRST,LAST}
> >       - Enforced throughout the bpf_mprog state's lifetime
> >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> >   - Internal revision counter and optionally being able to pass expected_revision
> >   - User space daemon can query current state with revision, and pass it along
> >     for attachment to assert current state before doing updates
> >   - Query also gets extension for link_ids array and link_attach_flags:
> >     - prog_ids are always filled with program IDs
> >     - link_ids are filled with link IDs when link was used, otherwise 0
> >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
> >   - Must be easy to integrate/reuse for in-kernel users
> >
> > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> > consisting of the additions of the attachment flags, revision counter, and
> > expanding existing union with relative_{fd,id} member.
> >
> > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> > structure). Both have been separated, so that fast-path gets efficient packing
> > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> > instead of linked list or other structures to remove unnecessary indirections
> > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> > is populated and then just swapped which avoids additional allocations that
> > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> > currently static, but they could be converted to dynamic allocation if necessary
> > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> > for example, in case of tcx which uses this API in the next patch, it piggy-
> > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> > add,del} implementation and an extensive test suite for checking all aspects
> > of this API for prog-based attach/detach and link API as BPF selftests in
> > this series.
> >
> > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> >
> >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> >
> > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > ---
> >  MAINTAINERS                    |   1 +
> >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
> >  include/uapi/linux/bpf.h       |  37 ++-
> >  kernel/bpf/Makefile            |   2 +-
> >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  37 ++-
> >  6 files changed, 781 insertions(+), 17 deletions(-)
> >  create mode 100644 include/linux/bpf_mprog.h
> >  create mode 100644 kernel/bpf/mprog.c
> >

[...]

> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index a7b5e91dd768..207f8a37b327 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> >   */
> >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> >  #define BPF_F_ALLOW_MULTI    (1U << 1)
> > +/* Generic attachment flags. */
> >  #define BPF_F_REPLACE                (1U << 2)
> > +#define BPF_F_BEFORE         (1U << 3)
> > +#define BPF_F_AFTER          (1U << 4)
>
> [..]
>
> > +#define BPF_F_FIRST          (1U << 5)
> > +#define BPF_F_LAST           (1U << 6)
>
> I'm still not sure whether the hard semantics of first/last is really
> useful. My worry is that some prog will just use BPF_F_FIRST which
> would prevent the rest of the users.. (starting with only
> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> need first/laste).

Without FIRST/LAST some scenarios cannot be guaranteed to be safely
implemented. E.g., if I have some hard audit requirements and I need
to guarantee that my program runs first and observes each event, I'll
enforce BPF_F_FIRST when attaching it. And if that attachment fails,
then server setup is broken and my application cannot function.

In a setup where we expect multiple applications to co-exist, it
should be a rule that no one is using FIRST/LAST (unless it's
absolutely required). And if someone doesn't comply, then that's a bug
and has to be reported to application owners.

But it's not up to the kernel to enforce this cooperation by
disallowing FIRST/LAST semantics, because that semantics is critical
for some applications, IMO.

>
> But if everyone besides myself is on board with first/last, maybe at least
> put a comment here saying that only a single program can be first/last?
> And the users are advised not to use these unless they really really really
> need to be first/last. (IOW, feels like first/last should be reserved
> for observability tools/etc).

+1, we can definitely make it clear in API that this will prevent
anyone else from being attached as FIRST/LAST, so it's not cooperative
in nature and has to be very consciously evaluated.

>
> > +#define BPF_F_ID             (1U << 7)
> > +#define BPF_F_LINK           BPF_F_LINK /* 1 << 13 */
> >
> >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> >   * verifier will perform strict alignment checking as if the kernel

[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
  2023-06-08  1:25   ` Jamal Hadi Salim
  2023-06-08 17:50   ` Stanislav Fomichev
@ 2023-06-08 21:20   ` Andrii Nakryiko
  2023-06-09  3:06   ` Jakub Kicinski
  3 siblings, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 21:20 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

On Wed, Jun 7, 2023 at 12:27 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This work refactors and adds a lightweight extension ("tcx") to the tc BPF
> ingress and egress data path side for allowing BPF program management based
> on fds via bpf() syscall through the newly added generic multi-prog API.
> The main goal behind this work which we also presented at LPC [0] last year
> and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
> BPF link functionality for tc BPF programs, which allows for a model of safe
> ownership and program detachment.
>
> Given the rise in tc BPF users in cloud native environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications accidentally stepping on each others toes.
> As a recap, a BPF link represents the attachment of a BPF program to a BPF
> hook point. The BPF link holds a single reference to keep BPF program alive.
> Moreover, hook points do not reference a BPF link, only the application's
> fd or pinning does. A BPF link holds meta-data specific to attachment and
> implements operations for link creation, (atomic) BPF program update,
> detachment and introspection. The motivation for BPF links for tc BPF programs
> is multi-fold, for example:
>
>   - From Meta: "It's especially important for applications that are deployed
>     fleet-wide and that don't "control" hosts they are deployed to. If such
>     application crashes and no one notices and does anything about that, BPF
>     program will keep running draining resources or even just, say, dropping
>     packets. We at FB had outages due to such permanent BPF attachment
>     semantics. With fd-based BPF link we are getting a framework, which allows
>     safe, auto-detachable behavior by default, unless application explicitly
>     opts in by pinning the BPF link." [1]
>
>   - From Cilium-side the tc BPF programs we attach to host-facing veth devices
>     and phys devices build the core datapath for Kubernetes Pods, and they
>     implement forwarding, load-balancing, policy, EDT-management, etc, within
>     BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>     experienced hard-to-debug issues in a user's staging environment where
>     another Kubernetes application using tc BPF attached to the same prio/handle
>     of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
>     it. The goal is to establish a clear/safe ownership model via links which
>     cannot accidentally be overridden. [0,2]
>
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
>
> Earlier attempts [4] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
>
> We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
> attach API, so that the BPF link implementation blends in naturally similar to
> other link types which are fd-based and without the need for changing core tc
> internal APIs. BPF programs for tc can then be successively migrated from classic
> cls_bpf to the new tc BPF link without needing to change the program's source
> code, just the BPF loader mechanics for attaching is sufficient.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook have a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> entry point for tc BPF. The name tcx has been suggested from discussion of
> earlier revisions of this work as a good fit, and to more easily differ between
> the classic cls_bpf attachment and the fd-based one.
>
> For the ingress and egress tcx points, the device holds a cache-friendly array
> with program pointers which is separated from control plane (slow-path) data.
> Earlier versions of this work used priority to determine ordering and expression
> of dependencies similar as with classic tc, but it was challenged that for
> something more future-proof a better user experience is required. Hence this
> resulted in the design and development of the generic attach/detach/query API
> for multi-progs. See prior patch with its discussion on the API design. tcx is
> the first user and later we plan to integrate also others, for example, one
> candidate is multi-prog support for XDP which would benefit and have the same
> 'look and feel' from API perspective.
>
> The goal with tcx is to have maximum compatibility to existing tc BPF programs,
> so they don't need to be rewritten specifically. Compatibility to call into
> classic tcf_classify() is also provided in order to allow successive migration
> or both to cleanly co-exist where needed given its all one logical tc layer.
> tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
> to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
> The fd-based API is behind a static key, so that when unused the code is also
> not entered. The struct tcx_entry's program array is currently static, but
> could be made dynamic if necessary at a point in future. The a/b pair swap
> design has been chosen so that for detachment there are no allocations which
> otherwise could fail. The work has been tested with tc-testing selftest suite
> which all passes, as well as the tc BPF tests from the BPF CI, and also with
> Cilium's L4LB.
>
> Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
> of this work.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>   [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
>   [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
>   [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>   [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/netdevice.h      |  15 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/tcx.h              | 157 +++++++++++++++
>  include/uapi/linux/bpf.h       |  35 +++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/syscall.c           |  95 +++++++--
>  kernel/bpf/tcx.c               | 347 +++++++++++++++++++++++++++++++++
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 267 +++++++++++++++----------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  45 ++++-
>  tools/include/uapi/linux/bpf.h |  35 +++-
>  16 files changed, 877 insertions(+), 144 deletions(-)
>  create mode 100644 include/net/tcx.h
>  create mode 100644 kernel/bpf/tcx.c
>

[...]

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
>         BPF_STRUCT_OPS,
> +       BPF_TCX_INGRESS,
> +       BPF_TCX_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
>         BPF_LINK_TYPE_STRUCT_OPS = 9,
>         BPF_LINK_TYPE_NETFILTER = 10,
> -
> +       BPF_LINK_TYPE_TCX = 11,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1559,13 +1561,13 @@ union bpf_attr {
>                         __u32           map_fd;         /* struct_ops to attach */
>                 };
>                 union {
> -                       __u32           target_fd;      /* object to attach to */
> -                       __u32           target_ifindex; /* target ifindex */
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
>                 };
>                 __u32           attach_type;    /* attach type */
>                 __u32           flags;          /* extra flags */
>                 union {
> -                       __u32           target_btf_id;  /* btf_id of target to attach to */
> +                       __u32   target_btf_id;  /* btf_id of target to attach to */

nit: should this part be in patch 1?

>                         struct {
>                                 __aligned_u64   iter_info;      /* extra bpf_iter_link_info */
>                                 __u32           iter_info_len;  /* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
>                                 __s32           priority;
>                                 __u32           flags;
>                         } netfilter;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u32           expected_revision;
> +                       } tcx;
>                 };
>         } link_create;
>

[...]

> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_link_primer link_primer;
> +       struct net_device *dev;
> +       struct tcx_link *link;
> +       int fd, err;
> +
> +       dev = dev_get_by_index(net, attr->link_create.target_ifindex);
> +       if (!dev)
> +               return -EINVAL;
> +       link = kzalloc(sizeof(*link), GFP_USER);
> +       if (!link) {
> +               err = -ENOMEM;
> +               goto out_put;
> +       }
> +
> +       bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> +       link->location = attr->link_create.attach_type;
> +       link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
> +       link->dev = dev;
> +
> +       err = bpf_link_prime(&link->link, &link_primer);
> +       if (err) {
> +               kfree(link);
> +               goto out_put;
> +       }
> +       rtnl_lock();
> +       err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
> +                                  attr->link_create.tcx.relative_fd,
> +                                  attr->link_create.tcx.expected_revision);
> +       if (!err)
> +               fd = bpf_link_settle(&link_primer);

why this early settle? makes the error handling logic more convoluted.
Maybe leave link->dev as is and let bpf_link_cleanup() handle
dev_put(dev)? Can it be just:

err = tcx_link_prog_attach(...);

rtnl_unlock();

if (err) {
    link->dev = NULL;
    bpf_link_cleanup(&link_primer);
    goto out_put;
}

dev_put(dev);
return bpf_link_settle(&link_primer);

?

> +       rtnl_unlock();
> +       if (err) {
> +               link->dev = NULL;
> +               bpf_link_cleanup(&link_primer);
> +               goto out_put;
> +       }
> +       dev_put(dev);
> +       return fd;
> +out_put:
> +       dev_put(dev);
> +       return err;
> +}

[...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-08 19:46       ` Jamal Hadi Salim
@ 2023-06-08 21:24         ` Andrii Nakryiko
  2023-07-04 21:36           ` Jamal Hadi Salim
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 21:24 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, sdf,
	john.fastabend, kuba, dxu, joe, toke, davem, bpf, netdev

On Thu, Jun 8, 2023 at 12:46 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> Hi Daniel,
>
> On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > Hi Jamal,
> >
> > On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
> > [...]
> > > A general question (which i think i asked last time as well): who
> > > decides what comes after/before what prog in this setup? And would
> > > that same entity not have been able to make the same decision using tc
> > > priorities?
> >
> > Back in the first version of the series I initially coded up this option
> > that the tc_run() would basically be a fake 'bpf_prog' and it would have,
> > say, fixed prio 1000. It would get executed via tcx_run() when iterating
> > via bpf_mprog_foreach_prog() where bpf_prog_run() is called, and then users
> > could pick for native BPF prio before or after that. But then the feedback
> > was that sticking to prio is a bad user experience which led to the
> > development of what is in patch 1 of this series (see the details there).
> >
>
> Thanks. I read the commit message in patch 1 and followed the thread
> back including some of the discussion we had and i am still
> disagreeing that this couldnt be solved with a smart priority based
> scheme - but i think we can move on since this is standalone and
> doesnt affect tc.
>
> Daniel - i am still curious in the new scheme of things how would
> cilium vs datadog food fight get resolved without some arbitration
> entity?
>
> > > The idea of protecting programs from being unloaded is very welcome
> > > but feels would have made sense to be a separate patchset (we have
> > > good need for it). Would it be possible to use that feature in tc and
> > > xdp?
> > BPF links are supported for XDP today, just tc BPF is one of the few
> > remainders where it is not the case, hence the work of this series. What
> > XDP lacks today however is multi-prog support. With the bpf_mprog concept
> > that could be addressed with that common/uniform api (and Andrii expressed
> > interest in integrating this also for cgroup progs), so yes, various hook
> > points/program types could benefit from it.
>
> Is there some sample XDP related i could look at?  Let me describe our
> use case: lets say we load an ebpf program foo attached to XDP of a
> netdev  and then something further upstream in the stack is consuming
> the results of that ebpf XDP program. For some reason someone, at some
> point, decides to replace the XDP prog with a different one - and the
> new prog does a very different thing. Could we stop the replacement
> with the link mechanism you describe? i.e the program is still loaded
> but is no longer attached to the netdev.

If you initially attached an XDP program using BPF link api
(LINK_CREATE command in bpf() syscall), then subsequent attachment to
the same interface (of a new link or program with BPF_PROG_ATTACH)
will fail until the current BPF link is detached through closing its
last fd.

That is, until we allow multiple attachments of XDP programs to the
same network interface. But even then, no one will be able to
accidentally replace attached link, unless they have that link FD and
replace underlying BPF program.

>
>
> > >> +struct tcx_entry {
> > >> +       struct bpf_mprog_bundle         bundle;
> > >> +       struct mini_Qdisc __rcu         *miniq;
> > >> +};
> > >> +
> > >
> > > Can you please move miniq to the front? From where i sit this looks:
> > > struct tcx_entry {
> > >          struct bpf_mprog_bundle    bundle
> > > __attribute__((__aligned__(64))); /*     0  3264 */
> > >
> > >          /* XXX last struct has 36 bytes of padding */
> > >
> > >          /* --- cacheline 51 boundary (3264 bytes) --- */
> > >          struct mini_Qdisc *        miniq;                /*  3264     8 */
> > >
> > >          /* size: 3328, cachelines: 52, members: 2 */
> > >          /* padding: 56 */
> > >          /* paddings: 1, sum paddings: 36 */
> > >          /* forced alignments: 1 */
> > > } __attribute__((__aligned__(64)));
> > >
> > > That is a _lot_ of cachelines - at the expense of the status quo
> > > clsact/ingress qdiscs which access miniq.
> >
> > Ah yes, I'll fix this up.
>
> Thanks.
>
> cheers,
> jamal
> > Thanks,
> > Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx
  2023-06-07 19:26 ` [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
@ 2023-06-08 21:37   ` Andrii Nakryiko
  0 siblings, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 21:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

On Wed, Jun 7, 2023 at 12:26 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Extend libbpf attach opts and add a new detach opts API so this can be used
> to add/remove fd-based tcx BPF programs. The old-style bpf_prog_detach and
> bpf_prog_detach2 APIs are refactored to reuse the detach opts internally.
>
> The bpf_prog_query_opts API got extended to be able to handle the new link_ids,
> link_attach_flags and revision fields.
>
> For concrete usage examples, see the extensive selftests that have been
> developed as part of this series.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c      | 78 ++++++++++++++++++++++------------------
>  tools/lib/bpf/bpf.h      | 54 +++++++++++++++++++++-------
>  tools/lib/bpf/libbpf.c   |  6 ++++
>  tools/lib/bpf/libbpf.map |  1 +
>  4 files changed, 91 insertions(+), 48 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index ed86b37d8024..a3d1b7ebe224 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -629,11 +629,21 @@ int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
>         return bpf_prog_attach_opts(prog_fd, target_fd, type, &opts);
>  }
>
> -int bpf_prog_attach_opts(int prog_fd, int target_fd,
> -                         enum bpf_attach_type type,
> -                         const struct bpf_prog_attach_opts *opts)
> +int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
> +{
> +       return bpf_prog_detach_opts(0, target_fd, type, NULL);
> +}
> +
> +int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
>  {
> -       const size_t attr_sz = offsetofend(union bpf_attr, replace_bpf_fd);
> +       return bpf_prog_detach_opts(prog_fd, target_fd, type, NULL);
> +}

Please put these wrappers after bpf_prog_detach_ops(), it will make
the diff cleaner and will keep them closer to full version of
bpf_prog_detach_opts().

> +
> +int bpf_prog_attach_opts(int prog_fd, int target,
> +                        enum bpf_attach_type type,
> +                        const struct bpf_prog_attach_opts *opts)
> +{
> +       const size_t attr_sz = offsetofend(union bpf_attr, expected_revision);
>         union bpf_attr attr;
>         int ret;
>

[...]

> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 9aa0ee473754..480c584a6f7f 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -312,22 +312,43 @@ LIBBPF_API int bpf_obj_get(const char *pathname);
>  LIBBPF_API int bpf_obj_get_opts(const char *pathname,
>                                 const struct bpf_obj_get_opts *opts);
>
> -struct bpf_prog_attach_opts {
> -       size_t sz; /* size of this struct for forward/backward compatibility */
> -       unsigned int flags;
> -       int replace_prog_fd;
> -};
> -#define bpf_prog_attach_opts__last_field replace_prog_fd
> -
>  LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
>                                enum bpf_attach_type type, unsigned int flags);
> -LIBBPF_API int bpf_prog_attach_opts(int prog_fd, int attachable_fd,
> -                                    enum bpf_attach_type type,
> -                                    const struct bpf_prog_attach_opts *opts);
>  LIBBPF_API int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
>  LIBBPF_API int bpf_prog_detach2(int prog_fd, int attachable_fd,
>                                 enum bpf_attach_type type);
>
> +struct bpf_prog_attach_opts {
> +       size_t sz; /* size of this struct for forward/backward compatibility */
> +       __u32 flags;
> +       union {
> +               int     replace_prog_fd;
> +               int     replace_fd;
> +               int     relative_fd;
> +               __u32   relative_id;
> +       };

I tried to not use union for such cases in OPTS-based interfaces, see
bpf_link_create(). Let's keep them all as separate fields and then
return error if, say, both relative_fd and relative_id is specified at
the same time.

It's fine to have replace_prog_fd and replace_fd as a union, as they
are basically just synonyms.


> +       __u32 expected_revision;
> +};
> +#define bpf_prog_attach_opts__last_field expected_revision
> +
> +struct bpf_prog_detach_opts {
> +       size_t sz; /* size of this struct for forward/backward compatibility */
> +       __u32 flags;
> +       union {
> +               int     relative_fd;
> +               __u32   relative_id;
> +       };

same as above

> +       __u32 expected_revision;
> +};
> +#define bpf_prog_detach_opts__last_field expected_revision
> +
> +LIBBPF_API int bpf_prog_attach_opts(int prog_fd, int target,

let's add doc comments to both these APIs, where `target` is
explained. Right now because it doesn't have "_fd" suffix it's not
very clear what sort of value it is (I know why it's not target_fd
anymore due to target_ifindex)

> +                                   enum bpf_attach_type type,
> +                                   const struct bpf_prog_attach_opts *opts);
> +LIBBPF_API int bpf_prog_detach_opts(int prog_fd, int target,
> +                                   enum bpf_attach_type type,
> +                                   const struct bpf_prog_detach_opts *opts);
> +
>  union bpf_iter_link_info; /* defined in up-to-date linux/bpf.h */
>  struct bpf_link_create_opts {
>         size_t sz; /* size of this struct for forward/backward compatibility */
> @@ -489,14 +510,21 @@ struct bpf_prog_query_opts {
>         __u32 query_flags;
>         __u32 attach_flags; /* output argument */
>         __u32 *prog_ids;
> -       __u32 prog_cnt; /* input+output argument */
> +       union {
> +               __u32 prog_cnt; /* input+output argument */
> +               __u32 count;
> +       };
>         __u32 *prog_attach_flags;
> +       __u32 *link_ids;
> +       __u32 *link_attach_flags;
> +       __u32 revision;
>  };
> -#define bpf_prog_query_opts__last_field prog_attach_flags
> +#define bpf_prog_query_opts__last_field revision
>
> -LIBBPF_API int bpf_prog_query_opts(int target_fd,
> +LIBBPF_API int bpf_prog_query_opts(int target,

same here for doc comment

>                                    enum bpf_attach_type type,
>                                    struct bpf_prog_query_opts *opts);
> +
>  LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type,
>                               __u32 query_flags, __u32 *attach_flags,
>                               __u32 *prog_ids, __u32 *prog_cnt);
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 47632606b06d..b89127471c6a 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -117,6 +117,8 @@ static const char * const attach_type_name[] = {
>         [BPF_PERF_EVENT]                = "perf_event",
>         [BPF_TRACE_KPROBE_MULTI]        = "trace_kprobe_multi",
>         [BPF_STRUCT_OPS]                = "struct_ops",
> +       [BPF_TCX_INGRESS]               = "tcx_ingress",
> +       [BPF_TCX_EGRESS]                = "tcx_egress",
>  };
>
>  static const char * const link_type_name[] = {
> @@ -8669,6 +8671,10 @@ static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("kretsyscall+",         KPROBE, 0, SEC_NONE, attach_ksyscall),
>         SEC_DEF("usdt+",                KPROBE, 0, SEC_NONE, attach_usdt),
>         SEC_DEF("tc",                   SCHED_CLS, 0, SEC_NONE),
> +       SEC_DEF("tc/ingress",           SCHED_CLS, BPF_TCX_INGRESS, SEC_ATTACHABLE_OPT),
> +       SEC_DEF("tc/egress",            SCHED_CLS, BPF_TCX_EGRESS, SEC_ATTACHABLE_OPT),

for tc/ingress and tc/egress, is it intentional that libbpf should set
expected_attach_type to zero if kernel doesn't support BPF_TCX_INGRESS
or BPF_TCX_EGRESS? Or is it just an alias to tcx/ingress and
tcx/egress?

If it's an alias, why do we need it?

If not, let's replace SEC_ATTACHABLE_OPT with just SEC_EXP_ATTACH_OPT ?

> +       SEC_DEF("tcx/ingress",          SCHED_CLS, BPF_TCX_INGRESS, SEC_ATTACHABLE_OPT),
> +       SEC_DEF("tcx/egress",           SCHED_CLS, BPF_TCX_EGRESS, SEC_ATTACHABLE_OPT),

at least for tcx attach_type is not optional, right? So I'd drop
SEC_ATTACHABLE_OPT.

>         SEC_DEF("classifier",           SCHED_CLS, 0, SEC_NONE),
>         SEC_DEF("action",               SCHED_ACT, 0, SEC_NONE),
>         SEC_DEF("tracepoint+",          TRACEPOINT, 0, SEC_NONE, attach_tp),
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 7521a2fb7626..a29b90e9713c 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -395,4 +395,5 @@ LIBBPF_1.2.0 {
>  LIBBPF_1.3.0 {
>         global:
>                 bpf_obj_pin_opts;
> +               bpf_prog_detach_opts;
>  } LIBBPF_1.2.0;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 4/7] libbpf: Add link-based API for tcx
  2023-06-07 19:26 ` [PATCH bpf-next v2 4/7] libbpf: Add link-based " Daniel Borkmann
@ 2023-06-08 21:45   ` Andrii Nakryiko
  0 siblings, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 21:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

On Wed, Jun 7, 2023 at 12:26 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Implement tcx BPF link support for libbpf.
>
> The bpf_program__attach_fd_opts() API has been refactored slightly in order to
> pass bpf_link_create_opts pointer as input.
>
> A new bpf_program__attach_tcx_opts() has been added on top of this which allows
> for passing all relevant data via extensible struct bpf_tcx_opts.
>
> The program sections tcx/ingress and tcx/egress correspond to the hook locations
> for tc ingress and egress, respectively.
>
> For concrete usage examples, see the extensive selftests that have been
> developed as part of this series.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c      |  5 +++++
>  tools/lib/bpf/bpf.h      |  7 +++++++
>  tools/lib/bpf/libbpf.c   | 44 +++++++++++++++++++++++++++++++++++-----
>  tools/lib/bpf/libbpf.h   | 17 ++++++++++++++++
>  tools/lib/bpf/libbpf.map |  1 +
>  5 files changed, 69 insertions(+), 5 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index a3d1b7ebe224..c340d3cbc6bd 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -746,6 +746,11 @@ int bpf_link_create(int prog_fd, int target_fd,
>                 if (!OPTS_ZEROED(opts, tracing))
>                         return libbpf_err(-EINVAL);
>                 break;
> +       case BPF_TCX_INGRESS:
> +       case BPF_TCX_EGRESS:
> +               attr.link_create.tcx.relative_fd = OPTS_GET(opts, tcx.relative_fd, 0);
> +               attr.link_create.tcx.expected_revision = OPTS_GET(opts, tcx.expected_revision, 0);

can you also add an OPTS_ZEROED check like for other types of links?

> +               break;
>         default:
>                 if (!OPTS_ZEROED(opts, flags))
>                         return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 480c584a6f7f..12591516dca0 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -370,6 +370,13 @@ struct bpf_link_create_opts {
>                 struct {
>                         __u64 cookie;
>                 } tracing;
> +               struct {
> +                       union {
> +                               __u32 relative_fd;
> +                               __u32 relative_id;
> +                       };

same comment about union, let's not add it and have two separate fields


> +                       __u32 expected_revision;
> +               } tcx;
>         };
>         size_t :0;
>  };
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index b89127471c6a..d7b6ff49f02e 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -133,6 +133,7 @@ static const char * const link_type_name[] = {
>         [BPF_LINK_TYPE_KPROBE_MULTI]            = "kprobe_multi",
>         [BPF_LINK_TYPE_STRUCT_OPS]              = "struct_ops",
>         [BPF_LINK_TYPE_NETFILTER]               = "netfilter",
> +       [BPF_LINK_TYPE_TCX]                     = "tcx",
>  };
>
>  static const char * const map_type_name[] = {
> @@ -11685,11 +11686,10 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
>  }
>
>  static struct bpf_link *
> -bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
> -                      const char *target_name)
> +bpf_program__attach_fd_opts(const struct bpf_program *prog,
> +                           const struct bpf_link_create_opts *opts,
> +                           int target_fd, const char *target_name)

nit: please keep opts as the last argument

>  {
> -       DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
> -                           .target_btf_id = btf_id);
>         enum bpf_attach_type attach_type;
>         char errmsg[STRERR_BUFSIZE];
>         struct bpf_link *link;
> @@ -11707,7 +11707,7 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
>         link->detach = &bpf_link__detach_fd;
>
>         attach_type = bpf_program__expected_attach_type(prog);
> -       link_fd = bpf_link_create(prog_fd, target_fd, attach_type, &opts);
> +       link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
>         if (link_fd < 0) {
>                 link_fd = -errno;
>                 free(link);
> @@ -11720,6 +11720,17 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
>         return link;
>  }
>
> +static struct bpf_link *
> +bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
> +                      const char *target_name)
> +{
> +       LIBBPF_OPTS(bpf_link_create_opts, opts,
> +               .target_btf_id = btf_id,
> +       );
> +
> +       return bpf_program__attach_fd_opts(prog, &opts, target_fd, target_name);

it seems like the only user of btf_id is bpf_program__attach_freplace,
so I'd just inline this there, and for all other 4 cases let's just
pass NULL as options?

That means we don't really need bpf_program__attach_fd_opts() and can
just add opts to bpf_program__attach_fd(). We'll have shorter name.
BTW, given it's not exposed API, let's drop double underscore and call
it just bpf_program_attach_fd()?

> +}
> +
>  struct bpf_link *
>  bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
>  {
> @@ -11738,6 +11749,29 @@ struct bpf_link *bpf_program__attach_xdp(const struct bpf_program *prog, int ifi
>         return bpf_program__attach_fd(prog, ifindex, 0, "xdp");
>  }
>
> +struct bpf_link *
> +bpf_program__attach_tcx_opts(const struct bpf_program *prog,
> +                            const struct bpf_tcx_opts *opts)

we don't have non-opts variant, so let's keep the name short (like we
did with bpf_program__attach_netlink): bpf_program__attach_tcx().

> +{
> +       LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
> +       int ifindex = OPTS_GET(opts, ifindex, 0);

let's not do OPTS_GET before we checked OPTS_VALID

> +
> +       if (!OPTS_VALID(opts, bpf_tcx_opts))
> +               return libbpf_err_ptr(-EINVAL);
> +       if (!ifindex) {
> +               pr_warn("prog '%s': target netdevice ifindex cannot be zero\n",
> +                       prog->name);
> +               return libbpf_err_ptr(-EINVAL);
> +       }
> +
> +       link_create_opts.tcx.expected_revision = OPTS_GET(opts, expected_revision, 0);
> +       link_create_opts.tcx.relative_fd = OPTS_GET(opts, relative_fd, 0);
> +       link_create_opts.flags = OPTS_GET(opts, flags, 0);
> +
> +       /* target_fd/target_ifindex use the same field in LINK_CREATE */
> +       return bpf_program__attach_fd_opts(prog, &link_create_opts, ifindex, "tc");
> +}
> +
>  struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
>                                               int target_fd,
>                                               const char *attach_func_name)
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 754da73c643b..8ffba0f67c60 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -718,6 +718,23 @@ LIBBPF_API struct bpf_link *
>  bpf_program__attach_freplace(const struct bpf_program *prog,
>                              int target_fd, const char *attach_func_name);
>
> +struct bpf_tcx_opts {
> +       /* size of this struct, for forward/backward compatibility */
> +       size_t sz;
> +       int ifindex;
> +       __u32 flags;
> +       union {
> +               __u32 relative_fd;
> +               __u32 relative_id;
> +       };

same thing about not using unions here :)

> +       __u32 expected_revision;

and let's add `size_t :0;` to prevent compiler from leaving garbage
values in a padding at the end of the struct (once you drop union
there will be padding)

> +};
> +#define bpf_tcx_opts__last_field expected_revision
> +
> +LIBBPF_API struct bpf_link *
> +bpf_program__attach_tcx_opts(const struct bpf_program *prog,
> +                            const struct bpf_tcx_opts *opts);
> +
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index a29b90e9713c..f66b714512c2 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -396,4 +396,5 @@ LIBBPF_1.3.0 {
>         global:
>                 bpf_obj_pin_opts;
>                 bpf_prog_detach_opts;
> +               bpf_program__attach_tcx_opts;
>  } LIBBPF_1.2.0;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 20:59     ` Andrii Nakryiko
@ 2023-06-08 21:52       ` Stanislav Fomichev
  2023-06-08 22:13         ` Andrii Nakryiko
  0 siblings, 1 reply; 49+ messages in thread
From: Stanislav Fomichev @ 2023-06-08 21:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, john.fastabend,
	kuba, dxu, joe, toke, davem, bpf, netdev

On 06/08, Andrii Nakryiko wrote:
> On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 06/07, Daniel Borkmann wrote:
> > > This adds a generic layer called bpf_mprog which can be reused by different
> > > attachment layers to enable multi-program attachment and dependency resolution.
> > > In-kernel users of the bpf_mprog don't need to care about the dependency
> > > resolution internals, they can just consume it with few API calls.
> > >
> > > The initial idea of having a generic API sparked out of discussion [0] from an
> > > earlier revision of this work where tc's priority was reused and exposed via
> > > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> > > as-is for classic tc BPF. The feedback was that priority provides a bad user
> > > experience and is hard to use [1], e.g.:
> > >
> > >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
> > >   and friends is done because "that's how things were done in the past". [...]
> > >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
> > >   right there for users to understand. And that's the main problem with it.
> > >
> > >   The user don't want to and don't need to be aware of it, but uapi forces them
> > >   to pick the priority. [...] Your cover letter [0] example proves that in
> > >   real life different service pick the same priority. They simply don't know
> > >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
> > >   they just copy-paste and everyone ends up using the same.
> > >
> > > The course of the discussion showed more and more the need for a generic,
> > > reusable API where the "same look and feel" can be applied for various other
> > > program types beyond just tc BPF, for example XDP today does not have multi-
> > > program support in kernel, but also there was interest around this API for
> > > improving management of cgroup program types. Such common multi-program
> > > management concept is useful for BPF management daemons or user space BPF
> > > applications coordinating about their attachments.
> > >
> > > Both from Cilium and Meta side [2], we've collected the following requirements
> > > for a generic attach/detach/query API for multi-progs which has been implemented
> > > as part of this work:
> > >
> > >   - Support prog-based attach/detach and link API
> > >   - Dependency directives (can also be combined):
> > >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> > >       - BPF_F_ID flag as {fd,id} toggle
> > >       - BPF_F_LINK flag as {prog,link} toggle
> > >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> > >         BPF_F_AFTER will just append for the case of attaching
> > >       - Enforced only at attach time
> > >     - BPF_F_{FIRST,LAST}
> > >       - Enforced throughout the bpf_mprog state's lifetime
> > >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> > >   - Internal revision counter and optionally being able to pass expected_revision
> > >   - User space daemon can query current state with revision, and pass it along
> > >     for attachment to assert current state before doing updates
> > >   - Query also gets extension for link_ids array and link_attach_flags:
> > >     - prog_ids are always filled with program IDs
> > >     - link_ids are filled with link IDs when link was used, otherwise 0
> > >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
> > >   - Must be easy to integrate/reuse for in-kernel users
> > >
> > > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> > > consisting of the additions of the attachment flags, revision counter, and
> > > expanding existing union with relative_{fd,id} member.
> > >
> > > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> > > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> > > structure). Both have been separated, so that fast-path gets efficient packing
> > > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> > > instead of linked list or other structures to remove unnecessary indirections
> > > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> > > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> > > is populated and then just swapped which avoids additional allocations that
> > > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> > > currently static, but they could be converted to dynamic allocation if necessary
> > > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> > > for example, in case of tcx which uses this API in the next patch, it piggy-
> > > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> > > add,del} implementation and an extensive test suite for checking all aspects
> > > of this API for prog-based attach/detach and link API as BPF selftests in
> > > this series.
> > >
> > > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> > >
> > >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> > >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> > >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> > >
> > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > > ---
> > >  MAINTAINERS                    |   1 +
> > >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
> > >  include/uapi/linux/bpf.h       |  37 ++-
> > >  kernel/bpf/Makefile            |   2 +-
> > >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |  37 ++-
> > >  6 files changed, 781 insertions(+), 17 deletions(-)
> > >  create mode 100644 include/linux/bpf_mprog.h
> > >  create mode 100644 kernel/bpf/mprog.c
> > >
> 
> [...]
> 
> > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > index a7b5e91dd768..207f8a37b327 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> > >   */
> > >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> > >  #define BPF_F_ALLOW_MULTI    (1U << 1)
> > > +/* Generic attachment flags. */
> > >  #define BPF_F_REPLACE                (1U << 2)
> > > +#define BPF_F_BEFORE         (1U << 3)
> > > +#define BPF_F_AFTER          (1U << 4)
> >
> > [..]
> >
> > > +#define BPF_F_FIRST          (1U << 5)
> > > +#define BPF_F_LAST           (1U << 6)
> >
> > I'm still not sure whether the hard semantics of first/last is really
> > useful. My worry is that some prog will just use BPF_F_FIRST which
> > would prevent the rest of the users.. (starting with only
> > F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> > need first/laste).
> 
> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> implemented. E.g., if I have some hard audit requirements and I need
> to guarantee that my program runs first and observes each event, I'll
> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> then server setup is broken and my application cannot function.
> 
> In a setup where we expect multiple applications to co-exist, it
> should be a rule that no one is using FIRST/LAST (unless it's
> absolutely required). And if someone doesn't comply, then that's a bug
> and has to be reported to application owners.
> 
> But it's not up to the kernel to enforce this cooperation by
> disallowing FIRST/LAST semantics, because that semantics is critical
> for some applications, IMO.

Maybe that's something that should be done by some other mechanism?
(and as a follow up, if needed) Something akin to what Toke
mentioned with another program doing sorting or similar.

Otherwise, those first/last are just plain simple old priority bands;
only we have two now, not u16.

I'm mostly coming from the observability point: imagine I have my fancy
tc_ingress_tcpdump program that I want to attach as a first program to debug
some issue, but it won't work because there is already a 'first' program
installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?

> > But if everyone besides myself is on board with first/last, maybe at least
> > put a comment here saying that only a single program can be first/last?
> > And the users are advised not to use these unless they really really really
> > need to be first/last. (IOW, feels like first/last should be reserved
> > for observability tools/etc).
> 
> +1, we can definitely make it clear in API that this will prevent
> anyone else from being attached as FIRST/LAST, so it's not cooperative
> in nature and has to be very consciously evaluated.
> 
> >
> > > +#define BPF_F_ID             (1U << 7)
> > > +#define BPF_F_LINK           BPF_F_LINK /* 1 << 13 */
> > >
> > >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> > >   * verifier will perform strict alignment checking as if the kernel
> 
> [...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 21:52       ` Stanislav Fomichev
@ 2023-06-08 22:13         ` Andrii Nakryiko
  2023-06-08 23:06           ` Stanislav Fomichev
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 22:13 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, john.fastabend,
	kuba, dxu, joe, toke, davem, bpf, netdev

On Thu, Jun 8, 2023 at 2:52 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/08, Andrii Nakryiko wrote:
> > On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 06/07, Daniel Borkmann wrote:
> > > > This adds a generic layer called bpf_mprog which can be reused by different
> > > > attachment layers to enable multi-program attachment and dependency resolution.
> > > > In-kernel users of the bpf_mprog don't need to care about the dependency
> > > > resolution internals, they can just consume it with few API calls.
> > > >
> > > > The initial idea of having a generic API sparked out of discussion [0] from an
> > > > earlier revision of this work where tc's priority was reused and exposed via
> > > > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> > > > as-is for classic tc BPF. The feedback was that priority provides a bad user
> > > > experience and is hard to use [1], e.g.:
> > > >
> > > >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
> > > >   and friends is done because "that's how things were done in the past". [...]
> > > >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
> > > >   right there for users to understand. And that's the main problem with it.
> > > >
> > > >   The user don't want to and don't need to be aware of it, but uapi forces them
> > > >   to pick the priority. [...] Your cover letter [0] example proves that in
> > > >   real life different service pick the same priority. They simply don't know
> > > >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
> > > >   they just copy-paste and everyone ends up using the same.
> > > >
> > > > The course of the discussion showed more and more the need for a generic,
> > > > reusable API where the "same look and feel" can be applied for various other
> > > > program types beyond just tc BPF, for example XDP today does not have multi-
> > > > program support in kernel, but also there was interest around this API for
> > > > improving management of cgroup program types. Such common multi-program
> > > > management concept is useful for BPF management daemons or user space BPF
> > > > applications coordinating about their attachments.
> > > >
> > > > Both from Cilium and Meta side [2], we've collected the following requirements
> > > > for a generic attach/detach/query API for multi-progs which has been implemented
> > > > as part of this work:
> > > >
> > > >   - Support prog-based attach/detach and link API
> > > >   - Dependency directives (can also be combined):
> > > >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> > > >       - BPF_F_ID flag as {fd,id} toggle
> > > >       - BPF_F_LINK flag as {prog,link} toggle
> > > >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> > > >         BPF_F_AFTER will just append for the case of attaching
> > > >       - Enforced only at attach time
> > > >     - BPF_F_{FIRST,LAST}
> > > >       - Enforced throughout the bpf_mprog state's lifetime
> > > >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> > > >   - Internal revision counter and optionally being able to pass expected_revision
> > > >   - User space daemon can query current state with revision, and pass it along
> > > >     for attachment to assert current state before doing updates
> > > >   - Query also gets extension for link_ids array and link_attach_flags:
> > > >     - prog_ids are always filled with program IDs
> > > >     - link_ids are filled with link IDs when link was used, otherwise 0
> > > >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
> > > >   - Must be easy to integrate/reuse for in-kernel users
> > > >
> > > > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> > > > consisting of the additions of the attachment flags, revision counter, and
> > > > expanding existing union with relative_{fd,id} member.
> > > >
> > > > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> > > > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> > > > structure). Both have been separated, so that fast-path gets efficient packing
> > > > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> > > > instead of linked list or other structures to remove unnecessary indirections
> > > > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> > > > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> > > > is populated and then just swapped which avoids additional allocations that
> > > > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> > > > currently static, but they could be converted to dynamic allocation if necessary
> > > > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> > > > for example, in case of tcx which uses this API in the next patch, it piggy-
> > > > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> > > > add,del} implementation and an extensive test suite for checking all aspects
> > > > of this API for prog-based attach/detach and link API as BPF selftests in
> > > > this series.
> > > >
> > > > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> > > >
> > > >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> > > >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> > > >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> > > >
> > > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > > > ---
> > > >  MAINTAINERS                    |   1 +
> > > >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
> > > >  include/uapi/linux/bpf.h       |  37 ++-
> > > >  kernel/bpf/Makefile            |   2 +-
> > > >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
> > > >  tools/include/uapi/linux/bpf.h |  37 ++-
> > > >  6 files changed, 781 insertions(+), 17 deletions(-)
> > > >  create mode 100644 include/linux/bpf_mprog.h
> > > >  create mode 100644 kernel/bpf/mprog.c
> > > >
> >
> > [...]
> >
> > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > > index a7b5e91dd768..207f8a37b327 100644
> > > > --- a/tools/include/uapi/linux/bpf.h
> > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> > > >   */
> > > >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> > > >  #define BPF_F_ALLOW_MULTI    (1U << 1)
> > > > +/* Generic attachment flags. */
> > > >  #define BPF_F_REPLACE                (1U << 2)
> > > > +#define BPF_F_BEFORE         (1U << 3)
> > > > +#define BPF_F_AFTER          (1U << 4)
> > >
> > > [..]
> > >
> > > > +#define BPF_F_FIRST          (1U << 5)
> > > > +#define BPF_F_LAST           (1U << 6)
> > >
> > > I'm still not sure whether the hard semantics of first/last is really
> > > useful. My worry is that some prog will just use BPF_F_FIRST which
> > > would prevent the rest of the users.. (starting with only
> > > F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> > > need first/laste).
> >
> > Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> > implemented. E.g., if I have some hard audit requirements and I need
> > to guarantee that my program runs first and observes each event, I'll
> > enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> > then server setup is broken and my application cannot function.
> >
> > In a setup where we expect multiple applications to co-exist, it
> > should be a rule that no one is using FIRST/LAST (unless it's
> > absolutely required). And if someone doesn't comply, then that's a bug
> > and has to be reported to application owners.
> >
> > But it's not up to the kernel to enforce this cooperation by
> > disallowing FIRST/LAST semantics, because that semantics is critical
> > for some applications, IMO.
>
> Maybe that's something that should be done by some other mechanism?
> (and as a follow up, if needed) Something akin to what Toke
> mentioned with another program doing sorting or similar.

The goal of this API is to avoid needing some extra special program to
do this sorting

>
> Otherwise, those first/last are just plain simple old priority bands;
> only we have two now, not u16.

I think it's different. FIRST/LAST has to be used judiciously, of
course, but when they are needed, they will have no alternative.

Also, specifying FIRST + LAST is the way to say "I want my program to
be the only one attached". Should we encourage such use cases? No, of
course. But I think it's fair  for users to be able to express this.

>
> I'm mostly coming from the observability point: imagine I have my fancy
> tc_ingress_tcpdump program that I want to attach as a first program to debug
> some issue, but it won't work because there is already a 'first' program
> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?

If your production setup requires that some important program has to
be FIRST, then yeah, your "let me debug something" program shouldn't
interfere with it (assuming that FIRST requirement is a real
requirement and not someone just thinking they need to be first; but
that's up to user space to decide). Maybe the solution for you in that
case would be freplace program installed on top of that stubborn FIRST
program? And if we are talking about local debugging and development,
then you are a sysadmin and you should be able to force-detach that
program that is getting in the way.


>
> > > But if everyone besides myself is on board with first/last, maybe at least
> > > put a comment here saying that only a single program can be first/last?
> > > And the users are advised not to use these unless they really really really
> > > need to be first/last. (IOW, feels like first/last should be reserved
> > > for observability tools/etc).
> >
> > +1, we can definitely make it clear in API that this will prevent
> > anyone else from being attached as FIRST/LAST, so it's not cooperative
> > in nature and has to be very consciously evaluated.
> >
> > >
> > > > +#define BPF_F_ID             (1U << 7)
> > > > +#define BPF_F_LINK           BPF_F_LINK /* 1 << 13 */
> > > >
> > > >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> > > >   * verifier will perform strict alignment checking as if the kernel
> >
> > [...]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 22:13         ` Andrii Nakryiko
@ 2023-06-08 23:06           ` Stanislav Fomichev
  2023-06-08 23:54             ` Alexei Starovoitov
  2023-06-09  0:29             ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 49+ messages in thread
From: Stanislav Fomichev @ 2023-06-08 23:06 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, john.fastabend,
	kuba, dxu, joe, toke, davem, bpf, netdev

On 06/08, Andrii Nakryiko wrote:
> On Thu, Jun 8, 2023 at 2:52 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 06/08, Andrii Nakryiko wrote:
> > > On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 06/07, Daniel Borkmann wrote:
> > > > > This adds a generic layer called bpf_mprog which can be reused by different
> > > > > attachment layers to enable multi-program attachment and dependency resolution.
> > > > > In-kernel users of the bpf_mprog don't need to care about the dependency
> > > > > resolution internals, they can just consume it with few API calls.
> > > > >
> > > > > The initial idea of having a generic API sparked out of discussion [0] from an
> > > > > earlier revision of this work where tc's priority was reused and exposed via
> > > > > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> > > > > as-is for classic tc BPF. The feedback was that priority provides a bad user
> > > > > experience and is hard to use [1], e.g.:
> > > > >
> > > > >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
> > > > >   and friends is done because "that's how things were done in the past". [...]
> > > > >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
> > > > >   right there for users to understand. And that's the main problem with it.
> > > > >
> > > > >   The user don't want to and don't need to be aware of it, but uapi forces them
> > > > >   to pick the priority. [...] Your cover letter [0] example proves that in
> > > > >   real life different service pick the same priority. They simply don't know
> > > > >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
> > > > >   they just copy-paste and everyone ends up using the same.
> > > > >
> > > > > The course of the discussion showed more and more the need for a generic,
> > > > > reusable API where the "same look and feel" can be applied for various other
> > > > > program types beyond just tc BPF, for example XDP today does not have multi-
> > > > > program support in kernel, but also there was interest around this API for
> > > > > improving management of cgroup program types. Such common multi-program
> > > > > management concept is useful for BPF management daemons or user space BPF
> > > > > applications coordinating about their attachments.
> > > > >
> > > > > Both from Cilium and Meta side [2], we've collected the following requirements
> > > > > for a generic attach/detach/query API for multi-progs which has been implemented
> > > > > as part of this work:
> > > > >
> > > > >   - Support prog-based attach/detach and link API
> > > > >   - Dependency directives (can also be combined):
> > > > >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> > > > >       - BPF_F_ID flag as {fd,id} toggle
> > > > >       - BPF_F_LINK flag as {prog,link} toggle
> > > > >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> > > > >         BPF_F_AFTER will just append for the case of attaching
> > > > >       - Enforced only at attach time
> > > > >     - BPF_F_{FIRST,LAST}
> > > > >       - Enforced throughout the bpf_mprog state's lifetime
> > > > >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
> > > > >   - Internal revision counter and optionally being able to pass expected_revision
> > > > >   - User space daemon can query current state with revision, and pass it along
> > > > >     for attachment to assert current state before doing updates
> > > > >   - Query also gets extension for link_ids array and link_attach_flags:
> > > > >     - prog_ids are always filled with program IDs
> > > > >     - link_ids are filled with link IDs when link was used, otherwise 0
> > > > >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
> > > > >   - Must be easy to integrate/reuse for in-kernel users
> > > > >
> > > > > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> > > > > consisting of the additions of the attachment flags, revision counter, and
> > > > > expanding existing union with relative_{fd,id} member.
> > > > >
> > > > > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> > > > > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
> > > > > structure). Both have been separated, so that fast-path gets efficient packing
> > > > > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
> > > > > instead of linked list or other structures to remove unnecessary indirections
> > > > > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
> > > > > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
> > > > > is populated and then just swapped which avoids additional allocations that
> > > > > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
> > > > > currently static, but they could be converted to dynamic allocation if necessary
> > > > > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
> > > > > for example, in case of tcx which uses this API in the next patch, it piggy-
> > > > > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
> > > > > add,del} implementation and an extensive test suite for checking all aspects
> > > > > of this API for prog-based attach/detach and link API as BPF selftests in
> > > > > this series.
> > > > >
> > > > > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
> > > > >
> > > > >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
> > > > >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> > > > >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> > > > >
> > > > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > > > > ---
> > > > >  MAINTAINERS                    |   1 +
> > > > >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
> > > > >  include/uapi/linux/bpf.h       |  37 ++-
> > > > >  kernel/bpf/Makefile            |   2 +-
> > > > >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
> > > > >  tools/include/uapi/linux/bpf.h |  37 ++-
> > > > >  6 files changed, 781 insertions(+), 17 deletions(-)
> > > > >  create mode 100644 include/linux/bpf_mprog.h
> > > > >  create mode 100644 kernel/bpf/mprog.c
> > > > >
> > >
> > > [...]
> > >
> > > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > > > index a7b5e91dd768..207f8a37b327 100644
> > > > > --- a/tools/include/uapi/linux/bpf.h
> > > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
> > > > >   */
> > > > >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> > > > >  #define BPF_F_ALLOW_MULTI    (1U << 1)
> > > > > +/* Generic attachment flags. */
> > > > >  #define BPF_F_REPLACE                (1U << 2)
> > > > > +#define BPF_F_BEFORE         (1U << 3)
> > > > > +#define BPF_F_AFTER          (1U << 4)
> > > >
> > > > [..]
> > > >
> > > > > +#define BPF_F_FIRST          (1U << 5)
> > > > > +#define BPF_F_LAST           (1U << 6)
> > > >
> > > > I'm still not sure whether the hard semantics of first/last is really
> > > > useful. My worry is that some prog will just use BPF_F_FIRST which
> > > > would prevent the rest of the users.. (starting with only
> > > > F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> > > > need first/laste).
> > >
> > > Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> > > implemented. E.g., if I have some hard audit requirements and I need
> > > to guarantee that my program runs first and observes each event, I'll
> > > enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> > > then server setup is broken and my application cannot function.
> > >
> > > In a setup where we expect multiple applications to co-exist, it
> > > should be a rule that no one is using FIRST/LAST (unless it's
> > > absolutely required). And if someone doesn't comply, then that's a bug
> > > and has to be reported to application owners.
> > >
> > > But it's not up to the kernel to enforce this cooperation by
> > > disallowing FIRST/LAST semantics, because that semantics is critical
> > > for some applications, IMO.
> >
> > Maybe that's something that should be done by some other mechanism?
> > (and as a follow up, if needed) Something akin to what Toke
> > mentioned with another program doing sorting or similar.
> 
> The goal of this API is to avoid needing some extra special program to
> do this sorting
> 
> >
> > Otherwise, those first/last are just plain simple old priority bands;
> > only we have two now, not u16.
> 
> I think it's different. FIRST/LAST has to be used judiciously, of
> course, but when they are needed, they will have no alternative.
> 
> Also, specifying FIRST + LAST is the way to say "I want my program to
> be the only one attached". Should we encourage such use cases? No, of
> course. But I think it's fair  for users to be able to express this.
> 
> >
> > I'm mostly coming from the observability point: imagine I have my fancy
> > tc_ingress_tcpdump program that I want to attach as a first program to debug
> > some issue, but it won't work because there is already a 'first' program
> > installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
> 
> If your production setup requires that some important program has to
> be FIRST, then yeah, your "let me debug something" program shouldn't
> interfere with it (assuming that FIRST requirement is a real
> requirement and not someone just thinking they need to be first; but
> that's up to user space to decide). Maybe the solution for you in that
> case would be freplace program installed on top of that stubborn FIRST
> program? And if we are talking about local debugging and development,
> then you are a sysadmin and you should be able to force-detach that
> program that is getting in the way.

I'm not really concerned about our production environment. It's pretty
controlled and restricted and I'm pretty certain we can avoid doing
something stupid. Probably the same for your env.

I'm mostly fantasizing about upstream world where different users don't
know about each other and start doing stupid things like F_FIRST where
they don't really have to be first. It's that "used judiciously" part
that I'm a bit skeptical about :-D

Because even with this new ordering scheme, there still should be
some entity to do relative ordering (systemd-style, maybe CNI?).
And if it does the ordering, I don't really see why we need
F_FIRST/F_LAST.

But, if you think you need F_FIRST/F_LAST, let's have them. I just
personally don't see us using them (nor do I see why they have to
be used upstream). The only thing that makes sense is probably for
cilium to do F_FIRST|F_LAST to prevent other things from breaking it?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 23:06           ` Stanislav Fomichev
@ 2023-06-08 23:54             ` Alexei Starovoitov
  2023-06-09  0:08               ` Andrii Nakryiko
  2023-06-09  0:29             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 49+ messages in thread
From: Alexei Starovoitov @ 2023-06-08 23:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, Nikolay Aleksandrov,
	John Fastabend, Jakub Kicinski, Daniel Xu, Joe Stringer,
	Toke Høiland-Jørgensen, David S. Miller, bpf,
	Network Development

On Thu, Jun 8, 2023 at 4:06 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> I'm not really concerned about our production environment. It's pretty
> controlled and restricted and I'm pretty certain we can avoid doing
> something stupid. Probably the same for your env.
>
> I'm mostly fantasizing about upstream world where different users don't
> know about each other and start doing stupid things like F_FIRST where
> they don't really have to be first. It's that "used judiciously" part
> that I'm a bit skeptical about :-D
>
> Because even with this new ordering scheme, there still should be
> some entity to do relative ordering (systemd-style, maybe CNI?).
> And if it does the ordering, I don't really see why we need
> F_FIRST/F_LAST.

+1.
I have the same concerns as expressed during lsfmmbpf.
This first/last is a foot gun.
It puts the whole API back into a single user situation.
Without "first api" the users are forced to talk to each other
and come up with an arbitration mechanism. A daemon to control
the order or something like that.
With "first api" there is no incentive to do so.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 23:54             ` Alexei Starovoitov
@ 2023-06-09  0:08               ` Andrii Nakryiko
  2023-06-09  0:38                 ` Stanislav Fomichev
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-09  0:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, Nikolay Aleksandrov,
	John Fastabend, Jakub Kicinski, Daniel Xu, Joe Stringer,
	Toke Høiland-Jørgensen, David S. Miller, bpf,
	Network Development

On Thu, Jun 8, 2023 at 4:55 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 8, 2023 at 4:06 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > I'm not really concerned about our production environment. It's pretty
> > controlled and restricted and I'm pretty certain we can avoid doing
> > something stupid. Probably the same for your env.
> >
> > I'm mostly fantasizing about upstream world where different users don't
> > know about each other and start doing stupid things like F_FIRST where
> > they don't really have to be first. It's that "used judiciously" part
> > that I'm a bit skeptical about :-D
> >
> > Because even with this new ordering scheme, there still should be
> > some entity to do relative ordering (systemd-style, maybe CNI?).
> > And if it does the ordering, I don't really see why we need
> > F_FIRST/F_LAST.
>
> +1.
> I have the same concerns as expressed during lsfmmbpf.
> This first/last is a foot gun.
> It puts the whole API back into a single user situation.
> Without "first api" the users are forced to talk to each other
> and come up with an arbitration mechanism. A daemon to control
> the order or something like that.
> With "first api" there is no incentive to do so.

If Cilium and some other company X both produce, say, anti-DDOS
solution which cannot co-exist with any other anti-DDOS program and
either of them needs to guarantee that their program runs first, then
FIRST is what would be used by both to prevent accidental breakage of
each other (which is basically what happened with Cilium and some
other networking solution, don't remember the name). It's better for
one of them to loudly fail to attach than silently break other
solution with end users struggling to understand what's going on.

You and Stanislav keep insisting that any combination of any BPF
programs should co-exist, and I don't understand why we can or should
presume that. I think we are conflating generic API (and kernel *not*
making any assumptions about such API usage) with encouraging
collaborative BPF attachment policies. They are orthogonal and are not
in conflict with each other.

But we lived without FIRST/LAST guarantees till now, that's fine, I'll
stop fighting this.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-08 23:06           ` Stanislav Fomichev
  2023-06-08 23:54             ` Alexei Starovoitov
@ 2023-06-09  0:29             ` Toke Høiland-Jørgensen
  2023-06-09  6:52               ` Daniel Borkmann
  1 sibling, 1 reply; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09  0:29 UTC (permalink / raw)
  To: Stanislav Fomichev, Andrii Nakryiko
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, john.fastabend,
	kuba, dxu, joe, davem, bpf, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On 06/08, Andrii Nakryiko wrote:
>> On Thu, Jun 8, 2023 at 2:52 PM Stanislav Fomichev <sdf@google.com> wrote:
>> >
>> > On 06/08, Andrii Nakryiko wrote:
>> > > On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
>> > > >
>> > > > On 06/07, Daniel Borkmann wrote:
>> > > > > This adds a generic layer called bpf_mprog which can be reused by different
>> > > > > attachment layers to enable multi-program attachment and dependency resolution.
>> > > > > In-kernel users of the bpf_mprog don't need to care about the dependency
>> > > > > resolution internals, they can just consume it with few API calls.
>> > > > >
>> > > > > The initial idea of having a generic API sparked out of discussion [0] from an
>> > > > > earlier revision of this work where tc's priority was reused and exposed via
>> > > > > BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
>> > > > > as-is for classic tc BPF. The feedback was that priority provides a bad user
>> > > > > experience and is hard to use [1], e.g.:
>> > > > >
>> > > > >   I cannot help but feel that priority logic copy-paste from old tc, netfilter
>> > > > >   and friends is done because "that's how things were done in the past". [...]
>> > > > >   Priority gets exposed everywhere in uapi all the way to bpftool when it's
>> > > > >   right there for users to understand. And that's the main problem with it.
>> > > > >
>> > > > >   The user don't want to and don't need to be aware of it, but uapi forces them
>> > > > >   to pick the priority. [...] Your cover letter [0] example proves that in
>> > > > >   real life different service pick the same priority. They simply don't know
>> > > > >   any better. Priority is an unnecessary magic that apps _have_ to pick, so
>> > > > >   they just copy-paste and everyone ends up using the same.
>> > > > >
>> > > > > The course of the discussion showed more and more the need for a generic,
>> > > > > reusable API where the "same look and feel" can be applied for various other
>> > > > > program types beyond just tc BPF, for example XDP today does not have multi-
>> > > > > program support in kernel, but also there was interest around this API for
>> > > > > improving management of cgroup program types. Such common multi-program
>> > > > > management concept is useful for BPF management daemons or user space BPF
>> > > > > applications coordinating about their attachments.
>> > > > >
>> > > > > Both from Cilium and Meta side [2], we've collected the following requirements
>> > > > > for a generic attach/detach/query API for multi-progs which has been implemented
>> > > > > as part of this work:
>> > > > >
>> > > > >   - Support prog-based attach/detach and link API
>> > > > >   - Dependency directives (can also be combined):
>> > > > >     - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
>> > > > >       - BPF_F_ID flag as {fd,id} toggle
>> > > > >       - BPF_F_LINK flag as {prog,link} toggle
>> > > > >       - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
>> > > > >         BPF_F_AFTER will just append for the case of attaching
>> > > > >       - Enforced only at attach time
>> > > > >     - BPF_F_{FIRST,LAST}
>> > > > >       - Enforced throughout the bpf_mprog state's lifetime
>> > > > >       - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
>> > > > >   - Internal revision counter and optionally being able to pass expected_revision
>> > > > >   - User space daemon can query current state with revision, and pass it along
>> > > > >     for attachment to assert current state before doing updates
>> > > > >   - Query also gets extension for link_ids array and link_attach_flags:
>> > > > >     - prog_ids are always filled with program IDs
>> > > > >     - link_ids are filled with link IDs when link was used, otherwise 0
>> > > > >     - {prog,link}_attach_flags for holding {prog,link}-specific flags
>> > > > >   - Must be easy to integrate/reuse for in-kernel users
>> > > > >
>> > > > > The uapi-side changes needed for supporting bpf_mprog are rather minimal,
>> > > > > consisting of the additions of the attachment flags, revision counter, and
>> > > > > expanding existing union with relative_{fd,id} member.
>> > > > >
>> > > > > The bpf_mprog framework consists of an bpf_mprog_entry object which holds
>> > > > > an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
>> > > > > structure). Both have been separated, so that fast-path gets efficient packing
>> > > > > of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
>> > > > > instead of linked list or other structures to remove unnecessary indirections
>> > > > > for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
>> > > > > via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
>> > > > > is populated and then just swapped which avoids additional allocations that
>> > > > > could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
>> > > > > currently static, but they could be converted to dynamic allocation if necessary
>> > > > > at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
>> > > > > for example, in case of tcx which uses this API in the next patch, it piggy-
>> > > > > backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
>> > > > > add,del} implementation and an extensive test suite for checking all aspects
>> > > > > of this API for prog-based attach/detach and link API as BPF selftests in
>> > > > > this series.
>> > > > >
>> > > > > Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
>> > > > >
>> > > > >   [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
>> > > > >   [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
>> > > > >   [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>> > > > >
>> > > > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> > > > > ---
>> > > > >  MAINTAINERS                    |   1 +
>> > > > >  include/linux/bpf_mprog.h      | 245 +++++++++++++++++
>> > > > >  include/uapi/linux/bpf.h       |  37 ++-
>> > > > >  kernel/bpf/Makefile            |   2 +-
>> > > > >  kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
>> > > > >  tools/include/uapi/linux/bpf.h |  37 ++-
>> > > > >  6 files changed, 781 insertions(+), 17 deletions(-)
>> > > > >  create mode 100644 include/linux/bpf_mprog.h
>> > > > >  create mode 100644 kernel/bpf/mprog.c
>> > > > >
>> > >
>> > > [...]
>> > >
>> > > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> > > > > index a7b5e91dd768..207f8a37b327 100644
>> > > > > --- a/tools/include/uapi/linux/bpf.h
>> > > > > +++ b/tools/include/uapi/linux/bpf.h
>> > > > > @@ -1102,7 +1102,14 @@ enum bpf_link_type {
>> > > > >   */
>> > > > >  #define BPF_F_ALLOW_OVERRIDE (1U << 0)
>> > > > >  #define BPF_F_ALLOW_MULTI    (1U << 1)
>> > > > > +/* Generic attachment flags. */
>> > > > >  #define BPF_F_REPLACE                (1U << 2)
>> > > > > +#define BPF_F_BEFORE         (1U << 3)
>> > > > > +#define BPF_F_AFTER          (1U << 4)
>> > > >
>> > > > [..]
>> > > >
>> > > > > +#define BPF_F_FIRST          (1U << 5)
>> > > > > +#define BPF_F_LAST           (1U << 6)
>> > > >
>> > > > I'm still not sure whether the hard semantics of first/last is really
>> > > > useful. My worry is that some prog will just use BPF_F_FIRST which
>> > > > would prevent the rest of the users.. (starting with only
>> > > > F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>> > > > need first/laste).
>> > >
>> > > Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>> > > implemented. E.g., if I have some hard audit requirements and I need
>> > > to guarantee that my program runs first and observes each event, I'll
>> > > enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>> > > then server setup is broken and my application cannot function.
>> > >
>> > > In a setup where we expect multiple applications to co-exist, it
>> > > should be a rule that no one is using FIRST/LAST (unless it's
>> > > absolutely required). And if someone doesn't comply, then that's a bug
>> > > and has to be reported to application owners.
>> > >
>> > > But it's not up to the kernel to enforce this cooperation by
>> > > disallowing FIRST/LAST semantics, because that semantics is critical
>> > > for some applications, IMO.
>> >
>> > Maybe that's something that should be done by some other mechanism?
>> > (and as a follow up, if needed) Something akin to what Toke
>> > mentioned with another program doing sorting or similar.
>> 
>> The goal of this API is to avoid needing some extra special program to
>> do this sorting
>> 
>> >
>> > Otherwise, those first/last are just plain simple old priority bands;
>> > only we have two now, not u16.
>> 
>> I think it's different. FIRST/LAST has to be used judiciously, of
>> course, but when they are needed, they will have no alternative.
>> 
>> Also, specifying FIRST + LAST is the way to say "I want my program to
>> be the only one attached". Should we encourage such use cases? No, of
>> course. But I think it's fair  for users to be able to express this.
>> 
>> >
>> > I'm mostly coming from the observability point: imagine I have my fancy
>> > tc_ingress_tcpdump program that I want to attach as a first program to debug
>> > some issue, but it won't work because there is already a 'first' program
>> > installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>> 
>> If your production setup requires that some important program has to
>> be FIRST, then yeah, your "let me debug something" program shouldn't
>> interfere with it (assuming that FIRST requirement is a real
>> requirement and not someone just thinking they need to be first; but
>> that's up to user space to decide). Maybe the solution for you in that
>> case would be freplace program installed on top of that stubborn FIRST
>> program? And if we are talking about local debugging and development,
>> then you are a sysadmin and you should be able to force-detach that
>> program that is getting in the way.
>
> I'm not really concerned about our production environment. It's pretty
> controlled and restricted and I'm pretty certain we can avoid doing
> something stupid. Probably the same for your env.
>
> I'm mostly fantasizing about upstream world where different users don't
> know about each other and start doing stupid things like F_FIRST where
> they don't really have to be first. It's that "used judiciously" part
> that I'm a bit skeptical about :-D
>
> Because even with this new ordering scheme, there still should be
> some entity to do relative ordering (systemd-style, maybe CNI?).
> And if it does the ordering, I don't really see why we need
> F_FIRST/F_LAST.

I can see I'm a bit late to the party, but FWIW I agree with this:
FIRST/LAST will definitely be abused if we add it. It also seems to me
to be policy in the kernel, which would be much better handled in
userspace like we do for so many other things. So we should rather
expose a hook to allow userspace to set the policy, as we've discussed
before; I definitely think we should add that at some point! Although
obviously it doesn't have to be part of this series...

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09  0:08               ` Andrii Nakryiko
@ 2023-06-09  0:38                 ` Stanislav Fomichev
  0 siblings, 0 replies; 49+ messages in thread
From: Stanislav Fomichev @ 2023-06-09  0:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, Nikolay Aleksandrov,
	John Fastabend, Jakub Kicinski, Daniel Xu, Joe Stringer,
	Toke Høiland-Jørgensen, David S. Miller, bpf,
	Network Development

On 06/08, Andrii Nakryiko wrote:
> On Thu, Jun 8, 2023 at 4:55 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 8, 2023 at 4:06 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > I'm not really concerned about our production environment. It's pretty
> > > controlled and restricted and I'm pretty certain we can avoid doing
> > > something stupid. Probably the same for your env.
> > >
> > > I'm mostly fantasizing about upstream world where different users don't
> > > know about each other and start doing stupid things like F_FIRST where
> > > they don't really have to be first. It's that "used judiciously" part
> > > that I'm a bit skeptical about :-D
> > >
> > > Because even with this new ordering scheme, there still should be
> > > some entity to do relative ordering (systemd-style, maybe CNI?).
> > > And if it does the ordering, I don't really see why we need
> > > F_FIRST/F_LAST.
> >
> > +1.
> > I have the same concerns as expressed during lsfmmbpf.
> > This first/last is a foot gun.
> > It puts the whole API back into a single user situation.
> > Without "first api" the users are forced to talk to each other
> > and come up with an arbitration mechanism. A daemon to control
> > the order or something like that.
> > With "first api" there is no incentive to do so.
> 
> If Cilium and some other company X both produce, say, anti-DDOS
> solution which cannot co-exist with any other anti-DDOS program and
> either of them needs to guarantee that their program runs first, then
> FIRST is what would be used by both to prevent accidental breakage of
> each other (which is basically what happened with Cilium and some
> other networking solution, don't remember the name). It's better for
> one of them to loudly fail to attach than silently break other
> solution with end users struggling to understand what's going on.
> 
> You and Stanislav keep insisting that any combination of any BPF
> programs should co-exist, and I don't understand why we can or should
> presume that. I think we are conflating generic API (and kernel *not*
> making any assumptions about such API usage) with encouraging
> collaborative BPF attachment policies. They are orthogonal and are not
> in conflict with each other.
> 
> But we lived without FIRST/LAST guarantees till now, that's fine, I'll
> stop fighting this.

I'm not saying this situation where there are several incompatible programs
doesn't exist. All I'm saying is that imo this is a policy that doesn't
belong to the kernel. Or maybe even let's put it that way: F_FIRST and
F_LAST isn't flexible enough to express this policy. External
systemd-like arbiter should express the dependencies/ordering/conflicts/etc.
And F_BEFORE and F_AFTER is enough for that sysmted-like entity to do the
rest.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
                     ` (2 preceding siblings ...)
  2023-06-08 21:20   ` Andrii Nakryiko
@ 2023-06-09  3:06   ` Jakub Kicinski
  3 siblings, 0 replies; 49+ messages in thread
From: Jakub Kicinski @ 2023-06-09  3:06 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, dxu, joe,
	toke, davem, bpf, netdev

On Wed,  7 Jun 2023 21:26:20 +0200 Daniel Borkmann wrote:
> +	dev = dev_get_by_index(net, attr->link_create.target_ifindex);
> +	if (!dev)
> +		return -EINVAL;
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put;
> +	}
> +
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> +	link->location = attr->link_create.attach_type;
> +	link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
> +	link->dev = dev;
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err) {
> +		kfree(link);
> +		goto out_put;
> +	}
> +	rtnl_lock();

How does this work vs device unregistering? 

Best I can tell (and it is a large patch :() the device may have passed
dev_tcx_uninstall() by the time we take the lock.

> +	err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
> +				   attr->link_create.tcx.relative_fd,
> +				   attr->link_create.tcx.expected_revision);
> +	if (!err)
> +		fd = bpf_link_settle(&link_primer);
> +	rtnl_unlock();

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09  0:29             ` Toke Høiland-Jørgensen
@ 2023-06-09  6:52               ` Daniel Borkmann
  2023-06-09  7:15                 ` Daniel Borkmann
  2023-06-09 11:04                 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-09  6:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

On 6/9/23 2:29 AM, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
>> On 06/08, Andrii Nakryiko wrote:
>>> On Thu, Jun 8, 2023 at 2:52 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>> On 06/08, Andrii Nakryiko wrote:
>>>>> On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
>>>>>> On 06/07, Daniel Borkmann wrote:
>>>>>>> This adds a generic layer called bpf_mprog which can be reused by different
>>>>>>> attachment layers to enable multi-program attachment and dependency resolution.
>>>>>>> In-kernel users of the bpf_mprog don't need to care about the dependency
>>>>>>> resolution internals, they can just consume it with few API calls.
>>>>>>>
>>>>>>> The initial idea of having a generic API sparked out of discussion [0] from an
>>>>>>> earlier revision of this work where tc's priority was reused and exposed via
>>>>>>> BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
>>>>>>> as-is for classic tc BPF. The feedback was that priority provides a bad user
>>>>>>> experience and is hard to use [1], e.g.:
>>>>>>>
>>>>>>>    I cannot help but feel that priority logic copy-paste from old tc, netfilter
>>>>>>>    and friends is done because "that's how things were done in the past". [...]
>>>>>>>    Priority gets exposed everywhere in uapi all the way to bpftool when it's
>>>>>>>    right there for users to understand. And that's the main problem with it.
>>>>>>>
>>>>>>>    The user don't want to and don't need to be aware of it, but uapi forces them
>>>>>>>    to pick the priority. [...] Your cover letter [0] example proves that in
>>>>>>>    real life different service pick the same priority. They simply don't know
>>>>>>>    any better. Priority is an unnecessary magic that apps _have_ to pick, so
>>>>>>>    they just copy-paste and everyone ends up using the same.
>>>>>>>
>>>>>>> The course of the discussion showed more and more the need for a generic,
>>>>>>> reusable API where the "same look and feel" can be applied for various other
>>>>>>> program types beyond just tc BPF, for example XDP today does not have multi-
>>>>>>> program support in kernel, but also there was interest around this API for
>>>>>>> improving management of cgroup program types. Such common multi-program
>>>>>>> management concept is useful for BPF management daemons or user space BPF
>>>>>>> applications coordinating about their attachments.
>>>>>>>
>>>>>>> Both from Cilium and Meta side [2], we've collected the following requirements
>>>>>>> for a generic attach/detach/query API for multi-progs which has been implemented
>>>>>>> as part of this work:
>>>>>>>
>>>>>>>    - Support prog-based attach/detach and link API
>>>>>>>    - Dependency directives (can also be combined):
>>>>>>>      - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
>>>>>>>        - BPF_F_ID flag as {fd,id} toggle
>>>>>>>        - BPF_F_LINK flag as {prog,link} toggle
>>>>>>>        - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
>>>>>>>          BPF_F_AFTER will just append for the case of attaching
>>>>>>>        - Enforced only at attach time
>>>>>>>      - BPF_F_{FIRST,LAST}
>>>>>>>        - Enforced throughout the bpf_mprog state's lifetime
>>>>>>>        - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
>>>>>>>    - Internal revision counter and optionally being able to pass expected_revision
>>>>>>>    - User space daemon can query current state with revision, and pass it along
>>>>>>>      for attachment to assert current state before doing updates
>>>>>>>    - Query also gets extension for link_ids array and link_attach_flags:
>>>>>>>      - prog_ids are always filled with program IDs
>>>>>>>      - link_ids are filled with link IDs when link was used, otherwise 0
>>>>>>>      - {prog,link}_attach_flags for holding {prog,link}-specific flags
>>>>>>>    - Must be easy to integrate/reuse for in-kernel users
>>>>>>>
>>>>>>> The uapi-side changes needed for supporting bpf_mprog are rather minimal,
>>>>>>> consisting of the additions of the attachment flags, revision counter, and
>>>>>>> expanding existing union with relative_{fd,id} member.
>>>>>>>
>>>>>>> The bpf_mprog framework consists of an bpf_mprog_entry object which holds
>>>>>>> an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
>>>>>>> structure). Both have been separated, so that fast-path gets efficient packing
>>>>>>> of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
>>>>>>> instead of linked list or other structures to remove unnecessary indirections
>>>>>>> for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
>>>>>>> via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
>>>>>>> is populated and then just swapped which avoids additional allocations that
>>>>>>> could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
>>>>>>> currently static, but they could be converted to dynamic allocation if necessary
>>>>>>> at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
>>>>>>> for example, in case of tcx which uses this API in the next patch, it piggy-
>>>>>>> backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
>>>>>>> add,del} implementation and an extensive test suite for checking all aspects
>>>>>>> of this API for prog-based attach/detach and link API as BPF selftests in
>>>>>>> this series.
>>>>>>>
>>>>>>> Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
>>>>>>>
>>>>>>>    [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
>>>>>>>    [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
>>>>>>>    [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>>>>>>>
>>>>>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>> ---
>>>>>>>   MAINTAINERS                    |   1 +
>>>>>>>   include/linux/bpf_mprog.h      | 245 +++++++++++++++++
>>>>>>>   include/uapi/linux/bpf.h       |  37 ++-
>>>>>>>   kernel/bpf/Makefile            |   2 +-
>>>>>>>   kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
>>>>>>>   tools/include/uapi/linux/bpf.h |  37 ++-
>>>>>>>   6 files changed, 781 insertions(+), 17 deletions(-)
>>>>>>>   create mode 100644 include/linux/bpf_mprog.h
>>>>>>>   create mode 100644 kernel/bpf/mprog.c
>>>>>
>>>>> [...]
>>>>>
>>>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>>>> index a7b5e91dd768..207f8a37b327 100644
>>>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>>>> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
>>>>>>>    */
>>>>>>>   #define BPF_F_ALLOW_OVERRIDE (1U << 0)
>>>>>>>   #define BPF_F_ALLOW_MULTI    (1U << 1)
>>>>>>> +/* Generic attachment flags. */
>>>>>>>   #define BPF_F_REPLACE                (1U << 2)
>>>>>>> +#define BPF_F_BEFORE         (1U << 3)
>>>>>>> +#define BPF_F_AFTER          (1U << 4)
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> +#define BPF_F_FIRST          (1U << 5)
>>>>>>> +#define BPF_F_LAST           (1U << 6)
>>>>>>
>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>> would prevent the rest of the users.. (starting with only
>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>> need first/laste).
>>>>>
>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>> then server setup is broken and my application cannot function.
>>>>>
>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>> and has to be reported to application owners.
>>>>>
>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>> for some applications, IMO.
>>>>
>>>> Maybe that's something that should be done by some other mechanism?
>>>> (and as a follow up, if needed) Something akin to what Toke
>>>> mentioned with another program doing sorting or similar.
>>>
>>> The goal of this API is to avoid needing some extra special program to
>>> do this sorting
>>>
>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>> only we have two now, not u16.
>>>
>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>> course, but when they are needed, they will have no alternative.
>>>
>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>> be the only one attached". Should we encourage such use cases? No, of
>>> course. But I think it's fair  for users to be able to express this.
>>>
>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>> some issue, but it won't work because there is already a 'first' program
>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>
>>> If your production setup requires that some important program has to
>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>> interfere with it (assuming that FIRST requirement is a real
>>> requirement and not someone just thinking they need to be first; but
>>> that's up to user space to decide). Maybe the solution for you in that
>>> case would be freplace program installed on top of that stubborn FIRST
>>> program? And if we are talking about local debugging and development,
>>> then you are a sysadmin and you should be able to force-detach that
>>> program that is getting in the way.
>>
>> I'm not really concerned about our production environment. It's pretty
>> controlled and restricted and I'm pretty certain we can avoid doing
>> something stupid. Probably the same for your env.
>>
>> I'm mostly fantasizing about upstream world where different users don't
>> know about each other and start doing stupid things like F_FIRST where
>> they don't really have to be first. It's that "used judiciously" part
>> that I'm a bit skeptical about :-D

But in the end how is that different from just attaching themselves blindly
into the first position (e.g. with before and relative_fd as 0 or the fd/id
of the current first program) - same, they don't really have to be first.
How would that not result in doing something stupid? ;) To add to Andrii's
earlier DDoS mitigation example ... think of K8s environment: one project
is implementing DDoS mitigation with BPF, another one wants to monitor/
sample traffic to user space with BPF. Both install as first position by
default (before + 0). In K8s, there is no built-in Pod dependency management
so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
up in a situation where sometimes the monitor runs before the DDoS mitigation
and on some other nodes it's vice versa. The other case where this gets
broken (assuming a node where we get first the DDoS mitigation, then the
monitoring) is when you need to upgrade one of the Pods: monitoring Pod
gets a new stable update and is being re-rolled out, then it inserts
itself before the DDoS mitigation mechanism, potentially causing outage.
With the first/last mechanism these two situations cannot happen. The DDoS
mitigation software uses first and the monitoring uses before + 0, then no
matter the re-rollouts or the ordering in which Pods come up, it's always
at the expected/correct location.

>> Because even with this new ordering scheme, there still should be
>> some entity to do relative ordering (systemd-style, maybe CNI?).
>> And if it does the ordering, I don't really see why we need
>> F_FIRST/F_LAST.
> 
> I can see I'm a bit late to the party, but FWIW I agree with this:
> FIRST/LAST will definitely be abused if we add it. It also seems to me

See above on the issues w/o the first/last. How would you work around them
in practice so they cannot happen?

> to be policy in the kernel, which would be much better handled in
> userspace like we do for so many other things. So we should rather
> expose a hook to allow userspace to set the policy, as we've discussed
> before; I definitely think we should add that at some point! Although
> obviously it doesn't have to be part of this series...

Imo, it would be better if we could avoid that.. it feels like we're
trying to shoot sparrows with cannon, e.g. when this API gets reused
for other attach hooks, then for each of them you need yet another
policy program. I don't think that's a good user experience, and I
presume this is then single-user program, thus you'll run into the same
race in the end - whichever management daemon or application gets to
install this policy program first wins. This is potentially just
shifting the same issue one level higher, imo.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09  6:52               ` Daniel Borkmann
@ 2023-06-09  7:15                 ` Daniel Borkmann
  2023-06-09 11:04                 ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-09  7:15 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

On 6/9/23 8:52 AM, Daniel Borkmann wrote:
> On 6/9/23 2:29 AM, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>>> On 06/08, Andrii Nakryiko wrote:
>>>> On Thu, Jun 8, 2023 at 2:52 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>>> On 06/08, Andrii Nakryiko wrote:
>>>>>> On Thu, Jun 8, 2023 at 10:24 AM Stanislav Fomichev <sdf@google.com> wrote:
>>>>>>> On 06/07, Daniel Borkmann wrote:
>>>>>>>> This adds a generic layer called bpf_mprog which can be reused by different
>>>>>>>> attachment layers to enable multi-program attachment and dependency resolution.
>>>>>>>> In-kernel users of the bpf_mprog don't need to care about the dependency
>>>>>>>> resolution internals, they can just consume it with few API calls.
>>>>>>>>
>>>>>>>> The initial idea of having a generic API sparked out of discussion [0] from an
>>>>>>>> earlier revision of this work where tc's priority was reused and exposed via
>>>>>>>> BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
>>>>>>>> as-is for classic tc BPF. The feedback was that priority provides a bad user
>>>>>>>> experience and is hard to use [1], e.g.:
>>>>>>>>
>>>>>>>>    I cannot help but feel that priority logic copy-paste from old tc, netfilter
>>>>>>>>    and friends is done because "that's how things were done in the past". [...]
>>>>>>>>    Priority gets exposed everywhere in uapi all the way to bpftool when it's
>>>>>>>>    right there for users to understand. And that's the main problem with it.
>>>>>>>>
>>>>>>>>    The user don't want to and don't need to be aware of it, but uapi forces them
>>>>>>>>    to pick the priority. [...] Your cover letter [0] example proves that in
>>>>>>>>    real life different service pick the same priority. They simply don't know
>>>>>>>>    any better. Priority is an unnecessary magic that apps _have_ to pick, so
>>>>>>>>    they just copy-paste and everyone ends up using the same.
>>>>>>>>
>>>>>>>> The course of the discussion showed more and more the need for a generic,
>>>>>>>> reusable API where the "same look and feel" can be applied for various other
>>>>>>>> program types beyond just tc BPF, for example XDP today does not have multi-
>>>>>>>> program support in kernel, but also there was interest around this API for
>>>>>>>> improving management of cgroup program types. Such common multi-program
>>>>>>>> management concept is useful for BPF management daemons or user space BPF
>>>>>>>> applications coordinating about their attachments.
>>>>>>>>
>>>>>>>> Both from Cilium and Meta side [2], we've collected the following requirements
>>>>>>>> for a generic attach/detach/query API for multi-progs which has been implemented
>>>>>>>> as part of this work:
>>>>>>>>
>>>>>>>>    - Support prog-based attach/detach and link API
>>>>>>>>    - Dependency directives (can also be combined):
>>>>>>>>      - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
>>>>>>>>        - BPF_F_ID flag as {fd,id} toggle
>>>>>>>>        - BPF_F_LINK flag as {prog,link} toggle
>>>>>>>>        - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
>>>>>>>>          BPF_F_AFTER will just append for the case of attaching
>>>>>>>>        - Enforced only at attach time
>>>>>>>>      - BPF_F_{FIRST,LAST}
>>>>>>>>        - Enforced throughout the bpf_mprog state's lifetime
>>>>>>>>        - Admin override possible (e.g. link detach, prog-based BPF_F_REPLACE)
>>>>>>>>    - Internal revision counter and optionally being able to pass expected_revision
>>>>>>>>    - User space daemon can query current state with revision, and pass it along
>>>>>>>>      for attachment to assert current state before doing updates
>>>>>>>>    - Query also gets extension for link_ids array and link_attach_flags:
>>>>>>>>      - prog_ids are always filled with program IDs
>>>>>>>>      - link_ids are filled with link IDs when link was used, otherwise 0
>>>>>>>>      - {prog,link}_attach_flags for holding {prog,link}-specific flags
>>>>>>>>    - Must be easy to integrate/reuse for in-kernel users
>>>>>>>>
>>>>>>>> The uapi-side changes needed for supporting bpf_mprog are rather minimal,
>>>>>>>> consisting of the additions of the attachment flags, revision counter, and
>>>>>>>> expanding existing union with relative_{fd,id} member.
>>>>>>>>
>>>>>>>> The bpf_mprog framework consists of an bpf_mprog_entry object which holds
>>>>>>>> an array of bpf_mprog_fp (fast-path structure) and bpf_mprog_cp (control-path
>>>>>>>> structure). Both have been separated, so that fast-path gets efficient packing
>>>>>>>> of bpf_prog pointers for maximum cache efficieny. Also, array has been chosen
>>>>>>>> instead of linked list or other structures to remove unnecessary indirections
>>>>>>>> for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair
>>>>>>>> via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry
>>>>>>>> is populated and then just swapped which avoids additional allocations that
>>>>>>>> could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are
>>>>>>>> currently static, but they could be converted to dynamic allocation if necessary
>>>>>>>> at a point in future. Locking is deferred to the in-kernel user of bpf_mprog,
>>>>>>>> for example, in case of tcx which uses this API in the next patch, it piggy-
>>>>>>>> backs on rtnl. The nitty-gritty details are in the bpf_mprog_{replace,head_tail,
>>>>>>>> add,del} implementation and an extensive test suite for checking all aspects
>>>>>>>> of this API for prog-based attach/detach and link API as BPF selftests in
>>>>>>>> this series.
>>>>>>>>
>>>>>>>> Kudos also to Andrii Nakryiko for API discussions wrt Meta's BPF management daemon.
>>>>>>>>
>>>>>>>>    [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net/
>>>>>>>>    [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
>>>>>>>>    [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>>>>>>>>
>>>>>>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>>> ---
>>>>>>>>   MAINTAINERS                    |   1 +
>>>>>>>>   include/linux/bpf_mprog.h      | 245 +++++++++++++++++
>>>>>>>>   include/uapi/linux/bpf.h       |  37 ++-
>>>>>>>>   kernel/bpf/Makefile            |   2 +-
>>>>>>>>   kernel/bpf/mprog.c             | 476 +++++++++++++++++++++++++++++++++
>>>>>>>>   tools/include/uapi/linux/bpf.h |  37 ++-
>>>>>>>>   6 files changed, 781 insertions(+), 17 deletions(-)
>>>>>>>>   create mode 100644 include/linux/bpf_mprog.h
>>>>>>>>   create mode 100644 kernel/bpf/mprog.c
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>>>>> index a7b5e91dd768..207f8a37b327 100644
>>>>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>>>>> @@ -1102,7 +1102,14 @@ enum bpf_link_type {
>>>>>>>>    */
>>>>>>>>   #define BPF_F_ALLOW_OVERRIDE (1U << 0)
>>>>>>>>   #define BPF_F_ALLOW_MULTI    (1U << 1)
>>>>>>>> +/* Generic attachment flags. */
>>>>>>>>   #define BPF_F_REPLACE                (1U << 2)
>>>>>>>> +#define BPF_F_BEFORE         (1U << 3)
>>>>>>>> +#define BPF_F_AFTER          (1U << 4)
>>>>>>>
>>>>>>> [..]
>>>>>>>
>>>>>>>> +#define BPF_F_FIRST          (1U << 5)
>>>>>>>> +#define BPF_F_LAST           (1U << 6)
>>>>>>>
>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>> need first/laste).
>>>>>>
>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>> then server setup is broken and my application cannot function.
>>>>>>
>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>> and has to be reported to application owners.
>>>>>>
>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>> for some applications, IMO.
>>>>>
>>>>> Maybe that's something that should be done by some other mechanism?
>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>> mentioned with another program doing sorting or similar.
>>>>
>>>> The goal of this API is to avoid needing some extra special program to
>>>> do this sorting
>>>>
>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>> only we have two now, not u16.
>>>>
>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>> course, but when they are needed, they will have no alternative.
>>>>
>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>> be the only one attached". Should we encourage such use cases? No, of
>>>> course. But I think it's fair  for users to be able to express this.
>>>>
>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>> some issue, but it won't work because there is already a 'first' program
>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>
>>>> If your production setup requires that some important program has to
>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>> interfere with it (assuming that FIRST requirement is a real
>>>> requirement and not someone just thinking they need to be first; but
>>>> that's up to user space to decide). Maybe the solution for you in that
>>>> case would be freplace program installed on top of that stubborn FIRST
>>>> program? And if we are talking about local debugging and development,
>>>> then you are a sysadmin and you should be able to force-detach that
>>>> program that is getting in the way.
>>>
>>> I'm not really concerned about our production environment. It's pretty
>>> controlled and restricted and I'm pretty certain we can avoid doing
>>> something stupid. Probably the same for your env.
>>>
>>> I'm mostly fantasizing about upstream world where different users don't
>>> know about each other and start doing stupid things like F_FIRST where
>>> they don't really have to be first. It's that "used judiciously" part
>>> that I'm a bit skeptical about :-D
> 
> But in the end how is that different from just attaching themselves blindly
> into the first position (e.g. with before and relative_fd as 0 or the fd/id
> of the current first program) - same, they don't really have to be first.
> How would that not result in doing something stupid? ;) To add to Andrii's
> earlier DDoS mitigation example ... think of K8s environment: one project
> is implementing DDoS mitigation with BPF, another one wants to monitor/
> sample traffic to user space with BPF. Both install as first position by
> default (before + 0). In K8s, there is no built-in Pod dependency management
> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
> up in a situation where sometimes the monitor runs before the DDoS mitigation
> and on some other nodes it's vice versa. The other case where this gets
> broken (assuming a node where we get first the DDoS mitigation, then the
> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
> gets a new stable update and is being re-rolled out, then it inserts
> itself before the DDoS mitigation mechanism, potentially causing outage.
> With the first/last mechanism these two situations cannot happen. The DDoS
> mitigation software uses first and the monitoring uses before + 0, then no
> matter the re-rollouts or the ordering in which Pods come up, it's always
> at the expected/correct location.
> 
>>> Because even with this new ordering scheme, there still should be
>>> some entity to do relative ordering (systemd-style, maybe CNI?).

Just to add, in K8s there can be multiple CNIs chained together, and there
is also no common management daemon as you have in G or Meta. So yes, K8s is
special snowflake, but everyone outside of the big hyperscalers are relying
on it as a platform, so we do need to have a solution for the trivial, above-
mentioned scenario if we drop the first/last.

>>> And if it does the ordering, I don't really see why we need
>>> F_FIRST/F_LAST.
>>
>> I can see I'm a bit late to the party, but FWIW I agree with this:
>> FIRST/LAST will definitely be abused if we add it. It also seems to me
> 
> See above on the issues w/o the first/last. How would you work around them
> in practice so they cannot happen?
> 
>> to be policy in the kernel, which would be much better handled in
>> userspace like we do for so many other things. So we should rather
>> expose a hook to allow userspace to set the policy, as we've discussed
>> before; I definitely think we should add that at some point! Although
>> obviously it doesn't have to be part of this series...
> 
> Imo, it would be better if we could avoid that.. it feels like we're
> trying to shoot sparrows with cannon, e.g. when this API gets reused
> for other attach hooks, then for each of them you need yet another
> policy program. I don't think that's a good user experience, and I
> presume this is then single-user program, thus you'll run into the same
> race in the end - whichever management daemon or application gets to
> install this policy program first wins. This is potentially just
> shifting the same issue one level higher, imo.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09  6:52               ` Daniel Borkmann
  2023-06-09  7:15                 ` Daniel Borkmann
@ 2023-06-09 11:04                 ` Toke Høiland-Jørgensen
  2023-06-09 12:34                   ` Timo Beckers
  1 sibling, 1 reply; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09 11:04 UTC (permalink / raw)
  To: Daniel Borkmann, Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

Daniel Borkmann <daniel@iogearbox.net> writes:

>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>> need first/laste).
>>>>>>
>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>> then server setup is broken and my application cannot function.
>>>>>>
>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>> and has to be reported to application owners.
>>>>>>
>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>> for some applications, IMO.
>>>>>
>>>>> Maybe that's something that should be done by some other mechanism?
>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>> mentioned with another program doing sorting or similar.
>>>>
>>>> The goal of this API is to avoid needing some extra special program to
>>>> do this sorting
>>>>
>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>> only we have two now, not u16.
>>>>
>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>> course, but when they are needed, they will have no alternative.
>>>>
>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>> be the only one attached". Should we encourage such use cases? No, of
>>>> course. But I think it's fair  for users to be able to express this.
>>>>
>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>> some issue, but it won't work because there is already a 'first' program
>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>
>>>> If your production setup requires that some important program has to
>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>> interfere with it (assuming that FIRST requirement is a real
>>>> requirement and not someone just thinking they need to be first; but
>>>> that's up to user space to decide). Maybe the solution for you in that
>>>> case would be freplace program installed on top of that stubborn FIRST
>>>> program? And if we are talking about local debugging and development,
>>>> then you are a sysadmin and you should be able to force-detach that
>>>> program that is getting in the way.
>>>
>>> I'm not really concerned about our production environment. It's pretty
>>> controlled and restricted and I'm pretty certain we can avoid doing
>>> something stupid. Probably the same for your env.
>>>
>>> I'm mostly fantasizing about upstream world where different users don't
>>> know about each other and start doing stupid things like F_FIRST where
>>> they don't really have to be first. It's that "used judiciously" part
>>> that I'm a bit skeptical about :-D
>
> But in the end how is that different from just attaching themselves blindly
> into the first position (e.g. with before and relative_fd as 0 or the fd/id
> of the current first program) - same, they don't really have to be first.
> How would that not result in doing something stupid? ;) To add to Andrii's
> earlier DDoS mitigation example ... think of K8s environment: one project
> is implementing DDoS mitigation with BPF, another one wants to monitor/
> sample traffic to user space with BPF. Both install as first position by
> default (before + 0). In K8s, there is no built-in Pod dependency management
> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
> up in a situation where sometimes the monitor runs before the DDoS mitigation
> and on some other nodes it's vice versa. The other case where this gets
> broken (assuming a node where we get first the DDoS mitigation, then the
> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
> gets a new stable update and is being re-rolled out, then it inserts
> itself before the DDoS mitigation mechanism, potentially causing outage.
> With the first/last mechanism these two situations cannot happen. The DDoS
> mitigation software uses first and the monitoring uses before + 0, then no
> matter the re-rollouts or the ordering in which Pods come up, it's always
> at the expected/correct location.

I'm not disputing that these kinds of policy issues need to be solved
somehow. But adding the first/last pinning as part of the kernel hooks
doesn't solve the policy problem, it just hard-codes a solution for one
particular instance of the problem.

Taking your example from above, what happens when someone wants to
deploy those tools in reverse order? Say the monitoring tool counts
packets and someone wants to also count the DDOS traffic; but the DDOS
protection tool has decided for itself (by setting the FIRST) flag that
it can *only* run as the first program, so there is no way to achieve
this without modifying the application itself.

>>> Because even with this new ordering scheme, there still should be
>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>> And if it does the ordering, I don't really see why we need
>>> F_FIRST/F_LAST.
>> 
>> I can see I'm a bit late to the party, but FWIW I agree with this:
>> FIRST/LAST will definitely be abused if we add it. It also seems to me
>
> See above on the issues w/o the first/last. How would you work around them
> in practice so they cannot happen?

By having an ordering configuration that is deterministic. Enforced by
the system-wide management daemon by whichever mechanism suits it. We
could implement a minimal reference policy agent that just reads a
config file in /etc somewhere, and *that* could implement FIRST/LAST
semantics.

>> to be policy in the kernel, which would be much better handled in
>> userspace like we do for so many other things. So we should rather
>> expose a hook to allow userspace to set the policy, as we've discussed
>> before; I definitely think we should add that at some point! Although
>> obviously it doesn't have to be part of this series...
>
> Imo, it would be better if we could avoid that.. it feels like we're
> trying to shoot sparrows with cannon, e.g. when this API gets reused
> for other attach hooks, then for each of them you need yet another
> policy program.

Or a single one that understands multiple program types. Sharing the
multi-prog implementation is helpful here.

> I don't think that's a good user experience, and I presume this is
> then single-user program, thus you'll run into the same race in the
> end - whichever management daemon or application gets to install this
> policy program first wins. This is potentially just shifting the same
> issue one level higher, imo.

Sure, we're shifting the problem one level higher, i.e., out of the
kernel. That's the point: this is better solved in userspace, so
different environments can solve it according to their needs :)

I'm not against having one policy agent on the system, I just don't
think the kernel should hard-code one particular solution to the policy
problem. Much better to merge this without it, and then iterate on
different options (and happy to help with this!), instead of locking the
UAPI into a single solution straight away.

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 11:04                 ` Toke Høiland-Jørgensen
@ 2023-06-09 12:34                   ` Timo Beckers
  2023-06-09 13:11                     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 49+ messages in thread
From: Timo Beckers @ 2023-06-09 12:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Daniel Borkmann,
	Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>
>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>>> need first/laste).
>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>>> then server setup is broken and my application cannot function.
>>>>>>>
>>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>>> and has to be reported to application owners.
>>>>>>>
>>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>>> for some applications, IMO.
>>>>>> Maybe that's something that should be done by some other mechanism?
>>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>>> mentioned with another program doing sorting or similar.
>>>>> The goal of this API is to avoid needing some extra special program to
>>>>> do this sorting
>>>>>
>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>>> only we have two now, not u16.
>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>>> course, but when they are needed, they will have no alternative.
>>>>>
>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>>> be the only one attached". Should we encourage such use cases? No, of
>>>>> course. But I think it's fair  for users to be able to express this.
>>>>>
>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>>> some issue, but it won't work because there is already a 'first' program
>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>> If your production setup requires that some important program has to
>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>>> interfere with it (assuming that FIRST requirement is a real
>>>>> requirement and not someone just thinking they need to be first; but
>>>>> that's up to user space to decide). Maybe the solution for you in that
>>>>> case would be freplace program installed on top of that stubborn FIRST
>>>>> program? And if we are talking about local debugging and development,
>>>>> then you are a sysadmin and you should be able to force-detach that
>>>>> program that is getting in the way.
>>>> I'm not really concerned about our production environment. It's pretty
>>>> controlled and restricted and I'm pretty certain we can avoid doing
>>>> something stupid. Probably the same for your env.
>>>>
>>>> I'm mostly fantasizing about upstream world where different users don't
>>>> know about each other and start doing stupid things like F_FIRST where
>>>> they don't really have to be first. It's that "used judiciously" part
>>>> that I'm a bit skeptical about :-D
>> But in the end how is that different from just attaching themselves blindly
>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>> of the current first program) - same, they don't really have to be first.
>> How would that not result in doing something stupid? ;) To add to Andrii's
>> earlier DDoS mitigation example ... think of K8s environment: one project
>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>> sample traffic to user space with BPF. Both install as first position by
>> default (before + 0). In K8s, there is no built-in Pod dependency management
>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>> and on some other nodes it's vice versa. The other case where this gets
>> broken (assuming a node where we get first the DDoS mitigation, then the
>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>> gets a new stable update and is being re-rolled out, then it inserts
>> itself before the DDoS mitigation mechanism, potentially causing outage.
>> With the first/last mechanism these two situations cannot happen. The DDoS
>> mitigation software uses first and the monitoring uses before + 0, then no
>> matter the re-rollouts or the ordering in which Pods come up, it's always
>> at the expected/correct location.
> I'm not disputing that these kinds of policy issues need to be solved
> somehow. But adding the first/last pinning as part of the kernel hooks
> doesn't solve the policy problem, it just hard-codes a solution for one
> particular instance of the problem.
>
> Taking your example from above, what happens when someone wants to
> deploy those tools in reverse order? Say the monitoring tool counts
> packets and someone wants to also count the DDOS traffic; but the DDOS
> protection tool has decided for itself (by setting the FIRST) flag that
> it can *only* run as the first program, so there is no way to achieve
> this without modifying the application itself.
>
>>>> Because even with this new ordering scheme, there still should be
>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>>> And if it does the ordering, I don't really see why we need
>>>> F_FIRST/F_LAST.
>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
It's in the prisoners' best interest to collaborate (and they do! see
https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
prio system is limiting and turns out to be really fragile in practice.

If your tool wants to attach to tc prio 1 and there's already a prog 
attached,
the most reliable option is basically to blindly replace the attachment, 
unless
you have the possibility to inspect the attached prog and try to figure 
out if it
belongs to another tool. This is fragile in and of itself, and only 
possible on
more recent kernels iirc.

With tcx, Cilium could make an initial attachment using F_FIRST and simply
update a link at well-known path on subsequent startups. If there's no 
existing
link, and F_FIRST is taken, bail out with an error. The owner of the 
existing
F_FIRST program can be queried and logged; we know for sure the program
doesn't belong to Cilium, and we have no interest in detaching it.
>> See above on the issues w/o the first/last. How would you work around them
>> in practice so they cannot happen?
> By having an ordering configuration that is deterministic. Enforced by
> the system-wide management daemon by whichever mechanism suits it. We
> could implement a minimal reference policy agent that just reads a
> config file in /etc somewhere, and *that* could implement FIRST/LAST
> semantics.
I think this particular perspective is what's deadlocking this discussion.
To me, it looks like distros and hyperscalers are in the same boat with
regards to the possibility of coordination between tools. Distros are only
responsible for the tools they package themselves, and hyperscalers
run a tight ship with mostly in-house tooling already. When it comes to
projects out in the wild, that all goes out the window.

Regardless of merit or feasability of a system-wide bpf management
daemon for k8s, there _is no ordering configuration possible_. K8s is not
a distro where package maintainers (or anyone else, really) can coordinate
on correctly defining priority of each of the tools they ship. This is 
effectively
the prisoner's dilemma. I feel like most of the discussion so far has been
very hand-wavy in 'user space should solve it'. Well, we are user space, and
we're here trying to solve it. :)

A hypothetical policy/gatekeeper/ordering daemon doesn't possess
implicit knowledge about which program needs to go where in the chain,
nor is there an obvious heuristic about how to order things. Maintaining
such a configuration for all cloud-native tooling out there that possibly
uses bpf is simply impossible, as even a tool like Cilium can change
dramatically from one release to the next. Having to manage this too
would put a significant burden on velocity and flexibility for arguably
little benefit to the user.

So, daemon/kernel will need to be told how to order things, preferably by
the tools (Cilium/datadog-agent) themselves, since the user/admin of the
system cannot be expected to know where to position the hundreds of progs
loaded by Cilium and how they might interfere with other tools. Figuring
this out is the job of the tool, daemon or not.

The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
work correctly, and it's 100% in their best interest in doing so. Let's not
pretend like we're able to solve game theory on this mailing list. :)
We'll have to settle for the next-best thing: give user space a safe and 
clear
API to allow it to coordinate and make the right decisions.

To circle back to the observability case: in offline discussions with 
Daniel,
I've mentioned the need for 'shadow' progs that only collect data and
pump it to user space, attached at specific points in the chain (still 
within tcx!).
Their retcodes would be ignored, and context modifications would be
rejected, so attaching multiple to the same hook can always succeed,
much like cgroup multi. Consider the following:

To attach a shadow prog before F_FIRST, a caller could use F_BEFORE | 
F_FIRST |
F_RDONLY. Attaching between first and the 'relative' section: F_AFTER | 
F_FIRST |
F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
attach type is added for progs like these.

This is still perfectly possible to implement on top of Daniel's 
proposal, and
to me looks like it could address many of the concerns around ordering of
progs I've seen in this thread, many mention data exfiltration.

Please give this some consideration; we've been trying to figure out a way
forward for years at this point. Try not to defer to a daemon too much, it
won't actually address any of the pain points with developing k8s tooling.

Thanks,

T
>>> to be policy in the kernel, which would be much better handled in
>>> userspace like we do for so many other things. So we should rather
>>> expose a hook to allow userspace to set the policy, as we've discussed
>>> before; I definitely think we should add that at some point! Although
>>> obviously it doesn't have to be part of this series...
>> Imo, it would be better if we could avoid that.. it feels like we're
>> trying to shoot sparrows with cannon, e.g. when this API gets reused
>> for other attach hooks, then for each of them you need yet another
>> policy program.
> Or a single one that understands multiple program types. Sharing the
> multi-prog implementation is helpful here.
>
>> I don't think that's a good user experience, and I presume this is
>> then single-user program, thus you'll run into the same race in the
>> end - whichever management daemon or application gets to install this
>> policy program first wins. This is potentially just shifting the same
>> issue one level higher, imo.
> Sure, we're shifting the problem one level higher, i.e., out of the
> kernel. That's the point: this is better solved in userspace, so
> different environments can solve it according to their needs :)
>
> I'm not against having one policy agent on the system, I just don't
> think the kernel should hard-code one particular solution to the policy
> problem. Much better to merge this without it, and then iterate on
> different options (and happy to help with this!), instead of locking the
> UAPI into a single solution straight away.
>
> -Toke
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 12:34                   ` Timo Beckers
@ 2023-06-09 13:11                     ` Toke Høiland-Jørgensen
  2023-06-09 14:15                       ` Daniel Borkmann
  2023-06-09 18:56                       ` Andrii Nakryiko
  0 siblings, 2 replies; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09 13:11 UTC (permalink / raw)
  To: Timo Beckers, Daniel Borkmann, Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

Timo Beckers <timo@incline.eu> writes:

> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>
>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>>>> need first/laste).
>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>>>> then server setup is broken and my application cannot function.
>>>>>>>>
>>>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>>>> and has to be reported to application owners.
>>>>>>>>
>>>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>>>> for some applications, IMO.
>>>>>>> Maybe that's something that should be done by some other mechanism?
>>>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>>>> mentioned with another program doing sorting or similar.
>>>>>> The goal of this API is to avoid needing some extra special program to
>>>>>> do this sorting
>>>>>>
>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>>>> only we have two now, not u16.
>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>>>> course, but when they are needed, they will have no alternative.
>>>>>>
>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>>>> be the only one attached". Should we encourage such use cases? No, of
>>>>>> course. But I think it's fair  for users to be able to express this.
>>>>>>
>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>>>> some issue, but it won't work because there is already a 'first' program
>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>>> If your production setup requires that some important program has to
>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>>>> interfere with it (assuming that FIRST requirement is a real
>>>>>> requirement and not someone just thinking they need to be first; but
>>>>>> that's up to user space to decide). Maybe the solution for you in that
>>>>>> case would be freplace program installed on top of that stubborn FIRST
>>>>>> program? And if we are talking about local debugging and development,
>>>>>> then you are a sysadmin and you should be able to force-detach that
>>>>>> program that is getting in the way.
>>>>> I'm not really concerned about our production environment. It's pretty
>>>>> controlled and restricted and I'm pretty certain we can avoid doing
>>>>> something stupid. Probably the same for your env.
>>>>>
>>>>> I'm mostly fantasizing about upstream world where different users don't
>>>>> know about each other and start doing stupid things like F_FIRST where
>>>>> they don't really have to be first. It's that "used judiciously" part
>>>>> that I'm a bit skeptical about :-D
>>> But in the end how is that different from just attaching themselves blindly
>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>>> of the current first program) - same, they don't really have to be first.
>>> How would that not result in doing something stupid? ;) To add to Andrii's
>>> earlier DDoS mitigation example ... think of K8s environment: one project
>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>>> sample traffic to user space with BPF. Both install as first position by
>>> default (before + 0). In K8s, there is no built-in Pod dependency management
>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>>> and on some other nodes it's vice versa. The other case where this gets
>>> broken (assuming a node where we get first the DDoS mitigation, then the
>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>>> gets a new stable update and is being re-rolled out, then it inserts
>>> itself before the DDoS mitigation mechanism, potentially causing outage.
>>> With the first/last mechanism these two situations cannot happen. The DDoS
>>> mitigation software uses first and the monitoring uses before + 0, then no
>>> matter the re-rollouts or the ordering in which Pods come up, it's always
>>> at the expected/correct location.
>> I'm not disputing that these kinds of policy issues need to be solved
>> somehow. But adding the first/last pinning as part of the kernel hooks
>> doesn't solve the policy problem, it just hard-codes a solution for one
>> particular instance of the problem.
>>
>> Taking your example from above, what happens when someone wants to
>> deploy those tools in reverse order? Say the monitoring tool counts
>> packets and someone wants to also count the DDOS traffic; but the DDOS
>> protection tool has decided for itself (by setting the FIRST) flag that
>> it can *only* run as the first program, so there is no way to achieve
>> this without modifying the application itself.
>>
>>>>> Because even with this new ordering scheme, there still should be
>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>>>> And if it does the ordering, I don't really see why we need
>>>>> F_FIRST/F_LAST.
>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
> It's in the prisoners' best interest to collaborate (and they do! see
> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
> prio system is limiting and turns out to be really fragile in practice.
>
> If your tool wants to attach to tc prio 1 and there's already a prog 
> attached,
> the most reliable option is basically to blindly replace the attachment, 
> unless
> you have the possibility to inspect the attached prog and try to figure 
> out if it
> belongs to another tool. This is fragile in and of itself, and only 
> possible on
> more recent kernels iirc.
>
> With tcx, Cilium could make an initial attachment using F_FIRST and simply
> update a link at well-known path on subsequent startups. If there's no 
> existing
> link, and F_FIRST is taken, bail out with an error. The owner of the 
> existing
> F_FIRST program can be queried and logged; we know for sure the program
> doesn't belong to Cilium, and we have no interest in detaching it.

That's conflating the benefit of F_FIRST with that of bpf_link, though;
you can have the replace thing without the exclusive locking.

>>> See above on the issues w/o the first/last. How would you work around them
>>> in practice so they cannot happen?
>> By having an ordering configuration that is deterministic. Enforced by
>> the system-wide management daemon by whichever mechanism suits it. We
>> could implement a minimal reference policy agent that just reads a
>> config file in /etc somewhere, and *that* could implement FIRST/LAST
>> semantics.
> I think this particular perspective is what's deadlocking this discussion.
> To me, it looks like distros and hyperscalers are in the same boat with
> regards to the possibility of coordination between tools. Distros are only
> responsible for the tools they package themselves, and hyperscalers
> run a tight ship with mostly in-house tooling already. When it comes to
> projects out in the wild, that all goes out the window.

Not really: from the distro PoV we absolutely care about arbitrary
combinations of programs with different authors. Which is why I'm
arguing against putting anything into the kernel where the first program
to come along can just grab a hook and lock everyone out.

My assumption is basically this: A system administrator installs
packages A and B that both use the TC hook. The developers of A and B
have never heard about each other. It should be possible for that admin
to run A and B in whichever order they like, without making any changes
to A and B themselves.

> Regardless of merit or feasability of a system-wide bpf management
> daemon for k8s, there _is no ordering configuration possible_. K8s is not
> a distro where package maintainers (or anyone else, really) can coordinate
> on correctly defining priority of each of the tools they ship. This is 
> effectively
> the prisoner's dilemma. I feel like most of the discussion so far has been
> very hand-wavy in 'user space should solve it'. Well, we are user space, and
> we're here trying to solve it. :)
>
> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
> implicit knowledge about which program needs to go where in the chain,
> nor is there an obvious heuristic about how to order things. Maintaining
> such a configuration for all cloud-native tooling out there that possibly
> uses bpf is simply impossible, as even a tool like Cilium can change
> dramatically from one release to the next. Having to manage this too
> would put a significant burden on velocity and flexibility for arguably
> little benefit to the user.
>
> So, daemon/kernel will need to be told how to order things, preferably by
> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
> system cannot be expected to know where to position the hundreds of progs
> loaded by Cilium and how they might interfere with other tools. Figuring
> this out is the job of the tool, daemon or not.
>
> The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
> work correctly, and it's 100% in their best interest in doing so. Let's not
> pretend like we're able to solve game theory on this mailing list. :)
> We'll have to settle for the next-best thing: give user space a safe and 
> clear
> API to allow it to coordinate and make the right decisions.

But "always first" is not a meaningful concept. It's just what we have
today (everyone picks priority 1), except now if there are two programs
that want the same hook, it will be the first program that wins the
contest (by locking the second one out), instead of the second program
winning (by overriding the first one) as is the case with the silent
override semantics we have with TC today. So we haven't solved the
problem, we've just shifted the breakage.

> To circle back to the observability case: in offline discussions with 
> Daniel,
> I've mentioned the need for 'shadow' progs that only collect data and
> pump it to user space, attached at specific points in the chain (still 
> within tcx!).
> Their retcodes would be ignored, and context modifications would be
> rejected, so attaching multiple to the same hook can always succeed,
> much like cgroup multi. Consider the following:
>
> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE | 
> F_FIRST |
> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER | 
> F_FIRST |
> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
> attach type is added for progs like these.
>
> This is still perfectly possible to implement on top of Daniel's 
> proposal, and
> to me looks like it could address many of the concerns around ordering of
> progs I've seen in this thread, many mention data exfiltration.

It may well be that semantics like this will turn out to be enough. Or
it may not (I personally believe we'll need something more expressive
still, and where the system admin has the option to override things; but
I may turn out to be wrong). Ultimately, my main point wrt this series
is that this kind of policy decision can be added later, and it's better
to merge the TCX infrastructure without it, instead of locking ourselves
into an API that is way too limited today. TCX (and in-kernel XDP
multiprog) has value without it, so let's merge that first and iterate
on the policy aspects.

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 13:11                     ` Toke Høiland-Jørgensen
@ 2023-06-09 14:15                       ` Daniel Borkmann
  2023-06-09 16:41                         ` Stanislav Fomichev
                                           ` (3 more replies)
  2023-06-09 18:56                       ` Andrii Nakryiko
  1 sibling, 4 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-09 14:15 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Timo Beckers,
	Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
> Timo Beckers <timo@incline.eu> writes:
>> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
[...]
>>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>>>>> need first/laste).
>>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>>>>> then server setup is broken and my application cannot function.
>>>>>>>>>
>>>>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>>>>> and has to be reported to application owners.
>>>>>>>>>
>>>>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>>>>> for some applications, IMO.
>>>>>>>> Maybe that's something that should be done by some other mechanism?
>>>>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>>>>> mentioned with another program doing sorting or similar.
>>>>>>> The goal of this API is to avoid needing some extra special program to
>>>>>>> do this sorting
>>>>>>>
>>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>>>>> only we have two now, not u16.
>>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>>>>> course, but when they are needed, they will have no alternative.
>>>>>>>
>>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>>>>> be the only one attached". Should we encourage such use cases? No, of
>>>>>>> course. But I think it's fair  for users to be able to express this.
>>>>>>>
>>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>>>>> some issue, but it won't work because there is already a 'first' program
>>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>>>> If your production setup requires that some important program has to
>>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>>>>> interfere with it (assuming that FIRST requirement is a real
>>>>>>> requirement and not someone just thinking they need to be first; but
>>>>>>> that's up to user space to decide). Maybe the solution for you in that
>>>>>>> case would be freplace program installed on top of that stubborn FIRST
>>>>>>> program? And if we are talking about local debugging and development,
>>>>>>> then you are a sysadmin and you should be able to force-detach that
>>>>>>> program that is getting in the way.
>>>>>> I'm not really concerned about our production environment. It's pretty
>>>>>> controlled and restricted and I'm pretty certain we can avoid doing
>>>>>> something stupid. Probably the same for your env.
>>>>>>
>>>>>> I'm mostly fantasizing about upstream world where different users don't
>>>>>> know about each other and start doing stupid things like F_FIRST where
>>>>>> they don't really have to be first. It's that "used judiciously" part
>>>>>> that I'm a bit skeptical about :-D
>>>> But in the end how is that different from just attaching themselves blindly
>>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>>>> of the current first program) - same, they don't really have to be first.
>>>> How would that not result in doing something stupid? ;) To add to Andrii's
>>>> earlier DDoS mitigation example ... think of K8s environment: one project
>>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>>>> sample traffic to user space with BPF. Both install as first position by
>>>> default (before + 0). In K8s, there is no built-in Pod dependency management
>>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>>>> and on some other nodes it's vice versa. The other case where this gets
>>>> broken (assuming a node where we get first the DDoS mitigation, then the
>>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>>>> gets a new stable update and is being re-rolled out, then it inserts
>>>> itself before the DDoS mitigation mechanism, potentially causing outage.
>>>> With the first/last mechanism these two situations cannot happen. The DDoS
>>>> mitigation software uses first and the monitoring uses before + 0, then no
>>>> matter the re-rollouts or the ordering in which Pods come up, it's always
>>>> at the expected/correct location.
>>> I'm not disputing that these kinds of policy issues need to be solved
>>> somehow. But adding the first/last pinning as part of the kernel hooks
>>> doesn't solve the policy problem, it just hard-codes a solution for one
>>> particular instance of the problem.
>>>
>>> Taking your example from above, what happens when someone wants to
>>> deploy those tools in reverse order? Say the monitoring tool counts
>>> packets and someone wants to also count the DDOS traffic; but the DDOS
>>> protection tool has decided for itself (by setting the FIRST) flag that
>>> it can *only* run as the first program, so there is no way to achieve
>>> this without modifying the application itself.
>>>
>>>>>> Because even with this new ordering scheme, there still should be
>>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>>>>> And if it does the ordering, I don't really see why we need
>>>>>> F_FIRST/F_LAST.
>>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
>> It's in the prisoners' best interest to collaborate (and they do! see
>> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
>> prio system is limiting and turns out to be really fragile in practice.
>>
>> If your tool wants to attach to tc prio 1 and there's already a prog
>> attached,
>> the most reliable option is basically to blindly replace the attachment,
>> unless
>> you have the possibility to inspect the attached prog and try to figure
>> out if it
>> belongs to another tool. This is fragile in and of itself, and only
>> possible on
>> more recent kernels iirc.
>>
>> With tcx, Cilium could make an initial attachment using F_FIRST and simply
>> update a link at well-known path on subsequent startups. If there's no
>> existing
>> link, and F_FIRST is taken, bail out with an error. The owner of the
>> existing
>> F_FIRST program can be queried and logged; we know for sure the program
>> doesn't belong to Cilium, and we have no interest in detaching it.
> 
> That's conflating the benefit of F_FIRST with that of bpf_link, though;
> you can have the replace thing without the exclusive locking.
> 
>>>> See above on the issues w/o the first/last. How would you work around them
>>>> in practice so they cannot happen?
>>> By having an ordering configuration that is deterministic. Enforced by
>>> the system-wide management daemon by whichever mechanism suits it. We
>>> could implement a minimal reference policy agent that just reads a
>>> config file in /etc somewhere, and *that* could implement FIRST/LAST
>>> semantics.
>> I think this particular perspective is what's deadlocking this discussion.
>> To me, it looks like distros and hyperscalers are in the same boat with
>> regards to the possibility of coordination between tools. Distros are only
>> responsible for the tools they package themselves, and hyperscalers
>> run a tight ship with mostly in-house tooling already. When it comes to
>> projects out in the wild, that all goes out the window.
> 
> Not really: from the distro PoV we absolutely care about arbitrary
> combinations of programs with different authors. Which is why I'm
> arguing against putting anything into the kernel where the first program
> to come along can just grab a hook and lock everyone out.
> 
> My assumption is basically this: A system administrator installs
> packages A and B that both use the TC hook. The developers of A and B
> have never heard about each other. It should be possible for that admin
> to run A and B in whichever order they like, without making any changes
> to A and B themselves.

I would come with the point of view of the K8s cluster operator or platform
engineer, if you will. Someone deeply familiar with K8s, but not necessarily
knowing about kernel internals. I know my org needs to run container A and
container B, so I'll deploy the daemon-sets for both and they get deployed
into my cluster. That platform engineer might have never heard of BPF or might
not even know that container A or container B ships software with BPF. As
mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
everything is loosely coupled. We are now expecting from that person to make
a concrete decision about some BPF kernel internals on various hooks in which
order they should be executed given if they don't then the system becomes
non-deterministic. I think that is quite a big burden and ask to understand.
Eventually that person will say that he/she cannot make this technical decision
and that only one of the two containers can be deployed. I agree with you that
there should be an option for a technically versed person to be able to change
ordering to avoid lock out, but I don't think it will fly asking users to come
up on their own with policies of BPF software in the wild ... similar as you
probably don't want having to deal with writing systemd unit files for software
xyz before you can use your laptop. It's a burden. You expect this to magically
work by default and only if needed for good reasons to make custom changes.
Just the one difference is that the latter ships with the OS (a priori known /
tight-ship analogy).

>> Regardless of merit or feasability of a system-wide bpf management
>> daemon for k8s, there _is no ordering configuration possible_. K8s is not
>> a distro where package maintainers (or anyone else, really) can coordinate
>> on correctly defining priority of each of the tools they ship. This is
>> effectively
>> the prisoner's dilemma. I feel like most of the discussion so far has been
>> very hand-wavy in 'user space should solve it'. Well, we are user space, and
>> we're here trying to solve it. :)
>>
>> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
>> implicit knowledge about which program needs to go where in the chain,
>> nor is there an obvious heuristic about how to order things. Maintaining
>> such a configuration for all cloud-native tooling out there that possibly
>> uses bpf is simply impossible, as even a tool like Cilium can change
>> dramatically from one release to the next. Having to manage this too
>> would put a significant burden on velocity and flexibility for arguably
>> little benefit to the user.
>>
>> So, daemon/kernel will need to be told how to order things, preferably by
>> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
>> system cannot be expected to know where to position the hundreds of progs
>> loaded by Cilium and how they might interfere with other tools. Figuring
>> this out is the job of the tool, daemon or not.
>>
>> The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
>> work correctly, and it's 100% in their best interest in doing so. Let's not
>> pretend like we're able to solve game theory on this mailing list. :)
>> We'll have to settle for the next-best thing: give user space a safe and
>> clear
>> API to allow it to coordinate and make the right decisions.
> 
> But "always first" is not a meaningful concept. It's just what we have
> today (everyone picks priority 1), except now if there are two programs
> that want the same hook, it will be the first program that wins the
> contest (by locking the second one out), instead of the second program
> winning (by overriding the first one) as is the case with the silent
> override semantics we have with TC today. So we haven't solved the
> problem, we've just shifted the breakage.

Fwiw, it's deterministic, and I think this 1000x better than silently
having a non-deterministic deployment where the two programs ship with
before + 0. That is much harder to debug.

>> To circle back to the observability case: in offline discussions with
>> Daniel,
>> I've mentioned the need for 'shadow' progs that only collect data and
>> pump it to user space, attached at specific points in the chain (still
>> within tcx!).
>> Their retcodes would be ignored, and context modifications would be
>> rejected, so attaching multiple to the same hook can always succeed,
>> much like cgroup multi. Consider the following:
>>
>> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
>> F_FIRST |
>> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
>> F_FIRST |
>> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
>> attach type is added for progs like these.
>>
>> This is still perfectly possible to implement on top of Daniel's
>> proposal, and
>> to me looks like it could address many of the concerns around ordering of
>> progs I've seen in this thread, many mention data exfiltration.
> 
> It may well be that semantics like this will turn out to be enough. Or
> it may not (I personally believe we'll need something more expressive
> still, and where the system admin has the option to override things; but
> I may turn out to be wrong). Ultimately, my main point wrt this series
> is that this kind of policy decision can be added later, and it's better
> to merge the TCX infrastructure without it, instead of locking ourselves
> into an API that is way too limited today. TCX (and in-kernel XDP
> multiprog) has value without it, so let's merge that first and iterate
> on the policy aspects.

That's okay and I'll do that for v3 to move on.

I feel we might repeat the same discussion with no good solution for K8s
users once we come back to this point again.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 14:15                       ` Daniel Borkmann
@ 2023-06-09 16:41                         ` Stanislav Fomichev
  2023-06-09 19:03                           ` Andrii Nakryiko
  2023-06-09 18:58                         ` Andrii Nakryiko
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Stanislav Fomichev @ 2023-06-09 16:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Timo Beckers, Andrii Nakryiko,
	ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

On Fri, Jun 9, 2023 at 7:15 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
> > Timo Beckers <timo@incline.eu> writes:
> >> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
> >>> Daniel Borkmann <daniel@iogearbox.net> writes:
> [...]
> >>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
> >>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
> >>>>>>>>>> would prevent the rest of the users.. (starting with only
> >>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> >>>>>>>>>> need first/laste).
> >>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> >>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
> >>>>>>>>> to guarantee that my program runs first and observes each event, I'll
> >>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> >>>>>>>>> then server setup is broken and my application cannot function.
> >>>>>>>>>
> >>>>>>>>> In a setup where we expect multiple applications to co-exist, it
> >>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
> >>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
> >>>>>>>>> and has to be reported to application owners.
> >>>>>>>>>
> >>>>>>>>> But it's not up to the kernel to enforce this cooperation by
> >>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
> >>>>>>>>> for some applications, IMO.
> >>>>>>>> Maybe that's something that should be done by some other mechanism?
> >>>>>>>> (and as a follow up, if needed) Something akin to what Toke
> >>>>>>>> mentioned with another program doing sorting or similar.
> >>>>>>> The goal of this API is to avoid needing some extra special program to
> >>>>>>> do this sorting
> >>>>>>>
> >>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
> >>>>>>>> only we have two now, not u16.
> >>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
> >>>>>>> course, but when they are needed, they will have no alternative.
> >>>>>>>
> >>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
> >>>>>>> be the only one attached". Should we encourage such use cases? No, of
> >>>>>>> course. But I think it's fair  for users to be able to express this.
> >>>>>>>
> >>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
> >>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
> >>>>>>>> some issue, but it won't work because there is already a 'first' program
> >>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
> >>>>>>> If your production setup requires that some important program has to
> >>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
> >>>>>>> interfere with it (assuming that FIRST requirement is a real
> >>>>>>> requirement and not someone just thinking they need to be first; but
> >>>>>>> that's up to user space to decide). Maybe the solution for you in that
> >>>>>>> case would be freplace program installed on top of that stubborn FIRST
> >>>>>>> program? And if we are talking about local debugging and development,
> >>>>>>> then you are a sysadmin and you should be able to force-detach that
> >>>>>>> program that is getting in the way.
> >>>>>> I'm not really concerned about our production environment. It's pretty
> >>>>>> controlled and restricted and I'm pretty certain we can avoid doing
> >>>>>> something stupid. Probably the same for your env.
> >>>>>>
> >>>>>> I'm mostly fantasizing about upstream world where different users don't
> >>>>>> know about each other and start doing stupid things like F_FIRST where
> >>>>>> they don't really have to be first. It's that "used judiciously" part
> >>>>>> that I'm a bit skeptical about :-D
> >>>> But in the end how is that different from just attaching themselves blindly
> >>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
> >>>> of the current first program) - same, they don't really have to be first.
> >>>> How would that not result in doing something stupid? ;) To add to Andrii's
> >>>> earlier DDoS mitigation example ... think of K8s environment: one project
> >>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
> >>>> sample traffic to user space with BPF. Both install as first position by
> >>>> default (before + 0). In K8s, there is no built-in Pod dependency management
> >>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
> >>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
> >>>> and on some other nodes it's vice versa. The other case where this gets
> >>>> broken (assuming a node where we get first the DDoS mitigation, then the
> >>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
> >>>> gets a new stable update and is being re-rolled out, then it inserts
> >>>> itself before the DDoS mitigation mechanism, potentially causing outage.
> >>>> With the first/last mechanism these two situations cannot happen. The DDoS
> >>>> mitigation software uses first and the monitoring uses before + 0, then no
> >>>> matter the re-rollouts or the ordering in which Pods come up, it's always
> >>>> at the expected/correct location.
> >>> I'm not disputing that these kinds of policy issues need to be solved
> >>> somehow. But adding the first/last pinning as part of the kernel hooks
> >>> doesn't solve the policy problem, it just hard-codes a solution for one
> >>> particular instance of the problem.
> >>>
> >>> Taking your example from above, what happens when someone wants to
> >>> deploy those tools in reverse order? Say the monitoring tool counts
> >>> packets and someone wants to also count the DDOS traffic; but the DDOS
> >>> protection tool has decided for itself (by setting the FIRST) flag that
> >>> it can *only* run as the first program, so there is no way to achieve
> >>> this without modifying the application itself.
> >>>
> >>>>>> Because even with this new ordering scheme, there still should be
> >>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
> >>>>>> And if it does the ordering, I don't really see why we need
> >>>>>> F_FIRST/F_LAST.
> >>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
> >>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
> >> It's in the prisoners' best interest to collaborate (and they do! see
> >> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
> >> prio system is limiting and turns out to be really fragile in practice.
> >>
> >> If your tool wants to attach to tc prio 1 and there's already a prog
> >> attached,
> >> the most reliable option is basically to blindly replace the attachment,
> >> unless
> >> you have the possibility to inspect the attached prog and try to figure
> >> out if it
> >> belongs to another tool. This is fragile in and of itself, and only
> >> possible on
> >> more recent kernels iirc.
> >>
> >> With tcx, Cilium could make an initial attachment using F_FIRST and simply
> >> update a link at well-known path on subsequent startups. If there's no
> >> existing
> >> link, and F_FIRST is taken, bail out with an error. The owner of the
> >> existing
> >> F_FIRST program can be queried and logged; we know for sure the program
> >> doesn't belong to Cilium, and we have no interest in detaching it.
> >
> > That's conflating the benefit of F_FIRST with that of bpf_link, though;
> > you can have the replace thing without the exclusive locking.
> >
> >>>> See above on the issues w/o the first/last. How would you work around them
> >>>> in practice so they cannot happen?
> >>> By having an ordering configuration that is deterministic. Enforced by
> >>> the system-wide management daemon by whichever mechanism suits it. We
> >>> could implement a minimal reference policy agent that just reads a
> >>> config file in /etc somewhere, and *that* could implement FIRST/LAST
> >>> semantics.
> >> I think this particular perspective is what's deadlocking this discussion.
> >> To me, it looks like distros and hyperscalers are in the same boat with
> >> regards to the possibility of coordination between tools. Distros are only
> >> responsible for the tools they package themselves, and hyperscalers
> >> run a tight ship with mostly in-house tooling already. When it comes to
> >> projects out in the wild, that all goes out the window.
> >
> > Not really: from the distro PoV we absolutely care about arbitrary
> > combinations of programs with different authors. Which is why I'm
> > arguing against putting anything into the kernel where the first program
> > to come along can just grab a hook and lock everyone out.
> >
> > My assumption is basically this: A system administrator installs
> > packages A and B that both use the TC hook. The developers of A and B
> > have never heard about each other. It should be possible for that admin
> > to run A and B in whichever order they like, without making any changes
> > to A and B themselves.
>
> I would come with the point of view of the K8s cluster operator or platform
> engineer, if you will. Someone deeply familiar with K8s, but not necessarily
> knowing about kernel internals. I know my org needs to run container A and
> container B, so I'll deploy the daemon-sets for both and they get deployed
> into my cluster. That platform engineer might have never heard of BPF or might
> not even know that container A or container B ships software with BPF. As
> mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
> everything is loosely coupled. We are now expecting from that person to make
> a concrete decision about some BPF kernel internals on various hooks in which
> order they should be executed given if they don't then the system becomes
> non-deterministic. I think that is quite a big burden and ask to understand.
> Eventually that person will say that he/she cannot make this technical decision
> and that only one of the two containers can be deployed. I agree with you that
> there should be an option for a technically versed person to be able to change
> ordering to avoid lock out, but I don't think it will fly asking users to come
> up on their own with policies of BPF software in the wild ... similar as you
> probably don't want having to deal with writing systemd unit files for software
> xyz before you can use your laptop. It's a burden. You expect this to magically
> work by default and only if needed for good reasons to make custom changes.
> Just the one difference is that the latter ships with the OS (a priori known /
> tight-ship analogy).
>
> >> Regardless of merit or feasability of a system-wide bpf management
> >> daemon for k8s, there _is no ordering configuration possible_. K8s is not
> >> a distro where package maintainers (or anyone else, really) can coordinate
> >> on correctly defining priority of each of the tools they ship. This is
> >> effectively
> >> the prisoner's dilemma. I feel like most of the discussion so far has been
> >> very hand-wavy in 'user space should solve it'. Well, we are user space, and
> >> we're here trying to solve it. :)
> >>
> >> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
> >> implicit knowledge about which program needs to go where in the chain,
> >> nor is there an obvious heuristic about how to order things. Maintaining
> >> such a configuration for all cloud-native tooling out there that possibly
> >> uses bpf is simply impossible, as even a tool like Cilium can change
> >> dramatically from one release to the next. Having to manage this too
> >> would put a significant burden on velocity and flexibility for arguably
> >> little benefit to the user.
> >>
> >> So, daemon/kernel will need to be told how to order things, preferably by
> >> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
> >> system cannot be expected to know where to position the hundreds of progs
> >> loaded by Cilium and how they might interfere with other tools. Figuring
> >> this out is the job of the tool, daemon or not.
> >>
> >> The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
> >> work correctly, and it's 100% in their best interest in doing so. Let's not
> >> pretend like we're able to solve game theory on this mailing list. :)
> >> We'll have to settle for the next-best thing: give user space a safe and
> >> clear
> >> API to allow it to coordinate and make the right decisions.
> >
> > But "always first" is not a meaningful concept. It's just what we have
> > today (everyone picks priority 1), except now if there are two programs
> > that want the same hook, it will be the first program that wins the
> > contest (by locking the second one out), instead of the second program
> > winning (by overriding the first one) as is the case with the silent
> > override semantics we have with TC today. So we haven't solved the
> > problem, we've just shifted the breakage.
>
> Fwiw, it's deterministic, and I think this 1000x better than silently
> having a non-deterministic deployment where the two programs ship with
> before + 0. That is much harder to debug.
>
> >> To circle back to the observability case: in offline discussions with
> >> Daniel,
> >> I've mentioned the need for 'shadow' progs that only collect data and
> >> pump it to user space, attached at specific points in the chain (still
> >> within tcx!).
> >> Their retcodes would be ignored, and context modifications would be
> >> rejected, so attaching multiple to the same hook can always succeed,
> >> much like cgroup multi. Consider the following:
> >>
> >> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
> >> F_FIRST |
> >> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
> >> F_FIRST |
> >> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
> >> attach type is added for progs like these.
> >>
> >> This is still perfectly possible to implement on top of Daniel's
> >> proposal, and
> >> to me looks like it could address many of the concerns around ordering of
> >> progs I've seen in this thread, many mention data exfiltration.
> >
> > It may well be that semantics like this will turn out to be enough. Or
> > it may not (I personally believe we'll need something more expressive
> > still, and where the system admin has the option to override things; but
> > I may turn out to be wrong). Ultimately, my main point wrt this series
> > is that this kind of policy decision can be added later, and it's better
> > to merge the TCX infrastructure without it, instead of locking ourselves
> > into an API that is way too limited today. TCX (and in-kernel XDP
> > multiprog) has value without it, so let's merge that first and iterate
> > on the policy aspects.
>
> That's okay and I'll do that for v3 to move on.
>
> I feel we might repeat the same discussion with no good solution for K8s
> users once we come back to this point again.

With your cilium vs ddos example, maybe all we really need is for the
program to have some signal about whether it's ok to have somebody
modify/drop the packets before it?
For example, the verifier, depending on whether it sees that the
program writes to the data, uses some helpers, or returns
TC_ACT_SHOT/etc can classify the program as readonly or non-readonly.
And then, we'll have some extra flag during program load/attach that
cilium will pass to express "I'm not ok with having a non-readonly
program before me".

Seems doable? If it makes sense, we can try to do this as a follow up.
It should solve some simple cases without an external arbiter.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 13:11                     ` Toke Høiland-Jørgensen
  2023-06-09 14:15                       ` Daniel Borkmann
@ 2023-06-09 18:56                       ` Andrii Nakryiko
  2023-06-09 20:08                         ` Alexei Starovoitov
  2023-06-09 20:20                         ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 18:56 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Timo Beckers, Daniel Borkmann, Stanislav Fomichev, ast, andrii,
	martin.lau, razor, john.fastabend, kuba, dxu, joe, davem, bpf,
	netdev

On Fri, Jun 9, 2023 at 6:11 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Timo Beckers <timo@incline.eu> writes:
>
> > On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
> >> Daniel Borkmann <daniel@iogearbox.net> writes:
> >>
> >>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
> >>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
> >>>>>>>>> would prevent the rest of the users.. (starting with only
> >>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> >>>>>>>>> need first/laste).
> >>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> >>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
> >>>>>>>> to guarantee that my program runs first and observes each event, I'll
> >>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> >>>>>>>> then server setup is broken and my application cannot function.
> >>>>>>>>
> >>>>>>>> In a setup where we expect multiple applications to co-exist, it
> >>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
> >>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
> >>>>>>>> and has to be reported to application owners.
> >>>>>>>>
> >>>>>>>> But it's not up to the kernel to enforce this cooperation by
> >>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
> >>>>>>>> for some applications, IMO.
> >>>>>>> Maybe that's something that should be done by some other mechanism?
> >>>>>>> (and as a follow up, if needed) Something akin to what Toke
> >>>>>>> mentioned with another program doing sorting or similar.
> >>>>>> The goal of this API is to avoid needing some extra special program to
> >>>>>> do this sorting
> >>>>>>
> >>>>>>> Otherwise, those first/last are just plain simple old priority bands;
> >>>>>>> only we have two now, not u16.
> >>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
> >>>>>> course, but when they are needed, they will have no alternative.
> >>>>>>
> >>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
> >>>>>> be the only one attached". Should we encourage such use cases? No, of
> >>>>>> course. But I think it's fair  for users to be able to express this.
> >>>>>>
> >>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
> >>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
> >>>>>>> some issue, but it won't work because there is already a 'first' program
> >>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
> >>>>>> If your production setup requires that some important program has to
> >>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
> >>>>>> interfere with it (assuming that FIRST requirement is a real
> >>>>>> requirement and not someone just thinking they need to be first; but
> >>>>>> that's up to user space to decide). Maybe the solution for you in that
> >>>>>> case would be freplace program installed on top of that stubborn FIRST
> >>>>>> program? And if we are talking about local debugging and development,
> >>>>>> then you are a sysadmin and you should be able to force-detach that
> >>>>>> program that is getting in the way.
> >>>>> I'm not really concerned about our production environment. It's pretty
> >>>>> controlled and restricted and I'm pretty certain we can avoid doing
> >>>>> something stupid. Probably the same for your env.
> >>>>>
> >>>>> I'm mostly fantasizing about upstream world where different users don't
> >>>>> know about each other and start doing stupid things like F_FIRST where
> >>>>> they don't really have to be first. It's that "used judiciously" part
> >>>>> that I'm a bit skeptical about :-D
> >>> But in the end how is that different from just attaching themselves blindly
> >>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
> >>> of the current first program) - same, they don't really have to be first.
> >>> How would that not result in doing something stupid? ;) To add to Andrii's
> >>> earlier DDoS mitigation example ... think of K8s environment: one project
> >>> is implementing DDoS mitigation with BPF, another one wants to monitor/
> >>> sample traffic to user space with BPF. Both install as first position by
> >>> default (before + 0). In K8s, there is no built-in Pod dependency management
> >>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
> >>> up in a situation where sometimes the monitor runs before the DDoS mitigation
> >>> and on some other nodes it's vice versa. The other case where this gets
> >>> broken (assuming a node where we get first the DDoS mitigation, then the
> >>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
> >>> gets a new stable update and is being re-rolled out, then it inserts
> >>> itself before the DDoS mitigation mechanism, potentially causing outage.
> >>> With the first/last mechanism these two situations cannot happen. The DDoS
> >>> mitigation software uses first and the monitoring uses before + 0, then no
> >>> matter the re-rollouts or the ordering in which Pods come up, it's always
> >>> at the expected/correct location.
> >> I'm not disputing that these kinds of policy issues need to be solved
> >> somehow. But adding the first/last pinning as part of the kernel hooks
> >> doesn't solve the policy problem, it just hard-codes a solution for one
> >> particular instance of the problem.
> >>
> >> Taking your example from above, what happens when someone wants to
> >> deploy those tools in reverse order? Say the monitoring tool counts
> >> packets and someone wants to also count the DDOS traffic; but the DDOS
> >> protection tool has decided for itself (by setting the FIRST) flag that
> >> it can *only* run as the first program, so there is no way to achieve
> >> this without modifying the application itself.
> >>
> >>>>> Because even with this new ordering scheme, there still should be
> >>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
> >>>>> And if it does the ordering, I don't really see why we need
> >>>>> F_FIRST/F_LAST.
> >>>> I can see I'm a bit late to the party, but FWIW I agree with this:
> >>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
> > It's in the prisoners' best interest to collaborate (and they do! see
> > https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
> > prio system is limiting and turns out to be really fragile in practice.
> >
> > If your tool wants to attach to tc prio 1 and there's already a prog
> > attached,
> > the most reliable option is basically to blindly replace the attachment,
> > unless
> > you have the possibility to inspect the attached prog and try to figure
> > out if it
> > belongs to another tool. This is fragile in and of itself, and only
> > possible on
> > more recent kernels iirc.
> >
> > With tcx, Cilium could make an initial attachment using F_FIRST and simply
> > update a link at well-known path on subsequent startups. If there's no
> > existing
> > link, and F_FIRST is taken, bail out with an error. The owner of the
> > existing
> > F_FIRST program can be queried and logged; we know for sure the program
> > doesn't belong to Cilium, and we have no interest in detaching it.
>
> That's conflating the benefit of F_FIRST with that of bpf_link, though;
> you can have the replace thing without the exclusive locking.

I think Timo says that he wants to install his bpf_link as the very
first decision-making BPF program (with F_FIRST) and make sure that
that spot stays the very first decision-making BPF program. And then
he can just do LINK_UPDATE to upgrade the underlying program.

I don't see anything being conflated here.

>
> >>> See above on the issues w/o the first/last. How would you work around them
> >>> in practice so they cannot happen?
> >> By having an ordering configuration that is deterministic. Enforced by
> >> the system-wide management daemon by whichever mechanism suits it. We
> >> could implement a minimal reference policy agent that just reads a
> >> config file in /etc somewhere, and *that* could implement FIRST/LAST
> >> semantics.
> > I think this particular perspective is what's deadlocking this discussion.
> > To me, it looks like distros and hyperscalers are in the same boat with
> > regards to the possibility of coordination between tools. Distros are only
> > responsible for the tools they package themselves, and hyperscalers
> > run a tight ship with mostly in-house tooling already. When it comes to
> > projects out in the wild, that all goes out the window.
>
> Not really: from the distro PoV we absolutely care about arbitrary
> combinations of programs with different authors. Which is why I'm
> arguing against putting anything into the kernel where the first program
> to come along can just grab a hook and lock everyone out.

What if some combinations of programs just cannot co-exist?


Me, Daniel, Timo are arguing that there are real situations where you
have to be first or need to die. And the counter argument we are
getting is "but someone can accidentally or in bad faith overuse
F_FIRST". The former is causing real problems and silent failures. The
latter is about fixing bugs and/or fighting bad actors. We don't
propose any real solution for the real first problem, because we are
afraid of hypothetical bad actors. The former has a technical solution
(F_FIRST/F_LAST), the latter is a matter of bug fixing and pushing
back on bad actors. This is where distros can actually help by making
sure that bad actors that don't really need F_FIRST/F_LAST are not
using them.

It's disturbing that we use the hypothetical "but users can be bad"
argument to prevent a solution to a technical problem that already
happened and will keep happening because there is no solution
available.

And the mythical user-space daemon that will solve all these problems
is not a convincing argument. I haven't seen any concrete proposals
beyond hand-wavy arguments.

>
> My assumption is basically this: A system administrator installs
> packages A and B that both use the TC hook. The developers of A and B
> have never heard about each other. It should be possible for that admin
> to run A and B in whichever order they like, without making any changes
> to A and B themselves.

That's impossible if A and B just cannot co-exist. E.g., if both A and
B are setting some socket options that are fundamentally in conflict.
You will have to either choose A or B, but not both. Or make sure A
and B somehow don't step on each other's toe. But it's not an ordering
problem. It's something that developers of A and B have to coordinate
between each other outside of the kernel.

>
> > Regardless of merit or feasability of a system-wide bpf management
> > daemon for k8s, there _is no ordering configuration possible_. K8s is not
> > a distro where package maintainers (or anyone else, really) can coordinate
> > on correctly defining priority of each of the tools they ship. This is
> > effectively
> > the prisoner's dilemma. I feel like most of the discussion so far has been
> > very hand-wavy in 'user space should solve it'. Well, we are user space, and
> > we're here trying to solve it. :)
> >
> > A hypothetical policy/gatekeeper/ordering daemon doesn't possess
> > implicit knowledge about which program needs to go where in the chain,
> > nor is there an obvious heuristic about how to order things. Maintaining
> > such a configuration for all cloud-native tooling out there that possibly
> > uses bpf is simply impossible, as even a tool like Cilium can change
> > dramatically from one release to the next. Having to manage this too
> > would put a significant burden on velocity and flexibility for arguably
> > little benefit to the user.
> >
> > So, daemon/kernel will need to be told how to order things, preferably by
> > the tools (Cilium/datadog-agent) themselves, since the user/admin of the
> > system cannot be expected to know where to position the hundreds of progs
> > loaded by Cilium and how they might interfere with other tools. Figuring
> > this out is the job of the tool, daemon or not.
> >
> > The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
> > work correctly, and it's 100% in their best interest in doing so. Let's not
> > pretend like we're able to solve game theory on this mailing list. :)
> > We'll have to settle for the next-best thing: give user space a safe and
> > clear
> > API to allow it to coordinate and make the right decisions.
>
> But "always first" is not a meaningful concept. It's just what we have
> today (everyone picks priority 1), except now if there are two programs
> that want the same hook, it will be the first program that wins the
> contest (by locking the second one out), instead of the second program
> winning (by overriding the first one) as is the case with the silent
> override semantics we have with TC today. So we haven't solved the
> problem, we've just shifted the breakage.
>
> > To circle back to the observability case: in offline discussions with
> > Daniel,
> > I've mentioned the need for 'shadow' progs that only collect data and
> > pump it to user space, attached at specific points in the chain (still
> > within tcx!).
> > Their retcodes would be ignored, and context modifications would be
> > rejected, so attaching multiple to the same hook can always succeed,
> > much like cgroup multi. Consider the following:
> >
> > To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
> > F_FIRST |
> > F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
> > F_FIRST |
> > F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
> > attach type is added for progs like these.
> >
> > This is still perfectly possible to implement on top of Daniel's
> > proposal, and
> > to me looks like it could address many of the concerns around ordering of
> > progs I've seen in this thread, many mention data exfiltration.
>
> It may well be that semantics like this will turn out to be enough. Or
> it may not (I personally believe we'll need something more expressive
> still, and where the system admin has the option to override things; but
> I may turn out to be wrong). Ultimately, my main point wrt this series
> is that this kind of policy decision can be added later, and it's better
> to merge the TCX infrastructure without it, instead of locking ourselves
> into an API that is way too limited today. TCX (and in-kernel XDP
> multiprog) has value without it, so let's merge that first and iterate
> on the policy aspects.
>
> -Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 14:15                       ` Daniel Borkmann
  2023-06-09 16:41                         ` Stanislav Fomichev
@ 2023-06-09 18:58                         ` Andrii Nakryiko
  2023-06-09 20:28                         ` Toke Høiland-Jørgensen
  2023-06-12 11:21                         ` Dave Tucker
  3 siblings, 0 replies; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 18:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Timo Beckers,
	Stanislav Fomichev, ast, andrii, martin.lau, razor,
	john.fastabend, kuba, dxu, joe, davem, bpf, netdev

On Fri, Jun 9, 2023 at 7:15 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
> > Timo Beckers <timo@incline.eu> writes:
> >> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
> >>> Daniel Borkmann <daniel@iogearbox.net> writes:
> [...]
> >>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
> >>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
> >>>>>>>>>> would prevent the rest of the users.. (starting with only
> >>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> >>>>>>>>>> need first/laste).
> >>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> >>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
> >>>>>>>>> to guarantee that my program runs first and observes each event, I'll
> >>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> >>>>>>>>> then server setup is broken and my application cannot function.
> >>>>>>>>>
> >>>>>>>>> In a setup where we expect multiple applications to co-exist, it
> >>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
> >>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
> >>>>>>>>> and has to be reported to application owners.
> >>>>>>>>>
> >>>>>>>>> But it's not up to the kernel to enforce this cooperation by
> >>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
> >>>>>>>>> for some applications, IMO.
> >>>>>>>> Maybe that's something that should be done by some other mechanism?
> >>>>>>>> (and as a follow up, if needed) Something akin to what Toke
> >>>>>>>> mentioned with another program doing sorting or similar.
> >>>>>>> The goal of this API is to avoid needing some extra special program to
> >>>>>>> do this sorting
> >>>>>>>
> >>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
> >>>>>>>> only we have two now, not u16.
> >>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
> >>>>>>> course, but when they are needed, they will have no alternative.
> >>>>>>>
> >>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
> >>>>>>> be the only one attached". Should we encourage such use cases? No, of
> >>>>>>> course. But I think it's fair  for users to be able to express this.
> >>>>>>>
> >>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
> >>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
> >>>>>>>> some issue, but it won't work because there is already a 'first' program
> >>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
> >>>>>>> If your production setup requires that some important program has to
> >>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
> >>>>>>> interfere with it (assuming that FIRST requirement is a real
> >>>>>>> requirement and not someone just thinking they need to be first; but
> >>>>>>> that's up to user space to decide). Maybe the solution for you in that
> >>>>>>> case would be freplace program installed on top of that stubborn FIRST
> >>>>>>> program? And if we are talking about local debugging and development,
> >>>>>>> then you are a sysadmin and you should be able to force-detach that
> >>>>>>> program that is getting in the way.
> >>>>>> I'm not really concerned about our production environment. It's pretty
> >>>>>> controlled and restricted and I'm pretty certain we can avoid doing
> >>>>>> something stupid. Probably the same for your env.
> >>>>>>
> >>>>>> I'm mostly fantasizing about upstream world where different users don't
> >>>>>> know about each other and start doing stupid things like F_FIRST where
> >>>>>> they don't really have to be first. It's that "used judiciously" part
> >>>>>> that I'm a bit skeptical about :-D
> >>>> But in the end how is that different from just attaching themselves blindly
> >>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
> >>>> of the current first program) - same, they don't really have to be first.
> >>>> How would that not result in doing something stupid? ;) To add to Andrii's
> >>>> earlier DDoS mitigation example ... think of K8s environment: one project
> >>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
> >>>> sample traffic to user space with BPF. Both install as first position by
> >>>> default (before + 0). In K8s, there is no built-in Pod dependency management
> >>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
> >>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
> >>>> and on some other nodes it's vice versa. The other case where this gets
> >>>> broken (assuming a node where we get first the DDoS mitigation, then the
> >>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
> >>>> gets a new stable update and is being re-rolled out, then it inserts
> >>>> itself before the DDoS mitigation mechanism, potentially causing outage.
> >>>> With the first/last mechanism these two situations cannot happen. The DDoS
> >>>> mitigation software uses first and the monitoring uses before + 0, then no
> >>>> matter the re-rollouts or the ordering in which Pods come up, it's always
> >>>> at the expected/correct location.
> >>> I'm not disputing that these kinds of policy issues need to be solved
> >>> somehow. But adding the first/last pinning as part of the kernel hooks
> >>> doesn't solve the policy problem, it just hard-codes a solution for one
> >>> particular instance of the problem.
> >>>
> >>> Taking your example from above, what happens when someone wants to
> >>> deploy those tools in reverse order? Say the monitoring tool counts
> >>> packets and someone wants to also count the DDOS traffic; but the DDOS
> >>> protection tool has decided for itself (by setting the FIRST) flag that
> >>> it can *only* run as the first program, so there is no way to achieve
> >>> this without modifying the application itself.
> >>>
> >>>>>> Because even with this new ordering scheme, there still should be
> >>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
> >>>>>> And if it does the ordering, I don't really see why we need
> >>>>>> F_FIRST/F_LAST.
> >>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
> >>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
> >> It's in the prisoners' best interest to collaborate (and they do! see
> >> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
> >> prio system is limiting and turns out to be really fragile in practice.
> >>
> >> If your tool wants to attach to tc prio 1 and there's already a prog
> >> attached,
> >> the most reliable option is basically to blindly replace the attachment,
> >> unless
> >> you have the possibility to inspect the attached prog and try to figure
> >> out if it
> >> belongs to another tool. This is fragile in and of itself, and only
> >> possible on
> >> more recent kernels iirc.
> >>
> >> With tcx, Cilium could make an initial attachment using F_FIRST and simply
> >> update a link at well-known path on subsequent startups. If there's no
> >> existing
> >> link, and F_FIRST is taken, bail out with an error. The owner of the
> >> existing
> >> F_FIRST program can be queried and logged; we know for sure the program
> >> doesn't belong to Cilium, and we have no interest in detaching it.
> >
> > That's conflating the benefit of F_FIRST with that of bpf_link, though;
> > you can have the replace thing without the exclusive locking.
> >
> >>>> See above on the issues w/o the first/last. How would you work around them
> >>>> in practice so they cannot happen?
> >>> By having an ordering configuration that is deterministic. Enforced by
> >>> the system-wide management daemon by whichever mechanism suits it. We
> >>> could implement a minimal reference policy agent that just reads a
> >>> config file in /etc somewhere, and *that* could implement FIRST/LAST
> >>> semantics.
> >> I think this particular perspective is what's deadlocking this discussion.
> >> To me, it looks like distros and hyperscalers are in the same boat with
> >> regards to the possibility of coordination between tools. Distros are only
> >> responsible for the tools they package themselves, and hyperscalers
> >> run a tight ship with mostly in-house tooling already. When it comes to
> >> projects out in the wild, that all goes out the window.
> >
> > Not really: from the distro PoV we absolutely care about arbitrary
> > combinations of programs with different authors. Which is why I'm
> > arguing against putting anything into the kernel where the first program
> > to come along can just grab a hook and lock everyone out.
> >
> > My assumption is basically this: A system administrator installs
> > packages A and B that both use the TC hook. The developers of A and B
> > have never heard about each other. It should be possible for that admin
> > to run A and B in whichever order they like, without making any changes
> > to A and B themselves.
>
> I would come with the point of view of the K8s cluster operator or platform
> engineer, if you will. Someone deeply familiar with K8s, but not necessarily
> knowing about kernel internals. I know my org needs to run container A and
> container B, so I'll deploy the daemon-sets for both and they get deployed
> into my cluster. That platform engineer might have never heard of BPF or might
> not even know that container A or container B ships software with BPF. As
> mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
> everything is loosely coupled. We are now expecting from that person to make
> a concrete decision about some BPF kernel internals on various hooks in which
> order they should be executed given if they don't then the system becomes
> non-deterministic. I think that is quite a big burden and ask to understand.
> Eventually that person will say that he/she cannot make this technical decision
> and that only one of the two containers can be deployed. I agree with you that
> there should be an option for a technically versed person to be able to change
> ordering to avoid lock out, but I don't think it will fly asking users to come
> up on their own with policies of BPF software in the wild ... similar as you
> probably don't want having to deal with writing systemd unit files for software
> xyz before you can use your laptop. It's a burden. You expect this to magically
> work by default and only if needed for good reasons to make custom changes.
> Just the one difference is that the latter ships with the OS (a priori known /
> tight-ship analogy).
>
> >> Regardless of merit or feasability of a system-wide bpf management
> >> daemon for k8s, there _is no ordering configuration possible_. K8s is not
> >> a distro where package maintainers (or anyone else, really) can coordinate
> >> on correctly defining priority of each of the tools they ship. This is
> >> effectively
> >> the prisoner's dilemma. I feel like most of the discussion so far has been
> >> very hand-wavy in 'user space should solve it'. Well, we are user space, and
> >> we're here trying to solve it. :)
> >>
> >> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
> >> implicit knowledge about which program needs to go where in the chain,
> >> nor is there an obvious heuristic about how to order things. Maintaining
> >> such a configuration for all cloud-native tooling out there that possibly
> >> uses bpf is simply impossible, as even a tool like Cilium can change
> >> dramatically from one release to the next. Having to manage this too
> >> would put a significant burden on velocity and flexibility for arguably
> >> little benefit to the user.
> >>
> >> So, daemon/kernel will need to be told how to order things, preferably by
> >> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
> >> system cannot be expected to know where to position the hundreds of progs
> >> loaded by Cilium and how they might interfere with other tools. Figuring
> >> this out is the job of the tool, daemon or not.
> >>
> >> The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
> >> work correctly, and it's 100% in their best interest in doing so. Let's not
> >> pretend like we're able to solve game theory on this mailing list. :)
> >> We'll have to settle for the next-best thing: give user space a safe and
> >> clear
> >> API to allow it to coordinate and make the right decisions.
> >
> > But "always first" is not a meaningful concept. It's just what we have
> > today (everyone picks priority 1), except now if there are two programs
> > that want the same hook, it will be the first program that wins the
> > contest (by locking the second one out), instead of the second program
> > winning (by overriding the first one) as is the case with the silent
> > override semantics we have with TC today. So we haven't solved the
> > problem, we've just shifted the breakage.
>
> Fwiw, it's deterministic, and I think this 1000x better than silently
> having a non-deterministic deployment where the two programs ship with
> before + 0. That is much harder to debug.
>

Totally agree. Silent overriding of the BPF program while user-space
is still running (and completely unaware) is MUCH WORSE than not being
able to start if your spot is taken.

The latter is explicit and gives a lot of signal early on. The former
is confusing and results in hours of painful guessing on what's going
on.

How is this even controversial?


> >> To circle back to the observability case: in offline discussions with
> >> Daniel,
> >> I've mentioned the need for 'shadow' progs that only collect data and
> >> pump it to user space, attached at specific points in the chain (still
> >> within tcx!).
> >> Their retcodes would be ignored, and context modifications would be
> >> rejected, so attaching multiple to the same hook can always succeed,
> >> much like cgroup multi. Consider the following:
> >>
> >> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
> >> F_FIRST |
> >> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
> >> F_FIRST |
> >> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
> >> attach type is added for progs like these.
> >>
> >> This is still perfectly possible to implement on top of Daniel's
> >> proposal, and
> >> to me looks like it could address many of the concerns around ordering of
> >> progs I've seen in this thread, many mention data exfiltration.
> >
> > It may well be that semantics like this will turn out to be enough. Or
> > it may not (I personally believe we'll need something more expressive
> > still, and where the system admin has the option to override things; but
> > I may turn out to be wrong). Ultimately, my main point wrt this series
> > is that this kind of policy decision can be added later, and it's better
> > to merge the TCX infrastructure without it, instead of locking ourselves
> > into an API that is way too limited today. TCX (and in-kernel XDP
> > multiprog) has value without it, so let's merge that first and iterate
> > on the policy aspects.
>
> That's okay and I'll do that for v3 to move on.
>
> I feel we might repeat the same discussion with no good solution for K8s
> users once we come back to this point again.
>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 16:41                         ` Stanislav Fomichev
@ 2023-06-09 19:03                           ` Andrii Nakryiko
  2023-06-10  2:52                             ` Daniel Xu
  0 siblings, 1 reply; 49+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 19:03 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Daniel Borkmann, Toke Høiland-Jørgensen, Timo Beckers,
	ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

On Fri, Jun 9, 2023 at 9:41 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jun 9, 2023 at 7:15 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
> > > Timo Beckers <timo@incline.eu> writes:
> > >> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
> > >>> Daniel Borkmann <daniel@iogearbox.net> writes:
> > [...]
> > >>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
> > >>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
> > >>>>>>>>>> would prevent the rest of the users.. (starting with only
> > >>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
> > >>>>>>>>>> need first/laste).
> > >>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
> > >>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
> > >>>>>>>>> to guarantee that my program runs first and observes each event, I'll
> > >>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
> > >>>>>>>>> then server setup is broken and my application cannot function.
> > >>>>>>>>>
> > >>>>>>>>> In a setup where we expect multiple applications to co-exist, it
> > >>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
> > >>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
> > >>>>>>>>> and has to be reported to application owners.
> > >>>>>>>>>
> > >>>>>>>>> But it's not up to the kernel to enforce this cooperation by
> > >>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
> > >>>>>>>>> for some applications, IMO.
> > >>>>>>>> Maybe that's something that should be done by some other mechanism?
> > >>>>>>>> (and as a follow up, if needed) Something akin to what Toke
> > >>>>>>>> mentioned with another program doing sorting or similar.
> > >>>>>>> The goal of this API is to avoid needing some extra special program to
> > >>>>>>> do this sorting
> > >>>>>>>
> > >>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
> > >>>>>>>> only we have two now, not u16.
> > >>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
> > >>>>>>> course, but when they are needed, they will have no alternative.
> > >>>>>>>
> > >>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
> > >>>>>>> be the only one attached". Should we encourage such use cases? No, of
> > >>>>>>> course. But I think it's fair  for users to be able to express this.
> > >>>>>>>
> > >>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
> > >>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
> > >>>>>>>> some issue, but it won't work because there is already a 'first' program
> > >>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
> > >>>>>>> If your production setup requires that some important program has to
> > >>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
> > >>>>>>> interfere with it (assuming that FIRST requirement is a real
> > >>>>>>> requirement and not someone just thinking they need to be first; but
> > >>>>>>> that's up to user space to decide). Maybe the solution for you in that
> > >>>>>>> case would be freplace program installed on top of that stubborn FIRST
> > >>>>>>> program? And if we are talking about local debugging and development,
> > >>>>>>> then you are a sysadmin and you should be able to force-detach that
> > >>>>>>> program that is getting in the way.
> > >>>>>> I'm not really concerned about our production environment. It's pretty
> > >>>>>> controlled and restricted and I'm pretty certain we can avoid doing
> > >>>>>> something stupid. Probably the same for your env.
> > >>>>>>
> > >>>>>> I'm mostly fantasizing about upstream world where different users don't
> > >>>>>> know about each other and start doing stupid things like F_FIRST where
> > >>>>>> they don't really have to be first. It's that "used judiciously" part
> > >>>>>> that I'm a bit skeptical about :-D
> > >>>> But in the end how is that different from just attaching themselves blindly
> > >>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
> > >>>> of the current first program) - same, they don't really have to be first.
> > >>>> How would that not result in doing something stupid? ;) To add to Andrii's
> > >>>> earlier DDoS mitigation example ... think of K8s environment: one project
> > >>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
> > >>>> sample traffic to user space with BPF. Both install as first position by
> > >>>> default (before + 0). In K8s, there is no built-in Pod dependency management
> > >>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
> > >>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
> > >>>> and on some other nodes it's vice versa. The other case where this gets
> > >>>> broken (assuming a node where we get first the DDoS mitigation, then the
> > >>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
> > >>>> gets a new stable update and is being re-rolled out, then it inserts
> > >>>> itself before the DDoS mitigation mechanism, potentially causing outage.
> > >>>> With the first/last mechanism these two situations cannot happen. The DDoS
> > >>>> mitigation software uses first and the monitoring uses before + 0, then no
> > >>>> matter the re-rollouts or the ordering in which Pods come up, it's always
> > >>>> at the expected/correct location.
> > >>> I'm not disputing that these kinds of policy issues need to be solved
> > >>> somehow. But adding the first/last pinning as part of the kernel hooks
> > >>> doesn't solve the policy problem, it just hard-codes a solution for one
> > >>> particular instance of the problem.
> > >>>
> > >>> Taking your example from above, what happens when someone wants to
> > >>> deploy those tools in reverse order? Say the monitoring tool counts
> > >>> packets and someone wants to also count the DDOS traffic; but the DDOS
> > >>> protection tool has decided for itself (by setting the FIRST) flag that
> > >>> it can *only* run as the first program, so there is no way to achieve
> > >>> this without modifying the application itself.
> > >>>
> > >>>>>> Because even with this new ordering scheme, there still should be
> > >>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
> > >>>>>> And if it does the ordering, I don't really see why we need
> > >>>>>> F_FIRST/F_LAST.
> > >>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
> > >>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
> > >> It's in the prisoners' best interest to collaborate (and they do! see
> > >> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
> > >> prio system is limiting and turns out to be really fragile in practice.
> > >>
> > >> If your tool wants to attach to tc prio 1 and there's already a prog
> > >> attached,
> > >> the most reliable option is basically to blindly replace the attachment,
> > >> unless
> > >> you have the possibility to inspect the attached prog and try to figure
> > >> out if it
> > >> belongs to another tool. This is fragile in and of itself, and only
> > >> possible on
> > >> more recent kernels iirc.
> > >>
> > >> With tcx, Cilium could make an initial attachment using F_FIRST and simply
> > >> update a link at well-known path on subsequent startups. If there's no
> > >> existing
> > >> link, and F_FIRST is taken, bail out with an error. The owner of the
> > >> existing
> > >> F_FIRST program can be queried and logged; we know for sure the program
> > >> doesn't belong to Cilium, and we have no interest in detaching it.
> > >
> > > That's conflating the benefit of F_FIRST with that of bpf_link, though;
> > > you can have the replace thing without the exclusive locking.
> > >
> > >>>> See above on the issues w/o the first/last. How would you work around them
> > >>>> in practice so they cannot happen?
> > >>> By having an ordering configuration that is deterministic. Enforced by
> > >>> the system-wide management daemon by whichever mechanism suits it. We
> > >>> could implement a minimal reference policy agent that just reads a
> > >>> config file in /etc somewhere, and *that* could implement FIRST/LAST
> > >>> semantics.
> > >> I think this particular perspective is what's deadlocking this discussion.
> > >> To me, it looks like distros and hyperscalers are in the same boat with
> > >> regards to the possibility of coordination between tools. Distros are only
> > >> responsible for the tools they package themselves, and hyperscalers
> > >> run a tight ship with mostly in-house tooling already. When it comes to
> > >> projects out in the wild, that all goes out the window.
> > >
> > > Not really: from the distro PoV we absolutely care about arbitrary
> > > combinations of programs with different authors. Which is why I'm
> > > arguing against putting anything into the kernel where the first program
> > > to come along can just grab a hook and lock everyone out.
> > >
> > > My assumption is basically this: A system administrator installs
> > > packages A and B that both use the TC hook. The developers of A and B
> > > have never heard about each other. It should be possible for that admin
> > > to run A and B in whichever order they like, without making any changes
> > > to A and B themselves.
> >
> > I would come with the point of view of the K8s cluster operator or platform
> > engineer, if you will. Someone deeply familiar with K8s, but not necessarily
> > knowing about kernel internals. I know my org needs to run container A and
> > container B, so I'll deploy the daemon-sets for both and they get deployed
> > into my cluster. That platform engineer might have never heard of BPF or might
> > not even know that container A or container B ships software with BPF. As
> > mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
> > everything is loosely coupled. We are now expecting from that person to make
> > a concrete decision about some BPF kernel internals on various hooks in which
> > order they should be executed given if they don't then the system becomes
> > non-deterministic. I think that is quite a big burden and ask to understand.
> > Eventually that person will say that he/she cannot make this technical decision
> > and that only one of the two containers can be deployed. I agree with you that
> > there should be an option for a technically versed person to be able to change
> > ordering to avoid lock out, but I don't think it will fly asking users to come
> > up on their own with policies of BPF software in the wild ... similar as you
> > probably don't want having to deal with writing systemd unit files for software
> > xyz before you can use your laptop. It's a burden. You expect this to magically
> > work by default and only if needed for good reasons to make custom changes.
> > Just the one difference is that the latter ships with the OS (a priori known /
> > tight-ship analogy).
> >
> > >> Regardless of merit or feasability of a system-wide bpf management
> > >> daemon for k8s, there _is no ordering configuration possible_. K8s is not
> > >> a distro where package maintainers (or anyone else, really) can coordinate
> > >> on correctly defining priority of each of the tools they ship. This is
> > >> effectively
> > >> the prisoner's dilemma. I feel like most of the discussion so far has been
> > >> very hand-wavy in 'user space should solve it'. Well, we are user space, and
> > >> we're here trying to solve it. :)
> > >>
> > >> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
> > >> implicit knowledge about which program needs to go where in the chain,
> > >> nor is there an obvious heuristic about how to order things. Maintaining
> > >> such a configuration for all cloud-native tooling out there that possibly
> > >> uses bpf is simply impossible, as even a tool like Cilium can change
> > >> dramatically from one release to the next. Having to manage this too
> > >> would put a significant burden on velocity and flexibility for arguably
> > >> little benefit to the user.
> > >>
> > >> So, daemon/kernel will need to be told how to order things, preferably by
> > >> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
> > >> system cannot be expected to know where to position the hundreds of progs
> > >> loaded by Cilium and how they might interfere with other tools. Figuring
> > >> this out is the job of the tool, daemon or not.
> > >>
> > >> The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
> > >> work correctly, and it's 100% in their best interest in doing so. Let's not
> > >> pretend like we're able to solve game theory on this mailing list. :)
> > >> We'll have to settle for the next-best thing: give user space a safe and
> > >> clear
> > >> API to allow it to coordinate and make the right decisions.
> > >
> > > But "always first" is not a meaningful concept. It's just what we have
> > > today (everyone picks priority 1), except now if there are two programs
> > > that want the same hook, it will be the first program that wins the
> > > contest (by locking the second one out), instead of the second program
> > > winning (by overriding the first one) as is the case with the silent
> > > override semantics we have with TC today. So we haven't solved the
> > > problem, we've just shifted the breakage.
> >
> > Fwiw, it's deterministic, and I think this 1000x better than silently
> > having a non-deterministic deployment where the two programs ship with
> > before + 0. That is much harder to debug.
> >
> > >> To circle back to the observability case: in offline discussions with
> > >> Daniel,
> > >> I've mentioned the need for 'shadow' progs that only collect data and
> > >> pump it to user space, attached at specific points in the chain (still
> > >> within tcx!).
> > >> Their retcodes would be ignored, and context modifications would be
> > >> rejected, so attaching multiple to the same hook can always succeed,
> > >> much like cgroup multi. Consider the following:
> > >>
> > >> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
> > >> F_FIRST |
> > >> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
> > >> F_FIRST |
> > >> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
> > >> attach type is added for progs like these.
> > >>
> > >> This is still perfectly possible to implement on top of Daniel's
> > >> proposal, and
> > >> to me looks like it could address many of the concerns around ordering of
> > >> progs I've seen in this thread, many mention data exfiltration.
> > >
> > > It may well be that semantics like this will turn out to be enough. Or
> > > it may not (I personally believe we'll need something more expressive
> > > still, and where the system admin has the option to override things; but
> > > I may turn out to be wrong). Ultimately, my main point wrt this series
> > > is that this kind of policy decision can be added later, and it's better
> > > to merge the TCX infrastructure without it, instead of locking ourselves
> > > into an API that is way too limited today. TCX (and in-kernel XDP
> > > multiprog) has value without it, so let's merge that first and iterate
> > > on the policy aspects.
> >
> > That's okay and I'll do that for v3 to move on.
> >
> > I feel we might repeat the same discussion with no good solution for K8s
> > users once we come back to this point again.
>
> With your cilium vs ddos example, maybe all we really need is for the
> program to have some signal about whether it's ok to have somebody
> modify/drop the packets before it?
> For example, the verifier, depending on whether it sees that the
> program writes to the data, uses some helpers, or returns
> TC_ACT_SHOT/etc can classify the program as readonly or non-readonly.
> And then, we'll have some extra flag during program load/attach that
> cilium will pass to express "I'm not ok with having a non-readonly
> program before me".

So this is what Timo is proposing with F_READONLY. And I agree, that
makes sense and we've discussed the need for something like this
internally. Specific use case was setsockopt programs. Sometimes they
should just observe, and we'd like to enforce that.

Once we have this F_READONLY flag support and enforce that during BPF
program validation, then "I'm not ok with having a non-readonly
program before me" is exactly F_FIRST. We just say that the F_READONLY
program can be inserted anywhere because it has no effect on the state
of the system.

>
> Seems doable? If it makes sense, we can try to do this as a follow up.
> It should solve some simple cases without an external arbiter.

Yes, and we can add that on top of current F_FIRST/F_LAST. Currently
we have to pessimistically assume that every program is non-readonly
and F_FIRST/F_LAST applies to just non-readonly programs.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 18:56                       ` Andrii Nakryiko
@ 2023-06-09 20:08                         ` Alexei Starovoitov
       [not found]                           ` <20230610022721.2950602-1-prankgup@fb.com>
  2023-06-09 20:20                         ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 49+ messages in thread
From: Alexei Starovoitov @ 2023-06-09 20:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Timo Beckers, Daniel Borkmann,
	Stanislav Fomichev, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Nikolay Aleksandrov, John Fastabend,
	Jakub Kicinski, Daniel Xu, Joe Stringer, David S. Miller, bpf,
	Network Development

On Fri, Jun 9, 2023 at 11:56 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Me, Daniel, Timo are arguing that there are real situations where you
> have to be first or need to die.

afaik out of all xdp and tc progs there is not a single prog in the fb fleet
that has to be first.
fb's ddos and firewall don't have to be first.
cilium and datadog progs don't have to be first either.
The race between cilium and datadog was not the race to the first position,
but the conflict due to the same prio.
In all cases, I'm aware, prog owners care a lot about ordering,
but never about strict first.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 18:56                       ` Andrii Nakryiko
  2023-06-09 20:08                         ` Alexei Starovoitov
@ 2023-06-09 20:20                         ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09 20:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Timo Beckers, Daniel Borkmann, Stanislav Fomichev, ast, andrii,
	martin.lau, razor, john.fastabend, kuba, dxu, joe, davem, bpf,
	netdev

>> >>> See above on the issues w/o the first/last. How would you work around them
>> >>> in practice so they cannot happen?
>> >> By having an ordering configuration that is deterministic. Enforced by
>> >> the system-wide management daemon by whichever mechanism suits it. We
>> >> could implement a minimal reference policy agent that just reads a
>> >> config file in /etc somewhere, and *that* could implement FIRST/LAST
>> >> semantics.
>> > I think this particular perspective is what's deadlocking this discussion.
>> > To me, it looks like distros and hyperscalers are in the same boat with
>> > regards to the possibility of coordination between tools. Distros are only
>> > responsible for the tools they package themselves, and hyperscalers
>> > run a tight ship with mostly in-house tooling already. When it comes to
>> > projects out in the wild, that all goes out the window.
>>
>> Not really: from the distro PoV we absolutely care about arbitrary
>> combinations of programs with different authors. Which is why I'm
>> arguing against putting anything into the kernel where the first program
>> to come along can just grab a hook and lock everyone out.
>
> What if some combinations of programs just cannot co-exist?
>
>
> Me, Daniel, Timo are arguing that there are real situations where you
> have to be first or need to die.

Right, and what I'm saying is that this decision should not be up to
individual applications to decide for the whole system. I'm OK with an
application *requesting* that, but it should be possible for a
system-level policy to override that request. I don't actually care so
much about the mechanism for doing so; I just don't want to expose a
flag in UAPI that comes with such a "lock everything" promise, because
that explicitly prevents such system overrides.

> And the counter argument we are getting is "but someone can
> accidentally or in bad faith overuse F_FIRST". The former is causing
> real problems and silent failures. The latter is about fixing bugs
> and/or fighting bad actors. We don't propose any real solution for the
> real first problem, because we are afraid of hypothetical bad actors.
> The former has a technical solution (F_FIRST/F_LAST), the latter is a
> matter of bug fixing and pushing back on bad actors. This is where
> distros can actually help by making sure that bad actors that don't
> really need F_FIRST/F_LAST are not using them.

It's not about "bad actors" in the malicious sense, it's about decisions
being made in one context not being valid in another. Take Daniel's
example of a DDOS application. It's probably a quite legitimate choice
for the developers of such an application to say "it only makes sense to
run a DDOS application first, so of course we'll set the FIRST flag".
But it is just as legitimate for a user/admin to say "I actually want to
run this DDOS application after this other application I wrote for that
specific purpose". If we enforce the FIRST flag semantics at the kernel
level we're making a decision that the first case is legitimate and the
second isn't, and that's just not true.

The whole datadog/cilium issue shows exactly that this kind of conflict
*is* how things play out in practice.

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 14:15                       ` Daniel Borkmann
  2023-06-09 16:41                         ` Stanislav Fomichev
  2023-06-09 18:58                         ` Andrii Nakryiko
@ 2023-06-09 20:28                         ` Toke Høiland-Jørgensen
  2023-06-12 11:21                         ` Dave Tucker
  3 siblings, 0 replies; 49+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09 20:28 UTC (permalink / raw)
  To: Daniel Borkmann, Timo Beckers, Stanislav Fomichev, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, john.fastabend, kuba, dxu, joe,
	davem, bpf, netdev

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
>> Timo Beckers <timo@incline.eu> writes:
>>> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
> [...]
>>>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>>>>>> need first/laste).
>>>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>>>>>> then server setup is broken and my application cannot function.
>>>>>>>>>>
>>>>>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>>>>>> and has to be reported to application owners.
>>>>>>>>>>
>>>>>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>>>>>> for some applications, IMO.
>>>>>>>>> Maybe that's something that should be done by some other mechanism?
>>>>>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>>>>>> mentioned with another program doing sorting or similar.
>>>>>>>> The goal of this API is to avoid needing some extra special program to
>>>>>>>> do this sorting
>>>>>>>>
>>>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>>>>>> only we have two now, not u16.
>>>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>>>>>> course, but when they are needed, they will have no alternative.
>>>>>>>>
>>>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>>>>>> be the only one attached". Should we encourage such use cases? No, of
>>>>>>>> course. But I think it's fair  for users to be able to express this.
>>>>>>>>
>>>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>>>>>> some issue, but it won't work because there is already a 'first' program
>>>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>>>>> If your production setup requires that some important program has to
>>>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>>>>>> interfere with it (assuming that FIRST requirement is a real
>>>>>>>> requirement and not someone just thinking they need to be first; but
>>>>>>>> that's up to user space to decide). Maybe the solution for you in that
>>>>>>>> case would be freplace program installed on top of that stubborn FIRST
>>>>>>>> program? And if we are talking about local debugging and development,
>>>>>>>> then you are a sysadmin and you should be able to force-detach that
>>>>>>>> program that is getting in the way.
>>>>>>> I'm not really concerned about our production environment. It's pretty
>>>>>>> controlled and restricted and I'm pretty certain we can avoid doing
>>>>>>> something stupid. Probably the same for your env.
>>>>>>>
>>>>>>> I'm mostly fantasizing about upstream world where different users don't
>>>>>>> know about each other and start doing stupid things like F_FIRST where
>>>>>>> they don't really have to be first. It's that "used judiciously" part
>>>>>>> that I'm a bit skeptical about :-D
>>>>> But in the end how is that different from just attaching themselves blindly
>>>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>>>>> of the current first program) - same, they don't really have to be first.
>>>>> How would that not result in doing something stupid? ;) To add to Andrii's
>>>>> earlier DDoS mitigation example ... think of K8s environment: one project
>>>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>>>>> sample traffic to user space with BPF. Both install as first position by
>>>>> default (before + 0). In K8s, there is no built-in Pod dependency management
>>>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>>>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>>>>> and on some other nodes it's vice versa. The other case where this gets
>>>>> broken (assuming a node where we get first the DDoS mitigation, then the
>>>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>>>>> gets a new stable update and is being re-rolled out, then it inserts
>>>>> itself before the DDoS mitigation mechanism, potentially causing outage.
>>>>> With the first/last mechanism these two situations cannot happen. The DDoS
>>>>> mitigation software uses first and the monitoring uses before + 0, then no
>>>>> matter the re-rollouts or the ordering in which Pods come up, it's always
>>>>> at the expected/correct location.
>>>> I'm not disputing that these kinds of policy issues need to be solved
>>>> somehow. But adding the first/last pinning as part of the kernel hooks
>>>> doesn't solve the policy problem, it just hard-codes a solution for one
>>>> particular instance of the problem.
>>>>
>>>> Taking your example from above, what happens when someone wants to
>>>> deploy those tools in reverse order? Say the monitoring tool counts
>>>> packets and someone wants to also count the DDOS traffic; but the DDOS
>>>> protection tool has decided for itself (by setting the FIRST) flag that
>>>> it can *only* run as the first program, so there is no way to achieve
>>>> this without modifying the application itself.
>>>>
>>>>>>> Because even with this new ordering scheme, there still should be
>>>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>>>>>> And if it does the ordering, I don't really see why we need
>>>>>>> F_FIRST/F_LAST.
>>>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>>>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
>>> It's in the prisoners' best interest to collaborate (and they do! see
>>> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
>>> prio system is limiting and turns out to be really fragile in practice.
>>>
>>> If your tool wants to attach to tc prio 1 and there's already a prog
>>> attached,
>>> the most reliable option is basically to blindly replace the attachment,
>>> unless
>>> you have the possibility to inspect the attached prog and try to figure
>>> out if it
>>> belongs to another tool. This is fragile in and of itself, and only
>>> possible on
>>> more recent kernels iirc.
>>>
>>> With tcx, Cilium could make an initial attachment using F_FIRST and simply
>>> update a link at well-known path on subsequent startups. If there's no
>>> existing
>>> link, and F_FIRST is taken, bail out with an error. The owner of the
>>> existing
>>> F_FIRST program can be queried and logged; we know for sure the program
>>> doesn't belong to Cilium, and we have no interest in detaching it.
>> 
>> That's conflating the benefit of F_FIRST with that of bpf_link, though;
>> you can have the replace thing without the exclusive locking.
>> 
>>>>> See above on the issues w/o the first/last. How would you work around them
>>>>> in practice so they cannot happen?
>>>> By having an ordering configuration that is deterministic. Enforced by
>>>> the system-wide management daemon by whichever mechanism suits it. We
>>>> could implement a minimal reference policy agent that just reads a
>>>> config file in /etc somewhere, and *that* could implement FIRST/LAST
>>>> semantics.
>>> I think this particular perspective is what's deadlocking this discussion.
>>> To me, it looks like distros and hyperscalers are in the same boat with
>>> regards to the possibility of coordination between tools. Distros are only
>>> responsible for the tools they package themselves, and hyperscalers
>>> run a tight ship with mostly in-house tooling already. When it comes to
>>> projects out in the wild, that all goes out the window.
>> 
>> Not really: from the distro PoV we absolutely care about arbitrary
>> combinations of programs with different authors. Which is why I'm
>> arguing against putting anything into the kernel where the first program
>> to come along can just grab a hook and lock everyone out.
>> 
>> My assumption is basically this: A system administrator installs
>> packages A and B that both use the TC hook. The developers of A and B
>> have never heard about each other. It should be possible for that admin
>> to run A and B in whichever order they like, without making any changes
>> to A and B themselves.
>
> I would come with the point of view of the K8s cluster operator or platform
> engineer, if you will. Someone deeply familiar with K8s, but not necessarily
> knowing about kernel internals. I know my org needs to run container A and
> container B, so I'll deploy the daemon-sets for both and they get deployed
> into my cluster. That platform engineer might have never heard of BPF or might
> not even know that container A or container B ships software with BPF. As
> mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
> everything is loosely coupled. We are now expecting from that person to make
> a concrete decision about some BPF kernel internals on various hooks in which
> order they should be executed given if they don't then the system becomes
> non-deterministic. I think that is quite a big burden and ask to understand.
> Eventually that person will say that he/she cannot make this technical decision
> and that only one of the two containers can be deployed. I agree with you that
> there should be an option for a technically versed person to be able to change
> ordering to avoid lock out, but I don't think it will fly asking users to come
> up on their own with policies of BPF software in the wild ... similar as you
> probably don't want having to deal with writing systemd unit files for software
> xyz before you can use your laptop. It's a burden. You expect this to magically
> work by default and only if needed for good reasons to make custom changes.
> Just the one difference is that the latter ships with the OS (a priori known /
> tight-ship analogy).

See my reply to Andrii: I'm not actually against having an API where an
application can say "please always run me first", I'm against the kernel
making a hard (UAPI) promise to honour that request.

>>> To circle back to the observability case: in offline discussions with
>>> Daniel,
>>> I've mentioned the need for 'shadow' progs that only collect data and
>>> pump it to user space, attached at specific points in the chain (still
>>> within tcx!).
>>> Their retcodes would be ignored, and context modifications would be
>>> rejected, so attaching multiple to the same hook can always succeed,
>>> much like cgroup multi. Consider the following:
>>>
>>> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
>>> F_FIRST |
>>> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
>>> F_FIRST |
>>> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
>>> attach type is added for progs like these.
>>>
>>> This is still perfectly possible to implement on top of Daniel's
>>> proposal, and
>>> to me looks like it could address many of the concerns around ordering of
>>> progs I've seen in this thread, many mention data exfiltration.
>> 
>> It may well be that semantics like this will turn out to be enough. Or
>> it may not (I personally believe we'll need something more expressive
>> still, and where the system admin has the option to override things; but
>> I may turn out to be wrong). Ultimately, my main point wrt this series
>> is that this kind of policy decision can be added later, and it's better
>> to merge the TCX infrastructure without it, instead of locking ourselves
>> into an API that is way too limited today. TCX (and in-kernel XDP
>> multiprog) has value without it, so let's merge that first and iterate
>> on the policy aspects.
>
> That's okay and I'll do that for v3 to move on.

Sounds good.

> I feel we might repeat the same discussion with no good solution for K8s
> users once we come back to this point again.

FWIW I do understand that we need to solve the problem for k8s as well,
and I'll try to get some people from RH who are working more with the
k8s side of things to look at this as well...

-Toke

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 19:03                           ` Andrii Nakryiko
@ 2023-06-10  2:52                             ` Daniel Xu
  0 siblings, 0 replies; 49+ messages in thread
From: Daniel Xu @ 2023-06-10  2:52 UTC (permalink / raw)
  To: Andrii Nakryiko, Stanislav Fomichev
  Cc: Daniel Borkmann, Toke Høiland-Jørgensen, Timo Beckers,
	Alexei Starovoitov, Andrii Nakryiko, martin.lau, razor,
	john.fastabend, Jakub Kicinski, joe, davem, bpf, netdev

Hi all,

On Sat, Jun 10, 2023, at 12:33 AM, Andrii Nakryiko wrote:
> On Fri, Jun 9, 2023 at 9:41 AM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Fri, Jun 9, 2023 at 7:15 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >
>> > On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
>> > > Timo Beckers <timo@incline.eu> writes:
>> > >> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
>> > >>> Daniel Borkmann <daniel@iogearbox.net> writes:
>> > [...]
>> > >>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>> > >>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>> > >>>>>>>>>> would prevent the rest of the users.. (starting with only
>> > >>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>> > >>>>>>>>>> need first/laste).
>> > >>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>> > >>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>> > >>>>>>>>> to guarantee that my program runs first and observes each event, I'll
>> > >>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>> > >>>>>>>>> then server setup is broken and my application cannot function.
>> > >>>>>>>>>
>> > >>>>>>>>> In a setup where we expect multiple applications to co-exist, it
>> > >>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>> > >>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>> > >>>>>>>>> and has to be reported to application owners.
>> > >>>>>>>>>
>> > >>>>>>>>> But it's not up to the kernel to enforce this cooperation by
>> > >>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>> > >>>>>>>>> for some applications, IMO.
>> > >>>>>>>> Maybe that's something that should be done by some other mechanism?
>> > >>>>>>>> (and as a follow up, if needed) Something akin to what Toke
>> > >>>>>>>> mentioned with another program doing sorting or similar.
>> > >>>>>>> The goal of this API is to avoid needing some extra special program to
>> > >>>>>>> do this sorting
>> > >>>>>>>
>> > >>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>> > >>>>>>>> only we have two now, not u16.
>> > >>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>> > >>>>>>> course, but when they are needed, they will have no alternative.
>> > >>>>>>>
>> > >>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>> > >>>>>>> be the only one attached". Should we encourage such use cases? No, of
>> > >>>>>>> course. But I think it's fair  for users to be able to express this.
>> > >>>>>>>
>> > >>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>> > >>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>> > >>>>>>>> some issue, but it won't work because there is already a 'first' program
>> > >>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>> > >>>>>>> If your production setup requires that some important program has to
>> > >>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>> > >>>>>>> interfere with it (assuming that FIRST requirement is a real
>> > >>>>>>> requirement and not someone just thinking they need to be first; but
>> > >>>>>>> that's up to user space to decide). Maybe the solution for you in that
>> > >>>>>>> case would be freplace program installed on top of that stubborn FIRST
>> > >>>>>>> program? And if we are talking about local debugging and development,
>> > >>>>>>> then you are a sysadmin and you should be able to force-detach that
>> > >>>>>>> program that is getting in the way.
>> > >>>>>> I'm not really concerned about our production environment. It's pretty
>> > >>>>>> controlled and restricted and I'm pretty certain we can avoid doing
>> > >>>>>> something stupid. Probably the same for your env.
>> > >>>>>>
>> > >>>>>> I'm mostly fantasizing about upstream world where different users don't
>> > >>>>>> know about each other and start doing stupid things like F_FIRST where
>> > >>>>>> they don't really have to be first. It's that "used judiciously" part
>> > >>>>>> that I'm a bit skeptical about :-D
>> > >>>> But in the end how is that different from just attaching themselves blindly
>> > >>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>> > >>>> of the current first program) - same, they don't really have to be first.
>> > >>>> How would that not result in doing something stupid? ;) To add to Andrii's
>> > >>>> earlier DDoS mitigation example ... think of K8s environment: one project
>> > >>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>> > >>>> sample traffic to user space with BPF. Both install as first position by
>> > >>>> default (before + 0). In K8s, there is no built-in Pod dependency management
>> > >>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>> > >>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>> > >>>> and on some other nodes it's vice versa. The other case where this gets
>> > >>>> broken (assuming a node where we get first the DDoS mitigation, then the
>> > >>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>> > >>>> gets a new stable update and is being re-rolled out, then it inserts
>> > >>>> itself before the DDoS mitigation mechanism, potentially causing outage.
>> > >>>> With the first/last mechanism these two situations cannot happen. The DDoS
>> > >>>> mitigation software uses first and the monitoring uses before + 0, then no
>> > >>>> matter the re-rollouts or the ordering in which Pods come up, it's always
>> > >>>> at the expected/correct location.
>> > >>> I'm not disputing that these kinds of policy issues need to be solved
>> > >>> somehow. But adding the first/last pinning as part of the kernel hooks
>> > >>> doesn't solve the policy problem, it just hard-codes a solution for one
>> > >>> particular instance of the problem.
>> > >>>
>> > >>> Taking your example from above, what happens when someone wants to
>> > >>> deploy those tools in reverse order? Say the monitoring tool counts
>> > >>> packets and someone wants to also count the DDOS traffic; but the DDOS
>> > >>> protection tool has decided for itself (by setting the FIRST) flag that
>> > >>> it can *only* run as the first program, so there is no way to achieve
>> > >>> this without modifying the application itself.
>> > >>>
>> > >>>>>> Because even with this new ordering scheme, there still should be
>> > >>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>> > >>>>>> And if it does the ordering, I don't really see why we need
>> > >>>>>> F_FIRST/F_LAST.
>> > >>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>> > >>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
>> > >> It's in the prisoners' best interest to collaborate (and they do! see
>> > >> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
>> > >> prio system is limiting and turns out to be really fragile in practice.
>> > >>
>> > >> If your tool wants to attach to tc prio 1 and there's already a prog
>> > >> attached,
>> > >> the most reliable option is basically to blindly replace the attachment,
>> > >> unless
>> > >> you have the possibility to inspect the attached prog and try to figure
>> > >> out if it
>> > >> belongs to another tool. This is fragile in and of itself, and only
>> > >> possible on
>> > >> more recent kernels iirc.
>> > >>
>> > >> With tcx, Cilium could make an initial attachment using F_FIRST and simply
>> > >> update a link at well-known path on subsequent startups. If there's no
>> > >> existing
>> > >> link, and F_FIRST is taken, bail out with an error. The owner of the
>> > >> existing
>> > >> F_FIRST program can be queried and logged; we know for sure the program
>> > >> doesn't belong to Cilium, and we have no interest in detaching it.
>> > >
>> > > That's conflating the benefit of F_FIRST with that of bpf_link, though;
>> > > you can have the replace thing without the exclusive locking.
>> > >
>> > >>>> See above on the issues w/o the first/last. How would you work around them
>> > >>>> in practice so they cannot happen?
>> > >>> By having an ordering configuration that is deterministic. Enforced by
>> > >>> the system-wide management daemon by whichever mechanism suits it. We
>> > >>> could implement a minimal reference policy agent that just reads a
>> > >>> config file in /etc somewhere, and *that* could implement FIRST/LAST
>> > >>> semantics.
>> > >> I think this particular perspective is what's deadlocking this discussion.
>> > >> To me, it looks like distros and hyperscalers are in the same boat with
>> > >> regards to the possibility of coordination between tools. Distros are only
>> > >> responsible for the tools they package themselves, and hyperscalers
>> > >> run a tight ship with mostly in-house tooling already. When it comes to
>> > >> projects out in the wild, that all goes out the window.
>> > >
>> > > Not really: from the distro PoV we absolutely care about arbitrary
>> > > combinations of programs with different authors. Which is why I'm
>> > > arguing against putting anything into the kernel where the first program
>> > > to come along can just grab a hook and lock everyone out.
>> > >
>> > > My assumption is basically this: A system administrator installs
>> > > packages A and B that both use the TC hook. The developers of A and B
>> > > have never heard about each other. It should be possible for that admin
>> > > to run A and B in whichever order they like, without making any changes
>> > > to A and B themselves.
>> >
>> > I would come with the point of view of the K8s cluster operator or platform
>> > engineer, if you will. Someone deeply familiar with K8s, but not necessarily
>> > knowing about kernel internals. I know my org needs to run container A and
>> > container B, so I'll deploy the daemon-sets for both and they get deployed
>> > into my cluster. That platform engineer might have never heard of BPF or might
>> > not even know that container A or container B ships software with BPF. As
>> > mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
>> > everything is loosely coupled. We are now expecting from that person to make
>> > a concrete decision about some BPF kernel internals on various hooks in which
>> > order they should be executed given if they don't then the system becomes
>> > non-deterministic. I think that is quite a big burden and ask to understand.
>> > Eventually that person will say that he/she cannot make this technical decision
>> > and that only one of the two containers can be deployed. I agree with you that
>> > there should be an option for a technically versed person to be able to change
>> > ordering to avoid lock out, but I don't think it will fly asking users to come
>> > up on their own with policies of BPF software in the wild ... similar as you
>> > probably don't want having to deal with writing systemd unit files for software
>> > xyz before you can use your laptop. It's a burden. You expect this to magically
>> > work by default and only if needed for good reasons to make custom changes.
>> > Just the one difference is that the latter ships with the OS (a priori known /
>> > tight-ship analogy).
>> >
>> > >> Regardless of merit or feasability of a system-wide bpf management
>> > >> daemon for k8s, there _is no ordering configuration possible_. K8s is not
>> > >> a distro where package maintainers (or anyone else, really) can coordinate
>> > >> on correctly defining priority of each of the tools they ship. This is
>> > >> effectively
>> > >> the prisoner's dilemma. I feel like most of the discussion so far has been
>> > >> very hand-wavy in 'user space should solve it'. Well, we are user space, and
>> > >> we're here trying to solve it. :)
>> > >>
>> > >> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
>> > >> implicit knowledge about which program needs to go where in the chain,
>> > >> nor is there an obvious heuristic about how to order things. Maintaining
>> > >> such a configuration for all cloud-native tooling out there that possibly
>> > >> uses bpf is simply impossible, as even a tool like Cilium can change
>> > >> dramatically from one release to the next. Having to manage this too
>> > >> would put a significant burden on velocity and flexibility for arguably
>> > >> little benefit to the user.
>> > >>
>> > >> So, daemon/kernel will need to be told how to order things, preferably by
>> > >> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
>> > >> system cannot be expected to know where to position the hundreds of progs
>> > >> loaded by Cilium and how they might interfere with other tools. Figuring
>> > >> this out is the job of the tool, daemon or not.
>> > >>
>> > >> The prisoners _must_ communicate (so, not abuse F_FIRST) for things to
>> > >> work correctly, and it's 100% in their best interest in doing so. Let's not
>> > >> pretend like we're able to solve game theory on this mailing list. :)
>> > >> We'll have to settle for the next-best thing: give user space a safe and
>> > >> clear
>> > >> API to allow it to coordinate and make the right decisions.
>> > >
>> > > But "always first" is not a meaningful concept. It's just what we have
>> > > today (everyone picks priority 1), except now if there are two programs
>> > > that want the same hook, it will be the first program that wins the
>> > > contest (by locking the second one out), instead of the second program
>> > > winning (by overriding the first one) as is the case with the silent
>> > > override semantics we have with TC today. So we haven't solved the
>> > > problem, we've just shifted the breakage.
>> >
>> > Fwiw, it's deterministic, and I think this 1000x better than silently
>> > having a non-deterministic deployment where the two programs ship with
>> > before + 0. That is much harder to debug.
>> >
>> > >> To circle back to the observability case: in offline discussions with
>> > >> Daniel,
>> > >> I've mentioned the need for 'shadow' progs that only collect data and
>> > >> pump it to user space, attached at specific points in the chain (still
>> > >> within tcx!).
>> > >> Their retcodes would be ignored, and context modifications would be
>> > >> rejected, so attaching multiple to the same hook can always succeed,
>> > >> much like cgroup multi. Consider the following:
>> > >>
>> > >> To attach a shadow prog before F_FIRST, a caller could use F_BEFORE |
>> > >> F_FIRST |
>> > >> F_RDONLY. Attaching between first and the 'relative' section: F_AFTER |
>> > >> F_FIRST |
>> > >> F_RDONLY, etc. The rdonly flag could even be made redundant if a new prog/
>> > >> attach type is added for progs like these.
>> > >>
>> > >> This is still perfectly possible to implement on top of Daniel's
>> > >> proposal, and
>> > >> to me looks like it could address many of the concerns around ordering of
>> > >> progs I've seen in this thread, many mention data exfiltration.
>> > >
>> > > It may well be that semantics like this will turn out to be enough. Or
>> > > it may not (I personally believe we'll need something more expressive
>> > > still, and where the system admin has the option to override things; but
>> > > I may turn out to be wrong). Ultimately, my main point wrt this series
>> > > is that this kind of policy decision can be added later, and it's better
>> > > to merge the TCX infrastructure without it, instead of locking ourselves
>> > > into an API that is way too limited today. TCX (and in-kernel XDP
>> > > multiprog) has value without it, so let's merge that first and iterate
>> > > on the policy aspects.
>> >
>> > That's okay and I'll do that for v3 to move on.
>> >
>> > I feel we might repeat the same discussion with no good solution for K8s
>> > users once we come back to this point again.
>>
>> With your cilium vs ddos example, maybe all we really need is for the
>> program to have some signal about whether it's ok to have somebody
>> modify/drop the packets before it?
>> For example, the verifier, depending on whether it sees that the
>> program writes to the data, uses some helpers, or returns
>> TC_ACT_SHOT/etc can classify the program as readonly or non-readonly.
>> And then, we'll have some extra flag during program load/attach that
>> cilium will pass to express "I'm not ok with having a non-readonly
>> program before me".
>
> So this is what Timo is proposing with F_READONLY. And I agree, that
> makes sense and we've discussed the need for something like this
> internally. Specific use case was setsockopt programs. Sometimes they
> should just observe, and we'd like to enforce that.
>
> Once we have this F_READONLY flag support and enforce that during BPF
> program validation, then "I'm not ok with having a non-readonly
> program before me" is exactly F_FIRST. We just say that the F_READONLY
> program can be inserted anywhere because it has no effect on the state
> of the system.

I have a different use case for something like F_READONLY. Basically I would
like to be able to accept precompiled BPF progs from semi-trusted sources
and run / attach the prog in a trusted context. Example could be telling the customer:
"give me a prog that you'd like to run against every packet that enters your network
and I will orchestrate / distribute it across your infrastructure". F_READONLY could be
used as one of the mechanisms to uphold invariants like not being able to bring
down the network.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
       [not found]                           ` <20230610022721.2950602-1-prankgup@fb.com>
@ 2023-06-10  3:37                             ` Alexei Starovoitov
  0 siblings, 0 replies; 49+ messages in thread
From: Alexei Starovoitov @ 2023-06-10  3:37 UTC (permalink / raw)
  To: Prankur gupta
  Cc: Andrii Nakryiko, Andrii Nakryiko, Alexei Starovoitov, bpf,
	Daniel Borkmann, David S. Miller, Daniel Xu, Joe Stringer,
	John Fastabend, Jakub Kicinski, Martin KaFai Lau,
	Network Development, Nikolay Aleksandrov, Stanislav Fomichev,
	Timo Beckers, Toke Høiland-Jørgensen, prankur.07

On Fri, Jun 9, 2023 at 8:03 PM Prankur gupta <prankgup@fb.com> wrote:
>
> >>
> >> Me, Daniel, Timo are arguing that there are real situations where you
> >> have to be first or need to die.
> >
> > afaik out of all xdp and tc progs there is not a single prog in the fb fleet
> > that has to be first.
> > fb's ddos and firewall don't have to be first.
> > cilium and datadog progs don't have to be first either.
> > The race between cilium and datadog was not the race to the first position,
> > but the conflict due to the same prio.
> > In all cases, I'm aware, prog owners care a lot about ordering,
> > but never about strict first.
>
> One usecase which we actively use in Meta(fb) fleet is avoiding double writer for
> cgroup/sockops bpf programs. For ex: we can have multiple BPF programs setting
> skops->reply field resulting in stepping on each other for ex: for ECN callback
> one program can set it 1 and other can set it to 0.
> We do that by creating a pre-func and post-func before
> executing sockops BPF program in our custom built chainer.
>
> We want these functions to be executed first and last respectively which actually
> makes the above functionality useful for us.
>
> Hypothetical usecase for cgroup/sockops - Middle BPF programs will not set skops->reply
> and the final BPF program based on results from each of the middle
> BPF program can set the appropriate reply to skops->reply, thus making sure all the middle
> programs executed and the final reply is correct.

cgroup progs are more complicated than a simple list of progs in tc/xdp.
It is not really possible for the kernel to guarantee absolute last and first
in a hierarchy of cgroups. In theory that's possible within a cgroup,
but not when children and parents are involved and progs can be
attached anywhere in the hierarchy and we need to keep
uapi of BPF_F_ALLOW_OVERRIDE, BPF_F_ALLOW_MULTI intact.
The absolute first/last is not the answer for this skops issue.
A different solution is necessary.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-09 14:15                       ` Daniel Borkmann
                                           ` (2 preceding siblings ...)
  2023-06-09 20:28                         ` Toke Høiland-Jørgensen
@ 2023-06-12 11:21                         ` Dave Tucker
  2023-06-12 12:43                           ` Daniel Borkmann
  3 siblings, 1 reply; 49+ messages in thread
From: Dave Tucker @ 2023-06-12 11:21 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Timo Beckers,
	Stanislav Fomichev, Andrii Nakryiko, ast, andrii, martin.lau,
	razor, john.fastabend, kuba, dxu, joe, davem, bpf, netdev



> On 9 Jun 2023, at 15:15, Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
> On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
>> Timo Beckers <timo@incline.eu> writes:
>>> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
> [...]
>>>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>>>>>> need first/laste).
>>>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>>>>>> then server setup is broken and my application cannot function.
>>>>>>>>>> 
>>>>>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>>>>>> and has to be reported to application owners.
>>>>>>>>>> 
>>>>>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>>>>>> for some applications, IMO.
>>>>>>>>> Maybe that's something that should be done by some other mechanism?
>>>>>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>>>>>> mentioned with another program doing sorting or similar.
>>>>>>>> The goal of this API is to avoid needing some extra special program to
>>>>>>>> do this sorting
>>>>>>>> 
>>>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>>>>>> only we have two now, not u16.
>>>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>>>>>> course, but when they are needed, they will have no alternative.
>>>>>>>> 
>>>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>>>>>> be the only one attached". Should we encourage such use cases? No, of
>>>>>>>> course. But I think it's fair  for users to be able to express this.
>>>>>>>> 
>>>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>>>>>> some issue, but it won't work because there is already a 'first' program
>>>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>>>>> If your production setup requires that some important program has to
>>>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>>>>>> interfere with it (assuming that FIRST requirement is a real
>>>>>>>> requirement and not someone just thinking they need to be first; but
>>>>>>>> that's up to user space to decide). Maybe the solution for you in that
>>>>>>>> case would be freplace program installed on top of that stubborn FIRST
>>>>>>>> program? And if we are talking about local debugging and development,
>>>>>>>> then you are a sysadmin and you should be able to force-detach that
>>>>>>>> program that is getting in the way.
>>>>>>> I'm not really concerned about our production environment. It's pretty
>>>>>>> controlled and restricted and I'm pretty certain we can avoid doing
>>>>>>> something stupid. Probably the same for your env.
>>>>>>> 
>>>>>>> I'm mostly fantasizing about upstream world where different users don't
>>>>>>> know about each other and start doing stupid things like F_FIRST where
>>>>>>> they don't really have to be first. It's that "used judiciously" part
>>>>>>> that I'm a bit skeptical about :-D
>>>>> But in the end how is that different from just attaching themselves blindly
>>>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>>>>> of the current first program) - same, they don't really have to be first.
>>>>> How would that not result in doing something stupid? ;) To add to Andrii's
>>>>> earlier DDoS mitigation example ... think of K8s environment: one project
>>>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>>>>> sample traffic to user space with BPF. Both install as first position by
>>>>> default (before + 0). In K8s, there is no built-in Pod dependency management
>>>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>>>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>>>>> and on some other nodes it's vice versa. The other case where this gets
>>>>> broken (assuming a node where we get first the DDoS mitigation, then the
>>>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>>>>> gets a new stable update and is being re-rolled out, then it inserts
>>>>> itself before the DDoS mitigation mechanism, potentially causing outage.
>>>>> With the first/last mechanism these two situations cannot happen. The DDoS
>>>>> mitigation software uses first and the monitoring uses before + 0, then no
>>>>> matter the re-rollouts or the ordering in which Pods come up, it's always
>>>>> at the expected/correct location.
>>>> I'm not disputing that these kinds of policy issues need to be solved
>>>> somehow. But adding the first/last pinning as part of the kernel hooks
>>>> doesn't solve the policy problem, it just hard-codes a solution for one
>>>> particular instance of the problem.
>>>> 
>>>> Taking your example from above, what happens when someone wants to
>>>> deploy those tools in reverse order? Say the monitoring tool counts
>>>> packets and someone wants to also count the DDOS traffic; but the DDOS
>>>> protection tool has decided for itself (by setting the FIRST) flag that
>>>> it can *only* run as the first program, so there is no way to achieve
>>>> this without modifying the application itself.
>>>> 
>>>>>>> Because even with this new ordering scheme, there still should be
>>>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>>>>>> And if it does the ordering, I don't really see why we need
>>>>>>> F_FIRST/F_LAST.
>>>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>>>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
>>> It's in the prisoners' best interest to collaborate (and they do! see
>>> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
>>> prio system is limiting and turns out to be really fragile in practice.
>>> 
>>> If your tool wants to attach to tc prio 1 and there's already a prog
>>> attached,
>>> the most reliable option is basically to blindly replace the attachment,
>>> unless
>>> you have the possibility to inspect the attached prog and try to figure
>>> out if it
>>> belongs to another tool. This is fragile in and of itself, and only
>>> possible on
>>> more recent kernels iirc.
>>> 
>>> With tcx, Cilium could make an initial attachment using F_FIRST and simply
>>> update a link at well-known path on subsequent startups. If there's no
>>> existing
>>> link, and F_FIRST is taken, bail out with an error. The owner of the
>>> existing
>>> F_FIRST program can be queried and logged; we know for sure the program
>>> doesn't belong to Cilium, and we have no interest in detaching it.
>> That's conflating the benefit of F_FIRST with that of bpf_link, though;
>> you can have the replace thing without the exclusive locking.
>>>>> See above on the issues w/o the first/last. How would you work around them
>>>>> in practice so they cannot happen?
>>>> By having an ordering configuration that is deterministic. Enforced by
>>>> the system-wide management daemon by whichever mechanism suits it. We
>>>> could implement a minimal reference policy agent that just reads a
>>>> config file in /etc somewhere, and *that* could implement FIRST/LAST
>>>> semantics.
>>> I think this particular perspective is what's deadlocking this discussion.
>>> To me, it looks like distros and hyperscalers are in the same boat with
>>> regards to the possibility of coordination between tools. Distros are only
>>> responsible for the tools they package themselves, and hyperscalers
>>> run a tight ship with mostly in-house tooling already. When it comes to
>>> projects out in the wild, that all goes out the window.
>> Not really: from the distro PoV we absolutely care about arbitrary
>> combinations of programs with different authors. Which is why I'm
>> arguing against putting anything into the kernel where the first program
>> to come along can just grab a hook and lock everyone out.
>> My assumption is basically this: A system administrator installs
>> packages A and B that both use the TC hook. The developers of A and B
>> have never heard about each other. It should be possible for that admin
>> to run A and B in whichever order they like, without making any changes
>> to A and B themselves.
> 
> I would come with the point of view of the K8s cluster operator or platform
> engineer, if you will. Someone deeply familiar with K8s, but not necessarily
> knowing about kernel internals. I know my org needs to run container A and
> container B, so I'll deploy the daemon-sets for both and they get deployed
> into my cluster. That platform engineer might have never heard of BPF or might
> not even know that container A or container B ships software with BPF. As
> mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
> everything is loosely coupled. We are now expecting from that person to make
> a concrete decision about some BPF kernel internals on various hooks in which
> order they should be executed given if they don't then the system becomes
> non-deterministic. I think that is quite a big burden and ask to understand.
> Eventually that person will say that he/she cannot make this technical decision
> and that only one of the two containers can be deployed. I agree with you that
> there should be an option for a technically versed person to be able to change
> ordering to avoid lock out, but I don't think it will fly asking users to come
> up on their own with policies of BPF software in the wild ... similar as you
> probably don't want having to deal with writing systemd unit files for software
> xyz before you can use your laptop. It's a burden. You expect this to magically
> work by default and only if needed for good reasons to make custom changes.
> Just the one difference is that the latter ships with the OS (a priori known /
> tight-ship analogy).

As someone deeply familiar with the K8s side of the equation you’re greatly oversimplifying.

You can’t just “run a daemon-set” for eBPF-enabled software and expect it to work.
<Insert Boromir stating “One does not simply walk into Mordor”>

First off, you need to find out which privileges it needs.

Just CAP_BPF? Pttf nope.
Depending on the program type likely it will need more, up to and including CAP_SYS_ADMIN.
Scary stuff.

Beyond that, you’ll also need some “special” paths from the host mounted into your container
for vmlinux, tracefs maybe even a bpffs etc…

Furthermore, all of these things above are usually restricted/discouraged in most K8s distros
so you need to wade into the depths of how to disable these protections.

The poor platform engineer in this case will be forced to learn all of these concepts on-the-fly.
So the assumption of them being oblivious to eBPF being run in their cluster should be dismissed.

Clearly explaining the following in documentation would make coming up with policies much easier:
1. Which priority you choose if not instructed otherwise via configuration
2. The risks of attaching other programs ahead/behind this one
3. The risks of having a conflicting priority with another application

Even from the bpf-enabled software vendor standpoint, the status-quo is annoying because you’ll
need to provide recipes to deploy your software on every different K8s distro.

I’ve been working on bpfd [1] + it’s kube integration for the past year to solve these problems
for users/vendors.

From a kernel standpoint, give me an array that does something like this:
- If no priority is provided picks the first free from upper 16 bits of the priority range
- If priority is provided, attach at that priority
- If conflict, use flags to decide what to do where the options are something like:
  - BPF_F_ERR_ON_CONFLICT
  - BPF_F_ASSIGN_ON_CONFLICT

That solves the immediate problem since given a block of u32 priorities I’m sure affected
vendors can pick one within the lower 16 bits that would produce the desired ordering.

As for how this works with a system daemon (and by extension in K8s), I’m of the
opinion that the only viable option is to move program load and
attachment to some other API, be it varlink, gRPC, or the K8s API.

It’s at that layer that policy decisions about priority are made and the kernel semantics
can remain as above.

>>> Regardless of merit or feasability of a system-wide bpf management
>>> daemon for k8s, there _is no ordering configuration possible_. K8s is not
>>> a distro where package maintainers (or anyone else, really) can coordinate
>>> on correctly defining priority of each of the tools they ship. This is
>>> effectively
>>> the prisoner's dilemma. I feel like most of the discussion so far has been
>>> very hand-wavy in 'user space should solve it'. Well, we are user space, and
>>> we're here trying to solve it. :)
>>> 
>>> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
>>> implicit knowledge about which program needs to go where in the chain,
>>> nor is there an obvious heuristic about how to order things. Maintaining
>>> such a configuration for all cloud-native tooling out there that possibly
>>> uses bpf is simply impossible, as even a tool like Cilium can change
>>> dramatically from one release to the next. Having to manage this too
>>> would put a significant burden on velocity and flexibility for arguably
>>> little benefit to the user.
>>> So, daemon/kernel will need to be told how to order things, preferably by
>>> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
>>> system cannot be expected to know where to position the hundreds of progs
>>> loaded by Cilium and how they might interfere with other tools. Figuring
>>> this out is the job of the tool, daemon or not.

I’m sorry but again I have to strongly disagree here.

Tools can provide hints at where they should be placed in a chain of programs, but
that eventual placement should always be down to the user.

The examples you’ve cited are large, specialised applications… but consider for a moment how
this works for smaller programs.

Let’s say you’ve got 3 programs:
- Firewall
- Load-Balancer
- Packet Logger

There are 6 ways that I can order these programs, each of which will have a very different effect.
How can any of these tools individually understand what the user actually wants?

- Dave

[1]: https://github.com/bpfd-dev/bpfd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs
  2023-06-12 11:21                         ` Dave Tucker
@ 2023-06-12 12:43                           ` Daniel Borkmann
  0 siblings, 0 replies; 49+ messages in thread
From: Daniel Borkmann @ 2023-06-12 12:43 UTC (permalink / raw)
  To: Dave Tucker
  Cc: Toke Høiland-Jørgensen, Timo Beckers,
	Stanislav Fomichev, Andrii Nakryiko, ast, andrii, martin.lau,
	razor, john.fastabend, kuba, dxu, joe, davem, bpf, netdev

On 6/12/23 1:21 PM, Dave Tucker wrote:
>> On 9 Jun 2023, at 15:15, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 6/9/23 3:11 PM, Toke Høiland-Jørgensen wrote:
>>> Timo Beckers <timo@incline.eu> writes:
>>>> On 6/9/23 13:04, Toke Høiland-Jørgensen wrote:
>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>> [...]
>>>>>>>>>>>> I'm still not sure whether the hard semantics of first/last is really
>>>>>>>>>>>> useful. My worry is that some prog will just use BPF_F_FIRST which
>>>>>>>>>>>> would prevent the rest of the users.. (starting with only
>>>>>>>>>>>> F_BEFORE/F_AFTER feels 'safer'; we can iterate later on if we really
>>>>>>>>>>>> need first/laste).
>>>>>>>>>>> Without FIRST/LAST some scenarios cannot be guaranteed to be safely
>>>>>>>>>>> implemented. E.g., if I have some hard audit requirements and I need
>>>>>>>>>>> to guarantee that my program runs first and observes each event, I'll
>>>>>>>>>>> enforce BPF_F_FIRST when attaching it. And if that attachment fails,
>>>>>>>>>>> then server setup is broken and my application cannot function.
>>>>>>>>>>>
>>>>>>>>>>> In a setup where we expect multiple applications to co-exist, it
>>>>>>>>>>> should be a rule that no one is using FIRST/LAST (unless it's
>>>>>>>>>>> absolutely required). And if someone doesn't comply, then that's a bug
>>>>>>>>>>> and has to be reported to application owners.
>>>>>>>>>>>
>>>>>>>>>>> But it's not up to the kernel to enforce this cooperation by
>>>>>>>>>>> disallowing FIRST/LAST semantics, because that semantics is critical
>>>>>>>>>>> for some applications, IMO.
>>>>>>>>>> Maybe that's something that should be done by some other mechanism?
>>>>>>>>>> (and as a follow up, if needed) Something akin to what Toke
>>>>>>>>>> mentioned with another program doing sorting or similar.
>>>>>>>>> The goal of this API is to avoid needing some extra special program to
>>>>>>>>> do this sorting
>>>>>>>>>
>>>>>>>>>> Otherwise, those first/last are just plain simple old priority bands;
>>>>>>>>>> only we have two now, not u16.
>>>>>>>>> I think it's different. FIRST/LAST has to be used judiciously, of
>>>>>>>>> course, but when they are needed, they will have no alternative.
>>>>>>>>>
>>>>>>>>> Also, specifying FIRST + LAST is the way to say "I want my program to
>>>>>>>>> be the only one attached". Should we encourage such use cases? No, of
>>>>>>>>> course. But I think it's fair  for users to be able to express this.
>>>>>>>>>
>>>>>>>>>> I'm mostly coming from the observability point: imagine I have my fancy
>>>>>>>>>> tc_ingress_tcpdump program that I want to attach as a first program to debug
>>>>>>>>>> some issue, but it won't work because there is already a 'first' program
>>>>>>>>>> installed.. Or the assumption that I'd do F_REPLACE | F_FIRST ?
>>>>>>>>> If your production setup requires that some important program has to
>>>>>>>>> be FIRST, then yeah, your "let me debug something" program shouldn't
>>>>>>>>> interfere with it (assuming that FIRST requirement is a real
>>>>>>>>> requirement and not someone just thinking they need to be first; but
>>>>>>>>> that's up to user space to decide). Maybe the solution for you in that
>>>>>>>>> case would be freplace program installed on top of that stubborn FIRST
>>>>>>>>> program? And if we are talking about local debugging and development,
>>>>>>>>> then you are a sysadmin and you should be able to force-detach that
>>>>>>>>> program that is getting in the way.
>>>>>>>> I'm not really concerned about our production environment. It's pretty
>>>>>>>> controlled and restricted and I'm pretty certain we can avoid doing
>>>>>>>> something stupid. Probably the same for your env.
>>>>>>>>
>>>>>>>> I'm mostly fantasizing about upstream world where different users don't
>>>>>>>> know about each other and start doing stupid things like F_FIRST where
>>>>>>>> they don't really have to be first. It's that "used judiciously" part
>>>>>>>> that I'm a bit skeptical about :-D
>>>>>> But in the end how is that different from just attaching themselves blindly
>>>>>> into the first position (e.g. with before and relative_fd as 0 or the fd/id
>>>>>> of the current first program) - same, they don't really have to be first.
>>>>>> How would that not result in doing something stupid? ;) To add to Andrii's
>>>>>> earlier DDoS mitigation example ... think of K8s environment: one project
>>>>>> is implementing DDoS mitigation with BPF, another one wants to monitor/
>>>>>> sample traffic to user space with BPF. Both install as first position by
>>>>>> default (before + 0). In K8s, there is no built-in Pod dependency management
>>>>>> so you cannot guarantee whether Pod A comes up before Pod B. So you'll end
>>>>>> up in a situation where sometimes the monitor runs before the DDoS mitigation
>>>>>> and on some other nodes it's vice versa. The other case where this gets
>>>>>> broken (assuming a node where we get first the DDoS mitigation, then the
>>>>>> monitoring) is when you need to upgrade one of the Pods: monitoring Pod
>>>>>> gets a new stable update and is being re-rolled out, then it inserts
>>>>>> itself before the DDoS mitigation mechanism, potentially causing outage.
>>>>>> With the first/last mechanism these two situations cannot happen. The DDoS
>>>>>> mitigation software uses first and the monitoring uses before + 0, then no
>>>>>> matter the re-rollouts or the ordering in which Pods come up, it's always
>>>>>> at the expected/correct location.
>>>>> I'm not disputing that these kinds of policy issues need to be solved
>>>>> somehow. But adding the first/last pinning as part of the kernel hooks
>>>>> doesn't solve the policy problem, it just hard-codes a solution for one
>>>>> particular instance of the problem.
>>>>>
>>>>> Taking your example from above, what happens when someone wants to
>>>>> deploy those tools in reverse order? Say the monitoring tool counts
>>>>> packets and someone wants to also count the DDOS traffic; but the DDOS
>>>>> protection tool has decided for itself (by setting the FIRST) flag that
>>>>> it can *only* run as the first program, so there is no way to achieve
>>>>> this without modifying the application itself.
>>>>>
>>>>>>>> Because even with this new ordering scheme, there still should be
>>>>>>>> some entity to do relative ordering (systemd-style, maybe CNI?).
>>>>>>>> And if it does the ordering, I don't really see why we need
>>>>>>>> F_FIRST/F_LAST.
>>>>>>> I can see I'm a bit late to the party, but FWIW I agree with this:
>>>>>>> FIRST/LAST will definitely be abused if we add it. It also seems to me
>>>> It's in the prisoners' best interest to collaborate (and they do! see
>>>> https://www.youtube.com/watch?v=YK7GyEJdJGo), except the current
>>>> prio system is limiting and turns out to be really fragile in practice.
>>>>
>>>> If your tool wants to attach to tc prio 1 and there's already a prog
>>>> attached,
>>>> the most reliable option is basically to blindly replace the attachment,
>>>> unless
>>>> you have the possibility to inspect the attached prog and try to figure
>>>> out if it
>>>> belongs to another tool. This is fragile in and of itself, and only
>>>> possible on
>>>> more recent kernels iirc.
>>>>
>>>> With tcx, Cilium could make an initial attachment using F_FIRST and simply
>>>> update a link at well-known path on subsequent startups. If there's no
>>>> existing
>>>> link, and F_FIRST is taken, bail out with an error. The owner of the
>>>> existing
>>>> F_FIRST program can be queried and logged; we know for sure the program
>>>> doesn't belong to Cilium, and we have no interest in detaching it.
>>> That's conflating the benefit of F_FIRST with that of bpf_link, though;
>>> you can have the replace thing without the exclusive locking.
>>>>>> See above on the issues w/o the first/last. How would you work around them
>>>>>> in practice so they cannot happen?
>>>>> By having an ordering configuration that is deterministic. Enforced by
>>>>> the system-wide management daemon by whichever mechanism suits it. We
>>>>> could implement a minimal reference policy agent that just reads a
>>>>> config file in /etc somewhere, and *that* could implement FIRST/LAST
>>>>> semantics.
>>>> I think this particular perspective is what's deadlocking this discussion.
>>>> To me, it looks like distros and hyperscalers are in the same boat with
>>>> regards to the possibility of coordination between tools. Distros are only
>>>> responsible for the tools they package themselves, and hyperscalers
>>>> run a tight ship with mostly in-house tooling already. When it comes to
>>>> projects out in the wild, that all goes out the window.
>>> Not really: from the distro PoV we absolutely care about arbitrary
>>> combinations of programs with different authors. Which is why I'm
>>> arguing against putting anything into the kernel where the first program
>>> to come along can just grab a hook and lock everyone out.
>>> My assumption is basically this: A system administrator installs
>>> packages A and B that both use the TC hook. The developers of A and B
>>> have never heard about each other. It should be possible for that admin
>>> to run A and B in whichever order they like, without making any changes
>>> to A and B themselves.
>>
>> I would come with the point of view of the K8s cluster operator or platform
>> engineer, if you will. Someone deeply familiar with K8s, but not necessarily
>> knowing about kernel internals. I know my org needs to run container A and
>> container B, so I'll deploy the daemon-sets for both and they get deployed
>> into my cluster. That platform engineer might have never heard of BPF or might
>> not even know that container A or container B ships software with BPF. As
>> mentioned, K8s itself has no concept of Pod ordering as its paradigm is that
>> everything is loosely coupled. We are now expecting from that person to make
>> a concrete decision about some BPF kernel internals on various hooks in which
>> order they should be executed given if they don't then the system becomes
>> non-deterministic. I think that is quite a big burden and ask to understand.
>> Eventually that person will say that he/she cannot make this technical decision
>> and that only one of the two containers can be deployed. I agree with you that
>> there should be an option for a technically versed person to be able to change
>> ordering to avoid lock out, but I don't think it will fly asking users to come
>> up on their own with policies of BPF software in the wild ... similar as you
>> probably don't want having to deal with writing systemd unit files for software
>> xyz before you can use your laptop. It's a burden. You expect this to magically
>> work by default and only if needed for good reasons to make custom changes.
>> Just the one difference is that the latter ships with the OS (a priori known /
>> tight-ship analogy).
> 
> As someone deeply familiar with the K8s side of the equation you’re greatly oversimplifying.
> 
> You can’t just “run a daemon-set” for eBPF-enabled software and expect it to work.
> <Insert Boromir stating “One does not simply walk into Mordor”>
> 
> First off, you need to find out which privileges it needs.
> 
> Just CAP_BPF? Pttf nope.
> Depending on the program type likely it will need more, up to and including CAP_SYS_ADMIN.
> Scary stuff.
> 
> Beyond that, you’ll also need some “special” paths from the host mounted into your container
> for vmlinux, tracefs maybe even a bpffs etc…
> 
> Furthermore, all of these things above are usually restricted/discouraged in most K8s distros
> so you need to wade into the depths of how to disable these protections.
> 
> The poor platform engineer in this case will be forced to learn all of these concepts on-the-fly.
> So the assumption of them being oblivious to eBPF being run in their cluster should be dismissed.

Sure, a lot of the above is irrelevant to the discussion, though. Yes, things need to
align and in many cases this is done by the existing projects shipping via Helm charts
or other means to make installation easier. Unless you build something from scratch, ofc.

> Clearly explaining the following in documentation would make coming up with policies much easier:
> 1. Which priority you choose if not instructed otherwise via configuration
> 2. The risks of attaching other programs ahead/behind this one
> 3. The risks of having a conflicting priority with another application >
> Even from the bpf-enabled software vendor standpoint, the status-quo is annoying because you’ll
> need to provide recipes to deploy your software on every different K8s distro.
> 
> I’ve been working on bpfd [1] + it’s kube integration for the past year to solve these problems
> for users/vendors.
> 
>  From a kernel standpoint, give me an array that does something like this:
> - If no priority is provided picks the first free from upper 16 bits of the priority range
> - If priority is provided, attach at that priority
> - If conflict, use flags to decide what to do where the options are something like:
>    - BPF_F_ERR_ON_CONFLICT
>    - BPF_F_ASSIGN_ON_CONFLICT
> 
> That solves the immediate problem since given a block of u32 priorities I’m sure affected
> vendors can pick one within the lower 16 bits that would produce the desired ordering.

See the discussion that we're moving away from priorities. I used them in v1, but
the community feedback was that priorities are discouraged.

> As for how this works with a system daemon (and by extension in K8s), I’m of the
> opinion that the only viable option is to move program load and
> attachment to some other API, be it varlink, gRPC, or the K8s API.

Fwiw, I doubt that this will fly to be honest. You would have to rewrite libbpf, and
all loaders out in the wild to standardize and implement this rpc protocol, and you
also still need to support the non-rpc way of loading your BPF programs for the case
where you need to deal with old/existing environments where this is not the case. If
bpfd ships as a 3rd party software, then fingers crossed that eventually all k8s distros
would actually adopt it. If it's something that comes natively with K8s it would probably
have a better chance, but still you'd force every project to support loading via both
ways. It's not as straight forward, and for it to be the only viable path you need a
clean way for projects to transition to this w/o breaking.

> It’s at that layer that policy decisions about priority are made and the kernel semantics
> can remain as above.
> 
>>>> Regardless of merit or feasability of a system-wide bpf management
>>>> daemon for k8s, there _is no ordering configuration possible_. K8s is not
>>>> a distro where package maintainers (or anyone else, really) can coordinate
>>>> on correctly defining priority of each of the tools they ship. This is
>>>> effectively
>>>> the prisoner's dilemma. I feel like most of the discussion so far has been
>>>> very hand-wavy in 'user space should solve it'. Well, we are user space, and
>>>> we're here trying to solve it. :)
>>>>
>>>> A hypothetical policy/gatekeeper/ordering daemon doesn't possess
>>>> implicit knowledge about which program needs to go where in the chain,
>>>> nor is there an obvious heuristic about how to order things. Maintaining
>>>> such a configuration for all cloud-native tooling out there that possibly
>>>> uses bpf is simply impossible, as even a tool like Cilium can change
>>>> dramatically from one release to the next. Having to manage this too
>>>> would put a significant burden on velocity and flexibility for arguably
>>>> little benefit to the user.
>>>> So, daemon/kernel will need to be told how to order things, preferably by
>>>> the tools (Cilium/datadog-agent) themselves, since the user/admin of the
>>>> system cannot be expected to know where to position the hundreds of progs
>>>> loaded by Cilium and how they might interfere with other tools. Figuring
>>>> this out is the job of the tool, daemon or not.
> 
> I’m sorry but again I have to strongly disagree here.
> 
> Tools can provide hints at where they should be placed in a chain of programs, but
> that eventual placement should always be down to the user.
> 
> The examples you’ve cited are large, specialised applications… but consider for a moment how
> this works for smaller programs.
> 
> Let’s say you’ve got 3 programs:
> - Firewall
> - Load-Balancer
> - Packet Logger
> 
> There are 6 ways that I can order these programs, each of which will have a very different effect.
> How can any of these tools individually understand what the user actually wants?
> 
> - Dave
> 
> [1]: https://github.com/bpfd-dev/bpfd
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-06-08 21:24         ` Andrii Nakryiko
@ 2023-07-04 21:36           ` Jamal Hadi Salim
  2023-07-04 22:01             ` Daniel Borkmann
  0 siblings, 1 reply; 49+ messages in thread
From: Jamal Hadi Salim @ 2023-07-04 21:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Daniel Borkmann, ast, andrii, martin.lau, razor, sdf,
	john.fastabend, kuba, dxu, joe, toke, davem, bpf, netdev

Sorry for the late reply, but trying this out now - and have a question:

On Thu, Jun 8, 2023 at 5:25 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Jun 8, 2023 at 12:46 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > Hi Daniel,
> >
> > On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >
> > > Hi Jamal,
> > >
> > > On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
> > > [...]
> > > > A general question (which i think i asked last time as well): who
> > > > decides what comes after/before what prog in this setup? And would
> > > > that same entity not have been able to make the same decision using tc
> > > > priorities?
> > >
> > > Back in the first version of the series I initially coded up this option
> > > that the tc_run() would basically be a fake 'bpf_prog' and it would have,
> > > say, fixed prio 1000. It would get executed via tcx_run() when iterating
> > > via bpf_mprog_foreach_prog() where bpf_prog_run() is called, and then users
> > > could pick for native BPF prio before or after that. But then the feedback
> > > was that sticking to prio is a bad user experience which led to the
> > > development of what is in patch 1 of this series (see the details there).
> > >
> >
> > Thanks. I read the commit message in patch 1 and followed the thread
> > back including some of the discussion we had and i am still
> > disagreeing that this couldnt be solved with a smart priority based
> > scheme - but i think we can move on since this is standalone and
> > doesnt affect tc.
> >
> > Daniel - i am still curious in the new scheme of things how would
> > cilium vs datadog food fight get resolved without some arbitration
> > entity?
> >
> > > > The idea of protecting programs from being unloaded is very welcome
> > > > but feels would have made sense to be a separate patchset (we have
> > > > good need for it). Would it be possible to use that feature in tc and
> > > > xdp?
> > > BPF links are supported for XDP today, just tc BPF is one of the few
> > > remainders where it is not the case, hence the work of this series. What
> > > XDP lacks today however is multi-prog support. With the bpf_mprog concept
> > > that could be addressed with that common/uniform api (and Andrii expressed
> > > interest in integrating this also for cgroup progs), so yes, various hook
> > > points/program types could benefit from it.
> >
> > Is there some sample XDP related i could look at?  Let me describe our
> > use case: lets say we load an ebpf program foo attached to XDP of a
> > netdev  and then something further upstream in the stack is consuming
> > the results of that ebpf XDP program. For some reason someone, at some
> > point, decides to replace the XDP prog with a different one - and the
> > new prog does a very different thing. Could we stop the replacement
> > with the link mechanism you describe? i.e the program is still loaded
> > but is no longer attached to the netdev.
>
> If you initially attached an XDP program using BPF link api
> (LINK_CREATE command in bpf() syscall), then subsequent attachment to
> the same interface (of a new link or program with BPF_PROG_ATTACH)
> will fail until the current BPF link is detached through closing its
> last fd.
>

So this works as advertised. The problem is however not totally solved
because it seems we need a process that's alive to hold the ownership.
If we had a daemon then that would solve it i think (we dont).
Alternatively,  you pin the link. The pinning part can be
circumvented, unless i misunderstood i,e anybody with the right
permissions can remove it.

Am I missing something?

cheers,
jamal

cheers,
jamal

> That is, until we allow multiple attachments of XDP programs to the
> same network interface. But even then, no one will be able to
> accidentally replace attached link, unless they have that link FD and
> replace underlying BPF program.
>
> >
> >
> > > >> +struct tcx_entry {
> > > >> +       struct bpf_mprog_bundle         bundle;
> > > >> +       struct mini_Qdisc __rcu         *miniq;
> > > >> +};
> > > >> +
> > > >
> > > > Can you please move miniq to the front? From where i sit this looks:
> > > > struct tcx_entry {
> > > >          struct bpf_mprog_bundle    bundle
> > > > __attribute__((__aligned__(64))); /*     0  3264 */
> > > >
> > > >          /* XXX last struct has 36 bytes of padding */
> > > >
> > > >          /* --- cacheline 51 boundary (3264 bytes) --- */
> > > >          struct mini_Qdisc *        miniq;                /*  3264     8 */
> > > >
> > > >          /* size: 3328, cachelines: 52, members: 2 */
> > > >          /* padding: 56 */
> > > >          /* paddings: 1, sum paddings: 36 */
> > > >          /* forced alignments: 1 */
> > > > } __attribute__((__aligned__(64)));
> > > >
> > > > That is a _lot_ of cachelines - at the expense of the status quo
> > > > clsact/ingress qdiscs which access miniq.
> > >
> > > Ah yes, I'll fix this up.
> >
> > Thanks.
> >
> > cheers,
> > jamal
> > > Thanks,
> > > Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-07-04 21:36           ` Jamal Hadi Salim
@ 2023-07-04 22:01             ` Daniel Borkmann
  2023-07-04 22:38               ` Jamal Hadi Salim
  0 siblings, 1 reply; 49+ messages in thread
From: Daniel Borkmann @ 2023-07-04 22:01 UTC (permalink / raw)
  To: Jamal Hadi Salim, Andrii Nakryiko
  Cc: ast, andrii, martin.lau, razor, sdf, john.fastabend, kuba, dxu,
	joe, toke, davem, bpf, netdev

On 7/4/23 11:36 PM, Jamal Hadi Salim wrote:
> On Thu, Jun 8, 2023 at 5:25 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Thu, Jun 8, 2023 at 12:46 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>> On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
[...]
>>>> BPF links are supported for XDP today, just tc BPF is one of the few
>>>> remainders where it is not the case, hence the work of this series. What
>>>> XDP lacks today however is multi-prog support. With the bpf_mprog concept
>>>> that could be addressed with that common/uniform api (and Andrii expressed
>>>> interest in integrating this also for cgroup progs), so yes, various hook
>>>> points/program types could benefit from it.
>>>
>>> Is there some sample XDP related i could look at?  Let me describe our
>>> use case: lets say we load an ebpf program foo attached to XDP of a
>>> netdev  and then something further upstream in the stack is consuming
>>> the results of that ebpf XDP program. For some reason someone, at some
>>> point, decides to replace the XDP prog with a different one - and the
>>> new prog does a very different thing. Could we stop the replacement
>>> with the link mechanism you describe? i.e the program is still loaded
>>> but is no longer attached to the netdev.
>>
>> If you initially attached an XDP program using BPF link api
>> (LINK_CREATE command in bpf() syscall), then subsequent attachment to
>> the same interface (of a new link or program with BPF_PROG_ATTACH)
>> will fail until the current BPF link is detached through closing its
>> last fd.
> 
> So this works as advertised. The problem is however not totally solved
> because it seems we need a process that's alive to hold the ownership.
> If we had a daemon then that would solve it i think (we dont).
> Alternatively,  you pin the link. The pinning part can be
> circumvented, unless i misunderstood i,e anybody with the right
> permissions can remove it.
> 
> Am I missing something?

It would be either of those depending on the use case, and for pinning
removal, it would require right permissions/acls. Keep in mind that for
your application you can also use your own bpffs mount, so you don't
need to use the default /sys/fs/bpf one in hostns.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-07-04 22:01             ` Daniel Borkmann
@ 2023-07-04 22:38               ` Jamal Hadi Salim
  2023-07-05  7:34                 ` Daniel Borkmann
  0 siblings, 1 reply; 49+ messages in thread
From: Jamal Hadi Salim @ 2023-07-04 22:38 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, ast, andrii, martin.lau, razor, sdf,
	john.fastabend, kuba, dxu, joe, toke, davem, bpf, netdev

On Tue, Jul 4, 2023 at 6:01 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 7/4/23 11:36 PM, Jamal Hadi Salim wrote:
> > On Thu, Jun 8, 2023 at 5:25 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> >> On Thu, Jun 8, 2023 at 12:46 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>> On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>> On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
> [...]
> >>>> BPF links are supported for XDP today, just tc BPF is one of the few
> >>>> remainders where it is not the case, hence the work of this series. What
> >>>> XDP lacks today however is multi-prog support. With the bpf_mprog concept
> >>>> that could be addressed with that common/uniform api (and Andrii expressed
> >>>> interest in integrating this also for cgroup progs), so yes, various hook
> >>>> points/program types could benefit from it.
> >>>
> >>> Is there some sample XDP related i could look at?  Let me describe our
> >>> use case: lets say we load an ebpf program foo attached to XDP of a
> >>> netdev  and then something further upstream in the stack is consuming
> >>> the results of that ebpf XDP program. For some reason someone, at some
> >>> point, decides to replace the XDP prog with a different one - and the
> >>> new prog does a very different thing. Could we stop the replacement
> >>> with the link mechanism you describe? i.e the program is still loaded
> >>> but is no longer attached to the netdev.
> >>
> >> If you initially attached an XDP program using BPF link api
> >> (LINK_CREATE command in bpf() syscall), then subsequent attachment to
> >> the same interface (of a new link or program with BPF_PROG_ATTACH)
> >> will fail until the current BPF link is detached through closing its
> >> last fd.
> >
> > So this works as advertised. The problem is however not totally solved
> > because it seems we need a process that's alive to hold the ownership.
> > If we had a daemon then that would solve it i think (we dont).
> > Alternatively,  you pin the link. The pinning part can be
> > circumvented, unless i misunderstood i,e anybody with the right
> > permissions can remove it.
> >
> > Am I missing something?
>
> It would be either of those depending on the use case, and for pinning
> removal, it would require right permissions/acls. Keep in mind that for
> your application you can also use your own bpffs mount, so you don't
> need to use the default /sys/fs/bpf one in hostns.

This helps for sure - doesnt 100% solve it. It would really be nice if
we could tie in a kerberos-like ticketing system for ownership of the
mount or something even more fine grained like a link. Doesnt have to
be kerberos but anything that would allow a digest of some verifiable
credentials/token to be handed to the kernel for authorization...

cheers,
jamal

> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-07-04 22:38               ` Jamal Hadi Salim
@ 2023-07-05  7:34                 ` Daniel Borkmann
  2023-07-06 13:31                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 49+ messages in thread
From: Daniel Borkmann @ 2023-07-05  7:34 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Andrii Nakryiko, ast, andrii, martin.lau, razor, sdf,
	john.fastabend, kuba, dxu, joe, toke, davem, bpf, netdev, lmb

On 7/5/23 12:38 AM, Jamal Hadi Salim wrote:
> On Tue, Jul 4, 2023 at 6:01 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 7/4/23 11:36 PM, Jamal Hadi Salim wrote:
>>> On Thu, Jun 8, 2023 at 5:25 PM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> On Thu, Jun 8, 2023 at 12:46 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>>>> On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>> On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
>> [...]
>>>>>> BPF links are supported for XDP today, just tc BPF is one of the few
>>>>>> remainders where it is not the case, hence the work of this series. What
>>>>>> XDP lacks today however is multi-prog support. With the bpf_mprog concept
>>>>>> that could be addressed with that common/uniform api (and Andrii expressed
>>>>>> interest in integrating this also for cgroup progs), so yes, various hook
>>>>>> points/program types could benefit from it.
>>>>>
>>>>> Is there some sample XDP related i could look at?  Let me describe our
>>>>> use case: lets say we load an ebpf program foo attached to XDP of a
>>>>> netdev  and then something further upstream in the stack is consuming
>>>>> the results of that ebpf XDP program. For some reason someone, at some
>>>>> point, decides to replace the XDP prog with a different one - and the
>>>>> new prog does a very different thing. Could we stop the replacement
>>>>> with the link mechanism you describe? i.e the program is still loaded
>>>>> but is no longer attached to the netdev.
>>>>
>>>> If you initially attached an XDP program using BPF link api
>>>> (LINK_CREATE command in bpf() syscall), then subsequent attachment to
>>>> the same interface (of a new link or program with BPF_PROG_ATTACH)
>>>> will fail until the current BPF link is detached through closing its
>>>> last fd.
>>>
>>> So this works as advertised. The problem is however not totally solved
>>> because it seems we need a process that's alive to hold the ownership.
>>> If we had a daemon then that would solve it i think (we dont).
>>> Alternatively,  you pin the link. The pinning part can be
>>> circumvented, unless i misunderstood i,e anybody with the right
>>> permissions can remove it.
>>>
>>> Am I missing something?
>>
>> It would be either of those depending on the use case, and for pinning
>> removal, it would require right permissions/acls. Keep in mind that for
>> your application you can also use your own bpffs mount, so you don't
>> need to use the default /sys/fs/bpf one in hostns.
> 
> This helps for sure - doesnt 100% solve it. It would really be nice if
> we could tie in a kerberos-like ticketing system for ownership of the
> mount or something even more fine grained like a link. Doesnt have to
> be kerberos but anything that would allow a digest of some verifiable
> credentials/token to be handed to the kernel for authorization...

What is your use-case, you don't want anyone except your own orchestration
application to access it, so any kind of ACLs, LSM policies or making the
mount only available to your container are not enough in this scenario you
have in mind?

I think the closest to that is probably the prototype which Lorenz recently
built where the user space application's digest is validated via IMA [0].

   [0] http://vger.kernel.org/bpfconf2023_material/Lorenz_Bauer_-_BPF_signing_using_fsverity_and_LSM_gatekeeper.pdf
       https://github.com/isovalent/bpf-verity

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support
  2023-07-05  7:34                 ` Daniel Borkmann
@ 2023-07-06 13:31                   ` Jamal Hadi Salim
  0 siblings, 0 replies; 49+ messages in thread
From: Jamal Hadi Salim @ 2023-07-06 13:31 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, ast, andrii, martin.lau, razor, sdf,
	john.fastabend, kuba, dxu, joe, toke, davem, bpf, netdev, lmb

On Wed, Jul 5, 2023 at 3:34 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 7/5/23 12:38 AM, Jamal Hadi Salim wrote:
> > On Tue, Jul 4, 2023 at 6:01 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 7/4/23 11:36 PM, Jamal Hadi Salim wrote:
> >>> On Thu, Jun 8, 2023 at 5:25 PM Andrii Nakryiko
> >>> <andrii.nakryiko@gmail.com> wrote:
> >>>> On Thu, Jun 8, 2023 at 12:46 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>>>> On Thu, Jun 8, 2023 at 6:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>>> On 6/8/23 3:25 AM, Jamal Hadi Salim wrote:
> >> [...]
> >>>>>> BPF links are supported for XDP today, just tc BPF is one of the few
> >>>>>> remainders where it is not the case, hence the work of this series. What
> >>>>>> XDP lacks today however is multi-prog support. With the bpf_mprog concept
> >>>>>> that could be addressed with that common/uniform api (and Andrii expressed
> >>>>>> interest in integrating this also for cgroup progs), so yes, various hook
> >>>>>> points/program types could benefit from it.
> >>>>>
> >>>>> Is there some sample XDP related i could look at?  Let me describe our
> >>>>> use case: lets say we load an ebpf program foo attached to XDP of a
> >>>>> netdev  and then something further upstream in the stack is consuming
> >>>>> the results of that ebpf XDP program. For some reason someone, at some
> >>>>> point, decides to replace the XDP prog with a different one - and the
> >>>>> new prog does a very different thing. Could we stop the replacement
> >>>>> with the link mechanism you describe? i.e the program is still loaded
> >>>>> but is no longer attached to the netdev.
> >>>>
> >>>> If you initially attached an XDP program using BPF link api
> >>>> (LINK_CREATE command in bpf() syscall), then subsequent attachment to
> >>>> the same interface (of a new link or program with BPF_PROG_ATTACH)
> >>>> will fail until the current BPF link is detached through closing its
> >>>> last fd.
> >>>
> >>> So this works as advertised. The problem is however not totally solved
> >>> because it seems we need a process that's alive to hold the ownership.
> >>> If we had a daemon then that would solve it i think (we dont).
> >>> Alternatively,  you pin the link. The pinning part can be
> >>> circumvented, unless i misunderstood i,e anybody with the right
> >>> permissions can remove it.
> >>>
> >>> Am I missing something?
> >>
> >> It would be either of those depending on the use case, and for pinning
> >> removal, it would require right permissions/acls. Keep in mind that for
> >> your application you can also use your own bpffs mount, so you don't
> >> need to use the default /sys/fs/bpf one in hostns.
> >
> > This helps for sure - doesnt 100% solve it. It would really be nice if
> > we could tie in a kerberos-like ticketing system for ownership of the
> > mount or something even more fine grained like a link. Doesnt have to
> > be kerberos but anything that would allow a digest of some verifiable
> > credentials/token to be handed to the kernel for authorization...
>
> What is your use-case, you don't want anyone except your own orchestration
> application to access it, so any kind of ACLs, LSM policies or making the
> mount only available to your container are not enough in this scenario you
> have in mind?

It should work - it's not even a shared environment (unlike the
situation you have to deal with). I think i got overly paranoid
because we  have gone through a couple of debug cases where an
installed parser (using ip) in  XDP (with a tc prog consuming the
results) was accidentally replaced (and the tc side had expectations
built on the removed prog). i.e end goal is two or more programs, in
this case, one running in XDP and another at TC are interdependent; if
you touch one you affect the other.
In a shared environment it could be problematic because all you need
is root access to remove things.
If you have a second factor authentication etc then someone has to be
both root and has more secret knowledge to displace things.

But: I do have some ulterior motive where the authentication could be
used at policy level as well. Example someone can read-only a tc rule
whereas someone else can read, update or even delete the rule.

> I think the closest to that is probably the prototype which Lorenz recently
> built where the user space application's digest is validated via IMA [0].

This may be sufficient for the atomicity requirement if we can lock
things into our own ebpf fs. I will take a look - thanks.

cheers,
jamal
>    [0] http://vger.kernel.org/bpfconf2023_material/Lorenz_Bauer_-_BPF_signing_using_fsverity_and_LSM_gatekeeper.pdf
>        https://github.com/isovalent/bpf-verity

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2023-07-06 13:32 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07 19:26 [PATCH bpf-next v2 0/7] BPF link support for tc BPF programs Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 1/7] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
2023-06-08 17:23   ` Stanislav Fomichev
2023-06-08 20:59     ` Andrii Nakryiko
2023-06-08 21:52       ` Stanislav Fomichev
2023-06-08 22:13         ` Andrii Nakryiko
2023-06-08 23:06           ` Stanislav Fomichev
2023-06-08 23:54             ` Alexei Starovoitov
2023-06-09  0:08               ` Andrii Nakryiko
2023-06-09  0:38                 ` Stanislav Fomichev
2023-06-09  0:29             ` Toke Høiland-Jørgensen
2023-06-09  6:52               ` Daniel Borkmann
2023-06-09  7:15                 ` Daniel Borkmann
2023-06-09 11:04                 ` Toke Høiland-Jørgensen
2023-06-09 12:34                   ` Timo Beckers
2023-06-09 13:11                     ` Toke Høiland-Jørgensen
2023-06-09 14:15                       ` Daniel Borkmann
2023-06-09 16:41                         ` Stanislav Fomichev
2023-06-09 19:03                           ` Andrii Nakryiko
2023-06-10  2:52                             ` Daniel Xu
2023-06-09 18:58                         ` Andrii Nakryiko
2023-06-09 20:28                         ` Toke Høiland-Jørgensen
2023-06-12 11:21                         ` Dave Tucker
2023-06-12 12:43                           ` Daniel Borkmann
2023-06-09 18:56                       ` Andrii Nakryiko
2023-06-09 20:08                         ` Alexei Starovoitov
     [not found]                           ` <20230610022721.2950602-1-prankgup@fb.com>
2023-06-10  3:37                             ` Alexei Starovoitov
2023-06-09 20:20                         ` Toke Høiland-Jørgensen
2023-06-08 20:53   ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
2023-06-08  1:25   ` Jamal Hadi Salim
2023-06-08 10:11     ` Daniel Borkmann
2023-06-08 19:46       ` Jamal Hadi Salim
2023-06-08 21:24         ` Andrii Nakryiko
2023-07-04 21:36           ` Jamal Hadi Salim
2023-07-04 22:01             ` Daniel Borkmann
2023-07-04 22:38               ` Jamal Hadi Salim
2023-07-05  7:34                 ` Daniel Borkmann
2023-07-06 13:31                   ` Jamal Hadi Salim
2023-06-08 17:50   ` Stanislav Fomichev
2023-06-08 21:20   ` Andrii Nakryiko
2023-06-09  3:06   ` Jakub Kicinski
2023-06-07 19:26 ` [PATCH bpf-next v2 3/7] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
2023-06-08 21:37   ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 4/7] libbpf: Add link-based " Daniel Borkmann
2023-06-08 21:45   ` Andrii Nakryiko
2023-06-07 19:26 ` [PATCH bpf-next v2 5/7] bpftool: Extend net dump with tcx progs Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 6/7] selftests/bpf: Add mprog API tests for BPF tcx opts Daniel Borkmann
2023-06-07 19:26 ` [PATCH bpf-next v2 7/7] selftests/bpf: Add mprog API tests for BPF tcx links Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).