netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/10] BPF link support for tc BPF programs
@ 2022-10-04 23:11 Daniel Borkmann
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
                   ` (9 more replies)
  0 siblings, 10 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

This series adds BPF link support for tc BPF programs. We initially
presented the motivation, related work and design at this year's LPC
conference in the networking & BPF track [0], and have incorporated
feedback we received. The main changes are in first two patches and
the last one has an extensive batch of test cases we developed along
with it, please see individual patches for details. We tested this
series with the tc-testing selftest suite as well as the existing
and newly developed tc BPF tests from BPF selftests which all pass.
Thanks!

  [0] https://lpc.events/event/16/contributions/1353/

Daniel Borkmann (10):
  bpf: Add initial fd-based API to attach tc BPF programs
  bpf: Implement BPF link handling for tc BPF programs
  bpf: Implement link update for tc BPF link programs
  bpf: Implement link introspection for tc BPF link programs
  bpf: Implement link detach for tc BPF link programs
  libbpf: Change signature of bpf_prog_query
  libbpf: Add extended attach/detach opts
  libbpf: Add support for BPF tc link
  bpftool: Add support for tc fd-based attach types
  bpf, selftests: Add various BPF tc link selftests

 MAINTAINERS                                   |   4 +-
 include/linux/bpf.h                           |   4 +
 include/linux/netdevice.h                     |  14 +-
 include/linux/skbuff.h                        |   4 +-
 include/net/sch_generic.h                     |   2 +-
 include/net/xtc.h                             | 195 +++++
 include/uapi/linux/bpf.h                      |  45 +-
 kernel/bpf/Kconfig                            |   1 +
 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/net.c                              | 451 +++++++++++
 kernel/bpf/syscall.c                          |  27 +-
 net/Kconfig                                   |   5 +
 net/core/dev.c                                | 262 +++---
 net/core/filter.c                             |   4 +-
 net/sched/Kconfig                             |   4 +-
 net/sched/sch_ingress.c                       |  48 +-
 tools/bpf/bpftool/net.c                       |  76 +-
 tools/include/uapi/linux/bpf.h                |  45 +-
 tools/lib/bpf/bpf.c                           |  27 +-
 tools/lib/bpf/bpf.h                           |  22 +-
 tools/lib/bpf/libbpf.c                        |  31 +-
 tools/lib/bpf/libbpf.h                        |   2 +
 tools/lib/bpf/libbpf.map                      |   2 +
 .../selftests/bpf/prog_tests/tc_link.c        | 756 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |  43 +
 25 files changed, 1932 insertions(+), 143 deletions(-)
 create mode 100644 include/net/xtc.h
 create mode 100644 kernel/bpf/net.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_link.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_link.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-05  0:55   ` sdf
                     ` (6 more replies)
  2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
                   ` (8 subsequent siblings)
  9 siblings, 7 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

This work refactors and adds a lightweight extension to the tc BPF ingress
and egress data path side for allowing BPF programs via an fd-based attach /
detach API. The main goal behind this work which we also presented at LPC [0]
this year is to eventually add support for BPF links for tc BPF programs in
a second step, thus this prep work is required for the latter which allows
for a model of safe ownership and program detachment. Given the vast rise
in tc BPF users in cloud native / Kubernetes environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications stepping on each others toes. Further
details for BPF link rationale in next patch.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook gets a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
tc-layer entry point for BPF. As part of the feedback from LPC, there was
a suggestion to provide a name for this infrastructure to more easily differ
between the classic cls_bpf attachment and the fd-based API. As for most,
the XDP vs tc layer is already the default mental model for the pkt processing
pipeline. We refactored this with an xtc internal prefix aka 'express traffic
control' in order to avoid to deviate too far (and 'express' given its more
lightweight/faster entry point).

For the ingress and egress xtc points, the device holds a cache-friendly array
with programs. Same as with classic tc, programs are attached with a prio that
can be specified or auto-allocated through an idr, and the program return code
determines whether to continue in the pipeline or to terminate processing.
With TC_ACT_UNSPEC code, the processing continues (as the case today). The goal
was to have maximum compatibility to existing tc BPF programs, so they don't
need to be adapted. Compatibility to call into classic tcf_classify() is also
provided in order to allow successive migration or both to cleanly co-exist
where needed given its one logical layer. The fd-based API is behind a static
key, so that when unused the code is also not entered. The struct xtc_entry's
program array is currently static, but could be made dynamic if necessary at
a point in future. Desire has also been expressed for future work to adapt
similar framework for XDP to allow multi-attach from in-kernel side, too.

Tested with tc-testing selftest suite which all passes, as well as the tc BPF
tests from the BPF CI.

  [0] https://lpc.events/event/16/contributions/1353/

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 MAINTAINERS                    |   4 +-
 include/linux/bpf.h            |   1 +
 include/linux/netdevice.h      |  14 +-
 include/linux/skbuff.h         |   4 +-
 include/net/sch_generic.h      |   2 +-
 include/net/xtc.h              | 181 ++++++++++++++++++++++
 include/uapi/linux/bpf.h       |  35 ++++-
 kernel/bpf/Kconfig             |   1 +
 kernel/bpf/Makefile            |   1 +
 kernel/bpf/net.c               | 274 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  24 ++-
 net/Kconfig                    |   5 +
 net/core/dev.c                 | 262 +++++++++++++++++++------------
 net/core/filter.c              |   4 +-
 net/sched/Kconfig              |   4 +-
 net/sched/sch_ingress.c        |  48 +++++-
 tools/include/uapi/linux/bpf.h |  35 ++++-
 17 files changed, 769 insertions(+), 130 deletions(-)
 create mode 100644 include/net/xtc.h
 create mode 100644 kernel/bpf/net.c

diff --git a/MAINTAINERS b/MAINTAINERS
index e55a4d47324c..bb63d8d000ea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3850,13 +3850,15 @@ S:	Maintained
 F:	kernel/trace/bpf_trace.c
 F:	kernel/bpf/stackmap.c
 
-BPF [NETWORKING] (tc BPF, sock_addr)
+BPF [NETWORKING] (xtc & tc BPF, sock_addr)
 M:	Martin KaFai Lau <martin.lau@linux.dev>
 M:	Daniel Borkmann <daniel@iogearbox.net>
 R:	John Fastabend <john.fastabend@gmail.com>
 L:	bpf@vger.kernel.org
 L:	netdev@vger.kernel.org
 S:	Maintained
+F:	include/net/xtc.h
+F:	kernel/bpf/net.c
 F:	net/core/filter.c
 F:	net/sched/act_bpf.c
 F:	net/sched/cls_bpf.c
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e7d46d16032..71e5f43db378 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
 	union {
 		struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
 		u64 bpf_cookie;
+		u32 bpf_priority;
 	};
 };
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eddf8ee270e7..43bbb2303e57 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1880,8 +1880,7 @@ enum netdev_ml_priv_type {
  *
  *	@rx_handler:		handler for received packets
  *	@rx_handler_data: 	XXX: need comments on this one
- *	@miniq_ingress:		ingress/clsact qdisc specific data for
- *				ingress processing
+ *	@xtc_ingress:		BPF/clsact qdisc specific data for ingress processing
  *	@ingress_queue:		XXX: need comments on this one
  *	@nf_hooks_ingress:	netfilter hooks executed for ingress packets
  *	@broadcast:		hw bcast address
@@ -1902,8 +1901,7 @@ enum netdev_ml_priv_type {
  *	@xps_maps:		all CPUs/RXQs maps for XPS device
  *
  *	@xps_maps:	XXX: need comments on this one
- *	@miniq_egress:		clsact qdisc specific data for
- *				egress processing
+ *	@xtc_egress:		BPF/clsact qdisc specific data for egress processing
  *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
@@ -2191,8 +2189,8 @@ struct net_device {
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc __rcu	*miniq_ingress;
+#ifdef CONFIG_NET_XGRESS
+	struct xtc_entry __rcu	*xtc_ingress;
 #endif
 	struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -2220,8 +2218,8 @@ struct net_device {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc __rcu	*miniq_egress;
+#ifdef CONFIG_NET_XGRESS
+	struct xtc_entry __rcu *xtc_egress;
 #endif
 #ifdef CONFIG_NETFILTER_EGRESS
 	struct nf_hook_entries __rcu *nf_hooks_egress;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9fcf534f2d92..a9ff7a1996e9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -955,7 +955,7 @@ struct sk_buff {
 	__u8			csum_level:2;
 	__u8			dst_pending_confirm:1;
 	__u8			mono_delivery_time:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	__u8			tc_skip_classify:1;
 	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
 #endif
@@ -983,7 +983,7 @@ struct sk_buff {
 	__u8			slow_gro:1;
 	__u8			csum_not_inet:1;
 
-#ifdef CONFIG_NET_SCHED
+#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
 #endif
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d5517719af4e..bc5c1da2d30f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -693,7 +693,7 @@ int skb_do_redirect(struct sk_buff *);
 
 static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
 {
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	return skb->tc_at_ingress;
 #else
 	return false;
diff --git a/include/net/xtc.h b/include/net/xtc.h
new file mode 100644
index 000000000000..627dc18aa433
--- /dev/null
+++ b/include/net/xtc.h
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2022 Isovalent */
+#ifndef __NET_XTC_H
+#define __NET_XTC_H
+
+#include <linux/idr.h>
+#include <linux/bpf.h>
+
+#include <net/sch_generic.h>
+
+#define XTC_MAX_ENTRIES 30
+/* Adds 1 NULL entry. */
+#define XTC_MAX	(XTC_MAX_ENTRIES + 1)
+
+struct xtc_entry {
+	struct bpf_prog_array_item items[XTC_MAX] ____cacheline_aligned;
+	struct xtc_entry_pair *parent;
+};
+
+struct mini_Qdisc;
+
+struct xtc_entry_pair {
+	struct rcu_head		rcu;
+	struct idr		idr;
+	struct mini_Qdisc	*miniq;
+	struct xtc_entry	a;
+	struct xtc_entry	b;
+};
+
+static inline void xtc_set_ingress(struct sk_buff *skb, bool ingress)
+{
+#ifdef CONFIG_NET_XGRESS
+	skb->tc_at_ingress = ingress;
+#endif
+}
+
+#ifdef CONFIG_NET_XGRESS
+void xtc_inc(void);
+void xtc_dec(void);
+
+static inline void
+dev_xtc_entry_update(struct net_device *dev, struct xtc_entry *entry,
+		     bool ingress)
+{
+	ASSERT_RTNL();
+	if (ingress)
+		rcu_assign_pointer(dev->xtc_ingress, entry);
+	else
+		rcu_assign_pointer(dev->xtc_egress, entry);
+	synchronize_rcu();
+}
+
+static inline struct xtc_entry *dev_xtc_entry_peer(const struct xtc_entry *entry)
+{
+	if (entry == &entry->parent->a)
+		return &entry->parent->b;
+	else
+		return &entry->parent->a;
+}
+
+static inline struct xtc_entry *dev_xtc_entry_create(void)
+{
+	struct xtc_entry_pair *pair = kzalloc(sizeof(*pair), GFP_KERNEL);
+
+	if (pair) {
+		pair->a.parent = pair;
+		pair->b.parent = pair;
+		idr_init(&pair->idr);
+		return &pair->a;
+	}
+	return NULL;
+}
+
+static inline struct xtc_entry *dev_xtc_entry_fetch(struct net_device *dev,
+						    bool ingress, bool *created)
+{
+	struct xtc_entry *entry = ingress ?
+		rcu_dereference_rtnl(dev->xtc_ingress) :
+		rcu_dereference_rtnl(dev->xtc_egress);
+
+	*created = false;
+	if (!entry) {
+		entry = dev_xtc_entry_create();
+		if (!entry)
+			return NULL;
+		*created = true;
+	}
+	return entry;
+}
+
+static inline void dev_xtc_entry_clear(struct xtc_entry *entry)
+{
+	memset(entry->items, 0, sizeof(entry->items));
+}
+
+static inline int dev_xtc_entry_prio_new(struct xtc_entry *entry, u32 prio,
+					 struct bpf_prog *prog)
+{
+	int ret;
+
+	if (prio == 0)
+		prio = 1;
+	ret = idr_alloc_u32(&entry->parent->idr, prog, &prio, U32_MAX,
+			    GFP_KERNEL);
+	return ret < 0 ? ret : prio;
+}
+
+static inline void dev_xtc_entry_prio_set(struct xtc_entry *entry, u32 prio,
+					  struct bpf_prog *prog)
+{
+	idr_replace(&entry->parent->idr, prog, prio);
+}
+
+static inline void dev_xtc_entry_prio_del(struct xtc_entry *entry, u32 prio)
+{
+	idr_remove(&entry->parent->idr, prio);
+}
+
+static inline void dev_xtc_entry_free(struct xtc_entry *entry)
+{
+	idr_destroy(&entry->parent->idr);
+	kfree_rcu(entry->parent, rcu);
+}
+
+static inline u32 dev_xtc_entry_total(struct xtc_entry *entry)
+{
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	u32 num = 0;
+
+	item = &entry->items[0];
+	while ((prog = READ_ONCE(item->prog))) {
+		num++;
+		item++;
+	}
+	return num;
+}
+
+static inline enum tc_action_base xtc_action_code(struct sk_buff *skb, int code)
+{
+	switch (code) {
+	case TC_PASS:
+		skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
+		fallthrough;
+	case TC_DROP:
+	case TC_REDIRECT:
+		return code;
+	case TC_NEXT:
+	default:
+		return TC_NEXT;
+	}
+}
+
+int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int xtc_prog_detach(const union bpf_attr *attr);
+int xtc_prog_query(const union bpf_attr *attr,
+		   union bpf_attr __user *uattr);
+void dev_xtc_uninstall(struct net_device *dev);
+#else
+static inline int xtc_prog_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int xtc_prog_detach(const union bpf_attr *attr)
+{
+	return -EINVAL;
+}
+
+static inline int xtc_prog_query(const union bpf_attr *attr,
+				 union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+
+static inline void dev_xtc_uninstall(struct net_device *dev)
+{
+}
+#endif /* CONFIG_NET_XGRESS */
+#endif /* __NET_XTC_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 51b9aa640ad2..de1f5546bcfe 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1025,6 +1025,8 @@ enum bpf_attach_type {
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
+	BPF_NET_INGRESS,
+	BPF_NET_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1399,14 +1401,20 @@ union bpf_attr {
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
-		__u32		target_fd;	/* container object to attach to */
+		union {
+			__u32	target_fd;	/* container object to attach to */
+			__u32	target_ifindex; /* target ifindex */
+		};
 		__u32		attach_bpf_fd;	/* eBPF program to attach */
 		__u32		attach_type;
 		__u32		attach_flags;
-		__u32		replace_bpf_fd;	/* previously attached eBPF
+		union {
+			__u32	attach_priority;
+			__u32	replace_bpf_fd;	/* previously attached eBPF
 						 * program to replace if
 						 * BPF_F_REPLACE is used
 						 */
+		};
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
@@ -1452,7 +1460,10 @@ union bpf_attr {
 	} info;
 
 	struct { /* anonymous struct used by BPF_PROG_QUERY command */
-		__u32		target_fd;	/* container object to query */
+		union {
+			__u32	target_fd;	/* container object to query */
+			__u32	target_ifindex; /* target ifindex */
+		};
 		__u32		attach_type;
 		__u32		query_flags;
 		__u32		attach_flags;
@@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
 	};
 };
 
+/* (Simplified) user return codes for tc prog type.
+ * A valid tc program must return one of these defined values. All other
+ * return codes are reserved for future use. Must remain compatible with
+ * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
+ * return codes are mapped to TC_NEXT.
+ */
+enum tc_action_base {
+	TC_NEXT		= -1,
+	TC_PASS		= 0,
+	TC_DROP		= 2,
+	TC_REDIRECT	= 7,
+};
+
 struct bpf_xdp_sock {
 	__u32 queue_id;
 };
@@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
 	__be32	flow_label;
 };
 
+struct bpf_query_info {
+	__u32 prog_id;
+	__u32 prio;
+};
+
 struct bpf_func_info {
 	__u32	insn_off;
 	__u32	type_id;
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 2dfe1079f772..6a906ff93006 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -31,6 +31,7 @@ config BPF_SYSCALL
 	select TASKS_TRACE_RCU
 	select BINARY_PRINTF
 	select NET_SOCK_MSG if NET
+	select NET_XGRESS if NET
 	select PAGE_POOL if NET
 	default n
 	help
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..76c3f9d4e2f3 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
+obj-$(CONFIG_BPF_SYSCALL) += net.o
 endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
new file mode 100644
index 000000000000..ab9a9dee615b
--- /dev/null
+++ b/kernel/bpf/net.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Isovalent */
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/netdevice.h>
+
+#include <net/xtc.h>
+
+static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
+			     struct bpf_prog *nprog, u32 prio, u32 flags)
+{
+	struct bpf_prog_array_item *item, *tmp;
+	struct xtc_entry *entry, *peer;
+	struct bpf_prog *oprog;
+	bool created;
+	int i, j;
+
+	ASSERT_RTNL();
+
+	entry = dev_xtc_entry_fetch(dev, ingress, &created);
+	if (!entry)
+		return -ENOMEM;
+	for (i = 0; i < limit; i++) {
+		item = &entry->items[i];
+		oprog = item->prog;
+		if (!oprog)
+			break;
+		if (item->bpf_priority == prio) {
+			if (flags & BPF_F_REPLACE) {
+				/* Pairs with READ_ONCE() in xtc_run_progs(). */
+				WRITE_ONCE(item->prog, nprog);
+				bpf_prog_put(oprog);
+				dev_xtc_entry_prio_set(entry, prio, nprog);
+				return prio;
+			}
+			return -EBUSY;
+		}
+	}
+	if (dev_xtc_entry_total(entry) >= limit)
+		return -ENOSPC;
+	prio = dev_xtc_entry_prio_new(entry, prio, nprog);
+	if (prio < 0) {
+		if (created)
+			dev_xtc_entry_free(entry);
+		return -ENOMEM;
+	}
+	peer = dev_xtc_entry_peer(entry);
+	dev_xtc_entry_clear(peer);
+	for (i = 0, j = 0; i < limit; i++, j++) {
+		item = &entry->items[i];
+		tmp = &peer->items[j];
+		oprog = item->prog;
+		if (!oprog) {
+			if (i == j) {
+				tmp->prog = nprog;
+				tmp->bpf_priority = prio;
+			}
+			break;
+		} else if (item->bpf_priority < prio) {
+			tmp->prog = oprog;
+			tmp->bpf_priority = item->bpf_priority;
+		} else if (item->bpf_priority > prio) {
+			if (i == j) {
+				tmp->prog = nprog;
+				tmp->bpf_priority = prio;
+				tmp = &peer->items[++j];
+			}
+			tmp->prog = oprog;
+			tmp->bpf_priority = item->bpf_priority;
+		}
+	}
+	dev_xtc_entry_update(dev, peer, ingress);
+	if (ingress)
+		net_inc_ingress_queue();
+	else
+		net_inc_egress_queue();
+	xtc_inc();
+	return prio;
+}
+
+int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *nprog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	bool ingress = attr->attach_type == BPF_NET_INGRESS;
+	struct net_device *dev;
+	int ret;
+
+	if (attr->attach_flags & ~BPF_F_REPLACE)
+		return -EINVAL;
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->target_ifindex);
+	if (!dev) {
+		rtnl_unlock();
+		return -EINVAL;
+	}
+	ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, nprog,
+				attr->attach_priority, attr->attach_flags);
+	rtnl_unlock();
+	return ret;
+}
+
+static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
+			     u32 prio)
+{
+	struct bpf_prog_array_item *item, *tmp;
+	struct bpf_prog *oprog, *fprog = NULL;
+	struct xtc_entry *entry, *peer;
+	int i, j;
+
+	ASSERT_RTNL();
+
+	entry = ingress ?
+		rcu_dereference_rtnl(dev->xtc_ingress) :
+		rcu_dereference_rtnl(dev->xtc_egress);
+	if (!entry)
+		return -ENOENT;
+	peer = dev_xtc_entry_peer(entry);
+	dev_xtc_entry_clear(peer);
+	for (i = 0, j = 0; i < limit; i++) {
+		item = &entry->items[i];
+		tmp = &peer->items[j];
+		oprog = item->prog;
+		if (!oprog)
+			break;
+		if (item->bpf_priority != prio) {
+			tmp->prog = oprog;
+			tmp->bpf_priority = item->bpf_priority;
+			j++;
+		} else {
+			fprog = oprog;
+		}
+	}
+	if (fprog) {
+		dev_xtc_entry_prio_del(peer, prio);
+		if (dev_xtc_entry_total(peer) == 0 && !entry->parent->miniq)
+			peer = NULL;
+		dev_xtc_entry_update(dev, peer, ingress);
+		bpf_prog_put(fprog);
+		if (!peer)
+			dev_xtc_entry_free(entry);
+		if (ingress)
+			net_dec_ingress_queue();
+		else
+			net_dec_egress_queue();
+		xtc_dec();
+		return 0;
+	}
+	return -ENOENT;
+}
+
+int xtc_prog_detach(const union bpf_attr *attr)
+{
+	struct net *net = current->nsproxy->net_ns;
+	bool ingress = attr->attach_type == BPF_NET_INGRESS;
+	struct net_device *dev;
+	int ret;
+
+	if (attr->attach_flags || !attr->attach_priority)
+		return -EINVAL;
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->target_ifindex);
+	if (!dev) {
+		rtnl_unlock();
+		return -EINVAL;
+	}
+	ret = __xtc_prog_detach(dev, ingress, XTC_MAX_ENTRIES,
+				attr->attach_priority);
+	rtnl_unlock();
+	return ret;
+}
+
+static void __xtc_prog_detach_all(struct net_device *dev, bool ingress, u32 limit)
+{
+	struct bpf_prog_array_item *item;
+	struct xtc_entry *entry;
+	struct bpf_prog *prog;
+	int i;
+
+	ASSERT_RTNL();
+
+	entry = ingress ?
+		rcu_dereference_rtnl(dev->xtc_ingress) :
+		rcu_dereference_rtnl(dev->xtc_egress);
+	if (!entry)
+		return;
+	dev_xtc_entry_update(dev, NULL, ingress);
+	for (i = 0; i < limit; i++) {
+		item = &entry->items[i];
+		prog = item->prog;
+		if (!prog)
+			break;
+		dev_xtc_entry_prio_del(entry, item->bpf_priority);
+		bpf_prog_put(prog);
+		if (ingress)
+			net_dec_ingress_queue();
+		else
+			net_dec_egress_queue();
+		xtc_dec();
+	}
+	dev_xtc_entry_free(entry);
+}
+
+void dev_xtc_uninstall(struct net_device *dev)
+{
+	__xtc_prog_detach_all(dev, true,  XTC_MAX_ENTRIES + 1);
+	__xtc_prog_detach_all(dev, false, XTC_MAX_ENTRIES + 1);
+}
+
+static int
+__xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
+		 struct net_device *dev, bool ingress, u32 limit)
+{
+	struct bpf_query_info info, __user *uinfo;
+	struct bpf_prog_array_item *item;
+	struct xtc_entry *entry;
+	struct bpf_prog *prog;
+	u32 i, flags = 0, cnt;
+	int ret = 0;
+
+	ASSERT_RTNL();
+
+	entry = ingress ?
+		rcu_dereference_rtnl(dev->xtc_ingress) :
+		rcu_dereference_rtnl(dev->xtc_egress);
+	if (!entry)
+		return -ENOENT;
+	cnt = dev_xtc_entry_total(entry);
+	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
+		return -EFAULT;
+	if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt)))
+		return -EFAULT;
+	uinfo = u64_to_user_ptr(attr->query.prog_ids);
+	if (attr->query.prog_cnt == 0 || !uinfo || !cnt)
+		/* return early if user requested only program count + flags */
+		return 0;
+	if (attr->query.prog_cnt < cnt) {
+		cnt = attr->query.prog_cnt;
+		ret = -ENOSPC;
+	}
+	for (i = 0; i < limit; i++) {
+		item = &entry->items[i];
+		prog = item->prog;
+		if (!prog)
+			break;
+		info.prog_id = prog->aux->id;
+		info.prio = item->bpf_priority;
+		if (copy_to_user(uinfo + i, &info, sizeof(info)))
+			return -EFAULT;
+		if (i + 1 == cnt)
+			break;
+	}
+	return ret;
+}
+
+int xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	struct net *net = current->nsproxy->net_ns;
+	bool ingress = attr->query.attach_type == BPF_NET_INGRESS;
+	struct net_device *dev;
+	int ret;
+
+	if (attr->query.query_flags || attr->query.attach_flags)
+		return -EINVAL;
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->query.target_ifindex);
+	if (!dev) {
+		rtnl_unlock();
+		return -EINVAL;
+	}
+	ret = __xtc_prog_query(attr, uattr, dev, ingress, XTC_MAX_ENTRIES);
+	rtnl_unlock();
+	return ret;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7b373a5e861f..a0a670b964bb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -36,6 +36,8 @@
 #include <linux/memcontrol.h>
 #include <linux/trace_events.h>
 
+#include <net/xtc.h>
+
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
@@ -3448,6 +3450,9 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_XDP;
 	case BPF_LSM_CGROUP:
 		return BPF_PROG_TYPE_LSM;
+	case BPF_NET_INGRESS:
+	case BPF_NET_EGRESS:
+		return BPF_PROG_TYPE_SCHED_CLS;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -3466,18 +3471,15 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 
 	if (CHECK_ATTR(BPF_PROG_ATTACH))
 		return -EINVAL;
-
 	if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
 		return -EINVAL;
 
 	ptype = attach_type_to_prog_type(attr->attach_type);
 	if (ptype == BPF_PROG_TYPE_UNSPEC)
 		return -EINVAL;
-
 	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
-
 	if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) {
 		bpf_prog_put(prog);
 		return -EINVAL;
@@ -3508,16 +3510,18 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = xtc_prog_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
-
-	if (ret)
+	if (ret < 0)
 		bpf_prog_put(prog);
 	return ret;
 }
 
-#define BPF_PROG_DETACH_LAST_FIELD attach_type
+#define BPF_PROG_DETACH_LAST_FIELD replace_bpf_fd
 
 static int bpf_prog_detach(const union bpf_attr *attr)
 {
@@ -3527,6 +3531,9 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return -EINVAL;
 
 	ptype = attach_type_to_prog_type(attr->attach_type);
+	if (ptype != BPF_PROG_TYPE_SCHED_CLS &&
+	    (attr->attach_flags || attr->replace_bpf_fd))
+		return -EINVAL;
 
 	switch (ptype) {
 	case BPF_PROG_TYPE_SK_MSG:
@@ -3545,6 +3552,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_LSM:
 		return cgroup_bpf_prog_detach(attr, ptype);
+	case BPF_PROG_TYPE_SCHED_CLS:
+		return xtc_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
@@ -3598,6 +3607,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_SK_MSG_VERDICT:
 	case BPF_SK_SKB_VERDICT:
 		return sock_map_bpf_prog_query(attr, uattr);
+	case BPF_NET_INGRESS:
+	case BPF_NET_EGRESS:
+		return xtc_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
diff --git a/net/Kconfig b/net/Kconfig
index 48c33c222199..b7a9cd174464 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -52,6 +52,11 @@ config NET_INGRESS
 config NET_EGRESS
 	bool
 
+config NET_XGRESS
+	select NET_INGRESS
+	select NET_EGRESS
+	bool
+
 config NET_REDIRECT
 	bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fa53830d0683..552b805c27dd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -107,6 +107,7 @@
 #include <net/pkt_cls.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
+#include <net/xtc.h>
 #include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/module.h>
@@ -154,7 +155,6 @@
 #include "dev.h"
 #include "net-sysfs.h"
 
-
 static DEFINE_SPINLOCK(ptype_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;	/* Taps */
@@ -3935,69 +3935,199 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 EXPORT_SYMBOL(dev_loopback_xmit);
 
 #ifdef CONFIG_NET_EGRESS
-static struct sk_buff *
-sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+static struct netdev_queue *
+netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
+{
+	int qm = skb_get_queue_mapping(skb);
+
+	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
+}
+
+static bool netdev_xmit_txqueue_skipped(void)
+{
+	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+}
+
+void netdev_xmit_skip_txqueue(bool skip)
+{
+	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+}
+EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
+#endif /* CONFIG_NET_EGRESS */
+
+#ifdef CONFIG_NET_XGRESS
+static int tc_run(struct xtc_entry *entry, struct sk_buff *skb)
 {
+	int ret = TC_ACT_UNSPEC;
 #ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
-	struct tcf_result cl_res;
+	struct mini_Qdisc *miniq = rcu_dereference_bh(entry->parent->miniq);
+	struct tcf_result res;
 
 	if (!miniq)
-		return skb;
+		return ret;
 
-	/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
 	tc_skb_cb(skb)->mru = 0;
 	tc_skb_cb(skb)->post_ct = false;
-	mini_qdisc_bstats_cpu_update(miniq, skb);
 
-	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
+	mini_qdisc_bstats_cpu_update(miniq, skb);
+	ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
+	/* Only tcf related quirks below. */
+	switch (ret) {
+	case TC_ACT_SHOT:
+		mini_qdisc_qstats_cpu_drop(miniq);
+		break;
 	case TC_ACT_OK:
 	case TC_ACT_RECLASSIFY:
-		skb->tc_index = TC_H_MIN(cl_res.classid);
+		skb->tc_index = TC_H_MIN(res.classid);
 		break;
+	}
+#endif /* CONFIG_NET_CLS_ACT */
+	return ret;
+}
+
+static DEFINE_STATIC_KEY_FALSE(xtc_needed_key);
+
+void xtc_inc(void)
+{
+	static_branch_inc(&xtc_needed_key);
+}
+EXPORT_SYMBOL_GPL(xtc_inc);
+
+void xtc_dec(void)
+{
+	static_branch_dec(&xtc_needed_key);
+}
+EXPORT_SYMBOL_GPL(xtc_dec);
+
+static __always_inline enum tc_action_base
+xtc_run(const struct xtc_entry *entry, struct sk_buff *skb,
+	const bool needs_mac)
+{
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	int ret = TC_NEXT;
+
+	if (needs_mac)
+		__skb_push(skb, skb->mac_len);
+	item = &entry->items[0];
+	while ((prog = READ_ONCE(item->prog))) {
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run(prog, skb);
+		if (ret != TC_NEXT)
+			break;
+		item++;
+	}
+	if (needs_mac)
+		__skb_pull(skb, skb->mac_len);
+	return xtc_action_code(skb, ret);
+}
+
+static __always_inline struct sk_buff *
+sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
+		   struct net_device *orig_dev, bool *another)
+{
+	struct xtc_entry *entry = rcu_dereference_bh(skb->dev->xtc_ingress);
+	int sch_ret;
+
+	if (!entry)
+		return skb;
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	qdisc_skb_cb(skb)->pkt_len = skb->len;
+	xtc_set_ingress(skb, true);
+
+	if (static_branch_unlikely(&xtc_needed_key)) {
+		sch_ret = xtc_run(entry, skb, true);
+		if (sch_ret != TC_ACT_UNSPEC)
+			goto ingress_verdict;
+	}
+	sch_ret = tc_run(entry, skb);
+ingress_verdict:
+	switch (sch_ret) {
+	case TC_ACT_REDIRECT:
+		/* skb_mac_header check was done by BPF, so we can safely
+		 * push the L2 header back before redirecting to another
+		 * netdev.
+		 */
+		__skb_push(skb, skb->mac_len);
+		if (skb_do_redirect(skb) == -EAGAIN) {
+			__skb_pull(skb, skb->mac_len);
+			*another = true;
+			break;
+		}
+		return NULL;
 	case TC_ACT_SHOT:
-		mini_qdisc_qstats_cpu_drop(miniq);
-		*ret = NET_XMIT_DROP;
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
 		return NULL;
+	/* used by tc_run */
 	case TC_ACT_STOLEN:
 	case TC_ACT_QUEUED:
 	case TC_ACT_TRAP:
-		*ret = NET_XMIT_SUCCESS;
 		consume_skb(skb);
+		fallthrough;
+	case TC_ACT_CONSUMED:
 		return NULL;
+	}
+
+	return skb;
+}
+
+static __always_inline struct sk_buff *
+sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+{
+	struct xtc_entry *entry = rcu_dereference_bh(dev->xtc_egress);
+	int sch_ret;
+
+	if (!entry)
+		return skb;
+
+	/* qdisc_skb_cb(skb)->pkt_len & xtc_set_ingress() was
+	 * already set by the caller.
+	 */
+	if (static_branch_unlikely(&xtc_needed_key)) {
+		sch_ret = xtc_run(entry, skb, false);
+		if (sch_ret != TC_ACT_UNSPEC)
+			goto egress_verdict;
+	}
+	sch_ret = tc_run(entry, skb);
+egress_verdict:
+	switch (sch_ret) {
 	case TC_ACT_REDIRECT:
+		*ret = NET_XMIT_SUCCESS;
 		/* No need to push/pop skb's mac_header here on egress! */
 		skb_do_redirect(skb);
+		return NULL;
+	case TC_ACT_SHOT:
+		*ret = NET_XMIT_DROP;
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
+		return NULL;
+	/* used by tc_run */
+	case TC_ACT_STOLEN:
+	case TC_ACT_QUEUED:
+	case TC_ACT_TRAP:
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
-	default:
-		break;
 	}
-#endif /* CONFIG_NET_CLS_ACT */
 
 	return skb;
 }
-
-static struct netdev_queue *
-netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
-{
-	int qm = skb_get_queue_mapping(skb);
-
-	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
-}
-
-static bool netdev_xmit_txqueue_skipped(void)
+#else
+static __always_inline struct sk_buff *
+sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
+		   struct net_device *orig_dev, bool *another)
 {
-	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+	return skb;
 }
 
-void netdev_xmit_skip_txqueue(bool skip)
+static __always_inline struct sk_buff *
+sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
-	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+	return skb;
 }
-EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
-#endif /* CONFIG_NET_EGRESS */
+#endif /* CONFIG_NET_XGRESS */
 
 #ifdef CONFIG_XPS
 static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
@@ -4181,9 +4311,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_update_prio(skb);
 
 	qdisc_pkt_len_init(skb);
-#ifdef CONFIG_NET_CLS_ACT
-	skb->tc_at_ingress = 0;
-#endif
+	xtc_set_ingress(skb, false);
 #ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
 		if (nf_hook_egress_active()) {
@@ -5101,68 +5229,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-static inline struct sk_buff *
-sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
-		   struct net_device *orig_dev, bool *another)
-{
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
-	struct tcf_result cl_res;
-
-	/* If there's at least one ingress present somewhere (so
-	 * we get here via enabled static key), remaining devices
-	 * that are not configured with an ingress qdisc will bail
-	 * out here.
-	 */
-	if (!miniq)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	qdisc_skb_cb(skb)->pkt_len = skb->len;
-	tc_skb_cb(skb)->mru = 0;
-	tc_skb_cb(skb)->post_ct = false;
-	skb->tc_at_ingress = 1;
-	mini_qdisc_bstats_cpu_update(miniq, skb);
-
-	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
-	case TC_ACT_OK:
-	case TC_ACT_RECLASSIFY:
-		skb->tc_index = TC_H_MIN(cl_res.classid);
-		break;
-	case TC_ACT_SHOT:
-		mini_qdisc_qstats_cpu_drop(miniq);
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
-		return NULL;
-	case TC_ACT_STOLEN:
-	case TC_ACT_QUEUED:
-	case TC_ACT_TRAP:
-		consume_skb(skb);
-		return NULL;
-	case TC_ACT_REDIRECT:
-		/* skb_mac_header check was done by cls/act_bpf, so
-		 * we can safely push the L2 header back before
-		 * redirecting to another netdev
-		 */
-		__skb_push(skb, skb->mac_len);
-		if (skb_do_redirect(skb) == -EAGAIN) {
-			__skb_pull(skb, skb->mac_len);
-			*another = true;
-			break;
-		}
-		return NULL;
-	case TC_ACT_CONSUMED:
-		return NULL;
-	default:
-		break;
-	}
-#endif /* CONFIG_NET_CLS_ACT */
-	return skb;
-}
-
 /**
  *	netdev_is_rx_handler_busy - check if receive handler is registered
  *	@dev: device to check
@@ -10832,7 +10898,7 @@ void unregister_netdevice_many(struct list_head *head)
 
 		/* Shutdown queueing discipline. */
 		dev_shutdown(dev);
-
+		dev_xtc_uninstall(dev);
 		dev_xdp_uninstall(dev);
 
 		netdev_offload_xstats_disable_all(dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index bb0136e7a8e4..ac4bb016c5ee 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9132,7 +9132,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
 	__u8 value_reg = si->dst_reg;
 	__u8 skb_reg = si->src_reg;
 
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	/* If the tstamp_type is read,
 	 * the bpf prog is aware the tstamp could have delivery time.
 	 * Thus, read skb->tstamp as is if tstamp_type_access is true.
@@ -9166,7 +9166,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
 	__u8 value_reg = si->src_reg;
 	__u8 skb_reg = si->dst_reg;
 
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	/* If the tstamp_type is read,
 	 * the bpf prog is aware the tstamp could have delivery time.
 	 * Thus, write skb->tstamp as is if tstamp_type_access is true.
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 1e8ab4749c6c..c1b8f2e7d966 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -382,8 +382,7 @@ config NET_SCH_FQ_PIE
 config NET_SCH_INGRESS
 	tristate "Ingress/classifier-action Qdisc"
 	depends on NET_CLS_ACT
-	select NET_INGRESS
-	select NET_EGRESS
+	select NET_XGRESS
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,
@@ -753,6 +752,7 @@ config NET_EMATCH_IPT
 config NET_CLS_ACT
 	bool "Actions"
 	select NET_CLS
+	select NET_XGRESS
 	help
 	  Say Y here if you want to use traffic control actions. Actions
 	  get attached to classifiers and are invoked after a successful
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 84838128b9c5..3bd37ee898ce 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -13,6 +13,7 @@
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
+#include <net/xtc.h>
 
 struct ingress_sched_data {
 	struct tcf_block *block;
@@ -78,11 +79,19 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct xtc_entry *entry;
+	bool created;
 	int err;
 
 	net_inc_ingress_queue();
 
-	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
+	entry = dev_xtc_entry_fetch(dev, true, &created);
+	if (!entry)
+		return -ENOMEM;
+
+	mini_qdisc_pair_init(&q->miniqp, sch, &entry->parent->miniq);
+	if (created)
+		dev_xtc_entry_update(dev, entry, true);
 
 	q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	q->block_info.chain_head_change = clsact_chain_head_change;
@@ -93,15 +102,20 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 		return err;
 
 	mini_qdisc_pair_block_init(&q->miniqp, q->block);
-
 	return 0;
 }
 
 static void ingress_destroy(struct Qdisc *sch)
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct xtc_entry *entry = rtnl_dereference(dev->xtc_ingress);
 
 	tcf_block_put_ext(q->block, sch, &q->block_info);
+	if (entry && dev_xtc_entry_total(entry) == 0) {
+		dev_xtc_entry_update(dev, NULL, true);
+		dev_xtc_entry_free(entry);
+	}
 	net_dec_ingress_queue();
 }
 
@@ -217,12 +231,20 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct xtc_entry *entry;
+	bool created;
 	int err;
 
 	net_inc_ingress_queue();
 	net_inc_egress_queue();
 
-	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
+	entry = dev_xtc_entry_fetch(dev, true, &created);
+	if (!entry)
+		return -ENOMEM;
+
+	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &entry->parent->miniq);
+	if (created)
+		dev_xtc_entry_update(dev, entry, true);
 
 	q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	q->ingress_block_info.chain_head_change = clsact_chain_head_change;
@@ -235,7 +257,13 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 
 	mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
 
-	mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
+	entry = dev_xtc_entry_fetch(dev, false, &created);
+	if (!entry)
+		return -ENOMEM;
+
+	mini_qdisc_pair_init(&q->miniqp_egress, sch, &entry->parent->miniq);
+	if (created)
+		dev_xtc_entry_update(dev, entry, false);
 
 	q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
 	q->egress_block_info.chain_head_change = clsact_chain_head_change;
@@ -247,9 +275,21 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 static void clsact_destroy(struct Qdisc *sch)
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct xtc_entry *ingress_entry = rtnl_dereference(dev->xtc_ingress);
+	struct xtc_entry *egress_entry = rtnl_dereference(dev->xtc_egress);
 
 	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
+	if (egress_entry && dev_xtc_entry_total(egress_entry) == 0) {
+		dev_xtc_entry_update(dev, NULL, false);
+		dev_xtc_entry_free(egress_entry);
+	}
+
 	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
+	if (ingress_entry && dev_xtc_entry_total(ingress_entry) == 0) {
+		dev_xtc_entry_update(dev, NULL, true);
+		dev_xtc_entry_free(ingress_entry);
+	}
 
 	net_dec_ingress_queue();
 	net_dec_egress_queue();
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 51b9aa640ad2..de1f5546bcfe 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1025,6 +1025,8 @@ enum bpf_attach_type {
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
+	BPF_NET_INGRESS,
+	BPF_NET_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1399,14 +1401,20 @@ union bpf_attr {
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
-		__u32		target_fd;	/* container object to attach to */
+		union {
+			__u32	target_fd;	/* container object to attach to */
+			__u32	target_ifindex; /* target ifindex */
+		};
 		__u32		attach_bpf_fd;	/* eBPF program to attach */
 		__u32		attach_type;
 		__u32		attach_flags;
-		__u32		replace_bpf_fd;	/* previously attached eBPF
+		union {
+			__u32	attach_priority;
+			__u32	replace_bpf_fd;	/* previously attached eBPF
 						 * program to replace if
 						 * BPF_F_REPLACE is used
 						 */
+		};
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
@@ -1452,7 +1460,10 @@ union bpf_attr {
 	} info;
 
 	struct { /* anonymous struct used by BPF_PROG_QUERY command */
-		__u32		target_fd;	/* container object to query */
+		union {
+			__u32	target_fd;	/* container object to query */
+			__u32	target_ifindex; /* target ifindex */
+		};
 		__u32		attach_type;
 		__u32		query_flags;
 		__u32		attach_flags;
@@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
 	};
 };
 
+/* (Simplified) user return codes for tc prog type.
+ * A valid tc program must return one of these defined values. All other
+ * return codes are reserved for future use. Must remain compatible with
+ * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
+ * return codes are mapped to TC_NEXT.
+ */
+enum tc_action_base {
+	TC_NEXT		= -1,
+	TC_PASS		= 0,
+	TC_DROP		= 2,
+	TC_REDIRECT	= 7,
+};
+
 struct bpf_xdp_sock {
 	__u32 queue_id;
 };
@@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
 	__be32	flow_label;
 };
 
+struct bpf_query_info {
+	__u32 prog_id;
+	__u32 prio;
+};
+
 struct bpf_func_info {
 	__u32	insn_off;
 	__u32	type_id;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 02/10] bpf: Implement BPF link handling for tc BPF programs
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
                     ` (2 more replies)
  2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

This work adds BPF links for tc. As a recap, a BPF link represents the attachment
of a BPF program to a BPF hook point. The BPF link holds a single reference to
keep BPF program alive. Moreover, hook points do not reference a BPF link, only
the application's fd or pinning does. A BPF link holds meta-data specific to
attachment and implements operations for link creation, (atomic) BPF program
update, detachment and introspection.

The motivation for BPF links for tc BPF programs is multi-fold, for example:

- "It's especially important for applications that are deployed fleet-wide
   and that don't "control" hosts they are deployed to. If such application
   crashes and no one notices and does anything about that, BPF program will
   keep running draining resources or even just, say, dropping packets. We
   at FB had outages due to such permanent BPF attachment semantics. With
   fd-based BPF link we are getting a framework, which allows safe, auto-
   detachable behavior by default, unless application explicitly opts in by
   pinning the BPF link." [0]

-  From Cilium-side the tc BPF programs we attach to host-facing veth devices
   and phys devices build the core datapath for Kubernetes Pods, and they
   implement forwarding, load-balancing, policy, EDT-management, etc, within
   BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
   experienced hard-to-debug issues in a user's staging environment where
   another Kubernetes application using tc BPF attached to the same prio/handle
   of cls_bpf, wiping all Cilium-based BPF programs from underneath it. The
   goal is to establish a clear/safe ownership model via links which cannot
   accidentally be overridden. [1]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [2] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We chose to implement a prerequisite of the fd-based tc BPF attach API, so
that the BPF link implementation fits in naturally similar to other link types
which are fd-based and without the need for changing core tc internal APIs.

BPF programs for tc can then be successively migrated from cls_bpf to the new
tc BPF link without needing to change the program's source code, just the BPF
loader mechanics for attaching.

  [0] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
  [1] https://lpc.events/event/16/contributions/1353/
  [2] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/bpf.h            |   5 +-
 include/net/xtc.h              |  14 ++++
 include/uapi/linux/bpf.h       |   5 ++
 kernel/bpf/net.c               | 116 ++++++++++++++++++++++++++++++---
 kernel/bpf/syscall.c           |   3 +
 tools/include/uapi/linux/bpf.h |   5 ++
 6 files changed, 139 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 71e5f43db378..226a74f65704 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1473,7 +1473,10 @@ struct bpf_prog_array_item {
 	union {
 		struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
 		u64 bpf_cookie;
-		u32 bpf_priority;
+		struct {
+			u32 bpf_priority;
+			u32 bpf_id;
+		};
 	};
 };
 
diff --git a/include/net/xtc.h b/include/net/xtc.h
index 627dc18aa433..e4a8cee09490 100644
--- a/include/net/xtc.h
+++ b/include/net/xtc.h
@@ -27,6 +27,13 @@ struct xtc_entry_pair {
 	struct xtc_entry	b;
 };
 
+struct bpf_tc_link {
+	struct bpf_link link;
+	struct net_device *dev;
+	u32 priority;
+	u32 location;
+};
+
 static inline void xtc_set_ingress(struct sk_buff *skb, bool ingress)
 {
 #ifdef CONFIG_NET_XGRESS
@@ -155,6 +162,7 @@ int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int xtc_prog_detach(const union bpf_attr *attr);
 int xtc_prog_query(const union bpf_attr *attr,
 		   union bpf_attr __user *uattr);
+int xtc_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 void dev_xtc_uninstall(struct net_device *dev);
 #else
 static inline int xtc_prog_attach(const union bpf_attr *attr,
@@ -174,6 +182,12 @@ static inline int xtc_prog_query(const union bpf_attr *attr,
 	return -EINVAL;
 }
 
+static inline int xtc_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
 static inline void dev_xtc_uninstall(struct net_device *dev)
 {
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index de1f5546bcfe..c006f561648e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1043,6 +1043,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
+	BPF_LINK_TYPE_TC = 10,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1541,6 +1542,9 @@ union bpf_attr {
 				 */
 				__u64		cookie;
 			} tracing;
+			struct {
+				__u32		priority;
+			} tc;
 		};
 	} link_create;
 
@@ -6830,6 +6834,7 @@ struct bpf_flow_keys {
 
 struct bpf_query_info {
 	__u32 prog_id;
+	__u32 link_id;
 	__u32 prio;
 };
 
diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
index ab9a9dee615b..22b7a9b05483 100644
--- a/kernel/bpf/net.c
+++ b/kernel/bpf/net.c
@@ -8,7 +8,7 @@
 #include <net/xtc.h>
 
 static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
-			     struct bpf_prog *nprog, u32 prio, u32 flags)
+			     u32 id, struct bpf_prog *nprog, u32 prio, u32 flags)
 {
 	struct bpf_prog_array_item *item, *tmp;
 	struct xtc_entry *entry, *peer;
@@ -27,10 +27,13 @@ static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
 		if (!oprog)
 			break;
 		if (item->bpf_priority == prio) {
-			if (flags & BPF_F_REPLACE) {
+			if (item->bpf_id == id &&
+			    (flags & BPF_F_REPLACE)) {
 				/* Pairs with READ_ONCE() in xtc_run_progs(). */
 				WRITE_ONCE(item->prog, nprog);
-				bpf_prog_put(oprog);
+				item->bpf_id = id;
+				if (!id)
+					bpf_prog_put(oprog);
 				dev_xtc_entry_prio_set(entry, prio, nprog);
 				return prio;
 			}
@@ -55,19 +58,23 @@ static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
 			if (i == j) {
 				tmp->prog = nprog;
 				tmp->bpf_priority = prio;
+				tmp->bpf_id = id;
 			}
 			break;
 		} else if (item->bpf_priority < prio) {
 			tmp->prog = oprog;
 			tmp->bpf_priority = item->bpf_priority;
+			tmp->bpf_id = item->bpf_id;
 		} else if (item->bpf_priority > prio) {
 			if (i == j) {
 				tmp->prog = nprog;
 				tmp->bpf_priority = prio;
+				tmp->bpf_id = id;
 				tmp = &peer->items[++j];
 			}
 			tmp->prog = oprog;
 			tmp->bpf_priority = item->bpf_priority;
+			tmp->bpf_id = item->bpf_id;
 		}
 	}
 	dev_xtc_entry_update(dev, peer, ingress);
@@ -94,14 +101,14 @@ int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *nprog)
 		rtnl_unlock();
 		return -EINVAL;
 	}
-	ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, nprog,
+	ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, 0, nprog,
 				attr->attach_priority, attr->attach_flags);
 	rtnl_unlock();
 	return ret;
 }
 
 static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
-			     u32 prio)
+			     u32 id, u32 prio)
 {
 	struct bpf_prog_array_item *item, *tmp;
 	struct bpf_prog *oprog, *fprog = NULL;
@@ -126,8 +133,11 @@ static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
 		if (item->bpf_priority != prio) {
 			tmp->prog = oprog;
 			tmp->bpf_priority = item->bpf_priority;
+			tmp->bpf_id = item->bpf_id;
 			j++;
 		} else {
+			if (item->bpf_id != id)
+				return -EBUSY;
 			fprog = oprog;
 		}
 	}
@@ -136,7 +146,8 @@ static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
 		if (dev_xtc_entry_total(peer) == 0 && !entry->parent->miniq)
 			peer = NULL;
 		dev_xtc_entry_update(dev, peer, ingress);
-		bpf_prog_put(fprog);
+		if (!id)
+			bpf_prog_put(fprog);
 		if (!peer)
 			dev_xtc_entry_free(entry);
 		if (ingress)
@@ -164,7 +175,7 @@ int xtc_prog_detach(const union bpf_attr *attr)
 		rtnl_unlock();
 		return -EINVAL;
 	}
-	ret = __xtc_prog_detach(dev, ingress, XTC_MAX_ENTRIES,
+	ret = __xtc_prog_detach(dev, ingress, XTC_MAX_ENTRIES, 0,
 				attr->attach_priority);
 	rtnl_unlock();
 	return ret;
@@ -191,7 +202,8 @@ static void __xtc_prog_detach_all(struct net_device *dev, bool ingress, u32 limi
 		if (!prog)
 			break;
 		dev_xtc_entry_prio_del(entry, item->bpf_priority);
-		bpf_prog_put(prog);
+		if (!item->bpf_id)
+			bpf_prog_put(prog);
 		if (ingress)
 			net_dec_ingress_queue();
 		else
@@ -244,6 +256,7 @@ __xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
 		if (!prog)
 			break;
 		info.prog_id = prog->aux->id;
+		info.link_id = item->bpf_id;
 		info.prio = item->bpf_priority;
 		if (copy_to_user(uinfo + i, &info, sizeof(info)))
 			return -EFAULT;
@@ -272,3 +285,90 @@ int xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
 	rtnl_unlock();
 	return ret;
 }
+
+static int __xtc_link_attach(struct bpf_link *l, u32 id)
+{
+	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
+	int ret;
+
+	rtnl_lock();
+	ret = __xtc_prog_attach(link->dev, link->location == BPF_NET_INGRESS,
+				XTC_MAX_ENTRIES, id, l->prog, link->priority,
+				0);
+	if (ret > 0) {
+		link->priority = ret;
+		ret = 0;
+	}
+	rtnl_unlock();
+	return ret;
+}
+
+static void xtc_link_release(struct bpf_link *l)
+{
+	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
+
+	rtnl_lock();
+	if (link->dev) {
+		WARN_ON(__xtc_prog_detach(link->dev,
+					  link->location == BPF_NET_INGRESS,
+					  XTC_MAX_ENTRIES, l->id, link->priority));
+		link->dev = NULL;
+	}
+	rtnl_unlock();
+}
+
+static void xtc_link_dealloc(struct bpf_link *l)
+{
+	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
+
+	kfree(link);
+}
+
+static const struct bpf_link_ops bpf_tc_link_lops = {
+	.release	= xtc_link_release,
+	.dealloc	= xtc_link_dealloc,
+};
+
+int xtc_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_link_primer link_primer;
+	struct bpf_tc_link *link;
+	struct net_device *dev;
+	int fd, err;
+
+	if (attr->link_create.flags)
+		return -EINVAL;
+	dev = dev_get_by_index(net, attr->link_create.target_ifindex);
+	if (!dev)
+		return -EINVAL;
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto out_put;
+	}
+
+	bpf_link_init(&link->link, BPF_LINK_TYPE_TC, &bpf_tc_link_lops, prog);
+	link->priority = attr->link_create.tc.priority;
+	link->location = attr->link_create.attach_type;
+	link->dev = dev;
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err) {
+		kfree(link);
+		goto out_put;
+	}
+	err = __xtc_link_attach(&link->link, link_primer.id);
+	if (err) {
+		link->dev = NULL;
+		bpf_link_cleanup(&link_primer);
+		goto out_put;
+	}
+
+	fd = bpf_link_settle(&link_primer);
+	dev_put(dev);
+	return fd;
+out_put:
+	dev_put(dev);
+	return err;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a0a670b964bb..4456df481381 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4613,6 +4613,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	case BPF_PROG_TYPE_XDP:
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = xtc_link_attach(attr, prog);
+		break;
 #endif
 	case BPF_PROG_TYPE_PERF_EVENT:
 	case BPF_PROG_TYPE_TRACEPOINT:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index de1f5546bcfe..c006f561648e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1043,6 +1043,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
+	BPF_LINK_TYPE_TC = 10,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1541,6 +1542,9 @@ union bpf_attr {
 				 */
 				__u64		cookie;
 			} tracing;
+			struct {
+				__u32		priority;
+			} tc;
 		};
 	} link_create;
 
@@ -6830,6 +6834,7 @@ struct bpf_flow_keys {
 
 struct bpf_query_info {
 	__u32 prog_id;
+	__u32 link_id;
 	__u32 prio;
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
  2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Add support for LINK_UPDATE command for tc BPF link to allow for a reliable
replacement of the underlying BPF program.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/net.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
index 22b7a9b05483..c50bcf656b3f 100644
--- a/kernel/bpf/net.c
+++ b/kernel/bpf/net.c
@@ -303,6 +303,39 @@ static int __xtc_link_attach(struct bpf_link *l, u32 id)
 	return ret;
 }
 
+static int xtc_link_update(struct bpf_link *l, struct bpf_prog *nprog,
+			   struct bpf_prog *oprog)
+{
+	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
+	int ret = 0;
+
+	rtnl_lock();
+	if (!link->dev) {
+		ret = -ENOLINK;
+		goto out;
+	}
+	if (oprog && l->prog != oprog) {
+		ret = -EPERM;
+		goto out;
+	}
+	oprog = l->prog;
+	if (oprog == nprog) {
+		bpf_prog_put(nprog);
+		goto out;
+	}
+	ret = __xtc_prog_attach(link->dev, link->location == BPF_NET_INGRESS,
+				XTC_MAX_ENTRIES, l->id, nprog, link->priority,
+				BPF_F_REPLACE);
+	if (ret == link->priority) {
+		oprog = xchg(&l->prog, nprog);
+		bpf_prog_put(oprog);
+		ret = 0;
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
 static void xtc_link_release(struct bpf_link *l)
 {
 	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
@@ -327,6 +360,7 @@ static void xtc_link_dealloc(struct bpf_link *l)
 static const struct bpf_link_ops bpf_tc_link_lops = {
 	.release	= xtc_link_release,
 	.dealloc	= xtc_link_dealloc,
+	.update_prog	= xtc_link_update,
 };
 
 int xtc_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 04/10] bpf: Implement link introspection for tc BPF link programs
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (2 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-06 23:14   ` Martin KaFai Lau
  2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Implement tc BPF link specific show_fdinfo and link_info to emit ifindex,
attach location and priority.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/uapi/linux/bpf.h       |  5 +++++
 kernel/bpf/net.c               | 36 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++++
 3 files changed, 46 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c006f561648e..f1b089170b78 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6309,6 +6309,11 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+			__u32 priority;
+		} tc;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
index c50bcf656b3f..a74b86bb60a9 100644
--- a/kernel/bpf/net.c
+++ b/kernel/bpf/net.c
@@ -357,10 +357,46 @@ static void xtc_link_dealloc(struct bpf_link *l)
 	kfree(link);
 }
 
+static void xtc_link_fdinfo(const struct bpf_link *l, struct seq_file *seq)
+{
+	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (link->dev)
+		ifindex = link->dev->ifindex;
+	rtnl_unlock();
+
+	seq_printf(seq, "ifindex:\t%u\n", ifindex);
+	seq_printf(seq, "attach_type:\t%u (%s)\n",
+		   link->location,
+		   link->location == BPF_NET_INGRESS ? "ingress" : "egress");
+	seq_printf(seq, "priority:\t%u\n", link->priority);
+}
+
+static int xtc_link_fill_info(const struct bpf_link *l,
+			      struct bpf_link_info *info)
+{
+	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (link->dev)
+		ifindex = link->dev->ifindex;
+	rtnl_unlock();
+
+	info->tc.ifindex = ifindex;
+	info->tc.attach_type = link->location;
+	info->tc.priority = link->priority;
+	return 0;
+}
+
 static const struct bpf_link_ops bpf_tc_link_lops = {
 	.release	= xtc_link_release,
 	.dealloc	= xtc_link_dealloc,
 	.update_prog	= xtc_link_update,
+	.show_fdinfo	= xtc_link_fdinfo,
+	.fill_link_info	= xtc_link_fill_info,
 };
 
 int xtc_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c006f561648e..f1b089170b78 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6309,6 +6309,11 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+			__u32 priority;
+		} tc;
 	};
 } __attribute__((aligned(8)));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 05/10] bpf: Implement link detach for tc BPF link programs
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (3 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-06 23:24   ` Martin KaFai Lau
  2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Add support for forced detach operation of tc BPF link. This detaches the link
but without destroying it. It has the same semantics as auto-detaching of BPF
link due to e.g. net device being destroyed for tc or XDP BPF link. Meaning,
in this case the BPF link is still a valid kernel object, but is defunct given
it is not attached anywhere anymore. It still holds a reference to the BPF
program, though. This functionality allows users with enough access rights to
manually force-detach attached tc BPF link without killing respective owner
process and to then introspect/debug the BPF assets. Similar LINK_DETACH exists
also for other BPF link types.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/net.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
index a74b86bb60a9..5650f62c1315 100644
--- a/kernel/bpf/net.c
+++ b/kernel/bpf/net.c
@@ -350,6 +350,12 @@ static void xtc_link_release(struct bpf_link *l)
 	rtnl_unlock();
 }
 
+static int xtc_link_detach(struct bpf_link *l)
+{
+	xtc_link_release(l);
+	return 0;
+}
+
 static void xtc_link_dealloc(struct bpf_link *l)
 {
 	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
@@ -393,6 +399,7 @@ static int xtc_link_fill_info(const struct bpf_link *l,
 
 static const struct bpf_link_ops bpf_tc_link_lops = {
 	.release	= xtc_link_release,
+	.detach		= xtc_link_detach,
 	.dealloc	= xtc_link_dealloc,
 	.update_prog	= xtc_link_update,
 	.show_fdinfo	= xtc_link_fdinfo,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (4 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Minor signature change for bpf_prog_query() API, no change in behavior.
An alternative option would be to add a new libbpf introspection API
with close to 1:1 implementation of bpf_prog_query() but with changed
prog_ids pointer. Given the change is just minor enough, we went for
the first option here.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c | 2 +-
 tools/lib/bpf/bpf.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 1d49a0352836..18b1e91cc469 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -846,7 +846,7 @@ int bpf_prog_query_opts(int target_fd,
 }
 
 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
-		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt)
+		   __u32 *attach_flags, void *prog_ids, __u32 *prog_cnt)
 {
 	LIBBPF_OPTS(bpf_prog_query_opts, opts);
 	int ret;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9c50beabdd14..bef7a5282188 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -386,7 +386,7 @@ LIBBPF_API int bpf_prog_query_opts(int target_fd,
 				   struct bpf_prog_query_opts *opts);
 LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type,
 			      __u32 query_flags, __u32 *attach_flags,
-			      __u32 *prog_ids, __u32 *prog_cnt);
+			      void *prog_ids, __u32 *prog_cnt);
 
 LIBBPF_API int bpf_raw_tracepoint_open(const char *name, int prog_fd);
 LIBBPF_API int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (5 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Extend libbpf attach opts and add a new detach opts API so this can be used
to add/remove fd-based tc BPF programs. For concrete usage examples, see the
extensive selftests that have been developed as part of this series.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c      | 21 +++++++++++++++++++++
 tools/lib/bpf/bpf.h      | 17 +++++++++++++++--
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 18b1e91cc469..d1e338ac9a62 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -670,6 +670,27 @@ int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
 	return libbpf_err_errno(ret);
 }
 
+int bpf_prog_detach_opts(int prog_fd, int target_fd,
+			 enum bpf_attach_type type,
+			 const struct bpf_prog_detach_opts *opts)
+{
+	const size_t attr_sz = offsetofend(union bpf_attr, replace_bpf_fd);
+	union bpf_attr attr;
+	int ret;
+
+	if (!OPTS_VALID(opts, bpf_prog_detach_opts))
+		return libbpf_err(-EINVAL);
+
+	memset(&attr, 0, attr_sz);
+	attr.target_fd	   = target_fd;
+	attr.attach_bpf_fd = prog_fd;
+	attr.attach_type   = type;
+	attr.attach_priority = OPTS_GET(opts, attach_priority, 0);
+
+	ret = sys_bpf(BPF_PROG_DETACH, &attr, attr_sz);
+	return libbpf_err_errno(ret);
+}
+
 int bpf_link_create(int prog_fd, int target_fd,
 		    enum bpf_attach_type attach_type,
 		    const struct bpf_link_create_opts *opts)
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index bef7a5282188..96de58fecdbc 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -286,8 +286,11 @@ LIBBPF_API int bpf_obj_get_opts(const char *pathname,
 
 struct bpf_prog_attach_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
-	unsigned int flags;
-	int replace_prog_fd;
+	__u32 flags;
+	union {
+		int replace_prog_fd;
+		__u32 attach_priority;
+	};
 };
 #define bpf_prog_attach_opts__last_field replace_prog_fd
 
@@ -296,9 +299,19 @@ LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
 LIBBPF_API int bpf_prog_attach_opts(int prog_fd, int attachable_fd,
 				     enum bpf_attach_type type,
 				     const struct bpf_prog_attach_opts *opts);
+
+struct bpf_prog_detach_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+	__u32 attach_priority;
+};
+#define bpf_prog_detach_opts__last_field attach_priority
+
 LIBBPF_API int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
 LIBBPF_API int bpf_prog_detach2(int prog_fd, int attachable_fd,
 				enum bpf_attach_type type);
+LIBBPF_API int bpf_prog_detach_opts(int prog_fd, int target_fd,
+				    enum bpf_attach_type type,
+				    const struct bpf_prog_detach_opts *opts);
 
 union bpf_iter_link_info; /* defined in up-to-date linux/bpf.h */
 struct bpf_link_create_opts {
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index c1d6aa7c82b6..0c94b4862ebb 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -377,4 +377,5 @@ LIBBPF_1.1.0 {
 		user_ring_buffer__reserve;
 		user_ring_buffer__reserve_blocking;
 		user_ring_buffer__submit;
+		bpf_prog_detach_opts;
 } LIBBPF_1.0.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (6 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-04 23:11 ` [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types Daniel Borkmann
  2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
  9 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Implement tc BPF link support for libbpf. The bpf_program__attach_fd()
API has been refactored slightly in order to pass bpf_link_create_opts.
A new bpf_program__attach_tc() has been added on top of this which allows
for passing ifindex and prio parameters.

New sections are tc/ingress and tc/egress which map to BPF_NET_INGRESS
and BPF_NET_EGRESS, respectively.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c      |  4 ++++
 tools/lib/bpf/bpf.h      |  3 +++
 tools/lib/bpf/libbpf.c   | 31 ++++++++++++++++++++++++++-----
 tools/lib/bpf/libbpf.h   |  2 ++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index d1e338ac9a62..f73fdecbb5f8 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -752,6 +752,10 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, tracing))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_NET_INGRESS:
+	case BPF_NET_EGRESS:
+		attr.link_create.tc.priority = OPTS_GET(opts, tc.priority, 0);
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 96de58fecdbc..937583421327 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -334,6 +334,9 @@ struct bpf_link_create_opts {
 		struct {
 			__u64 cookie;
 		} tracing;
+		struct {
+			__u32 priority;
+		} tc;
 	};
 	size_t :0;
 };
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 184ce1684dcd..6eb33e4324ad 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8474,6 +8474,8 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("kretsyscall+",		KPROBE, 0, SEC_NONE, attach_ksyscall),
 	SEC_DEF("usdt+",		KPROBE,	0, SEC_NONE, attach_usdt),
 	SEC_DEF("tc",			SCHED_CLS, 0, SEC_NONE),
+	SEC_DEF("tc/ingress",		SCHED_CLS, BPF_NET_INGRESS, SEC_ATTACHABLE_OPT),
+	SEC_DEF("tc/egress",		SCHED_CLS, BPF_NET_EGRESS, SEC_ATTACHABLE_OPT),
 	SEC_DEF("classifier",		SCHED_CLS, 0, SEC_NONE),
 	SEC_DEF("action",		SCHED_ACT, 0, SEC_NONE),
 	SEC_DEF("tracepoint+",		TRACEPOINT, 0, SEC_NONE, attach_tp),
@@ -11238,11 +11240,10 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
 }
 
 static struct bpf_link *
-bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
-		       const char *target_name)
+bpf_program__attach_fd_opts(const struct bpf_program *prog,
+			    const struct bpf_link_create_opts *opts,
+			    int target_fd, const char *target_name)
 {
-	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
-			    .target_btf_id = btf_id);
 	enum bpf_attach_type attach_type;
 	char errmsg[STRERR_BUFSIZE];
 	struct bpf_link *link;
@@ -11260,7 +11261,7 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
 	link->detach = &bpf_link__detach_fd;
 
 	attach_type = bpf_program__expected_attach_type(prog);
-	link_fd = bpf_link_create(prog_fd, target_fd, attach_type, &opts);
+	link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
 	if (link_fd < 0) {
 		link_fd = -errno;
 		free(link);
@@ -11273,6 +11274,16 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
 	return link;
 }
 
+static struct bpf_link *
+bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
+		       const char *target_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
+			    .target_btf_id = btf_id);
+
+	return bpf_program__attach_fd_opts(prog, &opts, target_fd, target_name);
+}
+
 struct bpf_link *
 bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
 {
@@ -11291,6 +11302,16 @@ struct bpf_link *bpf_program__attach_xdp(const struct bpf_program *prog, int ifi
 	return bpf_program__attach_fd(prog, ifindex, 0, "xdp");
 }
 
+struct bpf_link *bpf_program__attach_tc(const struct bpf_program *prog,
+					int ifindex, __u32 priority)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
+			    .tc.priority = priority);
+
+	/* target_fd/target_ifindex use the same field in LINK_CREATE */
+	return bpf_program__attach_fd_opts(prog, &opts, ifindex, "tc");
+}
+
 struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
 					      int target_fd,
 					      const char *attach_func_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index eee883f007f9..7e64cec9a1ba 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -645,6 +645,8 @@ bpf_program__attach_netns(const struct bpf_program *prog, int netns_fd);
 LIBBPF_API struct bpf_link *
 bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex);
 LIBBPF_API struct bpf_link *
+bpf_program__attach_tc(const struct bpf_program *prog, int ifindex, __u32 priority);
+LIBBPF_API struct bpf_link *
 bpf_program__attach_freplace(const struct bpf_program *prog,
 			     int target_fd, const char *attach_func_name);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 0c94b4862ebb..473ed71829c6 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -378,4 +378,5 @@ LIBBPF_1.1.0 {
 		user_ring_buffer__reserve_blocking;
 		user_ring_buffer__submit;
 		bpf_prog_detach_opts;
+		bpf_program__attach_tc;
 } LIBBPF_1.0.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (7 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
  9 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Add support to dump fd-based attach types via bpftool. This includes both
the tc BPF link and attach ops programs. Dumped information contain the
attach location, function entry name, program ID, link ID when applicable
as well as the attach priority.

Example with tc BPF link:

  # ./bpftool net
  xdp:

  tc:
  lo(1) bpf/ingress tc_handler_in id 189 link 40 prio 1
  lo(1) bpf/egress tc_handler_eg id 190 link 39 prio 1

  flow_dissector:

Example with tc BPF attach ops and also one instance of old-style cls_bpf:

  # ./bpftool net
  xdp:

  tc:
  lo(1) bpf/ingress tc_handler_in id 201 prio 1
  lo(1) clsact/ingress tc_handler_old:[203] id 203

  flow_dissector:

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/net.c | 76 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 72 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 526a332c48e6..06658978b092 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -74,6 +74,11 @@ static const char * const attach_type_strings[] = {
 	[NET_ATTACH_TYPE_XDP_OFFLOAD]	= "xdpoffload",
 };
 
+static const char * const attach_loc_strings[] = {
+	[BPF_NET_INGRESS]		= "bpf/ingress",
+	[BPF_NET_EGRESS]		= "bpf/egress",
+};
+
 const size_t net_attach_type_size = ARRAY_SIZE(attach_type_strings);
 
 static enum net_attach_type parse_attach_type(const char *str)
@@ -420,8 +425,69 @@ static int dump_filter_nlmsg(void *cookie, void *msg, struct nlattr **tb)
 			      filter_info->devname, filter_info->ifindex);
 }
 
-static int show_dev_tc_bpf(int sock, unsigned int nl_pid,
-			   struct ip_devname_ifindex *dev)
+static int __show_dev_tc_bpf_name(__u32 id, char *name, size_t len)
+{
+	struct bpf_prog_info info = {};
+	__u32 ilen = sizeof(info);
+	int fd, ret;
+
+	fd = bpf_prog_get_fd_by_id(id);
+	if (fd < 0)
+		return fd;
+	ret = bpf_obj_get_info_by_fd(fd, &info, &ilen);
+	if (ret < 0)
+		goto out;
+	ret = -ENOENT;
+	if (info.name) {
+		get_prog_full_name(&info, fd, name, len);
+		ret = 0;
+	}
+out:
+	close(fd);
+	return ret;
+}
+
+static void __show_dev_tc_bpf(const struct ip_devname_ifindex *dev,
+			      const enum bpf_attach_type loc)
+{
+	__u32 i, prog_cnt, attach_flags = 0;
+	char prog_name[MAX_PROG_FULL_NAME];
+	struct bpf_query_info progs[64];
+	int ret;
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	ret = bpf_prog_query(dev->ifindex, loc, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (ret)
+		return;
+	for (i = 0; i < prog_cnt; i++) {
+		NET_START_OBJECT;
+		NET_DUMP_STR("devname", "%s", dev->devname);
+		NET_DUMP_UINT("ifindex", "(%u)", dev->ifindex);
+		NET_DUMP_STR("kind", " %s", attach_loc_strings[loc]);
+		ret = __show_dev_tc_bpf_name(progs[i].prog_id,
+					     prog_name,
+					     sizeof(prog_name));
+		if (!ret)
+			NET_DUMP_STR("name", " %s", prog_name);
+		NET_DUMP_UINT("id", " id %u", progs[i].prog_id);
+		if (progs[i].link_id)
+			NET_DUMP_UINT("link", " link %u",
+				      progs[i].link_id);
+		NET_DUMP_UINT("prio", " prio %u", progs[i].prio);
+		NET_END_OBJECT_FINAL;
+	}
+}
+
+static void show_dev_tc_bpf(struct ip_devname_ifindex *dev)
+{
+	__show_dev_tc_bpf(dev, BPF_NET_INGRESS);
+	__show_dev_tc_bpf(dev, BPF_NET_EGRESS);
+}
+
+static int show_dev_tc_bpf_legacy(int sock, unsigned int nl_pid,
+				  struct ip_devname_ifindex *dev)
 {
 	struct bpf_filter_t filter_info;
 	struct bpf_tcinfo_t tcinfo;
@@ -686,8 +752,10 @@ static int do_show(int argc, char **argv)
 	if (!ret) {
 		NET_START_ARRAY("tc", "%s:\n");
 		for (i = 0; i < dev_array.used_len; i++) {
-			ret = show_dev_tc_bpf(sock, nl_pid,
-					      &dev_array.devices[i]);
+			show_dev_tc_bpf(&dev_array.devices[i]);
+
+			ret = show_dev_tc_bpf_legacy(sock, nl_pid,
+						     &dev_array.devices[i]);
 			if (ret)
 				break;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests
  2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
                   ` (8 preceding siblings ...)
  2022-10-04 23:11 ` [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types Daniel Borkmann
@ 2022-10-04 23:11 ` Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  9 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-04 23:11 UTC (permalink / raw)
  To: bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, toke, joe, netdev, Daniel Borkmann

Add a big batch of selftest to extend test_progs with various tc link,
attach ops and old-style tc BPF attachments via libbpf APIs. Also test
multi-program attachments including mixing the various attach options:

  # ./test_progs -t tc_link
  #179     tc_link_base:OK
  #180     tc_link_detach:OK
  #181     tc_link_mix:OK
  #182     tc_link_opts:OK
  #183     tc_link_run_base:OK
  #184     tc_link_run_chain:OK
  Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED

All new and existing test cases pass.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 .../selftests/bpf/prog_tests/tc_link.c        | 756 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |  43 +
 2 files changed, 799 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_link.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_link.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tc_link.c b/tools/testing/selftests/bpf/prog_tests/tc_link.c
new file mode 100644
index 000000000000..2dfd2874bbdd
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_link.c
@@ -0,0 +1,756 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Isovalent */
+
+#include <uapi/linux/if_link.h>
+#include <test_progs.h>
+
+#include "test_tc_link.skel.h"
+
+#define loopback	1
+#define ping_cmd	"ping -q -c1 -w1 127.0.0.1 > /dev/null"
+
+void serial_test_tc_link_base(void)
+{
+	struct test_tc_link *skel1 = NULL, *skel2 = NULL;
+	__u32 prog_fd1, prog_fd2, prog_fd3, prog_fd4;
+	__u32 id0 = 0, id1, id2, id3, id4, id5, id6, id7;
+	struct bpf_prog_info prog_info;
+	struct bpf_link_info link_info;
+	__u32 link_info_len = sizeof(link_info);
+	__u32 prog_info_len = sizeof(prog_info);
+	__u32 prog_cnt, attach_flags = 0;
+	struct bpf_query_info progs[4];
+	struct bpf_link *link;
+	int err;
+
+	skel1 = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel1, "skel_load"))
+		goto cleanup;
+	prog_fd1 = bpf_program__fd(skel1->progs.tc_handler_in);
+	prog_fd2 = bpf_program__fd(skel1->progs.tc_handler_eg);
+
+	skel2 = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel2, "skel_load"))
+		goto cleanup;
+	prog_fd3 = bpf_program__fd(skel2->progs.tc_handler_in);
+	prog_fd4 = bpf_program__fd(skel2->progs.tc_handler_eg);
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd1, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info1"))
+		goto cleanup;
+	id1 = prog_info.id;
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd2, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info2"))
+		goto cleanup;
+	id2 = prog_info.id;
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd3, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info3"))
+		goto cleanup;
+	id3 = prog_info.id;
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd4, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info4"))
+		goto cleanup;
+	id4 = prog_info.id;
+
+	/* Sanity check that we have distinct programs. */
+	ASSERT_NEQ(id1, id3, "prog_ids_1_3");
+	ASSERT_NEQ(id2, id4, "prog_ids_2_4");
+	ASSERT_NEQ(id1, id4, "prog_ids_1_4");
+
+	link = bpf_program__attach_tc(skel1->progs.tc_handler_in, loopback, 1);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel1->links.tc_handler_in = link;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	/* Sanity check that attached ingress BPF link looks as expected. */
+	ASSERT_EQ(link_info.type, BPF_LINK_TYPE_TC, "link_type");
+	ASSERT_EQ(link_info.prog_id, id1, "link_prog_id");
+	ASSERT_EQ(link_info.tc.ifindex, loopback, "link_ifindex");
+	ASSERT_EQ(link_info.tc.attach_type, BPF_NET_INGRESS, "link_attach_type");
+	ASSERT_EQ(link_info.tc.priority, 1, "link_priority");
+	ASSERT_NEQ(link_info.id, id0, "link_id");
+	id5 = link_info.id;
+
+	/* Updating program under active ingress BPF link works as expected. */
+	err = bpf_link__update_program(link, skel2->progs.tc_handler_in);
+	if (!ASSERT_OK(err, "link_upd_invalid"))
+		goto cleanup;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	ASSERT_EQ(link_info.id, id5, "link_id");
+	ASSERT_EQ(link_info.prog_id, id3, "link_prog_id");
+
+	link = bpf_program__attach_tc(skel1->progs.tc_handler_eg, loopback, 1);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel1->links.tc_handler_eg = link;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	/* Sanity check that attached egress BPF link looks as expected. */
+	ASSERT_EQ(link_info.type, BPF_LINK_TYPE_TC, "link_type");
+	ASSERT_EQ(link_info.prog_id, id2, "link_prog_id");
+	ASSERT_EQ(link_info.tc.ifindex, loopback, "link_ifindex");
+	ASSERT_EQ(link_info.tc.attach_type, BPF_NET_EGRESS, "link_attach_type");
+	ASSERT_EQ(link_info.tc.priority, 1, "link_priority");
+	ASSERT_NEQ(link_info.id, id0, "link_id");
+	ASSERT_NEQ(link_info.id, id5, "link_id");
+	id6 = link_info.id;
+
+	/* Updating program under active egress BPF link works as expected. */
+	err = bpf_link__update_program(link, skel2->progs.tc_handler_eg);
+	if (!ASSERT_OK(err, "link_upd_invalid"))
+		goto cleanup;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	ASSERT_EQ(link_info.id, id6, "link_id");
+	ASSERT_EQ(link_info.prog_id, id4, "link_prog_id");
+
+	/* BPF link is not allowed to replace another BPF link. */
+	link = bpf_program__attach_tc(skel2->progs.tc_handler_eg, loopback, 1);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	/* BPF link can be attached with different prio to available slot however. */
+	link = bpf_program__attach_tc(skel2->progs.tc_handler_eg, loopback, 2);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	/* Sanity check that 2nd attached egress BPF link looks as expected. */
+	ASSERT_EQ(link_info.type, BPF_LINK_TYPE_TC, "link_type");
+	ASSERT_EQ(link_info.prog_id, id4, "link_prog_id");
+	ASSERT_EQ(link_info.tc.ifindex, loopback, "link_ifindex");
+	ASSERT_EQ(link_info.tc.attach_type, BPF_NET_EGRESS, "link_attach_type");
+	ASSERT_EQ(link_info.tc.priority, 2, "link_priority");
+	ASSERT_NEQ(link_info.id, id6, "link_id");
+
+	/* We destroy link, and reattach with auto-allocated prio. */
+	bpf_link__destroy(link);
+
+	link = bpf_program__attach_tc(skel2->progs.tc_handler_eg, loopback, 0);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup_link;
+
+	/* Sanity check that egress BPF link looks as expected and got prio 2. */
+	ASSERT_EQ(link_info.type, BPF_LINK_TYPE_TC, "link_type");
+	ASSERT_EQ(link_info.prog_id, id4, "link_prog_id");
+	ASSERT_EQ(link_info.tc.ifindex, loopback, "link_ifindex");
+	ASSERT_EQ(link_info.tc.attach_type, BPF_NET_EGRESS, "link_attach_type");
+	ASSERT_EQ(link_info.tc.priority, 2, "link_priority");
+	ASSERT_NEQ(link_info.id, id6, "link_id");
+	id7 = link_info.id;
+
+	/* Sanity check query API on what progs we have attached. */
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_EGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_link;
+
+	ASSERT_EQ(prog_cnt, 2, "prog_cnt");
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_EGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_link;
+
+	ASSERT_EQ(prog_cnt, 2, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id4, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, id6, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, id4, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, id7, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 2, "prog[1]_prio");
+	ASSERT_EQ(progs[2].prog_id, 0, "prog[2]_id");
+	ASSERT_EQ(progs[2].link_id, 0, "prog[2]_link");
+	ASSERT_EQ(progs[2].prio, 0, "prog[2]_prio");
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_link;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id3, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, id5, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, 0, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 0, "prog[1]_prio");
+
+cleanup_link:
+	bpf_link__destroy(link);
+cleanup:
+	test_tc_link__destroy(skel1);
+	test_tc_link__destroy(skel2);
+}
+
+void serial_test_tc_link_detach(void)
+{
+	struct bpf_prog_info prog_info;
+	struct bpf_link_info link_info;
+	struct test_tc_link *skel;
+	__u32 prog_info_len = sizeof(prog_info);
+	__u32 link_info_len = sizeof(link_info);
+	__u32 prog_cnt, attach_flags = 0;
+	__u32 prog_fd, id, id2;
+	struct bpf_link *link;
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+	prog_fd = bpf_program__fd(skel->progs.tc_handler_in);
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info"))
+		goto cleanup;
+	id = prog_info.id;
+
+	link = bpf_program__attach_tc(skel->progs.tc_handler_in, loopback, 0);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc_handler_in = link;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	/* Sanity check that attached ingress BPF link looks as expected. */
+	ASSERT_EQ(link_info.type, BPF_LINK_TYPE_TC, "link_type");
+	ASSERT_EQ(link_info.prog_id, id, "link_prog_id");
+	ASSERT_EQ(link_info.tc.ifindex, loopback, "link_ifindex");
+	ASSERT_EQ(link_info.tc.attach_type, BPF_NET_INGRESS, "link_attach_type");
+	ASSERT_EQ(link_info.tc.priority, 1, "link_priority");
+	id2 = link_info.id;
+
+	/* Sanity check query API that one prog is attached. */
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+
+	err = bpf_link__detach(link);
+	if (!ASSERT_OK(err, "link_detach"))
+		goto cleanup;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	/* Sanity check that defunct detached link looks as expected. */
+	ASSERT_EQ(link_info.type, BPF_LINK_TYPE_TC, "link_type");
+	ASSERT_EQ(link_info.prog_id, id, "link_prog_id");
+	ASSERT_EQ(link_info.tc.ifindex, 0, "link_ifindex");
+	ASSERT_EQ(link_info.tc.attach_type, BPF_NET_INGRESS, "link_attach_type");
+	ASSERT_EQ(link_info.tc.priority, 1, "link_priority");
+	ASSERT_EQ(link_info.id, id2, "link_id");
+
+	/* Sanity check query API that no prog is attached. */
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	ASSERT_EQ(err, -ENOENT, "prog_cnt");
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+void serial_test_tc_link_opts(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	DECLARE_LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	__u32 prog_fd1, prog_fd2, id1, id2;
+	struct bpf_prog_info prog_info;
+	struct test_tc_link *skel;
+	__u32 prog_info_len = sizeof(prog_info);
+	__u32 prog_cnt, attach_flags = 0;
+	struct bpf_query_info progs[4];
+	int err, prio;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+	prog_fd1 = bpf_program__fd(skel->progs.tc_handler_in);
+	prog_fd2 = bpf_program__fd(skel->progs.tc_handler_eg);
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd1, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info1"))
+		goto cleanup;
+	id1 = prog_info.id;
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd2, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info2"))
+		goto cleanup;
+	id2 = prog_info.id;
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	/* Sanity check query API that nothing is attached. */
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	ASSERT_EQ(prog_cnt, 0, "prog_cnt");
+	ASSERT_EQ(err, -ENOENT, "prog_query");
+
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_EGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	ASSERT_EQ(prog_cnt, 0, "prog_cnt");
+	ASSERT_EQ(err, -ENOENT, "prog_query");
+
+	/* Sanity check that attaching with given prio works. */
+	opta.flags = 0;
+	opta.attach_priority = prio = 1;
+	err = bpf_prog_attach_opts(prog_fd1, loopback, BPF_NET_INGRESS, &opta);
+	if (!ASSERT_EQ(err, opta.attach_priority, "prog_attach"))
+		goto cleanup;
+
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id1, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, 0, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, 0, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 0, "prog[1]_prio");
+
+	/* We cannot override unless we add replace flag. */
+	opta.flags = 0;
+	opta.attach_priority = 1;
+	err = bpf_prog_attach_opts(prog_fd2, loopback, BPF_NET_INGRESS, &opta);
+	if (!ASSERT_ERR(err, "prog_attach_fail"))
+		goto cleanup_detach;
+
+	opta.flags = BPF_F_REPLACE;
+	opta.attach_priority = 1;
+	err = bpf_prog_attach_opts(prog_fd2, loopback, BPF_NET_INGRESS, &opta);
+	if (!ASSERT_EQ(err, opta.attach_priority, "prog_replace"))
+		goto cleanup_detach;
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id2, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, 0, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, 0, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 0, "prog[1]_prio");
+
+	/* Check auto-assignment for priority. */
+	opta.flags = 0;
+	opta.attach_priority = 0;
+	err = bpf_prog_attach_opts(prog_fd1, loopback, BPF_NET_INGRESS, &opta);
+	if (!ASSERT_EQ(err, 2, "prog_replace"))
+		goto cleanup_detach;
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach2;
+
+	ASSERT_EQ(prog_cnt, 2, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id2, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, 0, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, id1, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 2, "prog[1]_prio");
+	ASSERT_EQ(progs[2].prog_id, 0, "prog[2]_id");
+	ASSERT_EQ(progs[2].link_id, 0, "prog[2]_link");
+	ASSERT_EQ(progs[2].prio, 0, "prog[2]_prio");
+
+	/* Remove the 1st program, so the 2nd becomes 1st in line. */
+	prio = 2;
+	optd.attach_priority = 1;
+	err = bpf_prog_detach_opts(0, loopback, BPF_NET_INGRESS, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup_detach;
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id1, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, 0, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 2, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, 0, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 0, "prog[1]_prio");
+
+	/* Add back higher prio program, so 1st becomes 2nd in line.
+	 * Replace also works if nothing was attached at the given prio.
+	 */
+	opta.flags = BPF_F_REPLACE;
+	opta.attach_priority = 1;
+	err = bpf_prog_attach_opts(prog_fd2, loopback, BPF_NET_INGRESS, &opta);
+	if (!ASSERT_EQ(err, opta.attach_priority, "prog_replace"))
+		goto cleanup_detach;
+
+	prio = 1;
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach2;
+
+	ASSERT_EQ(prog_cnt, 2, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id2, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, 0, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, id1, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 2, "prog[1]_prio");
+	ASSERT_EQ(progs[2].prog_id, 0, "prog[2]_id");
+	ASSERT_EQ(progs[2].link_id, 0, "prog[2]_link");
+	ASSERT_EQ(progs[2].prio, 0, "prog[2]_prio");
+
+	optd.attach_priority = 2;
+	err = bpf_prog_detach_opts(0, loopback, BPF_NET_INGRESS, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	optd.attach_priority = 1;
+	err = bpf_prog_detach_opts(0, loopback, BPF_NET_INGRESS, &optd);
+	ASSERT_OK(err, "prog_detach");
+
+	/* Expected to be empty again. */
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_INGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	ASSERT_EQ(prog_cnt, 0, "prog_cnt");
+	ASSERT_EQ(err, -ENOENT, "prog_query");
+	goto cleanup;
+
+cleanup_detach:
+	optd.attach_priority = prio;
+	err = bpf_prog_detach_opts(0, loopback, BPF_NET_INGRESS, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup;
+cleanup:
+	test_tc_link__destroy(skel);
+	return;
+cleanup_detach2:
+	optd.attach_priority = 2;
+	err = bpf_prog_detach_opts(0, loopback, BPF_NET_INGRESS, &optd);
+	ASSERT_OK(err, "prog_detach");
+	goto cleanup_detach;
+}
+
+void serial_test_tc_link_mix(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	DECLARE_LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	__u32 prog_fd1, prog_fd2, id1, id2, id3;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	struct bpf_prog_info prog_info;
+	struct bpf_link_info link_info;
+	__u32 link_info_len = sizeof(link_info);
+	__u32 prog_info_len = sizeof(prog_info);
+	__u32 prog_cnt, attach_flags = 0;
+	struct bpf_query_info progs[4];
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+	prog_fd1 = bpf_program__fd(skel->progs.tc_handler_in);
+	prog_fd2 = bpf_program__fd(skel->progs.tc_handler_eg);
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd1, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info1"))
+		goto cleanup;
+	id1 = prog_info.id;
+
+	memset(&prog_info, 0, sizeof(prog_info));
+	err = bpf_obj_get_info_by_fd(prog_fd2, &prog_info, &prog_info_len);
+	if (!ASSERT_OK(err, "fd_info2"))
+		goto cleanup;
+	id2 = prog_info.id;
+
+	ASSERT_NEQ(id1, id2, "prog_ids_1_2");
+
+	/* Sanity check that attaching with given prio works. */
+	opta.flags = 0;
+	opta.attach_priority = 42;
+	err = bpf_prog_attach_opts(prog_fd1, loopback, BPF_NET_EGRESS, &opta);
+	if (!ASSERT_EQ(err, opta.attach_priority, "prog_attach"))
+		goto cleanup;
+
+	prog_cnt = 0;
+	err = bpf_prog_query(loopback, BPF_NET_EGRESS, 0, &attach_flags,
+			     NULL, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_EGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 1, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id1, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, 0, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 42, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, 0, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 0, "prog[1]_prio");
+
+	/* Sanity check that attaching link with same prio will fail. */
+	link = bpf_program__attach_tc(skel->progs.tc_handler_eg, loopback, 42);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	/* Different prio on unused slot works of course. */
+	link = bpf_program__attach_tc(skel->progs.tc_handler_eg, loopback, 0);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc_handler_eg = link;
+
+	memset(&link_info, 0, sizeof(link_info));
+	err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
+	if (!ASSERT_OK(err, "link_info"))
+		goto cleanup;
+
+	ASSERT_EQ(link_info.prog_id, id2, "link_prog_id");
+	id3 = link_info.id;
+
+	memset(progs, 0, sizeof(progs));
+	prog_cnt = ARRAY_SIZE(progs);
+	err = bpf_prog_query(loopback, BPF_NET_EGRESS, 0, &attach_flags,
+			     progs, &prog_cnt);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_detach;
+
+	ASSERT_EQ(prog_cnt, 2, "prog_cnt");
+	ASSERT_EQ(progs[0].prog_id, id2, "prog[0]_id");
+	ASSERT_EQ(progs[0].link_id, id3, "prog[0]_link");
+	ASSERT_EQ(progs[0].prio, 1, "prog[0]_prio");
+	ASSERT_EQ(progs[1].prog_id, id1, "prog[1]_id");
+	ASSERT_EQ(progs[1].link_id, 0, "prog[1]_link");
+	ASSERT_EQ(progs[1].prio, 42, "prog[1]_prio");
+	ASSERT_EQ(progs[2].prog_id, 0, "prog[2]_id");
+	ASSERT_EQ(progs[2].link_id, 0, "prog[2]_link");
+	ASSERT_EQ(progs[2].prio, 0, "prog[2]_prio");
+
+	/* Sanity check that attaching non-link with same prio as link will fail. */
+	opta.flags = BPF_F_REPLACE;
+	opta.attach_priority = 1;
+	err = bpf_prog_attach_opts(prog_fd1, loopback, BPF_NET_EGRESS, &opta);
+	if (!ASSERT_ERR(err, "prog_attach_should_fail"))
+		goto cleanup_detach;
+
+	opta.flags = 0;
+	opta.attach_priority = 1;
+	err = bpf_prog_attach_opts(prog_fd1, loopback, BPF_NET_EGRESS, &opta);
+	if (!ASSERT_ERR(err, "prog_attach_should_fail"))
+		goto cleanup_detach;
+
+cleanup_detach:
+	optd.attach_priority = 42;
+	err = bpf_prog_detach_opts(0, loopback, BPF_NET_EGRESS, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup;
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+void serial_test_tc_link_run_base(void)
+{
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	link = bpf_program__attach_tc(skel->progs.tc_handler_eg, loopback, 0);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+	skel->links.tc_handler_eg = link;
+
+	link = bpf_program__attach_tc(skel->progs.tc_handler_in, loopback, 0);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	CHECK_FAIL(system(ping_cmd));
+	ASSERT_EQ(skel->bss->run, 3, "run32_value");
+
+	bpf_link__destroy(link);
+	skel->bss->run = 0;
+
+	CHECK_FAIL(system(ping_cmd));
+	ASSERT_EQ(skel->bss->run, 2, "run32_value");
+cleanup:
+	test_tc_link__destroy(skel);
+}
+
+void tc_link_run_chain(int location, bool chain_tc_old)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, tc_hook, .ifindex = loopback);
+	DECLARE_LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	DECLARE_LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	bool hook_created = false, tc_attached = false;
+	__u32 prog_fd1, prog_fd2, prog_fd3;
+	struct test_tc_link *skel;
+	int err;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	prog_fd1 = bpf_program__fd(skel->progs.tc_handler_in);
+	prog_fd2 = bpf_program__fd(skel->progs.tc_handler_eg);
+	prog_fd3 = bpf_program__fd(skel->progs.tc_handler_old);
+
+	if (chain_tc_old) {
+		tc_hook.attach_point = location == BPF_NET_INGRESS ?
+				       BPF_TC_INGRESS : BPF_TC_EGRESS;
+		err = bpf_tc_hook_create(&tc_hook);
+		if (err == 0)
+			hook_created = true;
+		err = err == -EEXIST ? 0 : err;
+		if (!ASSERT_OK(err, "bpf_tc_hook_create"))
+			goto cleanup;
+
+		tc_opts.prog_fd = prog_fd3;
+		err = bpf_tc_attach(&tc_hook, &tc_opts);
+		if (!ASSERT_OK(err, "bpf_tc_attach"))
+			goto cleanup;
+		tc_attached = true;
+	}
+
+	opta.flags = 0;
+	opta.attach_priority = 1;
+	err = bpf_prog_attach_opts(prog_fd1, loopback, location, &opta);
+	if (!ASSERT_EQ(err, opta.attach_priority, "prog_attach"))
+		goto cleanup;
+
+	opta.flags = 0;
+	opta.attach_priority = 2;
+	err = bpf_prog_attach_opts(prog_fd2, loopback, location, &opta);
+	if (!ASSERT_EQ(err, opta.attach_priority, "prog_attach"))
+		goto cleanup_detach;
+
+	CHECK_FAIL(system(ping_cmd));
+	ASSERT_EQ(skel->bss->run, chain_tc_old ? 7 : 3, "run32_value");
+
+	skel->bss->run = 0;
+
+	optd.attach_priority = 2;
+	err = bpf_prog_detach_opts(0, loopback, location, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup_detach;
+
+	CHECK_FAIL(system(ping_cmd));
+	ASSERT_EQ(skel->bss->run, chain_tc_old ? 5 : 1, "run32_value");
+
+cleanup_detach:
+	optd.attach_priority = 1;
+	err = bpf_prog_detach_opts(0, loopback, location, &optd);
+	if (!ASSERT_OK(err, "prog_detach"))
+		goto cleanup;
+cleanup:
+	if (tc_attached) {
+		tc_opts.flags = tc_opts.prog_fd = tc_opts.prog_id = 0;
+		err = bpf_tc_detach(&tc_hook, &tc_opts);
+		ASSERT_OK(err, "bpf_tc_detach");
+	}
+	if (hook_created) {
+		tc_hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS;
+		bpf_tc_hook_destroy(&tc_hook);
+	}
+	test_tc_link__destroy(skel);
+}
+
+void serial_test_tc_link_run_chain(void)
+{
+	tc_link_run_chain(BPF_NET_INGRESS, false);
+	tc_link_run_chain(BPF_NET_EGRESS, false);
+
+	tc_link_run_chain(BPF_NET_INGRESS, true);
+	tc_link_run_chain(BPF_NET_EGRESS, true);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tc_link.c b/tools/testing/selftests/bpf/progs/test_tc_link.c
new file mode 100644
index 000000000000..648e504954eb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_link.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Isovalent */
+#include <linux/bpf.h>
+#include <linux/pkt_cls.h>
+
+#include <bpf/bpf_helpers.h>
+
+char LICENSE[] SEC("license") = "GPL";
+
+__u32 run;
+
+SEC("tc/ingress")
+int tc_handler_in(struct __sk_buff *skb)
+{
+#ifdef ENABLE_ATOMICS_TESTS
+	__sync_fetch_and_or(&run, 1);
+#else
+	run |= 1;
+#endif
+	return TC_NEXT;
+}
+
+SEC("tc/egress")
+int tc_handler_eg(struct __sk_buff *skb)
+{
+#ifdef ENABLE_ATOMICS_TESTS
+	__sync_fetch_and_or(&run, 2);
+#else
+	run |= 2;
+#endif
+	return TC_NEXT;
+}
+
+SEC("tc/egress")
+int tc_handler_old(struct __sk_buff *skb)
+{
+#ifdef ENABLE_ATOMICS_TESTS
+	__sync_fetch_and_or(&run, 4);
+#else
+	run |= 4;
+#endif
+	return TC_NEXT;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
@ 2022-10-05  0:55   ` sdf
  2022-10-05 10:50     ` Toke Høiland-Jørgensen
  2022-10-05 12:35     ` Daniel Borkmann
  2022-10-05 10:33   ` Toke Høiland-Jørgensen
                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: sdf @ 2022-10-05  0:55 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On 10/05, Daniel Borkmann wrote:
> This work refactors and adds a lightweight extension to the tc BPF ingress
> and egress data path side for allowing BPF programs via an fd-based  
> attach /
> detach API. The main goal behind this work which we also presented at LPC  
> [0]
> this year is to eventually add support for BPF links for tc BPF programs  
> in
> a second step, thus this prep work is required for the latter which allows
> for a model of safe ownership and program detachment. Given the vast rise
> in tc BPF users in cloud native / Kubernetes environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications stepping on each others toes. Further
> details for BPF link rationale in next patch.

> For the current tc framework, there is no change in behavior with this  
> change
> and neither does this change touch on tc core kernel APIs. The gist of  
> this
> patch is that the ingress and egress hook gets a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> tc-layer entry point for BPF. As part of the feedback from LPC, there was
> a suggestion to provide a name for this infrastructure to more easily  
> differ
> between the classic cls_bpf attachment and the fd-based API. As for most,
> the XDP vs tc layer is already the default mental model for the pkt  
> processing
> pipeline. We refactored this with an xtc internal prefix aka 'express  
> traffic
> control' in order to avoid to deviate too far (and 'express' given its  
> more
> lightweight/faster entry point).

> For the ingress and egress xtc points, the device holds a cache-friendly  
> array
> with programs. Same as with classic tc, programs are attached with a prio  
> that
> can be specified or auto-allocated through an idr, and the program return  
> code
> determines whether to continue in the pipeline or to terminate processing.
> With TC_ACT_UNSPEC code, the processing continues (as the case today).  
> The goal
> was to have maximum compatibility to existing tc BPF programs, so they  
> don't
> need to be adapted. Compatibility to call into classic tcf_classify() is  
> also
> provided in order to allow successive migration or both to cleanly  
> co-exist
> where needed given its one logical layer. The fd-based API is behind a  
> static
> key, so that when unused the code is also not entered. The struct  
> xtc_entry's
> program array is currently static, but could be made dynamic if necessary  
> at
> a point in future. Desire has also been expressed for future work to adapt
> similar framework for XDP to allow multi-attach from in-kernel side, too.

> Tested with tc-testing selftest suite which all passes, as well as the tc  
> BPF
> tests from the BPF CI.

>    [0] https://lpc.events/event/16/contributions/1353/

> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   MAINTAINERS                    |   4 +-
>   include/linux/bpf.h            |   1 +
>   include/linux/netdevice.h      |  14 +-
>   include/linux/skbuff.h         |   4 +-
>   include/net/sch_generic.h      |   2 +-
>   include/net/xtc.h              | 181 ++++++++++++++++++++++
>   include/uapi/linux/bpf.h       |  35 ++++-
>   kernel/bpf/Kconfig             |   1 +
>   kernel/bpf/Makefile            |   1 +
>   kernel/bpf/net.c               | 274 +++++++++++++++++++++++++++++++++
>   kernel/bpf/syscall.c           |  24 ++-
>   net/Kconfig                    |   5 +
>   net/core/dev.c                 | 262 +++++++++++++++++++------------
>   net/core/filter.c              |   4 +-
>   net/sched/Kconfig              |   4 +-
>   net/sched/sch_ingress.c        |  48 +++++-
>   tools/include/uapi/linux/bpf.h |  35 ++++-
>   17 files changed, 769 insertions(+), 130 deletions(-)
>   create mode 100644 include/net/xtc.h
>   create mode 100644 kernel/bpf/net.c

> diff --git a/MAINTAINERS b/MAINTAINERS
> index e55a4d47324c..bb63d8d000ea 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3850,13 +3850,15 @@ S:	Maintained
>   F:	kernel/trace/bpf_trace.c
>   F:	kernel/bpf/stackmap.c

> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (xtc & tc BPF, sock_addr)
>   M:	Martin KaFai Lau <martin.lau@linux.dev>
>   M:	Daniel Borkmann <daniel@iogearbox.net>
>   R:	John Fastabend <john.fastabend@gmail.com>
>   L:	bpf@vger.kernel.org
>   L:	netdev@vger.kernel.org
>   S:	Maintained
> +F:	include/net/xtc.h
> +F:	kernel/bpf/net.c
>   F:	net/core/filter.c
>   F:	net/sched/act_bpf.c
>   F:	net/sched/cls_bpf.c
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..71e5f43db378 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
>   	union {
>   		struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>   		u64 bpf_cookie;
> +		u32 bpf_priority;
>   	};
>   };

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eddf8ee270e7..43bbb2303e57 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1880,8 +1880,7 @@ enum netdev_ml_priv_type {
>    *
>    *	@rx_handler:		handler for received packets
>    *	@rx_handler_data: 	XXX: need comments on this one
> - *	@miniq_ingress:		ingress/clsact qdisc specific data for
> - *				ingress processing
> + *	@xtc_ingress:		BPF/clsact qdisc specific data for ingress processing
>    *	@ingress_queue:		XXX: need comments on this one
>    *	@nf_hooks_ingress:	netfilter hooks executed for ingress packets
>    *	@broadcast:		hw bcast address
> @@ -1902,8 +1901,7 @@ enum netdev_ml_priv_type {
>    *	@xps_maps:		all CPUs/RXQs maps for XPS device
>    *
>    *	@xps_maps:	XXX: need comments on this one
> - *	@miniq_egress:		clsact qdisc specific data for
> - *				egress processing
> + *	@xtc_egress:		BPF/clsact qdisc specific data for egress processing
>    *	@nf_hooks_egress:	netfilter hooks executed for egress packets
>    *	@qdisc_hash:		qdisc hash table
>    *	@watchdog_timeo:	Represents the timeout that is used by
> @@ -2191,8 +2189,8 @@ struct net_device {
>   	rx_handler_func_t __rcu	*rx_handler;
>   	void __rcu		*rx_handler_data;

> -#ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc __rcu	*miniq_ingress;
> +#ifdef CONFIG_NET_XGRESS
> +	struct xtc_entry __rcu	*xtc_ingress;
>   #endif
>   	struct netdev_queue __rcu *ingress_queue;
>   #ifdef CONFIG_NETFILTER_INGRESS
> @@ -2220,8 +2218,8 @@ struct net_device {
>   #ifdef CONFIG_XPS
>   	struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
>   #endif
> -#ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc __rcu	*miniq_egress;
> +#ifdef CONFIG_NET_XGRESS
> +	struct xtc_entry __rcu *xtc_egress;
>   #endif
>   #ifdef CONFIG_NETFILTER_EGRESS
>   	struct nf_hook_entries __rcu *nf_hooks_egress;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 9fcf534f2d92..a9ff7a1996e9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -955,7 +955,7 @@ struct sk_buff {
>   	__u8			csum_level:2;
>   	__u8			dst_pending_confirm:1;
>   	__u8			mono_delivery_time:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>   	__u8			tc_skip_classify:1;
>   	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
>   #endif
> @@ -983,7 +983,7 @@ struct sk_buff {
>   	__u8			slow_gro:1;
>   	__u8			csum_not_inet:1;

> -#ifdef CONFIG_NET_SCHED
> +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>   	__u16			tc_index;	/* traffic control index */
>   #endif

> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index d5517719af4e..bc5c1da2d30f 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -693,7 +693,7 @@ int skb_do_redirect(struct sk_buff *);

>   static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
>   {
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>   	return skb->tc_at_ingress;
>   #else
>   	return false;
> diff --git a/include/net/xtc.h b/include/net/xtc.h
> new file mode 100644
> index 000000000000..627dc18aa433
> --- /dev/null
> +++ b/include/net/xtc.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2022 Isovalent */
> +#ifndef __NET_XTC_H
> +#define __NET_XTC_H
> +
> +#include <linux/idr.h>
> +#include <linux/bpf.h>
> +
> +#include <net/sch_generic.h>
> +
> +#define XTC_MAX_ENTRIES 30
> +/* Adds 1 NULL entry. */
> +#define XTC_MAX	(XTC_MAX_ENTRIES + 1)
> +
> +struct xtc_entry {
> +	struct bpf_prog_array_item items[XTC_MAX] ____cacheline_aligned;
> +	struct xtc_entry_pair *parent;
> +};
> +
> +struct mini_Qdisc;
> +
> +struct xtc_entry_pair {
> +	struct rcu_head		rcu;
> +	struct idr		idr;
> +	struct mini_Qdisc	*miniq;
> +	struct xtc_entry	a;
> +	struct xtc_entry	b;
> +};
> +
> +static inline void xtc_set_ingress(struct sk_buff *skb, bool ingress)
> +{
> +#ifdef CONFIG_NET_XGRESS
> +	skb->tc_at_ingress = ingress;
> +#endif
> +}
> +
> +#ifdef CONFIG_NET_XGRESS
> +void xtc_inc(void);
> +void xtc_dec(void);
> +
> +static inline void
> +dev_xtc_entry_update(struct net_device *dev, struct xtc_entry *entry,
> +		     bool ingress)
> +{
> +	ASSERT_RTNL();
> +	if (ingress)
> +		rcu_assign_pointer(dev->xtc_ingress, entry);
> +	else
> +		rcu_assign_pointer(dev->xtc_egress, entry);
> +	synchronize_rcu();
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_peer(const struct  
> xtc_entry *entry)
> +{
> +	if (entry == &entry->parent->a)
> +		return &entry->parent->b;
> +	else
> +		return &entry->parent->a;
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_create(void)
> +{
> +	struct xtc_entry_pair *pair = kzalloc(sizeof(*pair), GFP_KERNEL);
> +
> +	if (pair) {
> +		pair->a.parent = pair;
> +		pair->b.parent = pair;
> +		idr_init(&pair->idr);
> +		return &pair->a;
> +	}
> +	return NULL;
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_fetch(struct net_device  
> *dev,
> +						    bool ingress, bool *created)
> +{
> +	struct xtc_entry *entry = ingress ?
> +		rcu_dereference_rtnl(dev->xtc_ingress) :
> +		rcu_dereference_rtnl(dev->xtc_egress);
> +
> +	*created = false;
> +	if (!entry) {
> +		entry = dev_xtc_entry_create();
> +		if (!entry)
> +			return NULL;
> +		*created = true;
> +	}
> +	return entry;
> +}
> +
> +static inline void dev_xtc_entry_clear(struct xtc_entry *entry)
> +{
> +	memset(entry->items, 0, sizeof(entry->items));
> +}
> +
> +static inline int dev_xtc_entry_prio_new(struct xtc_entry *entry, u32  
> prio,
> +					 struct bpf_prog *prog)
> +{
> +	int ret;
> +
> +	if (prio == 0)
> +		prio = 1;
> +	ret = idr_alloc_u32(&entry->parent->idr, prog, &prio, U32_MAX,
> +			    GFP_KERNEL);
> +	return ret < 0 ? ret : prio;
> +}
> +
> +static inline void dev_xtc_entry_prio_set(struct xtc_entry *entry, u32  
> prio,
> +					  struct bpf_prog *prog)
> +{
> +	idr_replace(&entry->parent->idr, prog, prio);
> +}
> +
> +static inline void dev_xtc_entry_prio_del(struct xtc_entry *entry, u32  
> prio)
> +{
> +	idr_remove(&entry->parent->idr, prio);
> +}
> +
> +static inline void dev_xtc_entry_free(struct xtc_entry *entry)
> +{
> +	idr_destroy(&entry->parent->idr);
> +	kfree_rcu(entry->parent, rcu);
> +}
> +
> +static inline u32 dev_xtc_entry_total(struct xtc_entry *entry)
> +{
> +	const struct bpf_prog_array_item *item;
> +	const struct bpf_prog *prog;
> +	u32 num = 0;
> +
> +	item = &entry->items[0];
> +	while ((prog = READ_ONCE(item->prog))) {
> +		num++;
> +		item++;
> +	}
> +	return num;
> +}
> +
> +static inline enum tc_action_base xtc_action_code(struct sk_buff *skb,  
> int code)
> +{
> +	switch (code) {
> +	case TC_PASS:
> +		skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
> +		fallthrough;
> +	case TC_DROP:
> +	case TC_REDIRECT:
> +		return code;
> +	case TC_NEXT:
> +	default:
> +		return TC_NEXT;
> +	}
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int xtc_prog_detach(const union bpf_attr *attr);
> +int xtc_prog_query(const union bpf_attr *attr,
> +		   union bpf_attr __user *uattr);
> +void dev_xtc_uninstall(struct net_device *dev);
> +#else
> +static inline int xtc_prog_attach(const union bpf_attr *attr,
> +				  struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int xtc_prog_detach(const union bpf_attr *attr)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int xtc_prog_query(const union bpf_attr *attr,
> +				 union bpf_attr __user *uattr)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline void dev_xtc_uninstall(struct net_device *dev)
> +{
> +}
> +#endif /* CONFIG_NET_XGRESS */
> +#endif /* __NET_XTC_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
>   	BPF_PERF_EVENT,
>   	BPF_TRACE_KPROBE_MULTI,
>   	BPF_LSM_CGROUP,
> +	BPF_NET_INGRESS,
> +	BPF_NET_EGRESS,
>   	__MAX_BPF_ATTACH_TYPE
>   };

> @@ -1399,14 +1401,20 @@ union bpf_attr {
>   	};

>   	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -		__u32		target_fd;	/* container object to attach to */
> +		union {
> +			__u32	target_fd;	/* container object to attach to */
> +			__u32	target_ifindex; /* target ifindex */
> +		};
>   		__u32		attach_bpf_fd;	/* eBPF program to attach */
>   		__u32		attach_type;
>   		__u32		attach_flags;
> -		__u32		replace_bpf_fd;	/* previously attached eBPF

[..]

> +		union {
> +			__u32	attach_priority;
> +			__u32	replace_bpf_fd;	/* previously attached eBPF
>   						 * program to replace if
>   						 * BPF_F_REPLACE is used
>   						 */
> +		};

The series looks exciting, haven't had a chance to look deeply, will try
to find some time this week.

We've chatted briefly about priority during the talk, let's maybe discuss
it here more?

I, as a user, still really have no clue about what priority to use.
We have this problem at tc, and we'll seemingly have the same problem
here? I guess it's even more relevant in k8s because internally at G we
can control the users.

Is it worth at least trying to provide some default bands / guidance?

For example, having SEC('tc/ingress') receive attach_priority=124 by
default? Maybe we can even have something like 'tc/ingress_first' get
attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
(the names are arbitrary, we can do something better)

ingress_first/ingress_last can be used by some monitoring jobs. The rest
can use default 124. If somebody really needs a custom priority, then they
can manually use something around 124/2 if they need to trigger before the
'default' priority or 124+124/2 if they want to trigger after?

Thoughts? Is it worth it? Do we care?

>   	};

>   	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
>   	} info;

>   	struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -		__u32		target_fd;	/* container object to query */
> +		union {
> +			__u32	target_fd;	/* container object to query */
> +			__u32	target_ifindex; /* target ifindex */
> +		};
>   		__u32		attach_type;
>   		__u32		query_flags;
>   		__u32		attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
>   	};
>   };

> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +	TC_NEXT		= -1,
> +	TC_PASS		= 0,
> +	TC_DROP		= 2,
> +	TC_REDIRECT	= 7,
> +};
> +
>   struct bpf_xdp_sock {
>   	__u32 queue_id;
>   };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
>   	__be32	flow_label;
>   };

> +struct bpf_query_info {
> +	__u32 prog_id;
> +	__u32 prio;
> +};
> +
>   struct bpf_func_info {
>   	__u32	insn_off;
>   	__u32	type_id;
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..6a906ff93006 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -31,6 +31,7 @@ config BPF_SYSCALL
>   	select TASKS_TRACE_RCU
>   	select BINARY_PRINTF
>   	select NET_SOCK_MSG if NET
> +	select NET_XGRESS if NET
>   	select PAGE_POOL if NET
>   	default n
>   	help
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..76c3f9d4e2f3 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
>   obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>   obj-$(CONFIG_BPF_SYSCALL) += offload.o
>   obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
> +obj-$(CONFIG_BPF_SYSCALL) += net.o
>   endif
>   ifeq ($(CONFIG_PERF_EVENTS),y)
>   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> new file mode 100644
> index 000000000000..ab9a9dee615b
> --- /dev/null
> +++ b/kernel/bpf/net.c
> @@ -0,0 +1,274 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/xtc.h>
> +
> +static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32  
> limit,
> +			     struct bpf_prog *nprog, u32 prio, u32 flags)
> +{
> +	struct bpf_prog_array_item *item, *tmp;
> +	struct xtc_entry *entry, *peer;
> +	struct bpf_prog *oprog;
> +	bool created;
> +	int i, j;
> +
> +	ASSERT_RTNL();
> +
> +	entry = dev_xtc_entry_fetch(dev, ingress, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	for (i = 0; i < limit; i++) {
> +		item = &entry->items[i];
> +		oprog = item->prog;
> +		if (!oprog)
> +			break;
> +		if (item->bpf_priority == prio) {
> +			if (flags & BPF_F_REPLACE) {
> +				/* Pairs with READ_ONCE() in xtc_run_progs(). */
> +				WRITE_ONCE(item->prog, nprog);
> +				bpf_prog_put(oprog);
> +				dev_xtc_entry_prio_set(entry, prio, nprog);
> +				return prio;
> +			}
> +			return -EBUSY;
> +		}
> +	}
> +	if (dev_xtc_entry_total(entry) >= limit)
> +		return -ENOSPC;
> +	prio = dev_xtc_entry_prio_new(entry, prio, nprog);
> +	if (prio < 0) {
> +		if (created)
> +			dev_xtc_entry_free(entry);
> +		return -ENOMEM;
> +	}
> +	peer = dev_xtc_entry_peer(entry);
> +	dev_xtc_entry_clear(peer);
> +	for (i = 0, j = 0; i < limit; i++, j++) {
> +		item = &entry->items[i];
> +		tmp = &peer->items[j];
> +		oprog = item->prog;
> +		if (!oprog) {
> +			if (i == j) {
> +				tmp->prog = nprog;
> +				tmp->bpf_priority = prio;
> +			}
> +			break;
> +		} else if (item->bpf_priority < prio) {
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +		} else if (item->bpf_priority > prio) {
> +			if (i == j) {
> +				tmp->prog = nprog;
> +				tmp->bpf_priority = prio;
> +				tmp = &peer->items[++j];
> +			}
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +		}
> +	}
> +	dev_xtc_entry_update(dev, peer, ingress);
> +	if (ingress)
> +		net_inc_ingress_queue();
> +	else
> +		net_inc_egress_queue();
> +	xtc_inc();
> +	return prio;
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *nprog)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	bool ingress = attr->attach_type == BPF_NET_INGRESS;
> +	struct net_device *dev;
> +	int ret;
> +
> +	if (attr->attach_flags & ~BPF_F_REPLACE)
> +		return -EINVAL;
> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->target_ifindex);
> +	if (!dev) {
> +		rtnl_unlock();
> +		return -EINVAL;
> +	}
> +	ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, nprog,
> +				attr->attach_priority, attr->attach_flags);
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32  
> limit,
> +			     u32 prio)
> +{
> +	struct bpf_prog_array_item *item, *tmp;
> +	struct bpf_prog *oprog, *fprog = NULL;
> +	struct xtc_entry *entry, *peer;
> +	int i, j;
> +
> +	ASSERT_RTNL();
> +
> +	entry = ingress ?
> +		rcu_dereference_rtnl(dev->xtc_ingress) :
> +		rcu_dereference_rtnl(dev->xtc_egress);
> +	if (!entry)
> +		return -ENOENT;
> +	peer = dev_xtc_entry_peer(entry);
> +	dev_xtc_entry_clear(peer);
> +	for (i = 0, j = 0; i < limit; i++) {
> +		item = &entry->items[i];
> +		tmp = &peer->items[j];
> +		oprog = item->prog;
> +		if (!oprog)
> +			break;
> +		if (item->bpf_priority != prio) {
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +			j++;
> +		} else {
> +			fprog = oprog;
> +		}
> +	}
> +	if (fprog) {
> +		dev_xtc_entry_prio_del(peer, prio);
> +		if (dev_xtc_entry_total(peer) == 0 && !entry->parent->miniq)
> +			peer = NULL;
> +		dev_xtc_entry_update(dev, peer, ingress);
> +		bpf_prog_put(fprog);
> +		if (!peer)
> +			dev_xtc_entry_free(entry);
> +		if (ingress)
> +			net_dec_ingress_queue();
> +		else
> +			net_dec_egress_queue();
> +		xtc_dec();
> +		return 0;
> +	}
> +	return -ENOENT;
> +}
> +
> +int xtc_prog_detach(const union bpf_attr *attr)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	bool ingress = attr->attach_type == BPF_NET_INGRESS;
> +	struct net_device *dev;
> +	int ret;
> +
> +	if (attr->attach_flags || !attr->attach_priority)
> +		return -EINVAL;
> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->target_ifindex);
> +	if (!dev) {
> +		rtnl_unlock();
> +		return -EINVAL;
> +	}
> +	ret = __xtc_prog_detach(dev, ingress, XTC_MAX_ENTRIES,
> +				attr->attach_priority);
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static void __xtc_prog_detach_all(struct net_device *dev, bool ingress,  
> u32 limit)
> +{
> +	struct bpf_prog_array_item *item;
> +	struct xtc_entry *entry;
> +	struct bpf_prog *prog;
> +	int i;
> +
> +	ASSERT_RTNL();
> +
> +	entry = ingress ?
> +		rcu_dereference_rtnl(dev->xtc_ingress) :
> +		rcu_dereference_rtnl(dev->xtc_egress);
> +	if (!entry)
> +		return;
> +	dev_xtc_entry_update(dev, NULL, ingress);
> +	for (i = 0; i < limit; i++) {
> +		item = &entry->items[i];
> +		prog = item->prog;
> +		if (!prog)
> +			break;
> +		dev_xtc_entry_prio_del(entry, item->bpf_priority);
> +		bpf_prog_put(prog);
> +		if (ingress)
> +			net_dec_ingress_queue();
> +		else
> +			net_dec_egress_queue();
> +		xtc_dec();
> +	}
> +	dev_xtc_entry_free(entry);
> +}
> +
> +void dev_xtc_uninstall(struct net_device *dev)
> +{
> +	__xtc_prog_detach_all(dev, true,  XTC_MAX_ENTRIES + 1);
> +	__xtc_prog_detach_all(dev, false, XTC_MAX_ENTRIES + 1);
> +}
> +
> +static int
> +__xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user  
> *uattr,
> +		 struct net_device *dev, bool ingress, u32 limit)
> +{
> +	struct bpf_query_info info, __user *uinfo;
> +	struct bpf_prog_array_item *item;
> +	struct xtc_entry *entry;
> +	struct bpf_prog *prog;
> +	u32 i, flags = 0, cnt;
> +	int ret = 0;
> +
> +	ASSERT_RTNL();
> +
> +	entry = ingress ?
> +		rcu_dereference_rtnl(dev->xtc_ingress) :
> +		rcu_dereference_rtnl(dev->xtc_egress);
> +	if (!entry)
> +		return -ENOENT;
> +	cnt = dev_xtc_entry_total(entry);
> +	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
> +		return -EFAULT;
> +	if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt)))
> +		return -EFAULT;
> +	uinfo = u64_to_user_ptr(attr->query.prog_ids);
> +	if (attr->query.prog_cnt == 0 || !uinfo || !cnt)
> +		/* return early if user requested only program count + flags */
> +		return 0;
> +	if (attr->query.prog_cnt < cnt) {
> +		cnt = attr->query.prog_cnt;
> +		ret = -ENOSPC;
> +	}
> +	for (i = 0; i < limit; i++) {
> +		item = &entry->items[i];
> +		prog = item->prog;
> +		if (!prog)
> +			break;
> +		info.prog_id = prog->aux->id;
> +		info.prio = item->bpf_priority;
> +		if (copy_to_user(uinfo + i, &info, sizeof(info)))
> +			return -EFAULT;
> +		if (i + 1 == cnt)
> +			break;
> +	}
> +	return ret;
> +}
> +
> +int xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user  
> *uattr)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	bool ingress = attr->query.attach_type == BPF_NET_INGRESS;
> +	struct net_device *dev;
> +	int ret;
> +
> +	if (attr->query.query_flags || attr->query.attach_flags)
> +		return -EINVAL;
> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->query.target_ifindex);
> +	if (!dev) {
> +		rtnl_unlock();
> +		return -EINVAL;
> +	}
> +	ret = __xtc_prog_query(attr, uattr, dev, ingress, XTC_MAX_ENTRIES);
> +	rtnl_unlock();
> +	return ret;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7b373a5e861f..a0a670b964bb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -36,6 +36,8 @@
>   #include <linux/memcontrol.h>
>   #include <linux/trace_events.h>

> +#include <net/xtc.h>
> +
>   #define IS_FD_ARRAY(map) ((map)->map_type ==  
> BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>   			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
>   			  (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> @@ -3448,6 +3450,9 @@ attach_type_to_prog_type(enum bpf_attach_type  
> attach_type)
>   		return BPF_PROG_TYPE_XDP;
>   	case BPF_LSM_CGROUP:
>   		return BPF_PROG_TYPE_LSM;
> +	case BPF_NET_INGRESS:
> +	case BPF_NET_EGRESS:
> +		return BPF_PROG_TYPE_SCHED_CLS;
>   	default:
>   		return BPF_PROG_TYPE_UNSPEC;
>   	}

[..]

> @@ -3466,18 +3471,15 @@ static int bpf_prog_attach(const union bpf_attr  
> *attr)

>   	if (CHECK_ATTR(BPF_PROG_ATTACH))
>   		return -EINVAL;
> -
>   	if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
>   		return -EINVAL;

>   	ptype = attach_type_to_prog_type(attr->attach_type);
>   	if (ptype == BPF_PROG_TYPE_UNSPEC)
>   		return -EINVAL;
> -
>   	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
>   	if (IS_ERR(prog))
>   		return PTR_ERR(prog);
> -
>   	if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) {
>   		bpf_prog_put(prog);
>   		return -EINVAL;

This whole chunk can probably be dropped?

> @@ -3508,16 +3510,18 @@ static int bpf_prog_attach(const union bpf_attr  
> *attr)

>   		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>   		break;
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		ret = xtc_prog_attach(attr, prog);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   	}
> -
> -	if (ret)
> +	if (ret < 0)
>   		bpf_prog_put(prog);
>   	return ret;
>   }

> -#define BPF_PROG_DETACH_LAST_FIELD attach_type
> +#define BPF_PROG_DETACH_LAST_FIELD replace_bpf_fd

>   static int bpf_prog_detach(const union bpf_attr *attr)
>   {
> @@ -3527,6 +3531,9 @@ static int bpf_prog_detach(const union bpf_attr  
> *attr)
>   		return -EINVAL;

>   	ptype = attach_type_to_prog_type(attr->attach_type);
> +	if (ptype != BPF_PROG_TYPE_SCHED_CLS &&
> +	    (attr->attach_flags || attr->replace_bpf_fd))
> +		return -EINVAL;

>   	switch (ptype) {
>   	case BPF_PROG_TYPE_SK_MSG:
> @@ -3545,6 +3552,8 @@ static int bpf_prog_detach(const union bpf_attr  
> *attr)
>   	case BPF_PROG_TYPE_SOCK_OPS:
>   	case BPF_PROG_TYPE_LSM:
>   		return cgroup_bpf_prog_detach(attr, ptype);
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		return xtc_prog_detach(attr);
>   	default:
>   		return -EINVAL;
>   	}
> @@ -3598,6 +3607,9 @@ static int bpf_prog_query(const union bpf_attr  
> *attr,
>   	case BPF_SK_MSG_VERDICT:
>   	case BPF_SK_SKB_VERDICT:
>   		return sock_map_bpf_prog_query(attr, uattr);
> +	case BPF_NET_INGRESS:
> +	case BPF_NET_EGRESS:
> +		return xtc_prog_query(attr, uattr);
>   	default:
>   		return -EINVAL;
>   	}
> diff --git a/net/Kconfig b/net/Kconfig
> index 48c33c222199..b7a9cd174464 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -52,6 +52,11 @@ config NET_INGRESS
>   config NET_EGRESS
>   	bool

> +config NET_XGRESS
> +	select NET_INGRESS
> +	select NET_EGRESS
> +	bool
> +
>   config NET_REDIRECT
>   	bool

> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa53830d0683..552b805c27dd 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -107,6 +107,7 @@
>   #include <net/pkt_cls.h>
>   #include <net/checksum.h>
>   #include <net/xfrm.h>
> +#include <net/xtc.h>
>   #include <linux/highmem.h>
>   #include <linux/init.h>
>   #include <linux/module.h>
> @@ -154,7 +155,6 @@
>   #include "dev.h"
>   #include "net-sysfs.h"

> -
>   static DEFINE_SPINLOCK(ptype_lock);
>   struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>   struct list_head ptype_all __read_mostly;	/* Taps */
> @@ -3935,69 +3935,199 @@ int dev_loopback_xmit(struct net *net, struct  
> sock *sk, struct sk_buff *skb)
>   EXPORT_SYMBOL(dev_loopback_xmit);

>   #ifdef CONFIG_NET_EGRESS
> -static struct sk_buff *
> -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int qm = skb_get_queue_mapping(skb);
> +
> +	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
> +{
> +	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> +	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> +#endif /* CONFIG_NET_EGRESS */
> +
> +#ifdef CONFIG_NET_XGRESS
> +static int tc_run(struct xtc_entry *entry, struct sk_buff *skb)
>   {
> +	int ret = TC_ACT_UNSPEC;
>   #ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
> -	struct tcf_result cl_res;
> +	struct mini_Qdisc *miniq = rcu_dereference_bh(entry->parent->miniq);
> +	struct tcf_result res;

>   	if (!miniq)
> -		return skb;
> +		return ret;

> -	/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
>   	tc_skb_cb(skb)->mru = 0;
>   	tc_skb_cb(skb)->post_ct = false;
> -	mini_qdisc_bstats_cpu_update(miniq, skb);

> -	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res,  
> false)) {
> +	mini_qdisc_bstats_cpu_update(miniq, skb);
> +	ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> +	/* Only tcf related quirks below. */
> +	switch (ret) {
> +	case TC_ACT_SHOT:
> +		mini_qdisc_qstats_cpu_drop(miniq);
> +		break;
>   	case TC_ACT_OK:
>   	case TC_ACT_RECLASSIFY:
> -		skb->tc_index = TC_H_MIN(cl_res.classid);
> +		skb->tc_index = TC_H_MIN(res.classid);
>   		break;
> +	}
> +#endif /* CONFIG_NET_CLS_ACT */
> +	return ret;
> +}
> +
> +static DEFINE_STATIC_KEY_FALSE(xtc_needed_key);
> +
> +void xtc_inc(void)
> +{
> +	static_branch_inc(&xtc_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(xtc_inc);
> +
> +void xtc_dec(void)
> +{
> +	static_branch_dec(&xtc_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(xtc_dec);
> +
> +static __always_inline enum tc_action_base
> +xtc_run(const struct xtc_entry *entry, struct sk_buff *skb,
> +	const bool needs_mac)
> +{
> +	const struct bpf_prog_array_item *item;
> +	const struct bpf_prog *prog;
> +	int ret = TC_NEXT;
> +
> +	if (needs_mac)
> +		__skb_push(skb, skb->mac_len);
> +	item = &entry->items[0];
> +	while ((prog = READ_ONCE(item->prog))) {
> +		bpf_compute_data_pointers(skb);
> +		ret = bpf_prog_run(prog, skb);
> +		if (ret != TC_NEXT)
> +			break;
> +		item++;
> +	}
> +	if (needs_mac)
> +		__skb_pull(skb, skb->mac_len);
> +	return xtc_action_code(skb, ret);
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev,  
> int *ret,
> +		   struct net_device *orig_dev, bool *another)
> +{
> +	struct xtc_entry *entry = rcu_dereference_bh(skb->dev->xtc_ingress);
> +	int sch_ret;
> +
> +	if (!entry)
> +		return skb;
> +	if (*pt_prev) {
> +		*ret = deliver_skb(skb, *pt_prev, orig_dev);
> +		*pt_prev = NULL;
> +	}
> +
> +	qdisc_skb_cb(skb)->pkt_len = skb->len;
> +	xtc_set_ingress(skb, true);
> +
> +	if (static_branch_unlikely(&xtc_needed_key)) {
> +		sch_ret = xtc_run(entry, skb, true);
> +		if (sch_ret != TC_ACT_UNSPEC)
> +			goto ingress_verdict;
> +	}
> +	sch_ret = tc_run(entry, skb);
> +ingress_verdict:
> +	switch (sch_ret) {
> +	case TC_ACT_REDIRECT:
> +		/* skb_mac_header check was done by BPF, so we can safely
> +		 * push the L2 header back before redirecting to another
> +		 * netdev.
> +		 */
> +		__skb_push(skb, skb->mac_len);
> +		if (skb_do_redirect(skb) == -EAGAIN) {
> +			__skb_pull(skb, skb->mac_len);
> +			*another = true;
> +			break;
> +		}
> +		return NULL;
>   	case TC_ACT_SHOT:
> -		mini_qdisc_qstats_cpu_drop(miniq);
> -		*ret = NET_XMIT_DROP;
> -		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
>   		return NULL;
> +	/* used by tc_run */
>   	case TC_ACT_STOLEN:
>   	case TC_ACT_QUEUED:
>   	case TC_ACT_TRAP:
> -		*ret = NET_XMIT_SUCCESS;
>   		consume_skb(skb);
> +		fallthrough;
> +	case TC_ACT_CONSUMED:
>   		return NULL;
> +	}
> +
> +	return skb;
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +{
> +	struct xtc_entry *entry = rcu_dereference_bh(dev->xtc_egress);
> +	int sch_ret;
> +
> +	if (!entry)
> +		return skb;
> +
> +	/* qdisc_skb_cb(skb)->pkt_len & xtc_set_ingress() was
> +	 * already set by the caller.
> +	 */
> +	if (static_branch_unlikely(&xtc_needed_key)) {
> +		sch_ret = xtc_run(entry, skb, false);
> +		if (sch_ret != TC_ACT_UNSPEC)
> +			goto egress_verdict;
> +	}
> +	sch_ret = tc_run(entry, skb);
> +egress_verdict:
> +	switch (sch_ret) {
>   	case TC_ACT_REDIRECT:
> +		*ret = NET_XMIT_SUCCESS;
>   		/* No need to push/pop skb's mac_header here on egress! */
>   		skb_do_redirect(skb);
> +		return NULL;
> +	case TC_ACT_SHOT:
> +		*ret = NET_XMIT_DROP;
> +		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +		return NULL;
> +	/* used by tc_run */
> +	case TC_ACT_STOLEN:
> +	case TC_ACT_QUEUED:
> +	case TC_ACT_TRAP:
>   		*ret = NET_XMIT_SUCCESS;
>   		return NULL;
> -	default:
> -		break;
>   	}
> -#endif /* CONFIG_NET_CLS_ACT */

>   	return skb;
>   }
> -
> -static struct netdev_queue *
> -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> -{
> -	int qm = skb_get_queue_mapping(skb);
> -
> -	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> -}
> -
> -static bool netdev_xmit_txqueue_skipped(void)
> +#else
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev,  
> int *ret,
> +		   struct net_device *orig_dev, bool *another)
>   {
> -	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +	return skb;
>   }

> -void netdev_xmit_skip_txqueue(bool skip)
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>   {
> -	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +	return skb;
>   }
> -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> -#endif /* CONFIG_NET_EGRESS */
> +#endif /* CONFIG_NET_XGRESS */

>   #ifdef CONFIG_XPS
>   static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff  
> *skb,
> @@ -4181,9 +4311,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct  
> net_device *sb_dev)
>   	skb_update_prio(skb);

>   	qdisc_pkt_len_init(skb);
> -#ifdef CONFIG_NET_CLS_ACT
> -	skb->tc_at_ingress = 0;
> -#endif
> +	xtc_set_ingress(skb, false);
>   #ifdef CONFIG_NET_EGRESS
>   	if (static_branch_unlikely(&egress_needed_key)) {
>   		if (nf_hook_egress_active()) {
> @@ -5101,68 +5229,6 @@ int (*br_fdb_test_addr_hook)(struct net_device  
> *dev,
>   EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>   #endif

> -static inline struct sk_buff *
> -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev,  
> int *ret,
> -		   struct net_device *orig_dev, bool *another)
> -{
> -#ifdef CONFIG_NET_CLS_ACT
> -	struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
> -	struct tcf_result cl_res;
> -
> -	/* If there's at least one ingress present somewhere (so
> -	 * we get here via enabled static key), remaining devices
> -	 * that are not configured with an ingress qdisc will bail
> -	 * out here.
> -	 */
> -	if (!miniq)
> -		return skb;
> -
> -	if (*pt_prev) {
> -		*ret = deliver_skb(skb, *pt_prev, orig_dev);
> -		*pt_prev = NULL;
> -	}
> -
> -	qdisc_skb_cb(skb)->pkt_len = skb->len;
> -	tc_skb_cb(skb)->mru = 0;
> -	tc_skb_cb(skb)->post_ct = false;
> -	skb->tc_at_ingress = 1;
> -	mini_qdisc_bstats_cpu_update(miniq, skb);
> -
> -	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res,  
> false)) {
> -	case TC_ACT_OK:
> -	case TC_ACT_RECLASSIFY:
> -		skb->tc_index = TC_H_MIN(cl_res.classid);
> -		break;
> -	case TC_ACT_SHOT:
> -		mini_qdisc_qstats_cpu_drop(miniq);
> -		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> -		return NULL;
> -	case TC_ACT_STOLEN:
> -	case TC_ACT_QUEUED:
> -	case TC_ACT_TRAP:
> -		consume_skb(skb);
> -		return NULL;
> -	case TC_ACT_REDIRECT:
> -		/* skb_mac_header check was done by cls/act_bpf, so
> -		 * we can safely push the L2 header back before
> -		 * redirecting to another netdev
> -		 */
> -		__skb_push(skb, skb->mac_len);
> -		if (skb_do_redirect(skb) == -EAGAIN) {
> -			__skb_pull(skb, skb->mac_len);
> -			*another = true;
> -			break;
> -		}
> -		return NULL;
> -	case TC_ACT_CONSUMED:
> -		return NULL;
> -	default:
> -		break;
> -	}
> -#endif /* CONFIG_NET_CLS_ACT */
> -	return skb;
> -}
> -
>   /**
>    *	netdev_is_rx_handler_busy - check if receive handler is registered
>    *	@dev: device to check
> @@ -10832,7 +10898,7 @@ void unregister_netdevice_many(struct list_head  
> *head)

>   		/* Shutdown queueing discipline. */
>   		dev_shutdown(dev);
> -
> +		dev_xtc_uninstall(dev);
>   		dev_xdp_uninstall(dev);

>   		netdev_offload_xstats_disable_all(dev);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bb0136e7a8e4..ac4bb016c5ee 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9132,7 +9132,7 @@ static struct bpf_insn  
> *bpf_convert_tstamp_read(const struct bpf_prog *prog,
>   	__u8 value_reg = si->dst_reg;
>   	__u8 skb_reg = si->src_reg;

> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>   	/* If the tstamp_type is read,
>   	 * the bpf prog is aware the tstamp could have delivery time.
>   	 * Thus, read skb->tstamp as is if tstamp_type_access is true.
> @@ -9166,7 +9166,7 @@ static struct bpf_insn  
> *bpf_convert_tstamp_write(const struct bpf_prog *prog,
>   	__u8 value_reg = si->src_reg;
>   	__u8 skb_reg = si->dst_reg;

> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>   	/* If the tstamp_type is read,
>   	 * the bpf prog is aware the tstamp could have delivery time.
>   	 * Thus, write skb->tstamp as is if tstamp_type_access is true.
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 1e8ab4749c6c..c1b8f2e7d966 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -382,8 +382,7 @@ config NET_SCH_FQ_PIE
>   config NET_SCH_INGRESS
>   	tristate "Ingress/classifier-action Qdisc"
>   	depends on NET_CLS_ACT
> -	select NET_INGRESS
> -	select NET_EGRESS
> +	select NET_XGRESS
>   	help
>   	  Say Y here if you want to use classifiers for incoming and/or outgoing
>   	  packets. This qdisc doesn't do anything else besides running  
> classifiers,
> @@ -753,6 +752,7 @@ config NET_EMATCH_IPT
>   config NET_CLS_ACT
>   	bool "Actions"
>   	select NET_CLS
> +	select NET_XGRESS
>   	help
>   	  Say Y here if you want to use traffic control actions. Actions
>   	  get attached to classifiers and are invoked after a successful
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..3bd37ee898ce 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -13,6 +13,7 @@
>   #include <net/netlink.h>
>   #include <net/pkt_sched.h>
>   #include <net/pkt_cls.h>
> +#include <net/xtc.h>

>   struct ingress_sched_data {
>   	struct tcf_block *block;
> @@ -78,11 +79,19 @@ static int ingress_init(struct Qdisc *sch, struct  
> nlattr *opt,
>   {
>   	struct ingress_sched_data *q = qdisc_priv(sch);
>   	struct net_device *dev = qdisc_dev(sch);
> +	struct xtc_entry *entry;
> +	bool created;
>   	int err;

>   	net_inc_ingress_queue();

> -	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> +	entry = dev_xtc_entry_fetch(dev, true, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	mini_qdisc_pair_init(&q->miniqp, sch, &entry->parent->miniq);
> +	if (created)
> +		dev_xtc_entry_update(dev, entry, true);

>   	q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>   	q->block_info.chain_head_change = clsact_chain_head_change;
> @@ -93,15 +102,20 @@ static int ingress_init(struct Qdisc *sch, struct  
> nlattr *opt,
>   		return err;

>   	mini_qdisc_pair_block_init(&q->miniqp, q->block);
> -
>   	return 0;
>   }

>   static void ingress_destroy(struct Qdisc *sch)
>   {
>   	struct ingress_sched_data *q = qdisc_priv(sch);
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct xtc_entry *entry = rtnl_dereference(dev->xtc_ingress);

>   	tcf_block_put_ext(q->block, sch, &q->block_info);
> +	if (entry && dev_xtc_entry_total(entry) == 0) {
> +		dev_xtc_entry_update(dev, NULL, true);
> +		dev_xtc_entry_free(entry);
> +	}
>   	net_dec_ingress_queue();
>   }

> @@ -217,12 +231,20 @@ static int clsact_init(struct Qdisc *sch, struct  
> nlattr *opt,
>   {
>   	struct clsact_sched_data *q = qdisc_priv(sch);
>   	struct net_device *dev = qdisc_dev(sch);
> +	struct xtc_entry *entry;
> +	bool created;
>   	int err;

>   	net_inc_ingress_queue();
>   	net_inc_egress_queue();

> -	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
> +	entry = dev_xtc_entry_fetch(dev, true, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &entry->parent->miniq);
> +	if (created)
> +		dev_xtc_entry_update(dev, entry, true);

>   	q->ingress_block_info.binder_type =  
> FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>   	q->ingress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -235,7 +257,13 @@ static int clsact_init(struct Qdisc *sch, struct  
> nlattr *opt,

>   	mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);

> -	mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
> +	entry = dev_xtc_entry_fetch(dev, false, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	mini_qdisc_pair_init(&q->miniqp_egress, sch, &entry->parent->miniq);
> +	if (created)
> +		dev_xtc_entry_update(dev, entry, false);

>   	q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
>   	q->egress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -247,9 +275,21 @@ static int clsact_init(struct Qdisc *sch, struct  
> nlattr *opt,
>   static void clsact_destroy(struct Qdisc *sch)
>   {
>   	struct clsact_sched_data *q = qdisc_priv(sch);
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct xtc_entry *ingress_entry = rtnl_dereference(dev->xtc_ingress);
> +	struct xtc_entry *egress_entry = rtnl_dereference(dev->xtc_egress);

>   	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> +	if (egress_entry && dev_xtc_entry_total(egress_entry) == 0) {
> +		dev_xtc_entry_update(dev, NULL, false);
> +		dev_xtc_entry_free(egress_entry);
> +	}
> +
>   	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> +	if (ingress_entry && dev_xtc_entry_total(ingress_entry) == 0) {
> +		dev_xtc_entry_update(dev, NULL, true);
> +		dev_xtc_entry_free(ingress_entry);
> +	}

>   	net_dec_ingress_queue();
>   	net_dec_egress_queue();
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
>   	BPF_PERF_EVENT,
>   	BPF_TRACE_KPROBE_MULTI,
>   	BPF_LSM_CGROUP,
> +	BPF_NET_INGRESS,
> +	BPF_NET_EGRESS,
>   	__MAX_BPF_ATTACH_TYPE
>   };

> @@ -1399,14 +1401,20 @@ union bpf_attr {
>   	};

>   	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -		__u32		target_fd;	/* container object to attach to */
> +		union {
> +			__u32	target_fd;	/* container object to attach to */
> +			__u32	target_ifindex; /* target ifindex */
> +		};
>   		__u32		attach_bpf_fd;	/* eBPF program to attach */
>   		__u32		attach_type;
>   		__u32		attach_flags;
> -		__u32		replace_bpf_fd;	/* previously attached eBPF
> +		union {
> +			__u32	attach_priority;
> +			__u32	replace_bpf_fd;	/* previously attached eBPF
>   						 * program to replace if
>   						 * BPF_F_REPLACE is used
>   						 */
> +		};
>   	};

>   	struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
>   	} info;

>   	struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -		__u32		target_fd;	/* container object to query */
> +		union {
> +			__u32	target_fd;	/* container object to query */
> +			__u32	target_ifindex; /* target ifindex */
> +		};
>   		__u32		attach_type;
>   		__u32		query_flags;
>   		__u32		attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
>   	};
>   };

> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +	TC_NEXT		= -1,
> +	TC_PASS		= 0,
> +	TC_DROP		= 2,
> +	TC_REDIRECT	= 7,
> +};
> +
>   struct bpf_xdp_sock {
>   	__u32 queue_id;
>   };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
>   	__be32	flow_label;
>   };

> +struct bpf_query_info {
> +	__u32 prog_id;
> +	__u32 prio;
> +};
> +
>   struct bpf_func_info {
>   	__u32	insn_off;
>   	__u32	type_id;
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
  2022-10-05  0:55   ` sdf
@ 2022-10-05 10:33   ` Toke Høiland-Jørgensen
  2022-10-05 12:47     ` Daniel Borkmann
  2022-10-05 19:04   ` Jamal Hadi Salim
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-05 10:33 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, joe, netdev, Daniel Borkmann

Daniel Borkmann <daniel@iogearbox.net> writes:

> As part of the feedback from LPC, there was a suggestion to provide a
> name for this infrastructure to more easily differ between the classic
> cls_bpf attachment and the fd-based API. As for most, the XDP vs tc
> layer is already the default mental model for the pkt processing
> pipeline. We refactored this with an xtc internal prefix aka 'express
> traffic control' in order to avoid to deviate too far (and 'express'
> given its more lightweight/faster entry point).

Woohoo, bikeshed time! :)

I am OK with having a separate name for this, but can we please pick one
that doesn't sound like 'XDP' when you say it out loud? You really don't
have to mumble much for 'XDP' and 'XTC' to sound exactly alike; this is
bound to lead to confusion!

Alternatives, in the same vein:
- ltc (lightweight)
- etc (extended/express/ebpf/et cetera ;))
- tcx (keep the cool X, but put it at the end)

[...]

> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +	TC_NEXT		= -1,
> +	TC_PASS		= 0,
> +	TC_DROP		= 2,
> +	TC_REDIRECT	= 7,
> +};

Looking at things like this, though, I wonder if having a separate name
(at least if it's too prominent) is not just going to be more confusing
than not? I.e., we go out of our way to make it compatible with existing
TC-BPF programs (which is a good thing!), so do we really need a
separate name? Couldn't it just be an implementation detail that "it's
faster now"?

Oh, and speaking of compatibility should 'tc' (the iproute2 binary) be
taught how to display these new bpf_link attachments so that users can
see that they're there?

-Toke


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05  0:55   ` sdf
@ 2022-10-05 10:50     ` Toke Høiland-Jørgensen
  2022-10-05 14:48       ` Daniel Borkmann
  2022-10-05 12:35     ` Daniel Borkmann
  1 sibling, 1 reply; 62+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-05 10:50 UTC (permalink / raw)
  To: sdf, Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, joe, netdev

sdf@google.com writes:

>>   	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>> -		__u32		target_fd;	/* container object to attach to */
>> +		union {
>> +			__u32	target_fd;	/* container object to attach to */
>> +			__u32	target_ifindex; /* target ifindex */
>> +		};
>>   		__u32		attach_bpf_fd;	/* eBPF program to attach */
>>   		__u32		attach_type;
>>   		__u32		attach_flags;
>> -		__u32		replace_bpf_fd;	/* previously attached eBPF
>
> [..]
>
>> +		union {
>> +			__u32	attach_priority;
>> +			__u32	replace_bpf_fd;	/* previously attached eBPF
>>   						 * program to replace if
>>   						 * BPF_F_REPLACE is used
>>   						 */
>> +		};
>
> The series looks exciting, haven't had a chance to look deeply, will try
> to find some time this week.
>
> We've chatted briefly about priority during the talk, let's maybe discuss
> it here more?
>
> I, as a user, still really have no clue about what priority to use.
> We have this problem at tc, and we'll seemingly have the same problem
> here? I guess it's even more relevant in k8s because internally at G we
> can control the users.
>
> Is it worth at least trying to provide some default bands / guidance?
>
> For example, having SEC('tc/ingress') receive attach_priority=124 by
> default? Maybe we can even have something like 'tc/ingress_first' get
> attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
> (the names are arbitrary, we can do something better)
>
> ingress_first/ingress_last can be used by some monitoring jobs. The rest
> can use default 124. If somebody really needs a custom priority, then they
> can manually use something around 124/2 if they need to trigger before the
> 'default' priority or 124+124/2 if they want to trigger after?
>
> Thoughts? Is it worth it? Do we care?

I think we should care :)

Having "better" defaults are probably a good idea (so not everything
just ends up at priority 1 by default). However, I think ultimately the
only robust solution is to make the priority override-able. Users are
going to want to combine BPF programs in ways that their authors didn't
anticipate, so the actual priority the programs run at should not be the
sole choice of the program author.

To use the example that Daniel presented at LPC: Running datadog and
cilium at the same time broke cilium because datadog took over the
prio-1 hook point. With the bpf_link API what would change is that (a)
it would be obvious that something breaks (that is good), and (b) it
would be datadog that breaks instead of cilium (because it can no longer
just take over the hook, it'll get an error instead). However, (b) means
that the user still hasn't gotten what they wanted: the ability to run
datadog and cilium at the same time. To do this, they will need to be
able to change the priorities of one or both applications.

I know cilium at least has a configuration option to change this
somewhere, but I don't think relying on every BPF-using application to
expose this (each in their own way) is a good solution. I think of
priorities more like daemon startup at boot: this is system policy,
decided by the equivalent of the init system (and in this analogy we are
currently at the 'rc.d' stage of init system design, with the hook
priorities).

One way to resolve this is to have a central daemon that implements the
policy and does all the program loading on behalf of the users. I think
multiple such daemons exist already in more or less public and/or
complete states. However, getting everyone to agree on one is also hard,
so maybe the kernel needs to expose a mechanism for doing the actual
overriding, and then whatever daemon people run can hook into that?

Not sure what that mechanism would be? A(nother) BPF hook for overriding
priority on load? An LSM hook that rewrites the system call? (can it
already do that?) Something else?

Oh, and also, in the case of TC there's also the additional issue that
execution only chains to the next program if the current one returns
TC_ACT_UNSPEC; this should probably also be overridable somehow, for the
same reasons...

-Toke


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05  0:55   ` sdf
  2022-10-05 10:50     ` Toke Høiland-Jørgensen
@ 2022-10-05 12:35     ` Daniel Borkmann
  2022-10-05 17:56       ` sdf
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-05 12:35 UTC (permalink / raw)
  To: sdf
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On 10/5/22 2:55 AM, sdf@google.com wrote:
> On 10/05, Daniel Borkmann wrote:
[...]
> 
> The series looks exciting, haven't had a chance to look deeply, will try
> to find some time this week.

Great, thanks!

> We've chatted briefly about priority during the talk, let's maybe discuss
> it here more?
> 
> I, as a user, still really have no clue about what priority to use.
> We have this problem at tc, and we'll seemingly have the same problem
> here? I guess it's even more relevant in k8s because internally at G we
> can control the users.
> 
> Is it worth at least trying to provide some default bands / guidance?
> 
> For example, having SEC('tc/ingress') receive attach_priority=124 by
> default? Maybe we can even have something like 'tc/ingress_first' get
> attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
> (the names are arbitrary, we can do something better)
> 
> ingress_first/ingress_last can be used by some monitoring jobs. The rest
> can use default 124. If somebody really needs a custom priority, then they
> can manually use something around 124/2 if they need to trigger before the
> 'default' priority or 124+124/2 if they want to trigger after?
> 
> Thoughts? Is it worth it? Do we care?

I think guidance is needed, yes, I can add a few paragraphs to the libbpf
header file where we also have the tc BPF link API. I had a brief discussion
around this also with developers from datadog as they also use the infra
via tc BPF. Overall, its a hard problem, and I don't think there's a good
generic solution. The '*_last' is implied by prio=0, so that kernel auto-
allocates it, and for libbpf we could add an API for it where the user
does not need to specify prio specifically. The 'appending' is reasonable
to me given if an application explicitly requests to be added as first
(and e.g. enforces policy through tc BPF), but some other 3rd party application
prepends itself as first, it can bypass the former, which would be too easy
to shoot yourself in the foot. Overall the issue in tc land is that ordering
matters, skb packet data could be mangled (e.g. IPs NATed), skb fields can
be mangled, and we can have redirect actions (dev A vs. B); the only way I'd
see were this is possible if somewhat verifier can annotate the prog when
it didn't observe any writes to skb, and no redirect was in play. Then you've
kind of replicated the constraints similar to tracing where the attachment
can say that ordering doesn't matter if all the progs are in same style.
Otherwise, explicit corporation is needed as is today with rest of tc (or
as Toke did in libxdp) with multi-attach. In the specific case I mentioned
at LPC, it can be made to work given one of the two is only observing traffic
at the layer, e.g. it could get prepended if there is guarantee that all
return codes are tc_act_unspec so that there is no bypass and then you'll
see all traffic or appended to see only traffic which made it past the
policy. So it all depends on the applications installing programs, but to
solve it generically is not possible given ordering and conflicting actions.
So, imho, an _append() API for libbpf can be added along with guidance for
developers when to use _append() vs explicit prio.

Thanks,
Daniel

>>       };
> 
>>       struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
>> @@ -1452,7 +1460,10 @@ union bpf_attr {
>>       } info;
> 
>>       struct { /* anonymous struct used by BPF_PROG_QUERY command */
>> -        __u32        target_fd;    /* container object to query */
>> +        union {
>> +            __u32    target_fd;    /* container object to query */
>> +            __u32    target_ifindex; /* target ifindex */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 10:33   ` Toke Høiland-Jørgensen
@ 2022-10-05 12:47     ` Daniel Borkmann
  2022-10-05 14:32       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-05 12:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, joe, netdev

On 10/5/22 12:33 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
> 
>> As part of the feedback from LPC, there was a suggestion to provide a
>> name for this infrastructure to more easily differ between the classic
>> cls_bpf attachment and the fd-based API. As for most, the XDP vs tc
>> layer is already the default mental model for the pkt processing
>> pipeline. We refactored this with an xtc internal prefix aka 'express
>> traffic control' in order to avoid to deviate too far (and 'express'
>> given its more lightweight/faster entry point).
> 
> Woohoo, bikeshed time! :)
> 
> I am OK with having a separate name for this, but can we please pick one
> that doesn't sound like 'XDP' when you say it out loud? You really don't
> have to mumble much for 'XDP' and 'XTC' to sound exactly alike; this is
> bound to lead to confusion!
> 
> Alternatives, in the same vein:
> - ltc (lightweight)
> - etc (extended/express/ebpf/et cetera ;))
> - tcx (keep the cool X, but put it at the end)

Hehe, yeah agree, I don't have a strong opinion, but tcx (or just sticking
with tc) is fully okay to me.

> [...]
> 
>> +/* (Simplified) user return codes for tc prog type.
>> + * A valid tc program must return one of these defined values. All other
>> + * return codes are reserved for future use. Must remain compatible with
>> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
>> + * return codes are mapped to TC_NEXT.
>> + */
>> +enum tc_action_base {
>> +	TC_NEXT		= -1,
>> +	TC_PASS		= 0,
>> +	TC_DROP		= 2,
>> +	TC_REDIRECT	= 7,
>> +};
> 
> Looking at things like this, though, I wonder if having a separate name
> (at least if it's too prominent) is not just going to be more confusing
> than not? I.e., we go out of our way to make it compatible with existing
> TC-BPF programs (which is a good thing!), so do we really need a
> separate name? Couldn't it just be an implementation detail that "it's
> faster now"?

Yep, faster is an implementation detail; and developers can stick to existing
opcodes. I added this here given Andrii suggested to add the action codes as
enum so they land in vmlinux BTF. My thinking was that if we go this route,
we could also make them more user friendly. This part is 100% optional,
but for new developers it might lower the barrier a bit I was hoping given
it makes it clear which subset of actions BPF supports explicitly and with
less cryptic name.

> Oh, and speaking of compatibility should 'tc' (the iproute2 binary) be
> taught how to display these new bpf_link attachments so that users can
> see that they're there?

Sounds reasonable, I can follow-up with the iproute2 support as well.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 12:47     ` Daniel Borkmann
@ 2022-10-05 14:32       ` Toke Høiland-Jørgensen
  2022-10-05 14:53         ` Daniel Borkmann
  0 siblings, 1 reply; 62+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-05 14:32 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, joe, netdev

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 10/5/22 12:33 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>> 
>>> As part of the feedback from LPC, there was a suggestion to provide a
>>> name for this infrastructure to more easily differ between the classic
>>> cls_bpf attachment and the fd-based API. As for most, the XDP vs tc
>>> layer is already the default mental model for the pkt processing
>>> pipeline. We refactored this with an xtc internal prefix aka 'express
>>> traffic control' in order to avoid to deviate too far (and 'express'
>>> given its more lightweight/faster entry point).
>> 
>> Woohoo, bikeshed time! :)
>> 
>> I am OK with having a separate name for this, but can we please pick one
>> that doesn't sound like 'XDP' when you say it out loud? You really don't
>> have to mumble much for 'XDP' and 'XTC' to sound exactly alike; this is
>> bound to lead to confusion!
>> 
>> Alternatives, in the same vein:
>> - ltc (lightweight)
>> - etc (extended/express/ebpf/et cetera ;))
>> - tcx (keep the cool X, but put it at the end)
>
> Hehe, yeah agree, I don't have a strong opinion, but tcx (or just sticking
> with tc) is fully okay to me.

Either is fine with me; I don't have any strong opinions either, other
than "not XTC" ;)

>> [...]
>> 
>>> +/* (Simplified) user return codes for tc prog type.
>>> + * A valid tc program must return one of these defined values. All other
>>> + * return codes are reserved for future use. Must remain compatible with
>>> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
>>> + * return codes are mapped to TC_NEXT.
>>> + */
>>> +enum tc_action_base {
>>> +	TC_NEXT		= -1,
>>> +	TC_PASS		= 0,
>>> +	TC_DROP		= 2,
>>> +	TC_REDIRECT	= 7,
>>> +};
>> 
>> Looking at things like this, though, I wonder if having a separate name
>> (at least if it's too prominent) is not just going to be more confusing
>> than not? I.e., we go out of our way to make it compatible with existing
>> TC-BPF programs (which is a good thing!), so do we really need a
>> separate name? Couldn't it just be an implementation detail that "it's
>> faster now"?
>
> Yep, faster is an implementation detail; and developers can stick to existing
> opcodes. I added this here given Andrii suggested to add the action codes as
> enum so they land in vmlinux BTF. My thinking was that if we go this route,
> we could also make them more user friendly. This part is 100% optional,
> but for new developers it might lower the barrier a bit I was hoping given
> it makes it clear which subset of actions BPF supports explicitly and with
> less cryptic name.

Oh, I didn't mean that we shouldn't define these helpers; that's totally
fine, and probably useful. Just that when everything is named 'TC'
anyway, having a different name (like TCX) is maybe not that important
anyway?

>> Oh, and speaking of compatibility should 'tc' (the iproute2 binary) be
>> taught how to display these new bpf_link attachments so that users can
>> see that they're there?
>
> Sounds reasonable, I can follow-up with the iproute2 support as well.

Cool!

-Toke


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 10:50     ` Toke Høiland-Jørgensen
@ 2022-10-05 14:48       ` Daniel Borkmann
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-05 14:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, sdf
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, joe, netdev

On 10/5/22 12:50 PM, Toke Høiland-Jørgensen wrote:
> sdf@google.com writes:
[...]
>>
>> The series looks exciting, haven't had a chance to look deeply, will try
>> to find some time this week.
>>
>> We've chatted briefly about priority during the talk, let's maybe discuss
>> it here more?
>>
>> I, as a user, still really have no clue about what priority to use.
>> We have this problem at tc, and we'll seemingly have the same problem
>> here? I guess it's even more relevant in k8s because internally at G we
>> can control the users.
>>
>> Is it worth at least trying to provide some default bands / guidance?
>>
>> For example, having SEC('tc/ingress') receive attach_priority=124 by
>> default? Maybe we can even have something like 'tc/ingress_first' get
>> attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
>> (the names are arbitrary, we can do something better)
>>
>> ingress_first/ingress_last can be used by some monitoring jobs. The rest
>> can use default 124. If somebody really needs a custom priority, then they
>> can manually use something around 124/2 if they need to trigger before the
>> 'default' priority or 124+124/2 if they want to trigger after?
>>
>> Thoughts? Is it worth it? Do we care?
> 
> I think we should care :)
> 
> Having "better" defaults are probably a good idea (so not everything
> just ends up at priority 1 by default). However, I think ultimately the
> only robust solution is to make the priority override-able. Users are
> going to want to combine BPF programs in ways that their authors didn't
> anticipate, so the actual priority the programs run at should not be the
> sole choice of the program author.
> 
> To use the example that Daniel presented at LPC: Running datadog and
> cilium at the same time broke cilium because datadog took over the
> prio-1 hook point. With the bpf_link API what would change is that (a)
> it would be obvious that something breaks (that is good), and (b) it
> would be datadog that breaks instead of cilium (because it can no longer
> just take over the hook, it'll get an error instead). However, (b) means
> that the user still hasn't gotten what they wanted: the ability to run
> datadog and cilium at the same time. To do this, they will need to be
> able to change the priorities of one or both applications.

(Just for the record :) it was an oversight on datadog agent part and it
got fixed, somehow there was a corner-case race with device creation and
bpf attachment which lead to this, but 100% it would make it obvious that
something breaks which is already a good step forward - I just took this
solely as a real-world example that these things /can/ happen and are
/tricky/ to debug on top given the 'undefined' behavior resulting from
this; this can happen to anyone in general ofc. Both sides (cilium, dd)
are configurable to interoperate cleanly now through daemon config.)

> I know cilium at least has a configuration option to change this
> somewhere, but I don't think relying on every BPF-using application to
> expose this (each in their own way) is a good solution. I think of
> priorities more like daemon startup at boot: this is system policy,
> decided by the equivalent of the init system (and in this analogy we are
> currently at the 'rc.d' stage of init system design, with the hook
> priorities).
> 
> One way to resolve this is to have a central daemon that implements the
> policy and does all the program loading on behalf of the users. I think
> multiple such daemons exist already in more or less public and/or
> complete states. However, getting everyone to agree on one is also hard,
> so maybe the kernel needs to expose a mechanism for doing the actual
> overriding, and then whatever daemon people run can hook into that?

I think system policy but also user policy, kind of a mixed bag in the end.
Just take the policy bpf app vs introspection bpf app as an example: a user
might want to see either all traffic (thus before policy app), or just
traffic that policy let through (thus after policy app).

> Not sure what that mechanism would be? A(nother) BPF hook for overriding
> priority on load? An LSM hook that rewrites the system call? (can it
> already do that?) Something else?

Yeah, it could be a means to achieve that, some kind of policy agent which
has awareness of the installed programs and their inter-dependencies resp.
user intent where it then rewrites the prios dynamically.

> Oh, and also, in the case of TC there's also the additional issue that
> execution only chains to the next program if the current one returns
> TC_ACT_UNSPEC; this should probably also be overridable somehow, for the
> same reasons...

Same category as above, yes.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 14:32       ` Toke Høiland-Jørgensen
@ 2022-10-05 14:53         ` Daniel Borkmann
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-05 14:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf
  Cc: razor, ast, andrii, martin.lau, john.fastabend, joannelkoong,
	memxor, joe, netdev

On 10/5/22 4:32 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 10/5/22 12:33 PM, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
[...]
>>>> +/* (Simplified) user return codes for tc prog type.
>>>> + * A valid tc program must return one of these defined values. All other
>>>> + * return codes are reserved for future use. Must remain compatible with
>>>> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
>>>> + * return codes are mapped to TC_NEXT.
>>>> + */
>>>> +enum tc_action_base {
>>>> +	TC_NEXT		= -1,
>>>> +	TC_PASS		= 0,
>>>> +	TC_DROP		= 2,
>>>> +	TC_REDIRECT	= 7,
>>>> +};
>>>
>>> Looking at things like this, though, I wonder if having a separate name
>>> (at least if it's too prominent) is not just going to be more confusing
>>> than not? I.e., we go out of our way to make it compatible with existing
>>> TC-BPF programs (which is a good thing!), so do we really need a
>>> separate name? Couldn't it just be an implementation detail that "it's
>>> faster now"?
>>
>> Yep, faster is an implementation detail; and developers can stick to existing
>> opcodes. I added this here given Andrii suggested to add the action codes as
>> enum so they land in vmlinux BTF. My thinking was that if we go this route,
>> we could also make them more user friendly. This part is 100% optional,
>> but for new developers it might lower the barrier a bit I was hoping given
>> it makes it clear which subset of actions BPF supports explicitly and with
>> less cryptic name.
> 
> Oh, I didn't mean that we shouldn't define these helpers; that's totally
> fine, and probably useful. Just that when everything is named 'TC'
> anyway, having a different name (like TCX) is maybe not that important
> anyway?

I thought about this initially, but then also it has nothing to do with tcx
given it can just as well be used on both old/new style attachments, thus
wanted to avoid potential confusion around this.

>>> Oh, and speaking of compatibility should 'tc' (the iproute2 binary) be
>>> taught how to display these new bpf_link attachments so that users can
>>> see that they're there?
>>
>> Sounds reasonable, I can follow-up with the iproute2 support as well.
> 
> Cool!

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 12:35     ` Daniel Borkmann
@ 2022-10-05 17:56       ` sdf
  2022-10-05 18:21         ` Daniel Borkmann
  0 siblings, 1 reply; 62+ messages in thread
From: sdf @ 2022-10-05 17:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On 10/05, Daniel Borkmann wrote:
> On 10/5/22 2:55 AM, sdf@google.com wrote:
> > On 10/05, Daniel Borkmann wrote:
> [...]
> >
> > The series looks exciting, haven't had a chance to look deeply, will try
> > to find some time this week.

> Great, thanks!

> > We've chatted briefly about priority during the talk, let's maybe  
> discuss
> > it here more?
> >
> > I, as a user, still really have no clue about what priority to use.
> > We have this problem at tc, and we'll seemingly have the same problem
> > here? I guess it's even more relevant in k8s because internally at G we
> > can control the users.
> >
> > Is it worth at least trying to provide some default bands / guidance?
> >
> > For example, having SEC('tc/ingress') receive attach_priority=124 by
> > default? Maybe we can even have something like 'tc/ingress_first' get
> > attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
> > (the names are arbitrary, we can do something better)
> >
> > ingress_first/ingress_last can be used by some monitoring jobs. The rest
> > can use default 124. If somebody really needs a custom priority, then  
> they
> > can manually use something around 124/2 if they need to trigger before  
> the
> > 'default' priority or 124+124/2 if they want to trigger after?
> >
> > Thoughts? Is it worth it? Do we care?

> I think guidance is needed, yes, I can add a few paragraphs to the libbpf
> header file where we also have the tc BPF link API. I had a brief  
> discussion
> around this also with developers from datadog as they also use the infra
> via tc BPF. Overall, its a hard problem, and I don't think there's a good
> generic solution. The '*_last' is implied by prio=0, so that kernel auto-
> allocates it, and for libbpf we could add an API for it where the user
> does not need to specify prio specifically. The 'appending' is reasonable
> to me given if an application explicitly requests to be added as first
> (and e.g. enforces policy through tc BPF), but some other 3rd party  
> application
> prepends itself as first, it can bypass the former, which would be too  
> easy
> to shoot yourself in the foot. Overall the issue in tc land is that  
> ordering
> matters, skb packet data could be mangled (e.g. IPs NATed), skb fields can
> be mangled, and we can have redirect actions (dev A vs. B); the only way  
> I'd
> see were this is possible if somewhat verifier can annotate the prog when
> it didn't observe any writes to skb, and no redirect was in play. Then  
> you've
> kind of replicated the constraints similar to tracing where the attachment
> can say that ordering doesn't matter if all the progs are in same style.
> Otherwise, explicit corporation is needed as is today with rest of tc (or
> as Toke did in libxdp) with multi-attach. In the specific case I mentioned
> at LPC, it can be made to work given one of the two is only observing  
> traffic
> at the layer, e.g. it could get prepended if there is guarantee that all
> return codes are tc_act_unspec so that there is no bypass and then you'll
> see all traffic or appended to see only traffic which made it past the
> policy. So it all depends on the applications installing programs, but to
> solve it generically is not possible given ordering and conflicting  
> actions.
> So, imho, an _append() API for libbpf can be added along with guidance for
> developers when to use _append() vs explicit prio.

Agreed, it's a hard problem to solve, especially from the kernel side.
Ideally, as Toke mentions on the side thread, there should be some kind
of system daemon or some other place where the ordering is described.
But let's start with at least some guidance on the current prio.

Might be also a good idea to narrow down the prio range to 0-65k for
now? Maybe in the future we'll have some special PRIO_MONITORING_BEFORE_ALL
and PRIO_MONITORING_AFTER_ALL that trigger regardless of TC_ACT_UNSPEC?
I agree with Toke that it's another problem with the current action based
chains that's worth solving somehow (compared to, say, cgroup programs).

> Thanks,
> Daniel

> > > ����� };
> >
> > > ����� struct { /* anonymous struct used by BPF_PROG_TEST_RUN command  
> */
> > > @@ -1452,7 +1460,10 @@ union bpf_attr {
> > > ����� } info;
> >
> > > ����� struct { /* anonymous struct used by BPF_PROG_QUERY command */
> > > -������� __u32������� target_fd;��� /* container object to query */
> > > +������� union {
> > > +����������� __u32��� target_fd;��� /* container object to query */
> > > +����������� __u32��� target_ifindex; /* target ifindex */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 17:56       ` sdf
@ 2022-10-05 18:21         ` Daniel Borkmann
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-05 18:21 UTC (permalink / raw)
  To: sdf
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On 10/5/22 7:56 PM, sdf@google.com wrote:
> On 10/05, Daniel Borkmann wrote:
>> On 10/5/22 2:55 AM, sdf@google.com wrote:
>> > On 10/05, Daniel Borkmann wrote:
>> [...]
>> >
>> > The series looks exciting, haven't had a chance to look deeply, will try
>> > to find some time this week.
> 
>> Great, thanks!
> 
>> > We've chatted briefly about priority during the talk, let's maybe discuss
>> > it here more?
>> >
>> > I, as a user, still really have no clue about what priority to use.
>> > We have this problem at tc, and we'll seemingly have the same problem
>> > here? I guess it's even more relevant in k8s because internally at G we
>> > can control the users.
>> >
>> > Is it worth at least trying to provide some default bands / guidance?
>> >
>> > For example, having SEC('tc/ingress') receive attach_priority=124 by
>> > default? Maybe we can even have something like 'tc/ingress_first' get
>> > attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
>> > (the names are arbitrary, we can do something better)
>> >
>> > ingress_first/ingress_last can be used by some monitoring jobs. The rest
>> > can use default 124. If somebody really needs a custom priority, then they
>> > can manually use something around 124/2 if they need to trigger before the
>> > 'default' priority or 124+124/2 if they want to trigger after?
>> >
>> > Thoughts? Is it worth it? Do we care?
> 
>> I think guidance is needed, yes, I can add a few paragraphs to the libbpf
>> header file where we also have the tc BPF link API. I had a brief discussion
>> around this also with developers from datadog as they also use the infra
>> via tc BPF. Overall, its a hard problem, and I don't think there's a good
>> generic solution. The '*_last' is implied by prio=0, so that kernel auto-
>> allocates it, and for libbpf we could add an API for it where the user
>> does not need to specify prio specifically. The 'appending' is reasonable
>> to me given if an application explicitly requests to be added as first
>> (and e.g. enforces policy through tc BPF), but some other 3rd party application
>> prepends itself as first, it can bypass the former, which would be too easy
>> to shoot yourself in the foot. Overall the issue in tc land is that ordering
>> matters, skb packet data could be mangled (e.g. IPs NATed), skb fields can
>> be mangled, and we can have redirect actions (dev A vs. B); the only way I'd
>> see were this is possible if somewhat verifier can annotate the prog when
>> it didn't observe any writes to skb, and no redirect was in play. Then you've
>> kind of replicated the constraints similar to tracing where the attachment
>> can say that ordering doesn't matter if all the progs are in same style.
>> Otherwise, explicit corporation is needed as is today with rest of tc (or
>> as Toke did in libxdp) with multi-attach. In the specific case I mentioned
>> at LPC, it can be made to work given one of the two is only observing traffic
>> at the layer, e.g. it could get prepended if there is guarantee that all
>> return codes are tc_act_unspec so that there is no bypass and then you'll
>> see all traffic or appended to see only traffic which made it past the
>> policy. So it all depends on the applications installing programs, but to
>> solve it generically is not possible given ordering and conflicting actions.
>> So, imho, an _append() API for libbpf can be added along with guidance for
>> developers when to use _append() vs explicit prio.
> 
> Agreed, it's a hard problem to solve, especially from the kernel side.
> Ideally, as Toke mentions on the side thread, there should be some kind
> of system daemon or some other place where the ordering is described.
> But let's start with at least some guidance on the current prio.
> 
> Might be also a good idea to narrow down the prio range to 0-65k for
> now? Maybe in the future we'll have some special PRIO_MONITORING_BEFORE_ALL
> and PRIO_MONITORING_AFTER_ALL that trigger regardless of TC_ACT_UNSPEC?
> I agree with Toke that it's another problem with the current action based
> chains that's worth solving somehow (compared to, say, cgroup programs).

Makes sense, I'll restrict the range so there's headroom for future
extensions, the mentioned 0-65k looks very reasonable to me.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
  2022-10-05  0:55   ` sdf
  2022-10-05 10:33   ` Toke Høiland-Jørgensen
@ 2022-10-05 19:04   ` Jamal Hadi Salim
  2022-10-06 20:49     ` Daniel Borkmann
  2022-10-06  0:22   ` Andrii Nakryiko
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Jamal Hadi Salim @ 2022-10-05 19:04 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev, Cong Wang, Jiri Pirko

Daniel,

+tc maintainers

So i perused the slides, very fascinating battle debugging that ;->

Let me see if i can summarize the issue of ownership..
It seems there were two users each with root access and one decided they want
to be prio 1 and basically deleted the others programs and added
themselves to the top?
And of course both want to be prio 1. Am i correct? And this feature
basically avoids
this problem by virtue of fd ownership.

IIUC,  this is an issue of resource contention. Both users who have
root access think they should be prio 1. Kubernetes has no controls for this?
For debugging, wouldnt listening to netlink events have caught this?
I may be misunderstanding - but if both users took advantage of this
feature seems the root cause is still unresolved i.e  whoever gets there first
becomes the owner of the highest prio?

Other comments on just this patch (I will pay attention in detail later):
My two qualms:
1) Was bastardizing all things TC_ACT_XXX necessary?
Maybe you could create #define somewhere visible which refers
to the TC_ACT_XXX?
Even these kind of things seems puzzling:
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
TC_ACT_*,
2) Why is xtc_run before tc_run()?
tc_run() existed before xtc_run() - which is the same arguement
used when someone new shows up (eg when nftables did)

Probably lesser concern are thing like dev_xtc_entry_fetch()
which are bpf specific are now in net.

cheers,
jamal

On Tue, Oct 4, 2022 at 7:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This work refactors and adds a lightweight extension to the tc BPF ingress
> and egress data path side for allowing BPF programs via an fd-based attach /
> detach API. The main goal behind this work which we also presented at LPC [0]
> this year is to eventually add support for BPF links for tc BPF programs in
> a second step, thus this prep work is required for the latter which allows
> for a model of safe ownership and program detachment.
> Given the vast rise
> in tc BPF users in cloud native / Kubernetes environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications stepping on each others toes. Further
> details for BPF link rationale in next patch.
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook gets a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> tc-layer entry point for BPF. As part of the feedback from LPC, there was
> a suggestion to provide a name for this infrastructure to more easily differ
> between the classic cls_bpf attachment and the fd-based API. As for most,
> the XDP vs tc layer is already the default mental model for the pkt processing
> pipeline. We refactored this with an xtc internal prefix aka 'express traffic
> control' in order to avoid to deviate too far (and 'express' given its more
> lightweight/faster entry point).



> For the ingress and egress xtc points, the device holds a cache-friendly array
> with programs. Same as with classic tc, programs are attached with a prio that
> can be specified or auto-allocated through an idr, and the program return code
> determines whether to continue in the pipeline or to terminate processing.
> With TC_ACT_UNSPEC code, the processing continues (as the case today). The goal
> was to have maximum compatibility to existing tc BPF programs, so they don't
> need to be adapted. Compatibility to call into classic tcf_classify() is also
> provided in order to allow successive migration or both to cleanly co-exist
> where needed given its one logical layer. The fd-based API is behind a static
> key, so that when unused the code is also not entered. The struct xtc_entry's
> program array is currently static, but could be made dynamic if necessary at
> a point in future. Desire has also been expressed for future work to adapt
> similar framework for XDP to allow multi-attach from in-kernel side, too.
>
> Tested with tc-testing selftest suite which all passes, as well as the tc BPF
> tests from the BPF CI.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/bpf.h            |   1 +
>  include/linux/netdevice.h      |  14 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/xtc.h              | 181 ++++++++++++++++++++++
>  include/uapi/linux/bpf.h       |  35 ++++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/net.c               | 274 +++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  24 ++-
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 262 +++++++++++++++++++------------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  48 +++++-
>  tools/include/uapi/linux/bpf.h |  35 ++++-
>  17 files changed, 769 insertions(+), 130 deletions(-)
>  create mode 100644 include/net/xtc.h
>  create mode 100644 kernel/bpf/net.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e55a4d47324c..bb63d8d000ea 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3850,13 +3850,15 @@ S:      Maintained
>  F:     kernel/trace/bpf_trace.c
>  F:     kernel/bpf/stackmap.c
>
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (xtc & tc BPF, sock_addr)
>  M:     Martin KaFai Lau <martin.lau@linux.dev>
>  M:     Daniel Borkmann <daniel@iogearbox.net>
>  R:     John Fastabend <john.fastabend@gmail.com>
>  L:     bpf@vger.kernel.org
>  L:     netdev@vger.kernel.org
>  S:     Maintained
> +F:     include/net/xtc.h
> +F:     kernel/bpf/net.c
>  F:     net/core/filter.c
>  F:     net/sched/act_bpf.c
>  F:     net/sched/cls_bpf.c
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..71e5f43db378 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
>         union {
>                 struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>                 u64 bpf_cookie;
> +               u32 bpf_priority;
>         };
>  };
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eddf8ee270e7..43bbb2303e57 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1880,8 +1880,7 @@ enum netdev_ml_priv_type {
>   *
>   *     @rx_handler:            handler for received packets
>   *     @rx_handler_data:       XXX: need comments on this one
> - *     @miniq_ingress:         ingress/clsact qdisc specific data for
> - *                             ingress processing
> + *     @xtc_ingress:           BPF/clsact qdisc specific data for ingress processing
>   *     @ingress_queue:         XXX: need comments on this one
>   *     @nf_hooks_ingress:      netfilter hooks executed for ingress packets
>   *     @broadcast:             hw bcast address
> @@ -1902,8 +1901,7 @@ enum netdev_ml_priv_type {
>   *     @xps_maps:              all CPUs/RXQs maps for XPS device
>   *
>   *     @xps_maps:      XXX: need comments on this one
> - *     @miniq_egress:          clsact qdisc specific data for
> - *                             egress processing
> + *     @xtc_egress:            BPF/clsact qdisc specific data for egress processing
>   *     @nf_hooks_egress:       netfilter hooks executed for egress packets
>   *     @qdisc_hash:            qdisc hash table
>   *     @watchdog_timeo:        Represents the timeout that is used by
> @@ -2191,8 +2189,8 @@ struct net_device {
>         rx_handler_func_t __rcu *rx_handler;
>         void __rcu              *rx_handler_data;
>
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc __rcu *miniq_ingress;
> +#ifdef CONFIG_NET_XGRESS
> +       struct xtc_entry __rcu  *xtc_ingress;
>  #endif
>         struct netdev_queue __rcu *ingress_queue;
>  #ifdef CONFIG_NETFILTER_INGRESS
> @@ -2220,8 +2218,8 @@ struct net_device {
>  #ifdef CONFIG_XPS
>         struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
>  #endif
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc __rcu *miniq_egress;
> +#ifdef CONFIG_NET_XGRESS
> +       struct xtc_entry __rcu *xtc_egress;
>  #endif
>  #ifdef CONFIG_NETFILTER_EGRESS
>         struct nf_hook_entries __rcu *nf_hooks_egress;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 9fcf534f2d92..a9ff7a1996e9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -955,7 +955,7 @@ struct sk_buff {
>         __u8                    csum_level:2;
>         __u8                    dst_pending_confirm:1;
>         __u8                    mono_delivery_time:1;   /* See SKB_MONO_DELIVERY_TIME_MASK */
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         __u8                    tc_skip_classify:1;
>         __u8                    tc_at_ingress:1;        /* See TC_AT_INGRESS_MASK */
>  #endif
> @@ -983,7 +983,7 @@ struct sk_buff {
>         __u8                    slow_gro:1;
>         __u8                    csum_not_inet:1;
>
> -#ifdef CONFIG_NET_SCHED
> +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>         __u16                   tc_index;       /* traffic control index */
>  #endif
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index d5517719af4e..bc5c1da2d30f 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -693,7 +693,7 @@ int skb_do_redirect(struct sk_buff *);
>
>  static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
>  {
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         return skb->tc_at_ingress;
>  #else
>         return false;
> diff --git a/include/net/xtc.h b/include/net/xtc.h
> new file mode 100644
> index 000000000000..627dc18aa433
> --- /dev/null
> +++ b/include/net/xtc.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2022 Isovalent */
> +#ifndef __NET_XTC_H
> +#define __NET_XTC_H
> +
> +#include <linux/idr.h>
> +#include <linux/bpf.h>
> +
> +#include <net/sch_generic.h>
> +
> +#define XTC_MAX_ENTRIES 30
> +/* Adds 1 NULL entry. */
> +#define XTC_MAX        (XTC_MAX_ENTRIES + 1)
> +
> +struct xtc_entry {
> +       struct bpf_prog_array_item items[XTC_MAX] ____cacheline_aligned;
> +       struct xtc_entry_pair *parent;
> +};
> +
> +struct mini_Qdisc;
> +
> +struct xtc_entry_pair {
> +       struct rcu_head         rcu;
> +       struct idr              idr;
> +       struct mini_Qdisc       *miniq;
> +       struct xtc_entry        a;
> +       struct xtc_entry        b;
> +};
> +
> +static inline void xtc_set_ingress(struct sk_buff *skb, bool ingress)
> +{
> +#ifdef CONFIG_NET_XGRESS
> +       skb->tc_at_ingress = ingress;
> +#endif
> +}
> +
> +#ifdef CONFIG_NET_XGRESS
> +void xtc_inc(void);
> +void xtc_dec(void);
> +
> +static inline void
> +dev_xtc_entry_update(struct net_device *dev, struct xtc_entry *entry,
> +                    bool ingress)
> +{
> +       ASSERT_RTNL();
> +       if (ingress)
> +               rcu_assign_pointer(dev->xtc_ingress, entry);
> +       else
> +               rcu_assign_pointer(dev->xtc_egress, entry);
> +       synchronize_rcu();
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_peer(const struct xtc_entry *entry)
> +{
> +       if (entry == &entry->parent->a)
> +               return &entry->parent->b;
> +       else
> +               return &entry->parent->a;
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_create(void)
> +{
> +       struct xtc_entry_pair *pair = kzalloc(sizeof(*pair), GFP_KERNEL);
> +
> +       if (pair) {
> +               pair->a.parent = pair;
> +               pair->b.parent = pair;
> +               idr_init(&pair->idr);
> +               return &pair->a;
> +       }
> +       return NULL;
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_fetch(struct net_device *dev,
> +                                                   bool ingress, bool *created)
> +{
> +       struct xtc_entry *entry = ingress ?
> +               rcu_dereference_rtnl(dev->xtc_ingress) :
> +               rcu_dereference_rtnl(dev->xtc_egress);
> +
> +       *created = false;
> +       if (!entry) {
> +               entry = dev_xtc_entry_create();
> +               if (!entry)
> +                       return NULL;
> +               *created = true;
> +       }
> +       return entry;
> +}
> +
> +static inline void dev_xtc_entry_clear(struct xtc_entry *entry)
> +{
> +       memset(entry->items, 0, sizeof(entry->items));
> +}
> +
> +static inline int dev_xtc_entry_prio_new(struct xtc_entry *entry, u32 prio,
> +                                        struct bpf_prog *prog)
> +{
> +       int ret;
> +
> +       if (prio == 0)
> +               prio = 1;
> +       ret = idr_alloc_u32(&entry->parent->idr, prog, &prio, U32_MAX,
> +                           GFP_KERNEL);
> +       return ret < 0 ? ret : prio;
> +}
> +
> +static inline void dev_xtc_entry_prio_set(struct xtc_entry *entry, u32 prio,
> +                                         struct bpf_prog *prog)
> +{
> +       idr_replace(&entry->parent->idr, prog, prio);
> +}
> +
> +static inline void dev_xtc_entry_prio_del(struct xtc_entry *entry, u32 prio)
> +{
> +       idr_remove(&entry->parent->idr, prio);
> +}
> +
> +static inline void dev_xtc_entry_free(struct xtc_entry *entry)
> +{
> +       idr_destroy(&entry->parent->idr);
> +       kfree_rcu(entry->parent, rcu);
> +}
> +
> +static inline u32 dev_xtc_entry_total(struct xtc_entry *entry)
> +{
> +       const struct bpf_prog_array_item *item;
> +       const struct bpf_prog *prog;
> +       u32 num = 0;
> +
> +       item = &entry->items[0];
> +       while ((prog = READ_ONCE(item->prog))) {
> +               num++;
> +               item++;
> +       }
> +       return num;
> +}
> +
> +static inline enum tc_action_base xtc_action_code(struct sk_buff *skb, int code)
> +{
> +       switch (code) {
> +       case TC_PASS:
> +               skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
> +               fallthrough;
> +       case TC_DROP:
> +       case TC_REDIRECT:
> +               return code;
> +       case TC_NEXT:
> +       default:
> +               return TC_NEXT;
> +       }
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int xtc_prog_detach(const union bpf_attr *attr);
> +int xtc_prog_query(const union bpf_attr *attr,
> +                  union bpf_attr __user *uattr);
> +void dev_xtc_uninstall(struct net_device *dev);
> +#else
> +static inline int xtc_prog_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int xtc_prog_detach(const union bpf_attr *attr)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int xtc_prog_query(const union bpf_attr *attr,
> +                                union bpf_attr __user *uattr)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline void dev_xtc_uninstall(struct net_device *dev)
> +{
> +}
> +#endif /* CONFIG_NET_XGRESS */
> +#endif /* __NET_XTC_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
>         BPF_PERF_EVENT,
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
> +       BPF_NET_INGRESS,
> +       BPF_NET_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1399,14 +1401,20 @@ union bpf_attr {
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -               __u32           target_fd;      /* container object to attach to */
> +               union {
> +                       __u32   target_fd;      /* container object to attach to */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_bpf_fd;  /* eBPF program to attach */
>                 __u32           attach_type;
>                 __u32           attach_flags;
> -               __u32           replace_bpf_fd; /* previously attached eBPF
> +               union {
> +                       __u32   attach_priority;
> +                       __u32   replace_bpf_fd; /* previously attached eBPF
>                                                  * program to replace if
>                                                  * BPF_F_REPLACE is used
>                                                  */
> +               };
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
>         } info;
>
>         struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -               __u32           target_fd;      /* container object to query */
> +               union {
> +                       __u32   target_fd;      /* container object to query */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_type;
>                 __u32           query_flags;
>                 __u32           attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +       TC_NEXT         = -1,
> +       TC_PASS         = 0,
> +       TC_DROP         = 2,
> +       TC_REDIRECT     = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
>         __be32  flow_label;
>  };
>
> +struct bpf_query_info {
> +       __u32 prog_id;
> +       __u32 prio;
> +};
> +
>  struct bpf_func_info {
>         __u32   insn_off;
>         __u32   type_id;
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..6a906ff93006 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -31,6 +31,7 @@ config BPF_SYSCALL
>         select TASKS_TRACE_RCU
>         select BINARY_PRINTF
>         select NET_SOCK_MSG if NET
> +       select NET_XGRESS if NET
>         select PAGE_POOL if NET
>         default n
>         help
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..76c3f9d4e2f3 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
>  obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  obj-$(CONFIG_BPF_SYSCALL) += offload.o
>  obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
> +obj-$(CONFIG_BPF_SYSCALL) += net.o
>  endif
>  ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> new file mode 100644
> index 000000000000..ab9a9dee615b
> --- /dev/null
> +++ b/kernel/bpf/net.c
> @@ -0,0 +1,274 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/xtc.h>
> +
> +static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
> +                            struct bpf_prog *nprog, u32 prio, u32 flags)
> +{
> +       struct bpf_prog_array_item *item, *tmp;
> +       struct xtc_entry *entry, *peer;
> +       struct bpf_prog *oprog;
> +       bool created;
> +       int i, j;
> +
> +       ASSERT_RTNL();
> +
> +       entry = dev_xtc_entry_fetch(dev, ingress, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       for (i = 0; i < limit; i++) {
> +               item = &entry->items[i];
> +               oprog = item->prog;
> +               if (!oprog)
> +                       break;
> +               if (item->bpf_priority == prio) {
> +                       if (flags & BPF_F_REPLACE) {
> +                               /* Pairs with READ_ONCE() in xtc_run_progs(). */
> +                               WRITE_ONCE(item->prog, nprog);
> +                               bpf_prog_put(oprog);
> +                               dev_xtc_entry_prio_set(entry, prio, nprog);
> +                               return prio;
> +                       }
> +                       return -EBUSY;
> +               }
> +       }
> +       if (dev_xtc_entry_total(entry) >= limit)
> +               return -ENOSPC;
> +       prio = dev_xtc_entry_prio_new(entry, prio, nprog);
> +       if (prio < 0) {
> +               if (created)
> +                       dev_xtc_entry_free(entry);
> +               return -ENOMEM;
> +       }
> +       peer = dev_xtc_entry_peer(entry);
> +       dev_xtc_entry_clear(peer);
> +       for (i = 0, j = 0; i < limit; i++, j++) {
> +               item = &entry->items[i];
> +               tmp = &peer->items[j];
> +               oprog = item->prog;
> +               if (!oprog) {
> +                       if (i == j) {
> +                               tmp->prog = nprog;
> +                               tmp->bpf_priority = prio;
> +                       }
> +                       break;
> +               } else if (item->bpf_priority < prio) {
> +                       tmp->prog = oprog;
> +                       tmp->bpf_priority = item->bpf_priority;
> +               } else if (item->bpf_priority > prio) {
> +                       if (i == j) {
> +                               tmp->prog = nprog;
> +                               tmp->bpf_priority = prio;
> +                               tmp = &peer->items[++j];
> +                       }
> +                       tmp->prog = oprog;
> +                       tmp->bpf_priority = item->bpf_priority;
> +               }
> +       }
> +       dev_xtc_entry_update(dev, peer, ingress);
> +       if (ingress)
> +               net_inc_ingress_queue();
> +       else
> +               net_inc_egress_queue();
> +       xtc_inc();
> +       return prio;
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *nprog)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       bool ingress = attr->attach_type == BPF_NET_INGRESS;
> +       struct net_device *dev;
> +       int ret;
> +
> +       if (attr->attach_flags & ~BPF_F_REPLACE)
> +               return -EINVAL;
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->target_ifindex);
> +       if (!dev) {
> +               rtnl_unlock();
> +               return -EINVAL;
> +       }
> +       ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, nprog,
> +                               attr->attach_priority, attr->attach_flags);
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
> +                            u32 prio)
> +{
> +       struct bpf_prog_array_item *item, *tmp;
> +       struct bpf_prog *oprog, *fprog = NULL;
> +       struct xtc_entry *entry, *peer;
> +       int i, j;
> +
> +       ASSERT_RTNL();
> +
> +       entry = ingress ?
> +               rcu_dereference_rtnl(dev->xtc_ingress) :
> +               rcu_dereference_rtnl(dev->xtc_egress);
> +       if (!entry)
> +               return -ENOENT;
> +       peer = dev_xtc_entry_peer(entry);
> +       dev_xtc_entry_clear(peer);
> +       for (i = 0, j = 0; i < limit; i++) {
> +               item = &entry->items[i];
> +               tmp = &peer->items[j];
> +               oprog = item->prog;
> +               if (!oprog)
> +                       break;
> +               if (item->bpf_priority != prio) {
> +                       tmp->prog = oprog;
> +                       tmp->bpf_priority = item->bpf_priority;
> +                       j++;
> +               } else {
> +                       fprog = oprog;
> +               }
> +       }
> +       if (fprog) {
> +               dev_xtc_entry_prio_del(peer, prio);
> +               if (dev_xtc_entry_total(peer) == 0 && !entry->parent->miniq)
> +                       peer = NULL;
> +               dev_xtc_entry_update(dev, peer, ingress);
> +               bpf_prog_put(fprog);
> +               if (!peer)
> +                       dev_xtc_entry_free(entry);
> +               if (ingress)
> +                       net_dec_ingress_queue();
> +               else
> +                       net_dec_egress_queue();
> +               xtc_dec();
> +               return 0;
> +       }
> +       return -ENOENT;
> +}
> +
> +int xtc_prog_detach(const union bpf_attr *attr)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       bool ingress = attr->attach_type == BPF_NET_INGRESS;
> +       struct net_device *dev;
> +       int ret;
> +
> +       if (attr->attach_flags || !attr->attach_priority)
> +               return -EINVAL;
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->target_ifindex);
> +       if (!dev) {
> +               rtnl_unlock();
> +               return -EINVAL;
> +       }
> +       ret = __xtc_prog_detach(dev, ingress, XTC_MAX_ENTRIES,
> +                               attr->attach_priority);
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static void __xtc_prog_detach_all(struct net_device *dev, bool ingress, u32 limit)
> +{
> +       struct bpf_prog_array_item *item;
> +       struct xtc_entry *entry;
> +       struct bpf_prog *prog;
> +       int i;
> +
> +       ASSERT_RTNL();
> +
> +       entry = ingress ?
> +               rcu_dereference_rtnl(dev->xtc_ingress) :
> +               rcu_dereference_rtnl(dev->xtc_egress);
> +       if (!entry)
> +               return;
> +       dev_xtc_entry_update(dev, NULL, ingress);
> +       for (i = 0; i < limit; i++) {
> +               item = &entry->items[i];
> +               prog = item->prog;
> +               if (!prog)
> +                       break;
> +               dev_xtc_entry_prio_del(entry, item->bpf_priority);
> +               bpf_prog_put(prog);
> +               if (ingress)
> +                       net_dec_ingress_queue();
> +               else
> +                       net_dec_egress_queue();
> +               xtc_dec();
> +       }
> +       dev_xtc_entry_free(entry);
> +}
> +
> +void dev_xtc_uninstall(struct net_device *dev)
> +{
> +       __xtc_prog_detach_all(dev, true,  XTC_MAX_ENTRIES + 1);
> +       __xtc_prog_detach_all(dev, false, XTC_MAX_ENTRIES + 1);
> +}
> +
> +static int
> +__xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> +                struct net_device *dev, bool ingress, u32 limit)
> +{
> +       struct bpf_query_info info, __user *uinfo;
> +       struct bpf_prog_array_item *item;
> +       struct xtc_entry *entry;
> +       struct bpf_prog *prog;
> +       u32 i, flags = 0, cnt;
> +       int ret = 0;
> +
> +       ASSERT_RTNL();
> +
> +       entry = ingress ?
> +               rcu_dereference_rtnl(dev->xtc_ingress) :
> +               rcu_dereference_rtnl(dev->xtc_egress);
> +       if (!entry)
> +               return -ENOENT;
> +       cnt = dev_xtc_entry_total(entry);
> +       if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
> +               return -EFAULT;
> +       if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt)))
> +               return -EFAULT;
> +       uinfo = u64_to_user_ptr(attr->query.prog_ids);
> +       if (attr->query.prog_cnt == 0 || !uinfo || !cnt)
> +               /* return early if user requested only program count + flags */
> +               return 0;
> +       if (attr->query.prog_cnt < cnt) {
> +               cnt = attr->query.prog_cnt;
> +               ret = -ENOSPC;
> +       }
> +       for (i = 0; i < limit; i++) {
> +               item = &entry->items[i];
> +               prog = item->prog;
> +               if (!prog)
> +                       break;
> +               info.prog_id = prog->aux->id;
> +               info.prio = item->bpf_priority;
> +               if (copy_to_user(uinfo + i, &info, sizeof(info)))
> +                       return -EFAULT;
> +               if (i + 1 == cnt)
> +                       break;
> +       }
> +       return ret;
> +}
> +
> +int xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       bool ingress = attr->query.attach_type == BPF_NET_INGRESS;
> +       struct net_device *dev;
> +       int ret;
> +
> +       if (attr->query.query_flags || attr->query.attach_flags)
> +               return -EINVAL;
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->query.target_ifindex);
> +       if (!dev) {
> +               rtnl_unlock();
> +               return -EINVAL;
> +       }
> +       ret = __xtc_prog_query(attr, uattr, dev, ingress, XTC_MAX_ENTRIES);
> +       rtnl_unlock();
> +       return ret;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7b373a5e861f..a0a670b964bb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -36,6 +36,8 @@
>  #include <linux/memcontrol.h>
>  #include <linux/trace_events.h>
>
> +#include <net/xtc.h>
> +
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>                           (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
>                           (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> @@ -3448,6 +3450,9 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>                 return BPF_PROG_TYPE_XDP;
>         case BPF_LSM_CGROUP:
>                 return BPF_PROG_TYPE_LSM;
> +       case BPF_NET_INGRESS:
> +       case BPF_NET_EGRESS:
> +               return BPF_PROG_TYPE_SCHED_CLS;
>         default:
>                 return BPF_PROG_TYPE_UNSPEC;
>         }
> @@ -3466,18 +3471,15 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>
>         if (CHECK_ATTR(BPF_PROG_ATTACH))
>                 return -EINVAL;
> -
>         if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
>                 return -EINVAL;
>
>         ptype = attach_type_to_prog_type(attr->attach_type);
>         if (ptype == BPF_PROG_TYPE_UNSPEC)
>                 return -EINVAL;
> -
>         prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
>         if (IS_ERR(prog))
>                 return PTR_ERR(prog);
> -
>         if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) {
>                 bpf_prog_put(prog);
>                 return -EINVAL;
> @@ -3508,16 +3510,18 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>
>                 ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = xtc_prog_attach(attr, prog);
> +               break;
>         default:
>                 ret = -EINVAL;
>         }
> -
> -       if (ret)
> +       if (ret < 0)
>                 bpf_prog_put(prog);
>         return ret;
>  }
>
> -#define BPF_PROG_DETACH_LAST_FIELD attach_type
> +#define BPF_PROG_DETACH_LAST_FIELD replace_bpf_fd
>
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
> @@ -3527,6 +3531,9 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>                 return -EINVAL;
>
>         ptype = attach_type_to_prog_type(attr->attach_type);
> +       if (ptype != BPF_PROG_TYPE_SCHED_CLS &&
> +           (attr->attach_flags || attr->replace_bpf_fd))
> +               return -EINVAL;
>
>         switch (ptype) {
>         case BPF_PROG_TYPE_SK_MSG:
> @@ -3545,6 +3552,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>         case BPF_PROG_TYPE_SOCK_OPS:
>         case BPF_PROG_TYPE_LSM:
>                 return cgroup_bpf_prog_detach(attr, ptype);
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               return xtc_prog_detach(attr);
>         default:
>                 return -EINVAL;
>         }
> @@ -3598,6 +3607,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
>         case BPF_SK_MSG_VERDICT:
>         case BPF_SK_SKB_VERDICT:
>                 return sock_map_bpf_prog_query(attr, uattr);
> +       case BPF_NET_INGRESS:
> +       case BPF_NET_EGRESS:
> +               return xtc_prog_query(attr, uattr);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/net/Kconfig b/net/Kconfig
> index 48c33c222199..b7a9cd174464 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -52,6 +52,11 @@ config NET_INGRESS
>  config NET_EGRESS
>         bool
>
> +config NET_XGRESS
> +       select NET_INGRESS
> +       select NET_EGRESS
> +       bool
> +
>  config NET_REDIRECT
>         bool
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa53830d0683..552b805c27dd 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -107,6 +107,7 @@
>  #include <net/pkt_cls.h>
>  #include <net/checksum.h>
>  #include <net/xfrm.h>
> +#include <net/xtc.h>
>  #include <linux/highmem.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> @@ -154,7 +155,6 @@
>  #include "dev.h"
>  #include "net-sysfs.h"
>
> -
>  static DEFINE_SPINLOCK(ptype_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>  struct list_head ptype_all __read_mostly;      /* Taps */
> @@ -3935,69 +3935,199 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>  EXPORT_SYMBOL(dev_loopback_xmit);
>
>  #ifdef CONFIG_NET_EGRESS
> -static struct sk_buff *
> -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +       int qm = skb_get_queue_mapping(skb);
> +
> +       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
> +{
> +       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> +       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> +#endif /* CONFIG_NET_EGRESS */
> +
> +#ifdef CONFIG_NET_XGRESS
> +static int tc_run(struct xtc_entry *entry, struct sk_buff *skb)
>  {
> +       int ret = TC_ACT_UNSPEC;
>  #ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
> -       struct tcf_result cl_res;
> +       struct mini_Qdisc *miniq = rcu_dereference_bh(entry->parent->miniq);
> +       struct tcf_result res;
>
>         if (!miniq)
> -               return skb;
> +               return ret;
>
> -       /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
>         tc_skb_cb(skb)->mru = 0;
>         tc_skb_cb(skb)->post_ct = false;
> -       mini_qdisc_bstats_cpu_update(miniq, skb);
>
> -       switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> +       mini_qdisc_bstats_cpu_update(miniq, skb);
> +       ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> +       /* Only tcf related quirks below. */
> +       switch (ret) {
> +       case TC_ACT_SHOT:
> +               mini_qdisc_qstats_cpu_drop(miniq);
> +               break;
>         case TC_ACT_OK:
>         case TC_ACT_RECLASSIFY:
> -               skb->tc_index = TC_H_MIN(cl_res.classid);
> +               skb->tc_index = TC_H_MIN(res.classid);
>                 break;
> +       }
> +#endif /* CONFIG_NET_CLS_ACT */
> +       return ret;
> +}
> +
> +static DEFINE_STATIC_KEY_FALSE(xtc_needed_key);
> +
> +void xtc_inc(void)
> +{
> +       static_branch_inc(&xtc_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(xtc_inc);
> +
> +void xtc_dec(void)
> +{
> +       static_branch_dec(&xtc_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(xtc_dec);
> +
> +static __always_inline enum tc_action_base
> +xtc_run(const struct xtc_entry *entry, struct sk_buff *skb,
> +       const bool needs_mac)
> +{
> +       const struct bpf_prog_array_item *item;
> +       const struct bpf_prog *prog;
> +       int ret = TC_NEXT;
> +
> +       if (needs_mac)
> +               __skb_push(skb, skb->mac_len);
> +       item = &entry->items[0];
> +       while ((prog = READ_ONCE(item->prog))) {
> +               bpf_compute_data_pointers(skb);
> +               ret = bpf_prog_run(prog, skb);
> +               if (ret != TC_NEXT)
> +                       break;
> +               item++;
> +       }
> +       if (needs_mac)
> +               __skb_pull(skb, skb->mac_len);
> +       return xtc_action_code(skb, ret);
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +                  struct net_device *orig_dev, bool *another)
> +{
> +       struct xtc_entry *entry = rcu_dereference_bh(skb->dev->xtc_ingress);
> +       int sch_ret;
> +
> +       if (!entry)
> +               return skb;
> +       if (*pt_prev) {
> +               *ret = deliver_skb(skb, *pt_prev, orig_dev);
> +               *pt_prev = NULL;
> +       }
> +
> +       qdisc_skb_cb(skb)->pkt_len = skb->len;
> +       xtc_set_ingress(skb, true);
> +
> +       if (static_branch_unlikely(&xtc_needed_key)) {
> +               sch_ret = xtc_run(entry, skb, true);
> +               if (sch_ret != TC_ACT_UNSPEC)
> +                       goto ingress_verdict;
> +       }
> +       sch_ret = tc_run(entry, skb);
> +ingress_verdict:
> +       switch (sch_ret) {
> +       case TC_ACT_REDIRECT:
> +               /* skb_mac_header check was done by BPF, so we can safely
> +                * push the L2 header back before redirecting to another
> +                * netdev.
> +                */
> +               __skb_push(skb, skb->mac_len);
> +               if (skb_do_redirect(skb) == -EAGAIN) {
> +                       __skb_pull(skb, skb->mac_len);
> +                       *another = true;
> +                       break;
> +               }
> +               return NULL;
>         case TC_ACT_SHOT:
> -               mini_qdisc_qstats_cpu_drop(miniq);
> -               *ret = NET_XMIT_DROP;
> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
>                 return NULL;
> +       /* used by tc_run */
>         case TC_ACT_STOLEN:
>         case TC_ACT_QUEUED:
>         case TC_ACT_TRAP:
> -               *ret = NET_XMIT_SUCCESS;
>                 consume_skb(skb);
> +               fallthrough;
> +       case TC_ACT_CONSUMED:
>                 return NULL;
> +       }
> +
> +       return skb;
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +{
> +       struct xtc_entry *entry = rcu_dereference_bh(dev->xtc_egress);
> +       int sch_ret;
> +
> +       if (!entry)
> +               return skb;
> +
> +       /* qdisc_skb_cb(skb)->pkt_len & xtc_set_ingress() was
> +        * already set by the caller.
> +        */
> +       if (static_branch_unlikely(&xtc_needed_key)) {
> +               sch_ret = xtc_run(entry, skb, false);
> +               if (sch_ret != TC_ACT_UNSPEC)
> +                       goto egress_verdict;
> +       }
> +       sch_ret = tc_run(entry, skb);
> +egress_verdict:
> +       switch (sch_ret) {
>         case TC_ACT_REDIRECT:
> +               *ret = NET_XMIT_SUCCESS;
>                 /* No need to push/pop skb's mac_header here on egress! */
>                 skb_do_redirect(skb);
> +               return NULL;
> +       case TC_ACT_SHOT:
> +               *ret = NET_XMIT_DROP;
> +               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +               return NULL;
> +       /* used by tc_run */
> +       case TC_ACT_STOLEN:
> +       case TC_ACT_QUEUED:
> +       case TC_ACT_TRAP:
>                 *ret = NET_XMIT_SUCCESS;
>                 return NULL;
> -       default:
> -               break;
>         }
> -#endif /* CONFIG_NET_CLS_ACT */
>
>         return skb;
>  }
> -
> -static struct netdev_queue *
> -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> -{
> -       int qm = skb_get_queue_mapping(skb);
> -
> -       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> -}
> -
> -static bool netdev_xmit_txqueue_skipped(void)
> +#else
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +                  struct net_device *orig_dev, bool *another)
>  {
> -       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +       return skb;
>  }
>
> -void netdev_xmit_skip_txqueue(bool skip)
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>  {
> -       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +       return skb;
>  }
> -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> -#endif /* CONFIG_NET_EGRESS */
> +#endif /* CONFIG_NET_XGRESS */
>
>  #ifdef CONFIG_XPS
>  static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
> @@ -4181,9 +4311,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>         skb_update_prio(skb);
>
>         qdisc_pkt_len_init(skb);
> -#ifdef CONFIG_NET_CLS_ACT
> -       skb->tc_at_ingress = 0;
> -#endif
> +       xtc_set_ingress(skb, false);
>  #ifdef CONFIG_NET_EGRESS
>         if (static_branch_unlikely(&egress_needed_key)) {
>                 if (nf_hook_egress_active()) {
> @@ -5101,68 +5229,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
>  EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>  #endif
>
> -static inline struct sk_buff *
> -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> -                  struct net_device *orig_dev, bool *another)
> -{
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
> -       struct tcf_result cl_res;
> -
> -       /* If there's at least one ingress present somewhere (so
> -        * we get here via enabled static key), remaining devices
> -        * that are not configured with an ingress qdisc will bail
> -        * out here.
> -        */
> -       if (!miniq)
> -               return skb;
> -
> -       if (*pt_prev) {
> -               *ret = deliver_skb(skb, *pt_prev, orig_dev);
> -               *pt_prev = NULL;
> -       }
> -
> -       qdisc_skb_cb(skb)->pkt_len = skb->len;
> -       tc_skb_cb(skb)->mru = 0;
> -       tc_skb_cb(skb)->post_ct = false;
> -       skb->tc_at_ingress = 1;
> -       mini_qdisc_bstats_cpu_update(miniq, skb);
> -
> -       switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> -       case TC_ACT_OK:
> -       case TC_ACT_RECLASSIFY:
> -               skb->tc_index = TC_H_MIN(cl_res.classid);
> -               break;
> -       case TC_ACT_SHOT:
> -               mini_qdisc_qstats_cpu_drop(miniq);
> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> -               return NULL;
> -       case TC_ACT_STOLEN:
> -       case TC_ACT_QUEUED:
> -       case TC_ACT_TRAP:
> -               consume_skb(skb);
> -               return NULL;
> -       case TC_ACT_REDIRECT:
> -               /* skb_mac_header check was done by cls/act_bpf, so
> -                * we can safely push the L2 header back before
> -                * redirecting to another netdev
> -                */
> -               __skb_push(skb, skb->mac_len);
> -               if (skb_do_redirect(skb) == -EAGAIN) {
> -                       __skb_pull(skb, skb->mac_len);
> -                       *another = true;
> -                       break;
> -               }
> -               return NULL;
> -       case TC_ACT_CONSUMED:
> -               return NULL;
> -       default:
> -               break;
> -       }
> -#endif /* CONFIG_NET_CLS_ACT */
> -       return skb;
> -}
> -
>  /**
>   *     netdev_is_rx_handler_busy - check if receive handler is registered
>   *     @dev: device to check
> @@ -10832,7 +10898,7 @@ void unregister_netdevice_many(struct list_head *head)
>
>                 /* Shutdown queueing discipline. */
>                 dev_shutdown(dev);
> -
> +               dev_xtc_uninstall(dev);
>                 dev_xdp_uninstall(dev);
>
>                 netdev_offload_xstats_disable_all(dev);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bb0136e7a8e4..ac4bb016c5ee 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9132,7 +9132,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
>         __u8 value_reg = si->dst_reg;
>         __u8 skb_reg = si->src_reg;
>
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         /* If the tstamp_type is read,
>          * the bpf prog is aware the tstamp could have delivery time.
>          * Thus, read skb->tstamp as is if tstamp_type_access is true.
> @@ -9166,7 +9166,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
>         __u8 value_reg = si->src_reg;
>         __u8 skb_reg = si->dst_reg;
>
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         /* If the tstamp_type is read,
>          * the bpf prog is aware the tstamp could have delivery time.
>          * Thus, write skb->tstamp as is if tstamp_type_access is true.
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 1e8ab4749c6c..c1b8f2e7d966 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -382,8 +382,7 @@ config NET_SCH_FQ_PIE
>  config NET_SCH_INGRESS
>         tristate "Ingress/classifier-action Qdisc"
>         depends on NET_CLS_ACT
> -       select NET_INGRESS
> -       select NET_EGRESS
> +       select NET_XGRESS
>         help
>           Say Y here if you want to use classifiers for incoming and/or outgoing
>           packets. This qdisc doesn't do anything else besides running classifiers,
> @@ -753,6 +752,7 @@ config NET_EMATCH_IPT
>  config NET_CLS_ACT
>         bool "Actions"
>         select NET_CLS
> +       select NET_XGRESS
>         help
>           Say Y here if you want to use traffic control actions. Actions
>           get attached to classifiers and are invoked after a successful
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..3bd37ee898ce 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -13,6 +13,7 @@
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
>  #include <net/pkt_cls.h>
> +#include <net/xtc.h>
>
>  struct ingress_sched_data {
>         struct tcf_block *block;
> @@ -78,11 +79,19 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
>         struct net_device *dev = qdisc_dev(sch);
> +       struct xtc_entry *entry;
> +       bool created;
>         int err;
>
>         net_inc_ingress_queue();
>
> -       mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> +       entry = dev_xtc_entry_fetch(dev, true, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +
> +       mini_qdisc_pair_init(&q->miniqp, sch, &entry->parent->miniq);
> +       if (created)
> +               dev_xtc_entry_update(dev, entry, true);
>
>         q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>         q->block_info.chain_head_change = clsact_chain_head_change;
> @@ -93,15 +102,20 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>                 return err;
>
>         mini_qdisc_pair_block_init(&q->miniqp, q->block);
> -
>         return 0;
>  }
>
>  static void ingress_destroy(struct Qdisc *sch)
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +       struct xtc_entry *entry = rtnl_dereference(dev->xtc_ingress);
>
>         tcf_block_put_ext(q->block, sch, &q->block_info);
> +       if (entry && dev_xtc_entry_total(entry) == 0) {
> +               dev_xtc_entry_update(dev, NULL, true);
> +               dev_xtc_entry_free(entry);
> +       }
>         net_dec_ingress_queue();
>  }
>
> @@ -217,12 +231,20 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>         struct clsact_sched_data *q = qdisc_priv(sch);
>         struct net_device *dev = qdisc_dev(sch);
> +       struct xtc_entry *entry;
> +       bool created;
>         int err;
>
>         net_inc_ingress_queue();
>         net_inc_egress_queue();
>
> -       mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
> +       entry = dev_xtc_entry_fetch(dev, true, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +
> +       mini_qdisc_pair_init(&q->miniqp_ingress, sch, &entry->parent->miniq);
> +       if (created)
> +               dev_xtc_entry_update(dev, entry, true);
>
>         q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>         q->ingress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -235,7 +257,13 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>
>         mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
>
> -       mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
> +       entry = dev_xtc_entry_fetch(dev, false, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +
> +       mini_qdisc_pair_init(&q->miniqp_egress, sch, &entry->parent->miniq);
> +       if (created)
> +               dev_xtc_entry_update(dev, entry, false);
>
>         q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
>         q->egress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -247,9 +275,21 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  static void clsact_destroy(struct Qdisc *sch)
>  {
>         struct clsact_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +       struct xtc_entry *ingress_entry = rtnl_dereference(dev->xtc_ingress);
> +       struct xtc_entry *egress_entry = rtnl_dereference(dev->xtc_egress);
>
>         tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> +       if (egress_entry && dev_xtc_entry_total(egress_entry) == 0) {
> +               dev_xtc_entry_update(dev, NULL, false);
> +               dev_xtc_entry_free(egress_entry);
> +       }
> +
>         tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> +       if (ingress_entry && dev_xtc_entry_total(ingress_entry) == 0) {
> +               dev_xtc_entry_update(dev, NULL, true);
> +               dev_xtc_entry_free(ingress_entry);
> +       }
>
>         net_dec_ingress_queue();
>         net_dec_egress_queue();
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
>         BPF_PERF_EVENT,
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
> +       BPF_NET_INGRESS,
> +       BPF_NET_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1399,14 +1401,20 @@ union bpf_attr {
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -               __u32           target_fd;      /* container object to attach to */
> +               union {
> +                       __u32   target_fd;      /* container object to attach to */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_bpf_fd;  /* eBPF program to attach */
>                 __u32           attach_type;
>                 __u32           attach_flags;
> -               __u32           replace_bpf_fd; /* previously attached eBPF
> +               union {
> +                       __u32   attach_priority;
> +                       __u32   replace_bpf_fd; /* previously attached eBPF
>                                                  * program to replace if
>                                                  * BPF_F_REPLACE is used
>                                                  */
> +               };
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
>         } info;
>
>         struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -               __u32           target_fd;      /* container object to query */
> +               union {
> +                       __u32   target_fd;      /* container object to query */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_type;
>                 __u32           query_flags;
>                 __u32           attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +       TC_NEXT         = -1,
> +       TC_PASS         = 0,
> +       TC_DROP         = 2,
> +       TC_REDIRECT     = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
>         __be32  flow_label;
>  };
>
> +struct bpf_query_info {
> +       __u32 prog_id;
> +       __u32 prio;
> +};
> +
>  struct bpf_func_info {
>         __u32   insn_off;
>         __u32   type_id;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
                     ` (2 preceding siblings ...)
  2022-10-05 19:04   ` Jamal Hadi Salim
@ 2022-10-06  0:22   ` Andrii Nakryiko
  2022-10-06  5:00   ` Alexei Starovoitov
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  0:22 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This work refactors and adds a lightweight extension to the tc BPF ingress
> and egress data path side for allowing BPF programs via an fd-based attach /
> detach API. The main goal behind this work which we also presented at LPC [0]
> this year is to eventually add support for BPF links for tc BPF programs in
> a second step, thus this prep work is required for the latter which allows
> for a model of safe ownership and program detachment. Given the vast rise
> in tc BPF users in cloud native / Kubernetes environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications stepping on each others toes. Further
> details for BPF link rationale in next patch.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook gets a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> tc-layer entry point for BPF. As part of the feedback from LPC, there was
> a suggestion to provide a name for this infrastructure to more easily differ
> between the classic cls_bpf attachment and the fd-based API. As for most,
> the XDP vs tc layer is already the default mental model for the pkt processing
> pipeline. We refactored this with an xtc internal prefix aka 'express traffic
> control' in order to avoid to deviate too far (and 'express' given its more
> lightweight/faster entry point).
>
> For the ingress and egress xtc points, the device holds a cache-friendly array
> with programs. Same as with classic tc, programs are attached with a prio that
> can be specified or auto-allocated through an idr, and the program return code
> determines whether to continue in the pipeline or to terminate processing.
> With TC_ACT_UNSPEC code, the processing continues (as the case today). The goal
> was to have maximum compatibility to existing tc BPF programs, so they don't
> need to be adapted. Compatibility to call into classic tcf_classify() is also
> provided in order to allow successive migration or both to cleanly co-exist
> where needed given its one logical layer. The fd-based API is behind a static
> key, so that when unused the code is also not entered. The struct xtc_entry's
> program array is currently static, but could be made dynamic if necessary at
> a point in future. Desire has also been expressed for future work to adapt
> similar framework for XDP to allow multi-attach from in-kernel side, too.
>
> Tested with tc-testing selftest suite which all passes, as well as the tc BPF
> tests from the BPF CI.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/bpf.h            |   1 +
>  include/linux/netdevice.h      |  14 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/xtc.h              | 181 ++++++++++++++++++++++
>  include/uapi/linux/bpf.h       |  35 ++++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/net.c               | 274 +++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  24 ++-
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 262 +++++++++++++++++++------------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  48 +++++-
>  tools/include/uapi/linux/bpf.h |  35 ++++-
>  17 files changed, 769 insertions(+), 130 deletions(-)
>  create mode 100644 include/net/xtc.h
>  create mode 100644 kernel/bpf/net.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e55a4d47324c..bb63d8d000ea 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3850,13 +3850,15 @@ S:      Maintained
>  F:     kernel/trace/bpf_trace.c
>  F:     kernel/bpf/stackmap.c
>
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (xtc & tc BPF, sock_addr)
>  M:     Martin KaFai Lau <martin.lau@linux.dev>
>  M:     Daniel Borkmann <daniel@iogearbox.net>
>  R:     John Fastabend <john.fastabend@gmail.com>
>  L:     bpf@vger.kernel.org
>  L:     netdev@vger.kernel.org
>  S:     Maintained
> +F:     include/net/xtc.h
> +F:     kernel/bpf/net.c
>  F:     net/core/filter.c
>  F:     net/sched/act_bpf.c
>  F:     net/sched/cls_bpf.c
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..71e5f43db378 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
>         union {
>                 struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>                 u64 bpf_cookie;
> +               u32 bpf_priority;

So this looks unfortunate and unnecessary. You are basically saying no
BPF cookie for this new TC/XTC/TCX thingy. But there is no need, we
already reserve 2 * 8 bytes for cgroup_storage, so make bpf_cookie and
bpf_prio co-exist with

union {
    struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
    struct {
        u64 bpf_cookie;
        u32 bpf_priority;
    };
}

or is there some problem with that?

>         };
>  };
>

[...]

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
>         BPF_PERF_EVENT,
>         BPF_TRACE_KPROBE_MULTI,
>         BPF_LSM_CGROUP,
> +       BPF_NET_INGRESS,
> +       BPF_NET_EGRESS,

I can bikeshedding as well :) Shouldn't this be TC/TCX-specific attach
types? Wouldn't BPF_[X]TC[X]_INGRESS/BPF_[X]TC[X]_EGRESS be more
appropriate? Because when you think about it XDP is also NET, right,
so I find NET meaning really TC a bit confusing.

>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1399,14 +1401,20 @@ union bpf_attr {
>         };
>
>         struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> -               __u32           target_fd;      /* container object to attach to */
> +               union {
> +                       __u32   target_fd;      /* container object to attach to */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };

this makes total sense (target can be FD or ifindex, we have that in
LINK_CREATE as well)

>                 __u32           attach_bpf_fd;  /* eBPF program to attach */
>                 __u32           attach_type;
>                 __u32           attach_flags;
> -               __u32           replace_bpf_fd; /* previously attached eBPF
> +               union {
> +                       __u32   attach_priority;
> +                       __u32   replace_bpf_fd; /* previously attached eBPF
>                                                  * program to replace if
>                                                  * BPF_F_REPLACE is used
>                                                  */
> +               };

But this union seems 1) unnecessary (we don't have to save those 4
bytes), but also 2) wouldn't it make sense to support replace_bpf_fd
with BPF_F_REPLACE (at given prio, if I understand correctly). It's
equivalent situation to what we had in cgroup programs land before we
got bpf_link. So staying consistent makes sense, unless I missed
something?

>         };
>
>         struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
>         } info;
>
>         struct { /* anonymous struct used by BPF_PROG_QUERY command */
> -               __u32           target_fd;      /* container object to query */
> +               union {
> +                       __u32   target_fd;      /* container object to query */
> +                       __u32   target_ifindex; /* target ifindex */
> +               };
>                 __u32           attach_type;
>                 __u32           query_flags;
>                 __u32           attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> +       TC_NEXT         = -1,
> +       TC_PASS         = 0,
> +       TC_DROP         = 2,
> +       TC_REDIRECT     = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
>         __be32  flow_label;
>  };
>
> +struct bpf_query_info {

this is something that's returned from BPF_PROG_QUERY command, right?
Shouldn't it be called bpf_prog_query_info or something like that?
Just "query_info" is very generic, IMO, but if we are sure that there
will never be any other "QUERY" command, I guess it might be fine.

> +       __u32 prog_id;
> +       __u32 prio;
> +};
> +
>  struct bpf_func_info {
>         __u32   insn_off;
>         __u32   type_id;

[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 02/10] bpf: Implement BPF link handling for tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-06 20:54     ` Daniel Borkmann
  2022-10-06 17:56   ` Martin KaFai Lau
  2022-10-06 20:10   ` Martin KaFai Lau
  2 siblings, 1 reply; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This work adds BPF links for tc. As a recap, a BPF link represents the attachment
> of a BPF program to a BPF hook point. The BPF link holds a single reference to
> keep BPF program alive. Moreover, hook points do not reference a BPF link, only
> the application's fd or pinning does. A BPF link holds meta-data specific to
> attachment and implements operations for link creation, (atomic) BPF program
> update, detachment and introspection.
>
> The motivation for BPF links for tc BPF programs is multi-fold, for example:
>
> - "It's especially important for applications that are deployed fleet-wide
>    and that don't "control" hosts they are deployed to. If such application
>    crashes and no one notices and does anything about that, BPF program will
>    keep running draining resources or even just, say, dropping packets. We
>    at FB had outages due to such permanent BPF attachment semantics. With
>    fd-based BPF link we are getting a framework, which allows safe, auto-
>    detachable behavior by default, unless application explicitly opts in by
>    pinning the BPF link." [0]
>
> -  From Cilium-side the tc BPF programs we attach to host-facing veth devices
>    and phys devices build the core datapath for Kubernetes Pods, and they
>    implement forwarding, load-balancing, policy, EDT-management, etc, within
>    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>    experienced hard-to-debug issues in a user's staging environment where
>    another Kubernetes application using tc BPF attached to the same prio/handle
>    of cls_bpf, wiping all Cilium-based BPF programs from underneath it. The
>    goal is to establish a clear/safe ownership model via links which cannot
>    accidentally be overridden. [1]
>
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
>
> Earlier attempts [2] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
> We chose to implement a prerequisite of the fd-based tc BPF attach API, so
> that the BPF link implementation fits in naturally similar to other link types
> which are fd-based and without the need for changing core tc internal APIs.
>
> BPF programs for tc can then be successively migrated from cls_bpf to the new
> tc BPF link without needing to change the program's source code, just the BPF
> loader mechanics for attaching.
>
>   [0] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
>   [1] https://lpc.events/event/16/contributions/1353/
>   [2] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---

have you considered supporting BPF cookie from the outset? It should
be trivial if you remove union from bpf_prog_array_item. If not, then
we should reject LINK_CREATE if bpf_cookie is non-zero.

>  include/linux/bpf.h            |   5 +-
>  include/net/xtc.h              |  14 ++++
>  include/uapi/linux/bpf.h       |   5 ++
>  kernel/bpf/net.c               | 116 ++++++++++++++++++++++++++++++---
>  kernel/bpf/syscall.c           |   3 +
>  tools/include/uapi/linux/bpf.h |   5 ++
>  6 files changed, 139 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 71e5f43db378..226a74f65704 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,7 +1473,10 @@ struct bpf_prog_array_item {
>         union {
>                 struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>                 u64 bpf_cookie;
> -               u32 bpf_priority;
> +               struct {
> +                       u32 bpf_priority;
> +                       u32 bpf_id;

this is link_id, is that right? should we name it as such?

> +               };
>         };
>  };
>

[...]

> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> index ab9a9dee615b..22b7a9b05483 100644
> --- a/kernel/bpf/net.c
> +++ b/kernel/bpf/net.c
> @@ -8,7 +8,7 @@
>  #include <net/xtc.h>
>
>  static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
> -                            struct bpf_prog *nprog, u32 prio, u32 flags)
> +                            u32 id, struct bpf_prog *nprog, u32 prio, u32 flags)

similarly here, id -> link_id or something like that, it's quite
confusing what kind of ID it is otherwise

>  {
>         struct bpf_prog_array_item *item, *tmp;
>         struct xtc_entry *entry, *peer;
> @@ -27,10 +27,13 @@ static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
>                 if (!oprog)
>                         break;
>                 if (item->bpf_priority == prio) {
> -                       if (flags & BPF_F_REPLACE) {
> +                       if (item->bpf_id == id &&
> +                           (flags & BPF_F_REPLACE)) {

[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs
  2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Add support for LINK_UPDATE command for tc BPF link to allow for a reliable
> replacement of the underlying BPF program.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  kernel/bpf/net.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> index 22b7a9b05483..c50bcf656b3f 100644
> --- a/kernel/bpf/net.c
> +++ b/kernel/bpf/net.c
> @@ -303,6 +303,39 @@ static int __xtc_link_attach(struct bpf_link *l, u32 id)
>         return ret;
>  }
>
> +static int xtc_link_update(struct bpf_link *l, struct bpf_prog *nprog,
> +                          struct bpf_prog *oprog)
> +{
> +       struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
> +       int ret = 0;
> +
> +       rtnl_lock();
> +       if (!link->dev) {
> +               ret = -ENOLINK;
> +               goto out;
> +       }
> +       if (oprog && l->prog != oprog) {
> +               ret = -EPERM;
> +               goto out;
> +       }
> +       oprog = l->prog;
> +       if (oprog == nprog) {
> +               bpf_prog_put(nprog);
> +               goto out;
> +       }
> +       ret = __xtc_prog_attach(link->dev, link->location == BPF_NET_INGRESS,
> +                               XTC_MAX_ENTRIES, l->id, nprog, link->priority,
> +                               BPF_F_REPLACE);
> +       if (ret == link->priority) {

prog_attach returning priority is quite confusing. I think it's
because we support specifying zero and letting kernel pick priority,
so we need to communicate it back, is that right? If yes, can you
please add comment to xtc_prog_attach explaining this behavior?

and also, here if it's not an error then priority *has* to be equal to
link->priority, right? So:

if (ret < 0)
    goto out;

oprog = xchg(...)
bpf_prog_put(...)
ret = 0;

would be easier to follow, otherwise we are left wondering what
happens when ret > 0 && ret != link->priority. If you are worried of
bugs, BUG_ON/WARN_ON if ret != link->priority?


> +               oprog = xchg(&l->prog, nprog);
> +               bpf_prog_put(oprog);
> +               ret = 0;
> +       }
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
>  static void xtc_link_release(struct bpf_link *l)
>  {
>         struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
> @@ -327,6 +360,7 @@ static void xtc_link_dealloc(struct bpf_link *l)
>  static const struct bpf_link_ops bpf_tc_link_lops = {
>         .release        = xtc_link_release,
>         .dealloc        = xtc_link_dealloc,
> +       .update_prog    = xtc_link_update,
>  };
>
>  int xtc_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Implement link introspection for tc BPF link programs
  2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-06 23:14   ` Martin KaFai Lau
  1 sibling, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Implement tc BPF link specific show_fdinfo and link_info to emit ifindex,
> attach location and priority.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---

LGTM

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>  include/uapi/linux/bpf.h       |  5 +++++
>  kernel/bpf/net.c               | 36 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  5 +++++
>  3 files changed, 46 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c006f561648e..f1b089170b78 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6309,6 +6309,11 @@ struct bpf_link_info {
>                 struct {
>                         __u32 ifindex;
>                 } xdp;
> +               struct {
> +                       __u32 ifindex;
> +                       __u32 attach_type;
> +                       __u32 priority;
> +               } tc;
>         };
>  } __attribute__((aligned(8)));
>

[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 05/10] bpf: Implement link detach for tc BPF link programs
  2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-06 23:24   ` Martin KaFai Lau
  1 sibling, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Add support for forced detach operation of tc BPF link. This detaches the link
> but without destroying it. It has the same semantics as auto-detaching of BPF
> link due to e.g. net device being destroyed for tc or XDP BPF link. Meaning,
> in this case the BPF link is still a valid kernel object, but is defunct given
> it is not attached anywhere anymore. It still holds a reference to the BPF
> program, though. This functionality allows users with enough access rights to
> manually force-detach attached tc BPF link without killing respective owner
> process and to then introspect/debug the BPF assets. Similar LINK_DETACH exists
> also for other BPF link types.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---

Acked-by: Andrii Nakryiko <andrii@kernel.org>


>  kernel/bpf/net.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> index a74b86bb60a9..5650f62c1315 100644
> --- a/kernel/bpf/net.c
> +++ b/kernel/bpf/net.c
> @@ -350,6 +350,12 @@ static void xtc_link_release(struct bpf_link *l)
>         rtnl_unlock();
>  }
>
> +static int xtc_link_detach(struct bpf_link *l)
> +{
> +       xtc_link_release(l);
> +       return 0;
> +}
> +
>  static void xtc_link_dealloc(struct bpf_link *l)
>  {
>         struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
> @@ -393,6 +399,7 @@ static int xtc_link_fill_info(const struct bpf_link *l,
>
>  static const struct bpf_link_ops bpf_tc_link_lops = {
>         .release        = xtc_link_release,
> +       .detach         = xtc_link_detach,
>         .dealloc        = xtc_link_dealloc,
>         .update_prog    = xtc_link_update,
>         .show_fdinfo    = xtc_link_fdinfo,
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query
  2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Minor signature change for bpf_prog_query() API, no change in behavior.
> An alternative option would be to add a new libbpf introspection API
> with close to 1:1 implementation of bpf_prog_query() but with changed
> prog_ids pointer. Given the change is just minor enough, we went for
> the first option here.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c | 2 +-
>  tools/lib/bpf/bpf.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 1d49a0352836..18b1e91cc469 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -846,7 +846,7 @@ int bpf_prog_query_opts(int target_fd,
>  }
>
>  int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
> -                  __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt)
> +                  __u32 *attach_flags, void *prog_ids, __u32 *prog_cnt)
>  {
>         LIBBPF_OPTS(bpf_prog_query_opts, opts);
>         int ret;
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 9c50beabdd14..bef7a5282188 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -386,7 +386,7 @@ LIBBPF_API int bpf_prog_query_opts(int target_fd,
>                                    struct bpf_prog_query_opts *opts);
>  LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type,
>                               __u32 query_flags, __u32 *attach_flags,
> -                             __u32 *prog_ids, __u32 *prog_cnt);
> +                             void *prog_ids, __u32 *prog_cnt);

ugh, this is pretty nasty. Let's not do that. Have you though about
re-using prog_attach_flags (we can add a union to name the field
differently) to return prios instead of adding struct bpf_query_info?
This would be consistent with other uses cases that use PROG_ATTACH
and PROG_QUERY approach?


>
>  LIBBPF_API int bpf_raw_tracepoint_open(const char *name, int prog_fd);
>  LIBBPF_API int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf,
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts
  2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Extend libbpf attach opts and add a new detach opts API so this can be used
> to add/remove fd-based tc BPF programs. For concrete usage examples, see the
> extensive selftests that have been developed as part of this series.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c      | 21 +++++++++++++++++++++
>  tools/lib/bpf/bpf.h      | 17 +++++++++++++++--
>  tools/lib/bpf/libbpf.map |  1 +
>  3 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 18b1e91cc469..d1e338ac9a62 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -670,6 +670,27 @@ int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
>         return libbpf_err_errno(ret);
>  }
>
> +int bpf_prog_detach_opts(int prog_fd, int target_fd,
> +                        enum bpf_attach_type type,
> +                        const struct bpf_prog_detach_opts *opts)
> +{
> +       const size_t attr_sz = offsetofend(union bpf_attr, replace_bpf_fd);
> +       union bpf_attr attr;
> +       int ret;
> +
> +       if (!OPTS_VALID(opts, bpf_prog_detach_opts))
> +               return libbpf_err(-EINVAL);
> +
> +       memset(&attr, 0, attr_sz);
> +       attr.target_fd     = target_fd;
> +       attr.attach_bpf_fd = prog_fd;
> +       attr.attach_type   = type;
> +       attr.attach_priority = OPTS_GET(opts, attach_priority, 0);
> +
> +       ret = sys_bpf(BPF_PROG_DETACH, &attr, attr_sz);
> +       return libbpf_err_errno(ret);
> +}
> +
>  int bpf_link_create(int prog_fd, int target_fd,
>                     enum bpf_attach_type attach_type,
>                     const struct bpf_link_create_opts *opts)
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index bef7a5282188..96de58fecdbc 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -286,8 +286,11 @@ LIBBPF_API int bpf_obj_get_opts(const char *pathname,
>
>  struct bpf_prog_attach_opts {
>         size_t sz; /* size of this struct for forward/backward compatibility */
> -       unsigned int flags;
> -       int replace_prog_fd;
> +       __u32 flags;
> +       union {
> +               int replace_prog_fd;
> +               __u32 attach_priority;
> +       };

just add a new field, unions are very confusing in API structures.
It's ok if some unused fields stay zero.

>  };
>  #define bpf_prog_attach_opts__last_field replace_prog_fd
>
> @@ -296,9 +299,19 @@ LIBBPF_API int bpf_prog_attach(int prog_fd, int attachable_fd,
>  LIBBPF_API int bpf_prog_attach_opts(int prog_fd, int attachable_fd,
>                                      enum bpf_attach_type type,
>                                      const struct bpf_prog_attach_opts *opts);
> +
> +struct bpf_prog_detach_opts {
> +       size_t sz; /* size of this struct for forward/backward compatibility */
> +       __u32 attach_priority;

please add size_t: 0; at the end to ensure better zero-initialization
by compiler

> +};
> +#define bpf_prog_detach_opts__last_field attach_priority
> +
>  LIBBPF_API int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
>  LIBBPF_API int bpf_prog_detach2(int prog_fd, int attachable_fd,
>                                 enum bpf_attach_type type);
> +LIBBPF_API int bpf_prog_detach_opts(int prog_fd, int target_fd,
> +                                   enum bpf_attach_type type,

given we add detach_opts API, is type something that always makes
sense? If not, let's move it into opts.

> +                                   const struct bpf_prog_detach_opts *opts);
>
>  union bpf_iter_link_info; /* defined in up-to-date linux/bpf.h */
>  struct bpf_link_create_opts {
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index c1d6aa7c82b6..0c94b4862ebb 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -377,4 +377,5 @@ LIBBPF_1.1.0 {
>                 user_ring_buffer__reserve;
>                 user_ring_buffer__reserve_blocking;
>                 user_ring_buffer__submit;
> +               bpf_prog_detach_opts;

let's keep this sorted

>  } LIBBPF_1.0.0;

> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link
  2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Implement tc BPF link support for libbpf. The bpf_program__attach_fd()
> API has been refactored slightly in order to pass bpf_link_create_opts.
> A new bpf_program__attach_tc() has been added on top of this which allows
> for passing ifindex and prio parameters.
>
> New sections are tc/ingress and tc/egress which map to BPF_NET_INGRESS
> and BPF_NET_EGRESS, respectively.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c      |  4 ++++
>  tools/lib/bpf/bpf.h      |  3 +++
>  tools/lib/bpf/libbpf.c   | 31 ++++++++++++++++++++++++++-----
>  tools/lib/bpf/libbpf.h   |  2 ++
>  tools/lib/bpf/libbpf.map |  1 +
>  5 files changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index d1e338ac9a62..f73fdecbb5f8 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -752,6 +752,10 @@ int bpf_link_create(int prog_fd, int target_fd,

should we rename target_fd into more generic "target" maybe?

>                 if (!OPTS_ZEROED(opts, tracing))
>                         return libbpf_err(-EINVAL);
>                 break;
> +       case BPF_NET_INGRESS:
> +       case BPF_NET_EGRESS:
> +               attr.link_create.tc.priority = OPTS_GET(opts, tc.priority, 0);
> +               break;
>         default:
>                 if (!OPTS_ZEROED(opts, flags))
>                         return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 96de58fecdbc..937583421327 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -334,6 +334,9 @@ struct bpf_link_create_opts {
>                 struct {
>                         __u64 cookie;
>                 } tracing;
> +               struct {
> +                       __u32 priority;
> +               } tc;
>         };
>         size_t :0;
>  };
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 184ce1684dcd..6eb33e4324ad 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -8474,6 +8474,8 @@ static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("kretsyscall+",         KPROBE, 0, SEC_NONE, attach_ksyscall),
>         SEC_DEF("usdt+",                KPROBE, 0, SEC_NONE, attach_usdt),
>         SEC_DEF("tc",                   SCHED_CLS, 0, SEC_NONE),
> +       SEC_DEF("tc/ingress",           SCHED_CLS, BPF_NET_INGRESS, SEC_ATTACHABLE_OPT),
> +       SEC_DEF("tc/egress",            SCHED_CLS, BPF_NET_EGRESS, SEC_ATTACHABLE_OPT),

btw, we could implement optionally the ability to declaratively
specify priority, so that you could do SEC("tc/ingress:10") or some
syntax like that, if that seems useful in practice. If you expect that
prio is going to be dynamic most of the time, then it might not make
sense to add unnecessary parsing code

>         SEC_DEF("classifier",           SCHED_CLS, 0, SEC_NONE),
>         SEC_DEF("action",               SCHED_ACT, 0, SEC_NONE),
>         SEC_DEF("tracepoint+",          TRACEPOINT, 0, SEC_NONE, attach_tp),
> @@ -11238,11 +11240,10 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
>  }
>
>  static struct bpf_link *
> -bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
> -                      const char *target_name)
> +bpf_program__attach_fd_opts(const struct bpf_program *prog,
> +                           const struct bpf_link_create_opts *opts,
> +                           int target_fd, const char *target_name)

let's move opts to be last argument or second to last before
"target_name", whichever makes more sense to you

also fd part is a lie, and whole double-underscore naming is also bad
here because this is internal helper. Let's rename this to something
like bpf_prog_create_link()?

>  {
> -       DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
> -                           .target_btf_id = btf_id);
>         enum bpf_attach_type attach_type;
>         char errmsg[STRERR_BUFSIZE];
>         struct bpf_link *link;
> @@ -11260,7 +11261,7 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
>         link->detach = &bpf_link__detach_fd;
>
>         attach_type = bpf_program__expected_attach_type(prog);
> -       link_fd = bpf_link_create(prog_fd, target_fd, attach_type, &opts);
> +       link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
>         if (link_fd < 0) {
>                 link_fd = -errno;
>                 free(link);
> @@ -11273,6 +11274,16 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
>         return link;
>  }
>
> +static struct bpf_link *
> +bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
> +                      const char *target_name)

there seems to be only one use case where we have btf_id != 0, so I
think we should just use LIBBPF_OPTS() explicitly in that one case and
for all other current uses of bpf_program__attach_fd() just use opts
variant and pass NULL

> +{
> +       DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
> +                           .target_btf_id = btf_id);
> +
> +       return bpf_program__attach_fd_opts(prog, &opts, target_fd, target_name);
> +}
> +
>  struct bpf_link *
>  bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
>  {
> @@ -11291,6 +11302,16 @@ struct bpf_link *bpf_program__attach_xdp(const struct bpf_program *prog, int ifi
>         return bpf_program__attach_fd(prog, ifindex, 0, "xdp");
>  }
>
> +struct bpf_link *bpf_program__attach_tc(const struct bpf_program *prog,
> +                                       int ifindex, __u32 priority)
> +{
> +       DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
> +                           .tc.priority = priority);
> +

nit: please just use shorter LIBBPF_OPTS in new code, nice and short

> +       /* target_fd/target_ifindex use the same field in LINK_CREATE */
> +       return bpf_program__attach_fd_opts(prog, &opts, ifindex, "tc");
> +}
> +
>  struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
>                                               int target_fd,
>                                               const char *attach_func_name)
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index eee883f007f9..7e64cec9a1ba 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -645,6 +645,8 @@ bpf_program__attach_netns(const struct bpf_program *prog, int netns_fd);
>  LIBBPF_API struct bpf_link *
>  bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex);
>  LIBBPF_API struct bpf_link *
> +bpf_program__attach_tc(const struct bpf_program *prog, int ifindex, __u32 priority);
> +LIBBPF_API struct bpf_link *
>  bpf_program__attach_freplace(const struct bpf_program *prog,
>                              int target_fd, const char *attach_func_name);
>
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 0c94b4862ebb..473ed71829c6 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -378,4 +378,5 @@ LIBBPF_1.1.0 {
>                 user_ring_buffer__reserve_blocking;
>                 user_ring_buffer__submit;
>                 bpf_prog_detach_opts;
> +               bpf_program__attach_tc;

same about alphabetical order

>  } LIBBPF_1.0.0;

> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests
  2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
@ 2022-10-06  3:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-06  3:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Add a big batch of selftest to extend test_progs with various tc link,
> attach ops and old-style tc BPF attachments via libbpf APIs. Also test
> multi-program attachments including mixing the various attach options:
>
>   # ./test_progs -t tc_link
>   #179     tc_link_base:OK
>   #180     tc_link_detach:OK
>   #181     tc_link_mix:OK
>   #182     tc_link_opts:OK
>   #183     tc_link_run_base:OK
>   #184     tc_link_run_chain:OK
>   Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED
>
> All new and existing test cases pass.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---

Few small things.

First, please make sure to not use CHECK and CHECK_FAIL.

Second, it's kind of sad that we need to still check
ENABLE_ATOMICS_TESTS guards. I'd either not do that at all, or I
wonder if it's cleaner to do it in one header and just re-#define
__sync_fetch_and_xxx to be no-ops. This will make compilation not
break. And then tests will just be failing at runtime, which is fine,
because they can be denylisted. WDYT?

>  .../selftests/bpf/prog_tests/tc_link.c        | 756 ++++++++++++++++++
>  .../selftests/bpf/progs/test_tc_link.c        |  43 +
>  2 files changed, 799 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_link.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_tc_link.c
>

[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
                     ` (3 preceding siblings ...)
  2022-10-06  0:22   ` Andrii Nakryiko
@ 2022-10-06  5:00   ` Alexei Starovoitov
  2022-10-06 14:40     ` Jamal Hadi Salim
  2022-10-06 21:29     ` Daniel Borkmann
  2022-10-06 20:15   ` Martin KaFai Lau
  2022-10-06 20:54   ` Martin KaFai Lau
  6 siblings, 2 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-06  5:00 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> +
> +static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
> +			     struct bpf_prog *nprog, u32 prio, u32 flags)
> +{
> +	struct bpf_prog_array_item *item, *tmp;
> +	struct xtc_entry *entry, *peer;
> +	struct bpf_prog *oprog;
> +	bool created;
> +	int i, j;
> +
> +	ASSERT_RTNL();
> +
> +	entry = dev_xtc_entry_fetch(dev, ingress, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	for (i = 0; i < limit; i++) {
> +		item = &entry->items[i];
> +		oprog = item->prog;
> +		if (!oprog)
> +			break;
> +		if (item->bpf_priority == prio) {
> +			if (flags & BPF_F_REPLACE) {
> +				/* Pairs with READ_ONCE() in xtc_run_progs(). */
> +				WRITE_ONCE(item->prog, nprog);
> +				bpf_prog_put(oprog);
> +				dev_xtc_entry_prio_set(entry, prio, nprog);
> +				return prio;
> +			}
> +			return -EBUSY;
> +		}
> +	}
> +	if (dev_xtc_entry_total(entry) >= limit)
> +		return -ENOSPC;
> +	prio = dev_xtc_entry_prio_new(entry, prio, nprog);
> +	if (prio < 0) {
> +		if (created)
> +			dev_xtc_entry_free(entry);
> +		return -ENOMEM;
> +	}
> +	peer = dev_xtc_entry_peer(entry);
> +	dev_xtc_entry_clear(peer);
> +	for (i = 0, j = 0; i < limit; i++, j++) {
> +		item = &entry->items[i];
> +		tmp = &peer->items[j];
> +		oprog = item->prog;
> +		if (!oprog) {
> +			if (i == j) {
> +				tmp->prog = nprog;
> +				tmp->bpf_priority = prio;
> +			}
> +			break;
> +		} else if (item->bpf_priority < prio) {
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +		} else if (item->bpf_priority > prio) {
> +			if (i == j) {
> +				tmp->prog = nprog;
> +				tmp->bpf_priority = prio;
> +				tmp = &peer->items[++j];
> +			}
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +		}
> +	}
> +	dev_xtc_entry_update(dev, peer, ingress);
> +	if (ingress)
> +		net_inc_ingress_queue();
> +	else
> +		net_inc_egress_queue();
> +	xtc_inc();
> +	return prio;
> +}

...

> +static __always_inline enum tc_action_base
> +xtc_run(const struct xtc_entry *entry, struct sk_buff *skb,
> +	const bool needs_mac)
> +{
> +	const struct bpf_prog_array_item *item;
> +	const struct bpf_prog *prog;
> +	int ret = TC_NEXT;
> +
> +	if (needs_mac)
> +		__skb_push(skb, skb->mac_len);
> +	item = &entry->items[0];
> +	while ((prog = READ_ONCE(item->prog))) {
> +		bpf_compute_data_pointers(skb);
> +		ret = bpf_prog_run(prog, skb);
> +		if (ret != TC_NEXT)
> +			break;
> +		item++;
> +	}
> +	if (needs_mac)
> +		__skb_pull(skb, skb->mac_len);
> +	return xtc_action_code(skb, ret);
> +}

I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
is done because "that's how things were done in the past".
imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
copy-pasted that cumbersome and hard to use concept.
Let's throw away that baggage?
In good set of cases the bpf prog inserter cares whether the prog is first or not.
Since the first prog returning anything but TC_NEXT will be final.
I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
is good enough in practice. Any complex scheme should probably be programmable
as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
to libxdp chaining, but we added a feature that allows a prog to jump over another
prog and continue the chain. Priority concept cannot express that.
Since we'd have to add some "policy program" anyway for use cases like this
let's keep things as simple as possible?
Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
in both tcx and xdp ?
Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
The owner of "run_me_first" will update its prog through bpf_link_update.
"run_me_anywhere" will add to the end of the chain.
In XDP for compatibility reasons "run_me_first" will be the default.
Since only one prog can be enqueued with such flag it will match existing single prog behavior.
Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
I know it's far from covering plenty of cases that we've discussed for long time,
but prio concept isn't really covering them either.
We've struggled enough with single xdp prog, so certainly not advocating for that.
Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
Both simple versions have their pros and cons and don't cover everything,
but imo both are better than prio.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06  5:00   ` Alexei Starovoitov
@ 2022-10-06 14:40     ` Jamal Hadi Salim
  2022-10-06 23:29       ` Alexei Starovoitov
  2022-10-06 21:29     ` Daniel Borkmann
  1 sibling, 1 reply; 62+ messages in thread
From: Jamal Hadi Salim @ 2022-10-06 14:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, bpf, razor, ast, andrii, martin.lau,
	john.fastabend, joannelkoong, memxor, toke, joe, netdev

On Thu, Oct 6, 2022 at 1:01 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:

>
> I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
> is done because "that's how things were done in the past".
> imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
> copy-pasted that cumbersome and hard to use concept.
> Let's throw away that baggage?
> In good set of cases the bpf prog inserter cares whether the prog is first or not.
> Since the first prog returning anything but TC_NEXT will be final.
> I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
> is good enough in practice. Any complex scheme should probably be programmable
> as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
> to libxdp chaining, but we added a feature that allows a prog to jump over another
> prog and continue the chain. Priority concept cannot express that.
> Since we'd have to add some "policy program" anyway for use cases like this
> let's keep things as simple as possible?
> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
> in both tcx and xdp ?

You just described the features already offered by tc opcodes + priority.

This problem is solvable by some user space resource arbitration scheme.
Reading through the thread - a daemon of some sort will do. A daemon
which issues tokens that can be validated in the kernel (kerberos type
of approach) would be the best i.e fds alone dont resolve this.

cheers,
jamal

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 02/10] bpf: Implement BPF link handling for tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
@ 2022-10-06 17:56   ` Martin KaFai Lau
  2022-10-06 20:10   ` Martin KaFai Lau
  2 siblings, 0 replies; 62+ messages in thread
From: Martin KaFai Lau @ 2022-10-06 17:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: razor, ast, andrii, john.fastabend, joannelkoong, memxor, toke,
	joe, netdev, bpf

On 10/4/22 4:11 PM, Daniel Borkmann wrote:

> @@ -191,7 +202,8 @@ static void __xtc_prog_detach_all(struct net_device *dev, bool ingress, u32 limi
>   		if (!prog)
>   			break;
>   		dev_xtc_entry_prio_del(entry, item->bpf_priority);
> -		bpf_prog_put(prog);
> +		if (!item->bpf_id)
> +			bpf_prog_put(prog);

Should the link->dev be set to NULL somewhere?

>   		if (ingress)
>   			net_dec_ingress_queue();
>   		else
> @@ -244,6 +256,7 @@ __xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
>   		if (!prog)
>   			break;
>   		info.prog_id = prog->aux->id;
> +		info.link_id = item->bpf_id;
>   		info.prio = item->bpf_priority;
>   		if (copy_to_user(uinfo + i, &info, sizeof(info)))
>   			return -EFAULT;
> @@ -272,3 +285,90 @@ int xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
>   	rtnl_unlock();
>   	return ret;
>   }
> +

[ ... ]

> +static void xtc_link_release(struct bpf_link *l)
> +{
> +	struct bpf_tc_link *link = container_of(l, struct bpf_tc_link, link);
> +
> +	rtnl_lock();
> +	if (link->dev) {
> +		WARN_ON(__xtc_prog_detach(link->dev,
> +					  link->location == BPF_NET_INGRESS,
> +					  XTC_MAX_ENTRIES, l->id, link->priority));
> +		link->dev = NULL;
> +	}
> +	rtnl_unlock();
> +}


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 02/10] bpf: Implement BPF link handling for tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
  2022-10-06 17:56   ` Martin KaFai Lau
@ 2022-10-06 20:10   ` Martin KaFai Lau
  2 siblings, 0 replies; 62+ messages in thread
From: Martin KaFai Lau @ 2022-10-06 20:10 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: razor, ast, andrii, john.fastabend, joannelkoong, memxor, toke,
	joe, netdev, bpf

On 10/4/22 4:11 PM, Daniel Borkmann wrote:
>   static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
> -			     u32 prio)
> +			     u32 id, u32 prio)
>   {
>   	struct bpf_prog_array_item *item, *tmp;
>   	struct bpf_prog *oprog, *fprog = NULL;
> @@ -126,8 +133,11 @@ static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32 limit,
>   		if (item->bpf_priority != prio) {
>   			tmp->prog = oprog;
>   			tmp->bpf_priority = item->bpf_priority;
> +			tmp->bpf_id = item->bpf_id;
>   			j++;
>   		} else {
> +			if (item->bpf_id != id)
> +				return -EBUSY;

A nit.  Should this be -ENOENT?  I think the cgroup detach is also returning 
-ENOENT for the not found case.

btw, this case should only happen from the BPF_PROG_DETACH but not the 
BPF_LINK_DETACH?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
                     ` (4 preceding siblings ...)
  2022-10-06  5:00   ` Alexei Starovoitov
@ 2022-10-06 20:15   ` Martin KaFai Lau
  2022-10-06 20:54   ` Martin KaFai Lau
  6 siblings, 0 replies; 62+ messages in thread
From: Martin KaFai Lau @ 2022-10-06 20:15 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: razor, ast, andrii, john.fastabend, joannelkoong, memxor, toke,
	joe, netdev, bpf

On 10/4/22 4:11 PM, Daniel Borkmann wrote:
>   static int bpf_prog_detach(const union bpf_attr *attr)
>   {
> @@ -3527,6 +3531,9 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   		return -EINVAL;
>   
>   	ptype = attach_type_to_prog_type(attr->attach_type);
> +	if (ptype != BPF_PROG_TYPE_SCHED_CLS &&
> +	    (attr->attach_flags || attr->replace_bpf_fd))

It seems no ptype is using the attach_flags in detach. xtc_prog_detach() is also 
not using it.  Should it be checked regardless of the ptype instead?

> +		return -EINVAL;
>   
>   	switch (ptype) {
>   	case BPF_PROG_TYPE_SK_MSG:
> @@ -3545,6 +3552,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   	case BPF_PROG_TYPE_SOCK_OPS:
>   	case BPF_PROG_TYPE_LSM:
>   		return cgroup_bpf_prog_detach(attr, ptype);
> +	case BPF_PROG_TYPE_SCHED_CLS:
> +		return xtc_prog_detach(attr);
>   	default:
>   		return -EINVAL;
>   	}
> @@ -3598,6 +3607,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
>   	case BPF_SK_MSG_VERDICT:
>   	case BPF_SK_SKB_VERDICT:
>   		return sock_map_bpf_prog_query(attr, uattr);
> +	case BPF_NET_INGRESS:
> +	case BPF_NET_EGRESS:
> +		return xtc_prog_query(attr, uattr);
>   	default:
>   		return -EINVAL;
>   	}


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-05 19:04   ` Jamal Hadi Salim
@ 2022-10-06 20:49     ` Daniel Borkmann
  2022-10-07 15:36       ` Jamal Hadi Salim
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-06 20:49 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev, Cong Wang, Jiri Pirko

Hi Jamal,

On 10/5/22 9:04 PM, Jamal Hadi Salim wrote:
[...]
> Let me see if i can summarize the issue of ownership..
> It seems there were two users each with root access and one decided they want
> to be prio 1 and basically deleted the others programs and added
> themselves to the top?
> And of course both want to be prio 1. Am i correct? And this feature
> basically avoids
> this problem by virtue of fd ownership.

Yes and no ;) In the specific example I gave there was an application bug that
led to this race of one evicting the other, so it was not intentional and also
not triggered on all the nodes in the cluster, but aside from the example, the
issue is generic one for tc BPF users. Not fd ownership, but ownership of BPF
link solves this as it does similarly for other existing BPF infra which is one
of the motivations as outlined in patch 2 to align this for tc BPF, too.

> IIUC,  this is an issue of resource contention. Both users who have
> root access think they should be prio 1. Kubernetes has no controls for this?
> For debugging, wouldnt listening to netlink events have caught this?
> I may be misunderstanding - but if both users took advantage of this
> feature seems the root cause is still unresolved i.e  whoever gets there first
> becomes the owner of the highest prio?

This is independent of K8s core; system applications for observability, runtime
enforcement, networking, etc can be deployed as Pods via kube-system namespace into
the cluster and live in the host netns. These are typically developed independently
by different groups of people. So it all depends on the use cases these applications
solve, e.g. if you try to deploy two /main/ CNI plugins which both want to provide
cluster networking, it won't fly and this is also generally understood by cluster
operators, but there can be other applications also attaching to tc BPF for more
specialized functions (f.e. observing traffic flows, setting EDT tstamp for subsequent
fq, etc) and interoperability can be provided to a certain degree with prio settings &
unspec combo to continue the pipeline. Netlink events would at best only allow to
observe the rig being pulled from underneath us, but not prevent it compared to tc
BPF links, and given the rise of BPF projects we see in K8s space, it's becoming
more crucial to avoid accidental outage just from deploying a new Pod into a
running cluster given tc BPF layer becomes more occupied.

> Other comments on just this patch (I will pay attention in detail later):
> My two qualms:
> 1) Was bastardizing all things TC_ACT_XXX necessary?
> Maybe you could create #define somewhere visible which refers
> to the TC_ACT_XXX?

Optional as mentioned in the other thread. It was suggested having enums which
become visible via vmlinux BTF as opposed to defines, so my thought was to lower
barrier for new developers by making the naming and supported subset more obvious
similar/closer to XDP case. I didn't want to pull in new header, but I can move it
to pkt_cls.h.

> 2) Why is xtc_run before tc_run()?

It needs to be first in the list because its the only hook point that has an
'ownership' model in tc BPF layer. If its first we can unequivocally know its
owner and ensure its never skipped/bypassed/removed by another BPF program either
intentionally or due to users bugs/errors. If we put it after other hooks like cls_bpf
we loose the statement because those hooks might 'steal', remove, alter the skb before
the BPF link ones are executed. Other option is to make this completely flexible, to
the point that Stan made, that is, tcf_classify() is just callback from the array at
a fixed position and it's completely up to the user where to add from this layer,
but we went with former approach.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 02/10] bpf: Implement BPF link handling for tc BPF programs
  2022-10-06  3:19   ` Andrii Nakryiko
@ 2022-10-06 20:54     ` Daniel Borkmann
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-06 20:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On 10/6/22 5:19 AM, Andrii Nakryiko wrote:
> On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> This work adds BPF links for tc. As a recap, a BPF link represents the attachment
>> of a BPF program to a BPF hook point. The BPF link holds a single reference to
>> keep BPF program alive. Moreover, hook points do not reference a BPF link, only
>> the application's fd or pinning does. A BPF link holds meta-data specific to
>> attachment and implements operations for link creation, (atomic) BPF program
>> update, detachment and introspection.
>>
>> The motivation for BPF links for tc BPF programs is multi-fold, for example:
>>
>> - "It's especially important for applications that are deployed fleet-wide
>>     and that don't "control" hosts they are deployed to. If such application
>>     crashes and no one notices and does anything about that, BPF program will
>>     keep running draining resources or even just, say, dropping packets. We
>>     at FB had outages due to such permanent BPF attachment semantics. With
>>     fd-based BPF link we are getting a framework, which allows safe, auto-
>>     detachable behavior by default, unless application explicitly opts in by
>>     pinning the BPF link." [0]
>>
>> -  From Cilium-side the tc BPF programs we attach to host-facing veth devices
>>     and phys devices build the core datapath for Kubernetes Pods, and they
>>     implement forwarding, load-balancing, policy, EDT-management, etc, within
>>     BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>>     experienced hard-to-debug issues in a user's staging environment where
>>     another Kubernetes application using tc BPF attached to the same prio/handle
>>     of cls_bpf, wiping all Cilium-based BPF programs from underneath it. The
>>     goal is to establish a clear/safe ownership model via links which cannot
>>     accidentally be overridden. [1]
>>
>> BPF links for tc can co-exist with non-link attachments, and the semantics are
>> in line also with XDP links: BPF links cannot replace other BPF links, BPF
>> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
>> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
>> would solve mentioned issue of safe ownership model as 3rd party applications
>> would not be able to accidentally wipe Cilium programs, even if they are not
>> BPF link aware.
>>
>> Earlier attempts [2] have tried to integrate BPF links into core tc machinery
>> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
>> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
>> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
>> getting into layering hacks given the two object models are vastly different.
>> We chose to implement a prerequisite of the fd-based tc BPF attach API, so
>> that the BPF link implementation fits in naturally similar to other link types
>> which are fd-based and without the need for changing core tc internal APIs.
>>
>> BPF programs for tc can then be successively migrated from cls_bpf to the new
>> tc BPF link without needing to change the program's source code, just the BPF
>> loader mechanics for attaching.
>>
>>    [0] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
>>    [1] https://lpc.events/event/16/contributions/1353/
>>    [2] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/
>>
>> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
> 
> have you considered supporting BPF cookie from the outset? It should
> be trivial if you remove union from bpf_prog_array_item. If not, then
> we should reject LINK_CREATE if bpf_cookie is non-zero.

Haven't considered it yet at this point, but we can add this in subsequent step,
agree, thus we should reject for now upon create.

>>   include/linux/bpf.h            |   5 +-
>>   include/net/xtc.h              |  14 ++++
>>   include/uapi/linux/bpf.h       |   5 ++
>>   kernel/bpf/net.c               | 116 ++++++++++++++++++++++++++++++---
>>   kernel/bpf/syscall.c           |   3 +
>>   tools/include/uapi/linux/bpf.h |   5 ++
>>   6 files changed, 139 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 71e5f43db378..226a74f65704 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1473,7 +1473,10 @@ struct bpf_prog_array_item {
>>          union {
>>                  struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>>                  u64 bpf_cookie;
>> -               u32 bpf_priority;
>> +               struct {
>> +                       u32 bpf_priority;
>> +                       u32 bpf_id;
> 
> this is link_id, is that right? should we name it as such?

Ack, will rename, thanks also for all your other suggestions inthe various patches,
all make sense to me & will address them!

>> +               };
>>          };
>>   };
>>
> 
> [...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
                     ` (5 preceding siblings ...)
  2022-10-06 20:15   ` Martin KaFai Lau
@ 2022-10-06 20:54   ` Martin KaFai Lau
  6 siblings, 0 replies; 62+ messages in thread
From: Martin KaFai Lau @ 2022-10-06 20:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: razor, ast, andrii, john.fastabend, joannelkoong, memxor, toke,
	joe, netdev, bpf

On 10/4/22 4:11 PM, Daniel Borkmann wrote:
> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> new file mode 100644
> index 000000000000..ab9a9dee615b
> --- /dev/null
> +++ b/kernel/bpf/net.c
> @@ -0,0 +1,274 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/xtc.h>
> +
> +static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit,
> +			     struct bpf_prog *nprog, u32 prio, u32 flags)
> +{
> +	struct bpf_prog_array_item *item, *tmp;
> +	struct xtc_entry *entry, *peer;
> +	struct bpf_prog *oprog;
> +	bool created;
> +	int i, j;
> +
> +	ASSERT_RTNL();
> +
> +	entry = dev_xtc_entry_fetch(dev, ingress, &created);
> +	if (!entry)
> +		return -ENOMEM;
> +	for (i = 0; i < limit; i++) {
> +		item = &entry->items[i];
> +		oprog = item->prog;
> +		if (!oprog)
> +			break;
> +		if (item->bpf_priority == prio) {
> +			if (flags & BPF_F_REPLACE) {
> +				/* Pairs with READ_ONCE() in xtc_run_progs(). */
> +				WRITE_ONCE(item->prog, nprog);
> +				bpf_prog_put(oprog);
> +				dev_xtc_entry_prio_set(entry, prio, nprog);
> +				return prio;
> +			}
> +			return -EBUSY;
> +		}
> +	}
> +	if (dev_xtc_entry_total(entry) >= limit)
> +		return -ENOSPC;
> +	prio = dev_xtc_entry_prio_new(entry, prio, nprog);
> +	if (prio < 0) {
> +		if (created)
> +			dev_xtc_entry_free(entry);
> +		return -ENOMEM;
> +	}
> +	peer = dev_xtc_entry_peer(entry);
> +	dev_xtc_entry_clear(peer);
> +	for (i = 0, j = 0; i < limit; i++, j++) {
> +		item = &entry->items[i];
> +		tmp = &peer->items[j];
> +		oprog = item->prog;
> +		if (!oprog) {
> +			if (i == j) {
> +				tmp->prog = nprog;
> +				tmp->bpf_priority = prio;
> +			}
> +			break;
> +		} else if (item->bpf_priority < prio) {
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +		} else if (item->bpf_priority > prio) {
> +			if (i == j) {
> +				tmp->prog = nprog;
> +				tmp->bpf_priority = prio;
> +				tmp = &peer->items[++j];
> +			}
> +			tmp->prog = oprog;
> +			tmp->bpf_priority = item->bpf_priority;
> +		}
> +	}
> +	dev_xtc_entry_update(dev, peer, ingress);
> +	if (ingress)
> +		net_inc_ingress_queue();
> +	else
> +		net_inc_egress_queue();
> +	xtc_inc();
> +	return prio;
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *nprog)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	bool ingress = attr->attach_type == BPF_NET_INGRESS;
> +	struct net_device *dev;
> +	int ret;
> +
> +	if (attr->attach_flags & ~BPF_F_REPLACE)
> +		return -EINVAL;

After looking at patch 3, I think it needs to check the attach_priority is non 
zero when BPF_F_REPLACE is set.

Then the __xtc_prog_attach() should return -ENOENT for BPF_F_REPLACE when prio 
is not found instead of continuing to dev_xtc_entry_prio_new().

However, all these probably could go away if the decision on the prio discussion 
is to avoid it.


> +	rtnl_lock();
> +	dev = __dev_get_by_index(net, attr->target_ifindex);
> +	if (!dev) {
> +		rtnl_unlock();
> +		return -EINVAL;
> +	}
> +	ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, nprog,
> +				attr->attach_priority, attr->attach_flags);
> +	rtnl_unlock();
> +	return ret;
> +}
> +


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06  5:00   ` Alexei Starovoitov
  2022-10-06 14:40     ` Jamal Hadi Salim
@ 2022-10-06 21:29     ` Daniel Borkmann
  2022-10-06 23:28       ` Alexei Starovoitov
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-06 21:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev

On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
[...]
> 
> I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
> is done because "that's how things were done in the past".
> imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
> copy-pasted that cumbersome and hard to use concept.
> Let's throw away that baggage?
> In good set of cases the bpf prog inserter cares whether the prog is first or not.
> Since the first prog returning anything but TC_NEXT will be final.
> I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
> is good enough in practice. Any complex scheme should probably be programmable
> as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
> to libxdp chaining, but we added a feature that allows a prog to jump over another
> prog and continue the chain. Priority concept cannot express that.
> Since we'd have to add some "policy program" anyway for use cases like this
> let's keep things as simple as possible?
> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
> in both tcx and xdp ?
> Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
> The owner of "run_me_first" will update its prog through bpf_link_update.
> "run_me_anywhere" will add to the end of the chain.
> In XDP for compatibility reasons "run_me_first" will be the default.
> Since only one prog can be enqueued with such flag it will match existing single prog behavior.
> Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
> I know it's far from covering plenty of cases that we've discussed for long time,
> but prio concept isn't really covering them either.
> We've struggled enough with single xdp prog, so certainly not advocating for that.
> Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
> Both simple versions have their pros and cons and don't cover everything,
> but imo both are better than prio.

Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two
use cases that should be covered (and actually we kind of do this in this set, too,
with the prios via prio=x vs prio=0). Given users will only be consuming the APIs
via libs like libbpf, this can also be abstracted this way w/o users having to be
aware of prios. Anyway, where it gets tricky would be when things depend on ordering,
e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which
would be sth you can build today via tc BPF: so policy one acts as a prefilter for
various cidr ranges that should be blocked no matter what, then monitoring to sample
what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what
the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send
the result to wireguard dev, so it's encrypted from lb node to backend. For such
example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's
a case for both scenarios (concrete layout vs loose one), and for latter we could
start off with and internal prio around x (e.g. 16k), so there's room to attach in
front via fixed prio, but also append to end for 'don't care', and that could be
from lib pov the default/main API whereas prio would be some kind of extended one.
Thoughts?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Implement link introspection for tc BPF link programs
  2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
@ 2022-10-06 23:14   ` Martin KaFai Lau
  1 sibling, 0 replies; 62+ messages in thread
From: Martin KaFai Lau @ 2022-10-06 23:14 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: razor, ast, andrii, john.fastabend, joannelkoong, memxor, toke,
	joe, netdev, bpf

On 10/4/22 4:11 PM, Daniel Borkmann wrote:
> Implement tc BPF link specific show_fdinfo and link_info to emit ifindex,
> attach location and priority.

Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 05/10] bpf: Implement link detach for tc BPF link programs
  2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
  2022-10-06  3:19   ` Andrii Nakryiko
@ 2022-10-06 23:24   ` Martin KaFai Lau
  1 sibling, 0 replies; 62+ messages in thread
From: Martin KaFai Lau @ 2022-10-06 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: razor, ast, andrii, john.fastabend, joannelkoong, memxor, toke,
	joe, netdev

On 10/4/22 4:11 PM, Daniel Borkmann wrote:
> Add support for forced detach operation of tc BPF link. This detaches the link
> but without destroying it. It has the same semantics as auto-detaching of BPF
> link due to e.g. net device being destroyed for tc or XDP BPF link. Meaning,
> in this case the BPF link is still a valid kernel object, but is defunct given
> it is not attached anywhere anymore. It still holds a reference to the BPF
> program, though. This functionality allows users with enough access rights to
> manually force-detach attached tc BPF link without killing respective owner
> process and to then introspect/debug the BPF assets. Similar LINK_DETACH exists
> also for other BPF link types.

Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06 21:29     ` Daniel Borkmann
@ 2022-10-06 23:28       ` Alexei Starovoitov
  2022-10-07 13:26         ` Daniel Borkmann
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-06 23:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, Nikolay Aleksandrov, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Toke Høiland-Jørgensen,
	Joe Stringer, Network Development

On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> > On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> [...]
> >
> > I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
> > is done because "that's how things were done in the past".
> > imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
> > copy-pasted that cumbersome and hard to use concept.
> > Let's throw away that baggage?
> > In good set of cases the bpf prog inserter cares whether the prog is first or not.
> > Since the first prog returning anything but TC_NEXT will be final.
> > I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
> > is good enough in practice. Any complex scheme should probably be programmable
> > as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
> > to libxdp chaining, but we added a feature that allows a prog to jump over another
> > prog and continue the chain. Priority concept cannot express that.
> > Since we'd have to add some "policy program" anyway for use cases like this
> > let's keep things as simple as possible?
> > Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> > And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
> > in both tcx and xdp ?
> > Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
> > The owner of "run_me_first" will update its prog through bpf_link_update.
> > "run_me_anywhere" will add to the end of the chain.
> > In XDP for compatibility reasons "run_me_first" will be the default.
> > Since only one prog can be enqueued with such flag it will match existing single prog behavior.
> > Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
> > I know it's far from covering plenty of cases that we've discussed for long time,
> > but prio concept isn't really covering them either.
> > We've struggled enough with single xdp prog, so certainly not advocating for that.
> > Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
> > Both simple versions have their pros and cons and don't cover everything,
> > but imo both are better than prio.
>
> Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two
> use cases that should be covered (and actually we kind of do this in this set, too,
> with the prios via prio=x vs prio=0). Given users will only be consuming the APIs
> via libs like libbpf, this can also be abstracted this way w/o users having to be
> aware of prios.

but the patchset tells different story.
Prio gets exposed everywhere in uapi all the way to bpftool
when it's right there for users to understand.
And that's the main problem with it.
The user don't want to and don't need to be aware of it,
but uapi forces them to pick the priority.

> Anyway, where it gets tricky would be when things depend on ordering,
> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which
> would be sth you can build today via tc BPF: so policy one acts as a prefilter for
> various cidr ranges that should be blocked no matter what, then monitoring to sample
> what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what
> the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send
> the result to wireguard dev, so it's encrypted from lb node to backend.

That's all theory. Your cover letter example proves that in
real life different service pick the same priority.
They simply don't know any better.
prio is an unnecessary magic that apps _have_ to pick,
so they just copy-paste and everyone ends up using the same.

> For such
> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's
> a case for both scenarios (concrete layout vs loose one), and for latter we could
> start off with and internal prio around x (e.g. 16k), so there's room to attach in
> front via fixed prio, but also append to end for 'don't care', and that could be
> from lib pov the default/main API whereas prio would be some kind of extended one.
> Thoughts?

If prio was not part of uapi, like kernel internal somehow,
and there was a user space daemon, systemd, or another bpf prog,
module, whatever that users would interface to then
the proposed implementation of prio would totally make sense.
prio as uapi is not that.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06 14:40     ` Jamal Hadi Salim
@ 2022-10-06 23:29       ` Alexei Starovoitov
  2022-10-07 15:43         ` Jamal Hadi Salim
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-06 23:29 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Toke Høiland-Jørgensen,
	Joe Stringer, Network Development

On Thu, Oct 6, 2022 at 7:41 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Oct 6, 2022 at 1:01 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>
> >
> > I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
> > is done because "that's how things were done in the past".
> > imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
> > copy-pasted that cumbersome and hard to use concept.
> > Let's throw away that baggage?
> > In good set of cases the bpf prog inserter cares whether the prog is first or not.
> > Since the first prog returning anything but TC_NEXT will be final.
> > I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
> > is good enough in practice. Any complex scheme should probably be programmable
> > as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
> > to libxdp chaining, but we added a feature that allows a prog to jump over another
> > prog and continue the chain. Priority concept cannot express that.
> > Since we'd have to add some "policy program" anyway for use cases like this
> > let's keep things as simple as possible?
> > Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> > And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
> > in both tcx and xdp ?
>
> You just described the features already offered by tc opcodes + priority.

Ohh, right. All possible mechanisms were available in TC 20 years ago.
Moving on.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06 23:28       ` Alexei Starovoitov
@ 2022-10-07 13:26         ` Daniel Borkmann
  2022-10-07 14:32           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-07 13:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Nikolay Aleksandrov, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Toke Høiland-Jørgensen,
	Joe Stringer, Network Development

On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>> [...]
>>>
>>> I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
>>> is done because "that's how things were done in the past".
>>> imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
>>> copy-pasted that cumbersome and hard to use concept.
>>> Let's throw away that baggage?
>>> In good set of cases the bpf prog inserter cares whether the prog is first or not.
>>> Since the first prog returning anything but TC_NEXT will be final.
>>> I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
>>> is good enough in practice. Any complex scheme should probably be programmable
>>> as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
>>> to libxdp chaining, but we added a feature that allows a prog to jump over another
>>> prog and continue the chain. Priority concept cannot express that.
>>> Since we'd have to add some "policy program" anyway for use cases like this
>>> let's keep things as simple as possible?
>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>>> And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
>>> in both tcx and xdp ?
>>> Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
>>> The owner of "run_me_first" will update its prog through bpf_link_update.
>>> "run_me_anywhere" will add to the end of the chain.
>>> In XDP for compatibility reasons "run_me_first" will be the default.
>>> Since only one prog can be enqueued with such flag it will match existing single prog behavior.
>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
>>> I know it's far from covering plenty of cases that we've discussed for long time,
>>> but prio concept isn't really covering them either.
>>> We've struggled enough with single xdp prog, so certainly not advocating for that.
>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
>>> Both simple versions have their pros and cons and don't cover everything,
>>> but imo both are better than prio.
>>
>> Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two
>> use cases that should be covered (and actually we kind of do this in this set, too,
>> with the prios via prio=x vs prio=0). Given users will only be consuming the APIs
>> via libs like libbpf, this can also be abstracted this way w/o users having to be
>> aware of prios.
> 
> but the patchset tells different story.
> Prio gets exposed everywhere in uapi all the way to bpftool
> when it's right there for users to understand.
> And that's the main problem with it.
> The user don't want to and don't need to be aware of it,
> but uapi forces them to pick the priority.
> 
>> Anyway, where it gets tricky would be when things depend on ordering,
>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which
>> would be sth you can build today via tc BPF: so policy one acts as a prefilter for
>> various cidr ranges that should be blocked no matter what, then monitoring to sample
>> what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what
>> the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send
>> the result to wireguard dev, so it's encrypted from lb node to backend.
> 
> That's all theory. Your cover letter example proves that in
> real life different service pick the same priority.
> They simply don't know any better.
> prio is an unnecessary magic that apps _have_ to pick,
> so they just copy-paste and everyone ends up using the same.
> 
>> For such
>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's
>> a case for both scenarios (concrete layout vs loose one), and for latter we could
>> start off with and internal prio around x (e.g. 16k), so there's room to attach in
>> front via fixed prio, but also append to end for 'don't care', and that could be
>> from lib pov the default/main API whereas prio would be some kind of extended one.
>> Thoughts?
> 
> If prio was not part of uapi, like kernel internal somehow,
> and there was a user space daemon, systemd, or another bpf prog,
> module, whatever that users would interface to then
> the proposed implementation of prio would totally make sense.
> prio as uapi is not that.

A good analogy to this issue might be systemd's unit files.. you specify dependencies
for your own <unit> file via 'Wants=<unitA>', and ordering via 'Before=<unitB>' and
'After=<unitC>' and they refer to other unit files. I think that is generally okay,
you don't deal with prio numbers, but rather some kind textual representation. However
user/operator will have to deal with dependencies/ordering one way or another, the
problem here is that we deal with kernel and loader talks to kernel directly so it
has no awareness of what else is running or could be running, so apps needs to deal
with it somehow (and it cannot without external help). Some kind of system daemon
(like systemd) also won't fly much given such applications as Pods are typically
shipped individually as container images, so really only host /netns/ is shared in
such case but nothing else (base image itself can be alpine, ubuntu, etc, and it has
its own systemd instance, for example). Maybe BPF links could have user defined
name, and you'd express dependencies via names, but then again the application/
loader deals with bpf(2) directly and only kernel is common denominator and apps
themselves have no awareness of other components that run or might run in the
system which load bpf (unless they expose config knob).. you mentioned 'xdp chainer'
at Meta, how do you express dependencies and ordering there? When you deploy a new
app for XDP to production, I presume you need to know exactly where it's running
and not just 'ordering doesn't matter, just append to the end', no? I guess we
generally agree on that, just whether there are better options than prio for uapi
to express ordering/dependencies. Do you use sth different in mentioned 'xdp chainer'?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 13:26         ` Daniel Borkmann
@ 2022-10-07 14:32           ` Toke Høiland-Jørgensen
  2022-10-07 16:55             ` sdf
  0 siblings, 1 reply; 62+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-07 14:32 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: bpf, Nikolay Aleksandrov, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
>> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>>> [...]
>>>>
>>>> I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
>>>> is done because "that's how things were done in the past".
>>>> imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
>>>> copy-pasted that cumbersome and hard to use concept.
>>>> Let's throw away that baggage?
>>>> In good set of cases the bpf prog inserter cares whether the prog is first or not.
>>>> Since the first prog returning anything but TC_NEXT will be final.
>>>> I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
>>>> is good enough in practice. Any complex scheme should probably be programmable
>>>> as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
>>>> to libxdp chaining, but we added a feature that allows a prog to jump over another
>>>> prog and continue the chain. Priority concept cannot express that.
>>>> Since we'd have to add some "policy program" anyway for use cases like this
>>>> let's keep things as simple as possible?
>>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>>>> And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
>>>> in both tcx and xdp ?
>>>> Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
>>>> The owner of "run_me_first" will update its prog through bpf_link_update.
>>>> "run_me_anywhere" will add to the end of the chain.
>>>> In XDP for compatibility reasons "run_me_first" will be the default.
>>>> Since only one prog can be enqueued with such flag it will match existing single prog behavior.
>>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
>>>> I know it's far from covering plenty of cases that we've discussed for long time,
>>>> but prio concept isn't really covering them either.
>>>> We've struggled enough with single xdp prog, so certainly not advocating for that.
>>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
>>>> Both simple versions have their pros and cons and don't cover everything,
>>>> but imo both are better than prio.
>>>
>>> Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two
>>> use cases that should be covered (and actually we kind of do this in this set, too,
>>> with the prios via prio=x vs prio=0). Given users will only be consuming the APIs
>>> via libs like libbpf, this can also be abstracted this way w/o users having to be
>>> aware of prios.
>> 
>> but the patchset tells different story.
>> Prio gets exposed everywhere in uapi all the way to bpftool
>> when it's right there for users to understand.
>> And that's the main problem with it.
>> The user don't want to and don't need to be aware of it,
>> but uapi forces them to pick the priority.
>> 
>>> Anyway, where it gets tricky would be when things depend on ordering,
>>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which
>>> would be sth you can build today via tc BPF: so policy one acts as a prefilter for
>>> various cidr ranges that should be blocked no matter what, then monitoring to sample
>>> what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what
>>> the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send
>>> the result to wireguard dev, so it's encrypted from lb node to backend.
>> 
>> That's all theory. Your cover letter example proves that in
>> real life different service pick the same priority.
>> They simply don't know any better.
>> prio is an unnecessary magic that apps _have_ to pick,
>> so they just copy-paste and everyone ends up using the same.
>> 
>>> For such
>>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's
>>> a case for both scenarios (concrete layout vs loose one), and for latter we could
>>> start off with and internal prio around x (e.g. 16k), so there's room to attach in
>>> front via fixed prio, but also append to end for 'don't care', and that could be
>>> from lib pov the default/main API whereas prio would be some kind of extended one.
>>> Thoughts?
>> 
>> If prio was not part of uapi, like kernel internal somehow,
>> and there was a user space daemon, systemd, or another bpf prog,
>> module, whatever that users would interface to then
>> the proposed implementation of prio would totally make sense.
>> prio as uapi is not that.
>
> A good analogy to this issue might be systemd's unit files.. you specify dependencies
> for your own <unit> file via 'Wants=<unitA>', and ordering via 'Before=<unitB>' and
> 'After=<unitC>' and they refer to other unit files. I think that is generally okay,
> you don't deal with prio numbers, but rather some kind textual representation. However
> user/operator will have to deal with dependencies/ordering one way or another, the
> problem here is that we deal with kernel and loader talks to kernel directly so it
> has no awareness of what else is running or could be running, so apps needs to deal
> with it somehow (and it cannot without external help).

I was thinking a little about how this might work; i.e., how can the
kernel expose the required knobs to allow a system policy to be
implemented without program loading having to talk to anything other
than the syscall API?

How about we only expose prepend/append in the prog attach UAPI, and
then have a kernel function that does the sorting like:

int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct bpf_prog *new_prog, bool append)

where the default implementation just appends/prepends to the array in
progs depending on the value of 'appen'.

And then use the __weak linking trick (or maybe struct_ops with a member
for TXC, another for XDP, etc?) to allow BPF to override the function
wholesale and implement whatever ordering it wants? I.e., allow it can
to just shift around the order of progs in the 'progs' array whenever a
program is loaded/unloaded?

This way, a userspace daemon can implement any policy it wants by just
attaching to that hook, and keeping things like how to express
dependencies as a userspace concern?

-Toke


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06 20:49     ` Daniel Borkmann
@ 2022-10-07 15:36       ` Jamal Hadi Salim
  0 siblings, 0 replies; 62+ messages in thread
From: Jamal Hadi Salim @ 2022-10-07 15:36 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, razor, ast, andrii, martin.lau, john.fastabend,
	joannelkoong, memxor, toke, joe, netdev, Cong Wang, Jiri Pirko

On Thu, Oct 6, 2022 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Hi Jamal,
>
> On 10/5/22 9:04 PM, Jamal Hadi Salim wrote:
> [...]

>
> Yes and no ;) In the specific example I gave there was an application bug that
> led to this race of one evicting the other, so it was not intentional and also
> not triggered on all the nodes in the cluster, but aside from the example, the
> issue is generic one for tc BPF users. Not fd ownership, but ownership of BPF
> link solves this as it does similarly for other existing BPF infra which is one
> of the motivations as outlined in patch 2 to align this for tc BPF, too.

Makes sense. I can see how noone would evict you with this; but it is still
a race for whoever gets installed first, no? i.e you still need an
arbitration scheme.
And if you have a good arbitration scheme you may not need the changes.

> > IIUC,  this is an issue of resource contention. Both users who have
> > root access think they should be prio 1. Kubernetes has no controls for this?
> > For debugging, wouldnt listening to netlink events have caught this?
> > I may be misunderstanding - but if both users took advantage of this
> > feature seems the root cause is still unresolved i.e  whoever gets there first
> > becomes the owner of the highest prio?
>
> This is independent of K8s core; system applications for observability, runtime
> enforcement, networking, etc can be deployed as Pods via kube-system namespace into
> the cluster and live in the host netns. These are typically developed independently
> by different groups of people. So it all depends on the use cases these applications
> solve, e.g. if you try to deploy two /main/ CNI plugins which both want to provide
> cluster networking, it won't fly and this is also generally understood by cluster
> operators, but there can be other applications also attaching to tc BPF for more
> specialized functions (f.e. observing traffic flows, setting EDT tstamp for subsequent
> fq, etc) and interoperability can be provided to a certain degree with prio settings &
> unspec combo to continue the pipeline. Netlink events would at best only allow to
> observe the rig being pulled from underneath us, but not prevent it compared to tc
> BPF links, and given the rise of BPF projects we see in K8s space, it's becoming
> more crucial to avoid accidental outage just from deploying a new Pod into a
> running cluster given tc BPF layer becomes more occupied.

I got it i think: seems like the granularity of resource control is
much higher then.
Most certainly you want protection against wild-west approach where everyone
wants to have the highest priority.


> > Other comments on just this patch (I will pay attention in detail later):
> > My two qualms:
> > 1) Was bastardizing all things TC_ACT_XXX necessary?
> > Maybe you could create #define somewhere visible which refers
> > to the TC_ACT_XXX?
>
> Optional as mentioned in the other thread. It was suggested having enums which
> become visible via vmlinux BTF as opposed to defines, so my thought was to lower
> barrier for new developers by making the naming and supported subset more obvious
> similar/closer to XDP case. I didn't want to pull in new header, but I can move it
> to pkt_cls.h.
>

I dont think those values will ever change - but putting them in the
same location
will make it easier to find.

> > 2) Why is xtc_run before tc_run()?
>
> It needs to be first in the list because its the only hook point that has an
> 'ownership' model in tc BPF layer. If its first we can unequivocally know its
> owner and ensure its never skipped/bypassed/removed by another BPF program either
> intentionally or due to users bugs/errors. If we put it after other hooks like cls_bpf
> we loose the statement because those hooks might 'steal', remove, alter the skb before
> the BPF link ones are executed.

I understand - its a generic problem in shared systems which from your
description
it seems kubernetes takes to another level.

> Other option is to make this completely flexible, to
> the point that Stan made, that is, tcf_classify() is just callback from the array at
> a fixed position and it's completely up to the user where to add from this layer,
> but we went with former approach.

I am going to read the the thread again. If you make it user definable where
tcf_classify() as opposed to some privilege that you are first in the code path
because you already planted your flag already, then we're all happy.
Let 1000 flowers bloom.

It's a contentious issue Daniel. You are fixing it only for ebpf - to
be precise only
for new users of ebpf who migrate to new interface and not for users
who are still
using the existing hooks. I havent looked closely, would it not have
worked to pass
the link info via some TLV to current tc code? That feels like it would be more
compatible with older code assuming the infra code in user space can
hide things,
so if someone doesnt specify their prio through something like bpftool
or tc then a
default of prio 0 gets sent and the kernel provides whatever that
reserved space it
uses today. And if they get clever they can specify a prio and it is a
race of who gets
there first.
I think this idea of having some object for ownership is great and i am hoping
it can be extended in general for tc; but we are going to need more granularity
for access control other than just delete (or create); example would it make
sense that permissions to add or delete table/filter/map entries could
be controled
this way? I'd be willing to commit resources if this was going to be done for tc
in general.

That aside:
We dont have this problem when it comes to hardware offloading because such
systems have very strict admin of control: there's typically a daemon in charge;
which by itself is naive in the sense someone with root could go underneath you
and do things - hence my interest in not just ownership but also access control.

cheers,
jamal

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-06 23:29       ` Alexei Starovoitov
@ 2022-10-07 15:43         ` Jamal Hadi Salim
  0 siblings, 0 replies; 62+ messages in thread
From: Jamal Hadi Salim @ 2022-10-07 15:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Toke Høiland-Jørgensen,
	Joe Stringer, Network Development

On Thu, Oct 6, 2022 at 7:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Oct 6, 2022 at 7:41 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:

[..]
> > You just described the features already offered by tc opcodes + priority.
>
> Ohh, right. All possible mechanisms were available in TC 20 years ago.
> Moving on.

Alexei, it is the open source world - you can reinvent bell bottom
pants, the wheel, etc
just please dont mutilate or kill small animals along the way.

cheers,
jamal

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 14:32           ` Toke Høiland-Jørgensen
@ 2022-10-07 16:55             ` sdf
  2022-10-07 17:20               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 62+ messages in thread
From: sdf @ 2022-10-07 16:55 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, Alexei Starovoitov, bpf, Nikolay Aleksandrov,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	John Fastabend, Joanne Koong, Kumar Kartikeya Dwivedi,
	Joe Stringer, Network Development

On 10/07, Toke H�iland-J�rgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:

> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net>  
> wrote:
> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> >>> [...]
> >>>>
> >>>> I cannot help but feel that prio logic copy-paste from old tc,  
> netfilter and friends
> >>>> is done because "that's how things were done in the past".
> >>>> imo it was a well intentioned mistake and all networking things (tc,  
> netfilter, etc)
> >>>> copy-pasted that cumbersome and hard to use concept.
> >>>> Let's throw away that baggage?
> >>>> In good set of cases the bpf prog inserter cares whether the prog is  
> first or not.
> >>>> Since the first prog returning anything but TC_NEXT will be final.
> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care  
> about order'
> >>>> is good enough in practice. Any complex scheme should probably be  
> programmable
> >>>> as any policy should. For example in Meta we have 'xdp chainer'  
> logic that is similar
> >>>> to libxdp chaining, but we added a feature that allows a prog to  
> jump over another
> >>>> prog and continue the chain. Priority concept cannot express that.
> >>>> Since we'd have to add some "policy program" anyway for use cases  
> like this
> >>>> let's keep things as simple as possible?
> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> >>>> And allow bpf progs chaining in the kernel with "run_me_first"  
> vs "run_me_anywhere"
> >>>> in both tcx and xdp ?
> >>>> Naturally "run_me_first" prog will be the only one. No need for  
> F_REPLACE flags, etc.
> >>>> The owner of "run_me_first" will update its prog through  
> bpf_link_update.
> >>>> "run_me_anywhere" will add to the end of the chain.
> >>>> In XDP for compatibility reasons "run_me_first" will be the default.
> >>>> Since only one prog can be enqueued with such flag it will match  
> existing single prog behavior.
> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)  
> will use "run_me_anywhere".
> >>>> I know it's far from covering plenty of cases that we've discussed  
> for long time,
> >>>> but prio concept isn't really covering them either.
> >>>> We've struggled enough with single xdp prog, so certainly not  
> advocating for that.
> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".  
> Just as simple.
> >>>> Both simple versions have their pros and cons and don't cover  
> everything,
> >>>> but imo both are better than prio.
> >>>
> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'  
> vs 'run_me_anywhere' are two
> >>> use cases that should be covered (and actually we kind of do this in  
> this set, too,
> >>> with the prios via prio=x vs prio=0). Given users will only be  
> consuming the APIs
> >>> via libs like libbpf, this can also be abstracted this way w/o users  
> having to be
> >>> aware of prios.
> >>
> >> but the patchset tells different story.
> >> Prio gets exposed everywhere in uapi all the way to bpftool
> >> when it's right there for users to understand.
> >> And that's the main problem with it.
> >> The user don't want to and don't need to be aware of it,
> >> but uapi forces them to pick the priority.
> >>
> >>> Anyway, where it gets tricky would be when things depend on ordering,
> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,  
> encryption, which
> >>> would be sth you can build today via tc BPF: so policy one acts as a  
> prefilter for
> >>> various cidr ranges that should be blocked no matter what, then  
> monitoring to sample
> >>> what goes into the lb, then lb itself which does snat/dnat, then  
> monitoring to see what
> >>> the corresponding pkt looks that goes to backend, and maybe  
> encryption to e.g. send
> >>> the result to wireguard dev, so it's encrypted from lb node to  
> backend.
> >>
> >> That's all theory. Your cover letter example proves that in
> >> real life different service pick the same priority.
> >> They simply don't know any better.
> >> prio is an unnecessary magic that apps _have_ to pick,
> >> so they just copy-paste and everyone ends up using the same.
> >>
> >>> For such
> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee  
> order, so there's
> >>> a case for both scenarios (concrete layout vs loose one), and for  
> latter we could
> >>> start off with and internal prio around x (e.g. 16k), so there's room  
> to attach in
> >>> front via fixed prio, but also append to end for 'don't care', and  
> that could be
> >>> from lib pov the default/main API whereas prio would be some kind of  
> extended one.
> >>> Thoughts?
> >>
> >> If prio was not part of uapi, like kernel internal somehow,
> >> and there was a user space daemon, systemd, or another bpf prog,
> >> module, whatever that users would interface to then
> >> the proposed implementation of prio would totally make sense.
> >> prio as uapi is not that.
> >
> > A good analogy to this issue might be systemd's unit files.. you  
> specify dependencies
> > for your own <unit> file via 'Wants=<unitA>', and ordering  
> via 'Before=<unitB>' and
> > 'After=<unitC>' and they refer to other unit files. I think that is  
> generally okay,
> > you don't deal with prio numbers, but rather some kind textual  
> representation. However
> > user/operator will have to deal with dependencies/ordering one way or  
> another, the
> > problem here is that we deal with kernel and loader talks to kernel  
> directly so it
> > has no awareness of what else is running or could be running, so apps  
> needs to deal
> > with it somehow (and it cannot without external help).

> I was thinking a little about how this might work; i.e., how can the
> kernel expose the required knobs to allow a system policy to be
> implemented without program loading having to talk to anything other
> than the syscall API?

> How about we only expose prepend/append in the prog attach UAPI, and
> then have a kernel function that does the sorting like:

> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct  
> bpf_prog *new_prog, bool append)

> where the default implementation just appends/prepends to the array in
> progs depending on the value of 'appen'.

> And then use the __weak linking trick (or maybe struct_ops with a member
> for TXC, another for XDP, etc?) to allow BPF to override the function
> wholesale and implement whatever ordering it wants? I.e., allow it can
> to just shift around the order of progs in the 'progs' array whenever a
> program is loaded/unloaded?

> This way, a userspace daemon can implement any policy it wants by just
> attaching to that hook, and keeping things like how to express
> dependencies as a userspace concern?

What if we do the above, but instead of simple global 'attach first/last',
the default api would be:

- attach before <target_fd>
- attach after <target_fd>
- attach before target_fd=-1 == first
- attach after target_fd=-1 == last

?

That might be flexible enough by default to allow users to
append/prepend to any existing program in the chain (say, for
monitoring). Flexible enough for some central daemons to do
systemd-style policy. And, with bpf_add_new_tcx_prog, flexible
enough to implement any policy?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 16:55             ` sdf
@ 2022-10-07 17:20               ` Toke Høiland-Jørgensen
  2022-10-07 18:11                 ` sdf
  2022-10-07 18:59                 ` Alexei Starovoitov
  0 siblings, 2 replies; 62+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-07 17:20 UTC (permalink / raw)
  To: sdf
  Cc: Daniel Borkmann, Alexei Starovoitov, bpf, Nikolay Aleksandrov,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	John Fastabend, Joanne Koong, Kumar Kartikeya Dwivedi,
	Joe Stringer, Network Development

sdf@google.com writes:

> On 10/07, Toke H�iland-J�rgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>
>> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
>> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net>  
>> wrote:
>> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>> >>> [...]
>> >>>>
>> >>>> I cannot help but feel that prio logic copy-paste from old tc,  
>> netfilter and friends
>> >>>> is done because "that's how things were done in the past".
>> >>>> imo it was a well intentioned mistake and all networking things (tc,  
>> netfilter, etc)
>> >>>> copy-pasted that cumbersome and hard to use concept.
>> >>>> Let's throw away that baggage?
>> >>>> In good set of cases the bpf prog inserter cares whether the prog is  
>> first or not.
>> >>>> Since the first prog returning anything but TC_NEXT will be final.
>> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care  
>> about order'
>> >>>> is good enough in practice. Any complex scheme should probably be  
>> programmable
>> >>>> as any policy should. For example in Meta we have 'xdp chainer'  
>> logic that is similar
>> >>>> to libxdp chaining, but we added a feature that allows a prog to  
>> jump over another
>> >>>> prog and continue the chain. Priority concept cannot express that.
>> >>>> Since we'd have to add some "policy program" anyway for use cases  
>> like this
>> >>>> let's keep things as simple as possible?
>> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>> >>>> And allow bpf progs chaining in the kernel with "run_me_first"  
>> vs "run_me_anywhere"
>> >>>> in both tcx and xdp ?
>> >>>> Naturally "run_me_first" prog will be the only one. No need for  
>> F_REPLACE flags, etc.
>> >>>> The owner of "run_me_first" will update its prog through  
>> bpf_link_update.
>> >>>> "run_me_anywhere" will add to the end of the chain.
>> >>>> In XDP for compatibility reasons "run_me_first" will be the default.
>> >>>> Since only one prog can be enqueued with such flag it will match  
>> existing single prog behavior.
>> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)  
>> will use "run_me_anywhere".
>> >>>> I know it's far from covering plenty of cases that we've discussed  
>> for long time,
>> >>>> but prio concept isn't really covering them either.
>> >>>> We've struggled enough with single xdp prog, so certainly not  
>> advocating for that.
>> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".  
>> Just as simple.
>> >>>> Both simple versions have their pros and cons and don't cover  
>> everything,
>> >>>> but imo both are better than prio.
>> >>>
>> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'  
>> vs 'run_me_anywhere' are two
>> >>> use cases that should be covered (and actually we kind of do this in  
>> this set, too,
>> >>> with the prios via prio=x vs prio=0). Given users will only be  
>> consuming the APIs
>> >>> via libs like libbpf, this can also be abstracted this way w/o users  
>> having to be
>> >>> aware of prios.
>> >>
>> >> but the patchset tells different story.
>> >> Prio gets exposed everywhere in uapi all the way to bpftool
>> >> when it's right there for users to understand.
>> >> And that's the main problem with it.
>> >> The user don't want to and don't need to be aware of it,
>> >> but uapi forces them to pick the priority.
>> >>
>> >>> Anyway, where it gets tricky would be when things depend on ordering,
>> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,  
>> encryption, which
>> >>> would be sth you can build today via tc BPF: so policy one acts as a  
>> prefilter for
>> >>> various cidr ranges that should be blocked no matter what, then  
>> monitoring to sample
>> >>> what goes into the lb, then lb itself which does snat/dnat, then  
>> monitoring to see what
>> >>> the corresponding pkt looks that goes to backend, and maybe  
>> encryption to e.g. send
>> >>> the result to wireguard dev, so it's encrypted from lb node to  
>> backend.
>> >>
>> >> That's all theory. Your cover letter example proves that in
>> >> real life different service pick the same priority.
>> >> They simply don't know any better.
>> >> prio is an unnecessary magic that apps _have_ to pick,
>> >> so they just copy-paste and everyone ends up using the same.
>> >>
>> >>> For such
>> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee  
>> order, so there's
>> >>> a case for both scenarios (concrete layout vs loose one), and for  
>> latter we could
>> >>> start off with and internal prio around x (e.g. 16k), so there's room  
>> to attach in
>> >>> front via fixed prio, but also append to end for 'don't care', and  
>> that could be
>> >>> from lib pov the default/main API whereas prio would be some kind of  
>> extended one.
>> >>> Thoughts?
>> >>
>> >> If prio was not part of uapi, like kernel internal somehow,
>> >> and there was a user space daemon, systemd, or another bpf prog,
>> >> module, whatever that users would interface to then
>> >> the proposed implementation of prio would totally make sense.
>> >> prio as uapi is not that.
>> >
>> > A good analogy to this issue might be systemd's unit files.. you  
>> specify dependencies
>> > for your own <unit> file via 'Wants=<unitA>', and ordering  
>> via 'Before=<unitB>' and
>> > 'After=<unitC>' and they refer to other unit files. I think that is  
>> generally okay,
>> > you don't deal with prio numbers, but rather some kind textual  
>> representation. However
>> > user/operator will have to deal with dependencies/ordering one way or  
>> another, the
>> > problem here is that we deal with kernel and loader talks to kernel  
>> directly so it
>> > has no awareness of what else is running or could be running, so apps  
>> needs to deal
>> > with it somehow (and it cannot without external help).
>
>> I was thinking a little about how this might work; i.e., how can the
>> kernel expose the required knobs to allow a system policy to be
>> implemented without program loading having to talk to anything other
>> than the syscall API?
>
>> How about we only expose prepend/append in the prog attach UAPI, and
>> then have a kernel function that does the sorting like:
>
>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct  
>> bpf_prog *new_prog, bool append)
>
>> where the default implementation just appends/prepends to the array in
>> progs depending on the value of 'appen'.
>
>> And then use the __weak linking trick (or maybe struct_ops with a member
>> for TXC, another for XDP, etc?) to allow BPF to override the function
>> wholesale and implement whatever ordering it wants? I.e., allow it can
>> to just shift around the order of progs in the 'progs' array whenever a
>> program is loaded/unloaded?
>
>> This way, a userspace daemon can implement any policy it wants by just
>> attaching to that hook, and keeping things like how to express
>> dependencies as a userspace concern?
>
> What if we do the above, but instead of simple global 'attach first/last',
> the default api would be:
>
> - attach before <target_fd>
> - attach after <target_fd>
> - attach before target_fd=-1 == first
> - attach after target_fd=-1 == last
>
> ?

Hmm, the problem with that is that applications don't generally have an
fd to another application's BPF programs; and obtaining them from an ID
is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
before target *ID*" instead, which could work I guess? But then the
problem becomes that it's racy: the ID you're targeting could get
detached before you attach, so you'll need to be prepared to check that
and retry; and I'm almost certain that applications won't test for this,
so it'll just lead to hard-to-debug heisenbugs. Or am I being too
pessimistic here?

-Toke


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 17:20               ` Toke Høiland-Jørgensen
@ 2022-10-07 18:11                 ` sdf
  2022-10-07 19:06                   ` Daniel Borkmann
  2022-10-07 18:59                 ` Alexei Starovoitov
  1 sibling, 1 reply; 62+ messages in thread
From: sdf @ 2022-10-07 18:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, Alexei Starovoitov, bpf, Nikolay Aleksandrov,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	John Fastabend, Joanne Koong, Kumar Kartikeya Dwivedi,
	Joe Stringer, Network Development

On 10/07, Toke Høiland-Jørgensen wrote:
> sdf@google.com writes:

> > On 10/07, Toke H�iland-J�rgensen wrote:
> >> Daniel Borkmann <daniel@iogearbox.net> writes:
> >
> >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann  
> <daniel@iogearbox.net>
> >> wrote:
> >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> >> >>> [...]
> >> >>>>
> >> >>>> I cannot help but feel that prio logic copy-paste from old tc,
> >> netfilter and friends
> >> >>>> is done because "that's how things were done in the past".
> >> >>>> imo it was a well intentioned mistake and all networking things  
> (tc,
> >> netfilter, etc)
> >> >>>> copy-pasted that cumbersome and hard to use concept.
> >> >>>> Let's throw away that baggage?
> >> >>>> In good set of cases the bpf prog inserter cares whether the prog  
> is
> >> first or not.
> >> >>>> Since the first prog returning anything but TC_NEXT will be final.
> >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't  
> care
> >> about order'
> >> >>>> is good enough in practice. Any complex scheme should probably be
> >> programmable
> >> >>>> as any policy should. For example in Meta we have 'xdp chainer'
> >> logic that is similar
> >> >>>> to libxdp chaining, but we added a feature that allows a prog to
> >> jump over another
> >> >>>> prog and continue the chain. Priority concept cannot express that.
> >> >>>> Since we'd have to add some "policy program" anyway for use cases
> >> like this
> >> >>>> let's keep things as simple as possible?
> >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP  
> hooks ?
> >> >>>> And allow bpf progs chaining in the kernel with "run_me_first"
> >> vs "run_me_anywhere"
> >> >>>> in both tcx and xdp ?
> >> >>>> Naturally "run_me_first" prog will be the only one. No need for
> >> F_REPLACE flags, etc.
> >> >>>> The owner of "run_me_first" will update its prog through
> >> bpf_link_update.
> >> >>>> "run_me_anywhere" will add to the end of the chain.
> >> >>>> In XDP for compatibility reasons "run_me_first" will be the  
> default.
> >> >>>> Since only one prog can be enqueued with such flag it will match
> >> existing single prog behavior.
> >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring  
> progs)
> >> will use "run_me_anywhere".
> >> >>>> I know it's far from covering plenty of cases that we've discussed
> >> for long time,
> >> >>>> but prio concept isn't really covering them either.
> >> >>>> We've struggled enough with single xdp prog, so certainly not
> >> advocating for that.
> >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".
> >> Just as simple.
> >> >>>> Both simple versions have their pros and cons and don't cover
> >> everything,
> >> >>>> but imo both are better than prio.
> >> >>>
> >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'
> >> vs 'run_me_anywhere' are two
> >> >>> use cases that should be covered (and actually we kind of do this  
> in
> >> this set, too,
> >> >>> with the prios via prio=x vs prio=0). Given users will only be
> >> consuming the APIs
> >> >>> via libs like libbpf, this can also be abstracted this way w/o  
> users
> >> having to be
> >> >>> aware of prios.
> >> >>
> >> >> but the patchset tells different story.
> >> >> Prio gets exposed everywhere in uapi all the way to bpftool
> >> >> when it's right there for users to understand.
> >> >> And that's the main problem with it.
> >> >> The user don't want to and don't need to be aware of it,
> >> >> but uapi forces them to pick the priority.
> >> >>
> >> >>> Anyway, where it gets tricky would be when things depend on  
> ordering,
> >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,
> >> encryption, which
> >> >>> would be sth you can build today via tc BPF: so policy one acts as  
> a
> >> prefilter for
> >> >>> various cidr ranges that should be blocked no matter what, then
> >> monitoring to sample
> >> >>> what goes into the lb, then lb itself which does snat/dnat, then
> >> monitoring to see what
> >> >>> the corresponding pkt looks that goes to backend, and maybe
> >> encryption to e.g. send
> >> >>> the result to wireguard dev, so it's encrypted from lb node to
> >> backend.
> >> >>
> >> >> That's all theory. Your cover letter example proves that in
> >> >> real life different service pick the same priority.
> >> >> They simply don't know any better.
> >> >> prio is an unnecessary magic that apps _have_ to pick,
> >> >> so they just copy-paste and everyone ends up using the same.
> >> >>
> >> >>> For such
> >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't  
> guarantee
> >> order, so there's
> >> >>> a case for both scenarios (concrete layout vs loose one), and for
> >> latter we could
> >> >>> start off with and internal prio around x (e.g. 16k), so there's  
> room
> >> to attach in
> >> >>> front via fixed prio, but also append to end for 'don't care', and
> >> that could be
> >> >>> from lib pov the default/main API whereas prio would be some kind  
> of
> >> extended one.
> >> >>> Thoughts?
> >> >>
> >> >> If prio was not part of uapi, like kernel internal somehow,
> >> >> and there was a user space daemon, systemd, or another bpf prog,
> >> >> module, whatever that users would interface to then
> >> >> the proposed implementation of prio would totally make sense.
> >> >> prio as uapi is not that.
> >> >
> >> > A good analogy to this issue might be systemd's unit files.. you
> >> specify dependencies
> >> > for your own <unit> file via 'Wants=<unitA>', and ordering
> >> via 'Before=<unitB>' and
> >> > 'After=<unitC>' and they refer to other unit files. I think that is
> >> generally okay,
> >> > you don't deal with prio numbers, but rather some kind textual
> >> representation. However
> >> > user/operator will have to deal with dependencies/ordering one way or
> >> another, the
> >> > problem here is that we deal with kernel and loader talks to kernel
> >> directly so it
> >> > has no awareness of what else is running or could be running, so apps
> >> needs to deal
> >> > with it somehow (and it cannot without external help).
> >
> >> I was thinking a little about how this might work; i.e., how can the
> >> kernel expose the required knobs to allow a system policy to be
> >> implemented without program loading having to talk to anything other
> >> than the syscall API?
> >
> >> How about we only expose prepend/append in the prog attach UAPI, and
> >> then have a kernel function that does the sorting like:
> >
> >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs,  
> struct
> >> bpf_prog *new_prog, bool append)
> >
> >> where the default implementation just appends/prepends to the array in
> >> progs depending on the value of 'appen'.
> >
> >> And then use the __weak linking trick (or maybe struct_ops with a  
> member
> >> for TXC, another for XDP, etc?) to allow BPF to override the function
> >> wholesale and implement whatever ordering it wants? I.e., allow it can
> >> to just shift around the order of progs in the 'progs' array whenever a
> >> program is loaded/unloaded?
> >
> >> This way, a userspace daemon can implement any policy it wants by just
> >> attaching to that hook, and keeping things like how to express
> >> dependencies as a userspace concern?
> >
> > What if we do the above, but instead of simple global 'attach  
> first/last',
> > the default api would be:
> >
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=-1 == first
> > - attach after target_fd=-1 == last
> >
> > ?

> Hmm, the problem with that is that applications don't generally have an
> fd to another application's BPF programs; and obtaining them from an ID
> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> before target *ID*" instead, which could work I guess? But then the
> problem becomes that it's racy: the ID you're targeting could get
> detached before you attach, so you'll need to be prepared to check that
> and retry; and I'm almost certain that applications won't test for this,
> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> pessimistic here?

Yeah, agreed, id would work better. I guess I'm mostly coming here
from the bpftool/observability perspective where it seems handy to
being able to stick into any place in the chain for debugging?

Not sure if we need to care about raciness here. The same thing applies
for things like 'list all programs and dump their info' and all other
similar rmw operations?

But, I guess, most users will still do 'attach target id = -1' aka
'attach last' which probably makes this flexibility unnecessary?
OTOH, the users that still want it (bpftool/observability) might use it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 17:20               ` Toke Høiland-Jørgensen
  2022-10-07 18:11                 ` sdf
@ 2022-10-07 18:59                 ` Alexei Starovoitov
  2022-10-07 19:37                   ` Daniel Borkmann
  1 sibling, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-07 18:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Daniel Borkmann, bpf, Nikolay Aleksandrov,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	John Fastabend, Joanne Koong, Kumar Kartikeya Dwivedi,
	Joe Stringer, Network Development

On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> sdf@google.com writes:
>
> > On 10/07, Toke H�iland-J�rgensen wrote:
> >> Daniel Borkmann <daniel@iogearbox.net> writes:
> >
> >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net>
> >> wrote:
> >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> >> >>> [...]
> >> >>>>
> >> >>>> I cannot help but feel that prio logic copy-paste from old tc,
> >> netfilter and friends
> >> >>>> is done because "that's how things were done in the past".
> >> >>>> imo it was a well intentioned mistake and all networking things (tc,
> >> netfilter, etc)
> >> >>>> copy-pasted that cumbersome and hard to use concept.
> >> >>>> Let's throw away that baggage?
> >> >>>> In good set of cases the bpf prog inserter cares whether the prog is
> >> first or not.
> >> >>>> Since the first prog returning anything but TC_NEXT will be final.
> >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care
> >> about order'
> >> >>>> is good enough in practice. Any complex scheme should probably be
> >> programmable
> >> >>>> as any policy should. For example in Meta we have 'xdp chainer'
> >> logic that is similar
> >> >>>> to libxdp chaining, but we added a feature that allows a prog to
> >> jump over another
> >> >>>> prog and continue the chain. Priority concept cannot express that.
> >> >>>> Since we'd have to add some "policy program" anyway for use cases
> >> like this
> >> >>>> let's keep things as simple as possible?
> >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> >> >>>> And allow bpf progs chaining in the kernel with "run_me_first"
> >> vs "run_me_anywhere"
> >> >>>> in both tcx and xdp ?
> >> >>>> Naturally "run_me_first" prog will be the only one. No need for
> >> F_REPLACE flags, etc.
> >> >>>> The owner of "run_me_first" will update its prog through
> >> bpf_link_update.
> >> >>>> "run_me_anywhere" will add to the end of the chain.
> >> >>>> In XDP for compatibility reasons "run_me_first" will be the default.
> >> >>>> Since only one prog can be enqueued with such flag it will match
> >> existing single prog behavior.
> >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)
> >> will use "run_me_anywhere".
> >> >>>> I know it's far from covering plenty of cases that we've discussed
> >> for long time,
> >> >>>> but prio concept isn't really covering them either.
> >> >>>> We've struggled enough with single xdp prog, so certainly not
> >> advocating for that.
> >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".
> >> Just as simple.
> >> >>>> Both simple versions have their pros and cons and don't cover
> >> everything,
> >> >>>> but imo both are better than prio.
> >> >>>
> >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'
> >> vs 'run_me_anywhere' are two
> >> >>> use cases that should be covered (and actually we kind of do this in
> >> this set, too,
> >> >>> with the prios via prio=x vs prio=0). Given users will only be
> >> consuming the APIs
> >> >>> via libs like libbpf, this can also be abstracted this way w/o users
> >> having to be
> >> >>> aware of prios.
> >> >>
> >> >> but the patchset tells different story.
> >> >> Prio gets exposed everywhere in uapi all the way to bpftool
> >> >> when it's right there for users to understand.
> >> >> And that's the main problem with it.
> >> >> The user don't want to and don't need to be aware of it,
> >> >> but uapi forces them to pick the priority.
> >> >>
> >> >>> Anyway, where it gets tricky would be when things depend on ordering,
> >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,
> >> encryption, which
> >> >>> would be sth you can build today via tc BPF: so policy one acts as a
> >> prefilter for
> >> >>> various cidr ranges that should be blocked no matter what, then
> >> monitoring to sample
> >> >>> what goes into the lb, then lb itself which does snat/dnat, then
> >> monitoring to see what
> >> >>> the corresponding pkt looks that goes to backend, and maybe
> >> encryption to e.g. send
> >> >>> the result to wireguard dev, so it's encrypted from lb node to
> >> backend.
> >> >>
> >> >> That's all theory. Your cover letter example proves that in
> >> >> real life different service pick the same priority.
> >> >> They simply don't know any better.
> >> >> prio is an unnecessary magic that apps _have_ to pick,
> >> >> so they just copy-paste and everyone ends up using the same.
> >> >>
> >> >>> For such
> >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee
> >> order, so there's
> >> >>> a case for both scenarios (concrete layout vs loose one), and for
> >> latter we could
> >> >>> start off with and internal prio around x (e.g. 16k), so there's room
> >> to attach in
> >> >>> front via fixed prio, but also append to end for 'don't care', and
> >> that could be
> >> >>> from lib pov the default/main API whereas prio would be some kind of
> >> extended one.
> >> >>> Thoughts?
> >> >>
> >> >> If prio was not part of uapi, like kernel internal somehow,
> >> >> and there was a user space daemon, systemd, or another bpf prog,
> >> >> module, whatever that users would interface to then
> >> >> the proposed implementation of prio would totally make sense.
> >> >> prio as uapi is not that.
> >> >
> >> > A good analogy to this issue might be systemd's unit files.. you
> >> specify dependencies
> >> > for your own <unit> file via 'Wants=<unitA>', and ordering
> >> via 'Before=<unitB>' and
> >> > 'After=<unitC>' and they refer to other unit files. I think that is
> >> generally okay,
> >> > you don't deal with prio numbers, but rather some kind textual
> >> representation. However
> >> > user/operator will have to deal with dependencies/ordering one way or
> >> another, the
> >> > problem here is that we deal with kernel and loader talks to kernel
> >> directly so it
> >> > has no awareness of what else is running or could be running, so apps
> >> needs to deal
> >> > with it somehow (and it cannot without external help).
> >
> >> I was thinking a little about how this might work; i.e., how can the
> >> kernel expose the required knobs to allow a system policy to be
> >> implemented without program loading having to talk to anything other
> >> than the syscall API?
> >
> >> How about we only expose prepend/append in the prog attach UAPI, and
> >> then have a kernel function that does the sorting like:
> >
> >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> >> bpf_prog *new_prog, bool append)
> >
> >> where the default implementation just appends/prepends to the array in
> >> progs depending on the value of 'appen'.
> >
> >> And then use the __weak linking trick (or maybe struct_ops with a member
> >> for TXC, another for XDP, etc?) to allow BPF to override the function
> >> wholesale and implement whatever ordering it wants? I.e., allow it can
> >> to just shift around the order of progs in the 'progs' array whenever a
> >> program is loaded/unloaded?
> >
> >> This way, a userspace daemon can implement any policy it wants by just
> >> attaching to that hook, and keeping things like how to express
> >> dependencies as a userspace concern?
> >
> > What if we do the above, but instead of simple global 'attach first/last',
> > the default api would be:
> >
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=-1 == first
> > - attach after target_fd=-1 == last
> >
> > ?
>
> Hmm, the problem with that is that applications don't generally have an
> fd to another application's BPF programs; and obtaining them from an ID
> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> before target *ID*" instead, which could work I guess? But then the
> problem becomes that it's racy: the ID you're targeting could get
> detached before you attach, so you'll need to be prepared to check that
> and retry; and I'm almost certain that applications won't test for this,
> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> pessimistic here?

I like Stan's proposal and don't see any issue with FD.
It's good to gate specific sequencing with cap_sys_admin.
Also for consistency the FD is better than ID.

I also like systemd analogy with Before=, After=.
systemd has a ton more ways to specify deps between Units,
but none of them have absolute numbers (which is what priority is).
The only bit I'd tweak in Stan's proposal is:
- attach before <target_fd>
- attach after <target_fd>
- attach before target_fd=0 == first
- attach after target_fd=0 == last

The attach operation needs to be CAP_NET_ADMIN.
Just like we do for BPF_PROG_TYPE_CGROUP_SKB.

And we can do the same logic for XDP attaching.
Eventually we can add __weak "orchestrator prog",
but it would need to not only order progs, but should
interpret enum tc_action_base return codes at run-time
between progs too.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 18:11                 ` sdf
@ 2022-10-07 19:06                   ` Daniel Borkmann
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-07 19:06 UTC (permalink / raw)
  To: sdf, Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On 10/7/22 8:11 PM, sdf@google.com wrote:
> On 10/07, Toke Høiland-Jørgensen wrote:
>> sdf@google.com writes:
[...]
>> >> I was thinking a little about how this might work; i.e., how can the
>> >> kernel expose the required knobs to allow a system policy to be
>> >> implemented without program loading having to talk to anything other
>> >> than the syscall API?
>> >
>> >> How about we only expose prepend/append in the prog attach UAPI, and
>> >> then have a kernel function that does the sorting like:
>> >
>> >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
>> >> bpf_prog *new_prog, bool append)
>> >
>> >> where the default implementation just appends/prepends to the array in
>> >> progs depending on the value of 'appen'.
>> >
>> >> And then use the __weak linking trick (or maybe struct_ops with a member
>> >> for TXC, another for XDP, etc?) to allow BPF to override the function
>> >> wholesale and implement whatever ordering it wants? I.e., allow it can
>> >> to just shift around the order of progs in the 'progs' array whenever a
>> >> program is loaded/unloaded?
>> >
>> >> This way, a userspace daemon can implement any policy it wants by just
>> >> attaching to that hook, and keeping things like how to express
>> >> dependencies as a userspace concern?
>> >
>> > What if we do the above, but instead of simple global 'attach first/last',
>> > the default api would be:
>> >
>> > - attach before <target_fd>
>> > - attach after <target_fd>
>> > - attach before target_fd=-1 == first
>> > - attach after target_fd=-1 == last
>> >
>> > ?
> 
>> Hmm, the problem with that is that applications don't generally have an
>> fd to another application's BPF programs; and obtaining them from an ID
>> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
>> before target *ID*" instead, which could work I guess? But then the
>> problem becomes that it's racy: the ID you're targeting could get
>> detached before you attach, so you'll need to be prepared to check that
>> and retry; and I'm almost certain that applications won't test for this,
>> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
>> pessimistic here?
> 
> Yeah, agreed, id would work better. I guess I'm mostly coming here
> from the bpftool/observability perspective where it seems handy to
> being able to stick into any place in the chain for debugging?
> 
> Not sure if we need to care about raciness here. The same thing applies
> for things like 'list all programs and dump their info' and all other
> similar rmw operations?

For such case you have the issue that kernel has source of truth and
some kind of agent would have its own view when it wants to do dependency
injection, so it needs to keep syncing with kernel view somehow (given
there can be multiple entities changing it), and then question is also
understanding context 'what is target fd/id xyz'. The latter you have as
well when you, say, tell a system daemon that it needs to install a struct_ops
program to manage these things (see remark wrt containers only host netns
being shared) - now you moved that 'prio 1 conflict' situation to a meta
level where the orchestration progs fight over each other wrt who's first.

> But, I guess, most users will still do 'attach target id = -1' aka
> 'attach last' which probably makes this flexibility unnecessary?
> OTOH, the users that still want it (bpftool/observability) might use it.

To me it sounds reasonable to have the append mode as default mode/API,
and an advanced option to say 'I want to run as 2nd prog, but if something
is already attached as 2nd prog, shift all the others +1 in the array' which
would relate to your above point, Stan, of being able to stick into any
place in the chain.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 18:59                 ` Alexei Starovoitov
@ 2022-10-07 19:37                   ` Daniel Borkmann
  2022-10-07 22:45                     ` sdf
  2022-10-07 23:34                     ` Alexei Starovoitov
  0 siblings, 2 replies; 62+ messages in thread
From: Daniel Borkmann @ 2022-10-07 19:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
[...]
>>>> I was thinking a little about how this might work; i.e., how can the
>>>> kernel expose the required knobs to allow a system policy to be
>>>> implemented without program loading having to talk to anything other
>>>> than the syscall API?
>>>
>>>> How about we only expose prepend/append in the prog attach UAPI, and
>>>> then have a kernel function that does the sorting like:
>>>
>>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
>>>> bpf_prog *new_prog, bool append)
>>>
>>>> where the default implementation just appends/prepends to the array in
>>>> progs depending on the value of 'appen'.
>>>
>>>> And then use the __weak linking trick (or maybe struct_ops with a member
>>>> for TXC, another for XDP, etc?) to allow BPF to override the function
>>>> wholesale and implement whatever ordering it wants? I.e., allow it can
>>>> to just shift around the order of progs in the 'progs' array whenever a
>>>> program is loaded/unloaded?
>>>
>>>> This way, a userspace daemon can implement any policy it wants by just
>>>> attaching to that hook, and keeping things like how to express
>>>> dependencies as a userspace concern?
>>>
>>> What if we do the above, but instead of simple global 'attach first/last',
>>> the default api would be:
>>>
>>> - attach before <target_fd>
>>> - attach after <target_fd>
>>> - attach before target_fd=-1 == first
>>> - attach after target_fd=-1 == last
>>>
>>> ?
>>
>> Hmm, the problem with that is that applications don't generally have an
>> fd to another application's BPF programs; and obtaining them from an ID
>> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
>> before target *ID*" instead, which could work I guess? But then the
>> problem becomes that it's racy: the ID you're targeting could get
>> detached before you attach, so you'll need to be prepared to check that
>> and retry; and I'm almost certain that applications won't test for this,
>> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
>> pessimistic here?
> 
> I like Stan's proposal and don't see any issue with FD.
> It's good to gate specific sequencing with cap_sys_admin.
> Also for consistency the FD is better than ID.
> 
> I also like systemd analogy with Before=, After=.
> systemd has a ton more ways to specify deps between Units,
> but none of them have absolute numbers (which is what priority is).
> The only bit I'd tweak in Stan's proposal is:
> - attach before <target_fd>
> - attach after <target_fd>
> - attach before target_fd=0 == first
> - attach after target_fd=0 == last

I think the before(), after() could work, but the target_fd I have my doubts
that it will be practical. Maybe lets walk through a concrete real example. app_a
and app_b shipped via container_a resp container_b. Both want to install tc BPF
and we (operator/user) want to say that prog from app_b should only be inserted
after the one from app_a, never run before; if no prog_a is installed, we ofc just
run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
can only run after the former. How would we get to one anothers target fd? One
could use the 0, but not if more programs sit before/after.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 19:37                   ` Daniel Borkmann
@ 2022-10-07 22:45                     ` sdf
  2022-10-07 23:41                       ` Alexei Starovoitov
  2022-10-07 23:34                     ` Alexei Starovoitov
  1 sibling, 1 reply; 62+ messages in thread
From: sdf @ 2022-10-07 22:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen, bpf,
	Nikolay Aleksandrov, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On 10/07, Daniel Borkmann wrote:
> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > On Fri, Oct 7, 2022 at 10:20 AM Toke H�iland-J�rgensen  
> <toke@redhat.com> wrote:
> [...]
> > > > > I was thinking a little about how this might work; i.e., how can  
> the
> > > > > kernel expose the required knobs to allow a system policy to be
> > > > > implemented without program loading having to talk to anything  
> other
> > > > > than the syscall API?
> > > >
> > > > > How about we only expose prepend/append in the prog attach UAPI,  
> and
> > > > > then have a kernel function that does the sorting like:
> > > >
> > > > > int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t  
> num_progs, struct
> > > > > bpf_prog *new_prog, bool append)
> > > >
> > > > > where the default implementation just appends/prepends to the  
> array in
> > > > > progs depending on the value of 'appen'.
> > > >
> > > > > And then use the __weak linking trick (or maybe struct_ops with a  
> member
> > > > > for TXC, another for XDP, etc?) to allow BPF to override the  
> function
> > > > > wholesale and implement whatever ordering it wants? I.e., allow  
> it can
> > > > > to just shift around the order of progs in the 'progs' array  
> whenever a
> > > > > program is loaded/unloaded?
> > > >
> > > > > This way, a userspace daemon can implement any policy it wants by  
> just
> > > > > attaching to that hook, and keeping things like how to express
> > > > > dependencies as a userspace concern?
> > > >
> > > > What if we do the above, but instead of simple global 'attach  
> first/last',
> > > > the default api would be:
> > > >
> > > > - attach before <target_fd>
> > > > - attach after <target_fd>
> > > > - attach before target_fd=-1 == first
> > > > - attach after target_fd=-1 == last
> > > >
> > > > ?
> > >
> > > Hmm, the problem with that is that applications don't generally have  
> an
> > > fd to another application's BPF programs; and obtaining them from an  
> ID
> > > is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> > > before target *ID*" instead, which could work I guess? But then the
> > > problem becomes that it's racy: the ID you're targeting could get
> > > detached before you attach, so you'll need to be prepared to check  
> that
> > > and retry; and I'm almost certain that applications won't test for  
> this,
> > > so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> > > pessimistic here?
> >
> > I like Stan's proposal and don't see any issue with FD.
> > It's good to gate specific sequencing with cap_sys_admin.
> > Also for consistency the FD is better than ID.
> >
> > I also like systemd analogy with Before=, After=.
> > systemd has a ton more ways to specify deps between Units,
> > but none of them have absolute numbers (which is what priority is).
> > The only bit I'd tweak in Stan's proposal is:
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=0 == first
> > - attach after target_fd=0 == last

> I think the before(), after() could work, but the target_fd I have my  
> doubts
> that it will be practical. Maybe lets walk through a concrete real  
> example. app_a
> and app_b shipped via container_a resp container_b. Both want to install  
> tc BPF
> and we (operator/user) want to say that prog from app_b should only be  
> inserted
> after the one from app_a, never run before; if no prog_a is installed, we  
> ofc just
> run prog_b, but if prog_a is inserted, it must be before prog_b given the  
> latter
> can only run after the former. How would we get to one anothers target  
> fd? One
> could use the 0, but not if more programs sit before/after.

This fd/id has to be definitely abstracted by the loader. With the
program, we would ship some metadata like 'run_after:prog_a' for
prog_b (where prog_a might be literal function name maybe?).
However, this also depends on 'run_before:prog_b' in prog_a (in
case it happens to be started after prog_b) :-/

So yeah, some central place might still be needed; in this case, Toke's
suggestion on overriding this via bpf seems like the most flexible one.

Or maybe libbpf can consult some /etc/bpf.init.d/ directory for those?
Not sure if it's too much for libbpf or it's better done by the higher
levels? I guess we can rely on the program names and then all we really
need is some place to say 'prog X happens before Y' and for the loaders
to interpret that.

> To me it sounds reasonable to have the append mode as default mode/API,
> and an advanced option to say 'I want to run as 2nd prog, but if something
> is already attached as 2nd prog, shift all the others +1 in the array'  
> which
> would relate to your above point, Stan, of being able to stick into any
> place in the chain.

Replying to your other email here:

I'd still prefer, from the user side, to be able to stick my prog into
any place for debugging. But you suggestion to shift others for +1 works  
for me.
(although, not sure, for example, what happens if I want to shift right the
program that's at position 65k; aka already last?)

IMO, having explicit before/after+target is slightly better usability-wise
than juggling priorities, but I'm fine with either way.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 19:37                   ` Daniel Borkmann
  2022-10-07 22:45                     ` sdf
@ 2022-10-07 23:34                     ` Alexei Starovoitov
  2022-10-08 11:38                       ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-07 23:34 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Stanislav Fomichev, bpf,
	Nikolay Aleksandrov, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> [...]
> >>>> I was thinking a little about how this might work; i.e., how can the
> >>>> kernel expose the required knobs to allow a system policy to be
> >>>> implemented without program loading having to talk to anything other
> >>>> than the syscall API?
> >>>
> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> >>>> then have a kernel function that does the sorting like:
> >>>
> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> >>>> bpf_prog *new_prog, bool append)
> >>>
> >>>> where the default implementation just appends/prepends to the array in
> >>>> progs depending on the value of 'appen'.
> >>>
> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> >>>> to just shift around the order of progs in the 'progs' array whenever a
> >>>> program is loaded/unloaded?
> >>>
> >>>> This way, a userspace daemon can implement any policy it wants by just
> >>>> attaching to that hook, and keeping things like how to express
> >>>> dependencies as a userspace concern?
> >>>
> >>> What if we do the above, but instead of simple global 'attach first/last',
> >>> the default api would be:
> >>>
> >>> - attach before <target_fd>
> >>> - attach after <target_fd>
> >>> - attach before target_fd=-1 == first
> >>> - attach after target_fd=-1 == last
> >>>
> >>> ?
> >>
> >> Hmm, the problem with that is that applications don't generally have an
> >> fd to another application's BPF programs; and obtaining them from an ID
> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> >> before target *ID*" instead, which could work I guess? But then the
> >> problem becomes that it's racy: the ID you're targeting could get
> >> detached before you attach, so you'll need to be prepared to check that
> >> and retry; and I'm almost certain that applications won't test for this,
> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> >> pessimistic here?
> >
> > I like Stan's proposal and don't see any issue with FD.
> > It's good to gate specific sequencing with cap_sys_admin.
> > Also for consistency the FD is better than ID.
> >
> > I also like systemd analogy with Before=, After=.
> > systemd has a ton more ways to specify deps between Units,
> > but none of them have absolute numbers (which is what priority is).
> > The only bit I'd tweak in Stan's proposal is:
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=0 == first
> > - attach after target_fd=0 == last
>
> I think the before(), after() could work, but the target_fd I have my doubts
> that it will be practical. Maybe lets walk through a concrete real example. app_a
> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> and we (operator/user) want to say that prog from app_b should only be inserted
> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> can only run after the former. How would we get to one anothers target fd? One
> could use the 0, but not if more programs sit before/after.

I read your desired use case several times and probably still didn't get it.
Sounds like prog_b can just do after(fd=0) to become last.
And prog_a can do before(fd=0).
Whichever the order of attaching (a or b) these two will always
be in a->b order.
Are you saying that there should be no progs between them?
Sure, the daemon could iterate the hook progs, discover prog_id,
get its FD and do before(prog_fd).
The use case sounds hypothetical.
Since the first and any prog returning !TC_NEXT will abort
the chain we'd need __weak nop orchestrator prog to interpret
retval for anything to be useful.
With cgroup-skb we did fancy none/override/multi and what for?
As far as I can see everyone is using 'multi' and all progs are run.
If we did only 'multi' for cgroup it would be just as fine
and we would have avoided all the complexity in the kernel.
Hence I'm advocating for the simplest approach for tcx and xdp.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 22:45                     ` sdf
@ 2022-10-07 23:41                       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-07 23:41 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Daniel Borkmann, Toke Høiland-Jørgensen, bpf,
	Nikolay Aleksandrov, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On Fri, Oct 7, 2022 at 3:45 PM <sdf@google.com> wrote:
>
> On 10/07, Daniel Borkmann wrote:
> > On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > > On Fri, Oct 7, 2022 at 10:20 AM Toke H�iland-J�rgensen
> > <toke@redhat.com> wrote:
> > [...]
> > > > > > I was thinking a little about how this might work; i.e., how can
> > the
> > > > > > kernel expose the required knobs to allow a system policy to be
> > > > > > implemented without program loading having to talk to anything
> > other
> > > > > > than the syscall API?
> > > > >
> > > > > > How about we only expose prepend/append in the prog attach UAPI,
> > and
> > > > > > then have a kernel function that does the sorting like:
> > > > >
> > > > > > int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t
> > num_progs, struct
> > > > > > bpf_prog *new_prog, bool append)
> > > > >
> > > > > > where the default implementation just appends/prepends to the
> > array in
> > > > > > progs depending on the value of 'appen'.
> > > > >
> > > > > > And then use the __weak linking trick (or maybe struct_ops with a
> > member
> > > > > > for TXC, another for XDP, etc?) to allow BPF to override the
> > function
> > > > > > wholesale and implement whatever ordering it wants? I.e., allow
> > it can
> > > > > > to just shift around the order of progs in the 'progs' array
> > whenever a
> > > > > > program is loaded/unloaded?
> > > > >
> > > > > > This way, a userspace daemon can implement any policy it wants by
> > just
> > > > > > attaching to that hook, and keeping things like how to express
> > > > > > dependencies as a userspace concern?
> > > > >
> > > > > What if we do the above, but instead of simple global 'attach
> > first/last',
> > > > > the default api would be:
> > > > >
> > > > > - attach before <target_fd>
> > > > > - attach after <target_fd>
> > > > > - attach before target_fd=-1 == first
> > > > > - attach after target_fd=-1 == last
> > > > >
> > > > > ?
> > > >
> > > > Hmm, the problem with that is that applications don't generally have
> > an
> > > > fd to another application's BPF programs; and obtaining them from an
> > ID
> > > > is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> > > > before target *ID*" instead, which could work I guess? But then the
> > > > problem becomes that it's racy: the ID you're targeting could get
> > > > detached before you attach, so you'll need to be prepared to check
> > that
> > > > and retry; and I'm almost certain that applications won't test for
> > this,
> > > > so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> > > > pessimistic here?
> > >
> > > I like Stan's proposal and don't see any issue with FD.
> > > It's good to gate specific sequencing with cap_sys_admin.
> > > Also for consistency the FD is better than ID.
> > >
> > > I also like systemd analogy with Before=, After=.
> > > systemd has a ton more ways to specify deps between Units,
> > > but none of them have absolute numbers (which is what priority is).
> > > The only bit I'd tweak in Stan's proposal is:
> > > - attach before <target_fd>
> > > - attach after <target_fd>
> > > - attach before target_fd=0 == first
> > > - attach after target_fd=0 == last
>
> > I think the before(), after() could work, but the target_fd I have my
> > doubts
> > that it will be practical. Maybe lets walk through a concrete real
> > example. app_a
> > and app_b shipped via container_a resp container_b. Both want to install
> > tc BPF
> > and we (operator/user) want to say that prog from app_b should only be
> > inserted
> > after the one from app_a, never run before; if no prog_a is installed, we
> > ofc just
> > run prog_b, but if prog_a is inserted, it must be before prog_b given the
> > latter
> > can only run after the former. How would we get to one anothers target
> > fd? One
> > could use the 0, but not if more programs sit before/after.
>
> This fd/id has to be definitely abstracted by the loader. With the
> program, we would ship some metadata like 'run_after:prog_a' for
> prog_b (where prog_a might be literal function name maybe?).
> However, this also depends on 'run_before:prog_b' in prog_a (in
> case it happens to be started after prog_b) :-/

Let's not overload libbpf with that.
I don't see any of that being used.
If a real use case comes up we'll do that at that time.

> So yeah, some central place might still be needed; in this case, Toke's
> suggestion on overriding this via bpf seems like the most flexible one.
>
> Or maybe libbpf can consult some /etc/bpf.init.d/ directory for those?
> Not sure if it's too much for libbpf or it's better done by the higher
> levels? I guess we can rely on the program names and then all we really
> need is some place to say 'prog X happens before Y' and for the loaders
> to interpret that.

It's getting into bikeshedding territory.
We made this mistake with xdp.
No one could convince anyone of anything and got stuck with
single prog.

> > To me it sounds reasonable to have the append mode as default mode/API,
> > and an advanced option to say 'I want to run as 2nd prog, but if something
> > is already attached as 2nd prog, shift all the others +1 in the array'
> > which
> > would relate to your above point, Stan, of being able to stick into any
> > place in the chain.
>
> Replying to your other email here:
>
> I'd still prefer, from the user side, to be able to stick my prog into
> any place for debugging. But you suggestion to shift others for +1 works
> for me.
> (although, not sure, for example, what happens if I want to shift right the
> program that's at position 65k; aka already last?)

65k progs attached to a single hook?!
At that point it won't really matter how before() and after()
are implemented.
Copy of the whole array is the simplest implementation that would
work just fine.

I guess I wasn't clear that the absolute position in the array
is not going to be returned to the user space.
The user space could grab IDs of all progs attached
in the existing order. But that order is valid only at that
very second. Another prog might get inserted anywhere a second later.
Same thing we do for cgroups.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-07 23:34                     ` Alexei Starovoitov
@ 2022-10-08 11:38                       ` Toke Høiland-Jørgensen
  2022-10-08 20:38                         ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-08 11:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Stanislav Fomichev, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
>> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> [...]
>> >>>> I was thinking a little about how this might work; i.e., how can the
>> >>>> kernel expose the required knobs to allow a system policy to be
>> >>>> implemented without program loading having to talk to anything other
>> >>>> than the syscall API?
>> >>>
>> >>>> How about we only expose prepend/append in the prog attach UAPI, and
>> >>>> then have a kernel function that does the sorting like:
>> >>>
>> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
>> >>>> bpf_prog *new_prog, bool append)
>> >>>
>> >>>> where the default implementation just appends/prepends to the array in
>> >>>> progs depending on the value of 'appen'.
>> >>>
>> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
>> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
>> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
>> >>>> to just shift around the order of progs in the 'progs' array whenever a
>> >>>> program is loaded/unloaded?
>> >>>
>> >>>> This way, a userspace daemon can implement any policy it wants by just
>> >>>> attaching to that hook, and keeping things like how to express
>> >>>> dependencies as a userspace concern?
>> >>>
>> >>> What if we do the above, but instead of simple global 'attach first/last',
>> >>> the default api would be:
>> >>>
>> >>> - attach before <target_fd>
>> >>> - attach after <target_fd>
>> >>> - attach before target_fd=-1 == first
>> >>> - attach after target_fd=-1 == last
>> >>>
>> >>> ?
>> >>
>> >> Hmm, the problem with that is that applications don't generally have an
>> >> fd to another application's BPF programs; and obtaining them from an ID
>> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
>> >> before target *ID*" instead, which could work I guess? But then the
>> >> problem becomes that it's racy: the ID you're targeting could get
>> >> detached before you attach, so you'll need to be prepared to check that
>> >> and retry; and I'm almost certain that applications won't test for this,
>> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
>> >> pessimistic here?
>> >
>> > I like Stan's proposal and don't see any issue with FD.
>> > It's good to gate specific sequencing with cap_sys_admin.
>> > Also for consistency the FD is better than ID.
>> >
>> > I also like systemd analogy with Before=, After=.
>> > systemd has a ton more ways to specify deps between Units,
>> > but none of them have absolute numbers (which is what priority is).
>> > The only bit I'd tweak in Stan's proposal is:
>> > - attach before <target_fd>
>> > - attach after <target_fd>
>> > - attach before target_fd=0 == first
>> > - attach after target_fd=0 == last
>>
>> I think the before(), after() could work, but the target_fd I have my doubts
>> that it will be practical. Maybe lets walk through a concrete real example. app_a
>> and app_b shipped via container_a resp container_b. Both want to install tc BPF
>> and we (operator/user) want to say that prog from app_b should only be inserted
>> after the one from app_a, never run before; if no prog_a is installed, we ofc just
>> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
>> can only run after the former. How would we get to one anothers target fd? One
>> could use the 0, but not if more programs sit before/after.
>
> I read your desired use case several times and probably still didn't get it.
> Sounds like prog_b can just do after(fd=0) to become last.
> And prog_a can do before(fd=0).
> Whichever the order of attaching (a or b) these two will always
> be in a->b order.

I agree that it's probably not feasible to have programs themselves
coordinate between themselves except for "install me last/first" type
semantics.

I.e., the "before/after target_fd" is useful for a single application
that wants to install two programs in a certain order. Or for bpftool
for manual/debugging work.

System-wide policy (which includes "two containers both using BPF") is
going to need some kind of policy agent/daemon anyway. And the in-kernel
function override is the only feasible way to do that.

> Since the first and any prog returning !TC_NEXT will abort
> the chain we'd need __weak nop orchestrator prog to interpret
> retval for anything to be useful.

If we also want the orchestrator to interpret return codes, that
probably implies generating a BPF program that does the dispatching,
right? (since the attachment is per-interface we can't reuse the same
one). So maybe we do need to go the route of the (overridable) usermode
helper that gets all the program FDs and generates a BPF dispatcher
program? Or can we do this with a __weak function that emits bytecode
inside the kernel without being unsafe?

Anyway, I'm OK with deferring the orchestrator mechanism and going with
Stanislav's proposal as an initial API.

-Toke


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-08 11:38                       ` Toke Høiland-Jørgensen
@ 2022-10-08 20:38                         ` Alexei Starovoitov
  2022-10-13 18:30                           ` Andrii Nakryiko
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-08 20:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, Stanislav Fomichev, bpf, Nikolay Aleksandrov,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	John Fastabend, Joanne Koong, Kumar Kartikeya Dwivedi,
	Joe Stringer, Network Development

On Sat, Oct 08, 2022 at 01:38:54PM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> [...]
> >> >>>> I was thinking a little about how this might work; i.e., how can the
> >> >>>> kernel expose the required knobs to allow a system policy to be
> >> >>>> implemented without program loading having to talk to anything other
> >> >>>> than the syscall API?
> >> >>>
> >> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> >> >>>> then have a kernel function that does the sorting like:
> >> >>>
> >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> >> >>>> bpf_prog *new_prog, bool append)
> >> >>>
> >> >>>> where the default implementation just appends/prepends to the array in
> >> >>>> progs depending on the value of 'appen'.
> >> >>>
> >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> >> >>>> to just shift around the order of progs in the 'progs' array whenever a
> >> >>>> program is loaded/unloaded?
> >> >>>
> >> >>>> This way, a userspace daemon can implement any policy it wants by just
> >> >>>> attaching to that hook, and keeping things like how to express
> >> >>>> dependencies as a userspace concern?
> >> >>>
> >> >>> What if we do the above, but instead of simple global 'attach first/last',
> >> >>> the default api would be:
> >> >>>
> >> >>> - attach before <target_fd>
> >> >>> - attach after <target_fd>
> >> >>> - attach before target_fd=-1 == first
> >> >>> - attach after target_fd=-1 == last
> >> >>>
> >> >>> ?
> >> >>
> >> >> Hmm, the problem with that is that applications don't generally have an
> >> >> fd to another application's BPF programs; and obtaining them from an ID
> >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> >> >> before target *ID*" instead, which could work I guess? But then the
> >> >> problem becomes that it's racy: the ID you're targeting could get
> >> >> detached before you attach, so you'll need to be prepared to check that
> >> >> and retry; and I'm almost certain that applications won't test for this,
> >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> >> >> pessimistic here?
> >> >
> >> > I like Stan's proposal and don't see any issue with FD.
> >> > It's good to gate specific sequencing with cap_sys_admin.
> >> > Also for consistency the FD is better than ID.
> >> >
> >> > I also like systemd analogy with Before=, After=.
> >> > systemd has a ton more ways to specify deps between Units,
> >> > but none of them have absolute numbers (which is what priority is).
> >> > The only bit I'd tweak in Stan's proposal is:
> >> > - attach before <target_fd>
> >> > - attach after <target_fd>
> >> > - attach before target_fd=0 == first
> >> > - attach after target_fd=0 == last
> >>
> >> I think the before(), after() could work, but the target_fd I have my doubts
> >> that it will be practical. Maybe lets walk through a concrete real example. app_a
> >> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> >> and we (operator/user) want to say that prog from app_b should only be inserted
> >> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> >> can only run after the former. How would we get to one anothers target fd? One
> >> could use the 0, but not if more programs sit before/after.
> >
> > I read your desired use case several times and probably still didn't get it.
> > Sounds like prog_b can just do after(fd=0) to become last.
> > And prog_a can do before(fd=0).
> > Whichever the order of attaching (a or b) these two will always
> > be in a->b order.
> 
> I agree that it's probably not feasible to have programs themselves
> coordinate between themselves except for "install me last/first" type
> semantics.
> 
> I.e., the "before/after target_fd" is useful for a single application
> that wants to install two programs in a certain order. Or for bpftool
> for manual/debugging work.

yep

> System-wide policy (which includes "two containers both using BPF") is
> going to need some kind of policy agent/daemon anyway. And the in-kernel
> function override is the only feasible way to do that.

yep

> > Since the first and any prog returning !TC_NEXT will abort
> > the chain we'd need __weak nop orchestrator prog to interpret
> > retval for anything to be useful.
> 
> If we also want the orchestrator to interpret return codes, that
> probably implies generating a BPF program that does the dispatching,
> right? (since the attachment is per-interface we can't reuse the same
> one). So maybe we do need to go the route of the (overridable) usermode
> helper that gets all the program FDs and generates a BPF dispatcher
> program? Or can we do this with a __weak function that emits bytecode
> inside the kernel without being unsafe?

hid-bpf, cgroup-rstat, netfilter-bpf are facing similar issue.
The __weak override with one prog is certainly limiting.
And every case needs different demux.
I think we need to generalize xdp dispatcher to address this.
For example, for the case:
__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
                                     struct cgroup *parent, int cpu)
{
}

we can say that 1st argument to nop function will be used as
'demuxing entity'.
Sort of like if we had added a 'prog' pointer to 'struct cgroup',
but instead of burning 8 byte in every struct cgroup we can generate
'dispatcher asm' only for specific pointers.
In case of fuse-bpf that pointer will be a pointer to hid device and
demux will be done based on device. It can be an integer too.
The subsystem that defines __weak func can pick whatever int or pointer
as a first argument and dispatcher routine will generate code:
if (arg1 == constA) progA(arg1, arg2, ...);
else if (arg1 == constB) progB(arg1, arg2, ...);
...
else nop();

This way the 'nop' property of __weak is preserved until user space
passes (constA, progA) tuple to the kernel to generate dispatcher
for that __weak hook.

> Anyway, I'm OK with deferring the orchestrator mechanism and going with
> Stanislav's proposal as an initial API.

Great. Looks like we're converging :) Hope Daniel is ok with this direction.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-08 20:38                         ` Alexei Starovoitov
@ 2022-10-13 18:30                           ` Andrii Nakryiko
  2022-10-14 15:38                             ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Andrii Nakryiko @ 2022-10-13 18:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann,
	Stanislav Fomichev, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On Sat, Oct 8, 2022 at 1:38 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Oct 08, 2022 at 01:38:54PM +0200, Toke Høiland-Jørgensen wrote:
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >>
> > >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >> [...]
> > >> >>>> I was thinking a little about how this might work; i.e., how can the
> > >> >>>> kernel expose the required knobs to allow a system policy to be
> > >> >>>> implemented without program loading having to talk to anything other
> > >> >>>> than the syscall API?
> > >> >>>
> > >> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> > >> >>>> then have a kernel function that does the sorting like:
> > >> >>>
> > >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> > >> >>>> bpf_prog *new_prog, bool append)
> > >> >>>
> > >> >>>> where the default implementation just appends/prepends to the array in
> > >> >>>> progs depending on the value of 'appen'.
> > >> >>>
> > >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> > >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> > >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> > >> >>>> to just shift around the order of progs in the 'progs' array whenever a
> > >> >>>> program is loaded/unloaded?
> > >> >>>
> > >> >>>> This way, a userspace daemon can implement any policy it wants by just
> > >> >>>> attaching to that hook, and keeping things like how to express
> > >> >>>> dependencies as a userspace concern?
> > >> >>>
> > >> >>> What if we do the above, but instead of simple global 'attach first/last',
> > >> >>> the default api would be:
> > >> >>>
> > >> >>> - attach before <target_fd>
> > >> >>> - attach after <target_fd>
> > >> >>> - attach before target_fd=-1 == first
> > >> >>> - attach after target_fd=-1 == last
> > >> >>>
> > >> >>> ?
> > >> >>
> > >> >> Hmm, the problem with that is that applications don't generally have an
> > >> >> fd to another application's BPF programs; and obtaining them from an ID
> > >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> > >> >> before target *ID*" instead, which could work I guess? But then the
> > >> >> problem becomes that it's racy: the ID you're targeting could get
> > >> >> detached before you attach, so you'll need to be prepared to check that
> > >> >> and retry; and I'm almost certain that applications won't test for this,
> > >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> > >> >> pessimistic here?
> > >> >
> > >> > I like Stan's proposal and don't see any issue with FD.
> > >> > It's good to gate specific sequencing with cap_sys_admin.
> > >> > Also for consistency the FD is better than ID.
> > >> >
> > >> > I also like systemd analogy with Before=, After=.
> > >> > systemd has a ton more ways to specify deps between Units,
> > >> > but none of them have absolute numbers (which is what priority is).
> > >> > The only bit I'd tweak in Stan's proposal is:
> > >> > - attach before <target_fd>
> > >> > - attach after <target_fd>
> > >> > - attach before target_fd=0 == first
> > >> > - attach after target_fd=0 == last
> > >>
> > >> I think the before(), after() could work, but the target_fd I have my doubts
> > >> that it will be practical. Maybe lets walk through a concrete real example. app_a
> > >> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> > >> and we (operator/user) want to say that prog from app_b should only be inserted
> > >> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> > >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> > >> can only run after the former. How would we get to one anothers target fd? One
> > >> could use the 0, but not if more programs sit before/after.
> > >
> > > I read your desired use case several times and probably still didn't get it.
> > > Sounds like prog_b can just do after(fd=0) to become last.
> > > And prog_a can do before(fd=0).
> > > Whichever the order of attaching (a or b) these two will always
> > > be in a->b order.
> >
> > I agree that it's probably not feasible to have programs themselves
> > coordinate between themselves except for "install me last/first" type
> > semantics.
> >
> > I.e., the "before/after target_fd" is useful for a single application
> > that wants to install two programs in a certain order. Or for bpftool
> > for manual/debugging work.
>
> yep
>
> > System-wide policy (which includes "two containers both using BPF") is
> > going to need some kind of policy agent/daemon anyway. And the in-kernel
> > function override is the only feasible way to do that.
>
> yep
>
> > > Since the first and any prog returning !TC_NEXT will abort
> > > the chain we'd need __weak nop orchestrator prog to interpret
> > > retval for anything to be useful.
> >
> > If we also want the orchestrator to interpret return codes, that
> > probably implies generating a BPF program that does the dispatching,
> > right? (since the attachment is per-interface we can't reuse the same
> > one). So maybe we do need to go the route of the (overridable) usermode
> > helper that gets all the program FDs and generates a BPF dispatcher
> > program? Or can we do this with a __weak function that emits bytecode
> > inside the kernel without being unsafe?
>
> hid-bpf, cgroup-rstat, netfilter-bpf are facing similar issue.
> The __weak override with one prog is certainly limiting.
> And every case needs different demux.
> I think we need to generalize xdp dispatcher to address this.
> For example, for the case:
> __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
>                                      struct cgroup *parent, int cpu)
> {
> }
>
> we can say that 1st argument to nop function will be used as
> 'demuxing entity'.
> Sort of like if we had added a 'prog' pointer to 'struct cgroup',
> but instead of burning 8 byte in every struct cgroup we can generate
> 'dispatcher asm' only for specific pointers.
> In case of fuse-bpf that pointer will be a pointer to hid device and
> demux will be done based on device. It can be an integer too.
> The subsystem that defines __weak func can pick whatever int or pointer
> as a first argument and dispatcher routine will generate code:
> if (arg1 == constA) progA(arg1, arg2, ...);
> else if (arg1 == constB) progB(arg1, arg2, ...);
> ...
> else nop();
>
> This way the 'nop' property of __weak is preserved until user space
> passes (constA, progA) tuple to the kernel to generate dispatcher
> for that __weak hook.
>
> > Anyway, I'm OK with deferring the orchestrator mechanism and going with
> > Stanislav's proposal as an initial API.
>
> Great. Looks like we're converging :) Hope Daniel is ok with this direction.

No one proposed a slight variation on what Daniel was proposing with
prios that might work just as well. So for completeness, what if
instead of specifying 0 or explicit prio, we allow specifying either:
  - explicit prio, and if that prio is taken -- fail
  - min_prio, and kernel will find smallest untaken prio >= min_prio;
we can also define that min_prio=-1 means append as the very last one.

So if someone needs to be the very first -- explicitly request prio=1.
If wants to be last: prio=0, min_prio=-1. If we want to observe, we
can do something like min_prio=50 to leave a bunch of slots free for
some other programs for which exact order matters.

This whole before/after FD interface seems a bit hypothetical as well,
tbh. If it's multiple programs of the same application, then just
taking a few slots (either explicitly with prio or just best-effort
min_prio) is just fine, no need to deal with FDs. If there is no
coordination betweens apps, I'm not sure how you'd know that you want
to be before or after some other program's FD? How do you identify
what program it is, by it's name?

It seems more pragmatic that Cilium takes the very first slot (or a
bunch of slots) at startup to control exact location. And if that
fails, then fail startup or (given enough permissions) force-detach
existing link and install your own.

Just an idea for completeness, don't have much of a horse in this race.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-13 18:30                           ` Andrii Nakryiko
@ 2022-10-14 15:38                             ` Alexei Starovoitov
  2022-10-27  9:01                               ` Daniel Xu
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2022-10-14 15:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Daniel Borkmann,
	Stanislav Fomichev, bpf, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, John Fastabend, Joanne Koong,
	Kumar Kartikeya Dwivedi, Joe Stringer, Network Development

On Thu, Oct 13, 2022 at 11:30 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sat, Oct 8, 2022 at 1:38 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Oct 08, 2022 at 01:38:54PM +0200, Toke Høiland-Jørgensen wrote:
> > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > >
> > > > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >>
> > > >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > > >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >> [...]
> > > >> >>>> I was thinking a little about how this might work; i.e., how can the
> > > >> >>>> kernel expose the required knobs to allow a system policy to be
> > > >> >>>> implemented without program loading having to talk to anything other
> > > >> >>>> than the syscall API?
> > > >> >>>
> > > >> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> > > >> >>>> then have a kernel function that does the sorting like:
> > > >> >>>
> > > >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> > > >> >>>> bpf_prog *new_prog, bool append)
> > > >> >>>
> > > >> >>>> where the default implementation just appends/prepends to the array in
> > > >> >>>> progs depending on the value of 'appen'.
> > > >> >>>
> > > >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> > > >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> > > >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> > > >> >>>> to just shift around the order of progs in the 'progs' array whenever a
> > > >> >>>> program is loaded/unloaded?
> > > >> >>>
> > > >> >>>> This way, a userspace daemon can implement any policy it wants by just
> > > >> >>>> attaching to that hook, and keeping things like how to express
> > > >> >>>> dependencies as a userspace concern?
> > > >> >>>
> > > >> >>> What if we do the above, but instead of simple global 'attach first/last',
> > > >> >>> the default api would be:
> > > >> >>>
> > > >> >>> - attach before <target_fd>
> > > >> >>> - attach after <target_fd>
> > > >> >>> - attach before target_fd=-1 == first
> > > >> >>> - attach after target_fd=-1 == last
> > > >> >>>
> > > >> >>> ?
> > > >> >>
> > > >> >> Hmm, the problem with that is that applications don't generally have an
> > > >> >> fd to another application's BPF programs; and obtaining them from an ID
> > > >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> > > >> >> before target *ID*" instead, which could work I guess? But then the
> > > >> >> problem becomes that it's racy: the ID you're targeting could get
> > > >> >> detached before you attach, so you'll need to be prepared to check that
> > > >> >> and retry; and I'm almost certain that applications won't test for this,
> > > >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> > > >> >> pessimistic here?
> > > >> >
> > > >> > I like Stan's proposal and don't see any issue with FD.
> > > >> > It's good to gate specific sequencing with cap_sys_admin.
> > > >> > Also for consistency the FD is better than ID.
> > > >> >
> > > >> > I also like systemd analogy with Before=, After=.
> > > >> > systemd has a ton more ways to specify deps between Units,
> > > >> > but none of them have absolute numbers (which is what priority is).
> > > >> > The only bit I'd tweak in Stan's proposal is:
> > > >> > - attach before <target_fd>
> > > >> > - attach after <target_fd>
> > > >> > - attach before target_fd=0 == first
> > > >> > - attach after target_fd=0 == last
> > > >>
> > > >> I think the before(), after() could work, but the target_fd I have my doubts
> > > >> that it will be practical. Maybe lets walk through a concrete real example. app_a
> > > >> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> > > >> and we (operator/user) want to say that prog from app_b should only be inserted
> > > >> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> > > >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> > > >> can only run after the former. How would we get to one anothers target fd? One
> > > >> could use the 0, but not if more programs sit before/after.
> > > >
> > > > I read your desired use case several times and probably still didn't get it.
> > > > Sounds like prog_b can just do after(fd=0) to become last.
> > > > And prog_a can do before(fd=0).
> > > > Whichever the order of attaching (a or b) these two will always
> > > > be in a->b order.
> > >
> > > I agree that it's probably not feasible to have programs themselves
> > > coordinate between themselves except for "install me last/first" type
> > > semantics.
> > >
> > > I.e., the "before/after target_fd" is useful for a single application
> > > that wants to install two programs in a certain order. Or for bpftool
> > > for manual/debugging work.
> >
> > yep
> >
> > > System-wide policy (which includes "two containers both using BPF") is
> > > going to need some kind of policy agent/daemon anyway. And the in-kernel
> > > function override is the only feasible way to do that.
> >
> > yep
> >
> > > > Since the first and any prog returning !TC_NEXT will abort
> > > > the chain we'd need __weak nop orchestrator prog to interpret
> > > > retval for anything to be useful.
> > >
> > > If we also want the orchestrator to interpret return codes, that
> > > probably implies generating a BPF program that does the dispatching,
> > > right? (since the attachment is per-interface we can't reuse the same
> > > one). So maybe we do need to go the route of the (overridable) usermode
> > > helper that gets all the program FDs and generates a BPF dispatcher
> > > program? Or can we do this with a __weak function that emits bytecode
> > > inside the kernel without being unsafe?
> >
> > hid-bpf, cgroup-rstat, netfilter-bpf are facing similar issue.
> > The __weak override with one prog is certainly limiting.
> > And every case needs different demux.
> > I think we need to generalize xdp dispatcher to address this.
> > For example, for the case:
> > __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
> >                                      struct cgroup *parent, int cpu)
> > {
> > }
> >
> > we can say that 1st argument to nop function will be used as
> > 'demuxing entity'.
> > Sort of like if we had added a 'prog' pointer to 'struct cgroup',
> > but instead of burning 8 byte in every struct cgroup we can generate
> > 'dispatcher asm' only for specific pointers.
> > In case of fuse-bpf that pointer will be a pointer to hid device and
> > demux will be done based on device. It can be an integer too.
> > The subsystem that defines __weak func can pick whatever int or pointer
> > as a first argument and dispatcher routine will generate code:
> > if (arg1 == constA) progA(arg1, arg2, ...);
> > else if (arg1 == constB) progB(arg1, arg2, ...);
> > ...
> > else nop();
> >
> > This way the 'nop' property of __weak is preserved until user space
> > passes (constA, progA) tuple to the kernel to generate dispatcher
> > for that __weak hook.
> >
> > > Anyway, I'm OK with deferring the orchestrator mechanism and going with
> > > Stanislav's proposal as an initial API.
> >
> > Great. Looks like we're converging :) Hope Daniel is ok with this direction.
>
> No one proposed a slight variation on what Daniel was proposing with
> prios that might work just as well. So for completeness, what if
> instead of specifying 0 or explicit prio, we allow specifying either:
>   - explicit prio, and if that prio is taken -- fail
>   - min_prio, and kernel will find smallest untaken prio >= min_prio;
> we can also define that min_prio=-1 means append as the very last one.
>
> So if someone needs to be the very first -- explicitly request prio=1.
> If wants to be last: prio=0, min_prio=-1. If we want to observe, we
> can do something like min_prio=50 to leave a bunch of slots free for
> some other programs for which exact order matters.

Daniel, was suggesting more or less the same thing.
My point is that prio is an unnecessary concept and uapi
will be stuck with it. Including query interface
and bpftool printing it.

> This whole before/after FD interface seems a bit hypothetical as well,
> tbh.

The fd approach is not better. It's not more flexible.
That was not the point.
The point is that fd does not add stuff to uapi that
bpftool has to print and later the user has to somehow interpret.
prio is that magic number that users would have to understand,
but for them it's meaningless. The users want to see the order
of progs on query and select the order on attach.

> If it's multiple programs of the same application, then just
> taking a few slots (either explicitly with prio or just best-effort
> min_prio) is just fine, no need to deal with FDs. If there is no
> coordination betweens apps, I'm not sure how you'd know that you want
> to be before or after some other program's FD? How do you identify
> what program it is, by it's name?
>
> It seems more pragmatic that Cilium takes the very first slot (or a
> bunch of slots) at startup to control exact location. And if that
> fails, then fail startup or (given enough permissions) force-detach
> existing link and install your own.
>
> Just an idea for completeness, don't have much of a horse in this race.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
  2022-10-14 15:38                             ` Alexei Starovoitov
@ 2022-10-27  9:01                               ` Daniel Xu
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Xu @ 2022-10-27  9:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Toke Høiland-Jørgensen,
	Daniel Borkmann, Stanislav Fomichev, bpf, Nikolay Aleksandrov,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	John Fastabend, Joanne Koong, Kumar Kartikeya Dwivedi,
	Joe Stringer, Network Development

Hi Alexei,

On Fri, Oct 14, 2022 at 08:38:52AM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 13, 2022 at 11:30 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >

[...]

> > No one proposed a slight variation on what Daniel was proposing with
> > prios that might work just as well. So for completeness, what if
> > instead of specifying 0 or explicit prio, we allow specifying either:
> >   - explicit prio, and if that prio is taken -- fail
> >   - min_prio, and kernel will find smallest untaken prio >= min_prio;
> > we can also define that min_prio=-1 means append as the very last one.
> >
> > So if someone needs to be the very first -- explicitly request prio=1.
> > If wants to be last: prio=0, min_prio=-1. If we want to observe, we
> > can do something like min_prio=50 to leave a bunch of slots free for
> > some other programs for which exact order matters.
> 
> Daniel, was suggesting more or less the same thing.
> My point is that prio is an unnecessary concept and uapi
> will be stuck with it. Including query interface
> and bpftool printing it.

I apologize if I'm piling onto the bikeshedding, but I've been working a
lot more with TC bpf lately so I thought I'd offer some thoughts.

I quite like the intent of this patchset; it'll help simply using TC bpf
greatly. I also think what Andrii is suggesting makes a lot of sense. My
biggest gripe with TC priorities is that:

1. "Priority" is a rather arbitrary concept and hard to come up with
values for.

2. The default replace-on-collision semantic (IIRC) is error prone as
evidenced by this patch's motivation.

My suggestion here is to rename "priority" -> "position". Maybe it's
just me but I think "priority" is too vague of a concept whereas a
0-indexed "position" is rather unambiguous.

> 
> > This whole before/after FD interface seems a bit hypothetical as well,
> > tbh.
> 
> The fd approach is not better. It's not more flexible.
> That was not the point.
> The point is that fd does not add stuff to uapi that
> bpftool has to print and later the user has to somehow interpret.
> prio is that magic number that users would have to understand,
> but for them it's meaningless. The users want to see the order
> of progs on query and select the order on attach.

While I appreciate how the FD based approach leaves less confusing
values for bpftool to dump, I see a small semantic ambiguity with it:

For example, say we start with a single prog, A. Then add B as
"after-A".  Then add C as "before-B". It's unclear what'll happen.
Either you invalidate B's guarantees or you return an error. If you
invalidate, that's unfortunate. If you error, how does userspace retry?
You'd have to express all the existing relationships to the user thru
through bpftool or something. Whereas with Andrii's proposal it's
unambiguous.

[...]

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-10-27  9:02 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
2022-10-05  0:55   ` sdf
2022-10-05 10:50     ` Toke Høiland-Jørgensen
2022-10-05 14:48       ` Daniel Borkmann
2022-10-05 12:35     ` Daniel Borkmann
2022-10-05 17:56       ` sdf
2022-10-05 18:21         ` Daniel Borkmann
2022-10-05 10:33   ` Toke Høiland-Jørgensen
2022-10-05 12:47     ` Daniel Borkmann
2022-10-05 14:32       ` Toke Høiland-Jørgensen
2022-10-05 14:53         ` Daniel Borkmann
2022-10-05 19:04   ` Jamal Hadi Salim
2022-10-06 20:49     ` Daniel Borkmann
2022-10-07 15:36       ` Jamal Hadi Salim
2022-10-06  0:22   ` Andrii Nakryiko
2022-10-06  5:00   ` Alexei Starovoitov
2022-10-06 14:40     ` Jamal Hadi Salim
2022-10-06 23:29       ` Alexei Starovoitov
2022-10-07 15:43         ` Jamal Hadi Salim
2022-10-06 21:29     ` Daniel Borkmann
2022-10-06 23:28       ` Alexei Starovoitov
2022-10-07 13:26         ` Daniel Borkmann
2022-10-07 14:32           ` Toke Høiland-Jørgensen
2022-10-07 16:55             ` sdf
2022-10-07 17:20               ` Toke Høiland-Jørgensen
2022-10-07 18:11                 ` sdf
2022-10-07 19:06                   ` Daniel Borkmann
2022-10-07 18:59                 ` Alexei Starovoitov
2022-10-07 19:37                   ` Daniel Borkmann
2022-10-07 22:45                     ` sdf
2022-10-07 23:41                       ` Alexei Starovoitov
2022-10-07 23:34                     ` Alexei Starovoitov
2022-10-08 11:38                       ` Toke Høiland-Jørgensen
2022-10-08 20:38                         ` Alexei Starovoitov
2022-10-13 18:30                           ` Andrii Nakryiko
2022-10-14 15:38                             ` Alexei Starovoitov
2022-10-27  9:01                               ` Daniel Xu
2022-10-06 20:15   ` Martin KaFai Lau
2022-10-06 20:54   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-06 20:54     ` Daniel Borkmann
2022-10-06 17:56   ` Martin KaFai Lau
2022-10-06 20:10   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-06 23:14   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-06 23:24   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).