netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
@ 2021-05-28 19:59 Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 1/7] net: sched: refactor cls_bpf creation code Kumar Kartikeya Dwivedi
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

This is the first RFC version.

This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
introduces fd based ownership for such TC filters. Netlink cannot delete or
replace such filters, but the bpf_link is severed on indirect destruction of the
filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
that filters remain attached beyond process lifetime, the usual bpf_link fd
pinning approach can be used.

The individual patches contain more details and comments, but the overall kernel
API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
merged recently. This means that we start by always setting direct action mode,
protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
options in the future, they can be easily exposed through the bpf_link API in
the future.

Patch 1 refactors cls_bpf change function to extract two helpers that will be
reused in bpf_link creation.

Patch 2 exports some bpf_link management functions to modules. This is needed
because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
tcf_proto would be weird, because the update path has to replace offloaded bpf
prog, which happens using internal cls_bpf helpers, and would in general be more
code to abstract over an operation that is unlikely to be implemented for other
filter types.

Patch 3 adds the main bpf_link API. A function in cls_api takes care of
obtaining block reference, creating the filter object, and then calls the
bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
setting up the internal structures. An optimization is made to not keep around
resources for extended actions, which is explained in a code comment as it wasn't
immediately obvious.

Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
replacing the bpf_prog, we can skip tc filter's change path by reusing the
filter object but swapping its bpf_prog. This takes care of replacing the
offloaded prog as well (if that fails, update is aborted). So far however,
tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
would never be modified concurrently. This is no longer true, and to not
penalize the classify hot path, we also cannot impose serialization around
its load. Hence the load is changed to READ_ONCE, so that the pointer value is
always consistent. Due to invocation in a RCU critical section, the lifetime of
the prog is guaranteed for the duration of the call.

Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
function to libbpf.

Patch 7 adds a selftest that exercises all possible problematic interactions
that I could think of.

Design:

This is where in the object hierarchy our bpf_link object is attached.

                                                                            ┌─────┐
                                                                            │     │
                                                                            │ BPF │
                                                                            program
                                                                            │     │
                                                                            └──▲──┘
                                                      ┌───────┐                │
                                                      │       │         ┌──────┴───────┐
                                                      │  mod  ├─────────► cls_bpf_prog │
┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
│    tcf_block   │                                    │       │              │   │
└────────┬───────┘                                    └───▲───┘              │   │
         │          ┌─────────────┐                       │                ┌─▼───┴──┐
         └──────────►  tcf_chain  │                       │                │bpf_link│
                    └───────┬─────┘                       │                └────────┘
                            │          ┌─────────────┐    │
                            └──────────►  tcf_proto  ├────┘
                                       └─────────────┘

The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
allows us to implement update in a lightweight manner without having to recreate
a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.

The other way to do it would be to link the bpf_link to tcf_proto, there are
numerous downsides to this:

1. All filters have to embed the pointer even though they won't be using it when
cls_bpf is compiled in.
2. This probably won't make sense to be extended to other filter types anyway.
3. We aren't able to optimize the update case without adding another bpf_link
specific update operation to tcf_proto ops.

The downside with tying this to the module is having to export bpf_link
management functions and introducing a tcf_proto op. Hopefully the cost of
another operation func pointer is not big enough (as there is only one ops
struct per module).

This first version is to collect feedback on the approach and get ideas if there
is a better way to do this.

Kumar Kartikeya Dwivedi (7):
  net: sched: refactor cls_bpf creation code
  bpf: export bpf_link functions for modules
  net: sched: add bpf_link API for bpf classifier
  net: sched: add lightweight update path for cls_bpf
  tools: bpf.h: sync with kernel sources
  libbpf: add bpf_link based TC-BPF management API
  libbpf: add selftest for bpf_link based TC-BPF management API

 include/linux/bpf_types.h                     |   3 +
 include/net/pkt_cls.h                         |  13 +
 include/net/sch_generic.h                     |   6 +-
 include/uapi/linux/bpf.h                      |  15 +
 kernel/bpf/syscall.c                          |  14 +-
 net/sched/cls_api.c                           | 138 ++++++-
 net/sched/cls_bpf.c                           | 386 ++++++++++++++++--
 tools/include/uapi/linux/bpf.h                |  15 +
 tools/lib/bpf/bpf.c                           |   5 +
 tools/lib/bpf/bpf.h                           |   8 +-
 tools/lib/bpf/libbpf.c                        |  59 ++-
 tools/lib/bpf/libbpf.h                        |  17 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/lib/bpf/netlink.c                       |   5 +-
 tools/lib/bpf/netlink.h                       |   8 +
 .../selftests/bpf/prog_tests/tc_bpf_link.c    | 285 +++++++++++++
 16 files changed, 934 insertions(+), 44 deletions(-)
 create mode 100644 tools/lib/bpf/netlink.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 1/7] net: sched: refactor cls_bpf creation code
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 2/7] bpf: export bpf_link functions for modules Kumar Kartikeya Dwivedi
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

Move parts of the code that are independent and need to be reused into
their own helpers. These will be needed for adding a bpf_link creation
path in a subsequent patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 net/sched/cls_bpf.c | 84 ++++++++++++++++++++++++++++-----------------
 1 file changed, 53 insertions(+), 31 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 6e3e63db0e01..360b97ab8646 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -455,6 +455,57 @@ static int cls_bpf_set_parms(struct net *net, struct tcf_proto *tp,
 	return 0;
 }
 
+static int __cls_bpf_alloc_idr(struct cls_bpf_head *head, u32 handle,
+			       struct cls_bpf_prog *prog,
+			       struct cls_bpf_prog *oldprog)
+{
+	int ret = 0;
+
+	if (oldprog) {
+		if (handle && oldprog->handle != handle)
+			return -EINVAL;
+	}
+
+	if (handle == 0) {
+		handle = 1;
+		ret = idr_alloc_u32(&head->handle_idr, prog, &handle, INT_MAX,
+				    GFP_KERNEL);
+	} else if (!oldprog) {
+		ret = idr_alloc_u32(&head->handle_idr, prog, &handle, handle,
+				    GFP_KERNEL);
+	}
+
+	prog->handle = handle;
+	return ret;
+}
+
+static int __cls_bpf_change(struct cls_bpf_head *head, struct tcf_proto *tp,
+			    struct cls_bpf_prog *prog,
+			    struct cls_bpf_prog *oldprog,
+			    struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	ret = cls_bpf_offload(tp, prog, oldprog, extack);
+	if (ret)
+		return ret;
+
+	if (!tc_in_hw(prog->gen_flags))
+		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
+
+	if (oldprog) {
+		idr_replace(&head->handle_idr, prog, prog->handle);
+		list_replace_rcu(&oldprog->link, &prog->link);
+		tcf_unbind_filter(tp, &oldprog->res);
+		tcf_exts_get_net(&oldprog->exts);
+		tcf_queue_work(&oldprog->rwork, cls_bpf_delete_prog_work);
+	} else {
+		list_add_rcu(&prog->link, &head->plist);
+	}
+
+	return 0;
+}
+
 static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 			  struct tcf_proto *tp, unsigned long base,
 			  u32 handle, struct nlattr **tca,
@@ -483,48 +534,19 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (ret < 0)
 		goto errout;
 
-	if (oldprog) {
-		if (handle && oldprog->handle != handle) {
-			ret = -EINVAL;
-			goto errout;
-		}
-	}
-
-	if (handle == 0) {
-		handle = 1;
-		ret = idr_alloc_u32(&head->handle_idr, prog, &handle,
-				    INT_MAX, GFP_KERNEL);
-	} else if (!oldprog) {
-		ret = idr_alloc_u32(&head->handle_idr, prog, &handle,
-				    handle, GFP_KERNEL);
-	}
-
+	ret = __cls_bpf_alloc_idr(head, handle, prog, oldprog);
 	if (ret)
 		goto errout;
-	prog->handle = handle;
 
 	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr,
 				extack);
 	if (ret < 0)
 		goto errout_idr;
 
-	ret = cls_bpf_offload(tp, prog, oldprog, extack);
+	ret = __cls_bpf_change(head, tp, prog, oldprog, extack);
 	if (ret)
 		goto errout_parms;
 
-	if (!tc_in_hw(prog->gen_flags))
-		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
-
-	if (oldprog) {
-		idr_replace(&head->handle_idr, prog, handle);
-		list_replace_rcu(&oldprog->link, &prog->link);
-		tcf_unbind_filter(tp, &oldprog->res);
-		tcf_exts_get_net(&oldprog->exts);
-		tcf_queue_work(&oldprog->rwork, cls_bpf_delete_prog_work);
-	} else {
-		list_add_rcu(&prog->link, &head->plist);
-	}
-
 	*arg = prog;
 	return 0;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 2/7] bpf: export bpf_link functions for modules
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 1/7] net: sched: refactor cls_bpf creation code Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

These are required in a subsequent patch to implement the bpf_link
command for cls_bpf. Since the bpf_link object is tied to the
cls_bpf_prog object, it has to be initialized and managed from inside
the module.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/syscall.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 50457019da27..e5934b748ced 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2338,6 +2338,7 @@ void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
 	link->ops = ops;
 	link->prog = prog;
 }
+EXPORT_SYMBOL_GPL(bpf_link_init);
 
 static void bpf_link_free_id(int id)
 {
@@ -2363,6 +2364,7 @@ void bpf_link_cleanup(struct bpf_link_primer *primer)
 	fput(primer->file);
 	put_unused_fd(primer->fd);
 }
+EXPORT_SYMBOL_GPL(bpf_link_cleanup);
 
 void bpf_link_inc(struct bpf_link *link)
 {
@@ -2510,6 +2512,7 @@ int bpf_link_prime(struct bpf_link *link, struct bpf_link_primer *primer)
 	primer->id = id;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(bpf_link_prime);
 
 int bpf_link_settle(struct bpf_link_primer *primer)
 {
@@ -2522,6 +2525,7 @@ int bpf_link_settle(struct bpf_link_primer *primer)
 	/* pass through installed FD */
 	return primer->fd;
 }
+EXPORT_SYMBOL_GPL(bpf_link_settle);
 
 int bpf_link_new_fd(struct bpf_link *link)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 1/7] net: sched: refactor cls_bpf creation code Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 2/7] bpf: export bpf_link functions for modules Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-06-02 20:56   ` Andrii Nakryiko
  2021-05-28 19:59 ` [PATCH RFC bpf-next 4/7] net: sched: add lightweight update path for cls_bpf Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

This commit introduces a bpf_link based kernel API for creating tc
filters and using the cls_bpf classifier. Only a subset of what netlink
API offers is supported, things like TCA_BPF_POLICE, TCA_RATE and
embedded actions are unsupported.

The kernel API and the libbpf wrapper added in a subsequent patch are
more opinionated and mirror the semantics of low level netlink based
TC-BPF API, i.e. always setting direct action mode, always setting
protocol to ETH_P_ALL, and only exposing handle and priority as the
variables the user can control. We add an additional gen_flags parameter
though to allow for offloading use cases. It would be trivial to extend
the current API to support specifying other attributes in the future,
but for now I'm sticking how we want to push usage.

The semantics around bpf_link support are as follows:

A user can create a classifier attached to a filter using the bpf_link
API, after which changing it and deleting it only happens through the
bpf_link API. It is not possible to bind the bpf_link to existing
filter, and any such attempt will fail with EEXIST. Hence EEXIST can be
returned in two cases, when existing bpf_link owned filter exists, or
existing netlink owned filter exists.

Removing bpf_link owned filter from netlink returns EPERM, denoting that
netlink is locked out from filter manipulation when bpf_link is
involved.

Whenever a filter is detached due to chain removal, or qdisc tear down,
or net_device shutdown, the bpf_link becomes automatically detached.

In this way, the netlink API and bpf_link creation path are exclusive
and don't stomp over one another. Filters created using bpf_link API
cannot be replaced by netlink API, and filters created by netlink API are
never replaced by bpf_link. Netfilter also cannot detach bpf_link filters.

We serialize all changes dover rtnl_lock as cls_bpf API doesn't support the
unlocked classifier API.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_types.h |   3 +
 include/net/pkt_cls.h     |  13 ++
 include/net/sch_generic.h |   6 +-
 include/uapi/linux/bpf.h  |  15 +++
 kernel/bpf/syscall.c      |  10 +-
 net/sched/cls_api.c       | 138 ++++++++++++++++++++-
 net/sched/cls_bpf.c       | 247 +++++++++++++++++++++++++++++++++++++-
 7 files changed, 426 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a9db1eae6796..b1aaf7680917 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
 #ifdef CONFIG_NET
 BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
 #endif
+#if IS_ENABLED(CONFIG_NET_CLS_BPF)
+BPF_LINK_TYPE(BPF_LINK_TYPE_TC, tc)
+#endif
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 255e4f4b521f..c36c5d79db6b 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -2,6 +2,7 @@
 #ifndef __NET_PKT_CLS_H
 #define __NET_PKT_CLS_H
 
+#include <linux/bpf.h>
 #include <linux/pkt_cls.h>
 #include <linux/workqueue.h>
 #include <net/sch_generic.h>
@@ -45,6 +46,9 @@ bool tcf_queue_work(struct rcu_work *rwork, work_func_t func);
 struct tcf_chain *tcf_chain_get_by_act(struct tcf_block *block,
 				       u32 chain_index);
 void tcf_chain_put_by_act(struct tcf_chain *chain);
+void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
+			       struct tcf_proto *tp, bool rtnl_held,
+			       struct netlink_ext_ack *extack);
 struct tcf_chain *tcf_get_next_chain(struct tcf_block *block,
 				     struct tcf_chain *chain);
 struct tcf_proto *tcf_get_next_proto(struct tcf_chain *chain,
@@ -1004,4 +1008,13 @@ struct tc_fifo_qopt_offload {
 	};
 };
 
+#if IS_ENABLED(CONFIG_NET_CLS_BPF)
+int bpf_tc_link_attach(union bpf_attr *attr, struct bpf_prog *prog);
+#else
+static inline int bpf_tc_link_attach(union bpf_attr *attr, struct bpf_prog *prog)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 #endif
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14491fb..bacd70bfc5ed 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -341,7 +341,11 @@ struct tcf_proto_ops {
 	int			(*tmplt_dump)(struct sk_buff *skb,
 					      struct net *net,
 					      void *tmplt_priv);
-
+#if IS_ENABLED(CONFIG_NET_CLS_BPF)
+	int			(*bpf_link_change)(struct net *net, struct tcf_proto *tp,
+						   struct bpf_prog *filter, void **arg, u32 handle,
+						   u32 gen_flags);
+#endif
 	struct module		*owner;
 	int			flags;
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2c1ba70abbf1..a3488463d145 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -994,6 +994,7 @@ enum bpf_attach_type {
 	BPF_SK_LOOKUP,
 	BPF_XDP,
 	BPF_SK_SKB_VERDICT,
+	BPF_TC,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1007,6 +1008,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_ITER = 4,
 	BPF_LINK_TYPE_NETNS = 5,
 	BPF_LINK_TYPE_XDP = 6,
+	BPF_LINK_TYPE_TC = 7,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1447,6 +1449,12 @@ union bpf_attr {
 				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
 				__u32		iter_info_len;	/* iter_info length */
 			};
+			struct { /* used by BPF_TC */
+				__u32 parent;
+				__u32 handle;
+				__u32 gen_flags;
+				__u16 priority;
+			} tc;
 		};
 	} link_create;
 
@@ -5519,6 +5527,13 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 ifindex;
+			__u32 parent;
+			__u32 handle;
+			__u32 gen_flags;
+			__u16 priority;
+		} tc;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e5934b748ced..ce7c00ea135c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  */
+#include <net/pkt_cls.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
@@ -3027,6 +3028,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_SK_LOOKUP;
 	case BPF_XDP:
 		return BPF_PROG_TYPE_XDP;
+	case BPF_TC:
+		return BPF_PROG_TYPE_SCHED_CLS;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -4085,7 +4088,7 @@ static int tracing_bpf_link_attach(const union bpf_attr *attr, bpfptr_t uattr,
 	return -EINVAL;
 }
 
-#define BPF_LINK_CREATE_LAST_FIELD link_create.iter_info_len
+#define BPF_LINK_CREATE_LAST_FIELD link_create.tc.priority
 static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 {
 	enum bpf_prog_type ptype;
@@ -4136,6 +4139,11 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	case BPF_PROG_TYPE_XDP:
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
+#endif
+#if IS_ENABLED(CONFIG_NET_CLS_BPF)
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = bpf_tc_link_attach(attr, prog);
+		break;
 #endif
 	default:
 		ret = -EINVAL;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 75e3a288a7c8..1fd804fe55bf 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -9,6 +9,7 @@
  * Eduardo J. Blanco <ejbs@netlabs.com.uy> :990222: kmod support
  */
 
+#include <linux/bpf.h>
 #include <linux/module.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
@@ -1720,9 +1721,9 @@ static struct tcf_proto *tcf_chain_tp_insert_unique(struct tcf_chain *chain,
 	return tp_new;
 }
 
-static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
-				      struct tcf_proto *tp, bool rtnl_held,
-				      struct netlink_ext_ack *extack)
+void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
+			       struct tcf_proto *tp, bool rtnl_held,
+			       struct netlink_ext_ack *extack)
 {
 	struct tcf_chain_info chain_info;
 	struct tcf_proto *tp_iter;
@@ -1760,6 +1761,7 @@ static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
 
 	tcf_proto_put(tp, rtnl_held, extack);
 }
+EXPORT_SYMBOL_GPL(tcf_chain_tp_delete_empty);
 
 static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 					   struct tcf_chain_info *chain_info,
@@ -3917,3 +3919,133 @@ static int __init tc_filter_init(void)
 }
 
 subsys_initcall(tc_filter_init);
+
+#if IS_ENABLED(CONFIG_NET_CLS_BPF)
+
+int bpf_tc_link_attach(union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	u32 chain_index, prio, protocol, parent;
+	struct tcf_chain_info chain_info;
+	struct tcf_block *block;
+	struct tcf_chain *chain;
+	struct tcf_proto *tp;
+	int err, tp_created;
+	unsigned long cl;
+	struct Qdisc *q;
+	void *fh;
+
+	/* Caller already checks bpf_capable */
+	if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (attr->link_create.flags ||
+	    !attr->link_create.target_ifindex ||
+	    !tc_flags_valid(attr->link_create.tc.gen_flags))
+		return -EINVAL;
+
+replay:
+	parent = attr->link_create.tc.parent;
+	prio = attr->link_create.tc.priority;
+	protocol = htons(ETH_P_ALL);
+	chain_index = 0;
+	tp_created = 0;
+	prio <<= 16;
+	cl = 0;
+
+	/* Address this when cls_bpf switches to RTNL_FLAG_DOIT_UNLOCKED */
+	rtnl_lock();
+
+	block = tcf_block_find(net, &q, &parent, &cl,
+			       attr->link_create.target_ifindex, parent, NULL);
+	if (IS_ERR(block)) {
+		err = PTR_ERR(block);
+		goto out_unlock;
+	}
+	block->classid = parent;
+
+	chain = tcf_chain_get(block, chain_index, true);
+	if (!chain) {
+		err = -ENOMEM;
+		goto out_block;
+	}
+
+	mutex_lock(&chain->filter_chain_lock);
+
+	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
+			       prio ?: TC_H_MAKE(0x80000000U, 0U),
+			       !prio);
+	if (IS_ERR(tp)) {
+		err = PTR_ERR(tp);
+		goto out_chain_unlock;
+	}
+
+	if (!tp) {
+		struct tcf_proto *tp_new = NULL;
+
+		if (chain->flushing) {
+			err = -EAGAIN;
+			goto out_chain_unlock;
+		}
+
+		if (!prio)
+			prio = tcf_auto_prio(tcf_chain_tp_prev(chain,
+							       &chain_info));
+
+		mutex_unlock(&chain->filter_chain_lock);
+
+		tp_new = tcf_proto_create("bpf", protocol, prio, chain, true,
+					  NULL);
+		if (IS_ERR(tp_new)) {
+			err = PTR_ERR(tp_new);
+			goto out_chain;
+		}
+
+		tp_created = 1;
+		tp = tcf_chain_tp_insert_unique(chain, tp_new, protocol, prio,
+						true);
+		if (IS_ERR(tp)) {
+			err = PTR_ERR(tp);
+			goto out_chain;
+		}
+	} else {
+		mutex_unlock(&chain->filter_chain_lock);
+	}
+
+	fh = tp->ops->get(tp, attr->link_create.tc.handle);
+
+	if (!tp->ops->bpf_link_change)
+		err = -EDEADLK;
+	else
+		err = tp->ops->bpf_link_change(net, tp, prog, &fh,
+					       attr->link_create.tc.handle,
+					       attr->link_create.tc.gen_flags);
+	if (err >= 0 && q)
+		q->flags &= ~TCQ_F_CAN_BYPASS;
+
+out:
+	if (err < 0 && tp_created)
+		tcf_chain_tp_delete_empty(chain, tp, true, NULL);
+out_chain:
+	if (chain) {
+		if (!IS_ERR_OR_NULL(tp))
+			tcf_proto_put(tp, true, NULL);
+		/* Chain reference only kept for tp creation
+		 * to pair with tcf_chain_put from tcf_proto_destroy
+		 */
+		if (!tp_created)
+			tcf_chain_put(chain);
+	}
+out_block:
+	tcf_block_release(q, block, true);
+out_unlock:
+	rtnl_unlock();
+	if (err == -EAGAIN)
+		goto replay;
+	return err;
+out_chain_unlock:
+	mutex_unlock(&chain->filter_chain_lock);
+	goto out;
+}
+
+#endif
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 360b97ab8646..57d6dedb389a 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -34,6 +34,11 @@ struct cls_bpf_head {
 	struct rcu_head rcu;
 };
 
+struct cls_bpf_link {
+	struct bpf_link link;
+	struct cls_bpf_prog *prog;
+};
+
 struct cls_bpf_prog {
 	struct bpf_prog *filter;
 	struct list_head link;
@@ -48,6 +53,7 @@ struct cls_bpf_prog {
 	const char *bpf_name;
 	struct tcf_proto *tp;
 	struct rcu_work rwork;
+	struct cls_bpf_link *bpf_link;
 };
 
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
@@ -289,6 +295,8 @@ static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog,
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 
+	if (prog->bpf_link)
+		prog->bpf_link->prog = NULL;
 	idr_remove(&head->handle_idr, prog->handle);
 	cls_bpf_stop_offload(tp, prog, extack);
 	list_del_rcu(&prog->link);
@@ -303,8 +311,13 @@ static int cls_bpf_delete(struct tcf_proto *tp, void *arg, bool *last,
 			  bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
+	struct cls_bpf_prog *prog = arg;
+
+	/* Cannot remove bpf_link owned filter using netlink */
+	if (prog->bpf_link)
+		return -EPERM;
 
-	__cls_bpf_delete(tp, arg, extack);
+	__cls_bpf_delete(tp, prog, extack);
 	*last = list_empty(&head->plist);
 	return 0;
 }
@@ -494,6 +507,11 @@ static int __cls_bpf_change(struct cls_bpf_head *head, struct tcf_proto *tp,
 		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
 	if (oldprog) {
+		/* Since netfilter and bpf_link cannot replace a bpf_link
+		 * attached filter, this should never be true.
+		 */
+		WARN_ON(oldprog->bpf_link);
+
 		idr_replace(&head->handle_idr, prog, prog->handle);
 		list_replace_rcu(&oldprog->link, &prog->link);
 		tcf_unbind_filter(tp, &oldprog->res);
@@ -521,6 +539,10 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (tca[TCA_OPTIONS] == NULL)
 		return -EINVAL;
 
+	/* Can't touch bpf_link filter */
+	if (oldprog && oldprog->bpf_link)
+		return -EPERM;
+
 	ret = nla_parse_nested_deprecated(tb, TCA_BPF_MAX, tca[TCA_OPTIONS],
 					  bpf_policy, NULL);
 	if (ret < 0)
@@ -716,6 +738,228 @@ static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb
 	return 0;
 }
 
+static void cls_bpf_link_release(struct bpf_link *link)
+{
+	struct cls_bpf_link *cls_link;
+	struct cls_bpf_prog *prog;
+	struct cls_bpf_head *head;
+
+	rtnl_lock();
+
+	cls_link = container_of(link, struct cls_bpf_link, link);
+	prog = cls_link->prog;
+
+	if (prog) {
+		head = rtnl_dereference(prog->tp->root);
+		/* Deletion of the filter will unset cls_link->prog */
+		__cls_bpf_delete(prog->tp, prog, NULL);
+		if (list_empty(&head->plist))
+			tcf_chain_tp_delete_empty(prog->tp->chain, prog->tp,
+						  true, NULL);
+	}
+
+	rtnl_unlock();
+}
+
+static void cls_bpf_link_dealloc(struct bpf_link *link)
+{
+	struct cls_bpf_link *cls_link;
+
+	cls_link = container_of(link, struct cls_bpf_link, link);
+	kfree(cls_link);
+}
+
+static int cls_bpf_link_detach(struct bpf_link *link)
+{
+	cls_bpf_link_release(link);
+	return 0;
+}
+
+static void __bpf_fill_link_info(struct cls_bpf_link *link,
+				 struct bpf_link_info *info)
+{
+	struct tcf_block *block;
+	struct tcf_proto *tp;
+	struct Qdisc *q;
+
+	ASSERT_RTNL();
+
+	if (WARN_ON(!link->prog))
+		return;
+
+	tp = link->prog->tp;
+	block = tp->chain->block;
+	q = block->q;
+
+	info->tc.ifindex = q ? qdisc_dev(q)->ifindex : TCM_IFINDEX_MAGIC_BLOCK;
+	info->tc.parent = block->classid;
+	info->tc.handle = link->prog->handle;
+	info->tc.priority = tp->prio >> 16;
+	info->tc.gen_flags = link->prog->gen_flags;
+}
+
+#ifdef CONFIG_PROC_FS
+
+static void cls_bpf_link_show_fdinfo(const struct bpf_link *link,
+				     struct seq_file *seq)
+{
+	struct cls_bpf_link *cls_link;
+	struct bpf_link_info info = {};
+
+	rtnl_lock();
+
+	cls_link = container_of(link, struct cls_bpf_link, link);
+	if (!cls_link->prog)
+		goto out;
+
+	__bpf_fill_link_info(cls_link, &info);
+
+	seq_printf(seq,
+		   "ifindex:\t%u\n"
+		   "parent:\t%u\n"
+		   "handle:\t%u\n"
+		   "priority:\t%u\n"
+		   "gen_flags:\t%u\n",
+		   info.tc.ifindex, info.tc.parent,
+		   info.tc.handle, (u32)info.tc.priority,
+		   info.tc.gen_flags);
+
+out:
+	rtnl_unlock();
+}
+
+#endif
+
+static int cls_bpf_link_fill_link_info(const struct bpf_link *link,
+				       struct bpf_link_info *info)
+{
+	struct cls_bpf_link *cls_link;
+	int ret = 0;
+
+	rtnl_lock();
+
+	cls_link = container_of(link, struct cls_bpf_link, link);
+	if (!cls_link->prog) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	__bpf_fill_link_info(cls_link, info);
+
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static const struct bpf_link_ops cls_bpf_link_ops = {
+	.release = cls_bpf_link_release,
+	.dealloc = cls_bpf_link_dealloc,
+	.detach = cls_bpf_link_detach,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo = cls_bpf_link_show_fdinfo,
+#endif
+	.fill_link_info = cls_bpf_link_fill_link_info,
+};
+
+static inline char *cls_bpf_link_name(u32 prog_id, const char *name)
+{
+	char *str = kmalloc(CLS_BPF_NAME_LEN, GFP_KERNEL);
+
+	if (str)
+		snprintf(str, CLS_BPF_NAME_LEN, "%s:[%u]", name, prog_id);
+
+	return str;
+}
+
+static int cls_bpf_link_change(struct net *net, struct tcf_proto *tp,
+			       struct bpf_prog *filter, void **arg,
+			       u32 handle, u32 gen_flags)
+{
+	struct cls_bpf_head *head = rtnl_dereference(tp->root);
+	struct cls_bpf_prog *oldprog = *arg, *prog;
+	struct bpf_link_primer primer;
+	struct cls_bpf_link *link;
+	int ret;
+
+	if (gen_flags & ~CLS_BPF_SUPPORTED_GEN_FLAGS)
+		return -EINVAL;
+
+	if (oldprog)
+		return -EEXIST;
+
+	prog = kzalloc(sizeof(*prog), GFP_KERNEL);
+	if (!prog)
+		return -ENOMEM;
+
+	link = kzalloc(sizeof(*link), GFP_KERNEL);
+	if (!link) {
+		ret = -ENOMEM;
+		goto err_prog;
+	}
+
+	bpf_link_init(&link->link, BPF_LINK_TYPE_TC, &cls_bpf_link_ops,
+		      filter);
+
+	ret = bpf_link_prime(&link->link, &primer);
+	if (ret < 0)
+		goto err_link;
+
+	/* We don't init exts to save on memory, but we still need to store the
+	 * net_ns pointer, as during delete whether the deletion work will be
+	 * queued or executed inline depends on the refcount of net_ns. In
+	 * __cls_bpf_delete the reference is taken to keep the action IDR alive
+	 * (which we don't require), but its maybe_get_net also allows us to
+	 * detect whether we are being invoked in netns destruction path or not.
+	 * In the former case deletion will have to be done synchronously.
+	 *
+	 * Leaving it NULL would prevent us from doing deletion work
+	 * asynchronously, so set it here.
+	 *
+	 * On the tcf_classify side, exts->actions are not touched for
+	 * exts_integrated progs, so we should be good.
+	 */
+	prog->exts.net = net;
+
+	ret = __cls_bpf_alloc_idr(head, handle, prog, oldprog);
+	if (ret < 0)
+		goto err_primer;
+
+	prog->exts_integrated = true;
+	prog->bpf_link = link;
+	prog->filter = filter;
+	prog->tp = tp;
+	link->prog = prog;
+
+	prog->bpf_name = cls_bpf_link_name(filter->aux->id, filter->aux->name);
+	if (!prog->bpf_name) {
+		ret = -ENOMEM;
+		goto err_idr;
+	}
+
+	ret = __cls_bpf_change(head, tp, prog, oldprog, NULL);
+	if (ret < 0)
+		goto err_name;
+
+	bpf_prog_inc(filter);
+
+	if (filter->dst_needed)
+		tcf_block_netif_keep_dst(tp->chain->block);
+
+	return bpf_link_settle(&primer);
+
+err_name:
+	kfree(prog->bpf_name);
+err_idr:
+	idr_remove(&head->handle_idr, prog->handle);
+err_primer:
+	bpf_link_cleanup(&primer);
+err_link:
+	kfree(link);
+err_prog:
+	kfree(prog);
+	return ret;
+}
+
 static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.kind		=	"bpf",
 	.owner		=	THIS_MODULE,
@@ -729,6 +973,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.reoffload	=	cls_bpf_reoffload,
 	.dump		=	cls_bpf_dump,
 	.bind_class	=	cls_bpf_bind_class,
+	.bpf_link_change =	cls_bpf_link_change,
 };
 
 static int __init cls_bpf_init_mod(void)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 4/7] net: sched: add lightweight update path for cls_bpf
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2021-05-28 19:59 ` [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 5/7] tools: bpf.h: sync with kernel sources Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

This is used by BPF_LINK_UPDATE to replace the attach SCHED_CLS bpf prog
effectively changing the classifier implementation for a given filter
owned by a bpf_link.

Note that READ_ONCE suffices in this case as the ordering for loads from
the filter are implicitly provided by the data dependency on BPF prog
pointer.

On the writer side we can just use a relaxed WRITE_ONCE store to make
sure one or the other value is visible to a reader in cls_bpf_classify.
Lifetime is managed using RCU so bpf_prog_put path should wait until
readers are done for old_prog.

All other parties accessing the BPF prog are under RTNL protection, so
need no changes.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 net/sched/cls_bpf.c | 55 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 53 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 57d6dedb389a..459a139bcfbe 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -9,6 +9,7 @@
  * (C) 2013 Daniel Borkmann <dborkman@redhat.com>
  */
 
+#include <linux/atomic.h>
 #include <linux/module.h>
 #include <linux/types.h>
 #include <linux/skbuff.h>
@@ -104,11 +105,11 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			/* It is safe to push/pull even if skb_shared() */
 			__skb_push(skb, skb->mac_len);
 			bpf_compute_data_pointers(skb);
-			filter_res = BPF_PROG_RUN(prog->filter, skb);
+			filter_res = BPF_PROG_RUN(READ_ONCE(prog->filter), skb);
 			__skb_pull(skb, skb->mac_len);
 		} else {
 			bpf_compute_data_pointers(skb);
-			filter_res = BPF_PROG_RUN(prog->filter, skb);
+			filter_res = BPF_PROG_RUN(READ_ONCE(prog->filter), skb);
 		}
 
 		if (prog->exts_integrated) {
@@ -775,6 +776,55 @@ static int cls_bpf_link_detach(struct bpf_link *link)
 	return 0;
 }
 
+static int cls_bpf_link_update(struct bpf_link *link, struct bpf_prog *new_prog,
+			       struct bpf_prog *old_prog)
+{
+	struct cls_bpf_link *cls_link;
+	struct cls_bpf_prog cls_prog;
+	struct cls_bpf_prog *prog;
+	int ret;
+
+	rtnl_lock();
+
+	cls_link = container_of(link, struct cls_bpf_link, link);
+	if (!cls_link->prog) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	prog = cls_link->prog;
+
+	/* BPF_F_REPLACEing? */
+	if (old_prog && prog->filter != old_prog) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	old_prog = prog->filter;
+
+	if (new_prog == old_prog) {
+		ret = 0;
+		goto out;
+	}
+
+	cls_prog = *prog;
+	cls_prog.filter = new_prog;
+
+	ret = cls_bpf_offload(prog->tp, &cls_prog, prog, NULL);
+	if (ret < 0)
+		goto out;
+
+	WRITE_ONCE(prog->filter, new_prog);
+
+	bpf_prog_inc(new_prog);
+	/* release our reference */
+	bpf_prog_put(old_prog);
+
+out:
+	rtnl_unlock();
+	return ret;
+}
+
 static void __bpf_fill_link_info(struct cls_bpf_link *link,
 				 struct bpf_link_info *info)
 {
@@ -859,6 +909,7 @@ static const struct bpf_link_ops cls_bpf_link_ops = {
 	.show_fdinfo = cls_bpf_link_show_fdinfo,
 #endif
 	.fill_link_info = cls_bpf_link_fill_link_info,
+	.update_prog = cls_bpf_link_update,
 };
 
 static inline char *cls_bpf_link_name(u32 prog_id, const char *name)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 5/7] tools: bpf.h: sync with kernel sources
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2021-05-28 19:59 ` [PATCH RFC bpf-next 4/7] net: sched: add lightweight update path for cls_bpf Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-05-28 19:59 ` [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

This will be used to expose bpf_link based libbpf API to users.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/include/uapi/linux/bpf.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2c1ba70abbf1..a3488463d145 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -994,6 +994,7 @@ enum bpf_attach_type {
 	BPF_SK_LOOKUP,
 	BPF_XDP,
 	BPF_SK_SKB_VERDICT,
+	BPF_TC,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1007,6 +1008,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_ITER = 4,
 	BPF_LINK_TYPE_NETNS = 5,
 	BPF_LINK_TYPE_XDP = 6,
+	BPF_LINK_TYPE_TC = 7,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1447,6 +1449,12 @@ union bpf_attr {
 				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
 				__u32		iter_info_len;	/* iter_info length */
 			};
+			struct { /* used by BPF_TC */
+				__u32 parent;
+				__u32 handle;
+				__u32 gen_flags;
+				__u16 priority;
+			} tc;
 		};
 	} link_create;
 
@@ -5519,6 +5527,13 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 ifindex;
+			__u32 parent;
+			__u32 handle;
+			__u32 gen_flags;
+			__u16 priority;
+		} tc;
 	};
 } __attribute__((aligned(8)));
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2021-05-28 19:59 ` [PATCH RFC bpf-next 5/7] tools: bpf.h: sync with kernel sources Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-06-02 21:03   ` Andrii Nakryiko
  2021-05-28 19:59 ` [PATCH RFC bpf-next 7/7] libbpf: add selftest for " Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

This adds userspace TC-BPF management API based on bpf_link.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/lib/bpf/bpf.c      |  5 ++++
 tools/lib/bpf/bpf.h      |  8 +++++-
 tools/lib/bpf/libbpf.c   | 59 ++++++++++++++++++++++++++++++++++++++--
 tools/lib/bpf/libbpf.h   | 17 ++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 tools/lib/bpf/netlink.c  |  5 ++--
 tools/lib/bpf/netlink.h  |  8 ++++++
 7 files changed, 98 insertions(+), 5 deletions(-)
 create mode 100644 tools/lib/bpf/netlink.h

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 86dcac44f32f..ab2e2e9ccc5e 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -28,6 +28,7 @@
 #include <asm/unistd.h>
 #include <errno.h>
 #include <linux/bpf.h>
+#include <arpa/inet.h>
 #include "bpf.h"
 #include "libbpf.h"
 #include "libbpf_internal.h"
@@ -692,6 +693,10 @@ int bpf_link_create(int prog_fd, int target_fd,
 	attr.link_create.target_fd = target_fd;
 	attr.link_create.attach_type = attach_type;
 	attr.link_create.flags = OPTS_GET(opts, flags, 0);
+	attr.link_create.tc.parent = OPTS_GET(opts, tc.parent, 0);
+	attr.link_create.tc.handle = OPTS_GET(opts, tc.handle, 0);
+	attr.link_create.tc.priority = OPTS_GET(opts, tc.priority, 0);
+	attr.link_create.tc.gen_flags = OPTS_GET(opts, tc.gen_flags, 0);
 
 	if (iter_info_len) {
 		attr.link_create.iter_info =
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 4f758f8f50cd..f2178309e9ea 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -177,8 +177,14 @@ struct bpf_link_create_opts {
 	union bpf_iter_link_info *iter_info;
 	__u32 iter_info_len;
 	__u32 target_btf_id;
+	struct {
+		__u32 parent;
+		__u32 handle;
+		__u32 priority;
+		__u32 gen_flags;
+	} tc;
 };
-#define bpf_link_create_opts__last_field target_btf_id
+#define bpf_link_create_opts__last_field tc.gen_flags
 
 LIBBPF_API int bpf_link_create(int prog_fd, int target_fd,
 			       enum bpf_attach_type attach_type,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1c4e20e75237..7809536980b1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -55,6 +55,7 @@
 #include "libbpf_internal.h"
 #include "hashmap.h"
 #include "bpf_gen_internal.h"
+#include "netlink.h"
 
 #ifndef BPF_FS_MAGIC
 #define BPF_FS_MAGIC		0xcafe4a11
@@ -7185,7 +7186,7 @@ static int bpf_object__collect_relos(struct bpf_object *obj)
 
 	for (i = 0; i < obj->nr_programs; i++) {
 		struct bpf_program *p = &obj->programs[i];
-		
+
 		if (!p->nr_reloc)
 			continue;
 
@@ -10005,7 +10006,7 @@ struct bpf_link {
 int bpf_link__update_program(struct bpf_link *link, struct bpf_program *prog)
 {
 	int ret;
-	
+
 	ret = bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog), NULL);
 	return libbpf_err_errno(ret);
 }
@@ -10613,6 +10614,60 @@ struct bpf_link *bpf_program__attach_xdp(struct bpf_program *prog, int ifindex)
 	return bpf_program__attach_fd(prog, ifindex, 0, "xdp");
 }
 
+struct bpf_link *bpf_program__attach_tc(struct bpf_program *prog,
+					const struct bpf_tc_hook *hook,
+					const struct bpf_tc_link_opts *opts)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, lopts, 0);
+	char errmsg[STRERR_BUFSIZE];
+	int prog_fd, link_fd, ret;
+	struct bpf_link *link;
+	__u32 parent;
+
+	if (!hook || !OPTS_VALID(hook, bpf_tc_hook) ||
+	    !OPTS_VALID(opts, bpf_tc_link_opts))
+		return ERR_PTR(-EINVAL);
+
+	if (OPTS_GET(hook, ifindex, 0) <= 0 ||
+	    OPTS_GET(opts, priority, 0) > UINT16_MAX)
+		return ERR_PTR(-EINVAL);
+
+	parent = OPTS_GET(hook, parent, 0);
+
+	ret = tc_get_tcm_parent(OPTS_GET(hook, attach_point, 0),
+				&parent);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	lopts.tc.parent = parent;
+	lopts.tc.handle = OPTS_GET(opts, handle, 0);
+	lopts.tc.priority = OPTS_GET(opts, priority, 0);
+	lopts.tc.gen_flags = OPTS_GET(opts, gen_flags, 0);
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("prog '%s': can't attach before loaded\n", prog->name);
+		return ERR_PTR(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return ERR_PTR(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	link_fd = bpf_link_create(prog_fd, OPTS_GET(hook, ifindex, 0), BPF_TC, &lopts);
+	if (link_fd < 0) {
+		link_fd = -errno;
+		free(link);
+		pr_warn("prog '%s': failed to attach tc filter: %s\n",
+			prog->name, libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
+		return ERR_PTR(link_fd);
+	}
+	link->fd = link_fd;
+
+	return link;
+}
+
 struct bpf_link *bpf_program__attach_freplace(struct bpf_program *prog,
 					      int target_fd,
 					      const char *attach_func_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 6e61342ba56c..284a446c6513 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -282,6 +282,23 @@ LIBBPF_API struct bpf_link *
 bpf_program__attach_iter(struct bpf_program *prog,
 			 const struct bpf_iter_attach_opts *opts);
 
+/* TC bpf_link related API */
+struct bpf_tc_hook;
+
+struct bpf_tc_link_opts {
+	size_t sz;
+	__u32 handle;
+	__u32 priority;
+	__u32 gen_flags;
+	size_t :0;
+};
+#define bpf_tc_link_opts__last_field gen_flags
+
+LIBBPF_API struct bpf_link *
+bpf_program__attach_tc(struct bpf_program *prog,
+		       const struct bpf_tc_hook *hook,
+		       const struct bpf_tc_link_opts *opts);
+
 struct bpf_insn;
 
 /*
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index bbe99b1db1a9..ef570a840aca 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -373,5 +373,6 @@ LIBBPF_0.4.0 {
 
 LIBBPF_0.5.0 {
 	global:
+		bpf_program__attach_tc;
 		libbpf_set_strict_mode;
 } LIBBPF_0.4.0;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index d743c8721aa7..b7ac36fc9c1a 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -17,6 +17,7 @@
 #include "libbpf.h"
 #include "libbpf_internal.h"
 #include "nlattr.h"
+#include "netlink.h"
 
 #ifndef SOL_NETLINK
 #define SOL_NETLINK 270
@@ -405,8 +406,8 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
 	}
 }
 
-static int tc_get_tcm_parent(enum bpf_tc_attach_point attach_point,
-			     __u32 *parent)
+int tc_get_tcm_parent(enum bpf_tc_attach_point attach_point,
+		      __u32 *parent)
 {
 	switch (attach_point) {
 	case BPF_TC_INGRESS:
diff --git a/tools/lib/bpf/netlink.h b/tools/lib/bpf/netlink.h
new file mode 100644
index 000000000000..c89133d56eb4
--- /dev/null
+++ b/tools/lib/bpf/netlink.h
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+#pragma once
+
+#include <linux/types.h>
+#include "libbpf.h"
+
+int tc_get_tcm_parent(enum bpf_tc_attach_point attach_point,
+		      __u32 *parent);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH RFC bpf-next 7/7] libbpf: add selftest for bpf_link based TC-BPF management API
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2021-05-28 19:59 ` [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API Kumar Kartikeya Dwivedi
@ 2021-05-28 19:59 ` Kumar Kartikeya Dwivedi
  2021-06-02 21:09 ` [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Andrii Nakryiko
  2021-06-06 23:37 ` Cong Wang
  8 siblings, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-05-28 19:59 UTC (permalink / raw)
  To: bpf
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev

This covers basic attach/detach/update, and tests interaction with the
netlink API. It also exercises the bpf_link_info and fdinfo codepaths.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/tc_bpf_link.c    | 285 ++++++++++++++++++
 1 file changed, 285 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c b/tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c
new file mode 100644
index 000000000000..beaf06e0557c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c
@@ -0,0 +1,285 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include <linux/pkt_cls.h>
+
+#include "test_tc_bpf.skel.h"
+
+#define LO_IFINDEX 1
+
+static int test_tc_bpf_link_basic(struct bpf_tc_hook *hook,
+				  struct bpf_program *prog)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_link_opts, opts, .handle = 1, .priority = 1);
+	DECLARE_LIBBPF_OPTS(bpf_tc_opts, qopts, .handle = 1, .priority = 1);
+	struct bpf_prog_info info = {};
+	__u32 info_len = sizeof(info);
+	struct bpf_link *link, *invl;
+	int ret;
+
+	link = bpf_program__attach_tc(prog, hook, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_tc"))
+		return PTR_ERR(link);
+
+	ret = bpf_obj_get_info_by_fd(bpf_program__fd(prog), &info, &info_len);
+	if (!ASSERT_OK(ret, "bpf_obj_get_info_by_fd"))
+		goto end;
+
+	ret = bpf_tc_query(hook, &qopts);
+	if (!ASSERT_OK(ret, "bpf_tc_query"))
+		goto end;
+
+	if (!ASSERT_EQ(qopts.prog_id, info.id, "prog_id match"))
+		goto end;
+
+	opts.gen_flags = ~0u;
+	invl = bpf_program__attach_tc(prog, hook, &opts);
+	if (!ASSERT_ERR_PTR(invl, "bpf_program__attach_tc with invalid flags")) {
+		bpf_link__destroy(invl);
+		ret = -EINVAL;
+	}
+
+end:
+	bpf_link__destroy(link);
+	return ret;
+}
+
+static int test_tc_bpf_link_netlink_interaction(struct bpf_tc_hook *hook,
+						struct bpf_program *prog)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_update_opts, lopts,
+			    .old_prog_fd = bpf_program__fd(prog));
+	DECLARE_LIBBPF_OPTS(bpf_tc_link_opts, opts, .handle = 1, .priority = 1);
+	DECLARE_LIBBPF_OPTS(bpf_tc_opts, nopts, .handle = 1, .priority = 1);
+	DECLARE_LIBBPF_OPTS(bpf_tc_opts, dopts, .handle = 1, .priority = 1);
+	struct bpf_link *link;
+	int ret;
+
+	/* We need to test the following cases:
+	 *	1. BPF link owned filter cannot be replaced by netlink
+	 *	2. Netlink owned filter cannot be replaced by BPF link
+	 *	3. Netlink cannot do targeted delete of BPF link owned filter
+	 *	4. Filter is actually deleted (with chain cleanup)
+	 *	   We actually (ab)use the kernel behavior of returning EINVAL when
+	 *	   target chain doesn't exist on tc_get_tfilter (which maps to
+	 *	   bpf_tc_query) here, to know if the chain was really cleaned
+	 *	   up on tcf_proto destruction. Our setup is so that there is
+	 *	   only one reference to the chain.
+	 *
+	 *	   So on query, chain ? (filter ?: ENOENT) : EINVAL
+	 */
+
+	link = bpf_program__attach_tc(prog, hook, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_tc"))
+		return PTR_ERR(link);
+
+	nopts.prog_fd = bpf_program__fd(prog);
+	ret = bpf_tc_attach(hook, &nopts);
+	if (!ASSERT_EQ(ret, -EEXIST, "bpf_tc_attach without replace"))
+		goto end;
+
+	nopts.flags = BPF_TC_F_REPLACE;
+	ret = bpf_tc_attach(hook, &nopts);
+	if (!ASSERT_EQ(ret, -EPERM, "bpf_tc_attach with replace"))
+		goto end;
+
+	ret = bpf_tc_detach(hook, &dopts);
+	if (!ASSERT_EQ(ret, -EPERM, "bpf_tc_detach"))
+		goto end;
+
+	lopts.flags = BPF_F_REPLACE;
+	ret = bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog),
+			      &lopts);
+	ASSERT_OK(ret, "bpf_link_update");
+	ret = ret < 0 ? -errno : ret;
+
+end:
+	bpf_link__destroy(link);
+	if (!ret && !ASSERT_EQ(bpf_tc_query(hook, &dopts), -EINVAL,
+			       "chain empty delete"))
+		ret = -EINVAL;
+	return ret;
+}
+
+static int test_tc_bpf_link_update_ways(struct bpf_tc_hook *hook,
+					struct bpf_program *prog)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_link_opts, opts, .handle = 1, .priority = 1);
+	DECLARE_LIBBPF_OPTS(bpf_link_update_opts, uopts, 0);
+	struct test_tc_bpf *skel;
+	struct bpf_link *link;
+	int ret;
+
+	skel = test_tc_bpf__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_tc_bpf__open_and_load"))
+		return PTR_ERR(skel);
+
+	link = bpf_program__attach_tc(prog, hook, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_tc")) {
+		ret = PTR_ERR(link);
+		goto end;
+	}
+
+	ret = bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog),
+			      &uopts);
+	if (!ASSERT_OK(ret, "bpf_link_update no old prog"))
+		goto end;
+
+	uopts.old_prog_fd = bpf_program__fd(prog);
+	ret = bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog),
+			      &uopts);
+	if (!ASSERT_TRUE(ret < 0 && errno == EINVAL,
+			 "bpf_link_update with old prog without BPF_F_REPLACE")) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	uopts.flags = BPF_F_REPLACE;
+	ret = bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog),
+			      &uopts);
+	if (!ASSERT_OK(ret, "bpf_link_update with old prog with BPF_F_REPLACE"))
+		goto end;
+
+	uopts.old_prog_fd = bpf_program__fd(skel->progs.cls);
+	ret = bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog),
+			      &uopts);
+	if (!ASSERT_TRUE(ret < 0 && errno == EINVAL,
+			 "bpf_link_update with wrong old prog")) {
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+
+end:
+	test_tc_bpf__destroy(skel);
+	return ret;
+}
+
+static int test_tc_bpf_link_info_api(struct bpf_tc_hook *hook,
+				     struct bpf_program *prog)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_link_opts, opts, .handle = 1, .priority = 1);
+	__u32 ifindex, parent, handle, gen_flags, priority;
+	char buf[4096], path[256], *begin;
+	struct bpf_link_info info = {};
+	__u32 info_len = sizeof(info);
+	struct bpf_link *link;
+	int ret, fdinfo;
+
+	link = bpf_program__attach_tc(prog, hook, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_tc"))
+		return PTR_ERR(link);
+
+	ret = bpf_obj_get_info_by_fd(bpf_link__fd(link), &info, &info_len);
+	if (!ASSERT_OK(ret, "bpf_obj_get_info_by_fd"))
+		goto end;
+
+	ret = snprintf(path, sizeof(path), "/proc/self/fdinfo/%d",
+		       bpf_link__fd(link));
+	if (!ASSERT_TRUE(!ret || ret < sizeof(path), "snprintf pathname"))
+		goto end;
+
+	fdinfo = open(path, O_RDONLY);
+	if (!ASSERT_GT(fdinfo, -1, "open fdinfo"))
+		goto end;
+
+	ret = read(fdinfo, buf, sizeof(buf));
+	if (!ASSERT_GT(ret, 0, "read fdinfo")) {
+		ret = -EINVAL;
+		goto end_file;
+	}
+
+	begin = strstr(buf, "ifindex");
+	if (!ASSERT_OK_PTR(begin, "find beginning of fdinfo info")) {
+		ret = -EINVAL;
+		goto end_file;
+	}
+
+	ret = sscanf(begin, "ifindex:\t%u\n"
+			    "parent:\t%u\n"
+			    "handle:\t%u\n"
+			    "priority:\t%u\n"
+			    "gen_flags:\t%u\n",
+			    &ifindex, &parent, &handle, &priority, &gen_flags);
+	if (!ASSERT_EQ(ret, 5, "sscanf fdinfo")) {
+		ret = -EINVAL;
+		goto end_file;
+	}
+
+	ret = -EINVAL;
+
+#define X(a, b, c) (!ASSERT_EQ(a, b, #a " == " #b) || !ASSERT_EQ(b, c, #b " == " #c))
+	if (X(info.tc.ifindex, ifindex, 1) ||
+	    X(info.tc.parent, parent,
+	      TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_EGRESS)) ||
+	    X(info.tc.handle, handle, 1) ||
+	    X(info.tc.gen_flags, gen_flags, TCA_CLS_FLAGS_NOT_IN_HW) ||
+	    X(info.tc.priority, priority, 1))
+#undef X
+		goto end_file;
+
+	ret = 0;
+
+end_file:
+	close(fdinfo);
+end:
+	bpf_link__destroy(link);
+	return ret;
+}
+
+void test_tc_bpf_link(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_INGRESS);
+	struct test_tc_bpf *skel = NULL;
+	bool hook_created = false;
+	int ret;
+
+	skel = test_tc_bpf__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_tc_bpf__open_and_load"))
+		return;
+
+	ret = bpf_tc_hook_create(&hook);
+	if (ret == 0)
+		hook_created = true;
+
+	ret = ret == -EEXIST ? 0 : ret;
+	if (!ASSERT_OK(ret, "bpf_tc_hook_create(BPF_TC_INGRESS)"))
+		goto end;
+
+	ret = test_tc_bpf_link_basic(&hook, skel->progs.cls);
+	if (!ASSERT_OK(ret, "test_tc_bpf_link_basic"))
+		goto end;
+
+	bpf_tc_hook_destroy(&hook);
+
+	hook.attach_point = BPF_TC_EGRESS;
+	ret = test_tc_bpf_link_basic(&hook, skel->progs.cls);
+	if (!ASSERT_OK(ret, "test_tc_bpf_link_basic"))
+		goto end;
+
+	bpf_tc_hook_destroy(&hook);
+
+	ret = test_tc_bpf_link_netlink_interaction(&hook, skel->progs.cls);
+	if (!ASSERT_OK(ret, "test_tc_bpf_link_netlink_interaction"))
+		goto end;
+
+	bpf_tc_hook_destroy(&hook);
+
+	ret = test_tc_bpf_link_update_ways(&hook, skel->progs.cls);
+	if (!ASSERT_OK(ret, "test_tc_bpf_link_update_ways"))
+		goto end;
+
+	bpf_tc_hook_destroy(&hook);
+
+	ret = test_tc_bpf_link_info_api(&hook, skel->progs.cls);
+	if (!ASSERT_OK(ret, "test_tc_bpf_link_info_api"))
+		goto end;
+
+end:
+	if (hook_created) {
+		hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS;
+		bpf_tc_hook_destroy(&hook);
+	}
+	test_tc_bpf__destroy(skel);
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier
  2021-05-28 19:59 ` [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier Kumar Kartikeya Dwivedi
@ 2021-06-02 20:56   ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2021-06-02 20:56 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Networking

On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This commit introduces a bpf_link based kernel API for creating tc
> filters and using the cls_bpf classifier. Only a subset of what netlink
> API offers is supported, things like TCA_BPF_POLICE, TCA_RATE and
> embedded actions are unsupported.
>
> The kernel API and the libbpf wrapper added in a subsequent patch are
> more opinionated and mirror the semantics of low level netlink based
> TC-BPF API, i.e. always setting direct action mode, always setting
> protocol to ETH_P_ALL, and only exposing handle and priority as the
> variables the user can control. We add an additional gen_flags parameter
> though to allow for offloading use cases. It would be trivial to extend
> the current API to support specifying other attributes in the future,
> but for now I'm sticking how we want to push usage.
>
> The semantics around bpf_link support are as follows:
>
> A user can create a classifier attached to a filter using the bpf_link
> API, after which changing it and deleting it only happens through the
> bpf_link API. It is not possible to bind the bpf_link to existing
> filter, and any such attempt will fail with EEXIST. Hence EEXIST can be
> returned in two cases, when existing bpf_link owned filter exists, or
> existing netlink owned filter exists.
>
> Removing bpf_link owned filter from netlink returns EPERM, denoting that
> netlink is locked out from filter manipulation when bpf_link is
> involved.
>
> Whenever a filter is detached due to chain removal, or qdisc tear down,
> or net_device shutdown, the bpf_link becomes automatically detached.
>
> In this way, the netlink API and bpf_link creation path are exclusive
> and don't stomp over one another. Filters created using bpf_link API
> cannot be replaced by netlink API, and filters created by netlink API are
> never replaced by bpf_link. Netfilter also cannot detach bpf_link filters.
>
> We serialize all changes dover rtnl_lock as cls_bpf API doesn't support the
> unlocked classifier API.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf_types.h |   3 +
>  include/net/pkt_cls.h     |  13 ++
>  include/net/sch_generic.h |   6 +-
>  include/uapi/linux/bpf.h  |  15 +++
>  kernel/bpf/syscall.c      |  10 +-
>  net/sched/cls_api.c       | 138 ++++++++++++++++++++-
>  net/sched/cls_bpf.c       | 247 +++++++++++++++++++++++++++++++++++++-
>  7 files changed, 426 insertions(+), 6 deletions(-)
>

[...]

> +static int cls_bpf_link_change(struct net *net, struct tcf_proto *tp,
> +                              struct bpf_prog *filter, void **arg,
> +                              u32 handle, u32 gen_flags)
> +{
> +       struct cls_bpf_head *head = rtnl_dereference(tp->root);
> +       struct cls_bpf_prog *oldprog = *arg, *prog;
> +       struct bpf_link_primer primer;
> +       struct cls_bpf_link *link;
> +       int ret;
> +
> +       if (gen_flags & ~CLS_BPF_SUPPORTED_GEN_FLAGS)
> +               return -EINVAL;
> +
> +       if (oldprog)
> +               return -EEXIST;
> +
> +       prog = kzalloc(sizeof(*prog), GFP_KERNEL);
> +       if (!prog)
> +               return -ENOMEM;
> +
> +       link = kzalloc(sizeof(*link), GFP_KERNEL);
> +       if (!link) {
> +               ret = -ENOMEM;
> +               goto err_prog;
> +       }
> +
> +       bpf_link_init(&link->link, BPF_LINK_TYPE_TC, &cls_bpf_link_ops,
> +                     filter);
> +
> +       ret = bpf_link_prime(&link->link, &primer);
> +       if (ret < 0)
> +               goto err_link;
> +
> +       /* We don't init exts to save on memory, but we still need to store the
> +        * net_ns pointer, as during delete whether the deletion work will be
> +        * queued or executed inline depends on the refcount of net_ns. In
> +        * __cls_bpf_delete the reference is taken to keep the action IDR alive
> +        * (which we don't require), but its maybe_get_net also allows us to
> +        * detect whether we are being invoked in netns destruction path or not.
> +        * In the former case deletion will have to be done synchronously.
> +        *
> +        * Leaving it NULL would prevent us from doing deletion work
> +        * asynchronously, so set it here.
> +        *
> +        * On the tcf_classify side, exts->actions are not touched for
> +        * exts_integrated progs, so we should be good.
> +        */
> +       prog->exts.net = net;
> +
> +       ret = __cls_bpf_alloc_idr(head, handle, prog, oldprog);
> +       if (ret < 0)
> +               goto err_primer;
> +
> +       prog->exts_integrated = true;
> +       prog->bpf_link = link;
> +       prog->filter = filter;
> +       prog->tp = tp;
> +       link->prog = prog;
> +
> +       prog->bpf_name = cls_bpf_link_name(filter->aux->id, filter->aux->name);
> +       if (!prog->bpf_name) {
> +               ret = -ENOMEM;
> +               goto err_idr;
> +       }
> +
> +       ret = __cls_bpf_change(head, tp, prog, oldprog, NULL);
> +       if (ret < 0)
> +               goto err_name;
> +
> +       bpf_prog_inc(filter);
> +
> +       if (filter->dst_needed)
> +               tcf_block_netif_keep_dst(tp->chain->block);
> +
> +       return bpf_link_settle(&primer);
> +
> +err_name:
> +       kfree(prog->bpf_name);
> +err_idr:
> +       idr_remove(&head->handle_idr, prog->handle);
> +err_primer:
> +       bpf_link_cleanup(&primer);

once you prime the link, you can't kfree() it, you do only
bpf_link_cleanup() and it will handle eventually freeing it. So if you
look at other places doing bpf_link, they set link = NULL after
bpf_link_cleanup() to avoid directly freeing.

> +err_link:
> +       kfree(link);
> +err_prog:
> +       kfree(prog);
> +       return ret;
> +}
> +
>  static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
>         .kind           =       "bpf",
>         .owner          =       THIS_MODULE,
> @@ -729,6 +973,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
>         .reoffload      =       cls_bpf_reoffload,
>         .dump           =       cls_bpf_dump,
>         .bind_class     =       cls_bpf_bind_class,
> +       .bpf_link_change =      cls_bpf_link_change,
>  };
>
>  static int __init cls_bpf_init_mod(void)
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API
  2021-05-28 19:59 ` [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API Kumar Kartikeya Dwivedi
@ 2021-06-02 21:03   ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2021-06-02 21:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Networking

On Fri, May 28, 2021 at 1:01 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This adds userspace TC-BPF management API based on bpf_link.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  tools/lib/bpf/bpf.c      |  5 ++++
>  tools/lib/bpf/bpf.h      |  8 +++++-
>  tools/lib/bpf/libbpf.c   | 59 ++++++++++++++++++++++++++++++++++++++--
>  tools/lib/bpf/libbpf.h   | 17 ++++++++++++
>  tools/lib/bpf/libbpf.map |  1 +
>  tools/lib/bpf/netlink.c  |  5 ++--
>  tools/lib/bpf/netlink.h  |  8 ++++++
>  7 files changed, 98 insertions(+), 5 deletions(-)
>  create mode 100644 tools/lib/bpf/netlink.h
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 86dcac44f32f..ab2e2e9ccc5e 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -28,6 +28,7 @@
>  #include <asm/unistd.h>
>  #include <errno.h>
>  #include <linux/bpf.h>
> +#include <arpa/inet.h>
>  #include "bpf.h"
>  #include "libbpf.h"
>  #include "libbpf_internal.h"
> @@ -692,6 +693,10 @@ int bpf_link_create(int prog_fd, int target_fd,
>         attr.link_create.target_fd = target_fd;
>         attr.link_create.attach_type = attach_type;
>         attr.link_create.flags = OPTS_GET(opts, flags, 0);
> +       attr.link_create.tc.parent = OPTS_GET(opts, tc.parent, 0);
> +       attr.link_create.tc.handle = OPTS_GET(opts, tc.handle, 0);
> +       attr.link_create.tc.priority = OPTS_GET(opts, tc.priority, 0);
> +       attr.link_create.tc.gen_flags = OPTS_GET(opts, tc.gen_flags, 0);

you can't do filling out link_create.tc unless you know it's TC
bpf_link that is being created, right? link_create.tc is part of a
union, so you might be overwriting bpf_iter link data, for instance.


>
>         if (iter_info_len) {
>                 attr.link_create.iter_info =

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2021-05-28 19:59 ` [PATCH RFC bpf-next 7/7] libbpf: add selftest for " Kumar Kartikeya Dwivedi
@ 2021-06-02 21:09 ` Andrii Nakryiko
  2021-06-02 21:45   ` Kumar Kartikeya Dwivedi
  2021-06-06 23:37 ` Cong Wang
  8 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2021-06-02 21:09 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Networking

On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This is the first RFC version.
>
> This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> introduces fd based ownership for such TC filters. Netlink cannot delete or
> replace such filters, but the bpf_link is severed on indirect destruction of the
> filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> that filters remain attached beyond process lifetime, the usual bpf_link fd
> pinning approach can be used.
>
> The individual patches contain more details and comments, but the overall kernel
> API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> merged recently. This means that we start by always setting direct action mode,
> protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> options in the future, they can be easily exposed through the bpf_link API in
> the future.
>
> Patch 1 refactors cls_bpf change function to extract two helpers that will be
> reused in bpf_link creation.
>
> Patch 2 exports some bpf_link management functions to modules. This is needed
> because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> tcf_proto would be weird, because the update path has to replace offloaded bpf
> prog, which happens using internal cls_bpf helpers, and would in general be more
> code to abstract over an operation that is unlikely to be implemented for other
> filter types.
>
> Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> obtaining block reference, creating the filter object, and then calls the
> bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> setting up the internal structures. An optimization is made to not keep around
> resources for extended actions, which is explained in a code comment as it wasn't
> immediately obvious.
>
> Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> replacing the bpf_prog, we can skip tc filter's change path by reusing the
> filter object but swapping its bpf_prog. This takes care of replacing the
> offloaded prog as well (if that fails, update is aborted). So far however,
> tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> would never be modified concurrently. This is no longer true, and to not
> penalize the classify hot path, we also cannot impose serialization around
> its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> always consistent. Due to invocation in a RCU critical section, the lifetime of
> the prog is guaranteed for the duration of the call.
>
> Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> function to libbpf.
>
> Patch 7 adds a selftest that exercises all possible problematic interactions
> that I could think of.
>
> Design:
>
> This is where in the object hierarchy our bpf_link object is attached.
>
>                                                                             ┌─────┐
>                                                                             │     │
>                                                                             │ BPF │
>                                                                             program
>                                                                             │     │
>                                                                             └──▲──┘
>                                                       ┌───────┐                │
>                                                       │       │         ┌──────┴───────┐
>                                                       │  mod  ├─────────► cls_bpf_prog │
> ┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
> │    tcf_block   │                                    │       │              │   │
> └────────┬───────┘                                    └───▲───┘              │   │
>          │          ┌─────────────┐                       │                ┌─▼───┴──┐
>          └──────────►  tcf_chain  │                       │                │bpf_link│
>                     └───────┬─────┘                       │                └────────┘
>                             │          ┌─────────────┐    │
>                             └──────────►  tcf_proto  ├────┘
>                                        └─────────────┘
>
> The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
> allows us to implement update in a lightweight manner without having to recreate
> a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
>
> The other way to do it would be to link the bpf_link to tcf_proto, there are
> numerous downsides to this:
>
> 1. All filters have to embed the pointer even though they won't be using it when
> cls_bpf is compiled in.
> 2. This probably won't make sense to be extended to other filter types anyway.
> 3. We aren't able to optimize the update case without adding another bpf_link
> specific update operation to tcf_proto ops.
>
> The downside with tying this to the module is having to export bpf_link
> management functions and introducing a tcf_proto op. Hopefully the cost of
> another operation func pointer is not big enough (as there is only one ops
> struct per module).
>
> This first version is to collect feedback on the approach and get ideas if there
> is a better way to do this.

Bpf_link-based TC API is a long time coming, so it's great to see
someone finally working on this. Thanks!

I briefly skimmed through the patch set, noticed a few generic
bpf_link problems. But I think main feedback will come from Cilium
folks and others that heavily rely on TC APIs. I wonder if there is an
opportunity to simplify the API further given we have a new
opportunity here. I don't think we are constrained to follow legacy TC
API exactly.

The problem is that your patch set was marked as spam by Google, so I
suspect a bunch of folks haven't gotten it. I suggest re-sending it
again but trimming down the CC list, leaving only bpf@vger,
netdev@vger, and BPF maintainers CC'ed directly.

>
> Kumar Kartikeya Dwivedi (7):
>   net: sched: refactor cls_bpf creation code
>   bpf: export bpf_link functions for modules
>   net: sched: add bpf_link API for bpf classifier
>   net: sched: add lightweight update path for cls_bpf
>   tools: bpf.h: sync with kernel sources
>   libbpf: add bpf_link based TC-BPF management API
>   libbpf: add selftest for bpf_link based TC-BPF management API
>
>  include/linux/bpf_types.h                     |   3 +
>  include/net/pkt_cls.h                         |  13 +
>  include/net/sch_generic.h                     |   6 +-
>  include/uapi/linux/bpf.h                      |  15 +
>  kernel/bpf/syscall.c                          |  14 +-
>  net/sched/cls_api.c                           | 138 ++++++-
>  net/sched/cls_bpf.c                           | 386 ++++++++++++++++--
>  tools/include/uapi/linux/bpf.h                |  15 +
>  tools/lib/bpf/bpf.c                           |   5 +
>  tools/lib/bpf/bpf.h                           |   8 +-
>  tools/lib/bpf/libbpf.c                        |  59 ++-
>  tools/lib/bpf/libbpf.h                        |  17 +
>  tools/lib/bpf/libbpf.map                      |   1 +
>  tools/lib/bpf/netlink.c                       |   5 +-
>  tools/lib/bpf/netlink.h                       |   8 +
>  .../selftests/bpf/prog_tests/tc_bpf_link.c    | 285 +++++++++++++
>  16 files changed, 934 insertions(+), 44 deletions(-)
>  create mode 100644 tools/lib/bpf/netlink.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-02 21:09 ` [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Andrii Nakryiko
@ 2021-06-02 21:45   ` Kumar Kartikeya Dwivedi
  2021-06-02 23:50     ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-02 21:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Networking

On Thu, Jun 03, 2021 at 02:39:15AM IST, Andrii Nakryiko wrote:
> On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > This is the first RFC version.
> >
> > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > replace such filters, but the bpf_link is severed on indirect destruction of the
> > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > pinning approach can be used.
> >
> > The individual patches contain more details and comments, but the overall kernel
> > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> > merged recently. This means that we start by always setting direct action mode,
> > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> > options in the future, they can be easily exposed through the bpf_link API in
> > the future.
> >
> > Patch 1 refactors cls_bpf change function to extract two helpers that will be
> > reused in bpf_link creation.
> >
> > Patch 2 exports some bpf_link management functions to modules. This is needed
> > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> > tcf_proto would be weird, because the update path has to replace offloaded bpf
> > prog, which happens using internal cls_bpf helpers, and would in general be more
> > code to abstract over an operation that is unlikely to be implemented for other
> > filter types.
> >
> > Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> > obtaining block reference, creating the filter object, and then calls the
> > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> > setting up the internal structures. An optimization is made to not keep around
> > resources for extended actions, which is explained in a code comment as it wasn't
> > immediately obvious.
> >
> > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> > replacing the bpf_prog, we can skip tc filter's change path by reusing the
> > filter object but swapping its bpf_prog. This takes care of replacing the
> > offloaded prog as well (if that fails, update is aborted). So far however,
> > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> > would never be modified concurrently. This is no longer true, and to not
> > penalize the classify hot path, we also cannot impose serialization around
> > its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> > always consistent. Due to invocation in a RCU critical section, the lifetime of
> > the prog is guaranteed for the duration of the call.
> >
> > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> > function to libbpf.
> >
> > Patch 7 adds a selftest that exercises all possible problematic interactions
> > that I could think of.
> >
> > Design:
> >
> > This is where in the object hierarchy our bpf_link object is attached.
> >
> >                                                                             ┌─────┐
> >                                                                             │     │
> >                                                                             │ BPF │
> >                                                                             program
> >                                                                             │     │
> >                                                                             └──▲──┘
> >                                                       ┌───────┐                │
> >                                                       │       │         ┌──────┴───────┐
> >                                                       │  mod  ├─────────► cls_bpf_prog │
> > ┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
> > │    tcf_block   │                                    │       │              │   │
> > └────────┬───────┘                                    └───▲───┘              │   │
> >          │          ┌─────────────┐                       │                ┌─▼───┴──┐
> >          └──────────►  tcf_chain  │                       │                │bpf_link│
> >                     └───────┬─────┘                       │                └────────┘
> >                             │          ┌─────────────┐    │
> >                             └──────────►  tcf_proto  ├────┘
> >                                        └─────────────┘
> >
> > The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
> > allows us to implement update in a lightweight manner without having to recreate
> > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
> >
> > The other way to do it would be to link the bpf_link to tcf_proto, there are
> > numerous downsides to this:
> >
> > 1. All filters have to embed the pointer even though they won't be using it when
> > cls_bpf is compiled in.
> > 2. This probably won't make sense to be extended to other filter types anyway.
> > 3. We aren't able to optimize the update case without adding another bpf_link
> > specific update operation to tcf_proto ops.
> >
> > The downside with tying this to the module is having to export bpf_link
> > management functions and introducing a tcf_proto op. Hopefully the cost of
> > another operation func pointer is not big enough (as there is only one ops
> > struct per module).
> >
> > This first version is to collect feedback on the approach and get ideas if there
> > is a better way to do this.
>
> Bpf_link-based TC API is a long time coming, so it's great to see
> someone finally working on this. Thanks!
>
> I briefly skimmed through the patch set, noticed a few generic
> bpf_link problems. But I think main feedback will come from Cilium

Thanks for the review. I'll fix both of these in the resend (also have a couple
of private reports from the kernel test robot).

> folks and others that heavily rely on TC APIs. I wonder if there is an
> opportunity to simplify the API further given we have a new
> opportunity here. I don't think we are constrained to follow legacy TC
> API exactly.
>

I tried to keep it simple by going for the defaults we agreed upon for the new
netlink based libbpf API, and always setting direct action mode, and it's still
in a position to be extended in the future to allow full TC filter setup like
netlink does, if someone ever happens to need that.

As for the implementation, I did notice that there has been discussion around
this (though I could only find [0]) but I think doing it the way this patch does
is more flexible as you can attach the bpf filter to an aribitrary parent/class,
not just ingress and egress, and it can coexist with a conventional TC setup.

[0]: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html
     ("Note that there is ongoing work ...")

> The problem is that your patch set was marked as spam by Google, so I
> suspect a bunch of folks haven't gotten it. I suggest re-sending it
> again but trimming down the CC list, leaving only bpf@vger,
> netdev@vger, and BPF maintainers CC'ed directly.
>

Thanks for the heads up, I'll resend tomorrow.

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-02 21:45   ` Kumar Kartikeya Dwivedi
@ 2021-06-02 23:50     ` Alexei Starovoitov
  2021-06-04  6:43       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2021-06-02 23:50 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, Networking

On Thu, Jun 03, 2021 at 03:15:13AM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> > The problem is that your patch set was marked as spam by Google, so I
> > suspect a bunch of folks haven't gotten it. I suggest re-sending it
> > again but trimming down the CC list, leaving only bpf@vger,
> > netdev@vger, and BPF maintainers CC'ed directly.
> >
> 
> Thanks for the heads up, I'll resend tomorrow.

fyi I see this thread in my inbox, but, sadly, not the patches.
So guessing based on cover letter and hoping that the following is true:
link_fd is returned by BPF_LINK_CREATE command.
If anything is missing in struct link_create the patches are adding it there.
target_ifindex, flags are reused. attach_type indicates ingress vs egress.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-02 23:50     ` Alexei Starovoitov
@ 2021-06-04  6:43       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-04  6:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Cong Wang, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, Networking

On Thu, Jun 03, 2021 at 05:20:58AM IST, Alexei Starovoitov wrote:
> On Thu, Jun 03, 2021 at 03:15:13AM +0530, Kumar Kartikeya Dwivedi wrote:
> >
> > > The problem is that your patch set was marked as spam by Google, so I
> > > suspect a bunch of folks haven't gotten it. I suggest re-sending it
> > > again but trimming down the CC list, leaving only bpf@vger,
> > > netdev@vger, and BPF maintainers CC'ed directly.
> > >
> >
> > Thanks for the heads up, I'll resend tomorrow.
>
> fyi I see this thread in my inbox, but, sadly, not the patches.
> So guessing based on cover letter and hoping that the following is true:
> link_fd is returned by BPF_LINK_CREATE command.
> If anything is missing in struct link_create the patches are adding it there.
> target_ifindex, flags are reused. attach_type indicates ingress vs egress.

Everything is true except the attach_type part. I don't hook directly into
sch_handle_{ingress,egress}. It's a normal TC filter, and if one wants to hook
into ingress, egress, they attach it to clsact qdisc. The lifetime however is
decided by the link fd.

The new version is here:
https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
                   ` (7 preceding siblings ...)
  2021-06-02 21:09 ` [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Andrii Nakryiko
@ 2021-06-06 23:37 ` Cong Wang
  2021-06-07  3:37   ` Kumar Kartikeya Dwivedi
  8 siblings, 1 reply; 45+ messages in thread
From: Cong Wang @ 2021-06-06 23:37 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This is the first RFC version.
>
> This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> introduces fd based ownership for such TC filters. Netlink cannot delete or
> replace such filters, but the bpf_link is severed on indirect destruction of the
> filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> that filters remain attached beyond process lifetime, the usual bpf_link fd
> pinning approach can be used.

I have some troubles understanding this. So... why TC filter is so special
here that it deserves such a special treatment?

The reason why I ask is that none of other bpf links actually create any
object, they merely attach bpf program to an existing object. For example,
netns bpf_link does not create an netns, cgroup bpf_link does not create
a cgroup either. So, why TC filter is so lucky to be the first one requires
creating an object?

Is it because there is no fd associated with any TC object?  Or what?
TC object, like all other netlink stuffs, is not fs based, hence does not
have an fd. Or maybe you don't need an fd at all? Since at least xdp
bpf_link is associated with a netdev which does not have an fd either.

>
> The individual patches contain more details and comments, but the overall kernel
> API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> merged recently. This means that we start by always setting direct action mode,
> protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> options in the future, they can be easily exposed through the bpf_link API in
> the future.

As you already see, this fits really oddly into TC infrastructure, because
TC qdisc/filter/action are a whole subsystem, here you are trying to punch
a hole in the middle. ;) This usually indicates that we are going in a wrong
direction, maybe your case is an exception, but I can't find anything to justify
it in your cover letter.

Even if you really want to go down this path (I still double), you probably
want to explore whether there is any generic way to associate a TC object
with an fd, because we have TC bpf action and we will have TC bpf qdisc
too, I don't see any bpf_cls is more special than them.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-06 23:37 ` Cong Wang
@ 2021-06-07  3:37   ` Kumar Kartikeya Dwivedi
  2021-06-07  5:18     ` Cong Wang
  0 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-07  3:37 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > This is the first RFC version.
> >
> > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > replace such filters, but the bpf_link is severed on indirect destruction of the
> > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > pinning approach can be used.
>
> I have some troubles understanding this. So... why TC filter is so special
> here that it deserves such a special treatment?
>

So the motivation behind this was automatic cleanup of filters installed by some
program. Usually from the userspace side you need a bunch of things (handle,
priority, protocol, chain_index, etc.) to be able to delete a filter without
stepping on others' toes. Also, there is no gurantee that filter hasn't been
replaced, so you need to check some other way (either tag or prog_id, but these
are also not perfect).

bpf_link provides isolation from netlink and fd-based lifetime management. As
for why it needs special treatment (by which I guess you mean why it _creates_
an object instead of simply attaching to one, see below):

> The reason why I ask is that none of other bpf links actually create any
> object, they merely attach bpf program to an existing object. For example,
> netns bpf_link does not create an netns, cgroup bpf_link does not create
> a cgroup either. So, why TC filter is so lucky to be the first one requires
> creating an object?
>

They are created behind the scenes, but are also fairly isolated from netlink
(i.e.  can only be introspected, so not hidden in that sense, but are
effectively locked for replace/delete).

The problem would be (of not creating a filter during attach) is that a typical
'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
is fixed) out of the equation, it becomes possible to attach just a single BPF
prog, but that seems like a needless limitation when TC already supports list of
filters at each 'attach point'.

My point is that the created object (the tcf_proto under the 'chain' object) is
the attach point, and since there can be so many, keeping them around at all
times doesn't make sense, so the refcounted attach locations are created as
needed.  Both netlink and bpf_link owned filters can be attached under the same
location, with different ownership story in userspace.

> Is it because there is no fd associated with any TC object?  Or what?
> TC object, like all other netlink stuffs, is not fs based, hence does not
> have an fd. Or maybe you don't need an fd at all? Since at least xdp
> bpf_link is associated with a netdev which does not have an fd either.
>
> >
> > The individual patches contain more details and comments, but the overall kernel
> > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> > merged recently. This means that we start by always setting direct action mode,
> > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> > options in the future, they can be easily exposed through the bpf_link API in
> > the future.
>
> As you already see, this fits really oddly into TC infrastructure, because
> TC qdisc/filter/action are a whole subsystem, here you are trying to punch
> a hole in the middle. ;) This usually indicates that we are going in a wrong
> direction, maybe your case is an exception, but I can't find anything to justify
> it in your cover letter.
>

I don't see why I'm punching a hole. The qdisc, chain, protocol, priority is the
'attach location', handle is just an ID, maybe we can skip all this and just
create a static hook for attaching single BPF program that doesn't require
creating a filter, but someday someone will have to reimplement chaining of
programs again (like libxdp does).

> Even if you really want to go down this path (I still double), you probably
> want to explore whether there is any generic way to associate a TC object
> with an fd, because we have TC bpf action and we will have TC bpf qdisc
> too, I don't see any bpf_cls is more special than them.
>

I think TC bpf actions are not relevant going forward (due to cls_bpf's direct
action mode), but I could be wrong. I say so because even a proposed API to
attach these from libbpf was dropped because arguably cls_bpf does it better,
and people shouldn't be using integrated actions going forward.

TC bpf qdisc might be, but that can be a different attach type (say BPF_SCHED),
which if exposed through bpf_link will again have its attach point to be the
target_ifindex, not some fd, and it would still be possible to use this API to
attach to a eBPF qdisc.

What do you suggest? I am open to reworking this in a different way if there are
any better ideas.

> Thanks.

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-07  3:37   ` Kumar Kartikeya Dwivedi
@ 2021-06-07  5:18     ` Cong Wang
  2021-06-07  6:07       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 45+ messages in thread
From: Cong Wang @ 2021-06-07  5:18 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > This is the first RFC version.
> > >
> > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > pinning approach can be used.
> >
> > I have some troubles understanding this. So... why TC filter is so special
> > here that it deserves such a special treatment?
> >
>
> So the motivation behind this was automatic cleanup of filters installed by some
> program. Usually from the userspace side you need a bunch of things (handle,
> priority, protocol, chain_index, etc.) to be able to delete a filter without
> stepping on others' toes. Also, there is no gurantee that filter hasn't been
> replaced, so you need to check some other way (either tag or prog_id, but these
> are also not perfect).
>
> bpf_link provides isolation from netlink and fd-based lifetime management. As
> for why it needs special treatment (by which I guess you mean why it _creates_
> an object instead of simply attaching to one, see below):

Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
to exist. This is true, and is of course different with netns/cgroup.
This is perhaps
not hard to solve, because TC actions are already independent, we can perhaps
convert TC filters too (the biggest blocker is compatibility).

Or do you just need an ephemeral representation of a TC filter which only exists
for a process? If so, see below.

>
> > The reason why I ask is that none of other bpf links actually create any
> > object, they merely attach bpf program to an existing object. For example,
> > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > creating an object?
> >
>
> They are created behind the scenes, but are also fairly isolated from netlink
> (i.e.  can only be introspected, so not hidden in that sense, but are
> effectively locked for replace/delete).
>
> The problem would be (of not creating a filter during attach) is that a typical
> 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> is fixed) out of the equation, it becomes possible to attach just a single BPF
> prog, but that seems like a needless limitation when TC already supports list of
> filters at each 'attach point'.
>
> My point is that the created object (the tcf_proto under the 'chain' object) is
> the attach point, and since there can be so many, keeping them around at all
> times doesn't make sense, so the refcounted attach locations are created as
> needed.  Both netlink and bpf_link owned filters can be attached under the same
> location, with different ownership story in userspace.

I do not understand "created behind the scenes". These are all created
independent of bpf_link, right? For example, we create a cgroup with
mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
created by bpf_link or any bpf syscall.

The only thing different is fd, or more accurately, an identifier to locate
these objects. For example, ifindex can also be used to locate a netdev.
We can certainly locate a TC filter with (prio,proto,handle) but this has to
be passed via netlink. So if you need some locator, I think we can
introduce a kernel API which takes all necessary parameters to locate
a TC filter and return it to you. For a quick example, like this:

struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
                                int parent, char* kind, int handle...);

(Note, it can grab a refcnt in case of being deleted by others.)

With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
(of course, only cls_bpf succeeds here).

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-07  5:18     ` Cong Wang
@ 2021-06-07  6:07       ` Kumar Kartikeya Dwivedi
  2021-06-08  2:00         ` Cong Wang
  0 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-07  6:07 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > This is the first RFC version.
> > > >
> > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > pinning approach can be used.
> > >
> > > I have some troubles understanding this. So... why TC filter is so special
> > > here that it deserves such a special treatment?
> > >
> >
> > So the motivation behind this was automatic cleanup of filters installed by some
> > program. Usually from the userspace side you need a bunch of things (handle,
> > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > replaced, so you need to check some other way (either tag or prog_id, but these
> > are also not perfect).
> >
> > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > for why it needs special treatment (by which I guess you mean why it _creates_
> > an object instead of simply attaching to one, see below):
>
> Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> to exist. This is true, and is of course different with netns/cgroup.
> This is perhaps
> not hard to solve, because TC actions are already independent, we can perhaps
> convert TC filters too (the biggest blocker is compatibility).
>

True, but that would mean you need some way to create a detached TC filter, correct?
Can you give some ideas on how the setup would look like from userspace side?

IIUC you mean

RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID

then bpf_link comes in and creates the binding to the qdisc, parent, prio,
chain, handle ... ?

> Or do you just need an ephemeral representation of a TC filter which only exists
> for a process? If so, see below.
>
> >
> > > The reason why I ask is that none of other bpf links actually create any
> > > object, they merely attach bpf program to an existing object. For example,
> > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > creating an object?
> > >
> >
> > They are created behind the scenes, but are also fairly isolated from netlink
> > (i.e.  can only be introspected, so not hidden in that sense, but are
> > effectively locked for replace/delete).
> >
> > The problem would be (of not creating a filter during attach) is that a typical
> > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > prog, but that seems like a needless limitation when TC already supports list of
> > filters at each 'attach point'.
> >
> > My point is that the created object (the tcf_proto under the 'chain' object) is
> > the attach point, and since there can be so many, keeping them around at all
> > times doesn't make sense, so the refcounted attach locations are created as
> > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > location, with different ownership story in userspace.
>
> I do not understand "created behind the scenes". These are all created
> independent of bpf_link, right? For example, we create a cgroup with
> mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> created by bpf_link or any bpf syscall.

Sorry, that must be confusing. I was only referring to what this patch does.
Indeed, as far as implementation is concerned this has no precedence.

>
> The only thing different is fd, or more accurately, an identifier to locate
> these objects. For example, ifindex can also be used to locate a netdev.
> We can certainly locate a TC filter with (prio,proto,handle) but this has to
> be passed via netlink. So if you need some locator, I think we can
> introduce a kernel API which takes all necessary parameters to locate
> a TC filter and return it to you. For a quick example, like this:
>
> struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
>                                 int parent, char* kind, int handle...);
>

I think this already exists in some way, i.e. you can just ignore if filter
handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
from qdisc/block -> chain -> tcf_proto during creation.

> (Note, it can grab a refcnt in case of being deleted by others.)
>
> With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> (of course, only cls_bpf succeeds here).
>

So IIUC, you are proposing to first create a filter normally using netlink, then
attach it using bpf_link to the proper parent? I.e. your main contention point
is to not create filter from bpf_link, instead take a filter and attach it to a
parent with bpf_link representing this attachment?

But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
the lifetime of the attachment may be managed by bpf_link (in that we can detach
the filter from parent) but the filter itself will not be cleaned up. One of the
goals of tying TC filter to fd was to bind lifetime of filter itself, along with
attachment. Separating both doesn't seem to buy anything here. You always create
a filter to attach somewhere.

With actions, things are different, you may create one action but bind it to
multiple filters, so actions existing as their own thing makes sense. A single
action can serve multiple filters, and save on memory.

You could argue that even with filters this is true, as you may want to attach
the same filter to multiple qdiscs, but we already have a facility to do that
(shared tcf_block with block->q == NULL). However that is not as flexible as
what you are proposing.

It may be odd from the kernel side but to userspace a parent, prio, handle (we
don't let user choose anything else for now) is itself the attach point, how
bpf_link manages the attachment internally isn't really that interesting. It
does so now by way of creating an object that represents a certain hook, then
binding the BPF prog to it. I consider this mostly an implementation detail.
What you are really attaching to is the qdisc/block, which is the resource
analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
attributes.

> Thanks.

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-07  6:07       ` Kumar Kartikeya Dwivedi
@ 2021-06-08  2:00         ` Cong Wang
  2021-06-08  7:19           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 45+ messages in thread
From: Cong Wang @ 2021-06-08  2:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > >
> > > > > This is the first RFC version.
> > > > >
> > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > > pinning approach can be used.
> > > >
> > > > I have some troubles understanding this. So... why TC filter is so special
> > > > here that it deserves such a special treatment?
> > > >
> > >
> > > So the motivation behind this was automatic cleanup of filters installed by some
> > > program. Usually from the userspace side you need a bunch of things (handle,
> > > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > > replaced, so you need to check some other way (either tag or prog_id, but these
> > > are also not perfect).
> > >
> > > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > > for why it needs special treatment (by which I guess you mean why it _creates_
> > > an object instead of simply attaching to one, see below):
> >
> > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> > to exist. This is true, and is of course different with netns/cgroup.
> > This is perhaps
> > not hard to solve, because TC actions are already independent, we can perhaps
> > convert TC filters too (the biggest blocker is compatibility).
> >
>
> True, but that would mean you need some way to create a detached TC filter, correct?
> Can you give some ideas on how the setup would look like from userspace side?
>
> IIUC you mean
>
> RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID
>
> then bpf_link comes in and creates the binding to the qdisc, parent, prio,
> chain, handle ... ?

Roughly yes, except creation is still done by netlink, not bpf_link. It is
pretty much similar to those unbound TC actions.

>
> > Or do you just need an ephemeral representation of a TC filter which only exists
> > for a process? If so, see below.
> >
> > >
> > > > The reason why I ask is that none of other bpf links actually create any
> > > > object, they merely attach bpf program to an existing object. For example,
> > > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > > creating an object?
> > > >
> > >
> > > They are created behind the scenes, but are also fairly isolated from netlink
> > > (i.e.  can only be introspected, so not hidden in that sense, but are
> > > effectively locked for replace/delete).
> > >
> > > The problem would be (of not creating a filter during attach) is that a typical
> > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > > prog, but that seems like a needless limitation when TC already supports list of
> > > filters at each 'attach point'.
> > >
> > > My point is that the created object (the tcf_proto under the 'chain' object) is
> > > the attach point, and since there can be so many, keeping them around at all
> > > times doesn't make sense, so the refcounted attach locations are created as
> > > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > > location, with different ownership story in userspace.
> >
> > I do not understand "created behind the scenes". These are all created
> > independent of bpf_link, right? For example, we create a cgroup with
> > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> > created by bpf_link or any bpf syscall.
>
> Sorry, that must be confusing. I was only referring to what this patch does.
> Indeed, as far as implementation is concerned this has no precedence.
>
> >
> > The only thing different is fd, or more accurately, an identifier to locate
> > these objects. For example, ifindex can also be used to locate a netdev.
> > We can certainly locate a TC filter with (prio,proto,handle) but this has to
> > be passed via netlink. So if you need some locator, I think we can
> > introduce a kernel API which takes all necessary parameters to locate
> > a TC filter and return it to you. For a quick example, like this:
> >
> > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
> >                                 int parent, char* kind, int handle...);
> >
>
> I think this already exists in some way, i.e. you can just ignore if filter
> handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
> from qdisc/block -> chain -> tcf_proto during creation.

Right, except currently it requires a few API's to reach TC filters
(first netdev,,
then qdisc, finally filters). So, I think providing one API could at
least address
your "stepping on others toes" concern?

>
> > (Note, it can grab a refcnt in case of being deleted by others.)
> >
> > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> > (of course, only cls_bpf succeeds here).
> >
>
> So IIUC, you are proposing to first create a filter normally using netlink, then
> attach it using bpf_link to the proper parent? I.e. your main contention point
> is to not create filter from bpf_link, instead take a filter and attach it to a
> parent with bpf_link representing this attachment?

Yes, to me I don't see a reason we want to create it from bpf_link.

>
> But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
> the lifetime of the attachment may be managed by bpf_link (in that we can detach
> the filter from parent) but the filter itself will not be cleaned up. One of the
> goals of tying TC filter to fd was to bind lifetime of filter itself, along with
> attachment. Separating both doesn't seem to buy anything here. You always create
> a filter to attach somewhere.

This is really odd, for two reasons:

1) Why netdev does not have such problem? bpf_xdp_link_attach() uses
ifindex to locate a netdev, without creating it or cleaning it either.
So, why do we
never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so
different from netdev?

2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
sense for them naturally. TC filters, or any other netlink based
things, are not even
related to fs, hence fd does not make sense here, like we never bind a netdev
to a fd.

>
> With actions, things are different, you may create one action but bind it to
> multiple filters, so actions existing as their own thing makes sense. A single
> action can serve multiple filters, and save on memory.
>
> You could argue that even with filters this is true, as you may want to attach
> the same filter to multiple qdiscs, but we already have a facility to do that
> (shared tcf_block with block->q == NULL). However that is not as flexible as
> what you are proposing.

True. I think making TC filters as standalone as TC actions is a right
direction,
if it helps you too.

>
> It may be odd from the kernel side but to userspace a parent, prio, handle (we
> don't let user choose anything else for now) is itself the attach point, how
> bpf_link manages the attachment internally isn't really that interesting. It
> does so now by way of creating an object that represents a certain hook, then
> binding the BPF prog to it. I consider this mostly an implementation detail.
> What you are really attaching to is the qdisc/block, which is the resource
> analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
> attributes.

How do you establish the analogy here? cgroup and netns are fs based,
having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus,
current bpf_link code does not create any of them.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-08  2:00         ` Cong Wang
@ 2021-06-08  7:19           ` Kumar Kartikeya Dwivedi
  2021-06-08 15:39             ` Alexei Starovoitov
  2021-06-11  2:00             ` Cong Wang
  0 siblings, 2 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-08  7:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Tue, Jun 08, 2021 at 07:30:40AM IST, Cong Wang wrote:
> On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> > > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > >
> > > > > > This is the first RFC version.
> > > > > >
> > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > > > pinning approach can be used.
> > > > >
> > > > > I have some troubles understanding this. So... why TC filter is so special
> > > > > here that it deserves such a special treatment?
> > > > >
> > > >
> > > > So the motivation behind this was automatic cleanup of filters installed by some
> > > > program. Usually from the userspace side you need a bunch of things (handle,
> > > > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > > > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > > > replaced, so you need to check some other way (either tag or prog_id, but these
> > > > are also not perfect).
> > > >
> > > > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > > > for why it needs special treatment (by which I guess you mean why it _creates_
> > > > an object instead of simply attaching to one, see below):
> > >
> > > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> > > to exist. This is true, and is of course different with netns/cgroup.
> > > This is perhaps
> > > not hard to solve, because TC actions are already independent, we can perhaps
> > > convert TC filters too (the biggest blocker is compatibility).
> > >
> >
> > True, but that would mean you need some way to create a detached TC filter, correct?
> > Can you give some ideas on how the setup would look like from userspace side?
> >
> > IIUC you mean
> >
> > RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID
> >
> > then bpf_link comes in and creates the binding to the qdisc, parent, prio,
> > chain, handle ... ?
>
> Roughly yes, except creation is still done by netlink, not bpf_link. It is
> pretty much similar to those unbound TC actions.
>

Right, thanks for explaining. I will try to work on this and see if it works out.

> >
> > > Or do you just need an ephemeral representation of a TC filter which only exists
> > > for a process? If so, see below.
> > >
> > > >
> > > > > The reason why I ask is that none of other bpf links actually create any
> > > > > object, they merely attach bpf program to an existing object. For example,
> > > > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > > > creating an object?
> > > > >
> > > >
> > > > They are created behind the scenes, but are also fairly isolated from netlink
> > > > (i.e.  can only be introspected, so not hidden in that sense, but are
> > > > effectively locked for replace/delete).
> > > >
> > > > The problem would be (of not creating a filter during attach) is that a typical
> > > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > > > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > > > prog, but that seems like a needless limitation when TC already supports list of
> > > > filters at each 'attach point'.
> > > >
> > > > My point is that the created object (the tcf_proto under the 'chain' object) is
> > > > the attach point, and since there can be so many, keeping them around at all
> > > > times doesn't make sense, so the refcounted attach locations are created as
> > > > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > > > location, with different ownership story in userspace.
> > >
> > > I do not understand "created behind the scenes". These are all created
> > > independent of bpf_link, right? For example, we create a cgroup with
> > > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> > > created by bpf_link or any bpf syscall.
> >
> > Sorry, that must be confusing. I was only referring to what this patch does.
> > Indeed, as far as implementation is concerned this has no precedence.
> >
> > >
> > > The only thing different is fd, or more accurately, an identifier to locate
> > > these objects. For example, ifindex can also be used to locate a netdev.
> > > We can certainly locate a TC filter with (prio,proto,handle) but this has to
> > > be passed via netlink. So if you need some locator, I think we can
> > > introduce a kernel API which takes all necessary parameters to locate
> > > a TC filter and return it to you. For a quick example, like this:
> > >
> > > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
> > >                                 int parent, char* kind, int handle...);
> > >
> >
> > I think this already exists in some way, i.e. you can just ignore if filter
> > handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
> > from qdisc/block -> chain -> tcf_proto during creation.
>
> Right, except currently it requires a few API's to reach TC filters
> (first netdev,,
> then qdisc, finally filters). So, I think providing one API could at
> least address
> your "stepping on others toes" concern?
>
> >
> > > (Note, it can grab a refcnt in case of being deleted by others.)
> > >
> > > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> > > (of course, only cls_bpf succeeds here).
> > >
> >
> > So IIUC, you are proposing to first create a filter normally using netlink, then
> > attach it using bpf_link to the proper parent? I.e. your main contention point
> > is to not create filter from bpf_link, instead take a filter and attach it to a
> > parent with bpf_link representing this attachment?
>
> Yes, to me I don't see a reason we want to create it from bpf_link.
>
> >
> > But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
> > the lifetime of the attachment may be managed by bpf_link (in that we can detach
> > the filter from parent) but the filter itself will not be cleaned up. One of the
> > goals of tying TC filter to fd was to bind lifetime of filter itself, along with
> > attachment. Separating both doesn't seem to buy anything here. You always create
> > a filter to attach somewhere.
>
> This is really odd, for two reasons:
>
> 1) Why netdev does not have such problem? bpf_xdp_link_attach() uses
> ifindex to locate a netdev, without creating it or cleaning it either.
> So, why do we
> never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so
> different from netdev?
>

I think I tried to explain the difference, but I may have failed.

netdev does not have this problem because netdev is to XDP prog what qdisc is to
a SCHED_CLS prog. The filter is merely a way to hook into the qdisc. So we bind
the attachment's lifetime to the filter's lifetime, which in turn is controlled
by the bpf_link fd. When the filter is gone, the attachment to the qdisc is gone.

So we're not really creating a qdisc here, we're just tying the filter (which in
the current semantics exists only while attached) to the bpf_link. The filter is
the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
the bpf_link, the filter goes away too, which means classification at that
hook (parent/class) in the qdisc stops working. This is why creating the filter
from the bpf_link made sense to me.

I hope you can see where I was going with this now.  Introducing a new kind of
method to attach to qdisc didn't seem wise to me, given all the infrastructure
already exists.

> 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
> sense for them naturally. TC filters, or any other netlink based
> things, are not even
> related to fs, hence fd does not make sense here, like we never bind a netdev
> to a fd.
>

Yes, none of them create any objects. It is only a side effect of current
semantics that you are able to control the filter's lifetime using the bpf_link
as filter creation is also accompanied with its attachment to the qdisc.

Your unbound filter idea just separates the two. One will still end up creating
a cls_bpf_prog object internally in the kernel, just that it will now be
refcounted and be linked into multiple tcf_proto (based on how many bpf_link's
are attached).

Another additional responsibility of the user space is to now clean up these
unbound filters when it is done using them (either right after making a bpf_link
attachment so that it is removed on bpf_link destruction, or later), because
they don't sit under any chain etc. so a full flush of filters won't remove
them.

> >
> > With actions, things are different, you may create one action but bind it to
> > multiple filters, so actions existing as their own thing makes sense. A single
> > action can serve multiple filters, and save on memory.
> >
> > You could argue that even with filters this is true, as you may want to attach
> > the same filter to multiple qdiscs, but we already have a facility to do that
> > (shared tcf_block with block->q == NULL). However that is not as flexible as
> > what you are proposing.
>
> True. I think making TC filters as standalone as TC actions is a right
> direction,
> if it helps you too.
>
> >
> > It may be odd from the kernel side but to userspace a parent, prio, handle (we
> > don't let user choose anything else for now) is itself the attach point, how
> > bpf_link manages the attachment internally isn't really that interesting. It
> > does so now by way of creating an object that represents a certain hook, then
> > binding the BPF prog to it. I consider this mostly an implementation detail.
> > What you are really attaching to is the qdisc/block, which is the resource
> > analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
> > attributes.
>
> How do you establish the analogy here? cgroup and netns are fs based,
> having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus,
> current bpf_link code does not create any of them.
>
> Thanks.

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-08  7:19           ` Kumar Kartikeya Dwivedi
@ 2021-06-08 15:39             ` Alexei Starovoitov
  2021-06-11  2:10               ` Cong Wang
  2021-06-11  2:00             ` Cong Wang
  1 sibling, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2021-06-08 15:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Jamal Hadi Salim, Vlad Buslov,
	Jiri Pirko, David S. Miller, Jakub Kicinski, Joe Stringer,
	Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> > 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
> > sense for them naturally. TC filters, or any other netlink based

fs analogy is not applicable.
bpf_link-s for tracing and xdp have nothing to do with file systems.

> > things, are not even
> > related to fs, hence fd does not make sense here, like we never bind a netdev
> > to a fd.
> >
>
> Yes, none of them create any objects. It is only a side effect of current
> semantics that you are able to control the filter's lifetime using the bpf_link
> as filter creation is also accompanied with its attachment to the qdisc.

I think it makes sense to create these objects as part of establishing bpf_link.
ingress qdisc is a fake qdisc anyway.
If we could go back in time I would argue that its existence doesn't
need to be shown in iproute2. It's an object that serves no purpose
other than attaching filters to it. It doesn't do any queuing unlike
real qdiscs.
It's an artifact of old choices. Old doesn't mean good.
The kernel is full of such quirks and oddities. New api-s shouldn't
blindly follow them.
tc qdisc add dev eth0 clsact
is a useless command with nop effect.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-08  7:19           ` Kumar Kartikeya Dwivedi
  2021-06-08 15:39             ` Alexei Starovoitov
@ 2021-06-11  2:00             ` Cong Wang
  2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
  2021-06-13  3:08               ` Kumar Kartikeya Dwivedi
  1 sibling, 2 replies; 45+ messages in thread
From: Cong Wang @ 2021-06-11  2:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> So we're not really creating a qdisc here, we're just tying the filter (which in
> the current semantics exists only while attached) to the bpf_link. The filter is
> the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
> the bpf_link, the filter goes away too, which means classification at that
> hook (parent/class) in the qdisc stops working. This is why creating the filter
> from the bpf_link made sense to me.

I see why you are creating TC filters now, because you are trying to
force the lifetime of a bpf target to align with the bpf program itself.
The deeper reason seems to be that a cls_bpf filter looks so small
that it appears to you that it has nothing but a bpf_prog, right?

I offer two different views here:

1. If you view a TC filter as an instance as a netdev/qdisc/action, they
are no different from this perspective. Maybe the fact that a TC filter
resides in a qdisc makes a slight difference here, but like I mentioned, it
actually makes sense to let TC filters be standalone, qdisc's just have to
bind with them, like how we bind TC filters with standalone TC actions.
These are all updated independently, despite some of them residing in
another. There should not be an exceptional TC filter which can not
be updated via `tc filter` command.

2. For cls_bpf specifically, it is also an instance, like all other TC filters.
You can update it in the same way: tc filter change [...] The only difference
is a bpf program can attach to such an instance. So you can view the bpf
program attached to cls_bpf as a property of it. From this point of view,
there is no difference with XDP to netdev, where an XDP program
attached to a netdev is also a property of netdev. A netdev can still
function without XDP. Same for cls_bpf, it can be just a nop returns
TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
the lifetime of a bpf program can be separated from the target it
attaches too, like all other bpf_link targets. bpf_link is just a
supplement to `tc filter change cls_bpf`, not to replace it.

This is actually simpler, you do not need to worry about whether
netdev is destroyed when you detach the XDP bpf_link anyway,
same for cls_bpf filters. Likewise, TC filters don't need to worry
about bpf_links associated.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-08 15:39             ` Alexei Starovoitov
@ 2021-06-11  2:10               ` Cong Wang
  0 siblings, 0 replies; 45+ messages in thread
From: Cong Wang @ 2021-06-11  2:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Jamal Hadi Salim,
	Vlad Buslov, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Joe Stringer, Quentin Monnet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Tue, Jun 8, 2021 at 8:39 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> I think it makes sense to create these objects as part of establishing bpf_link.
> ingress qdisc is a fake qdisc anyway.
> If we could go back in time I would argue that its existence doesn't
> need to be shown in iproute2. It's an object that serves no purpose
> other than attaching filters to it. It doesn't do any queuing unlike
> real qdiscs.
> It's an artifact of old choices. Old doesn't mean good.
> The kernel is full of such quirks and oddities. New api-s shouldn't
> blindly follow them.
> tc qdisc add dev eth0 clsact
> is a useless command with nop effect.

Sounds like you just need a new bpf attach point outside of TC,
probably inside __dev_queue_xmit(). You don't need to create
any object, probably just need to attach it to a netdev.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-11  2:00             ` Cong Wang
@ 2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
  2021-06-13 20:27                 ` Jamal Hadi Salim
  2021-06-15  4:33                 ` Cong Wang
  2021-06-13  3:08               ` Kumar Kartikeya Dwivedi
  1 sibling, 2 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-13  2:53 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
> On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > So we're not really creating a qdisc here, we're just tying the filter (which in
> > the current semantics exists only while attached) to the bpf_link. The filter is
> > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
> > the bpf_link, the filter goes away too, which means classification at that
> > hook (parent/class) in the qdisc stops working. This is why creating the filter
> > from the bpf_link made sense to me.
>
> I see why you are creating TC filters now, because you are trying to
> force the lifetime of a bpf target to align with the bpf program itself.
> The deeper reason seems to be that a cls_bpf filter looks so small
> that it appears to you that it has nothing but a bpf_prog, right?
>

Yes, pretty much.

> I offer two different views here:
>
> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
> are no different from this perspective. Maybe the fact that a TC filter
> resides in a qdisc makes a slight difference here, but like I mentioned, it
> actually makes sense to let TC filters be standalone, qdisc's just have to
> bind with them, like how we bind TC filters with standalone TC actions.

You propose something different below IIUC, but I explained why I'm wary of
these unbound filters. They seem to add a step to classifier setup for no real
benefit to the user (except keeping track of one more object and cleaning it
up with the link when done).

I understand that the filter is very much an object of its own and why keeping
them unbound makes sense, but for the user there is no real benefit of this
scheme (some things like classid attribute are contextual in that they make
sense to be set based on what parent we're attaching to).

> These are all updated independently, despite some of them residing in
> another. There should not be an exceptional TC filter which can not
> be updated via `tc filter` command.

I see, but I'm mirroring what was done for XDP bpf_link.

Besides, flush still works, it's only that manipulating a filter managed by
bpf_link is not allowed, which sounds reasonable to me, given we're bringing
new ownership semantics here which didn't exist before with netlink, so it
doesn't make sense to allow netlink to simply invalidate the filter installed by
some other program.

You wouldn't do something like that for a cooperating setup, we're just
enforcing that using -EPERM (bpf_link is not allowed to replace netlink
installed filters either, so it goes both ways).

>
> 2. For cls_bpf specifically, it is also an instance, like all other TC filters.
> You can update it in the same way: tc filter change [...] The only difference
> is a bpf program can attach to such an instance. So you can view the bpf
> program attached to cls_bpf as a property of it. From this point of view,
> there is no difference with XDP to netdev, where an XDP program
> attached to a netdev is also a property of netdev. A netdev can still
> function without XDP. Same for cls_bpf, it can be just a nop returns
> TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
> the lifetime of a bpf program can be separated from the target it
> attaches too, like all other bpf_link targets. bpf_link is just a
> supplement to `tc filter change cls_bpf`, not to replace it.
>

So this is different now, as in the filter is attached as usual but bpf_link
represents attachment of bpf prog to the filter itself, not the filter to the
qdisc.

To me it seems apart from not having to create filter, this would pretty much be
equivalent to where I hook the bpf_link right now?

TBF, this split doesn't really seem to be bringing anything to the table (except
maybe preserving netlink as the only way to manipulate filter properties) and
keeping filters as separate objects. I can understand your position but for the
user it's just more and more objects to keep track of with no proper
ownership/cleanup semantics.

Though considering it for cls_bpf in particular, there are mainly three things
you would want to tc filter change:

* Integrated actions
  These are not allowed anyway, we force enable direct action mode, and I don't
  plan on opening up actions for this if its gets accepted. Anything missing
  we'll try to make it work in eBPF (act_ct etc.)

* classid
  cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid

* skip_hw/skip_sw
  Not supported for now, but can be done using flags in BPF_LINK_UPDATE

* BPF program
  Already works using BPF_LINK_UPDATE

So bpf_link isn't really prohibitive in any way.

Doing it your way also complicates cleanup of the filter (in case we don't want
to leave it attached), because it is hard to know who closes the link_fd last.
Closing it earlier would break the link for existing users, not doing it would
leave around unused object (which can accumulate if we use auto allocation of
filter priority). Counting existing links is racy.

This is better done in the kernel than worked around in userspace, as part of
attachment.

> This is actually simpler, you do not need to worry about whether
> netdev is destroyed when you detach the XDP bpf_link anyway,
> same for cls_bpf filters. Likewise, TC filters don't need to worry
> about bpf_links associated.
>
> Thanks.

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-11  2:00             ` Cong Wang
  2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
@ 2021-06-13  3:08               ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-13  3:08 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
> I see why you are creating TC filters now, because you are trying to
> force the lifetime of a bpf target to align with the bpf program itself.
> The deeper reason seems to be that a cls_bpf filter looks so small
> that it appears to you that it has nothing but a bpf_prog, right?
>

Just to clarify on this further, BPF program still has its own lifetime, link
takes a reference, and the filter still takes a reference on it (since it
assumes ownership, so it was easier that way).

When releasing the bpf_link if the prog pointer is set, we also detach the TC
filter (which releases its reference on the prog). The link on destruction
releases its reference. So the rest of refcount will depend on userspace
holding/pinning the fd or not.

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
@ 2021-06-13 20:27                 ` Jamal Hadi Salim
  2021-06-13 20:34                   ` Kumar Kartikeya Dwivedi
  2021-06-15  4:33                 ` Cong Wang
  1 sibling, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-13 20:27 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Cong Wang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

Hi,

Sorry - but i havent kept up with the discussion so some of this
and it is possible I may be misunderstanding some things you mention
in passing below (example that you only support da mode or the classid 
being able to be handled differently etc).
XDP may not be the best model to follow since some things that exist
in the tc architecture(example ability to have multi-programs)
seem to be plumbed in later (mostly because the original design intent
for XDP was to make it simple and then deployment follow and more
features get added)

Integrating tc into libbpf is a definete bonus that allows with a
unified programmatic interface and a singular loading mechanism - but
it wasnt clear why we loose some features that tc provides; we have
them today with current tc based loading scheme. I certainly use the
non-da scheme because over time it became clear that complex
programs(not necessarily large code size) are a challenge with ebpf
and using existing tc actions is valuable.
Also, multiple priorities are  important for the same reason - you
can work around them in your singular ebpf program but sooner than
later you will run out "tricks".

We do have this monthly tc meetup every second monday of the month.
Unfortunately it is short notice since the next one is monday 12pm
eastern time. Maybe you can show up and a high bandwidth discussion
(aka voice) would help?

cheers,
jamal

On 2021-06-12 10:53 p.m., Kumar Kartikeya Dwivedi wrote:
> On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
>> On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
>> <memxor@gmail.com> wrote:
>>>
>>> So we're not really creating a qdisc here, we're just tying the filter (which in
>>> the current semantics exists only while attached) to the bpf_link. The filter is
>>> the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
>>> the bpf_link, the filter goes away too, which means classification at that
>>> hook (parent/class) in the qdisc stops working. This is why creating the filter
>>> from the bpf_link made sense to me.
>>
>> I see why you are creating TC filters now, because you are trying to
>> force the lifetime of a bpf target to align with the bpf program itself.
>> The deeper reason seems to be that a cls_bpf filter looks so small
>> that it appears to you that it has nothing but a bpf_prog, right?
>>
> 
> Yes, pretty much.
> 
>> I offer two different views here:
>>
>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>> are no different from this perspective. Maybe the fact that a TC filter
>> resides in a qdisc makes a slight difference here, but like I mentioned, it
>> actually makes sense to let TC filters be standalone, qdisc's just have to
>> bind with them, like how we bind TC filters with standalone TC actions.
> 
> You propose something different below IIUC, but I explained why I'm wary of
> these unbound filters. They seem to add a step to classifier setup for no real
> benefit to the user (except keeping track of one more object and cleaning it
> up with the link when done).
> 
> I understand that the filter is very much an object of its own and why keeping
> them unbound makes sense, but for the user there is no real benefit of this
> scheme (some things like classid attribute are contextual in that they make
> sense to be set based on what parent we're attaching to).
> 
>> These are all updated independently, despite some of them residing in
>> another. There should not be an exceptional TC filter which can not
>> be updated via `tc filter` command.
> 
> I see, but I'm mirroring what was done for XDP bpf_link.
> 
> Besides, flush still works, it's only that manipulating a filter managed by
> bpf_link is not allowed, which sounds reasonable to me, given we're bringing
> new ownership semantics here which didn't exist before with netlink, so it
> doesn't make sense to allow netlink to simply invalidate the filter installed by
> some other program.
> 
> You wouldn't do something like that for a cooperating setup, we're just
> enforcing that using -EPERM (bpf_link is not allowed to replace netlink
> installed filters either, so it goes both ways).
> 
>>
>> 2. For cls_bpf specifically, it is also an instance, like all other TC filters.
>> You can update it in the same way: tc filter change [...] The only difference
>> is a bpf program can attach to such an instance. So you can view the bpf
>> program attached to cls_bpf as a property of it. From this point of view,
>> there is no difference with XDP to netdev, where an XDP program
>> attached to a netdev is also a property of netdev. A netdev can still
>> function without XDP. Same for cls_bpf, it can be just a nop returns
>> TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
>> the lifetime of a bpf program can be separated from the target it
>> attaches too, like all other bpf_link targets. bpf_link is just a
>> supplement to `tc filter change cls_bpf`, not to replace it.
>>
> 
> So this is different now, as in the filter is attached as usual but bpf_link
> represents attachment of bpf prog to the filter itself, not the filter to the
> qdisc.
> 
> To me it seems apart from not having to create filter, this would pretty much be
> equivalent to where I hook the bpf_link right now?
> 
> TBF, this split doesn't really seem to be bringing anything to the table (except
> maybe preserving netlink as the only way to manipulate filter properties) and
> keeping filters as separate objects. I can understand your position but for the
> user it's just more and more objects to keep track of with no proper
> ownership/cleanup semantics.
> 
> Though considering it for cls_bpf in particular, there are mainly three things
> you would want to tc filter change:
> 
> * Integrated actions
>    These are not allowed anyway, we force enable direct action mode, and I don't
>    plan on opening up actions for this if its gets accepted. Anything missing
>    we'll try to make it work in eBPF (act_ct etc.)
> 
> * classid
>    cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid
> 
> * skip_hw/skip_sw
>    Not supported for now, but can be done using flags in BPF_LINK_UPDATE
> 
> * BPF program
>    Already works using BPF_LINK_UPDATE
> 
> So bpf_link isn't really prohibitive in any way.
> 
> Doing it your way also complicates cleanup of the filter (in case we don't want
> to leave it attached), because it is hard to know who closes the link_fd last.
> Closing it earlier would break the link for existing users, not doing it would
> leave around unused object (which can accumulate if we use auto allocation of
> filter priority). Counting existing links is racy.
> 
> This is better done in the kernel than worked around in userspace, as part of
> attachment.
> 
>> This is actually simpler, you do not need to worry about whether
>> netdev is destroyed when you detach the XDP bpf_link anyway,
>> same for cls_bpf filters. Likewise, TC filters don't need to worry
>> about bpf_links associated.
>>
>> Thanks.
> 
> --
> Kartikeya
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-13 20:27                 ` Jamal Hadi Salim
@ 2021-06-13 20:34                   ` Kumar Kartikeya Dwivedi
  2021-06-13 21:10                     ` Jamal Hadi Salim
  0 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-13 20:34 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Cong Wang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:
> Hi,
>
> Sorry - but i havent kept up with the discussion so some of this
> and it is possible I may be misunderstanding some things you mention
> in passing below (example that you only support da mode or the classid being
> able to be handled differently etc).
> XDP may not be the best model to follow since some things that exist
> in the tc architecture(example ability to have multi-programs)
> seem to be plumbed in later (mostly because the original design intent
> for XDP was to make it simple and then deployment follow and more
> features get added)
>
> Integrating tc into libbpf is a definete bonus that allows with a
> unified programmatic interface and a singular loading mechanism - but
> it wasnt clear why we loose some features that tc provides; we have
> them today with current tc based loading scheme. I certainly use the
> non-da scheme because over time it became clear that complex
> programs(not necessarily large code size) are a challenge with ebpf
> and using existing tc actions is valuable.
> Also, multiple priorities are  important for the same reason - you
> can work around them in your singular ebpf program but sooner than
> later you will run out "tricks".
>

Right, also I'm just posting so that the use cases I care about are clear, and
why they are not being fulifilled in some other way. How to do it is ofcourse up
to TC and BPF maintainers, which is why I'm still waiting on feedback from you,
Cong and others before posting the next version.

> We do have this monthly tc meetup every second monday of the month.
> Unfortunately it is short notice since the next one is monday 12pm
> eastern time. Maybe you can show up and a high bandwidth discussion
> (aka voice) would help?
>

That would be best, please let me know how to join tomorrow. There are a few
other things I was working on that I also want to discuss with this.

> cheers,
> jamal
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-13 20:34                   ` Kumar Kartikeya Dwivedi
@ 2021-06-13 21:10                     ` Jamal Hadi Salim
  2021-06-14 13:03                       ` Marcelo Ricardo Leitner
  2021-06-15 23:07                       ` Daniel Borkmann
  0 siblings, 2 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-13 21:10 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote:
> On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:

> 
> Right, also I'm just posting so that the use cases I care about are clear, and
> why they are not being fulifilled in some other way. How to do it is ofcourse up
> to TC and BPF maintainers, which is why I'm still waiting on feedback from you,
> Cong and others before posting the next version.
> 

I look at it from the perspective that if i can run something with
existing tc loading mechanism then i should be able to do the same
with the new (libbpf) scheme.

>> We do have this monthly tc meetup every second monday of the month.
>> Unfortunately it is short notice since the next one is monday 12pm
>> eastern time. Maybe you can show up and a high bandwidth discussion
>> (aka voice) would help?
>>
> 
> That would be best, please let me know how to join tomorrow. There are a few
> other things I was working on that I also want to discuss with this.
> 

That would be great - thanks for your understanding.
+Cc Marcelo (who is the keeper of the meetup)
in case the link may change..

cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-13 21:10                     ` Jamal Hadi Salim
@ 2021-06-14 13:03                       ` Marcelo Ricardo Leitner
  2021-06-15 23:07                       ` Daniel Borkmann
  1 sibling, 0 replies; 45+ messages in thread
From: Marcelo Ricardo Leitner @ 2021-06-14 13:03 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Kumar Kartikeya Dwivedi, Cong Wang, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Sun, Jun 13, 2021 at 05:10:14PM -0400, Jamal Hadi Salim wrote:
> On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote:
> > On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:
> > > We do have this monthly tc meetup every second monday of the month.
> > > Unfortunately it is short notice since the next one is monday 12pm
> > > eastern time. Maybe you can show up and a high bandwidth discussion
> > > (aka voice) would help?
> > >
> >
> > That would be best, please let me know how to join tomorrow. There are a few
> > other things I was working on that I also want to discuss with this.
> >
>
> That would be great - thanks for your understanding.
> +Cc Marcelo (who is the keeper of the meetup)
> in case the link may change..

We have 2 URLs for today. The official one [1] and a test one [2].
We will be testing a new video conferencing system today and depending
on how it goes, we will be on one or another. I'll try to be always
present in the official one [1] to point people towards the testing
one [2] in case we're there.

Also, we have an agenda doc [3]. I can't openly share it to the public
but if you send a request for access, I'll grant it.

1.https://meet.kernel.social/tc-meetup
2.https://www.airmeet.com/e/2494c770-cc8c-11eb-830b-e787c099d9c3
3.https://docs.google.com/document/d/1uUm_o7lR9jCAH0bqZ1dyscXZbIF4GN3mh1FwwIuePcM/edit#

  Marcelo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
  2021-06-13 20:27                 ` Jamal Hadi Salim
@ 2021-06-15  4:33                 ` Cong Wang
  2021-06-15 11:54                   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 45+ messages in thread
From: Cong Wang @ 2021-06-15  4:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers

On Sat, Jun 12, 2021 at 7:54 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
> > On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > So we're not really creating a qdisc here, we're just tying the filter (which in
> > > the current semantics exists only while attached) to the bpf_link. The filter is
> > > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
> > > the bpf_link, the filter goes away too, which means classification at that
> > > hook (parent/class) in the qdisc stops working. This is why creating the filter
> > > from the bpf_link made sense to me.
> >
> > I see why you are creating TC filters now, because you are trying to
> > force the lifetime of a bpf target to align with the bpf program itself.
> > The deeper reason seems to be that a cls_bpf filter looks so small
> > that it appears to you that it has nothing but a bpf_prog, right?
> >
>
> Yes, pretty much.

OK. Just in case of any misunderstand: cls_bpf filter has more than
just a bpf prog, it inherits all other generic attributes, e.g. TC proto/prio,
too from TC infra. If you can agree on this, then it is no different from
netdev/cgroup/netns bpf_links.

>
> > I offer two different views here:
> >
> > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
> > are no different from this perspective. Maybe the fact that a TC filter
> > resides in a qdisc makes a slight difference here, but like I mentioned, it
> > actually makes sense to let TC filters be standalone, qdisc's just have to
> > bind with them, like how we bind TC filters with standalone TC actions.
>
> You propose something different below IIUC, but I explained why I'm wary of
> these unbound filters. They seem to add a step to classifier setup for no real
> benefit to the user (except keeping track of one more object and cleaning it
> up with the link when done).

I am not even sure if unbound filters help your case at all, making
them unbound merely changes their residence, not ownership.
You are trying to pass the ownership from TC to bpf_link, which
is what I am against.

>
> I understand that the filter is very much an object of its own and why keeping
> them unbound makes sense, but for the user there is no real benefit of this
> scheme (some things like classid attribute are contextual in that they make
> sense to be set based on what parent we're attaching to).
>
> > These are all updated independently, despite some of them residing in
> > another. There should not be an exceptional TC filter which can not
> > be updated via `tc filter` command.
>
> I see, but I'm mirroring what was done for XDP bpf_link.

Really? Does XDP bpf_link create a netdev or remove it? I see none.
It merely looks up netdev by attr->link_create.target_ifindex in
bpf_xdp_link_attach(). Where does the "mirroring" come from?

>
> Besides, flush still works, it's only that manipulating a filter managed by
> bpf_link is not allowed, which sounds reasonable to me, given we're bringing
> new ownership semantics here which didn't exist before with netlink, so it
> doesn't make sense to allow netlink to simply invalidate the filter installed by
> some other program.
>
> You wouldn't do something like that for a cooperating setup, we're just
> enforcing that using -EPERM (bpf_link is not allowed to replace netlink
> installed filters either, so it goes both ways).

I think our argument is never who manages it, our argument is who owns
it. By creating a TC filter from bpf_link and managed by bpf_link exclusively,
the ownership pretty much goes to bpf_link.

>
> >
> > 2. For cls_bpf specifically, it is also an instance, like all other TC filters.
> > You can update it in the same way: tc filter change [...] The only difference
> > is a bpf program can attach to such an instance. So you can view the bpf
> > program attached to cls_bpf as a property of it. From this point of view,
> > there is no difference with XDP to netdev, where an XDP program
> > attached to a netdev is also a property of netdev. A netdev can still
> > function without XDP. Same for cls_bpf, it can be just a nop returns
> > TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
> > the lifetime of a bpf program can be separated from the target it
> > attaches too, like all other bpf_link targets. bpf_link is just a
> > supplement to `tc filter change cls_bpf`, not to replace it.
> >
>
> So this is different now, as in the filter is attached as usual but bpf_link
> represents attachment of bpf prog to the filter itself, not the filter to the
> qdisc.

Yes, I think this is the right view of cls_bpf. It contains more than just
a bpf prog, its generic part (struct tcf_proto) contains other attributes
of this filter inherited from TC infra. And of course, TC actions can be
inherited too (for non-DA).

>
> To me it seems apart from not having to create filter, this would pretty much be
> equivalent to where I hook the bpf_link right now?
>
> TBF, this split doesn't really seem to be bringing anything to the table (except
> maybe preserving netlink as the only way to manipulate filter properties) and
> keeping filters as separate objects. I can understand your position but for the
> user it's just more and more objects to keep track of with no proper
> ownership/cleanup semantics.
>
> Though considering it for cls_bpf in particular, there are mainly three things
> you would want to tc filter change:
>
> * Integrated actions
>   These are not allowed anyway, we force enable direct action mode, and I don't
>   plan on opening up actions for this if its gets accepted. Anything missing
>   we'll try to make it work in eBPF (act_ct etc.)
>
> * classid
>   cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid
>
> * skip_hw/skip_sw
>   Not supported for now, but can be done using flags in BPF_LINK_UPDATE
>
> * BPF program
>   Already works using BPF_LINK_UPDATE

Our argument is never which pieces of cls_bpf should be updated by TC
or bpf_link. It is always ownership. TC should own TC filters, even its name
tells so. You are trying to create TC filters with bpf_link which are not even
owned by TC. And more importantly, you are not doing the same for other
bpf_link targets, by singling out TC filters for no valid reason.

>
> So bpf_link isn't really prohibitive in any way.
>
> Doing it your way also complicates cleanup of the filter (in case we don't want
> to leave it attached), because it is hard to know who closes the link_fd last.
> Closing it earlier would break the link for existing users, not doing it would
> leave around unused object (which can accumulate if we use auto allocation of
> filter priority). Counting existing links is racy.
>
> This is better done in the kernel than worked around in userspace, as part of
> attachment.

I am not proposing anything for your case, I am only explaining why
creating TC filters exclusively via bpf_link does not make sense to me.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-15  4:33                 ` Cong Wang
@ 2021-06-15 11:54                   ` Toke Høiland-Jørgensen
  2021-06-15 23:44                     ` Daniel Borkmann
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-06-15 11:54 UTC (permalink / raw)
  To: Cong Wang, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Jamal Hadi Salim, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Linux Kernel Network Developers

Cong Wang <xiyou.wangcong@gmail.com> writes:

>> > I offer two different views here:
>> >
>> > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>> > are no different from this perspective. Maybe the fact that a TC filter
>> > resides in a qdisc makes a slight difference here, but like I mentioned, it
>> > actually makes sense to let TC filters be standalone, qdisc's just have to
>> > bind with them, like how we bind TC filters with standalone TC actions.
>>
>> You propose something different below IIUC, but I explained why I'm wary of
>> these unbound filters. They seem to add a step to classifier setup for no real
>> benefit to the user (except keeping track of one more object and cleaning it
>> up with the link when done).
>
> I am not even sure if unbound filters help your case at all, making
> them unbound merely changes their residence, not ownership.
> You are trying to pass the ownership from TC to bpf_link, which
> is what I am against.

So what do you propose instead?

bpf_link is solving a specific problem: ensuring automatic cleanup of
kernel resources held by a userspace application with a BPF component.
Not all applications work this way, but for the ones that do it's very
useful. But if the TC filter stays around after bpf_link detaches, that
kinda defeats the point of the automatic cleanup.

So I don't really see any way around transferring ownership somehow.
Unless you have some other idea that I'm missing?

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-13 21:10                     ` Jamal Hadi Salim
  2021-06-14 13:03                       ` Marcelo Ricardo Leitner
@ 2021-06-15 23:07                       ` Daniel Borkmann
  2021-06-16 14:40                         ` Jamal Hadi Salim
  1 sibling, 1 reply; 45+ messages in thread
From: Daniel Borkmann @ 2021-06-15 23:07 UTC (permalink / raw)
  To: Jamal Hadi Salim, Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
> On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote:
>> On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:
[...]
>> Right, also I'm just posting so that the use cases I care about are clear, and
>> why they are not being fulifilled in some other way. How to do it is ofcourse up
>> to TC and BPF maintainers, which is why I'm still waiting on feedback from you,
>> Cong and others before posting the next version.
> 
> I look at it from the perspective that if i can run something with
> existing tc loading mechanism then i should be able to do the same
> with the new (libbpf) scheme.

The intention is not to provide a full-blown tc library (that could be subject to a
libtc or such), but rather to only have libbpf abstract the tc related API that is
most /relevant/ for BPF program development and /efficient/ in terms of execution in
fast-path while at the same time providing a good user experience from the API itself.

That is, simple to use and straight forward to explain to folks with otherwise zero
experience of tc. The current implementation does all that, and from experience with
large BPF programs managed via cls_bpf that is all that is actually needed from tc
layer perspective. The ability to have multi programs (incl. priorities) is in the
existing libbpf API as well.

Best,
Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-15 11:54                   ` Toke Høiland-Jørgensen
@ 2021-06-15 23:44                     ` Daniel Borkmann
  2021-06-16 12:03                       ` Toke Høiland-Jørgensen
  2021-06-16 15:33                       ` Jamal Hadi Salim
  0 siblings, 2 replies; 45+ messages in thread
From: Daniel Borkmann @ 2021-06-15 23:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Cong Wang, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Jamal Hadi Salim, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Linux Kernel Network Developers

On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
> Cong Wang <xiyou.wangcong@gmail.com> writes:
[...]
>>>> I offer two different views here:
>>>>
>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>>>> are no different from this perspective. Maybe the fact that a TC filter
>>>> resides in a qdisc makes a slight difference here, but like I mentioned, it
>>>> actually makes sense to let TC filters be standalone, qdisc's just have to
>>>> bind with them, like how we bind TC filters with standalone TC actions.
>>>
>>> You propose something different below IIUC, but I explained why I'm wary of
>>> these unbound filters. They seem to add a step to classifier setup for no real
>>> benefit to the user (except keeping track of one more object and cleaning it
>>> up with the link when done).
>>
>> I am not even sure if unbound filters help your case at all, making
>> them unbound merely changes their residence, not ownership.
>> You are trying to pass the ownership from TC to bpf_link, which
>> is what I am against.
> 
> So what do you propose instead?
> 
> bpf_link is solving a specific problem: ensuring automatic cleanup of
> kernel resources held by a userspace application with a BPF component.
> Not all applications work this way, but for the ones that do it's very
> useful. But if the TC filter stays around after bpf_link detaches, that
> kinda defeats the point of the automatic cleanup.
> 
> So I don't really see any way around transferring ownership somehow.
> Unless you have some other idea that I'm missing?

Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote:

   > I think it makes sense to create these objects as part of establishing bpf_link.
   > ingress qdisc is a fake qdisc anyway.
   > If we could go back in time I would argue that its existence doesn't
   > need to be shown in iproute2. It's an object that serves no purpose
   > other than attaching filters to it. It doesn't do any queuing unlike
   > real qdiscs.
   > It's an artifact of old choices. Old doesn't mean good.
   > The kernel is full of such quirks and oddities. New api-s shouldn't
   > blindly follow them.
   > tc qdisc add dev eth0 clsact
   > is a useless command with nop effect.

The whole bpf_link in this context feels somewhat awkward because both are two
different worlds, one accessible via netlink with its own lifetime etc, the other
one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration
since it felt the most natural at that time and it had support for both the ingress
and egress side, along with the direct action support which was added later to have
a proper fast path for BPF. One thing that I personally never liked is that later
on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I
guess mainly driven from ovs side) for which I have a hard time convincing myself
that this is used at scale in production. Stuff like af699626ee26 just to pick one
which annoyingly also adds to the fast path given distros will just compile in most
of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied
at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_
{egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning,
you could run existing tc BPF prog without any modifications and without additional
extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf
one). These tc BPF programs would be managed only from bpf() via tc bpf_link api,
and are otherwise not bothering to classic tc command (though they could be dumped
there as well for sake of visibility, though bpftool would be fitting too). However,
if there is something attached from classic tc side, it would also go into the old
style tcf_classify_ingress() implementation and walk whatever is there so that nothing
existing breaks (same as when no bpf_link would be present so that there is no extra
overhead). This would also allow for a migration path of multi prog from cls_bpf to
this new implementation. Details still tbd, but I would much rather like such an
approach than the current discussed one, and it would also fit better given we don't
run into this current mismatch of both worlds.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-15 23:44                     ` Daniel Borkmann
@ 2021-06-16 12:03                       ` Toke Høiland-Jørgensen
  2021-06-16 15:33                       ` Jamal Hadi Salim
  1 sibling, 0 replies; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-06-16 12:03 UTC (permalink / raw)
  To: Daniel Borkmann, Cong Wang, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Jamal Hadi Salim, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Linux Kernel Network Developers

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
> [...]
>>>>> I offer two different views here:
>>>>>
>>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>>>>> are no different from this perspective. Maybe the fact that a TC filter
>>>>> resides in a qdisc makes a slight difference here, but like I mentioned, it
>>>>> actually makes sense to let TC filters be standalone, qdisc's just have to
>>>>> bind with them, like how we bind TC filters with standalone TC actions.
>>>>
>>>> You propose something different below IIUC, but I explained why I'm wary of
>>>> these unbound filters. They seem to add a step to classifier setup for no real
>>>> benefit to the user (except keeping track of one more object and cleaning it
>>>> up with the link when done).
>>>
>>> I am not even sure if unbound filters help your case at all, making
>>> them unbound merely changes their residence, not ownership.
>>> You are trying to pass the ownership from TC to bpf_link, which
>>> is what I am against.
>> 
>> So what do you propose instead?
>> 
>> bpf_link is solving a specific problem: ensuring automatic cleanup of
>> kernel resources held by a userspace application with a BPF component.
>> Not all applications work this way, but for the ones that do it's very
>> useful. But if the TC filter stays around after bpf_link detaches, that
>> kinda defeats the point of the automatic cleanup.
>> 
>> So I don't really see any way around transferring ownership somehow.
>> Unless you have some other idea that I'm missing?
>
> Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote:
>
>    > I think it makes sense to create these objects as part of establishing bpf_link.
>    > ingress qdisc is a fake qdisc anyway.
>    > If we could go back in time I would argue that its existence doesn't
>    > need to be shown in iproute2. It's an object that serves no purpose
>    > other than attaching filters to it. It doesn't do any queuing unlike
>    > real qdiscs.
>    > It's an artifact of old choices. Old doesn't mean good.
>    > The kernel is full of such quirks and oddities. New api-s shouldn't
>    > blindly follow them.
>    > tc qdisc add dev eth0 clsact
>    > is a useless command with nop effect.
>
> The whole bpf_link in this context feels somewhat awkward because both are two
> different worlds, one accessible via netlink with its own lifetime etc, the other
> one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration
> since it felt the most natural at that time and it had support for both the ingress
> and egress side, along with the direct action support which was added later to have
> a proper fast path for BPF. One thing that I personally never liked is that later
> on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I
> guess mainly driven from ovs side) for which I have a hard time convincing myself
> that this is used at scale in production. Stuff like af699626ee26 just to pick one
> which annoyingly also adds to the fast path given distros will just compile in most
> of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied
> at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_
> {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning,
> you could run existing tc BPF prog without any modifications and without additional
> extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf
> one). These tc BPF programs would be managed only from bpf() via tc bpf_link api,
> and are otherwise not bothering to classic tc command (though they could be dumped
> there as well for sake of visibility, though bpftool would be fitting too). However,
> if there is something attached from classic tc side, it would also go into the old
> style tcf_classify_ingress() implementation and walk whatever is there so that nothing
> existing breaks (same as when no bpf_link would be present so that there is no extra
> overhead). This would also allow for a migration path of multi prog from cls_bpf to
> this new implementation. Details still tbd, but I would much rather like such an
> approach than the current discussed one, and it would also fit better given we don't
> run into this current mismatch of both worlds.

So this would entail adding a separate list of BPF programs and run
through those at the start of sch_handle_{egress,ingress}() I suppose?
And that list of filters would only contain bpf_link-attached BPF
programs, sorted by priority like TC filters? And return codes of
TC_ACT_OK or TC_ACT_RECLASSIFY would continue through to
tcf_classify_{egress,ingress}()?

I suppose that could work; we could even stick the second filter list in
struct mini_Qdisc and have clsact and bpf_link cooperate on managing
that, no? That way it would also be easy to dump the BPF filters via
netlink: I do think that will be the least surprising thing to do (so
people can at least see there's something there with existing tools).

The overhead would be a single extra branch when only one of clsact or
bpf_link is in use (to check if the other list of filters is set);
that's probably acceptable at this level...

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-15 23:07                       ` Daniel Borkmann
@ 2021-06-16 14:40                         ` Jamal Hadi Salim
  2021-06-16 15:32                           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-16 14:40 UTC (permalink / raw)
  To: Daniel Borkmann, Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:

[..]

>>
>> I look at it from the perspective that if i can run something with
>> existing tc loading mechanism then i should be able to do the same
>> with the new (libbpf) scheme.
> 
> The intention is not to provide a full-blown tc library (that could be 
> subject to a
> libtc or such), but rather to only have libbpf abstract the tc related 
> API that is
> most /relevant/ for BPF program development and /efficient/ in terms of 
> execution in
> fast-path while at the same time providing a good user experience from 
> the API itself.
> 
> That is, simple to use and straight forward to explain to folks with 
> otherwise zero
> experience of tc. The current implementation does all that, and from 
> experience with
> large BPF programs managed via cls_bpf that is all that is actually 
> needed from tc
> layer perspective. The ability to have multi programs (incl. priorities) 
> is in the
> existing libbpf API as well.
> 

Which is a fair statement, but if you take away things that work fine
with current iproute2 loading I have no motivation to migrate at all.
Its like that saying of "throwing out the baby with the bathwater".
I want my baby.

In particular, here's a list from Kartikeya's implementation:

1) Direct action mode only
2) Protocol ETH_P_ALL only
3) Only at chain 0
4) No block support

I think he said priority is supported but was also originally on that
list.
When we discussed at the meetup it didnt seem these cost anything
in terms of code complexity or usability of the API.

1) We use non-DA mode, so i cant live without that (and frankly ebpf
has challenges adding complex code blocks).

2) We also use different protocols when i need to
(yes, you can do the filtering in the bpf code - but why impose that
if the cost of adding it is simple? and of course it is cheaper to do
the check outside of ebpf)
3) We use chains outside of zero

4) So far we dont use block support but certainly my recent experiences
in a deployment shows that we need to group netdevices more often than
i thought was necessary. So if i could express one map shared by
multiple netdevices it should cut down the user space complexity.

cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-16 14:40                         ` Jamal Hadi Salim
@ 2021-06-16 15:32                           ` Kumar Kartikeya Dwivedi
  2021-06-16 16:00                             ` Daniel Borkmann
  0 siblings, 1 reply; 45+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-06-16 15:32 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, Cong Wang, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
> > On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
>
> [..]
>
> > >
> > > I look at it from the perspective that if i can run something with
> > > existing tc loading mechanism then i should be able to do the same
> > > with the new (libbpf) scheme.
> >
> > The intention is not to provide a full-blown tc library (that could be
> > subject to a
> > libtc or such), but rather to only have libbpf abstract the tc related
> > API that is
> > most /relevant/ for BPF program development and /efficient/ in terms of
> > execution in
> > fast-path while at the same time providing a good user experience from
> > the API itself.
> >
> > That is, simple to use and straight forward to explain to folks with
> > otherwise zero
> > experience of tc. The current implementation does all that, and from
> > experience with
> > large BPF programs managed via cls_bpf that is all that is actually
> > needed from tc
> > layer perspective. The ability to have multi programs (incl. priorities)
> > is in the
> > existing libbpf API as well.
> >
>
> Which is a fair statement, but if you take away things that work fine
> with current iproute2 loading I have no motivation to migrate at all.
> Its like that saying of "throwing out the baby with the bathwater".
> I want my baby.
>
> In particular, here's a list from Kartikeya's implementation:
>
> 1) Direct action mode only
> 2) Protocol ETH_P_ALL only
> 3) Only at chain 0
> 4) No block support
>

Block is supported, you just need to set TCM_IFINDEX_MAGIC_BLOCK as ifindex and
parent as the block index. There isn't anything more to it than that from libbpf
side (just specify BPF_TC_CUSTOM enum).

What I meant was that hook_create doesn't support specifying the ingress/egress
block when creating clsact, but that typically isn't a problem because qdiscs
for shared blocks would be set up together prior to the attachment anyway.

> I think he said priority is supported but was also originally on that
> list.
> When we discussed at the meetup it didnt seem these cost anything
> in terms of code complexity or usability of the API.
>
> 1) We use non-DA mode, so i cant live without that (and frankly ebpf
> has challenges adding complex code blocks).
>
> 2) We also use different protocols when i need to
> (yes, you can do the filtering in the bpf code - but why impose that
> if the cost of adding it is simple? and of course it is cheaper to do
> the check outside of ebpf)
> 3) We use chains outside of zero
>
> 4) So far we dont use block support but certainly my recent experiences
> in a deployment shows that we need to group netdevices more often than
> i thought was necessary. So if i could express one map shared by
> multiple netdevices it should cut down the user space complexity.
>
> cheers,
> jamal

--
Kartikeya

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-15 23:44                     ` Daniel Borkmann
  2021-06-16 12:03                       ` Toke Høiland-Jørgensen
@ 2021-06-16 15:33                       ` Jamal Hadi Salim
  1 sibling, 0 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-16 15:33 UTC (permalink / raw)
  To: Daniel Borkmann, Toke Høiland-Jørgensen, Cong Wang,
	Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh, Vlad Buslov,
	Jiri Pirko, David S. Miller, Jakub Kicinski, Joe Stringer,
	Quentin Monnet, Jesper Dangaard Brouer,
	Linux Kernel Network Developers

On 2021-06-15 7:44 p.m., Daniel Borkmann wrote:
> On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
> [...]


> 
> Just to keep on brainstorming here, I wanted to bring back Alexei's 
> earlier quote:
> 
>    > I think it makes sense to create these objects as part of 
> establishing bpf_link.
>    > ingress qdisc is a fake qdisc anyway.
>    > If we could go back in time I would argue that its existence doesn't
>    > need to be shown in iproute2. It's an object that serves no purpose
>    > other than attaching filters to it. It doesn't do any queuing unlike
>    > real qdiscs.
>    > It's an artifact of old choices. Old doesn't mean good.
>    > The kernel is full of such quirks and oddities. New api-s shouldn't
>    > blindly follow them.
>    > tc qdisc add dev eth0 clsact
>    > is a useless command with nop effect.
> 

I am not sure what Alexei's statement about old vs good was getting at.
You have to have  hooks/locations to stick things. Does it matter what
you call that hook?

> The whole bpf_link in this context feels somewhat awkward because both 
> are two
> different worlds, one accessible via netlink with its own lifetime etc, 
> the other
> one tied to fds and bpf syscall. Back in the days we did the cls_bpf 
> integration
> since it felt the most natural at that time and it had support for both 
> the ingress
> and egress side, along with the direct action support which was added 
> later to have
> a proper fast path for BPF. One thing that I personally never liked is 
> that later
> on tc sadly became a complex, quirky dumping ground for all the nic hw 
> offloads (I
> guess mainly driven from ovs side) for which I have a hard time 
> convincing myself
> that this is used at scale in production. Stuff like af699626ee26 just 
> to pick one
> which annoyingly also adds to the fast path given distros will just 
> compile in most
> of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is 
> not tied
> at all to cls_bpf or cls_act qdisc, and instead would implement the 
> tcf_classify_
> {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. 


The choice is between generic architecture and appliance
only-what-you-need code (via ebpf).
Dont disagree that at times patches go in at the expense of the kernel
datapath complexity or cost. Unfortunately sometimes this is because
theres no sufficient review time - but thats a different topic.
We try to impose a rule which states that any hardware offload has
to have a kernel/software twin. Often that helps contain things.


> Meaning,
> you could run existing tc BPF prog without any modifications and without 
> additional
> extra overhead (no need to walk the clsact qdisc and then again into the 
> cls_bpf
> one). These tc BPF programs would be managed only from bpf() via tc 
> bpf_link api,
> and are otherwise not bothering to classic tc command (though they could 
> be dumped
> there as well for sake of visibility, though bpftool would be fitting 
> too). However,
> if there is something attached from classic tc side, it would also go 
> into the old
> style tcf_classify_ingress() implementation and walk whatever is there 
> so that nothing
> existing breaks (same as when no bpf_link would be present so that there 
> is no extra
> overhead). This would also allow for a migration path of multi prog from 
> cls_bpf to
> this new implementation. Details still tbd, but I would much rather like 
> such an
> approach than the current discussed one, and it would also fit better 
> given we don't
> run into this current mismatch of both worlds.
>

The danger is totally divorcing from tc when you have speacial
cases just for ebpf/tc i.e this is no different than what the
hardware offload making you unhappy. The ability to use existing
tools (user space tc in this case) to inter-work on both is
very useful.


 From the discussion on the control aspect with Kartikeya i
understood that we need some "transient state" which needs to get
created and stored somewhere before being applied to tc (example
creating the filters first and all necessary artificats then calling
internally to cls api).
Seems to me that the "transient state" belongs to bpf. And i understood
Kartikeya this was his design intent as well (which seems sane to me).


cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-16 15:32                           ` Kumar Kartikeya Dwivedi
@ 2021-06-16 16:00                             ` Daniel Borkmann
  2021-06-18 11:40                               ` Jamal Hadi Salim
  0 siblings, 1 reply; 45+ messages in thread
From: Daniel Borkmann @ 2021-06-16 16:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Jamal Hadi Salim
  Cc: Cong Wang, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
>>
>> [..]
>>
>>>> I look at it from the perspective that if i can run something with
>>>> existing tc loading mechanism then i should be able to do the same
>>>> with the new (libbpf) scheme.
>>>
>>> The intention is not to provide a full-blown tc library (that could be
>>> subject to a
>>> libtc or such), but rather to only have libbpf abstract the tc related
>>> API that is
>>> most /relevant/ for BPF program development and /efficient/ in terms of
>>> execution in
>>> fast-path while at the same time providing a good user experience from
>>> the API itself.
>>>
>>> That is, simple to use and straight forward to explain to folks with
>>> otherwise zero
>>> experience of tc. The current implementation does all that, and from
>>> experience with
>>> large BPF programs managed via cls_bpf that is all that is actually
>>> needed from tc
>>> layer perspective. The ability to have multi programs (incl. priorities)
>>> is in the
>>> existing libbpf API as well.
>>
>> Which is a fair statement, but if you take away things that work fine
>> with current iproute2 loading I have no motivation to migrate at all.
>> Its like that saying of "throwing out the baby with the bathwater".
>> I want my baby.
>>
>> In particular, here's a list from Kartikeya's implementation:
>>
>> 1) Direct action mode only

(More below.)

>> 2) Protocol ETH_P_ALL only

The issue I see with this one is that it's not very valuable or useful from a BPF
point of view. Meaning, this kind of check can and typically is implemented from
BPF program anyway. For example, when you have direct packet access initially
parsing the eth header anyway (and from there having logic for the various eth
protos).

That protocol option is maybe more useful when you have classic tc with cls+act
style pipeline where you want a quick skip of classifiers to avoid reparsing the
packet. Given you can do everything inside the BPF program already it adds more
confusion than value for a simple libbpf [tc/BPF] API.

>> 3) Only at chain 0
>> 4) No block support
> 
> Block is supported, you just need to set TCM_IFINDEX_MAGIC_BLOCK as ifindex and
> parent as the block index. There isn't anything more to it than that from libbpf
> side (just specify BPF_TC_CUSTOM enum).
> 
> What I meant was that hook_create doesn't support specifying the ingress/egress
> block when creating clsact, but that typically isn't a problem because qdiscs
> for shared blocks would be set up together prior to the attachment anyway.
> 
>> I think he said priority is supported but was also originally on that
>> list.
>> When we discussed at the meetup it didnt seem these cost anything
>> in terms of code complexity or usability of the API.
>>
>> 1) We use non-DA mode, so i cant live without that (and frankly ebpf
>> has challenges adding complex code blocks).

Could you elaborate on that or provide code examples? Since introduction of the
direct action mode I've never used anything else again, and we do have complex
BPF code blocks that we need to handle as well. Would be good if you could provide
more details on things you ran into, maybe they can be solved?

>> 2) We also use different protocols when i need to
>> (yes, you can do the filtering in the bpf code - but why impose that
>> if the cost of adding it is simple? and of course it is cheaper to do
>> the check outside of ebpf)
>> 3) We use chains outside of zero
>>
>> 4) So far we dont use block support but certainly my recent experiences
>> in a deployment shows that we need to group netdevices more often than
>> i thought was necessary. So if i could express one map shared by
>> multiple netdevices it should cut down the user space complexity.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-16 16:00                             ` Daniel Borkmann
@ 2021-06-18 11:40                               ` Jamal Hadi Salim
  2021-06-18 14:38                                 ` Alexei Starovoitov
  2021-06-18 22:42                                 ` Daniel Borkmann
  0 siblings, 2 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-18 11:40 UTC (permalink / raw)
  To: Daniel Borkmann, Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 2021-06-16 12:00 p.m., Daniel Borkmann wrote:
> On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote:
>> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
>>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
>>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:

[..]

>>>
>>> In particular, here's a list from Kartikeya's implementation:
>>>
>>> 1) Direct action mode only
> 
> (More below.)
> 
>>> 2) Protocol ETH_P_ALL only
> 
> The issue I see with this one is that it's not very valuable or useful 
> from a BPF
> point of view. Meaning, this kind of check can and typically is 
> implemented from
> BPF program anyway. For example, when you have direct packet access 
> initially
> parsing the eth header anyway (and from there having logic for the 
> various eth
> protos).

In that case make it optional to specify proto and default it to
ETH_P_ALL. As far as i can see this flexibility doesnt
complicate usability or add code complexity to the interfaces.

> 
> That protocol option is maybe more useful when you have classic tc with 
> cls+act
> style pipeline where you want a quick skip of classifiers to avoid 
> reparsing the
> packet. Given you can do everything inside the BPF program already it 
> adds more
> confusion than value for a simple libbpf [tc/BPF] API.
> 

There's no point in repeating an operation of identifying
the protocol type which can/has already be Id-ed by the calling
(into ebpf) code. If all i am interested in is IPv4, then
my ebpf parser can be simplified if i am sure i can assume it
is an IPv4 packet.


[..]

>>> 1) We use non-DA mode, so i cant live without that (and frankly ebpf
>>> has challenges adding complex code blocks).
> 
> Could you elaborate on that or provide code examples? Since introduction 
> of the
> direct action mode I've never used anything else again, and we do have 
> complex
> BPF code blocks that we need to handle as well. Would be good if you 
> could provide
> more details on things you ran into, maybe they can be solved?
> 

Main issue is code complexity in ebpf and not so much instruction
count (which is complicated once you have bounded loops).
Earlier, I tried to post on the ebpf list but i got no response.
I moved on since. I would like to engage you at some point - and
you are right there may be some clever tricks to achieve the goals
we had. The challenge is in keeping up with the bag of tricks to make
the verifier happy.
Being able to run non-da mode and for example attach an action such
as the policer (and others) has pragmatic uses. It would be quiet 
complex to implement the policer within an all-in-one-appliance
da-mode ebpf code.
One approach is to add more helpers to invoke such code directly
from ebpf - but we have some restrictions; the deployment is RHEL8.3
based and we have to live with the kernel features supported there.
i.e kernel upgrade is a no-no. Given all these TC features have
existed (and stable) for 100 years it make a lot of sense to use them.

We are going to present some of the challenges we faced in a subset
of our work in an approach to replace iptables at netdev 0x15
(hopefully we get accepted).

cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-18 11:40                               ` Jamal Hadi Salim
@ 2021-06-18 14:38                                 ` Alexei Starovoitov
  2021-06-18 14:50                                   ` Jamal Hadi Salim
  2021-06-18 22:42                                 ` Daniel Borkmann
  1 sibling, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2021-06-18 14:38 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, Kumar Kartikeya Dwivedi, Cong Wang, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> We are going to present some of the challenges we faced in a subset
> of our work in an approach to replace iptables at netdev 0x15
> (hopefully we get accepted).

Jamal,
please stop using netdev@vger mailing list to promote a conference
that does NOT represent the netdev kernel community.
Slides shown at that conference is a non-event as far as this discussion goes.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-18 14:38                                 ` Alexei Starovoitov
@ 2021-06-18 14:50                                   ` Jamal Hadi Salim
  2021-06-18 16:23                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-18 14:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Kumar Kartikeya Dwivedi, Cong Wang, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 2021-06-18 10:38 a.m., Alexei Starovoitov wrote:
> On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> We are going to present some of the challenges we faced in a subset
>> of our work in an approach to replace iptables at netdev 0x15
>> (hopefully we get accepted).
> 
> Jamal,
> please stop using netdev@vger mailing list to promote a conference
> that does NOT represent the netdev kernel community.
 >
> Slides shown at that conference is a non-event as far as this discussion goes.

Alexei,
Tame the aggression, would you please?
You have no right to make claims as to who represents the community.
Absolutely none. So get off that high horse.

I only mentioned the slides because it will be a good spot when
done which captures the issues. As i mentioned in i actually did
send some email (some Cced to you) but got no response.
I dont mind having a discussion but you have to be willing to
listen as well.


cheers,
jamal



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-18 14:50                                   ` Jamal Hadi Salim
@ 2021-06-18 16:23                                     ` Alexei Starovoitov
  2021-06-18 16:41                                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2021-06-18 16:23 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, Kumar Kartikeya Dwivedi, Cong Wang, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On Fri, Jun 18, 2021 at 7:50 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2021-06-18 10:38 a.m., Alexei Starovoitov wrote:
> > On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>
> >> We are going to present some of the challenges we faced in a subset
> >> of our work in an approach to replace iptables at netdev 0x15
> >> (hopefully we get accepted).
> >
> > Jamal,
> > please stop using netdev@vger mailing list to promote a conference
> > that does NOT represent the netdev kernel community.
>  >
> > Slides shown at that conference is a non-event as far as this discussion goes.
>
> Alexei,
> Tame the aggression, would you please?
> You have no right to make claims as to who represents the community.
> Absolutely none. So get off that high horse.
>
> I only mentioned the slides because it will be a good spot when
> done which captures the issues. As i mentioned in i actually did
> send some email (some Cced to you) but got no response.
> I dont mind having a discussion but you have to be willing to
> listen as well.

You've side tracked technical discussion to promote your own conference.
That's not acceptable. Please use other forums for marketing.
This mailing list is for technical discussions.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-18 16:23                                     ` Alexei Starovoitov
@ 2021-06-18 16:41                                       ` Jamal Hadi Salim
  0 siblings, 0 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-18 16:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Kumar Kartikeya Dwivedi, Cong Wang, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Vlad Buslov, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 2021-06-18 12:23 p.m., Alexei Starovoitov wrote:
> On Fri, Jun 18, 2021 at 7:50 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>

[..]
>> Alexei,
>> Tame the aggression, would you please?
>> You have no right to make claims as to who represents the community.
>> Absolutely none. So get off that high horse.
>>
>> I only mentioned the slides because it will be a good spot when
>> done which captures the issues. As i mentioned in i actually did
>> send some email (some Cced to you) but got no response.
>> I dont mind having a discussion but you have to be willing to
>> listen as well.
> 
> You've side tracked technical discussion to promote your own conference.
> That's not acceptable. Please use other forums for marketing.
 >
> This mailing list is for technical discussions.

I just made a statement in passing and you took it to a
tangent. If you are so righteous, why didnt you just stick
to making technical comments?
Stop making bold statements and then playing the victim.

cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-18 11:40                               ` Jamal Hadi Salim
  2021-06-18 14:38                                 ` Alexei Starovoitov
@ 2021-06-18 22:42                                 ` Daniel Borkmann
  2021-06-21 13:55                                   ` Jamal Hadi Salim
  1 sibling, 1 reply; 45+ messages in thread
From: Daniel Borkmann @ 2021-06-18 22:42 UTC (permalink / raw)
  To: Jamal Hadi Salim, Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 6/18/21 1:40 PM, Jamal Hadi Salim wrote:
> On 2021-06-16 12:00 p.m., Daniel Borkmann wrote:
>> On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
>>>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
>>>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
> 
> [..]
> 
>>>> In particular, here's a list from Kartikeya's implementation:
>>>>
>>>> 1) Direct action mode only
>>
>> (More below.)
>>
>>>> 2) Protocol ETH_P_ALL only
>>
>> The issue I see with this one is that it's not very valuable or useful from a BPF
>> point of view. Meaning, this kind of check can and typically is implemented from
>> BPF program anyway. For example, when you have direct packet access initially
>> parsing the eth header anyway (and from there having logic for the various eth
>> protos).
> 
> In that case make it optional to specify proto and default it to
> ETH_P_ALL. As far as i can see this flexibility doesnt
> complicate usability or add code complexity to the interfaces.

 From a user interface PoV it's odd since you need to go and parse that anyway, at
least the programs typically start out with a switch/case on either reading the
skb->protocol or getting it via eth->h_proto. But then once you extend that same
program to also cover IPv6, you don't need to do anything with the ETH_P_ALL
from the loader application, but now you'd also need to additionally remember to
downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 traffic. But even
if you were to split things in the main/entry program to separate v4/v6 processing
into two different ones, I expect this to be faster via tail calls (given direct
absolute jump) instead of walking a list of tcf_proto objects, comparing the
tp->protocol and going into a different cls_bpf instance.

[...]>> Could you elaborate on that or provide code examples? Since introduction of the
>> direct action mode I've never used anything else again, and we do have complex
>> BPF code blocks that we need to handle as well. Would be good if you could provide
>> more details on things you ran into, maybe they can be solved?
> 
> Main issue is code complexity in ebpf and not so much instruction
> count (which is complicated once you have bounded loops).
> Earlier, I tried to post on the ebpf list but i got no response.
> I moved on since. I would like to engage you at some point - and
> you are right there may be some clever tricks to achieve the goals
> we had. The challenge is in keeping up with the bag of tricks to make
> the verifier happy.
> Being able to run non-da mode and for example attach an action such
> as the policer (and others) has pragmatic uses. It would be quiet complex to implement the policer within an all-in-one-appliance
> da-mode ebpf code.

It may be more tricky but not impossible either, in recent years some (imho) very
interesting and exciting use cases have been implemented and talked about e.g. [0-2],
and with the recent linker work there could also be a [e.g. in-kernel] collection with
library code that can be pulled in by others aside from using them as BPF selftests
as one option. The gain you have with the flexibility [as you know] is that it allows
easy integration/orchestration into user space applications and thus suitable for
more dynamic envs as with old-style actions. The issue I have with the latter is
that they're not scalable enough from a SW datapath / tc fast-path perspective given
you then need to fallback to old-style list processing of cls+act combinations which
is also not covered / in scope for the libbpf API in terms of their setup, and
additionally not all of the BPF features can be used this way either, so it'll be very
hard for users to debug why their BPF programs don't work as they're expected to.

But also aside from those blockers, the case with this clean slate tc BPF API is that
we have a unique chance to overcome the cmdline usability struggles, and make it as
straight forward as possible for new generation of users.

   [0] https://linuxplumbersconf.org/event/7/contributions/677/
   [1] https://linuxplumbersconf.org/event/2/contributions/121/
   [2] https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
  2021-06-18 22:42                                 ` Daniel Borkmann
@ 2021-06-21 13:55                                   ` Jamal Hadi Salim
  0 siblings, 0 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2021-06-21 13:55 UTC (permalink / raw)
  To: Daniel Borkmann, Kumar Kartikeya Dwivedi
  Cc: Cong Wang, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Vlad Buslov, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Joe Stringer, Quentin Monnet,
	Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Linux Kernel Network Developers, Marcelo Ricardo Leitner

On 2021-06-18 6:42 p.m., Daniel Borkmann wrote:
> On 6/18/21 1:40 PM, Jamal Hadi Salim wrote:

[..]
>  From a user interface PoV it's odd since you need to go and parse that 
> anyway, at
> least the programs typically start out with a switch/case on either 
> reading the
> skb->protocol or getting it via eth->h_proto. But then once you extend 
> that same
> program to also cover IPv6, you don't need to do anything with the 
> ETH_P_ALL
> from the loader application, but now you'd also need to additionally 
> remember to
> downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 
> traffic. But even
> if you were to split things in the main/entry program to separate v4/v6 
> processing
> into two different ones, I expect this to be faster via tail calls 
> (given direct
> absolute jump) instead of walking a list of tcf_proto objects, comparing 
> the
> tp->protocol and going into a different cls_bpf instance.
> 

Good point on being more future proof with ETH_P_ALL.
Note: In our case we were only interested in ipv4 and i dont see that
changing for the specific prog we have. From a compute perspective all
i am saving by not using ETH_P_ALL is one if statement (checking if
proto is ipv4). If you feel strongly about it we can change our code.
My worry now is if we used this approach then likely someone else in the 
wild used something similar.

I think it boils down again to: if it doesnt confuse the API or add
extra complexity why not allow it and default to ETH_P_ALL?

On your comment that a bpf based proto comparison being faster - the
issue is that the tp proto always happens regardless and ebpf, depending
on your program, may not fit all your code. Example i may actually
decide to have a program for v6 and v4 separately if i wanted
to with current mechanism - at different tc ruleset prios just
so as to work around code/complexity issues.

BTW: tail call limit of 32 provides an upper bound which affects
depth of (generic) parsing.
Does it make sense to allow (maybe on a per-boot) increasing the size?
The fact things run on the stack may be restricting.


> It may be more tricky but not impossible either, in recent years some 
> (imho) very
> interesting and exciting use cases have been implemented and talked 
> about e.g. [0-2],
> and with the recent linker work there could also be a [e.g. in-kernel] 
> collection with
> library code that can be pulled in by others aside from using them as 
> BPF selftests
> as one option. The gain you have with the flexibility [as you know] is 
> that it allows
> easy integration/orchestration into user space applications and thus 
> suitable for
> more dynamic envs as with old-style actions. The issue I have with the 
> latter is
> that they're not scalable enough from a SW datapath / tc fast-path 
> perspective given
> you then need to fallback to old-style list processing of cls+act 
> combinations which
> is also not covered / in scope for the libbpf API in terms of their 
> setup, and
> additionally not all of the BPF features can be used this way either, so 
> it'll be very
> hard for users to debug why their BPF programs don't work as they're 
> expected to.
> 
> But also aside from those blockers, the case with this clean slate tc 
> BPF API is that
> we have a unique chance to overcome the cmdline usability struggles, and 
> make it as
> straight forward as possible for new generation of users.
> 
>    [0] https://linuxplumbersconf.org/event/7/contributions/677/
>    [1] https://linuxplumbersconf.org/event/2/contributions/121/
>    [2] 
> https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF 

I took a quick glance at the refs.

IIUC, your message is "do more with less" i.e restrict choices now
so we can focus on optimizing for speed. Here's my experience.
We have two pragmatic challenges:

1) In a deployment, like some enterprise class data centers, we are
often limited by the kernel and often even the distro you are on. You
cant just upgrade to the latest and greatest without risking voiding
the distro vendors support contract. Big shops with a lot of geniuses
like FB and Google dont have these problems of course - but the majority
out there do.

So even our little program must use supported interfaces (ex: You cant
expect support on RH8.3 for an XDP issue without using the supplied XDP 
lib) to be accepted.

So building in support to use existing infra is useful

2) challenges with ebpf code space and code complexity: Depending
on the complexity, a program with less than 4K instructions may be
rejected by the verifier. IOW, I just cant add all the features
i need _even if i wanted to_.

For this reason working cooperatively with other existing kernel
and user infra makes sense (Ref [2] is doing that for example).
You dont want to rewrite the kernel using ebpf. Extending the kernel
with ebpf makes sense. And of course I dont want to loose performance
but there may be a trade-off sometimes where a little loss in 
performance is justified for gain of a feature makes sense
(the non-da example applies).

Perhaps adding more helpers to interface to the actions and classifiers
is one way forward.

cheers,
jamal

PS: I didnt understand the kernel linker point with BPF selftests.
Pointer?

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-06-21 13:55 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 1/7] net: sched: refactor cls_bpf creation code Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 2/7] bpf: export bpf_link functions for modules Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier Kumar Kartikeya Dwivedi
2021-06-02 20:56   ` Andrii Nakryiko
2021-05-28 19:59 ` [PATCH RFC bpf-next 4/7] net: sched: add lightweight update path for cls_bpf Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 5/7] tools: bpf.h: sync with kernel sources Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API Kumar Kartikeya Dwivedi
2021-06-02 21:03   ` Andrii Nakryiko
2021-05-28 19:59 ` [PATCH RFC bpf-next 7/7] libbpf: add selftest for " Kumar Kartikeya Dwivedi
2021-06-02 21:09 ` [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Andrii Nakryiko
2021-06-02 21:45   ` Kumar Kartikeya Dwivedi
2021-06-02 23:50     ` Alexei Starovoitov
2021-06-04  6:43       ` Kumar Kartikeya Dwivedi
2021-06-06 23:37 ` Cong Wang
2021-06-07  3:37   ` Kumar Kartikeya Dwivedi
2021-06-07  5:18     ` Cong Wang
2021-06-07  6:07       ` Kumar Kartikeya Dwivedi
2021-06-08  2:00         ` Cong Wang
2021-06-08  7:19           ` Kumar Kartikeya Dwivedi
2021-06-08 15:39             ` Alexei Starovoitov
2021-06-11  2:10               ` Cong Wang
2021-06-11  2:00             ` Cong Wang
2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
2021-06-13 20:27                 ` Jamal Hadi Salim
2021-06-13 20:34                   ` Kumar Kartikeya Dwivedi
2021-06-13 21:10                     ` Jamal Hadi Salim
2021-06-14 13:03                       ` Marcelo Ricardo Leitner
2021-06-15 23:07                       ` Daniel Borkmann
2021-06-16 14:40                         ` Jamal Hadi Salim
2021-06-16 15:32                           ` Kumar Kartikeya Dwivedi
2021-06-16 16:00                             ` Daniel Borkmann
2021-06-18 11:40                               ` Jamal Hadi Salim
2021-06-18 14:38                                 ` Alexei Starovoitov
2021-06-18 14:50                                   ` Jamal Hadi Salim
2021-06-18 16:23                                     ` Alexei Starovoitov
2021-06-18 16:41                                       ` Jamal Hadi Salim
2021-06-18 22:42                                 ` Daniel Borkmann
2021-06-21 13:55                                   ` Jamal Hadi Salim
2021-06-15  4:33                 ` Cong Wang
2021-06-15 11:54                   ` Toke Høiland-Jørgensen
2021-06-15 23:44                     ` Daniel Borkmann
2021-06-16 12:03                       ` Toke Høiland-Jørgensen
2021-06-16 15:33                       ` Jamal Hadi Salim
2021-06-13  3:08               ` Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).