All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
@ 2022-06-02  4:10 Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe() Cong Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Cong Wang @ 2022-06-02  4:10 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Cong Wang, Toke Høiland-Jørgensen,
	Jamal Hadi Salim, Jiri Pirko

From: Cong Wang <cong.wang@bytedance.com>

This *incomplete* patchset introduces a programmable Qdisc with eBPF.

There are a few use cases:

1. Allow customizing Qdisc's in an easier way. So that people don't
   have to write a complete Qdisc kernel module just to experiment
   some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
   is before enqueue, it is impossible to adjust those "tokens" after
   packets get dropped in enqueue. With eBPF Qdisc, it is easy to
   be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
   skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
   chain and block to reuse the TC filters, but they are too complicated
   to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
   TC filters on _any_ Qdisc (even on a different netdev) to do the
   classification.

5. Potentially pave a way for ingress to queue packets, although
   current implementation is still only for egress.

6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
   is already used by TCP to handle TCP retransmission.

The goal here is to make this Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can, no matter
in tree or out of tree. This is why I give up on PIFO which has
serious limitations on the programmablity.

Here is a summary of design decisions I made:

1. Avoid eBPF struct_ops, as it would be really hard to program
   a Qdisc with this approach, literally all the struct Qdisc_ops
   and struct Qdisc_class_ops are needed to implement. This is almost
   as hard as programming a Qdisc kernel module.

2. Introduce skb map, which will allow other eBPF programs to store skb's
   too.

   a) As eBPF maps are not directly visible to the kernel, we have to
   dump the stats via eBPF map API's instead of netlink.

   b) The user-space is not allowed to read the entire packets, only __sk_buff
   itself is readable, because we don't have such a use case yet and it would
   require a different API to read the data, as map values have fixed length.

   c) Two eBPF helpers are introduced for skb map operations:
   bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
   not allowed.

   d) Multi-queue support is implemented via map-in-map, in a similar
   push/pop fasion.

   e) Use the netdevice notifier to reset the packets inside skb map upon
   NETDEV_DOWN event.

3. Integrate with existing TC infra. For example, if the user doesn't want
   to implement her own filters (e.g. a flow dissector), she should be able
   to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is
   introduced for this purpose.

Any high-level feedback is welcome. Please kindly do not review any coding
details until RFC tag is removed.

TODO:
1. actually test it
2. write a document for this Qdisc
3. add test cases and sample code

Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Cong Wang (5):
  net: introduce skb_rbtree_walk_safe()
  bpf: move map in map declarations to bpf.h
  bpf: introduce skb map and flow map
  net_sched: introduce eBPF based Qdisc
  net_sched: introduce helper bpf_skb_tc_classify()

 include/linux/bpf.h            |   6 +
 include/linux/bpf_types.h      |   4 +
 include/linux/skbuff.h         |   9 +-
 include/uapi/linux/bpf.h       |  23 ++
 include/uapi/linux/pkt_sched.h |  17 ++
 kernel/bpf/arraymap.c          |   2 -
 kernel/bpf/hashtab.c           |   1 -
 kernel/bpf/map_in_map.c        |   2 -
 kernel/bpf/map_in_map.h        |  19 --
 kernel/bpf/verifier.c          |  10 +
 net/core/Makefile              |   1 +
 net/core/filter.c              |  39 +++
 net/core/skb_map.c             | 520 +++++++++++++++++++++++++++++++++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/cls_api.c            |  69 +++++
 net/sched/sch_bpf.c            | 485 ++++++++++++++++++++++++++++++
 17 files changed, 1198 insertions(+), 25 deletions(-)
 delete mode 100644 kernel/bpf/map_in_map.h
 create mode 100644 net/core/skb_map.c
 create mode 100644 net/sched/sch_bpf.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe()
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
@ 2022-06-02  4:10 ` Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 2/5] bpf: move map in map declarations to bpf.h Cong Wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2022-06-02  4:10 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/skbuff.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index da96f0d3e753..857fd813c1bc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3929,6 +3929,11 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, unsigned int len)
 		for (skb = skb_rb_first(root); skb != NULL;			\
 		     skb = skb_rb_next(skb))
 
+#define skb_rbtree_walk_safe(skb, tmp, root)					\
+		for (skb = skb_rb_first(root);					\
+		     tmp = skb ? skb_rb_next(skb) : NULL, (skb != NULL);	\
+		     skb = tmp)
+
 #define skb_rbtree_walk_from(skb)						\
 		for (; skb != NULL;						\
 		     skb = skb_rb_next(skb))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC Patch v5 2/5] bpf: move map in map declarations to bpf.h
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe() Cong Wang
@ 2022-06-02  4:10 ` Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 3/5] bpf: introduce skb map and flow map Cong Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2022-06-02  4:10 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf.h     |  6 ++++++
 kernel/bpf/arraymap.c   |  2 --
 kernel/bpf/hashtab.c    |  1 -
 kernel/bpf/map_in_map.c |  2 --
 kernel/bpf/map_in_map.h | 19 -------------------
 5 files changed, 6 insertions(+), 24 deletions(-)
 delete mode 100644 kernel/bpf/map_in_map.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8e6092d0ea95..cf04ddce2c2d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -341,6 +341,12 @@ int map_check_no_btf(const struct bpf_map *map,
 
 bool bpf_map_meta_equal(const struct bpf_map *meta0,
 			const struct bpf_map *meta1);
+struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
+void bpf_map_meta_free(struct bpf_map *map_meta);
+void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
+			 int ufd);
+void bpf_map_fd_put_ptr(void *ptr);
+u32 bpf_map_fd_sys_lookup_elem(void *ptr);
 
 extern const struct bpf_map_ops bpf_map_offload_ops;
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index fe40d3b9458f..65ba21e4b707 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -13,8 +13,6 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/btf_ids.h>
 
-#include "map_in_map.h"
-
 #define ARRAY_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | \
 	 BPF_F_PRESERVE_ELEMS | BPF_F_INNER_MAP)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 17fb69c0e0dc..f0a2464c1669 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -13,7 +13,6 @@
 #include <linux/btf_ids.h>
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
-#include "map_in_map.h"
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 135205d0d560..8a537f6b2abd 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -5,8 +5,6 @@
 #include <linux/bpf.h>
 #include <linux/btf.h>
 
-#include "map_in_map.h"
-
 struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 {
 	struct bpf_map *inner_map, *inner_map_meta;
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
deleted file mode 100644
index bcb7534afb3c..000000000000
--- a/kernel/bpf/map_in_map.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/* Copyright (c) 2017 Facebook
- */
-#ifndef __MAP_IN_MAP_H__
-#define __MAP_IN_MAP_H__
-
-#include <linux/types.h>
-
-struct file;
-struct bpf_map;
-
-struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
-void bpf_map_meta_free(struct bpf_map *map_meta);
-void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
-			 int ufd);
-void bpf_map_fd_put_ptr(void *ptr);
-u32 bpf_map_fd_sys_lookup_elem(void *ptr);
-
-#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC Patch v5 3/5] bpf: introduce skb map and flow map
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe() Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 2/5] bpf: move map in map declarations to bpf.h Cong Wang
@ 2022-06-02  4:10 ` Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2022-06-02  4:10 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang, Cong Wang

Introduce two maps: one for storing skb and one for storing skb flows,
the latter one is implemented as a map in map. The API's for two maps
are similiar, except one takes skb pointer and the other takes skb map
pointer.

Here are the API's for skb map:

1. Insert skb into skb map at position @key:
bpf_skb_push(&map, skb, key);

2. Remove the skb at position @key from skb map:
skb = bpf_skb_pop(&map, key);

3. Peek an skb by @key:
skb = bpf_map_lookup_elem(&map, &key);

4. Drop the skb at position @key:
bpf_map_delete_elem(&map, &key);

5. Iterate all the skbs in the map in order:
bpf_for_each_map_elem(&skb_map, skb_callback, key, skb);

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf.h       |   4 +
 include/linux/bpf_types.h |   2 +
 include/linux/skbuff.h    |   4 +-
 include/uapi/linux/bpf.h  |   6 +
 kernel/bpf/verifier.c     |  10 +
 net/core/Makefile         |   1 +
 net/core/skb_map.c        | 520 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 546 insertions(+), 1 deletion(-)
 create mode 100644 net/core/skb_map.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cf04ddce2c2d..43fbb45b6fc2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2274,6 +2274,10 @@ extern const struct bpf_func_proto bpf_loop_proto;
 extern const struct bpf_func_proto bpf_strncmp_proto;
 extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
 extern const struct bpf_func_proto bpf_kptr_xchg_proto;
+extern const struct bpf_func_proto bpf_skb_map_push_proto;
+extern const struct bpf_func_proto bpf_skb_map_pop_proto;
+extern const struct bpf_func_proto bpf_flow_map_push_proto;
+extern const struct bpf_func_proto bpf_flow_map_pop_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..b1276f9f9d26 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -110,6 +110,8 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SKBMAP, skb_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_FLOWMAP, flow_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)
 BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 #endif
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 857fd813c1bc..fea71b4a0b9d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1017,7 +1017,9 @@ struct sk_buff {
 				unsigned long		dev_scratch;
 			};
 		};
-		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
+		struct rb_node		rbnode; /* used in eBPF skb map, netem, ip4 defrag, and tcp
+						 * stack
+						 */
 		struct list_head	list;
 		struct llist_node	ll_node;
 	};
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f4009dbdf62d..cd9cff0df639 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -909,6 +909,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_SKBMAP,
+	BPF_MAP_TYPE_FLOWMAP,
 };
 
 /* Note that tracing related programs such as
@@ -5455,6 +5457,10 @@ union bpf_attr {
 	FN(dynptr_read),		\
 	FN(dynptr_write),		\
 	FN(dynptr_data),		\
+	FN(skb_map_push),		\
+	FN(skb_map_pop),		\
+	FN(flow_map_push),		\
+	FN(flow_map_pop),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index aedac2ac02b9..bc4cdb4a5176 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6264,6 +6264,16 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_map_push_elem)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_SKBMAP:
+		if (func_id != BPF_FUNC_skb_map_push &&
+		    func_id != BPF_FUNC_skb_map_pop)
+			goto error;
+		break;
+	case BPF_MAP_TYPE_FLOWMAP:
+		if (func_id != BPF_FUNC_flow_map_push &&
+		    func_id != BPF_FUNC_flow_map_pop)
+			goto error;
+		break;
 	default:
 		break;
 	}
diff --git a/net/core/Makefile b/net/core/Makefile
index a8e4f737692b..183f75e02b28 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -38,4 +38,5 @@ obj-$(CONFIG_FAILOVER) += failover.o
 obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
 obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += skb_map.o
 obj-$(CONFIG_OF)	+= of_net.o
diff --git a/net/core/skb_map.c b/net/core/skb_map.c
new file mode 100644
index 000000000000..1c4ef29de558
--- /dev/null
+++ b/net/core/skb_map.c
@@ -0,0 +1,520 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * skb_map.c: eBPF skb map based on RB tree
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/bpf.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/capability.h>
+#include <linux/rbtree.h>
+#include <linux/btf_ids.h>
+#include <linux/filter.h>
+#include <net/sch_generic.h>
+
+#define SKB_MAP_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+
+struct bpf_skb_map {
+	struct bpf_map map;
+	struct rb_root root;
+	raw_spinlock_t lock;
+	struct rb_node node;
+	u64 rank;
+	struct list_head list;
+	atomic_t count;
+};
+
+struct skb_map_cb {
+	struct qdisc_skb_cb qdisc_cb;
+	u64 rank;
+};
+
+static struct skb_map_cb *skb_map_cb(const struct sk_buff *skb)
+{
+        struct skb_map_cb *cb = (struct skb_map_cb *)skb->cb;
+
+        BUILD_BUG_ON(sizeof(*cb) > sizeof_field(struct sk_buff, cb));
+        return cb;
+}
+
+static DEFINE_SPINLOCK(skb_map_lock);
+static LIST_HEAD(skb_map_list);
+
+static struct bpf_skb_map *bpf_skb_map(struct bpf_map *map)
+{
+	return container_of(map, struct bpf_skb_map, map);
+}
+
+#define SKB_MAP_MAX_SZ 2048
+
+/* Called from syscall */
+static int skb_map_alloc_check(union bpf_attr *attr)
+{
+	if (!bpf_capable())
+		return -EPERM;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 8 ||
+	    attr->value_size != 0 ||
+	    attr->map_flags & ~SKB_MAP_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return -EINVAL;
+
+	if (attr->value_size > KMALLOC_MAX_SIZE)
+		/* if value_size is bigger, the user space won't be able to
+		 * access the elements.
+		 */
+		return -E2BIG;
+
+	if (attr->max_entries > SKB_MAP_MAX_SZ)
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *skb_map_alloc(union bpf_attr *attr)
+{
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct bpf_skb_map *rb;
+
+	rb = bpf_map_area_alloc(sizeof(*rb), numa_node);
+	if (!rb)
+		return ERR_PTR(-ENOMEM);
+
+	memset(rb, 0, sizeof(*rb));
+	bpf_map_init_from_attr(&rb->map, attr);
+	raw_spin_lock_init(&rb->lock);
+	rb->root = RB_ROOT;
+	atomic_set(&rb->count, 0);
+	spin_lock(&skb_map_lock);
+	list_add_tail_rcu(&rb->list, &skb_map_list);
+	spin_unlock(&skb_map_lock);
+	return &rb->map;
+}
+
+static void skb_map_free(struct bpf_map *map)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+
+	spin_lock(&skb_map_lock);
+	list_del_rcu(&rb->list);
+	spin_unlock(&skb_map_lock);
+	skb_rbtree_purge(&rb->root);
+	bpf_map_area_free(rb);
+}
+
+static struct sk_buff *skb_rb_find(struct rb_root *root, u64 rank)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct sk_buff *skb1;
+
+	while (*p) {
+		parent = *p;
+		skb1 = rb_to_skb(parent);
+		if (rank < skb_map_cb(skb1)->rank)
+			p = &parent->rb_left;
+		else if (rank > skb_map_cb(skb1)->rank)
+			p = &parent->rb_right;
+		else
+			return skb1;
+	}
+	return NULL;
+}
+
+/* Called from syscall */
+static void *skb_map_lookup_elem_sys(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
+/* Called from eBPF program */
+static void *skb_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	u64 rank = *(u64 *) key;
+
+	return skb_rb_find(&rb->root, rank);
+}
+
+/* Called from syscall or from eBPF program */
+static int skb_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 flags)
+{
+	return -ENOTSUPP;
+}
+
+/* Called from syscall or from eBPF program */
+static int skb_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	u64 rank = *(u64 *) key;
+	struct sk_buff *skb;
+
+	skb = skb_rb_find(&rb->root, rank);
+	if (!skb)
+		return -ENOENT;
+	rb_erase(&skb->rbnode, &rb->root);
+	consume_skb(skb);
+	return 0;
+}
+
+/* Called from syscall */
+static int skb_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	struct sk_buff *skb;
+	u64 rank;
+
+	if (!key) {
+		skb = skb_rb_first(&rb->root);
+		if (!skb)
+			return -ENOENT;
+		goto found;
+	}
+	rank = *(u64 *) key;
+	skb = skb_rb_find(&rb->root, rank);
+	if (!skb)
+		return -ENOENT;
+	skb = skb_rb_next(skb);
+	if (!skb)
+		return 0;
+found:
+	*(u64 *) next_key = skb_map_cb(skb)->rank;
+	return 0;
+}
+
+static int bpf_for_each_skb_map(struct bpf_map *map, bpf_callback_t callback_fn,
+				void *callback_ctx, u64 flags)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	struct sk_buff *skb, *tmp;
+	u32 num_elems = 0;
+	u64 ret = 0;
+	u64 key;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	skb_rbtree_walk_safe(skb, tmp, &rb->root) {
+		num_elems++;
+		key = skb_map_cb(skb)->rank;
+		ret = callback_fn((u64)(long)map, key, (u64)(long)skb,
+				  (u64)(long)callback_ctx, 0);
+		/* return value: 0 - continue, 1 - stop and return */
+		if (ret)
+			break;
+	}
+
+	return num_elems;
+}
+
+BTF_ID_LIST_SINGLE(skb_map_btf_ids, struct, bpf_skb_map)
+const struct bpf_map_ops skb_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = skb_map_alloc_check,
+	.map_alloc = skb_map_alloc,
+	.map_free = skb_map_free,
+	.map_lookup_elem_sys_only = skb_map_lookup_elem_sys,
+	.map_lookup_elem = skb_map_lookup_elem,
+	.map_update_elem = skb_map_update_elem,
+	.map_delete_elem = skb_map_delete_elem,
+	.map_get_next_key = skb_map_get_next_key,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_skb_map,
+	.map_btf_id = &skb_map_btf_ids[0],
+};
+
+static void skb_rb_push(struct rb_root *root, struct sk_buff *skb)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct sk_buff *skb1;
+
+	while (*p) {
+		parent = *p;
+		skb1 = rb_to_skb(parent);
+		if (skb_map_cb(skb)->rank < skb_map_cb(skb1)->rank)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, root);
+}
+
+BPF_CALL_2(bpf_skb_map_pop, struct bpf_map *, map, u64, key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	skb = skb_map_lookup_elem(map, &key);
+	if (!skb) {
+		raw_spin_unlock_irqrestore(&rb->lock, flags);
+		return (unsigned long)NULL;
+	}
+	rb_erase(&skb->rbnode, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	consume_skb(skb);
+	atomic_dec(&rb->count);
+	return (unsigned long)skb;
+}
+
+const struct bpf_func_proto bpf_skb_map_pop_proto = {
+	.func		= bpf_skb_map_pop,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	unsigned long flags;
+
+	if (atomic_inc_return(&rb->count) > rb->map.max_entries)
+		return -ENOBUFS;
+	skb = skb_get(skb);
+	skb_map_cb(skb)->rank = key;
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	skb_rb_push(&rb->root, skb);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_skb_map_push_proto = {
+	.func		= bpf_skb_map_push,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_CTX,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static struct bpf_map *flow_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_map *map, *inner_map_meta;
+
+	inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd);
+	if (IS_ERR(inner_map_meta))
+		return inner_map_meta;
+
+	map = skb_map_alloc(attr);
+	if (IS_ERR(map)) {
+		bpf_map_meta_free(inner_map_meta);
+		return map;
+	}
+
+	map->inner_map_meta = inner_map_meta;
+	return map;
+}
+
+#define rb_to_map(rb) rb_entry_safe(rb, struct bpf_skb_map, node)
+
+static void bpf_skb_map_purge(struct rb_root *root)
+{
+	struct rb_node *p = rb_first(root);
+
+	while (p) {
+		struct bpf_skb_map *map = rb_to_map(p);
+
+		p = rb_next(p);
+		rb_erase(&map->node, root);
+		skb_map_free(&map->map);
+	}
+}
+
+static void flow_map_free(struct bpf_map *map)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+
+	bpf_map_meta_free(map->inner_map_meta);
+	bpf_skb_map_purge(&rb->root);
+	bpf_map_area_free(rb);
+}
+
+static struct bpf_map *map_rb_find(struct rb_root *root, u64 rank)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct bpf_skb_map *map1;
+
+	while (*p) {
+		parent = *p;
+		map1 = rb_to_map(parent);
+		if (rank < map1->rank)
+			p = &parent->rb_left;
+		else if (rank > map1->rank)
+			p = &parent->rb_right;
+		else
+			return &map1->map;
+	}
+	return NULL;
+}
+
+/* Called from eBPF program */
+static void *flow_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	u64 rank = *(u64 *) key;
+
+	return map_rb_find(&rb->root, rank);
+}
+
+/* Called from syscall or from eBPF program */
+static int flow_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	struct bpf_skb_map *node;
+	u64 rank = *(u64 *) key;
+	struct bpf_map *target;
+
+	target = map_rb_find(&rb->root, rank);
+	if (!target)
+		return -ENOENT;
+	node = bpf_skb_map(target);
+	rb_erase(&node->node, &rb->root);
+	skb_map_free(target);
+	return 0;
+}
+
+static int flow_map_alloc_check(union bpf_attr *attr)
+{
+	if (attr->value_size != sizeof(u32))
+		return -EINVAL;
+	return skb_map_alloc_check(attr);
+}
+
+/* Called from syscall */
+static int flow_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -ENOTSUPP; /* TODO */
+}
+
+const struct bpf_map_ops flow_map_ops = {
+	.map_alloc_check = flow_map_alloc_check,
+	.map_alloc = flow_map_alloc,
+	.map_free = flow_map_free,
+	.map_get_next_key = flow_map_get_next_key,
+	.map_lookup_elem = flow_map_lookup_elem,
+	.map_delete_elem = flow_map_delete_elem,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_id = &skb_map_btf_ids[0],
+};
+
+BPF_CALL_2(bpf_flow_map_pop, struct bpf_map *, map, u64, key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	struct bpf_map *target;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	target = map_rb_find(&rb->root, key);
+	if (!target) {
+		raw_spin_unlock_irqrestore(&rb->lock, flags);
+		return (unsigned long)NULL;
+	}
+	rb_erase(&bpf_skb_map(target)->node, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	atomic_dec(&rb->count);
+	return (unsigned long)target;
+}
+
+const struct bpf_func_proto bpf_flow_map_pop_proto = {
+	.func		= bpf_flow_map_pop,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+static void map_rb_push(struct rb_root *root, struct bpf_map *map)
+{
+	struct rb_node **p = &root->rb_node;
+	struct bpf_skb_map *smap = bpf_skb_map(map);
+	struct rb_node *parent = NULL;
+	struct bpf_skb_map *map1;
+
+	while (*p) {
+		parent = *p;
+		map1 = rb_to_map(parent);
+		if (smap->rank < map1->rank)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&smap->node, parent, p);
+	rb_insert_color(&smap->node, root);
+}
+
+BPF_CALL_3(bpf_flow_map_push, struct bpf_map *, map, struct bpf_map *, value, u64, key)
+{
+	struct bpf_skb_map *rb = bpf_skb_map(map);
+	unsigned long irq_flags;
+
+	if (atomic_inc_return(&rb->count) > rb->map.max_entries)
+		return -ENOBUFS;
+	bpf_skb_map(value)->rank = key;
+	raw_spin_lock_irqsave(&rb->lock, irq_flags);
+	map_rb_push(&rb->root, value);
+	raw_spin_unlock_irqrestore(&rb->lock, irq_flags);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_flow_map_push_proto = {
+	.func		= bpf_flow_map_push,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_CTX,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static void skb_map_flush(struct bpf_skb_map *rb, struct net_device *dev)
+{
+	struct rb_node *p = rb_first(&rb->root);
+
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+		p = rb_next(p);
+		if (skb->dev == dev) {
+			rb_erase(&skb->rbnode, &rb->root);
+			kfree_skb(skb);
+		}
+	}
+}
+
+static int skb_map_notification(struct notifier_block *notifier,
+				ulong event, void *ptr)
+{
+	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
+	struct bpf_skb_map *rb;
+
+        switch (event) {
+        case NETDEV_DOWN:
+		rcu_read_lock();
+		list_for_each_entry_rcu(rb, &skb_map_list, list)
+			skb_map_flush(rb, netdev);
+		rcu_read_unlock();
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block skb_map_notifier = {
+	.notifier_call = skb_map_notification,
+};
+
+static int __init skb_map_init(void)
+{
+	return register_netdevice_notifier(&skb_map_notifier);
+}
+
+subsys_initcall(skb_map_init);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (2 preceding siblings ...)
  2022-06-02  4:10 ` [RFC Patch v5 3/5] bpf: introduce skb map and flow map Cong Wang
@ 2022-06-02  4:10 ` Cong Wang
  2022-06-03 23:14   ` Cong Wang
  2022-06-02  4:10 ` [RFC Patch v5 5/5] net_sched: introduce helper bpf_skb_tc_classify() Cong Wang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2022-06-02  4:10 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang, Cong Wang

Introduce a new Qdisc which is completely managed by eBPF program
of type BPF_PROG_TYPE_SCHED_QDISC. It accepts two eBPF programs of the
same type, but one for enqueue and the other for dequeue.

And it interacts with Qdisc layer in two ways:
1) It relies on Qdisc watchdog to handle throttling;
2) It could pass the skb enqueue/dequeue down to child classes

The context of this eBPF program is different, as shown below:

┌─────────────────┬───────────────┬────────────────────────────────────┐
│                 │               │                                    │
│   prog          │  input        │             output                 │
│                 │               │                                    │
├─────────────────┼───────────────┼────────────────────────────────────┤
│                 │               │                                    │
│                 │ ctx->skb      │      SCH_BPF_THROTTLE: ctx->delay  │
│                 │               │                                    │
│   enqueue       │ ctx->classid  │      SCH_BPF_QUEUED: None          │
│                 │               │                                    │
│                 │               │      SCH_BPF_DROP: None            │
│                 │               │                                    │
│                 │               │      SCH_BPF_CN: None              │
│                 │               │                                    │
│                 │               │      SCH_BPF_PASS: ctx->classid    │
├─────────────────┼───────────────┼────────────────────────────────────┤
│                 │               │                                    │
│                 │               │      SCH_BPF_THROTTLE: ctx->delay  │
│                 │               │                                    │
│   dequeue       │ ctx->classid  │      SCH_BPF_DEQUEUED: ctx->skb    │
│                 │               │                                    │
│                 │               │      SCH_BPF_DROP: None            │
│                 │               │                                    │
│                 │               │      SCH_BPF_PASS: ctx->classid    │
│                 │               │                                    │
└─────────────────┴───────────────┴────────────────────────────────────┘

Because eBPF maps are not directly visible to this Qdisc, so we have to
rely on user-space dumping to retrieve the stats, which is not
implemented yet as I don't find a right API to do this.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf.h            |   4 -
 include/linux/bpf_types.h      |   2 +
 include/uapi/linux/bpf.h       |  16 ++
 include/uapi/linux/pkt_sched.h |  17 ++
 net/core/filter.c              |  34 +++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/sch_bpf.c            | 485 +++++++++++++++++++++++++++++++++
 8 files changed, 570 insertions(+), 4 deletions(-)
 create mode 100644 net/sched/sch_bpf.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 43fbb45b6fc2..cf04ddce2c2d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2274,10 +2274,6 @@ extern const struct bpf_func_proto bpf_loop_proto;
 extern const struct bpf_func_proto bpf_strncmp_proto;
 extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
 extern const struct bpf_func_proto bpf_kptr_xchg_proto;
-extern const struct bpf_func_proto bpf_skb_map_push_proto;
-extern const struct bpf_func_proto bpf_skb_map_pop_proto;
-extern const struct bpf_func_proto bpf_flow_map_push_proto;
-extern const struct bpf_func_proto bpf_flow_map_pop_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index b1276f9f9d26..1baeea8771d2 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_CLS, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_QDISC, tc_qdisc,
+	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp,
 	      struct xdp_md, struct xdp_buff)
 #ifdef CONFIG_CGROUP_BPF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cd9cff0df639..148ec0c4e643 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_SCHED_QDISC,
 };
 
 enum bpf_attach_type {
@@ -6765,4 +6766,19 @@ struct bpf_core_relo {
 	enum bpf_core_relo_kind kind;
 };
 
+struct sch_bpf_ctx {
+	struct __sk_buff *skb;
+	__u32 classid;
+	__u64 delay;
+};
+
+enum {
+	SCH_BPF_QUEUED,
+	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
+	SCH_BPF_DROP,
+	SCH_BPF_CN,
+	SCH_BPF_THROTTLE,
+	SCH_BPF_PASS,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index f292b467b27f..b51eb712517a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1267,4 +1267,21 @@ enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+#define TCA_SCH_BPF_FLAG_DIRECT _BITUL(0)
+enum {
+	TCA_SCH_BPF_UNSPEC,
+	TCA_SCH_BPF_FLAGS,		/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_ENQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_DEQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	__TCA_SCH_BPF_MAX,
+};
+
+#define TCA_SCH_BPF_MAX (__TCA_SCH_BPF_MAX - 1)
+
 #endif
diff --git a/net/core/filter.c b/net/core/filter.c
index 5af58eb48587..1205298a17ca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7813,6 +7813,28 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+const struct bpf_func_proto bpf_skb_map_push_proto __weak;
+const struct bpf_func_proto bpf_skb_map_pop_proto __weak;
+const struct bpf_func_proto bpf_flow_map_push_proto __weak;
+const struct bpf_func_proto bpf_flow_map_pop_proto __weak;
+
+static const struct bpf_func_proto *
+tc_qdisc_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_map_push:
+		return &bpf_skb_map_push_proto;
+	case BPF_FUNC_skb_map_pop:
+		return &bpf_skb_map_pop_proto;
+	case BPF_FUNC_flow_map_push:
+		return &bpf_flow_map_push_proto;
+	case BPF_FUNC_flow_map_pop:
+		return &bpf_flow_map_pop_proto;
+	default:
+		return tc_cls_act_func_proto(func_id, prog);
+	}
+}
+
 static const struct bpf_func_proto *
 xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -10476,6 +10498,18 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops tc_qdisc_verifier_ops = {
+	.get_func_proto		= tc_qdisc_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= tc_cls_act_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+};
+
+const struct bpf_prog_ops tc_qdisc_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
 const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 1e8ab4749c6c..19f68aac79b1 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -439,6 +439,21 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	tristate "eBPF based programmable queue discipline"
+	help
+	  This eBPF based queue discipline offers a way to program your
+	  own packet scheduling algorithm. This is a classful qdisc which
+	  also allows you to decide the hierarchy.
+
+	  Say Y here if you want to use the eBPF based programmable queue
+	  discipline.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_bpf.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index dd14ef413fda..9ef0d579f5ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= sch_bpf.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
new file mode 100644
index 000000000000..e9eba23ef4d1
--- /dev/null
+++ b/net/sched/sch_bpf.c
@@ -0,0 +1,485 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Programmable Qdisc with eBPF
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+#define ACT_BPF_NAME_LEN	256
+
+struct sch_bpf_prog {
+	struct bpf_prog *prog;
+	const char *name;
+};
+
+struct sch_bpf_class {
+	struct Qdisc_class_common common;
+	struct Qdisc *qdisc;
+
+	unsigned int drops;
+	unsigned int overlimits;
+	struct gnet_stats_basic_sync bstats;
+};
+
+struct sch_bpf_qdisc {
+	struct tcf_proto __rcu *filter_list; /* optional external classifier */
+	struct tcf_block *block;
+	struct Qdisc_class_hash clhash;
+	struct sch_bpf_prog enqueue_prog;
+	struct sch_bpf_prog dequeue_prog;
+
+	struct qdisc_watchdog watchdog;
+};
+
+static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *skb,
+			     int name, int id, int tag)
+{
+	struct nlattr *nla;
+
+	if (prog->name &&
+	    nla_put_string(skb, name, prog->name))
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, id, prog->prog->aux->id))
+		return -EMSGSIZE;
+
+	nla = nla_reserve(skb, tag, sizeof(prog->prog->tag));
+	if (!nla)
+		return -EMSGSIZE;
+
+	memcpy(nla_data(nla), prog->prog->tag, nla_len(nla));
+	return 0;
+}
+
+static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct nlattr *opts;
+	u32 bpf_flags = 0;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (bpf_flags && nla_put_u32(skb, TCA_SCH_BPF_FLAGS, bpf_flags))
+		goto nla_put_failure;
+
+	if (sch_bpf_dump_prog(&q->enqueue_prog, skb, TCA_SCH_BPF_ENQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_ENQUEUE_PROG_ID, TCA_SCH_BPF_ENQUEUE_PROG_TAG))
+		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int sch_bpf_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (!clc)
+		return NULL;
+	return container_of(clc, struct sch_bpf_class, common);
+}
+
+static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			   struct sk_buff **to_free)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	unsigned int len = qdisc_pkt_len(skb);
+	struct sch_bpf_ctx ctx = {};
+	struct sch_bpf_class *cl;
+	int res = NET_XMIT_SUCCESS;
+	struct bpf_prog *enqueue;
+	s64 now;
+
+	enqueue = rcu_dereference(q->enqueue_prog.prog);
+	bpf_compute_data_pointers(skb);
+	ctx.skb = (struct __sk_buff *)skb;
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(enqueue, &ctx);
+	switch (res) {
+	case SCH_BPF_THROTTLE:
+		now = ktime_get_ns();
+		qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay);
+		qdisc_qstats_overlimit(sch);
+		fallthrough;
+	case SCH_BPF_QUEUED:
+		return NET_XMIT_SUCCESS;
+	case SCH_BPF_CN:
+		return NET_XMIT_CN;
+	case SCH_BPF_PASS:
+		break;
+	default:
+		__qdisc_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	cl = sch_bpf_find(sch, ctx.classid);
+	if (!cl || !cl->qdisc) {
+		if (res & __NET_XMIT_BYPASS)
+			qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return res;
+	}
+
+	res = qdisc_enqueue(skb, cl->qdisc, to_free);
+	if (res != NET_XMIT_SUCCESS) {
+		if (net_xmit_drop_count(res)) {
+			qdisc_qstats_drop(sch);
+			cl->drops++;
+		}
+		return res;
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return res;
+}
+
+static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sk_buff *ret = NULL;
+	struct sch_bpf_ctx ctx = {};
+	struct bpf_prog *dequeue;
+	struct sch_bpf_class *cl;
+	s64 now;
+	int res;
+
+	dequeue = rcu_dereference(q->dequeue_prog.prog);
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(dequeue, &ctx);
+	switch (res) {
+	case SCH_BPF_DEQUEUED:
+		ret = (struct sk_buff *)ctx.skb;
+		break;
+	case SCH_BPF_THROTTLE:
+		now = ktime_get_ns();
+		qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay);
+		qdisc_qstats_overlimit(sch);
+		cl->overlimits++;
+		return NULL;
+	case SCH_BPF_PASS:
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (!cl || !cl->qdisc)
+			return NULL;
+		ret = qdisc_dequeue_peeked(cl->qdisc);
+		if (ret) {
+			qdisc_bstats_update(sch, ret);
+			qdisc_qstats_backlog_dec(sch, ret);
+			sch->q.qlen--;
+		}
+	}
+
+	return ret;
+}
+
+static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	return cl->qdisc;
+}
+
+static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+			 struct Qdisc **old, struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	if (new)
+		*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle)
+{
+	return (unsigned long)sch_bpf_find(sch, handle);
+}
+
+static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->block;
+}
+
+static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
+	[TCA_SCH_BPF_FLAGS]		= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+};
+
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+{
+	char *prog_name = NULL;
+	struct bpf_prog *fp;
+	u32 bpf_fd;
+
+	if (!fd)
+		return -EINVAL;
+	bpf_fd = nla_get_u32(fd);
+
+	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_QDISC);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	if (name) {
+		prog_name = nla_memdup(name, GFP_KERNEL);
+		if (!prog_name) {
+			bpf_prog_put(fp);
+			return -ENOMEM;
+		}
+	}
+
+	prog->name = prog_name;
+	prog->prog = fp;
+	return 0;
+}
+
+static void bpf_cleanup_prog(struct sch_bpf_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+	kfree(prog->name);
+}
+
+static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_SCH_BPF_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_SCH_BPF_MAX, opt,
+					  sch_bpf_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_SCH_BPF_FLAGS]) {
+		u32 bpf_flags = nla_get_u32(tb[TCA_SCH_BPF_FLAGS]);
+
+		if (bpf_flags & ~TCA_SCH_BPF_FLAG_DIRECT)
+			return -EINVAL;
+	}
+
+	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+	if (err)
+		return err;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+	return err;
+}
+
+static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	int err;
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	if (opt) {
+		err = sch_bpf_change(sch, opt, extack);
+		if (err)
+			return err;
+	}
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	return qdisc_class_hash_init(&q->clhash);
+}
+
+static void sch_bpf_reset(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void sch_bpf_destroy(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	tcf_block_put(q->block);
+	qdisc_class_hash_destroy(&q->clhash);
+	bpf_cleanup_prog(&q->enqueue_prog);
+	bpf_cleanup_prog(&q->dequeue_prog);
+}
+
+static int sch_bpf_change_class(struct Qdisc *sch, u32 classid,
+				u32 parentid, struct nlattr **tca,
+				unsigned long *arg,
+				struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg;
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	if (!cl) {
+		cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+		if (!cl)
+			return -ENOBUFS;
+		qdisc_class_hash_insert(&q->clhash, &cl->common);
+	}
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	return 0;
+}
+
+static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg,
+			      struct sk_buff *skb, struct tcmsg *tcm)
+{
+	return 0;
+}
+
+static int
+sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+		.overlimits = cl->overlimits,
+	};
+	__u32 qlen = 0;
+
+	if (cl->qdisc)
+		qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog);
+	else
+		qlen = 0;
+
+	if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 ||
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static const struct Qdisc_class_ops sch_bpf_class_ops = {
+	.graft		=	sch_bpf_graft,
+	.leaf		=	sch_bpf_leaf,
+	.find		=	sch_bpf_search,
+	.change		=	sch_bpf_change_class,
+	.delete		=	sch_bpf_delete,
+	.tcf_block	=	sch_bpf_tcf_block,
+	.bind_tcf	=	sch_bpf_bind,
+	.unbind_tcf	=	sch_bpf_unbind,
+	.dump		=	sch_bpf_dump_class,
+	.dump_stats	=	sch_bpf_dump_class_stats,
+	.walk		=	sch_bpf_walk,
+};
+
+static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
+	.cl_ops		=	&sch_bpf_class_ops,
+	.id		=	"bpf",
+	.priv_size	=	sizeof(struct sch_bpf_qdisc),
+	.enqueue	=	sch_bpf_enqueue,
+	.dequeue	=	sch_bpf_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	sch_bpf_init,
+	.reset		=	sch_bpf_reset,
+	.destroy	=	sch_bpf_destroy,
+	.change		=	sch_bpf_change,
+	.dump		=	sch_bpf_dump,
+	.dump_stats	=	sch_bpf_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init sch_bpf_mod_init(void)
+{
+	return register_qdisc(&sch_bpf_qdisc_ops);
+}
+
+static void __exit sch_bpf_mod_exit(void)
+{
+	unregister_qdisc(&sch_bpf_qdisc_ops);
+}
+
+module_init(sch_bpf_mod_init)
+module_exit(sch_bpf_mod_exit)
+MODULE_AUTHOR("Cong Wang");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("eBPF queue discipline");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC Patch v5 5/5] net_sched: introduce helper bpf_skb_tc_classify()
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (3 preceding siblings ...)
  2022-06-02  4:10 ` [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc Cong Wang
@ 2022-06-02  4:10 ` Cong Wang
  2022-06-24 16:52 ` [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Toke Høiland-Jørgensen
  2022-06-24 20:51 ` sdf
  6 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2022-06-02  4:10 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Introduce an eBPF helper function bpf_skb_tc_classify() to reuse exising
TC filters on *any* Qdisc to classify the skb.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c        |  5 +++
 net/sched/cls_api.c      | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 148ec0c4e643..ad65859abbd5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5462,6 +5462,7 @@ union bpf_attr {
 	FN(skb_map_pop),		\
 	FN(flow_map_push),		\
 	FN(flow_map_pop),		\
+	FN(skb_tc_classify),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 1205298a17ca..8bd8cf5d5d20 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7817,6 +7817,7 @@ const struct bpf_func_proto bpf_skb_map_push_proto __weak;
 const struct bpf_func_proto bpf_skb_map_pop_proto __weak;
 const struct bpf_func_proto bpf_flow_map_push_proto __weak;
 const struct bpf_func_proto bpf_flow_map_pop_proto __weak;
+const struct bpf_func_proto bpf_skb_tc_classify_proto __weak;
 
 static const struct bpf_func_proto *
 tc_qdisc_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
@@ -7830,6 +7831,10 @@ tc_qdisc_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_flow_map_push_proto;
 	case BPF_FUNC_flow_map_pop:
 		return &bpf_flow_map_pop_proto;
+#ifdef CONFIG_NET_CLS_ACT
+	case BPF_FUNC_skb_tc_classify:
+		return &bpf_skb_tc_classify_proto;
+#endif
 	default:
 		return tc_cls_act_func_proto(func_id, prog);
 	}
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 9bb4d3dcc994..86a78265bc31 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -22,6 +22,7 @@
 #include <linux/idr.h>
 #include <linux/jhash.h>
 #include <linux/rculist.h>
+#include <linux/filter.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/netlink.h>
@@ -1654,6 +1655,74 @@ int tcf_classify(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(tcf_classify);
 
+#ifdef CONFIG_BPF_SYSCALL
+BPF_CALL_3(bpf_skb_tc_classify, struct sk_buff *, skb, int, ifindex, u32, handle)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct Qdisc_class_ops *cops;
+	struct tcf_result res = {};
+	struct tcf_block *block;
+	struct tcf_chain *chain;
+	struct net_device *dev;
+	unsigned long cl = 0;
+	struct Qdisc *q;
+	int result;
+
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		goto out;
+	q = qdisc_lookup_rcu(dev, handle);
+	if (!q)
+		goto out;
+
+	cops = q->ops->cl_ops;
+	if (!cops)
+		goto out;
+	if (!cops->tcf_block)
+		goto out;
+	if (TC_H_MIN(handle)) {
+		cl = cops->find(q, handle);
+		if (cl == 0)
+			goto out;
+	}
+	block = cops->tcf_block(q, cl, NULL);
+	if (!block)
+		goto out;
+
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
+		struct tcf_proto *tp;
+
+		result = tcf_classify(skb, NULL, tp, &res, false);
+		if (result  >= 0) {
+			switch (result) {
+			case TC_ACT_QUEUED:
+			case TC_ACT_STOLEN:
+			case TC_ACT_TRAP:
+				fallthrough;
+			case TC_ACT_SHOT:
+				rcu_read_unlock();
+				return 0;
+			}
+		}
+	}
+out:
+	rcu_read_unlock();
+	return res.class;
+}
+
+const struct bpf_func_proto bpf_skb_tc_classify_proto = {
+	.func		= bpf_skb_tc_classify,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+#endif
+
 struct tcf_chain_info {
 	struct tcf_proto __rcu **pprev;
 	struct tcf_proto __rcu *next;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc
  2022-06-02  4:10 ` [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc Cong Wang
@ 2022-06-03 23:14   ` Cong Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2022-06-03 23:14 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

On Wed, Jun 01, 2022 at 09:10:27PM -0700, Cong Wang wrote:
> Because eBPF maps are not directly visible to this Qdisc, so we have to
> rely on user-space dumping to retrieve the stats, which is not
> implemented yet as I don't find a right API to do this.
> 

I think an iterator for skb map could do this job, so I just implemented
it here:
https://github.com/congwang/linux/commit/aed8536a94836bf6df69a379c43b03e50bf4f813
which will be in the next update.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (4 preceding siblings ...)
  2022-06-02  4:10 ` [RFC Patch v5 5/5] net_sched: introduce helper bpf_skb_tc_classify() Cong Wang
@ 2022-06-24 16:52 ` Toke Høiland-Jørgensen
  2022-06-24 17:35   ` Dave Taht
  2022-06-24 20:51 ` sdf
  6 siblings, 1 reply; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-06-24 16:52 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: bpf, Cong Wang, Jamal Hadi Salim, Jiri Pirko

Cong Wang <xiyou.wangcong@gmail.com> writes:

> From: Cong Wang <cong.wang@bytedance.com>
>
> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

Sorry for the delay in looking at this; a few comments below:

[...]

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

Could you elaborate on the limitations of the PIFO? It looks to me like
the SKB map you're proposing is very close to a PIFO (it's a priority
queue that SKBs can be inserted into at an arbitrary position)?

> Here is a summary of design decisions I made:
>
> 1. Avoid eBPF struct_ops, as it would be really hard to program
>    a Qdisc with this approach, literally all the struct Qdisc_ops
>    and struct Qdisc_class_ops are needed to implement. This is almost
>    as hard as programming a Qdisc kernel module.

I agree that implementing the full Qdisc_class_ops as BPF functions is
going to be prohibitive; let's use this opportunity to define a simpler
API!

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>    too.
>
>    a) As eBPF maps are not directly visible to the kernel, we have to
>    dump the stats via eBPF map API's instead of netlink.

Why not do a hybrid thing where the kernel side of the qdisc keeps some
basic stats (packet counter for number of packets queued/dropped for
instance) and define a separate BPF callback that can return more stats
to the kernel for translation into netlink (e.g., "how many packets are
currently in the queue")? And then if a particular implementation wants
to do more custom stats, they can use BPF APIs for that?

>    b) The user-space is not allowed to read the entire packets, only
>    __sk_buff itself is readable, because we don't have such a use case
>    yet and it would require a different API to read the data, as map
>    values have fixed length.

I agree there's not any need to make the packet contents directly
accessible to userspace via reading the map.

>    c) Two eBPF helpers are introduced for skb map operations:
>    bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>    not allowed.

So with kptr support in the map this could conceivably be done via the
existing map_push() etc helpers?

>    d) Multi-queue support is implemented via map-in-map, in a similar
>    push/pop fasion.

Not sure I understand what you mean by this? Will the qdisc
automatically attach itself to all HWQs of an interface (like sch_mq
does)? Surely it will be up to the BPF program whether it'll use
map-in-map or something else?

>    e) Use the netdevice notifier to reset the packets inside skb map upon
>    NETDEV_DOWN event.

Is it really necessary to pull out SKBs from under the BPF program? If
the qdisc is removed and the BPF program goes away, so will the map and
all SKBs stored in it?

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>    to implement her own filters (e.g. a flow dissector), she should be able
>    to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is
>    introduced for this purpose.

This seems a bit convoluted? If a BPF program wants to reuse an existing
BPF TC filter it can just do that at the code level. As for the
in-kernel filters are there really any of them it would be worth reusing
from a BPF qdisc other than the flow dissector? In which case, why not
just expose that instead of taking all the TC filter infrastructure with
you?


Bringing in some text from patch 4 as well to comment on it all in one go:

> Introduce a new Qdisc which is completely managed by eBPF program
> of type BPF_PROG_TYPE_SCHED_QDISC. It accepts two eBPF programs of the
> same type, but one for enqueue and the other for dequeue.
> 
> And it interacts with Qdisc layer in two ways:
> 1) It relies on Qdisc watchdog to handle throttling;

Having a way for BPF to schedule the qdisc watchdog wakeup is probably
needed. For the XDP queueing side I'm planning to use BPF timers for
(the equivalent of) this, but since we already have a watchdog mechanism
on the qdisc side maybe just exposing that is fine...

> 2) It could pass the skb enqueue/dequeue down to child classes

I'm not sure I think it's a good idea to keep the qdisc hierarchy. If
someone wants to build a classification hierarchy they could just do
this directly in BPF by whichever mechanism they want? I.e., it can just
instantiate multiple skb maps in a single program to do classful
queueing and select the right one however it wants?

Keeping the qdisc itself simple as, basically:

enqueue: pass skb to BPF program to do with as it will (enqueue into a
map, drop, etc)

dequeue: execute BPF program, transmit pkt if it returns one

-Toke


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
  2022-06-24 16:52 ` [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Toke Høiland-Jørgensen
@ 2022-06-24 17:35   ` Dave Taht
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Taht @ 2022-06-24 17:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Cong Wang, Linux Kernel Network Developers, bpf, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko

In my world, it's inbound shaping that's the biggest problem, and my
dream has long been to somehow have a way to do:

tc qdisc add dev eth0 ingress cake bandwidth 800mbit

and that be able to work across multiple cores, without act_mirred,
without any filter magic. Getting there via ebpf as in
https://github.com/rchac/LibreQoS is working for ISPs (plea, anyone
got time to make ipv6 work there?), but it's
running out of cpu for inbound shaping on the current act_mirred path
we use that's killing us.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
  2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (5 preceding siblings ...)
  2022-06-24 16:52 ` [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Toke Høiland-Jørgensen
@ 2022-06-24 20:51 ` sdf
  6 siblings, 0 replies; 11+ messages in thread
From: sdf @ 2022-06-24 20:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bpf, Cong Wang, Toke Høiland-Jørgensen,
	Jamal Hadi Salim, Jiri Pirko

On 06/01, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

> Here is a summary of design decisions I made:

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>     too.

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

>     c) Two eBPF helpers are introduced for skb map operations:
>     bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>     not allowed.

>     d) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion.

>     e) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event.

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

> Any high-level feedback is welcome. Please kindly do not review any coding
> details until RFC tag is removed.

> TODO:
> 1. actually test it

Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
@ 2022-09-22  7:46 杨培灏
  0 siblings, 0 replies; 11+ messages in thread
From: 杨培灏 @ 2022-09-22  7:46 UTC (permalink / raw)
  To: sdf, YrYj3LPaHV7thgJW
  Cc: bpf, cong.wang, jhs, jiri, netdev, toke, xiyou.wangcong

On 06/24, sdf wrote:

> I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies.

I recently build an example for the simplest `pfifo` qdisc based on Cong Wang's patches, 
see https://github.com/Forsworns/sch_bpf/
if you have interests. The sample is a little large, and involving modifications on libbpf, 
so I didn't attach it as a patch here. 

I also run a micro-benchmark with iperf3, see 
https://gist.github.com/Forsworns/71b70ed67d2ac53a4316bac862ce7d0f

I know it's unfair to evaluate the sch_bpf with `pfifo`, since sch_bpf is based on rbtree.
But I think it a good example for other classiful and more complicated qdiscs.
The `pfifo` is simple enough as a start :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-09-22  8:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe() Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 2/5] bpf: move map in map declarations to bpf.h Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 3/5] bpf: introduce skb map and flow map Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-06-03 23:14   ` Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 5/5] net_sched: introduce helper bpf_skb_tc_classify() Cong Wang
2022-06-24 16:52 ` [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Toke Høiland-Jørgensen
2022-06-24 17:35   ` Dave Taht
2022-06-24 20:51 ` sdf
2022-09-22  7:46 杨培灏

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.