bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc
@ 2022-10-05 17:17 Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 1/5] bpf: Introduce rbtree map Cong Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Cong Wang @ 2022-10-05 17:17 UTC (permalink / raw)
  To: netdev; +Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

This *incomplete* patchset introduces a programmable Qdisc with eBPF.

There are a few use cases:

1. Allow customizing Qdisc's in an easier way. So that people don't
   have to write a complete Qdisc kernel module just to experiment
   some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
   is before enqueue, it is impossible to adjust those "tokens" after
   packets get dropped in enqueue. With eBPF Qdisc, it is easy to
   be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
   skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
   chain and block to reuse the TC filters, but they are too complicated
   to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
   TC filters on _any_ Qdisc (even on a different netdev) to do the
   classification.

5. Potentially pave a way for ingress to queue packets, although
   current implementation is still only for egress.

6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
   is already used by TCP to handle TCP retransmission.

7. Potentially pave a way for implementing eBPF based CPU scheduler,
   because we already have task kptr and CFS is clearly based on rbtree
   too.

The goal here is to make this Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can, no matter
in tree or out of tree. This is why I give up on PIFO which has
serious limitations on the programmablity. More importantly, with
rbtree, it is easy to implement PIFO logic too.

Here is a summary of design decisions I made:

1. Avoid eBPF struct_ops, as it would be really hard to program
   a Qdisc with this approach, literally all the struct Qdisc_ops
   and struct Qdisc_class_ops are needed to implement. This is almost
   as hard as programming a Qdisc kernel module.

2. Introduce a generic rbtree map, which supports kptr, so that skb's
   can be stored inside.

   a) As eBPF maps are not directly visible to the kernel, we have to
   dump the stats via eBPF map API's instead of netlink.

   b) The user-space is not allowed to read the entire packets, only __sk_buff
   itself is readable, because we don't have such a use case yet and it would
   require a different API to read the data, as map values have fixed length.

   c) Multi-queue support is implemented via map-in-map, in a similar
   push/pop fasion based on rbtree too.

   d) Use the netdevice notifier to reset the packets inside skb map upon
   NETDEV_DOWN event. (TODO: need to recognize kptr type)

3. Integrate with existing TC infra. For example, if the user doesn't want
   to implement her own filters (e.g. a flow dissector), she should be able
   to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is
   introduced for this purpose.

Any high-level feedback is welcome. Please kindly ignore any coding details
until RFC tag is removed.

TODO:
1. actually test it
2. write a document for this Qdisc
3. add test cases and sample code

Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Cong Wang (5):
  bpf: Introduce rbtree map
  bpf: Add map in map support to rbtree
  net_sched: introduce eBPF based Qdisc
  net_sched: Add kfuncs for storing skb
  net_sched: Introduce helper bpf_skb_tc_classify()

 include/linux/bpf.h            |   4 +
 include/linux/bpf_types.h      |   4 +
 include/uapi/linux/bpf.h       |  19 ++
 include/uapi/linux/pkt_sched.h |  17 +
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/rbtree.c            | 603 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |   7 +
 net/core/filter.c              |  27 ++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/cls_api.c            |  69 ++++
 net/sched/sch_bpf.c            | 564 ++++++++++++++++++++++++++++++
 12 files changed, 1331 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/rbtree.c
 create mode 100644 net/sched/sch_bpf.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC Patch v6 1/5] bpf: Introduce rbtree map
  2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
@ 2022-10-05 17:17 ` Cong Wang
  2022-10-11 15:19   ` Kumar Kartikeya Dwivedi
  2022-10-11 17:05   ` Dave Marchevsky
  2022-10-05 17:17 ` [RFC Patch v6 2/5] bpf: Add map in map support to rbtree Cong Wang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 9+ messages in thread
From: Cong Wang @ 2022-10-05 17:17 UTC (permalink / raw)
  To: netdev; +Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Insert:
bpf_map_update(&map, &key, &val, flag);

Delete a specific key-val pair:
bpf_map_delete_elem(&map, &key);

Pop the minimum one:
bpf_map_pop(&map, &val);

Lookup:
val = bpf_map_lookup_elem(&map, &key);

Iterator:
bpf_for_each_map_elem(&map, callback, key, val);

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/Makefile       |   2 +-
 kernel/bpf/rbtree.c       | 445 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 448 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/rbtree.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2c6a4f2562a7..c53ba6de1613 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -127,6 +127,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 51b9aa640ad2..9492cd3af701 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -935,6 +935,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_RBTREE,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..e60249258c74 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@ endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o rbtree.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
new file mode 100644
index 000000000000..f1a9b1c40b8b
--- /dev/null
+++ b/kernel/bpf/rbtree.c
@@ -0,0 +1,445 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * rbtree.c: eBPF rbtree map
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/bpf.h>
+#include <linux/slab.h>
+#include <linux/capability.h>
+#include <linux/rbtree.h>
+#include <linux/btf_ids.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/math.h>
+#include <linux/seq_file.h>
+
+#define RBTREE_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+
+/* each rbtree element is struct rbtree_elem + key + value */
+struct rbtree_elem {
+	struct rb_node rbnode;
+	char key[] __aligned(8);
+};
+
+struct rbtree_map {
+	struct bpf_map map;
+	struct bpf_mem_alloc ma;
+	raw_spinlock_t lock;
+	struct rb_root root;
+	atomic_t nr_entries;
+};
+
+#define rb_to_elem(rb) rb_entry_safe(rb, struct rbtree_elem, rbnode)
+#define elem_rb_first(root) rb_to_elem(rb_first(root))
+#define elem_rb_last(root)  rb_to_elem(rb_last(root))
+#define elem_rb_next(e)   rb_to_elem(rb_next(&(e)->rbnode))
+#define rbtree_walk_safe(e, tmp, root)					\
+		for (e = elem_rb_first(root);				\
+		     tmp = e ? elem_rb_next(e) : NULL, (e != NULL);	\
+		     e = tmp)
+
+static struct rbtree_map *rbtree_map(struct bpf_map *map)
+{
+	return container_of(map, struct rbtree_map, map);
+}
+
+/* Called from syscall */
+static int rbtree_map_alloc_check(union bpf_attr *attr)
+{
+	if (!bpf_capable())
+		return -EPERM;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 ||
+	    attr->map_flags & ~RBTREE_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return -EINVAL;
+
+	if (attr->value_size > KMALLOC_MAX_SIZE)
+		/* if value_size is bigger, the user space won't be able to
+		 * access the elements.
+		 */
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
+{
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct rbtree_map *rb;
+	u32 elem_size;
+	int err;
+
+	rb = bpf_map_area_alloc(sizeof(*rb), numa_node);
+	if (!rb)
+		return ERR_PTR(-ENOMEM);
+
+	memset(rb, 0, sizeof(*rb));
+	bpf_map_init_from_attr(&rb->map, attr);
+	raw_spin_lock_init(&rb->lock);
+	rb->root = RB_ROOT;
+	atomic_set(&rb->nr_entries, 0);
+
+	elem_size = sizeof(struct rbtree_elem) +
+			  round_up(rb->map.key_size, 8);
+	elem_size += round_up(rb->map.value_size, 8);
+	err = bpf_mem_alloc_init(&rb->ma, elem_size, false);
+	if (err) {
+		bpf_map_area_free(rb);
+		return ERR_PTR(err);
+	}
+	return &rb->map;
+}
+
+static void check_and_free_fields(struct rbtree_map *rb,
+				  struct rbtree_elem *elem)
+{
+	void *map_value = elem->key + round_up(rb->map.key_size, 8);
+
+	if (map_value_has_kptrs(&rb->map))
+		bpf_map_free_kptrs(&rb->map, map_value);
+}
+
+static void rbtree_map_purge(struct bpf_map *map)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e, *tmp;
+
+	rbtree_walk_safe(e, tmp, &rb->root) {
+		rb_erase(&e->rbnode, &rb->root);
+		check_and_free_fields(rb, e);
+		bpf_mem_cache_free(&rb->ma, e);
+	}
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void rbtree_map_free(struct bpf_map *map)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	rbtree_map_purge(map);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	bpf_mem_alloc_destroy(&rb->ma);
+	bpf_map_area_free(rb);
+}
+
+static struct rbtree_elem *bpf_rbtree_find(struct rb_root *root, void *key, int size)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct rbtree_elem *e;
+
+	while (*p) {
+		int ret;
+
+		parent = *p;
+		e = rb_to_elem(parent);
+		ret = memcmp(key, e->key, size);
+		if (ret < 0)
+			p = &parent->rb_left;
+		else if (ret > 0)
+			p = &parent->rb_right;
+		else
+			return e;
+	}
+	return NULL;
+}
+
+/* Called from eBPF program or syscall */
+static void *rbtree_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e;
+
+	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
+	if (!e)
+		return NULL;
+	return e->key + round_up(rb->map.key_size, 8);
+}
+
+static int check_flags(struct rbtree_elem *old, u64 map_flags)
+{
+	if (old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
+		/* elem already exists */
+		return -EEXIST;
+
+	if (!old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
+		/* elem doesn't exist, cannot update it */
+		return -ENOENT;
+
+	return 0;
+}
+
+static void rbtree_map_insert(struct rbtree_map *rb, struct rbtree_elem *e)
+{
+	struct rb_root *root = &rb->root;
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct rbtree_elem *e1;
+
+	while (*p) {
+		parent = *p;
+		e1 = rb_to_elem(parent);
+		if (memcmp(e->key, e1->key, rb->map.key_size) < 0)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&e->rbnode, parent, p);
+	rb_insert_color(&e->rbnode, root);
+}
+
+/* Called from syscall or from eBPF program */
+static int rbtree_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 map_flags)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	void *val = rbtree_map_lookup_elem(map, key);
+	int ret;
+
+	ret = check_flags(val, map_flags);
+	if (ret)
+		return ret;
+
+	if (!val) {
+		struct rbtree_elem *e_new;
+		unsigned long flags;
+
+		e_new = bpf_mem_cache_alloc(&rb->ma);
+		if (!e_new)
+			return -ENOMEM;
+		val = e_new->key + round_up(rb->map.key_size, 8);
+		check_and_init_map_value(&rb->map, val);
+		memcpy(e_new->key, key, rb->map.key_size);
+		raw_spin_lock_irqsave(&rb->lock, flags);
+		rbtree_map_insert(rb, e_new);
+		raw_spin_unlock_irqrestore(&rb->lock, flags);
+		atomic_inc(&rb->nr_entries);
+	}
+
+	if (map_flags & BPF_F_LOCK)
+		copy_map_value_locked(map, val, value, false);
+	else
+		copy_map_value(map, val, value);
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int rbtree_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
+	if (!e) {
+		raw_spin_unlock_irqrestore(&rb->lock, flags);
+		return -ENOENT;
+	}
+	rb_erase(&e->rbnode, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	check_and_free_fields(rb, e);
+	bpf_mem_cache_free(&rb->ma, e);
+	atomic_dec(&rb->nr_entries);
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int rbtree_map_pop_elem(struct bpf_map *map, void *value)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e = elem_rb_first(&rb->root);
+	unsigned long flags;
+	void *val;
+
+	if (!e)
+		return -ENOENT;
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	rb_erase(&e->rbnode, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	val = e->key + round_up(rb->map.key_size, 8);
+	copy_map_value(map, value, val);
+	check_and_free_fields(rb, e);
+	bpf_mem_cache_free(&rb->ma, e);
+	atomic_dec(&rb->nr_entries);
+	return 0;
+}
+
+/* Called from syscall */
+static int rbtree_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e;
+
+	if (!key) {
+		e = elem_rb_first(&rb->root);
+		if (!e)
+			return -ENOENT;
+		goto found;
+	}
+	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
+	if (!e)
+		return -ENOENT;
+	e = elem_rb_next(e);
+	if (!e)
+		return 0;
+found:
+	memcpy(next_key, e->key, map->key_size);
+	return 0;
+}
+
+static int bpf_for_each_rbtree_map(struct bpf_map *map,
+				   bpf_callback_t callback_fn,
+				   void *callback_ctx, u64 flags)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e, *tmp;
+	void *key, *value;
+	u32 num_elems = 0;
+	u64 ret = 0;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	rbtree_walk_safe(e, tmp, &rb->root) {
+		num_elems++;
+		key = e->key;
+		value = key + round_up(rb->map.key_size, 8);
+		ret = callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value,
+				  (u64)(long)callback_ctx, 0);
+		/* return value: 0 - continue, 1 - stop and return */
+		if (ret)
+			break;
+	}
+
+	return num_elems;
+}
+
+struct rbtree_map_seq_info {
+	struct bpf_map *map;
+	struct rbtree_map *rb;
+};
+
+static void *rbtree_map_seq_find_next(struct rbtree_map_seq_info *info,
+				      struct rbtree_elem *prev_elem)
+{
+	const struct rbtree_map *rb = info->rb;
+	struct rbtree_elem *elem;
+
+	/* try to find next elem in the same bucket */
+	if (prev_elem) {
+		elem = elem_rb_next(prev_elem);
+		if (elem)
+			return elem;
+		return NULL;
+	}
+
+	return elem_rb_first(&rb->root);
+}
+
+static void *rbtree_map_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct rbtree_map_seq_info *info = seq->private;
+
+	if (*pos == 0)
+		++*pos;
+
+	/* pairs with rbtree_map_seq_stop */
+	rcu_read_lock();
+	return rbtree_map_seq_find_next(info, NULL);
+}
+
+static void *rbtree_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct rbtree_map_seq_info *info = seq->private;
+
+	++*pos;
+	return rbtree_map_seq_find_next(info, v);
+}
+
+static int rbtree_map_seq_show(struct seq_file *seq, void *v)
+{
+	struct rbtree_map_seq_info *info = seq->private;
+	struct bpf_iter__bpf_map_elem ctx = {};
+	struct rbtree_elem *elem = v;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, !elem);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.map = info->map;
+	if (elem) {
+		ctx.key = elem->key;
+		ctx.value = elem->key + round_up(info->map->key_size, 8);
+	}
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static void rbtree_map_seq_stop(struct seq_file *seq, void *v)
+{
+	if (!v)
+		(void)rbtree_map_seq_show(seq, NULL);
+
+	/* pairs with rbtree_map_seq_start */
+	rcu_read_unlock();
+}
+
+static const struct seq_operations rbtree_map_seq_ops = {
+	.start	= rbtree_map_seq_start,
+	.next	= rbtree_map_seq_next,
+	.stop	= rbtree_map_seq_stop,
+	.show	= rbtree_map_seq_show,
+};
+
+static int rbtree_map_init_seq_private(void *priv_data,
+				       struct bpf_iter_aux_info *aux)
+{
+	struct rbtree_map_seq_info *info = priv_data;
+
+	bpf_map_inc_with_uref(aux->map);
+	info->map = aux->map;
+	info->rb = rbtree_map(info->map);
+	return 0;
+}
+
+static void rbtree_map_fini_seq_private(void *priv_data)
+{
+	struct rbtree_map_seq_info *info = priv_data;
+
+	bpf_map_put_with_uref(info->map);
+}
+
+static const struct bpf_iter_seq_info rbtree_map_iter_seq_info = {
+	.seq_ops		= &rbtree_map_seq_ops,
+	.init_seq_private	= rbtree_map_init_seq_private,
+	.fini_seq_private	= rbtree_map_fini_seq_private,
+	.seq_priv_size		= sizeof(struct rbtree_map_seq_info),
+};
+
+BTF_ID_LIST_SINGLE(rbtree_map_btf_ids, struct, rbtree_map)
+const struct bpf_map_ops rbtree_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = rbtree_map_alloc_check,
+	.map_alloc = rbtree_map_alloc,
+	.map_free = rbtree_map_free,
+	.map_lookup_elem = rbtree_map_lookup_elem,
+	.map_update_elem = rbtree_map_update_elem,
+	.map_delete_elem = rbtree_map_delete_elem,
+	.map_pop_elem = rbtree_map_pop_elem,
+	.map_get_next_key = rbtree_map_get_next_key,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_rbtree_map,
+	.map_btf_id = &rbtree_map_btf_ids[0],
+	.iter_seq_info = &rbtree_map_iter_seq_info,
+};
+
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC Patch v6 2/5] bpf: Add map in map support to rbtree
  2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 1/5] bpf: Introduce rbtree map Cong Wang
@ 2022-10-05 17:17 ` Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 3/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2022-10-05 17:17 UTC (permalink / raw)
  To: netdev; +Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf.h       |   4 +
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/rbtree.c       | 158 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c      |   7 ++
 5 files changed, 171 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e7d46d16032..d4d85df1e8ea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1913,6 +1913,10 @@ int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
 int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
 				void *key, void *value, u64 map_flags);
 int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
+int bpf_fd_rbtree_map_update_elem(struct bpf_map *map, struct file *map_file,
+				  void *key, void *value, u64 map_flags);
+int bpf_fd_rbtree_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
+int bpf_fd_rbtree_map_pop_elem(struct bpf_map *map, void *value);
 
 int bpf_get_file_flag(int flags);
 int bpf_check_uarg_tail_zero(bpfptr_t uaddr, size_t expected_size,
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c53ba6de1613..d1ef13b08e28 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -128,6 +128,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE_OF_MAPS, rbtree_map_in_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9492cd3af701..994a3e42a4fa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -936,6 +936,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_RBTREE,
+	BPF_MAP_TYPE_RBTREE_OF_MAPS,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index f1a9b1c40b8b..43d3d4193ce4 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -12,6 +12,7 @@
 #include <linux/bpf_mem_alloc.h>
 #include <linux/math.h>
 #include <linux/seq_file.h>
+#include "map_in_map.h"
 
 #define RBTREE_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
@@ -443,3 +444,160 @@ const struct bpf_map_ops rbtree_map_ops = {
 	.iter_seq_info = &rbtree_map_iter_seq_info,
 };
 
+static struct bpf_map *rbtree_map_in_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_map *map, *inner_map_meta;
+
+	inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd);
+	if (IS_ERR(inner_map_meta))
+		return inner_map_meta;
+
+	map = rbtree_map_alloc(attr);
+	if (IS_ERR(map)) {
+		bpf_map_meta_free(inner_map_meta);
+		return map;
+	}
+
+	map->inner_map_meta = inner_map_meta;
+	return map;
+}
+
+static void *fd_rbtree_map_get_ptr(const struct bpf_map *map, struct rbtree_elem *e)
+{
+	return *(void **)(e->key + roundup(map->key_size, 8));
+}
+
+static void rbtree_map_in_map_purge(struct bpf_map *map)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e, *tmp;
+
+	rbtree_walk_safe(e, tmp, &rb->root) {
+		void *ptr = fd_rbtree_map_get_ptr(map, e);
+
+		map->ops->map_fd_put_ptr(ptr);
+	}
+}
+
+static void rbtree_map_in_map_free(struct bpf_map *map)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+
+	bpf_map_meta_free(map->inner_map_meta);
+	rbtree_map_in_map_purge(map);
+	bpf_map_area_free(rb);
+}
+
+/* Called from eBPF program */
+static void *rbtree_map_in_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_map **inner_map = rbtree_map_lookup_elem(map, key);
+
+	if (!inner_map)
+		return NULL;
+
+	return READ_ONCE(*inner_map);
+}
+
+static int rbtree_map_in_map_alloc_check(union bpf_attr *attr)
+{
+	if (attr->value_size != sizeof(u32))
+		return -EINVAL;
+	return rbtree_map_alloc_check(attr);
+}
+
+/* Called from eBPF program */
+static int rbtree_map_in_map_pop_elem(struct bpf_map *map, void *value)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e = elem_rb_first(&rb->root);
+	struct bpf_map **inner_map;
+	unsigned long flags;
+
+	if (!e)
+		return -ENOENT;
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	rb_erase(&e->rbnode, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	inner_map = fd_rbtree_map_get_ptr(map, e);
+	*(void **)value = *inner_map;
+	bpf_mem_cache_free(&rb->ma, e);
+	atomic_dec(&rb->nr_entries);
+	return 0;
+}
+
+/* only called from syscall */
+int bpf_fd_rbtree_map_pop_elem(struct bpf_map *map, void *value)
+{
+	struct bpf_map *ptr;
+	int ret = 0;
+
+	if (!map->ops->map_fd_sys_lookup_elem)
+		return -ENOTSUPP;
+
+	rcu_read_lock();
+	ret = rbtree_map_in_map_pop_elem(map, &ptr);
+	if (!ret)
+		*(u32 *)value = map->ops->map_fd_sys_lookup_elem(ptr);
+	else
+		ret = -ENOENT;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/* only called from syscall */
+int bpf_fd_rbtree_map_lookup_elem(struct bpf_map *map, void *key, u32 *value)
+{
+	void **ptr;
+	int ret = 0;
+
+	if (!map->ops->map_fd_sys_lookup_elem)
+		return -ENOTSUPP;
+
+	rcu_read_lock();
+	ptr = rbtree_map_lookup_elem(map, key);
+	if (ptr)
+		*value = map->ops->map_fd_sys_lookup_elem(READ_ONCE(*ptr));
+	else
+		ret = -ENOENT;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/* only called from syscall */
+int bpf_fd_rbtree_map_update_elem(struct bpf_map *map, struct file *map_file,
+				  void *key, void *value, u64 map_flags)
+{
+	void *ptr;
+	int ret;
+	u32 ufd = *(u32 *)value;
+
+	ptr = map->ops->map_fd_get_ptr(map, map_file, ufd);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	ret = rbtree_map_update_elem(map, key, &ptr, map_flags);
+	if (ret)
+		map->ops->map_fd_put_ptr(ptr);
+
+	return ret;
+}
+
+const struct bpf_map_ops rbtree_map_in_map_ops = {
+	.map_alloc_check = rbtree_map_in_map_alloc_check,
+	.map_alloc = rbtree_map_in_map_alloc,
+	.map_free = rbtree_map_in_map_free,
+	.map_get_next_key = rbtree_map_get_next_key,
+	.map_lookup_elem = rbtree_map_in_map_lookup_elem,
+	.map_update_elem = rbtree_map_update_elem,
+	.map_pop_elem = rbtree_map_in_map_pop_elem,
+	.map_delete_elem = rbtree_map_delete_elem,
+	.map_fd_get_ptr = bpf_map_fd_get_ptr,
+	.map_fd_put_ptr = bpf_map_fd_put_ptr,
+	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_id = &rbtree_map_btf_ids[0],
+};
+
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7b373a5e861f..1b968dc38500 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -213,6 +213,11 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key,
 		err = bpf_fd_htab_map_update_elem(map, f.file, key, value,
 						  flags);
 		rcu_read_unlock();
+	} else if (map->map_type == BPF_MAP_TYPE_RBTREE_OF_MAPS) {
+		rcu_read_lock();
+		err = bpf_fd_rbtree_map_update_elem(map, f.file, key, value,
+						    flags);
+		rcu_read_unlock();
 	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
 		/* rcu_read_lock() is not needed */
 		err = bpf_fd_reuseport_array_update_elem(map, key, value,
@@ -1832,6 +1837,8 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 	if (map->map_type == BPF_MAP_TYPE_QUEUE ||
 	    map->map_type == BPF_MAP_TYPE_STACK) {
 		err = map->ops->map_pop_elem(map, value);
+	} else if (map->map_type == BPF_MAP_TYPE_RBTREE_OF_MAPS) {
+		bpf_fd_rbtree_map_pop_elem(map, value);
 	} else if (map->map_type == BPF_MAP_TYPE_HASH ||
 		   map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_HASH ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC Patch v6 3/5] net_sched: introduce eBPF based Qdisc
  2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 1/5] bpf: Introduce rbtree map Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 2/5] bpf: Add map in map support to rbtree Cong Wang
@ 2022-10-05 17:17 ` Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 4/5] net_sched: Add kfuncs for storing skb Cong Wang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2022-10-05 17:17 UTC (permalink / raw)
  To: netdev; +Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang, Cong Wang

Introduce a new Qdisc which is completely managed by eBPF program
of type BPF_PROG_TYPE_QDISC. It accepts two eBPF programs of
the same type, but one for enqueue and the other for dequeue.

And it interacts with Qdisc layer in two ways:
1) It relies on Qdisc watchdog to handle throttling;
2) It could pass the skb enqueue/dequeue down to child classes

The context of this eBPF program is different, as shown below:

┌──────────┬───────────────┬─────────────────────────────────┐
│          │               │                                 │
│ prog     │  input        │          output                 │
│          │               │                                 │
├──────────┼───────────────┼─────────────────────────────────┤
│          │ ctx->skb      │   SCH_BPF_THROTTLE: ctx->delay  │
│          │               │                                 │
│ enqueue  │ ctx->classid  │   SCH_BPF_QUEUED: None          │
│          │               │                                 │
│          │               │   SCH_BPF_DROP: None            │
│          │               │                                 │
│          │               │   SCH_BPF_CN: None              │
│          │               │                                 │
│          │               │   SCH_BPF_PASS: ctx->classid    │
├──────────┼───────────────┼─────────────────────────────────┤
│          │               │   SCH_BPF_THROTTLE: ctx->delay  │
│          │               │                                 │
│ dequeue  │ ctx->classid  │   SCH_BPF_DEQUEUED: ctx->skb    │
│          │               │                                 │
│          │               │   SCH_BPF_DROP: None            │
│          │               │                                 │
│          │               │   SCH_BPF_PASS: ctx->classid    │
└──────────┴───────────────┴─────────────────────────────────┘

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf_types.h      |   2 +
 include/uapi/linux/bpf.h       |  16 ++
 include/uapi/linux/pkt_sched.h |  17 ++
 net/core/filter.c              |  12 +
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/sch_bpf.c            | 485 +++++++++++++++++++++++++++++++++
 7 files changed, 548 insertions(+)
 create mode 100644 net/sched/sch_bpf.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d1ef13b08e28..4e375abe0f03 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_CLS, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
+BPF_PROG_TYPE(BPF_PROG_TYPE_QDISC, tc_qdisc,
+	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp,
 	      struct xdp_md, struct xdp_buff)
 #ifdef CONFIG_CGROUP_BPF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 994a3e42a4fa..c21fd1f189bc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -980,6 +980,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_QDISC,
 };
 
 enum bpf_attach_type {
@@ -6984,4 +6985,19 @@ struct bpf_core_relo {
 	enum bpf_core_relo_kind kind;
 };
 
+struct sch_bpf_ctx {
+	struct __sk_buff *skb;
+	__u32 classid;
+	__u64 delay;
+};
+
+enum {
+	SCH_BPF_QUEUED,
+	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
+	SCH_BPF_DROP,
+	SCH_BPF_CN,
+	SCH_BPF_THROTTLE,
+	SCH_BPF_PASS,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 000eec106856..229af4cc54f6 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1278,4 +1278,21 @@ enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+#define TCA_SCH_BPF_FLAG_DIRECT _BITUL(0)
+enum {
+	TCA_SCH_BPF_UNSPEC,
+	TCA_SCH_BPF_FLAGS,		/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_ENQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_DEQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	__TCA_SCH_BPF_MAX,
+};
+
+#define TCA_SCH_BPF_MAX (__TCA_SCH_BPF_MAX - 1)
+
 #endif
diff --git a/net/core/filter.c b/net/core/filter.c
index bb0136e7a8e4..7a271b77a2cc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10655,6 +10655,18 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops tc_qdisc_verifier_ops = {
+	.get_func_proto		= tc_cls_act_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= tc_cls_act_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+};
+
+const struct bpf_prog_ops tc_qdisc_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
 const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 1e8ab4749c6c..19f68aac79b1 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -439,6 +439,21 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	tristate "eBPF based programmable queue discipline"
+	help
+	  This eBPF based queue discipline offers a way to program your
+	  own packet scheduling algorithm. This is a classful qdisc which
+	  also allows you to decide the hierarchy.
+
+	  Say Y here if you want to use the eBPF based programmable queue
+	  discipline.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_bpf.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index dd14ef413fda..9ef0d579f5ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= sch_bpf.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
new file mode 100644
index 000000000000..2998d576708d
--- /dev/null
+++ b/net/sched/sch_bpf.c
@@ -0,0 +1,485 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Programmable Qdisc with eBPF
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+#define ACT_BPF_NAME_LEN	256
+
+struct sch_bpf_prog {
+	struct bpf_prog *prog;
+	const char *name;
+};
+
+struct sch_bpf_class {
+	struct Qdisc_class_common common;
+	struct Qdisc *qdisc;
+
+	unsigned int drops;
+	unsigned int overlimits;
+	struct gnet_stats_basic_sync bstats;
+};
+
+struct sch_bpf_qdisc {
+	struct tcf_proto __rcu *filter_list; /* optional external classifier */
+	struct tcf_block *block;
+	struct Qdisc_class_hash clhash;
+	struct sch_bpf_prog enqueue_prog;
+	struct sch_bpf_prog dequeue_prog;
+
+	struct qdisc_watchdog watchdog;
+};
+
+static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *skb,
+			     int name, int id, int tag)
+{
+	struct nlattr *nla;
+
+	if (prog->name &&
+	    nla_put_string(skb, name, prog->name))
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, id, prog->prog->aux->id))
+		return -EMSGSIZE;
+
+	nla = nla_reserve(skb, tag, sizeof(prog->prog->tag));
+	if (!nla)
+		return -EMSGSIZE;
+
+	memcpy(nla_data(nla), prog->prog->tag, nla_len(nla));
+	return 0;
+}
+
+static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct nlattr *opts;
+	u32 bpf_flags = 0;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (bpf_flags && nla_put_u32(skb, TCA_SCH_BPF_FLAGS, bpf_flags))
+		goto nla_put_failure;
+
+	if (sch_bpf_dump_prog(&q->enqueue_prog, skb, TCA_SCH_BPF_ENQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_ENQUEUE_PROG_ID, TCA_SCH_BPF_ENQUEUE_PROG_TAG))
+		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int sch_bpf_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (!clc)
+		return NULL;
+	return container_of(clc, struct sch_bpf_class, common);
+}
+
+static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			   struct sk_buff **to_free)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	unsigned int len = qdisc_pkt_len(skb);
+	struct sch_bpf_ctx ctx = {};
+	struct sch_bpf_class *cl;
+	int res = NET_XMIT_SUCCESS;
+	struct bpf_prog *enqueue;
+	s64 now;
+
+	enqueue = rcu_dereference(q->enqueue_prog.prog);
+	bpf_compute_data_pointers(skb);
+	ctx.skb = (struct __sk_buff *)skb;
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(enqueue, &ctx);
+	switch (res) {
+	case SCH_BPF_THROTTLE:
+		now = ktime_get_ns();
+		qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay);
+		qdisc_qstats_overlimit(sch);
+		fallthrough;
+	case SCH_BPF_QUEUED:
+		return NET_XMIT_SUCCESS;
+	case SCH_BPF_CN:
+		return NET_XMIT_CN;
+	case SCH_BPF_PASS:
+		break;
+	default:
+		__qdisc_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	cl = sch_bpf_find(sch, ctx.classid);
+	if (!cl || !cl->qdisc) {
+		if (res & __NET_XMIT_BYPASS)
+			qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return res;
+	}
+
+	res = qdisc_enqueue(skb, cl->qdisc, to_free);
+	if (res != NET_XMIT_SUCCESS) {
+		if (net_xmit_drop_count(res)) {
+			qdisc_qstats_drop(sch);
+			cl->drops++;
+		}
+		return res;
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return res;
+}
+
+static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sk_buff *ret = NULL;
+	struct sch_bpf_ctx ctx = {};
+	struct bpf_prog *dequeue;
+	struct sch_bpf_class *cl;
+	s64 now;
+	int res;
+
+	dequeue = rcu_dereference(q->dequeue_prog.prog);
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(dequeue, &ctx);
+	switch (res) {
+	case SCH_BPF_DEQUEUED:
+		ret = (struct sk_buff *)ctx.skb;
+		break;
+	case SCH_BPF_THROTTLE:
+		now = ktime_get_ns();
+		qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay);
+		qdisc_qstats_overlimit(sch);
+		cl->overlimits++;
+		return NULL;
+	case SCH_BPF_PASS:
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (!cl || !cl->qdisc)
+			return NULL;
+		ret = qdisc_dequeue_peeked(cl->qdisc);
+		if (ret) {
+			qdisc_bstats_update(sch, ret);
+			qdisc_qstats_backlog_dec(sch, ret);
+			sch->q.qlen--;
+		}
+	}
+
+	return ret;
+}
+
+static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	return cl->qdisc;
+}
+
+static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+			 struct Qdisc **old, struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	if (new)
+		*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle)
+{
+	return (unsigned long)sch_bpf_find(sch, handle);
+}
+
+static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->block;
+}
+
+static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
+	[TCA_SCH_BPF_FLAGS]		= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+};
+
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+{
+	char *prog_name = NULL;
+	struct bpf_prog *fp;
+	u32 bpf_fd;
+
+	if (!fd)
+		return -EINVAL;
+	bpf_fd = nla_get_u32(fd);
+
+	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_QDISC);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	if (name) {
+		prog_name = nla_memdup(name, GFP_KERNEL);
+		if (!prog_name) {
+			bpf_prog_put(fp);
+			return -ENOMEM;
+		}
+	}
+
+	prog->name = prog_name;
+	prog->prog = fp;
+	return 0;
+}
+
+static void bpf_cleanup_prog(struct sch_bpf_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+	kfree(prog->name);
+}
+
+static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_SCH_BPF_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_SCH_BPF_MAX, opt,
+					  sch_bpf_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_SCH_BPF_FLAGS]) {
+		u32 bpf_flags = nla_get_u32(tb[TCA_SCH_BPF_FLAGS]);
+
+		if (bpf_flags & ~TCA_SCH_BPF_FLAG_DIRECT)
+			return -EINVAL;
+	}
+
+	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+	if (err)
+		return err;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+	return err;
+}
+
+static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	int err;
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	if (opt) {
+		err = sch_bpf_change(sch, opt, extack);
+		if (err)
+			return err;
+	}
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	return qdisc_class_hash_init(&q->clhash);
+}
+
+static void sch_bpf_reset(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void sch_bpf_destroy(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	tcf_block_put(q->block);
+	qdisc_class_hash_destroy(&q->clhash);
+	bpf_cleanup_prog(&q->enqueue_prog);
+	bpf_cleanup_prog(&q->dequeue_prog);
+}
+
+static int sch_bpf_change_class(struct Qdisc *sch, u32 classid,
+				u32 parentid, struct nlattr **tca,
+				unsigned long *arg,
+				struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg;
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	if (!cl) {
+		cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+		if (!cl)
+			return -ENOBUFS;
+		qdisc_class_hash_insert(&q->clhash, &cl->common);
+	}
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	return 0;
+}
+
+static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg,
+			      struct sk_buff *skb, struct tcmsg *tcm)
+{
+	return 0;
+}
+
+static int
+sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+		.overlimits = cl->overlimits,
+	};
+	__u32 qlen = 0;
+
+	if (cl->qdisc)
+		qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog);
+	else
+		qlen = 0;
+
+	if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 ||
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static const struct Qdisc_class_ops sch_bpf_class_ops = {
+	.graft		=	sch_bpf_graft,
+	.leaf		=	sch_bpf_leaf,
+	.find		=	sch_bpf_search,
+	.change		=	sch_bpf_change_class,
+	.delete		=	sch_bpf_delete,
+	.tcf_block	=	sch_bpf_tcf_block,
+	.bind_tcf	=	sch_bpf_bind,
+	.unbind_tcf	=	sch_bpf_unbind,
+	.dump		=	sch_bpf_dump_class,
+	.dump_stats	=	sch_bpf_dump_class_stats,
+	.walk		=	sch_bpf_walk,
+};
+
+static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
+	.cl_ops		=	&sch_bpf_class_ops,
+	.id		=	"bpf",
+	.priv_size	=	sizeof(struct sch_bpf_qdisc),
+	.enqueue	=	sch_bpf_enqueue,
+	.dequeue	=	sch_bpf_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	sch_bpf_init,
+	.reset		=	sch_bpf_reset,
+	.destroy	=	sch_bpf_destroy,
+	.change		=	sch_bpf_change,
+	.dump		=	sch_bpf_dump,
+	.dump_stats	=	sch_bpf_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init sch_bpf_mod_init(void)
+{
+	return register_qdisc(&sch_bpf_qdisc_ops);
+}
+
+static void __exit sch_bpf_mod_exit(void)
+{
+	unregister_qdisc(&sch_bpf_qdisc_ops);
+}
+
+module_init(sch_bpf_mod_init)
+module_exit(sch_bpf_mod_exit)
+MODULE_AUTHOR("Cong Wang");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("eBPF queue discipline");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC Patch v6 4/5] net_sched: Add kfuncs for storing skb
  2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (2 preceding siblings ...)
  2022-10-05 17:17 ` [RFC Patch v6 3/5] net_sched: introduce eBPF based Qdisc Cong Wang
@ 2022-10-05 17:17 ` Cong Wang
  2022-10-05 17:17 ` [RFC Patch v6 5/5] net_sched: Introduce helper bpf_skb_tc_classify() Cong Wang
  2022-10-07 17:27 ` [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc sdf
  5 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2022-10-05 17:17 UTC (permalink / raw)
  To: netdev; +Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/sched/sch_bpf.c | 81 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 2998d576708d..6850eb8bb574 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
@@ -468,9 +469,87 @@ static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
 	.owner		=	THIS_MODULE,
 };
 
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * bpf_skb_acquire - Acquire a reference to an skb. An skb acquired by this
+ * kfunc which is not stored in a map as a kptr, must be released by calling
+ * bpf_skb_release().
+ * @p: The skb on which a reference is being acquired.
+ */
+__used noinline
+struct sk_buff *bpf_skb_acquire(struct sk_buff *p)
+{
+	return skb_get(p);
+}
+
+/**
+ * bpf_skb_kptr_get - Acquire a reference on a struct sk_buff kptr. An skb
+ * kptr acquired by this kfunc which is not subsequently stored in a map, must
+ * be released by calling bpf_skb_release().
+ * @pp: A pointer to an skb kptr on which a reference is being acquired.
+ */
+__used noinline
+struct sk_buff *bpf_skb_kptr_get(struct sk_buff **pp)
+{
+	struct sk_buff *p;
+
+	rcu_read_lock();
+	p = READ_ONCE(*pp);
+	if (p && !refcount_inc_not_zero(&p->users))
+		p = NULL;
+	rcu_read_unlock();
+
+	return p;
+}
+
+/**
+ * bpf_skb_release - Release the reference acquired on a struct sk_buff *.
+ * @p: The skb on which a reference is being released.
+ */
+__used noinline void bpf_skb_release(struct sk_buff *p)
+{
+	consume_skb(p);
+}
+
+__diag_pop();
+
+BTF_SET8_START(skb_kfunc_btf_ids)
+BTF_ID_FLAGS(func, bpf_skb_acquire, KF_ACQUIRE)
+BTF_ID_FLAGS(func, bpf_skb_kptr_get, KF_ACQUIRE | KF_KPTR_GET | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE | KF_TRUSTED_ARGS)
+BTF_SET8_END(skb_kfunc_btf_ids)
+
+static const struct btf_kfunc_id_set skb_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &skb_kfunc_btf_ids,
+};
+
+BTF_ID_LIST(skb_kfunc_dtor_ids)
+BTF_ID(struct, sk_buff)
+BTF_ID(func, bpf_skb_release)
+
 static int __init sch_bpf_mod_init(void)
 {
-	return register_qdisc(&sch_bpf_qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = skb_kfunc_dtor_ids[0],
+			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_QDISC, &skb_kfunc_set);
+	if (ret)
+		return ret;
+	ret = register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+					  ARRAY_SIZE(skb_kfunc_dtors),
+					  THIS_MODULE);
+	if (ret == 0)
+		return register_qdisc(&sch_bpf_qdisc_ops);
+	return ret;
 }
 
 static void __exit sch_bpf_mod_exit(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC Patch v6 5/5] net_sched: Introduce helper bpf_skb_tc_classify()
  2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (3 preceding siblings ...)
  2022-10-05 17:17 ` [RFC Patch v6 4/5] net_sched: Add kfuncs for storing skb Cong Wang
@ 2022-10-05 17:17 ` Cong Wang
  2022-10-07 17:27 ` [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc sdf
  5 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2022-10-05 17:17 UTC (permalink / raw)
  To: netdev; +Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

Introduce an eBPF helper function bpf_skb_tc_classify() to reuse exising
TC filters on *any* Qdisc to classify the skb.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c        | 17 +++++++++-
 net/sched/cls_api.c      | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c21fd1f189bc..7ed04736c4e4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5650,6 +5650,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(user_ringbuf_drain),		\
+	FN(skb_tc_classify),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 7a271b77a2cc..d1ed60114794 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7926,6 +7926,21 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+const struct bpf_func_proto bpf_skb_tc_classify_proto __weak;
+
+static const struct bpf_func_proto *
+tc_qdisc_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+#ifdef CONFIG_NET_CLS_ACT
+	case BPF_FUNC_skb_tc_classify:
+		return &bpf_skb_tc_classify_proto;
+#endif
+	default:
+		return tc_cls_act_func_proto(func_id, prog);
+	}
+}
+
 static const struct bpf_func_proto *
 xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -10656,7 +10671,7 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = {
 };
 
 const struct bpf_verifier_ops tc_qdisc_verifier_ops = {
-	.get_func_proto		= tc_cls_act_func_proto,
+	.get_func_proto		= tc_qdisc_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
 	.convert_ctx_access	= tc_cls_act_convert_ctx_access,
 	.gen_prologue		= tc_cls_act_prologue,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 50566db45949..64470a8947b1 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -22,6 +22,7 @@
 #include <linux/idr.h>
 #include <linux/jhash.h>
 #include <linux/rculist.h>
+#include <linux/filter.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/netlink.h>
@@ -1655,6 +1656,74 @@ int tcf_classify(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(tcf_classify);
 
+#ifdef CONFIG_BPF_SYSCALL
+BPF_CALL_3(bpf_skb_tc_classify, struct sk_buff *, skb, int, ifindex, u32, handle)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct Qdisc_class_ops *cops;
+	struct tcf_result res = {};
+	struct tcf_block *block;
+	struct tcf_chain *chain;
+	struct net_device *dev;
+	unsigned long cl = 0;
+	struct Qdisc *q;
+	int result;
+
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		goto out;
+	q = qdisc_lookup_rcu(dev, handle);
+	if (!q)
+		goto out;
+
+	cops = q->ops->cl_ops;
+	if (!cops)
+		goto out;
+	if (!cops->tcf_block)
+		goto out;
+	if (TC_H_MIN(handle)) {
+		cl = cops->find(q, handle);
+		if (cl == 0)
+			goto out;
+	}
+	block = cops->tcf_block(q, cl, NULL);
+	if (!block)
+		goto out;
+
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
+		struct tcf_proto *tp;
+
+		result = tcf_classify(skb, NULL, tp, &res, false);
+		if (result  >= 0) {
+			switch (result) {
+			case TC_ACT_QUEUED:
+			case TC_ACT_STOLEN:
+			case TC_ACT_TRAP:
+				fallthrough;
+			case TC_ACT_SHOT:
+				rcu_read_unlock();
+				return 0;
+			}
+		}
+	}
+out:
+	rcu_read_unlock();
+	return res.class;
+}
+
+const struct bpf_func_proto bpf_skb_tc_classify_proto = {
+	.func		= bpf_skb_tc_classify,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+#endif
+
 struct tcf_chain_info {
 	struct tcf_proto __rcu **pprev;
 	struct tcf_proto __rcu *next;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc
  2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
                   ` (4 preceding siblings ...)
  2022-10-05 17:17 ` [RFC Patch v6 5/5] net_sched: Introduce helper bpf_skb_tc_classify() Cong Wang
@ 2022-10-07 17:27 ` sdf
  5 siblings, 0 replies; 9+ messages in thread
From: sdf @ 2022-10-07 17:27 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, yangpeihao, toke, jhs, jiri, bpf, Cong Wang

On 10/05, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> 7. Potentially pave a way for implementing eBPF based CPU scheduler,
>     because we already have task kptr and CFS is clearly based on rbtree
>     too.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity. More importantly, with
> rbtree, it is easy to implement PIFO logic too.

> Here is a summary of design decisions I made:

[..]

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

Can you expand more on what's exactly 'hard'? We need to override
enqueue/dequeue mostly and that's it?

Does it make sense to at least do, say, sch_struct_opts? Maybe we can
have some higher level struct_opts with enqueue/dequeue only for
bpf to override?

> 2. Introduce a generic rbtree map, which supports kptr, so that skb's
>     can be stored inside.

There has been several rbtree proposals. Do we really need an rbtree
here (ignoring performance)? Can we demonstrate these new qdisc
capabilities with a hashtable?

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

[..]

>     c) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion based on rbtree too.

I took a brief look at sch_bpf.c and I don't see how that would work.
Can you expand more on that (+locking)?

Also, in sch_bpf.c, you care about classid, why? Isn't the whole purpose
of the series is to expand tc capabilities? Why are we still dragging
this classid?


>     d) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event. (TODO: need to recognize kptr type)

[..]

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

Supposedly, the idea is that bpf can implement all existing policies +
more, so why are we trying to re-use existing filters?

> Any high-level feedback is welcome. Please kindly ignore any coding  
> details
> until RFC tag is removed.

> TODO:
> 1. actually test it
> 2. write a document for this Qdisc

[..]

> 3. add test cases and sample code

I'd start with that; otherwise, it's not clear what are the current
capabilities.

> Cc: Toke H�iland-J�rgensen <toke@redhat.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: Jiri Pirko <jiri@resnulli.us>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
> v6: switch to kptr based approach

> v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
>      implement flow map as map-in-map
>      rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
>      clean up eBPF qdisc program context

> v4: get rid of PIFO, use rbtree directly

> v3: move priority queue from sch_bpf to skb map
>      introduce skb map and its helpers
>      introduce bpf_skb_classify()
>      use netdevice notifier to reset skb's
>      Rebase on latest bpf-next

> v2: Rebase on latest net-next
>      Make the code more complete (but still incomplete)

> Cong Wang (5):
>    bpf: Introduce rbtree map
>    bpf: Add map in map support to rbtree
>    net_sched: introduce eBPF based Qdisc
>    net_sched: Add kfuncs for storing skb
>    net_sched: Introduce helper bpf_skb_tc_classify()

>   include/linux/bpf.h            |   4 +
>   include/linux/bpf_types.h      |   4 +
>   include/uapi/linux/bpf.h       |  19 ++
>   include/uapi/linux/pkt_sched.h |  17 +
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/rbtree.c            | 603 +++++++++++++++++++++++++++++++++
>   kernel/bpf/syscall.c           |   7 +
>   net/core/filter.c              |  27 ++
>   net/sched/Kconfig              |  15 +
>   net/sched/Makefile             |   1 +
>   net/sched/cls_api.c            |  69 ++++
>   net/sched/sch_bpf.c            | 564 ++++++++++++++++++++++++++++++
>   12 files changed, 1331 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/bpf/rbtree.c
>   create mode 100644 net/sched/sch_bpf.c

> --
> 2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC Patch v6 1/5] bpf: Introduce rbtree map
  2022-10-05 17:17 ` [RFC Patch v6 1/5] bpf: Introduce rbtree map Cong Wang
@ 2022-10-11 15:19   ` Kumar Kartikeya Dwivedi
  2022-10-11 17:05   ` Dave Marchevsky
  1 sibling, 0 replies; 9+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-10-11 15:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang, davemarchevsky

On Wed, 5 Oct 2022 at 22:53, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cong.wang@bytedance.com>
>
> Insert:
> bpf_map_update(&map, &key, &val, flag);
>
> Delete a specific key-val pair:
> bpf_map_delete_elem(&map, &key);
>
> Pop the minimum one:
> bpf_map_pop(&map, &val);
>
> Lookup:
> val = bpf_map_lookup_elem(&map, &key);
>
> Iterator:
> bpf_for_each_map_elem(&map, callback, key, val);
>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

Instead of a dedicated BPF map and using kptr inside the map value, we
should probably lift Dave's series [0] adding the rbtree, and allow
linking sk_buff ctx directly into it. It would require recognising the
rb_node in sk_buff (or __sk_buff shadow struct) as a valid bpf_rb_node
similar to those in user allocated types. Overall it would be a much
better approach IMO and avoid having different rbtree implementations.
We would probably follow a similar approach for xdp_frame as well.

It can also be a union of bpf_rb_node, bpf_list_node, etc. Since the
type can only be in only one collection at once it would allow it to
be linked into different types of structures without wasting any
space.

[0]: https://lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC Patch v6 1/5] bpf: Introduce rbtree map
  2022-10-05 17:17 ` [RFC Patch v6 1/5] bpf: Introduce rbtree map Cong Wang
  2022-10-11 15:19   ` Kumar Kartikeya Dwivedi
@ 2022-10-11 17:05   ` Dave Marchevsky
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Marchevsky @ 2022-10-11 17:05 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: yangpeihao, toke, jhs, jiri, bpf, sdf, Cong Wang,
	Kumar Kartikeya Dwivedi

Hi Cong,

On 10/5/22 1:17 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Insert:
> bpf_map_update(&map, &key, &val, flag);
> 
> Delete a specific key-val pair:
> bpf_map_delete_elem(&map, &key);
> 
> Pop the minimum one:
> bpf_map_pop(&map, &val);
> 
> Lookup:
> val = bpf_map_lookup_elem(&map, &key);
> 
> Iterator:
> bpf_for_each_map_elem(&map, callback, key, val);
> 
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  include/linux/bpf_types.h |   1 +
>  include/uapi/linux/bpf.h  |   1 +
>  kernel/bpf/Makefile       |   2 +-
>  kernel/bpf/rbtree.c       | 445 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 448 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/rbtree.c
> 

I've also been working on a rbtree wrapper [0], with the goal of supporting
scheduler-like usecases as you mentioned in your cover letter. This morphed into
a larger discussion around "next-generation BPF datastructures" with Kumar,
Alexei, and others. Many of the points we reached consensus on are relevant to
the rbtree patches in this series, I'll try to link context where appropriate.

For next-gen BPF datastructures, we'd like to move away from exposing them as
BPF_MAPs with standard bpf helpers to access / manipulate. Instead, they should
be exposed as kptrs and manipulated by unstable kfuncs. kptr representing the
datastructure can live inside map_value of existing BPF_MAP - e.g. inside
ARRAY map_value as a global value. This means that next-gen datastructures can
be added - and their API changed post-addition - without UAPI changes. For
rbtree this is particularly desired as a few other folks also want a BPF rbtree
and we'd like to make sure as many usecases as possible are supported.
Discussion around this started in initial rbtree RFC [1] (specifically patch 5's
thread) and Alexei mentioned it during his LPC 2022 talk (slides [2] video [3],
slides 22-end are particularly relevant).

This next-gen style also allows us to move locks and locking behavior out of
helpers and into the BPF program itself, with verifier checking that correct
lock is held when the datastructure is manipulated. This discussion started
in [1] and continued in Kumar's [4] and my [0], both of which implement such
behavior. David Vernet's LWN article [5] touches on this in "Plans for the
future" section as well.

Kumar's recently-submitted evolution of [4], [6], implements the above as
well as other groundwork necessary for next-gen datastructures and a next-gen
linked list implementation. I've been modifying my rbtree implementation to
meet next-gen datastructure expectations - specifically kptr and kfunc, as it's
still using BPF_MAP/helpers like this series - and should be able to submit it
soon.

I'd like to make sure my impl works for you as well, so there are some questions
about approach below. I didn't focus too much on coding details as requested
in cover letter.

  [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
  [1]: lore.kernel.org/bpf/20220722183438.3319790-1-davemarchevsky@fb.com
  [2]: lpc.events/event/16/contributions/1346/attachments/1021/1966/bpf_LPC_2022.pdf
  [3]: https://www.youtube.com/watch?v=andvNRUAAs0&t=91s
  [4]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
  [5]: lwn.net/Articles/909095/
  [6]: lore.kernel.org/bpf/20221011012240.3149-1-memxor@gmail.com

> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 2c6a4f2562a7..c53ba6de1613 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -127,6 +127,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops)
>  
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..9492cd3af701 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -935,6 +935,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_TASK_STORAGE,
>  	BPF_MAP_TYPE_BLOOM_FILTER,
>  	BPF_MAP_TYPE_USER_RINGBUF,
> +	BPF_MAP_TYPE_RBTREE,
>  };
>  
>  /* Note that tracing related programs such as
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..e60249258c74 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -7,7 +7,7 @@ endif
>  CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>  
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o rbtree.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>  obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> new file mode 100644
> index 000000000000..f1a9b1c40b8b
> --- /dev/null
> +++ b/kernel/bpf/rbtree.c
> @@ -0,0 +1,445 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * rbtree.c: eBPF rbtree map
> + *
> + * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
> + */
> +#include <linux/bpf.h>
> +#include <linux/slab.h>
> +#include <linux/capability.h>
> +#include <linux/rbtree.h>
> +#include <linux/btf_ids.h>
> +#include <linux/bpf_mem_alloc.h>
> +#include <linux/math.h>
> +#include <linux/seq_file.h>
> +
> +#define RBTREE_CREATE_FLAG_MASK \
> +	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
> +
> +/* each rbtree element is struct rbtree_elem + key + value */
> +struct rbtree_elem {
> +	struct rb_node rbnode;
> +	char key[] __aligned(8);

My impl takes a different approach to key, instead allowing user to do

  struct node_data {
	  struct rb_node node;
	  __u32 key;
	  __u32 val;
  };

in their BPF program similarly to normal kernel style. This has the downside
of requiring custom comparator callbacks for any operation requiring traversal,
which prevents common lookup/update/delete helpers from being used. But next-gen
kptr / kfunc datastructure style won't use common helpers anyways. Will
something like this work for your usecase?

> +};
> +
> +struct rbtree_map {
> +	struct bpf_map map;
> +	struct bpf_mem_alloc ma;
> +	raw_spinlock_t lock;
> +	struct rb_root root;

In my RFCv1, Alexei suggested having the underlying rb_root be the "_cached"
version as it's likely that _cached behavior will be better for most usecases
by default. Do you need the raw rb_root?


> +	atomic_t nr_entries;

Is nr_entries used anywhere? I could only find incr/decr in this series.

> +};
> +
> +#define rb_to_elem(rb) rb_entry_safe(rb, struct rbtree_elem, rbnode)
> +#define elem_rb_first(root) rb_to_elem(rb_first(root))
> +#define elem_rb_last(root)  rb_to_elem(rb_last(root))
> +#define elem_rb_next(e)   rb_to_elem(rb_next(&(e)->rbnode))
> +#define rbtree_walk_safe(e, tmp, root)					\
> +		for (e = elem_rb_first(root);				\
> +		     tmp = e ? elem_rb_next(e) : NULL, (e != NULL);	\
> +		     e = tmp)
> +
> +static struct rbtree_map *rbtree_map(struct bpf_map *map)
> +{
> +	return container_of(map, struct rbtree_map, map);
> +}
> +
> +/* Called from syscall */
> +static int rbtree_map_alloc_check(union bpf_attr *attr)
> +{
> +	if (!bpf_capable())
> +		return -EPERM;
> +
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 ||
> +	    attr->map_flags & ~RBTREE_CREATE_FLAG_MASK ||
> +	    !bpf_map_flags_access_ok(attr->map_flags))
> +		return -EINVAL;
> +
> +	if (attr->value_size > KMALLOC_MAX_SIZE)
> +		/* if value_size is bigger, the user space won't be able to
> +		 * access the elements.
> +		 */
> +		return -E2BIG;
> +
> +	return 0;
> +}
> +
> +static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
> +{
> +	int numa_node = bpf_map_attr_numa_node(attr);
> +	struct rbtree_map *rb;
> +	u32 elem_size;
> +	int err;
> +
> +	rb = bpf_map_area_alloc(sizeof(*rb), numa_node);
> +	if (!rb)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memset(rb, 0, sizeof(*rb));
> +	bpf_map_init_from_attr(&rb->map, attr);
> +	raw_spin_lock_init(&rb->lock);
> +	rb->root = RB_ROOT;
> +	atomic_set(&rb->nr_entries, 0);
> +
> +	elem_size = sizeof(struct rbtree_elem) +
> +			  round_up(rb->map.key_size, 8);
> +	elem_size += round_up(rb->map.value_size, 8);

Will your usecase's rbtree values always have the same size?

> +	err = bpf_mem_alloc_init(&rb->ma, elem_size, false);

Although my most recently-sent rbtree patchset doesn't use the allocator Alexei
added, next version will.

> +	if (err) {
> +		bpf_map_area_free(rb);
> +		return ERR_PTR(err);
> +	}
> +	return &rb->map;
> +}
> +
> +static void check_and_free_fields(struct rbtree_map *rb,
> +				  struct rbtree_elem *elem)
> +{
> +	void *map_value = elem->key + round_up(rb->map.key_size, 8);
> +
> +	if (map_value_has_kptrs(&rb->map))
> +		bpf_map_free_kptrs(&rb->map, map_value);
> +}
> +
> +static void rbtree_map_purge(struct bpf_map *map)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e, *tmp;
> +
> +	rbtree_walk_safe(e, tmp, &rb->root) {
> +		rb_erase(&e->rbnode, &rb->root);
> +		check_and_free_fields(rb, e);
> +		bpf_mem_cache_free(&rb->ma, e);
> +	}
> +}
> +
> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
> +static void rbtree_map_free(struct bpf_map *map)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&rb->lock, flags);
> +	rbtree_map_purge(map);
> +	raw_spin_unlock_irqrestore(&rb->lock, flags);
> +	bpf_mem_alloc_destroy(&rb->ma);
> +	bpf_map_area_free(rb);
> +}
> +
> +static struct rbtree_elem *bpf_rbtree_find(struct rb_root *root, void *key, int size)
> +{
> +	struct rb_node **p = &root->rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rbtree_elem *e;
> +
> +	while (*p) {
> +		int ret;
> +
> +		parent = *p;
> +		e = rb_to_elem(parent);
> +		ret = memcmp(key, e->key, size);
> +		if (ret < 0)
> +			p = &parent->rb_left;
> +		else if (ret > 0)
> +			p = &parent->rb_right;
> +		else
> +			return e;
> +	}
> +	return NULL;
> +}
> +
> +/* Called from eBPF program or syscall */
> +static void *rbtree_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e;
> +
> +	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
> +	if (!e)
> +		return NULL;
> +	return e->key + round_up(rb->map.key_size, 8);
> +}
> +
> +static int check_flags(struct rbtree_elem *old, u64 map_flags)
> +{
> +	if (old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
> +		/* elem already exists */
> +		return -EEXIST;
> +
> +	if (!old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
> +		/* elem doesn't exist, cannot update it */
> +		return -ENOENT;
> +
> +	return 0;
> +}
> +
> +static void rbtree_map_insert(struct rbtree_map *rb, struct rbtree_elem *e)
> +{
> +	struct rb_root *root = &rb->root;
> +	struct rb_node **p = &root->rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rbtree_elem *e1;
> +
> +	while (*p) {
> +		parent = *p;
> +		e1 = rb_to_elem(parent);
> +		if (memcmp(e->key, e1->key, rb->map.key_size) < 0)
> +			p = &parent->rb_left;
> +		else
> +			p = &parent->rb_right;
> +	}
> +	rb_link_node(&e->rbnode, parent, p);
> +	rb_insert_color(&e->rbnode, root);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int rbtree_map_update_elem(struct bpf_map *map, void *key, void *value,
> +			       u64 map_flags)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	void *val = rbtree_map_lookup_elem(map, key);
> +	int ret;
> +
> +	ret = check_flags(val, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (!val) {
> +		struct rbtree_elem *e_new;
> +		unsigned long flags;
> +
> +		e_new = bpf_mem_cache_alloc(&rb->ma);
> +		if (!e_new)
> +			return -ENOMEM;
> +		val = e_new->key + round_up(rb->map.key_size, 8);
> +		check_and_init_map_value(&rb->map, val);
> +		memcpy(e_new->key, key, rb->map.key_size);
> +		raw_spin_lock_irqsave(&rb->lock, flags);
> +		rbtree_map_insert(rb, e_new);
> +		raw_spin_unlock_irqrestore(&rb->lock, flags);
> +		atomic_inc(&rb->nr_entries);
> +	}
> +
> +	if (map_flags & BPF_F_LOCK)
> +		copy_map_value_locked(map, val, value, false);
> +	else
> +		copy_map_value(map, val, value);
> +	return 0;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int rbtree_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&rb->lock, flags);
> +	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
> +	if (!e) {
> +		raw_spin_unlock_irqrestore(&rb->lock, flags);
> +		return -ENOENT;
> +	}
> +	rb_erase(&e->rbnode, &rb->root);
> +	raw_spin_unlock_irqrestore(&rb->lock, flags);
> +	check_and_free_fields(rb, e);
> +	bpf_mem_cache_free(&rb->ma, e);
> +	atomic_dec(&rb->nr_entries);
> +	return 0;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int rbtree_map_pop_elem(struct bpf_map *map, void *value)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e = elem_rb_first(&rb->root);
> +	unsigned long flags;
> +	void *val;
> +
> +	if (!e)
> +		return -ENOENT;
> +	raw_spin_lock_irqsave(&rb->lock, flags);
> +	rb_erase(&e->rbnode, &rb->root);
> +	raw_spin_unlock_irqrestore(&rb->lock, flags);
> +	val = e->key + round_up(rb->map.key_size, 8);
> +	copy_map_value(map, value, val);
> +	check_and_free_fields(rb, e);
> +	bpf_mem_cache_free(&rb->ma, e);
> +	atomic_dec(&rb->nr_entries);
> +	return 0;
> +}
> +
> +/* Called from syscall */
> +static int rbtree_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e;
> +
> +	if (!key) {
> +		e = elem_rb_first(&rb->root);
> +		if (!e)
> +			return -ENOENT;
> +		goto found;
> +	}
> +	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
> +	if (!e)
> +		return -ENOENT;
> +	e = elem_rb_next(e);
> +	if (!e)
> +		return 0;
> +found:
> +	memcpy(next_key, e->key, map->key_size);
> +	return 0;
> +}
> +
> +static int bpf_for_each_rbtree_map(struct bpf_map *map,
> +				   bpf_callback_t callback_fn,
> +				   void *callback_ctx, u64 flags)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e, *tmp;
> +	void *key, *value;
> +	u32 num_elems = 0;
> +	u64 ret = 0;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	rbtree_walk_safe(e, tmp, &rb->root) {
> +		num_elems++;
> +		key = e->key;
> +		value = key + round_up(rb->map.key_size, 8);
> +		ret = callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value,
> +				  (u64)(long)callback_ctx, 0);
> +		/* return value: 0 - continue, 1 - stop and return */
> +		if (ret)
> +			break;
> +	}
> +
> +	return num_elems;
> +}
> +
> +struct rbtree_map_seq_info {
> +	struct bpf_map *map;
> +	struct rbtree_map *rb;
> +};
> +
> +static void *rbtree_map_seq_find_next(struct rbtree_map_seq_info *info,
> +				      struct rbtree_elem *prev_elem)
> +{
> +	const struct rbtree_map *rb = info->rb;
> +	struct rbtree_elem *elem;
> +
> +	/* try to find next elem in the same bucket */
> +	if (prev_elem) {
> +		elem = elem_rb_next(prev_elem);
> +		if (elem)
> +			return elem;
> +		return NULL;
> +	}
> +
> +	return elem_rb_first(&rb->root);
> +}
> +
> +static void *rbtree_map_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct rbtree_map_seq_info *info = seq->private;
> +
> +	if (*pos == 0)
> +		++*pos;
> +
> +	/* pairs with rbtree_map_seq_stop */
> +	rcu_read_lock();
> +	return rbtree_map_seq_find_next(info, NULL);
> +}
> +
> +static void *rbtree_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct rbtree_map_seq_info *info = seq->private;
> +
> +	++*pos;
> +	return rbtree_map_seq_find_next(info, v);
> +}
> +
> +static int rbtree_map_seq_show(struct seq_file *seq, void *v)
> +{
> +	struct rbtree_map_seq_info *info = seq->private;
> +	struct bpf_iter__bpf_map_elem ctx = {};
> +	struct rbtree_elem *elem = v;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, !elem);
> +	if (!prog)
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.map = info->map;
> +	if (elem) {
> +		ctx.key = elem->key;
> +		ctx.value = elem->key + round_up(info->map->key_size, 8);
> +	}
> +
> +	return bpf_iter_run_prog(prog, &ctx);
> +}
> +
> +static void rbtree_map_seq_stop(struct seq_file *seq, void *v)
> +{
> +	if (!v)
> +		(void)rbtree_map_seq_show(seq, NULL);
> +
> +	/* pairs with rbtree_map_seq_start */
> +	rcu_read_unlock();
> +}
> +
> +static const struct seq_operations rbtree_map_seq_ops = {
> +	.start	= rbtree_map_seq_start,
> +	.next	= rbtree_map_seq_next,
> +	.stop	= rbtree_map_seq_stop,
> +	.show	= rbtree_map_seq_show,
> +};
> +
> +static int rbtree_map_init_seq_private(void *priv_data,
> +				       struct bpf_iter_aux_info *aux)
> +{
> +	struct rbtree_map_seq_info *info = priv_data;
> +
> +	bpf_map_inc_with_uref(aux->map);
> +	info->map = aux->map;
> +	info->rb = rbtree_map(info->map);
> +	return 0;
> +}
> +
> +static void rbtree_map_fini_seq_private(void *priv_data)
> +{
> +	struct rbtree_map_seq_info *info = priv_data;
> +
> +	bpf_map_put_with_uref(info->map);
> +}
> +
> +static const struct bpf_iter_seq_info rbtree_map_iter_seq_info = {
> +	.seq_ops		= &rbtree_map_seq_ops,
> +	.init_seq_private	= rbtree_map_init_seq_private,
> +	.fini_seq_private	= rbtree_map_fini_seq_private,
> +	.seq_priv_size		= sizeof(struct rbtree_map_seq_info),
> +};
> +
> +BTF_ID_LIST_SINGLE(rbtree_map_btf_ids, struct, rbtree_map)
> +const struct bpf_map_ops rbtree_map_ops = {
> +	.map_meta_equal = bpf_map_meta_equal,
> +	.map_alloc_check = rbtree_map_alloc_check,
> +	.map_alloc = rbtree_map_alloc,
> +	.map_free = rbtree_map_free,
> +	.map_lookup_elem = rbtree_map_lookup_elem,
> +	.map_update_elem = rbtree_map_update_elem,
> +	.map_delete_elem = rbtree_map_delete_elem,
> +	.map_pop_elem = rbtree_map_pop_elem,
> +	.map_get_next_key = rbtree_map_get_next_key,
> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
> +	.map_for_each_callback = bpf_for_each_rbtree_map,
> +	.map_btf_id = &rbtree_map_btf_ids[0],
> +	.iter_seq_info = &rbtree_map_iter_seq_info,
> +};
> +

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-10-11 17:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-05 17:17 [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-10-05 17:17 ` [RFC Patch v6 1/5] bpf: Introduce rbtree map Cong Wang
2022-10-11 15:19   ` Kumar Kartikeya Dwivedi
2022-10-11 17:05   ` Dave Marchevsky
2022-10-05 17:17 ` [RFC Patch v6 2/5] bpf: Add map in map support to rbtree Cong Wang
2022-10-05 17:17 ` [RFC Patch v6 3/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-10-05 17:17 ` [RFC Patch v6 4/5] net_sched: Add kfuncs for storing skb Cong Wang
2022-10-05 17:17 ` [RFC Patch v6 5/5] net_sched: Introduce helper bpf_skb_tc_classify() Cong Wang
2022-10-07 17:27 ` [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc sdf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).