bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map
@ 2022-07-22 18:34 Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 01/11] bpf: Pull repeated reg access bounds check into helper fn Dave Marchevsky
                   ` (13 more replies)
  0 siblings, 14 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Introduce bpf_rbtree map data structure. As the name implies, rbtree map
allows bpf programs to use red-black trees similarly to kernel code.
Programs interact with rbtree maps in a much more open-coded way than
more classic map implementations. Some example code to demonstrate:

  node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
  if (!node)
    return 0;

  node->one = calls;
  node->two = 6;
  bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));

  ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
  if (!ret) {
    bpf_rbtree_free_node(&rbtree, node);
    goto unlock_ret;
  }

unlock_ret:
  bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
  return 0;


This series is in a heavy RFC state, with some added verifier semantics
needing improvement before they can be considered safe. I am sending
early to gather feedback on approach:

  * Does the API seem reasonable and might it be useful for others?

  * Do new verifier semantics added in this series make logical sense?
    Are there any glaring safety holes aside from those called out in
    individual patches?

Please see individual patches for more in-depth explanation. A quick
summary of patches follows:


Patches 1-3 extend verifier and BTF searching logic in minor ways to
prepare for rbtree implementation patch.
  bpf: Pull repeated reg access bounds check into helper fn
  bpf: Add verifier support for custom callback return range
  bpf: Add rb_node_off to bpf_map


Patch 4 adds basic rbtree map implementation.
  bpf: Add rbtree map

Note that 'complete' implementation requires concepts and changes
introduced in further patches in the series. The series is currently
arranged in this way to ease RFC review.


Patches 5-7 add a spinlock to the rbtree map, with some differing
semantics from existing verifier spinlock handling.
  bpf: Add bpf_spin_lock member to rbtree
  bpf: Add bpf_rbtree_{lock,unlock} helpers
  bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}

Notably, rbtree's bpf_spin_lock must be held while manipulating the tree 
via helpers, while existing spinlock verifier logic prevents any helper
calls while lock is held. In current state this is worked around by not
having the verifier treat rbtree's lock specially in any way. This 
needs to be improved before leaving RFC state as it's unsafe.


Patch 8 adds the concept of non-owning references, firming up the
semantics of helpers that return a ptr to node which is owned by
a rbtree. See patch 4's summary for additional discussion of node
ownership.


Patch 9 adds a 'conditional release' concept: helpers which release a
resource, but may fail to do so and need to enforce that the BPF program
handles this failure appropriately, namely by freeing the resource
another way.


Path 10 adds 'iter' type flags which teach the verifier to understand
open-coded iteration of a data structure. Specifically, with such flags
the verifier can understand that this loop eventually ends:

  struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);

  while (iter) {
    node_ct++;
    iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
  }

NOTE: Patch 10's logic is currently very unsafe and it's unclear whether
there's a safe path forward that isn't too complex. It's the most RFC-ey
of all the patches.


Patch 11 adds tests. Best to start here to see BPF programs using rbtree
map as intended.


This series is based ontop of "bpf: Cleanup check_refcount_ok" patch,
which was submitted separately [0] and therefore is not included here. That
patch is likely to be applied before this is out of RFC state, so will
just rebase on newer bpf-next/master.

  [0]: lore.kernel.org/bpf/20220719215536.2787530-1-davemarchevsky@fb.com/

Dave Marchevsky (11):
  bpf: Pull repeated reg access bounds check into helper fn
  bpf: Add verifier support for custom callback return range
  bpf: Add rb_node_off to bpf_map
  bpf: Add rbtree map
  bpf: Add bpf_spin_lock member to rbtree
  bpf: Add bpf_rbtree_{lock,unlock} helpers
  bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
  bpf: Add OBJ_NON_OWNING_REF type flag
  bpf: Add CONDITIONAL_RELEASE type flag
  bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  selftests/bpf: Add rbtree map tests

 include/linux/bpf.h                           |  13 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |   2 +
 include/linux/btf.h                           |   1 +
 include/uapi/linux/bpf.h                      | 121 ++++++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/btf.c                              |  21 +
 kernel/bpf/helpers.c                          |  42 +-
 kernel/bpf/rbtree.c                           | 401 ++++++++++++++++++
 kernel/bpf/syscall.c                          |   3 +
 kernel/bpf/verifier.c                         | 382 +++++++++++++++--
 tools/include/uapi/linux/bpf.h                | 121 ++++++
 .../selftests/bpf/prog_tests/rbtree_map.c     | 164 +++++++
 .../testing/selftests/bpf/progs/rbtree_map.c  | 108 +++++
 .../selftests/bpf/progs/rbtree_map_fail.c     | 236 +++++++++++
 .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
 16 files changed, 1605 insertions(+), 37 deletions(-)
 create mode 100644 kernel/bpf/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 01/11] bpf: Pull repeated reg access bounds check into helper fn
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 02/11] bpf: Add verifier support for custom callback return range Dave Marchevsky
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Verifier's check_map_access function has some duplicated bounds checking
arithmetic to determine whether a map element access might touch a
protected field (bpf_spin_lock and bpf_timer). Pull this logic out into
new reg_access_may_touch_field helper function to improve readability
and understandability of the code.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/verifier.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8ceccc420db3..c5fbaa9f025a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3783,6 +3783,19 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	return 0;
 }
 
+/* if any part of struct field can be touched by
+ * load/store reject this program.
+ * To check that [x1, x2) overlaps with [y1, y2)
+ * it is sufficient to check x1 < y2 && y1 < x2.
+ */
+static bool reg_access_may_touch_field(const struct bpf_reg_state *reg,
+				       u32 access_off, size_t access_sz,
+				       u32 field_off, size_t field_sz)
+{
+	return reg->smin_value + access_off < field_off + field_sz &&
+			field_off < reg->umax_value + access_off + access_sz;
+}
+
 /* check read/write into a map element with possible variable offset */
 static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			    int off, int size, bool zero_size_allowed,
@@ -3800,15 +3813,9 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 		return err;
 
 	if (map_value_has_spin_lock(map)) {
-		u32 lock = map->spin_lock_off;
+		u32 t = map->spin_lock_off;
 
-		/* if any part of struct bpf_spin_lock can be touched by
-		 * load/store reject this program.
-		 * To check that [x1, x2) overlaps with [y1, y2)
-		 * it is sufficient to check x1 < y2 && y1 < x2.
-		 */
-		if (reg->smin_value + off < lock + sizeof(struct bpf_spin_lock) &&
-		     lock < reg->umax_value + off + size) {
+		if (reg_access_may_touch_field(reg, off, size, t, sizeof(struct bpf_spin_lock))) {
 			verbose(env, "bpf_spin_lock cannot be accessed directly by load/store\n");
 			return -EACCES;
 		}
@@ -3816,8 +3823,7 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 	if (map_value_has_timer(map)) {
 		u32 t = map->timer_off;
 
-		if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
-		     t < reg->umax_value + off + size) {
+		if (reg_access_may_touch_field(reg, off, size, t, sizeof(struct bpf_timer))) {
 			verbose(env, "bpf_timer cannot be accessed directly by load/store\n");
 			return -EACCES;
 		}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 02/11] bpf: Add verifier support for custom callback return range
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 01/11] bpf: Pull repeated reg access bounds check into helper fn Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 03/11] bpf: Add rb_node_off to bpf_map Dave Marchevsky
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Verifier logic to confirm that a callback function returns 0 or 1 was
added in commit 69c087ba6225b ("bpf: Add bpf_for_each_map_elem() helper").
At the time, callback return value was only used to continue or stop
iteration.

In order to support callbacks with a broader return value range, such as
those added further in this series, add a callback_ret_range to
bpf_func_state. Verifier's helpers which set in_callback_fn will also
set the new field, which the verifier will later use to check return
value bounds.

Default to tnum_range(0, 1) instead of using tnum_unknown as a sentinel
value as the latter would prevent the valid range (0, U64_MAX) being
used.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf_verifier.h | 1 +
 kernel/bpf/verifier.c        | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 2e3bad8640dc..9c017575c034 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -237,6 +237,7 @@ struct bpf_func_state {
 	 */
 	u32 async_entry_cnt;
 	bool in_callback_fn;
+	struct tnum callback_ret_range;
 	bool in_async_callback_fn;
 
 	/* The following fields should be last. See copy_func_state() */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c5fbaa9f025a..1f50becce141 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1745,6 +1745,7 @@ static void init_func_state(struct bpf_verifier_env *env,
 	state->callsite = callsite;
 	state->frameno = frameno;
 	state->subprogno = subprogno;
+	state->callback_ret_range = tnum_range(0, 1);
 	init_reg_state(env, state);
 	mark_verifier_state_scratched(env);
 }
@@ -6880,6 +6881,7 @@ static int set_find_vma_callback_state(struct bpf_verifier_env *env,
 	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
 	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
 	callee->in_callback_fn = true;
+	callee->callback_ret_range = tnum_range(0, 1);
 	return 0;
 }
 
@@ -6907,7 +6909,7 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 	caller = state->frame[state->curframe];
 	if (callee->in_callback_fn) {
 		/* enforce R0 return value range [0, 1]. */
-		struct tnum range = tnum_range(0, 1);
+		struct tnum range = callee->callback_ret_range;
 
 		if (r0->type != SCALAR_VALUE) {
 			verbose(env, "R0 not a scalar value\n");
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 03/11] bpf: Add rb_node_off to bpf_map
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 01/11] bpf: Pull repeated reg access bounds check into helper fn Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 02/11] bpf: Add verifier support for custom callback return range Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-08-01 22:19   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 04/11] bpf: Add rbtree map Dave Marchevsky
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Similarly to other protected fields which might be in a map value -
bpf_spin_lock and bpf_timer - keep track of existence and offset of
struct rb_node within map value struct. This will allow later patches in
this series to prevent writes to rb_node field.

Allowing bpf programs to modify the rbtree node struct's rb_node field
would allow parent and children node pointers to be changed, which could
corrupt an otherwise-valid rbtree.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h  |  6 ++++++
 include/linux/btf.h  |  1 +
 kernel/bpf/btf.c     | 21 +++++++++++++++++++++
 kernel/bpf/syscall.c |  3 +++
 4 files changed, 31 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a97751d845c9..eb8c550db0e2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -214,6 +214,7 @@ struct bpf_map {
 	int spin_lock_off; /* >=0 valid offset, <0 error */
 	struct bpf_map_value_off *kptr_off_tab;
 	int timer_off; /* >=0 valid offset, <0 error */
+	int rb_node_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
@@ -263,6 +264,11 @@ static inline bool map_value_has_kptrs(const struct bpf_map *map)
 	return !IS_ERR_OR_NULL(map->kptr_off_tab);
 }
 
+static inline bool map_value_has_rb_node(const struct bpf_map *map)
+{
+	return map->rb_node_off >= 0;
+}
+
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
 	if (unlikely(map_value_has_spin_lock(map)))
diff --git a/include/linux/btf.h b/include/linux/btf.h
index cdb376d53238..1d8b1bcf0396 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -152,6 +152,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
+int btf_find_rb_node(const struct btf *btf, const struct btf_type *t);
 struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 					  const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 7ac971ea98d1..3a7096da7f20 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3195,6 +3195,7 @@ enum btf_field_type {
 	BTF_FIELD_SPIN_LOCK,
 	BTF_FIELD_TIMER,
 	BTF_FIELD_KPTR,
+	BTF_FIELD_RB_NODE,
 };
 
 enum {
@@ -3282,6 +3283,7 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
+		case BTF_FIELD_RB_NODE:
 			ret = btf_find_struct(btf, member_type, off, sz,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3332,6 +3334,7 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
+		case BTF_FIELD_RB_NODE:
 			ret = btf_find_struct(btf, var_type, off, sz,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3374,6 +3377,11 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 		sz = sizeof(struct bpf_timer);
 		align = __alignof__(struct bpf_timer);
 		break;
+	case BTF_FIELD_RB_NODE:
+		name = "rb_node";
+		sz = sizeof(struct rb_node);
+		align = __alignof__(struct rb_node);
+		break;
 	case BTF_FIELD_KPTR:
 		name = NULL;
 		sz = sizeof(u64);
@@ -3420,6 +3428,19 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 	return info.off;
 }
 
+int btf_find_rb_node(const struct btf *btf, const struct btf_type *t)
+{
+	struct btf_field_info info;
+	int ret;
+
+	ret = btf_find_field(btf, t, BTF_FIELD_RB_NODE, &info, 1);
+	if (ret < 0)
+		return ret;
+	if (!ret)
+		return -ENOENT;
+	return info.off;
+}
+
 struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 					  const struct btf_type *t)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 83c7136c5788..3947fbd137af 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1052,6 +1052,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		}
 	}
 
+	map->rb_node_off = btf_find_rb_node(btf, value_type);
+
 	if (map->ops->map_check_btf) {
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 		if (ret < 0)
@@ -1115,6 +1117,7 @@ static int map_create(union bpf_attr *attr)
 
 	map->spin_lock_off = -EINVAL;
 	map->timer_off = -EINVAL;
+	map->rb_node_off = -EINVAL;
 	if (attr->btf_key_type_id || attr->btf_value_type_id ||
 	    /* Even the map's value is a kernel's struct,
 	     * the bpf_prog.o must have BTF to begin with
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 04/11] bpf: Add rbtree map
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (2 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 03/11] bpf: Add rb_node_off to bpf_map Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-08-01 21:49   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree Dave Marchevsky
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Introduce a new map type, bpf_rbtree_map, allowing BPF programs to
create and manipulate rbtree data structures. bpf_rbtree_map differs
from 'classic' BPF map patterns in a few important ways:

  * The map does not allocate its own elements. Instead,
    BPF programs must call bpf_rbtree_alloc_node helper to allocate and
    bpf_rbtree_add to add map elem - referred to as 'node' from now on -
    to the rbtree. This means that rbtree maps can grow dynamically, do
    not preallocate, and that 'max_entries' has no meaning for rbtree
    maps.
    * Separating allocation and insertion allows alloc_node call to
      occur in contexts where it's safe to allocate.

  * It's possible to remove a node from a rbtree map with
    bpf_rbtree_remove helper.
    * Goal here is to allow a node to be removed from one rbtree and
      added to another
      [ NOTE: This functionality is still in progress ]

Helpers are added to manipulate nodes and trees:
  * bpf_rbtree_{alloc,free}_node: Allocate / free node structs
  * bpf_rbtree_{add,remove}: Add / remove nodes from rbtree maps
    * A comparator function is passed to bpf_rbtree_add in order to
      find the correct place to add the node.
  * bpf_rbtree_find: Find a node matching some condition in the rbtree
    * A comparator function is passed in order to determine whether a
      node matches what's being searched for.

bpf_rbtree_add is very similar to the 'map_push_elem' builtin, but since
verifier needs special logic to setup the comparator callback a new
helper is added. Same for bpf_rbtree_find and 'map_lookup_elem' builtin.

In order to safely support separate allocation / insertion and passing
nodes between rbtrees, some invariants must hold:

  * A node that is not in a rbtree map must either be free'd or added to
    a rbtree map before the program terminates
    * Nodes are in this state when returned from bpf_rbtree_alloc_node
      or bpf_rbtree_remove.

If a node is in a rbtree map it is 'owned' by the map, otherwise it is
owned by the BPF program which holds a reference to it. Owner is
responsible for the lifetime of the node.

This matches existing acquire / release semantics well. node_alloc and
remove operations 'acquire' a node while add and node_free operations
'release' the node. The verifier enforces that acquired nodes are
released before program terminates.

Some other implementation details:

  * The value type of an rbtree map is expected to be a struct which
    contains 'struct rb_node' at offset 0.
  * BPF programs may not modify the node struct's rb_node field.
    * Otherwise the tree could be left in corrupted state
  * Rbtree map is value-only. Keys have no meaning
  * Since the value type is not known until the rbtree map is
    instantiated, helper protos have input and return type
    'struct rb_node *' which verifier replaces with actual
    map value type.
  * [ TODO: Existing logic prevents any writes to PTR_TO_BTF_ID. This
    broadly turned off in this patch and replaced with "no writes to
    struct rb_node is PTR_TO_BTF_ID struct has one". This is a hack and
    needs to be replaced. ]

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |  62 ++++++++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/helpers.c           |  15 ++
 kernel/bpf/rbtree.c            | 256 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 118 ++++++++++++++-
 tools/include/uapi/linux/bpf.h |  62 ++++++++
 7 files changed, 511 insertions(+), 5 deletions(-)
 create mode 100644 kernel/bpf/rbtree.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..78e9b5253983 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -126,6 +126,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ffcbf79a556b..4688ce88caf4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -909,6 +909,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_RBTREE,
 };
 
 /* Note that tracing related programs such as
@@ -5328,6 +5329,62 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * void *bpf_rbtree_alloc_node(struct bpf_map *map, u32 sz)
+ *	Description
+ *		Allocate a node of size *sz* bytes for use in rbtree *map*.
+ *
+ *		*sz* must be >= sizeof(struct rb_node)
+ *	Return
+ *		A pointer to the allocated node if successful, otherwise NULL.
+ *
+ *		The verifier considers the type of the returned pointer to be
+ *		the BTF id of *map*'s value type.
+ *
+ * void *bpf_rbtree_add(struct bpf_map *map, void *node, void *cb)
+ *	Description
+ *		Add *node* to rbtree *map* using *cb* comparator callback to
+ *		walk the rbtree.
+ *
+ *		*cb* must take (struct rb_node \*, const struct rb_node \*) as
+ *		input and return a bool signifying whether the first rb_node's
+ *		struct is less than the second.
+ *
+ *	Return
+ *		If success, returns a pointer to the added node in the rbtree.
+ *
+ *		If add fails, returns NULL
+ *
+ * long bpf_rbtree_find(struct bpf_map *map, void *key, void *cb)
+ *	Description
+ *		Find *key* in rbtree *map* using *cb* comparator callback to walk the
+ *		rbtree.
+ *
+ *		*cb* must take (const void \*key, const struct rb_node \*n) as
+ *		input and return an int. If *cb* determines *n* to match *key*, *cb* must
+ *		return 0. If larger, a positive int, and a negative int if smaller.
+ *
+ *		*key* does not need to be a rbtree node struct.
+ *
+ *	Return
+ *		Ptr to rbtree node if found, otherwise NULL.
+ *
+ * void *bpf_rbtree_remove(struct bpf_map *map, void *elem)
+ *	Description
+ *		Remove *elem* from rbtree *map*.
+ *
+ *	Return
+ *		If success, returns a pointer to the removed node.
+ *
+ *		If remove fails, returns NULL
+ *
+ * long bpf_rbtree_free_node(struct bpf_map *map, void *elem)
+ *	Description
+ *		Free rb_node that isn't associated w/ a tree.
+ *
+ *	Return
+ *		0
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5538,6 +5595,11 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(rbtree_alloc_node),		\
+	FN(rbtree_add),		\
+	FN(rbtree_find),		\
+	FN(rbtree_remove),		\
+	FN(rbtree_free_node),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..00eedab3ad53 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@ endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o rbtree.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a1c84d256f83..35eb66d11bf6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1582,6 +1582,11 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
 const struct bpf_func_proto bpf_probe_read_kernel_proto __weak;
 const struct bpf_func_proto bpf_probe_read_kernel_str_proto __weak;
 const struct bpf_func_proto bpf_task_pt_regs_proto __weak;
+const struct bpf_func_proto bpf_rbtree_alloc_node_proto __weak;
+const struct bpf_func_proto bpf_rbtree_add_proto __weak;
+const struct bpf_func_proto bpf_rbtree_find_proto __weak;
+const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
+const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
 
 const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
@@ -1671,6 +1676,16 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_timer_cancel_proto;
 	case BPF_FUNC_kptr_xchg:
 		return &bpf_kptr_xchg_proto;
+	case BPF_FUNC_rbtree_alloc_node:
+		return &bpf_rbtree_alloc_node_proto;
+	case BPF_FUNC_rbtree_add:
+		return &bpf_rbtree_add_proto;
+	case BPF_FUNC_rbtree_find:
+		return &bpf_rbtree_find_proto;
+	case BPF_FUNC_rbtree_remove:
+		return &bpf_rbtree_remove_proto;
+	case BPF_FUNC_rbtree_free_node:
+		return &bpf_rbtree_free_node_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
new file mode 100644
index 000000000000..250d62210804
--- /dev/null
+++ b/kernel/bpf/rbtree.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <linux/filter.h>
+
+struct bpf_rbtree {
+	struct bpf_map map;
+	struct rb_root_cached root;
+};
+
+BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
+
+static int rbtree_map_alloc_check(union bpf_attr *attr)
+{
+	if (attr->max_entries || !attr->btf_value_type_id)
+		return -EINVAL;
+
+	return 0;
+}
+
+static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_rbtree *tree;
+	int numa_node;
+
+	if (!bpf_capable())
+		return ERR_PTR(-EPERM);
+
+	if (attr->value_size == 0)
+		return ERR_PTR(-EINVAL);
+
+	numa_node = bpf_map_attr_numa_node(attr);
+	tree = bpf_map_area_alloc(sizeof(*tree), numa_node);
+	if (!tree)
+		return ERR_PTR(-ENOMEM);
+
+	tree->root = RB_ROOT_CACHED;
+	bpf_map_init_from_attr(&tree->map, attr);
+	return &tree->map;
+}
+
+static struct rb_node *rbtree_map_alloc_node(struct bpf_map *map, size_t sz)
+{
+	struct rb_node *node;
+
+	node = bpf_map_kmalloc_node(map, sz, GFP_KERNEL, map->numa_node);
+	if (!node)
+		return NULL;
+	RB_CLEAR_NODE(node);
+	return node;
+}
+
+BPF_CALL_2(bpf_rbtree_alloc_node, struct bpf_map *, map, u32, sz)
+{
+	struct rb_node *node;
+
+	if (map->map_type != BPF_MAP_TYPE_RBTREE)
+		return (u64)NULL;
+
+	if (sz < sizeof(*node))
+		return (u64)NULL;
+
+	node = rbtree_map_alloc_node(map, (size_t)sz);
+	if (!node)
+		return (u64)NULL;
+
+	return (u64)node;
+}
+
+const struct bpf_func_proto bpf_rbtree_alloc_node_proto = {
+	.func = bpf_rbtree_alloc_node,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_CONST_ALLOC_SIZE_OR_ZERO,
+};
+
+BPF_CALL_3(bpf_rbtree_add, struct bpf_map *, map, void *, value, void *, cb)
+{
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+	struct rb_node *node = (struct rb_node *)value;
+
+	if (WARN_ON_ONCE(!RB_EMPTY_NODE(node)))
+		return (u64)NULL;
+
+	rb_add_cached(node, &tree->root, (bool (*)(struct rb_node *, const struct rb_node *))cb);
+	return (u64)node;
+}
+
+const struct bpf_func_proto bpf_rbtree_add_proto = {
+	.func = bpf_rbtree_add,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
+	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg3_type = ARG_PTR_TO_FUNC,
+};
+
+BPF_CALL_3(bpf_rbtree_find, struct bpf_map *, map, void *, key, void *, cb)
+{
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	return (u64)rb_find(key, &tree->root.rb_root,
+			    (int (*)(const void *key,
+				     const struct rb_node *))cb);
+}
+
+const struct bpf_func_proto bpf_rbtree_find_proto = {
+	.func = bpf_rbtree_find,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_ANYTHING,
+	.arg3_type = ARG_PTR_TO_FUNC,
+};
+
+/* Like rbtree_postorder_for_each_entry_safe, but 'pos' and 'n' are
+ * 'rb_node *', so field name of rb_node within containing struct is not
+ * needed.
+ *
+ * Since bpf_rb_tree's node always has 'struct rb_node' at offset 0 it's
+ * not necessary to know field name or type of node struct
+ */
+#define bpf_rbtree_postorder_for_each_entry_safe(pos, n, root) \
+	for (pos = rb_first_postorder(root); \
+	     pos && ({ n = rb_next_postorder(pos); 1; }); \
+	     pos = n)
+
+static void rbtree_map_free(struct bpf_map *map)
+{
+	struct rb_node *pos, *n;
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	bpf_rbtree_postorder_for_each_entry_safe(pos, n, &tree->root.rb_root)
+		kfree(pos);
+	bpf_map_area_free(tree);
+}
+
+static int rbtree_map_check_btf(const struct bpf_map *map,
+				const struct btf *btf,
+				const struct btf_type *key_type,
+				const struct btf_type *value_type)
+{
+	if (!map_value_has_rb_node(map))
+		return -EINVAL;
+
+	if (map->rb_node_off > 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int rbtree_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	/* Use bpf_rbtree_add helper instead
+	 */
+	return -ENOTSUPP;
+}
+
+static int rbtree_map_pop_elem(struct bpf_map *map, void *value)
+{
+	return -ENOTSUPP;
+}
+
+static int rbtree_map_peek_elem(struct bpf_map *map, void *value)
+{
+	return -ENOTSUPP;
+}
+
+static void *rbtree_map_lookup_elem(struct bpf_map *map, void *value)
+{
+	/* Use bpf_rbtree_find helper instead
+	 */
+	return ERR_PTR(-ENOTSUPP);
+}
+
+static int rbtree_map_update_elem(struct bpf_map *map, void *key, void *value,
+				  u64 flags)
+{
+	return -ENOTSUPP;
+}
+
+static int rbtree_map_delete_elem(struct bpf_map *map, void *value)
+{
+	return -ENOTSUPP;
+}
+
+BPF_CALL_2(bpf_rbtree_remove, struct bpf_map *, map, void *, value)
+{
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+	struct rb_node *node = (struct rb_node *)value;
+
+	if (WARN_ON_ONCE(RB_EMPTY_NODE(node)))
+		return (u64)NULL;
+
+	rb_erase_cached(node, &tree->root);
+	RB_CLEAR_NODE(node);
+	return (u64)node;
+}
+
+const struct bpf_func_proto bpf_rbtree_remove_proto = {
+	.func = bpf_rbtree_remove,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
+};
+
+BPF_CALL_2(bpf_rbtree_free_node, struct bpf_map *, map, void *, value)
+{
+	struct rb_node *node = (struct rb_node *)value;
+
+	WARN_ON_ONCE(!RB_EMPTY_NODE(node));
+	kfree(node);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_rbtree_free_node_proto = {
+	.func = bpf_rbtree_free_node,
+	.gpl_only = true,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
+	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
+};
+
+static int rbtree_map_get_next_key(struct bpf_map *map, void *key,
+				   void *next_key)
+{
+	return -ENOTSUPP;
+}
+
+BTF_ID_LIST_SINGLE(bpf_rbtree_map_btf_ids, struct, bpf_rbtree)
+const struct bpf_map_ops rbtree_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = rbtree_map_alloc_check,
+	.map_alloc = rbtree_map_alloc,
+	.map_free = rbtree_map_free,
+	.map_get_next_key = rbtree_map_get_next_key,
+	.map_push_elem = rbtree_map_push_elem,
+	.map_peek_elem = rbtree_map_peek_elem,
+	.map_pop_elem = rbtree_map_pop_elem,
+	.map_lookup_elem = rbtree_map_lookup_elem,
+	.map_update_elem = rbtree_map_update_elem,
+	.map_delete_elem = rbtree_map_delete_elem,
+	.map_check_btf = rbtree_map_check_btf,
+	.map_btf_id = &bpf_rbtree_map_btf_ids[0],
+};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1f50becce141..535f673882cd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -481,7 +481,9 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
 	    func_id == BPF_FUNC_ringbuf_reserve ||
-	    func_id == BPF_FUNC_kptr_xchg)
+	    func_id == BPF_FUNC_kptr_xchg ||
+	    func_id == BPF_FUNC_rbtree_alloc_node ||
+	    func_id == BPF_FUNC_rbtree_remove)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -531,6 +533,20 @@ static bool is_cmpxchg_insn(const struct bpf_insn *insn)
 	       insn->imm == BPF_CMPXCHG;
 }
 
+static bool function_manipulates_rbtree_node(enum bpf_func_id func_id)
+{
+	return func_id == BPF_FUNC_rbtree_add ||
+		func_id == BPF_FUNC_rbtree_remove ||
+		func_id == BPF_FUNC_rbtree_free_node;
+}
+
+static bool function_returns_rbtree_node(enum bpf_func_id func_id)
+{
+	return func_id == BPF_FUNC_rbtree_alloc_node ||
+		func_id == BPF_FUNC_rbtree_add ||
+		func_id == BPF_FUNC_rbtree_remove;
+}
+
 /* string representation of 'enum bpf_reg_type'
  *
  * Note that reg_type_str() can not appear more than once in a single verbose()
@@ -3784,6 +3800,13 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	return 0;
 }
 
+static bool access_may_touch_field(u32 access_off, size_t access_sz,
+				   u32 field_off, size_t field_sz)
+{
+	return access_off < field_off + field_sz &&
+		field_off < access_off + access_sz;
+}
+
 /* if any part of struct field can be touched by
  * load/store reject this program.
  * To check that [x1, x2) overlaps with [y1, y2)
@@ -4490,7 +4513,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	const char *tname = btf_name_by_offset(reg->btf, t->name_off);
 	enum bpf_type_flag flag = 0;
 	u32 btf_id;
-	int ret;
+	int ret, rb_node_off;
 
 	if (off < 0) {
 		verbose(env,
@@ -4527,8 +4550,13 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 						  off, size, atype, &btf_id, &flag);
 	} else {
 		if (atype != BPF_READ) {
-			verbose(env, "only read is supported\n");
-			return -EACCES;
+			rb_node_off = btf_find_rb_node(reg->btf, t);
+			if (rb_node_off < 0 ||
+			    access_may_touch_field(off, size, rb_node_off,
+						   sizeof(struct rb_node))) {
+				verbose(env, "only read is supported\n");
+				return -EACCES;
+			}
 		}
 
 		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
@@ -5764,6 +5792,17 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 		if (meta->func_id == BPF_FUNC_kptr_xchg) {
 			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
 				return -EACCES;
+		} else if (function_manipulates_rbtree_node(meta->func_id)) {
+			if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
+						  meta->map_ptr->btf,
+						  meta->map_ptr->btf_value_type_id,
+						  strict_type_match)) {
+				verbose(env, "rbtree: R%d is of type %s but %s is expected\n",
+					regno, kernel_type_name(reg->btf, reg->btf_id),
+					kernel_type_name(meta->map_ptr->btf,
+							 meta->map_ptr->btf_value_type_id));
+				return -EACCES;
+			}
 		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
 						 btf_vmlinux, *arg_btf_id,
 						 strict_type_match)) {
@@ -6369,10 +6408,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_map_pop_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+		    map->map_type != BPF_MAP_TYPE_RBTREE &&
 		    map->map_type != BPF_MAP_TYPE_STACK)
 			goto error;
 		break;
 	case BPF_FUNC_map_peek_elem:
+		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_RBTREE &&
+		    map->map_type != BPF_MAP_TYPE_BLOOM_FILTER)
+			goto error;
+		break;
 	case BPF_FUNC_map_push_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
 		    map->map_type != BPF_MAP_TYPE_STACK &&
@@ -6828,6 +6874,57 @@ static int set_loop_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
+					 struct bpf_func_state *caller,
+					 struct bpf_func_state *callee,
+					 int insn_idx)
+{
+	struct bpf_map *map_ptr = caller->regs[BPF_REG_1].map_ptr;
+
+	/* bpf_rbtree_add(struct bpf_map *map, void *value, void *cb)
+	 * cb(struct rb_node *a, const struct rb_node *b);
+	 */
+	callee->regs[BPF_REG_1].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
+static int set_rbtree_find_callback_state(struct bpf_verifier_env *env,
+					  struct bpf_func_state *caller,
+					  struct bpf_func_state *callee,
+					  int insn_idx)
+{
+	struct bpf_map *map_ptr = caller->regs[BPF_REG_1].map_ptr;
+
+	/* bpf_rbtree_find(struct bpf_map *map, void *key, void *cb)
+	 * cb(void *key, const struct rb_node *b);
+	 */
+	callee->regs[BPF_REG_1].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	callee->callback_ret_range = tnum_range(0, U64_MAX);
+	return 0;
+}
+
 static int set_timer_callback_state(struct bpf_verifier_env *env,
 				    struct bpf_func_state *caller,
 				    struct bpf_func_state *callee,
@@ -7310,6 +7407,14 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
 					set_loop_callback_state);
 		break;
+	case BPF_FUNC_rbtree_add:
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_rbtree_add_callback_state);
+		break;
+	case BPF_FUNC_rbtree_find:
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_rbtree_find_callback_state);
+		break;
 	case BPF_FUNC_dynptr_from_mem:
 		if (regs[BPF_REG_1].type != PTR_TO_MAP_VALUE) {
 			verbose(env, "Unsupported reg type %s for bpf_dynptr_from_mem data\n",
@@ -7424,6 +7529,9 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		if (func_id == BPF_FUNC_kptr_xchg) {
 			ret_btf = meta.kptr_off_desc->kptr.btf;
 			ret_btf_id = meta.kptr_off_desc->kptr.btf_id;
+		} else if (function_returns_rbtree_node(func_id)) {
+			ret_btf = meta.map_ptr->btf;
+			ret_btf_id = meta.map_ptr->btf_value_type_id;
 		} else {
 			ret_btf = btf_vmlinux;
 			ret_btf_id = *fn->ret_btf_id;
@@ -13462,8 +13570,10 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 					BPF_SIZE((insn)->code);
 				env->prog->aux->num_exentries++;
 			} else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) {
+				/*TODO: Not sure what to do here
 				verbose(env, "Writes through BTF pointers are not allowed\n");
 				return -EINVAL;
+				*/
 			}
 			continue;
 		default:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ffcbf79a556b..4688ce88caf4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -909,6 +909,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_RBTREE,
 };
 
 /* Note that tracing related programs such as
@@ -5328,6 +5329,62 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * void *bpf_rbtree_alloc_node(struct bpf_map *map, u32 sz)
+ *	Description
+ *		Allocate a node of size *sz* bytes for use in rbtree *map*.
+ *
+ *		*sz* must be >= sizeof(struct rb_node)
+ *	Return
+ *		A pointer to the allocated node if successful, otherwise NULL.
+ *
+ *		The verifier considers the type of the returned pointer to be
+ *		the BTF id of *map*'s value type.
+ *
+ * void *bpf_rbtree_add(struct bpf_map *map, void *node, void *cb)
+ *	Description
+ *		Add *node* to rbtree *map* using *cb* comparator callback to
+ *		walk the rbtree.
+ *
+ *		*cb* must take (struct rb_node \*, const struct rb_node \*) as
+ *		input and return a bool signifying whether the first rb_node's
+ *		struct is less than the second.
+ *
+ *	Return
+ *		If success, returns a pointer to the added node in the rbtree.
+ *
+ *		If add fails, returns NULL
+ *
+ * long bpf_rbtree_find(struct bpf_map *map, void *key, void *cb)
+ *	Description
+ *		Find *key* in rbtree *map* using *cb* comparator callback to walk the
+ *		rbtree.
+ *
+ *		*cb* must take (const void \*key, const struct rb_node \*n) as
+ *		input and return an int. If *cb* determines *n* to match *key*, *cb* must
+ *		return 0. If larger, a positive int, and a negative int if smaller.
+ *
+ *		*key* does not need to be a rbtree node struct.
+ *
+ *	Return
+ *		Ptr to rbtree node if found, otherwise NULL.
+ *
+ * void *bpf_rbtree_remove(struct bpf_map *map, void *elem)
+ *	Description
+ *		Remove *elem* from rbtree *map*.
+ *
+ *	Return
+ *		If success, returns a pointer to the removed node.
+ *
+ *		If remove fails, returns NULL
+ *
+ * long bpf_rbtree_free_node(struct bpf_map *map, void *elem)
+ *	Description
+ *		Free rb_node that isn't associated w/ a tree.
+ *
+ *	Return
+ *		0
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5538,6 +5595,11 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(rbtree_alloc_node),		\
+	FN(rbtree_add),		\
+	FN(rbtree_find),		\
+	FN(rbtree_remove),		\
+	FN(rbtree_free_node),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (3 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 04/11] bpf: Add rbtree map Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-08-01 22:17   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 06/11] bpf: Add bpf_rbtree_{lock,unlock} helpers Dave Marchevsky
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
well as a bpf_rbtree_get_lock helper which allows bpf programs to access
the lock.

Ideally the bpf_spin_lock would be created independently oustide of the
tree and associated with it before the tree is used, either as part of
map definition or via some call like rbtree_init(&rbtree, &lock). Doing
this in an ergonomic way is proving harder than expected, so for now use
this workaround.

Why is creating the bpf_spin_lock independently and associating it with
the tree preferable? Because we want to be able to transfer nodes
between trees atomically, and for this to work need same lock associated
with 2 trees.

Further locking-related patches will make it possible for the lock to be
used in BPF programs and add code which enforces that the lock is held
when doing any operation on the tree.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/uapi/linux/bpf.h       |  7 +++++++
 kernel/bpf/helpers.c           |  3 +++
 kernel/bpf/rbtree.c            | 24 ++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  7 +++++++
 4 files changed, 41 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4688ce88caf4..c677d92de3bc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5385,6 +5385,12 @@ union bpf_attr {
  *	Return
  *		0
  *
+ * void *bpf_rbtree_get_lock(struct bpf_map *map)
+ *	Description
+ *		Return the bpf_spin_lock associated with the rbtree
+ *
+ *	Return
+ *		Ptr to lock
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5600,6 +5606,7 @@ union bpf_attr {
 	FN(rbtree_find),		\
 	FN(rbtree_remove),		\
 	FN(rbtree_free_node),		\
+	FN(rbtree_get_lock),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 35eb66d11bf6..257a808bb767 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1587,6 +1587,7 @@ const struct bpf_func_proto bpf_rbtree_add_proto __weak;
 const struct bpf_func_proto bpf_rbtree_find_proto __weak;
 const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
 const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
+const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
 
 const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
@@ -1686,6 +1687,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_rbtree_remove_proto;
 	case BPF_FUNC_rbtree_free_node:
 		return &bpf_rbtree_free_node_proto;
+	case BPF_FUNC_rbtree_get_lock:
+		return &bpf_rbtree_get_lock_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index 250d62210804..c6f0a2a083f6 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -9,6 +9,7 @@
 struct bpf_rbtree {
 	struct bpf_map map;
 	struct rb_root_cached root;
+	struct bpf_spin_lock *lock;
 };
 
 BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
@@ -39,6 +40,14 @@ static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
 
 	tree->root = RB_ROOT_CACHED;
 	bpf_map_init_from_attr(&tree->map, attr);
+
+	tree->lock = bpf_map_kzalloc(&tree->map, sizeof(struct bpf_spin_lock),
+				     GFP_KERNEL | __GFP_NOWARN);
+	if (!tree->lock) {
+		bpf_map_area_free(tree);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	return &tree->map;
 }
 
@@ -139,6 +148,7 @@ static void rbtree_map_free(struct bpf_map *map)
 
 	bpf_rbtree_postorder_for_each_entry_safe(pos, n, &tree->root.rb_root)
 		kfree(pos);
+	kfree(tree->lock);
 	bpf_map_area_free(tree);
 }
 
@@ -238,6 +248,20 @@ static int rbtree_map_get_next_key(struct bpf_map *map, void *key,
 	return -ENOTSUPP;
 }
 
+BPF_CALL_1(bpf_rbtree_get_lock, struct bpf_map *, map)
+{
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	return (u64)tree->lock;
+}
+
+const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
+	.func = bpf_rbtree_get_lock,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_MAP_VALUE,
+	.arg1_type = ARG_CONST_MAP_PTR,
+};
+
 BTF_ID_LIST_SINGLE(bpf_rbtree_map_btf_ids, struct, bpf_rbtree)
 const struct bpf_map_ops rbtree_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4688ce88caf4..c677d92de3bc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5385,6 +5385,12 @@ union bpf_attr {
  *	Return
  *		0
  *
+ * void *bpf_rbtree_get_lock(struct bpf_map *map)
+ *	Description
+ *		Return the bpf_spin_lock associated with the rbtree
+ *
+ *	Return
+ *		Ptr to lock
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5600,6 +5606,7 @@ union bpf_attr {
 	FN(rbtree_find),		\
 	FN(rbtree_remove),		\
 	FN(rbtree_free_node),		\
+	FN(rbtree_get_lock),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 06/11] bpf: Add bpf_rbtree_{lock,unlock} helpers
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (4 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-08-01 21:58   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 07/11] bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find} Dave Marchevsky
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

These helpers are equivalent to bpf_spin_{lock,unlock}, but the verifier
doesn't try to enforce that no helper calls occur when there's an active
spin lock.

[ TODO: Currently the verifier doesn't do _anything_ spinlock related
when it sees one of these, including setting active_spin_lock. This is
probably too lenient. Also, EXPORT_SYMBOL for internal lock helpers
might not be the best code structure. ]

Future patches will add enforcement of "rbtree helpers must always be
called when lock is held" constraint.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/uapi/linux/bpf.h       | 20 ++++++++++++++++++++
 kernel/bpf/helpers.c           | 12 ++++++++++--
 kernel/bpf/rbtree.c            | 29 +++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |  2 ++
 tools/include/uapi/linux/bpf.h | 20 ++++++++++++++++++++
 5 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c677d92de3bc..d21e2c99ea14 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5391,6 +5391,24 @@ union bpf_attr {
  *
  *	Return
  *		Ptr to lock
+ *
+ * void *bpf_rbtree_lock(struct bpf_spin_lock *lock)
+ *	Description
+ *		Like bpf_spin_lock helper, but use separate helper for now
+ *		as we don't want this helper to have special meaning to the verifier
+ *		so that we can do rbtree helper calls between rbtree_lock/unlock
+ *
+ *	Return
+ *		0
+ *
+ * void *bpf_rbtree_unlock(struct bpf_spin_lock *lock)
+ *	Description
+ *		Like bpf_spin_unlock helper, but use separate helper for now
+ *		as we don't want this helper to have special meaning to the verifier
+ *		so that we can do rbtree helper calls between rbtree_lock/unlock
+ *
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5607,6 +5625,8 @@ union bpf_attr {
 	FN(rbtree_remove),		\
 	FN(rbtree_free_node),		\
 	FN(rbtree_get_lock),		\
+	FN(rbtree_lock),		\
+	FN(rbtree_unlock),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 257a808bb767..fa2dba1dcec8 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -303,7 +303,7 @@ static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
 
 static DEFINE_PER_CPU(unsigned long, irqsave_flags);
 
-static inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock)
+inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock)
 {
 	unsigned long flags;
 
@@ -311,6 +311,7 @@ static inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock)
 	__bpf_spin_lock(lock);
 	__this_cpu_write(irqsave_flags, flags);
 }
+EXPORT_SYMBOL(__bpf_spin_lock_irqsave);
 
 notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
 {
@@ -325,7 +326,7 @@ const struct bpf_func_proto bpf_spin_lock_proto = {
 	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
 };
 
-static inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
+inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
 {
 	unsigned long flags;
 
@@ -333,6 +334,7 @@ static inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
 	__bpf_spin_unlock(lock);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(__bpf_spin_unlock_irqrestore);
 
 notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
 {
@@ -1588,6 +1590,8 @@ const struct bpf_func_proto bpf_rbtree_find_proto __weak;
 const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
 const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
 const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
+const struct bpf_func_proto bpf_rbtree_lock_proto __weak;
+const struct bpf_func_proto bpf_rbtree_unlock_proto __weak;
 
 const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
@@ -1689,6 +1693,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_rbtree_free_node_proto;
 	case BPF_FUNC_rbtree_get_lock:
 		return &bpf_rbtree_get_lock_proto;
+	case BPF_FUNC_rbtree_lock:
+		return &bpf_rbtree_lock_proto;
+	case BPF_FUNC_rbtree_unlock:
+		return &bpf_rbtree_unlock_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index c6f0a2a083f6..bf2e30af82ec 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -262,6 +262,35 @@ const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
 	.arg1_type = ARG_CONST_MAP_PTR,
 };
 
+extern void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock);
+extern void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock);
+
+BPF_CALL_1(bpf_rbtree_lock, void *, lock)
+{
+	__bpf_spin_lock_irqsave((struct bpf_spin_lock *)lock);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_rbtree_lock_proto = {
+	.func = bpf_rbtree_lock,
+	.gpl_only = true,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_PTR_TO_SPIN_LOCK,
+};
+
+BPF_CALL_1(bpf_rbtree_unlock, void *, lock)
+{
+	__bpf_spin_unlock_irqrestore((struct bpf_spin_lock *)lock);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_rbtree_unlock_proto = {
+	.func = bpf_rbtree_unlock,
+	.gpl_only = true,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_PTR_TO_SPIN_LOCK,
+};
+
 BTF_ID_LIST_SINGLE(bpf_rbtree_map_btf_ids, struct, bpf_rbtree)
 const struct bpf_map_ops rbtree_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 535f673882cd..174a355d97fd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6047,6 +6047,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		} else if (meta->func_id == BPF_FUNC_spin_unlock) {
 			if (process_spin_lock(env, regno, false))
 				return -EACCES;
+		} else if (meta->func_id == BPF_FUNC_rbtree_lock ||
+			   meta->func_id == BPF_FUNC_rbtree_unlock) { // Do nothing for now
 		} else {
 			verbose(env, "verifier internal error\n");
 			return -EFAULT;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c677d92de3bc..d21e2c99ea14 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5391,6 +5391,24 @@ union bpf_attr {
  *
  *	Return
  *		Ptr to lock
+ *
+ * void *bpf_rbtree_lock(struct bpf_spin_lock *lock)
+ *	Description
+ *		Like bpf_spin_lock helper, but use separate helper for now
+ *		as we don't want this helper to have special meaning to the verifier
+ *		so that we can do rbtree helper calls between rbtree_lock/unlock
+ *
+ *	Return
+ *		0
+ *
+ * void *bpf_rbtree_unlock(struct bpf_spin_lock *lock)
+ *	Description
+ *		Like bpf_spin_unlock helper, but use separate helper for now
+ *		as we don't want this helper to have special meaning to the verifier
+ *		so that we can do rbtree helper calls between rbtree_lock/unlock
+ *
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5607,6 +5625,8 @@ union bpf_attr {
 	FN(rbtree_remove),		\
 	FN(rbtree_free_node),		\
 	FN(rbtree_get_lock),		\
+	FN(rbtree_lock),		\
+	FN(rbtree_unlock),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 07/11] bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (5 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 06/11] bpf: Add bpf_rbtree_{lock,unlock} helpers Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-07-22 18:34 ` [RFC PATCH bpf-next 08/11] bpf: Add OBJ_NON_OWNING_REF type flag Dave Marchevsky
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

The bpf program calling these helpers must hold the spinlock associated
with the rbtree map when doing so. Otherwise, a concurrent add/remove
operation could corrupt the tree while {add,remove,find} are walking it
with callback or pivoting after update.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/rbtree.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index bf2e30af82ec..5b1ab73e164f 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -14,6 +14,11 @@ struct bpf_rbtree {
 
 BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
 
+static bool __rbtree_lock_held(struct bpf_rbtree *tree)
+{
+	return spin_is_locked((spinlock_t *)tree->lock);
+}
+
 static int rbtree_map_alloc_check(union bpf_attr *attr)
 {
 	if (attr->max_entries || !attr->btf_value_type_id)
@@ -93,6 +98,9 @@ BPF_CALL_3(bpf_rbtree_add, struct bpf_map *, map, void *, value, void *, cb)
 	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
 	struct rb_node *node = (struct rb_node *)value;
 
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
 	if (WARN_ON_ONCE(!RB_EMPTY_NODE(node)))
 		return (u64)NULL;
 
@@ -114,6 +122,9 @@ BPF_CALL_3(bpf_rbtree_find, struct bpf_map *, map, void *, key, void *, cb)
 {
 	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
 
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
 	return (u64)rb_find(key, &tree->root.rb_root,
 			    (int (*)(const void *key,
 				     const struct rb_node *))cb);
@@ -206,6 +217,9 @@ BPF_CALL_2(bpf_rbtree_remove, struct bpf_map *, map, void *, value)
 	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
 	struct rb_node *node = (struct rb_node *)value;
 
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
 	if (WARN_ON_ONCE(RB_EMPTY_NODE(node)))
 		return (u64)NULL;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 08/11] bpf: Add OBJ_NON_OWNING_REF type flag
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (6 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 07/11] bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find} Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-08-01 22:41   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 09/11] bpf: Add CONDITIONAL_RELEASE " Dave Marchevsky
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Consider a pointer to a type that would normally need acquire / release
semantics to be safely held. There may be scenarios where such a pointer
can be safely held without the need to acquire a reference.

For example, although a PTR_TO_BTF_ID for a rbtree_map node is released
via bpf_rbtree_add helper, the helper doesn't change the address of the
node and must be called with the rbtree_map's spinlock held. Since the
only way to remove a node from the rbtree - bpf_rbtree_remove helper -
requires the same lock, the newly-added node cannot be removed by a
concurrently-running program until the lock is released. Therefore it is
safe to hold a reference to this node until bpf_rbtree_unlock is called.

This patch introduces a new type flag and associated verifier logic to
handle such "non-owning" references.

Currently the only usecase I have is the rbtree example above, so the
verifier logic is straightforward:
  * Tag return types of bpf_rbtree_{add,find} with OBJ_NON_OWNING_REF
    * These both require the rbtree lock to be held to return anything
    non-NULL
    * Since ret type for both is PTR_TO_BTF_ID_OR_NULL, if lock is not
    held and NULL is returned, existing mark_ptr_or_null_reg logic
    will clear reg type.
    * So if mark_ptr_or_null_reg logic turns the returned reg into a
    PTR_TO_BTF_ID | OBJ_NON_OWNING_REF, verifier knows lock is held.

  * When the lock is released the verifier invalidates any regs holding
  non owning refs similarly to existing release_reference logic - but no
  need to clear ref_obj_id as an 'owning' reference was never acquired.

[ TODO: Currently the invalidation logic in
clear_rbtree_node_non_owning_refs is not parametrized by map so
unlocking any rbtree lock will invalidate all non-owning refs ]

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/rbtree.c   |  4 +--
 kernel/bpf/verifier.c | 63 +++++++++++++++++++++++++++++++++++++++----
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index eb8c550db0e2..c9c4b4fb019c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -412,6 +412,7 @@ enum bpf_type_flag {
 	/* Size is known at compile time. */
 	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
 
+	OBJ_NON_OWNING_REF	= BIT(11 + BPF_BASE_TYPE_BITS),
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index 5b1ab73e164f..34864fc83209 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -111,7 +111,7 @@ BPF_CALL_3(bpf_rbtree_add, struct bpf_map *, map, void *, value, void *, cb)
 const struct bpf_func_proto bpf_rbtree_add_proto = {
 	.func = bpf_rbtree_add,
 	.gpl_only = true,
-	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF,
 	.arg1_type = ARG_CONST_MAP_PTR,
 	.arg2_type = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
 	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
@@ -133,7 +133,7 @@ BPF_CALL_3(bpf_rbtree_find, struct bpf_map *, map, void *, key, void *, cb)
 const struct bpf_func_proto bpf_rbtree_find_proto = {
 	.func = bpf_rbtree_find,
 	.gpl_only = true,
-	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF,
 	.ret_btf_id = &bpf_rbtree_btf_ids[0],
 	.arg1_type = ARG_CONST_MAP_PTR,
 	.arg2_type = ARG_ANYTHING,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 174a355d97fd..4f46b2dfbc4b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -467,6 +467,11 @@ static bool type_is_rdonly_mem(u32 type)
 	return type & MEM_RDONLY;
 }
 
+static bool type_is_non_owning_ref(u32 type)
+{
+	return type & OBJ_NON_OWNING_REF;
+}
+
 static bool type_may_be_null(u32 type)
 {
 	return type & PTR_MAYBE_NULL;
@@ -555,7 +560,9 @@ static bool function_returns_rbtree_node(enum bpf_func_id func_id)
 static const char *reg_type_str(struct bpf_verifier_env *env,
 				enum bpf_reg_type type)
 {
-	char postfix[16] = {0}, prefix[32] = {0};
+	char postfix[32] = {0}, prefix[32] = {0};
+	unsigned int postfix_idx = 0;
+
 	static const char * const str[] = {
 		[NOT_INIT]		= "?",
 		[SCALAR_VALUE]		= "scalar",
@@ -579,11 +586,18 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
 		[PTR_TO_MAP_KEY]	= "map_key",
 	};
 
-	if (type & PTR_MAYBE_NULL) {
+	if (type_may_be_null(type)) {
 		if (base_type(type) == PTR_TO_BTF_ID)
-			strncpy(postfix, "or_null_", 16);
+			postfix_idx += strlcpy(postfix + postfix_idx, "or_null_", 32 - postfix_idx);
 		else
-			strncpy(postfix, "_or_null", 16);
+			postfix_idx += strlcpy(postfix + postfix_idx, "_or_null", 32 - postfix_idx);
+	}
+
+	if (type_is_non_owning_ref(type)) {
+		if (base_type(type) == PTR_TO_BTF_ID)
+			postfix_idx += strlcpy(postfix + postfix_idx, "non_own_", 32 - postfix_idx);
+		else
+			postfix_idx += strlcpy(postfix + postfix_idx, "_non_own", 32 - postfix_idx);
 	}
 
 	if (type & MEM_RDONLY)
@@ -5684,12 +5698,18 @@ static const struct bpf_reg_types int_ptr_types = {
 	},
 };
 
+static const struct bpf_reg_types btf_ptr_types = {
+	.types = {
+		PTR_TO_BTF_ID,
+		PTR_TO_BTF_ID | OBJ_NON_OWNING_REF,
+	},
+};
+
 static const struct bpf_reg_types fullsock_types = { .types = { PTR_TO_SOCKET } };
 static const struct bpf_reg_types scalar_types = { .types = { SCALAR_VALUE } };
 static const struct bpf_reg_types context_types = { .types = { PTR_TO_CTX } };
 static const struct bpf_reg_types alloc_mem_types = { .types = { PTR_TO_MEM | MEM_ALLOC } };
 static const struct bpf_reg_types const_map_ptr_types = { .types = { CONST_PTR_TO_MAP } };
-static const struct bpf_reg_types btf_ptr_types = { .types = { PTR_TO_BTF_ID } };
 static const struct bpf_reg_types spin_lock_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_BTF_ID | MEM_PERCPU } };
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
@@ -6635,6 +6655,33 @@ static int release_reference(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static void clear_non_owning_ref_regs(struct bpf_verifier_env *env,
+				      struct bpf_func_state *state)
+{
+	struct bpf_reg_state *regs = state->regs, *reg;
+	int i;
+
+	for (i = 0; i < MAX_BPF_REG; i++)
+		if (type_is_non_owning_ref(regs[i].type))
+			mark_reg_unknown(env, regs, i);
+
+	bpf_for_each_spilled_reg(i, state, reg) {
+		if (!reg)
+			continue;
+		if (type_is_non_owning_ref(reg->type))
+			__mark_reg_unknown(env, reg);
+	}
+}
+
+static void clear_rbtree_node_non_owning_refs(struct bpf_verifier_env *env)
+{
+	struct bpf_verifier_state *vstate = env->cur_state;
+	int i;
+
+	for (i = 0; i <= vstate->curframe; i++)
+		clear_non_owning_ref_regs(env, vstate->frame[i]);
+}
+
 static void clear_caller_saved_regs(struct bpf_verifier_env *env,
 				    struct bpf_reg_state *regs)
 {
@@ -7436,6 +7483,12 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			}
 		}
 		break;
+	case BPF_FUNC_rbtree_unlock:
+		/* TODO clear_rbtree_node_non_owning_refs calls should be
+		 * parametrized by base_type or ideally by owning map
+		 */
+		clear_rbtree_node_non_owning_refs(env);
+		break;
 	}
 
 	if (err)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 09/11] bpf: Add CONDITIONAL_RELEASE type flag
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (7 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 08/11] bpf: Add OBJ_NON_OWNING_REF type flag Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-08-01 22:23   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags Dave Marchevsky
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Currently if a helper proto has an arg with OBJ_RELEASE flag, the verifier
assumes the 'release' logic in the helper will always succeed, and
therefore always clears the arg reg, other regs w/ same
ref_obj_id, and releases the reference. This poses a problem for
'release' helpers which may not always succeed.

For example, bpf_rbtree_add will fail to add the passed-in node to a
bpf_rbtree if the rbtree's lock is not held when the helper is called.
In this case the helper returns NULL and the calling bpf program must
release the node in another way before terminating to avoid leaking
memory. An example of such logic:

1  struct node_data *node = bpf_rbtree_alloc_node(&rbtree, ...);
2  struct node_data *ret = bpf_rbtree_add(&rbtree, node);
3  if (!ret) {
4    bpf_rbtree_free_node(node);
5    return 0;
6  }
7  bpf_trace_printk("%d\n", ret->id);

However, current verifier logic does not allow this: after line 2,
ref_obj_id of reg holding 'node' (BPF_REG_2) will be released via
release_reference function, which will mark BPF_REG_2 and any other reg
with same ref_obj_id as unknown. As a result neither ret nor node will
point to anything useful if line 3's check passes. Additionally, since the
reference is unconditionally released, the program would pass the
verifier without the null check.

This patch adds 'conditional release' semantics so that the verifier can
handle the above example correctly. The CONDITIONAL_RELEASE type flag
works in concert with the existing OBJ_RELEASE flag - the latter is used
to tag an argument, while the new type flag is used to tag return type.

If a helper has an OBJ_RELEASE arg type and CONDITIONAL_RELEASE return
type, the helper is considered to use its return value to indicate
success or failure of the release operation. NULL is returned if release
fails, non-null otherwise.

For my concrete usecase - bpf_rbtree_add - CONDITIONAL_RELEASE works in
concert with OBJ_NON_OWNING_REF: successful release results in a non-owning
reference being returned, allowing line 7 in above example.

Instead of unconditionally releasing the OBJ_RELEASE reference when
doing check_helper_call, for CONDITIONAL_RELEASE helpers the verifier
will wait until the return value is checked for null.
  If not null: the reference is released

  If null: no reference is released. Since other regs w/ same ref_obj_id
           were not marked unknown by check_helper_call, they can be
           used to release the reference via other means (line 4 above),

It's necessary to prevent conditionally-released ref_obj_id regs from
being used between the release helper and null check. For example:

1  struct node_data *node = bpf_rbtree_alloc_node(&rbtree, ...);
2  struct node_data *ret = bpf_rbtree_add(&rbtree, node);
3  do_something_with_a_node(node);
4  if (!ret) {
5    bpf_rbtree_free_node(node);
6    return 0;
7  }

Line 3 shouldn't be allowed since node may have been released. The
verifier tags all regs with ref_obj_id of the conditionally-released arg
in the period between the helper call and null check for this reason.

Why no matching CONDITIONAL_ACQUIRE type flag? Existing verifier logic
already treats acquire of an _OR_NULL type as a conditional acquire.
Consider this code:

1  struct thing *i = acquire_helper_that_returns_thing_or_null();
2  if (!i)
3    return 0;
4  manipulate_thing(i);
5  release_thing(i);

After line 1, BPF_REG_0 will have an _OR_NULL type and a ref_obj_id set.
When the verifier sees line 2's conditional jump, existing logic in
mark_ptr_or_null_regs, specifically the if:

  if (ref_obj_id && ref_obj_id == id && is_null)
          /* regs[regno] is in the " == NULL" branch.
           * No one could have freed the reference state before
           * doing the NULL check.
           */
           WARN_ON_ONCE(release_reference_state(state, id));

will release the reference in the is_null state.

[ TODO: Either need to remove WARN_ON_ONCE there without adding
CONDITIONAL_ACQUIRE flag or add the flag and don't WARN_ON_ONCE if it's
set. Left out of first pass for simplicity's sake. ]

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h          |   3 +
 include/linux/bpf_verifier.h |   1 +
 kernel/bpf/rbtree.c          |   2 +-
 kernel/bpf/verifier.c        | 122 +++++++++++++++++++++++++++++++----
 4 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c9c4b4fb019c..a601ab30a2b1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -413,6 +413,9 @@ enum bpf_type_flag {
 	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
 
 	OBJ_NON_OWNING_REF	= BIT(11 + BPF_BASE_TYPE_BITS),
+
+	CONDITIONAL_RELEASE	= BIT(12 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9c017575c034..bdc8c48c2343 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -313,6 +313,7 @@ struct bpf_verifier_state {
 	u32 insn_idx;
 	u32 curframe;
 	u32 active_spin_lock;
+	u32 active_cond_ref_obj_id;
 	bool speculative;
 
 	/* first and last insn idx of this verifier state */
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index 34864fc83209..dcf8f69d4ada 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -111,7 +111,7 @@ BPF_CALL_3(bpf_rbtree_add, struct bpf_map *, map, void *, value, void *, cb)
 const struct bpf_func_proto bpf_rbtree_add_proto = {
 	.func = bpf_rbtree_add,
 	.gpl_only = true,
-	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF | CONDITIONAL_RELEASE,
 	.arg1_type = ARG_CONST_MAP_PTR,
 	.arg2_type = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
 	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4f46b2dfbc4b..f80e161170de 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -472,6 +472,11 @@ static bool type_is_non_owning_ref(u32 type)
 	return type & OBJ_NON_OWNING_REF;
 }
 
+static bool type_is_cond_release(u32 type)
+{
+	return type & CONDITIONAL_RELEASE;
+}
+
 static bool type_may_be_null(u32 type)
 {
 	return type & PTR_MAYBE_NULL;
@@ -600,6 +605,15 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
 			postfix_idx += strlcpy(postfix + postfix_idx, "_non_own", 32 - postfix_idx);
 	}
 
+	if (type_is_cond_release(type)) {
+		if (base_type(type) == PTR_TO_BTF_ID)
+			postfix_idx += strlcpy(postfix + postfix_idx, "cond_rel_",
+					       32 - postfix_idx);
+		else
+			postfix_idx += strlcpy(postfix + postfix_idx, "_cond_rel",
+					       32 - postfix_idx);
+	}
+
 	if (type & MEM_RDONLY)
 		strncpy(prefix, "rdonly_", 32);
 	if (type & MEM_ALLOC)
@@ -1211,6 +1225,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 	dst_state->speculative = src->speculative;
 	dst_state->curframe = src->curframe;
 	dst_state->active_spin_lock = src->active_spin_lock;
+	dst_state->active_cond_ref_obj_id = src->active_cond_ref_obj_id;
 	dst_state->branches = src->branches;
 	dst_state->parent = src->parent;
 	dst_state->first_insn_idx = src->first_insn_idx;
@@ -1418,6 +1433,7 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
 		return;
 	}
 
+	reg->type &= ~CONDITIONAL_RELEASE;
 	reg->type &= ~PTR_MAYBE_NULL;
 }
 
@@ -6635,24 +6651,78 @@ static void release_reg_references(struct bpf_verifier_env *env,
 	}
 }
 
+static int __release_reference(struct bpf_verifier_env *env, struct bpf_verifier_state *vstate,
+			       int ref_obj_id)
+{
+	int err;
+	int i;
+
+	err = release_reference_state(vstate->frame[vstate->curframe], ref_obj_id);
+	if (err)
+		return err;
+
+	for (i = 0; i <= vstate->curframe; i++)
+		release_reg_references(env, vstate->frame[i], ref_obj_id);
+	return 0;
+}
+
 /* The pointer with the specified id has released its reference to kernel
  * resources. Identify all copies of the same pointer and clear the reference.
  */
 static int release_reference(struct bpf_verifier_env *env,
 			     int ref_obj_id)
 {
-	struct bpf_verifier_state *vstate = env->cur_state;
-	int err;
+	return __release_reference(env, env->cur_state, ref_obj_id);
+}
+
+static void tag_reference_cond_release_regs(struct bpf_verifier_env *env,
+					    struct bpf_func_state *state,
+					    int ref_obj_id,
+					    bool remove)
+{
+	struct bpf_reg_state *regs = state->regs, *reg;
 	int i;
 
-	err = release_reference_state(cur_func(env), ref_obj_id);
-	if (err)
-		return err;
+	for (i = 0; i < MAX_BPF_REG; i++)
+		if (regs[i].ref_obj_id == ref_obj_id) {
+			if (remove)
+				regs[i].type &= ~CONDITIONAL_RELEASE;
+			else
+				regs[i].type |= CONDITIONAL_RELEASE;
+		}
+
+	bpf_for_each_spilled_reg(i, state, reg) {
+		if (!reg)
+			continue;
+		if (reg->ref_obj_id == ref_obj_id) {
+			if (remove)
+				reg->type &= ~CONDITIONAL_RELEASE;
+			else
+				reg->type |= CONDITIONAL_RELEASE;
+		}
+	}
+}
+
+static void tag_reference_cond_release(struct bpf_verifier_env *env,
+				       int ref_obj_id)
+{
+	struct bpf_verifier_state *vstate = env->cur_state;
+	int i;
 
 	for (i = 0; i <= vstate->curframe; i++)
-		release_reg_references(env, vstate->frame[i], ref_obj_id);
+		tag_reference_cond_release_regs(env, vstate->frame[i],
+						ref_obj_id, false);
+}
 
-	return 0;
+static void untag_reference_cond_release(struct bpf_verifier_env *env,
+					 struct bpf_verifier_state *vstate,
+					 int ref_obj_id)
+{
+	int i;
+
+	for (i = 0; i <= vstate->curframe; i++)
+		tag_reference_cond_release_regs(env, vstate->frame[i],
+						ref_obj_id, true);
 }
 
 static void clear_non_owning_ref_regs(struct bpf_verifier_env *env,
@@ -7406,7 +7476,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		if (arg_type_is_dynptr(fn->arg_type[meta.release_regno - BPF_REG_1]))
 			err = unmark_stack_slots_dynptr(env, &regs[meta.release_regno]);
 		else if (meta.ref_obj_id)
-			err = release_reference(env, meta.ref_obj_id);
+			if (type_is_cond_release(fn->ret_type)) {
+				if (env->cur_state->active_cond_ref_obj_id) {
+					verbose(env, "can't handle >1 cond_release\n");
+					return err;
+				}
+				env->cur_state->active_cond_ref_obj_id = meta.ref_obj_id;
+				tag_reference_cond_release(env, meta.ref_obj_id);
+				err = 0;
+			} else {
+				err = release_reference(env, meta.ref_obj_id);
+			}
 		/* meta.ref_obj_id can only be 0 if register that is meant to be
 		 * released is NULL, which must be > R0.
 		 */
@@ -10040,8 +10120,8 @@ static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id,
 /* The logic is similar to find_good_pkt_pointers(), both could eventually
  * be folded together at some point.
  */
-static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
-				  bool is_null)
+static int mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
+				 bool is_null, struct bpf_verifier_env *env)
 {
 	struct bpf_func_state *state = vstate->frame[vstate->curframe];
 	struct bpf_reg_state *regs = state->regs;
@@ -10056,8 +10136,19 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 		 */
 		WARN_ON_ONCE(release_reference_state(state, id));
 
+	if (type_is_cond_release(regs[regno].type)) {
+		if (!is_null) {
+			__release_reference(env, vstate, vstate->active_cond_ref_obj_id);
+			vstate->active_cond_ref_obj_id = 0;
+		} else {
+			untag_reference_cond_release(env, vstate, vstate->active_cond_ref_obj_id);
+			vstate->active_cond_ref_obj_id = 0;
+		}
+	}
 	for (i = 0; i <= vstate->curframe; i++)
 		__mark_ptr_or_null_regs(vstate->frame[i], id, is_null);
+
+	return 0;
 }
 
 static bool try_match_pkt_pointers(const struct bpf_insn *insn,
@@ -10365,10 +10456,10 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		/* Mark all identical registers in each branch as either
 		 * safe or unknown depending R == 0 or R != 0 conditional.
 		 */
-		mark_ptr_or_null_regs(this_branch, insn->dst_reg,
-				      opcode == BPF_JNE);
-		mark_ptr_or_null_regs(other_branch, insn->dst_reg,
-				      opcode == BPF_JEQ);
+		err = mark_ptr_or_null_regs(this_branch, insn->dst_reg,
+					    opcode == BPF_JNE, env);
+		err = mark_ptr_or_null_regs(other_branch, insn->dst_reg,
+					    opcode == BPF_JEQ, env);
 	} else if (!try_match_pkt_pointers(insn, dst_reg, &regs[insn->src_reg],
 					   this_branch, other_branch) &&
 		   is_pointer_value(env, insn->dst_reg)) {
@@ -11809,6 +11900,9 @@ static bool states_equal(struct bpf_verifier_env *env,
 	if (old->active_spin_lock != cur->active_spin_lock)
 		return false;
 
+	if (old->active_cond_ref_obj_id != cur->active_cond_ref_obj_id)
+		return false;
+
 	/* for states to be equal callsites have to be the same
 	 * and all frame states need to be equivalent
 	 */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (8 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 09/11] bpf: Add CONDITIONAL_RELEASE " Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-07-29 16:31   ` Tejun Heo
  2022-08-01 22:44   ` Alexei Starovoitov
  2022-07-22 18:34 ` [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests Dave Marchevsky
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Add two type flags, PTR_ITER and PTR_ITER_END, meant to support the
following pattern for iterating data structures:

  struct node *elem = data_structure_iter_first(&some_map)
  while (elem) {
    do_something();
    elem = data_structure_iter_next(&some_map, elem);
  }

Currently the ret_type of both _first() and _next() helpers would be
PTR_TO_BTF_ID_OR_NULL and the verifier would consider the loop to be
infinite as it knows nothing about an arbitrary PTR_TO_BTF_ID value.

However, if we can express "this PTR_TO_BTF_ID will eventually be
NULL if used in iteration helpers" via a new type flag, the verifier can
use this information to determine that the loop will terminate while
still verifying the loop body.

So for our contrived example above, _first() now returns a
PTR_TO_BTF_ID_OR_NULL | PTR_ITER, which _next() expects as input,
itself returning PTR_TO_BTF_ID_OR_NULL | PTR_ITER_END.

The verifier does nothing special for PTR_ITER, so the first iteration
of the example loop will result in both elem == NULL and elem != NULL
branches being verified. When verifying the latter branch, elem will
become a PTR_TO_BTF_ID_OR_NULL | PTR_ITER_END.

When the verifier sees a reg holding PTR_ITER_END in a conditional jump
it knows the reg type and val can be replaced with SCALAR 0. So in the
example loop on 2nd iteration elem == NULL will be assumed and
verification will continue with that branch only.

[ TODO:
  * PTR_ITER needs to be associated with the helper that produced it, to
  prevent something like:
    struct node *elem = data_structure_iter_last(&some_map)
    while (elem) {
      do_something();
      elem = data_structure_iter_next(&some_map, elem);
    }

  i.e. _first() and _next() should be used together, same with _last()
  and _prev().

  * This is currently very unsafe. Per multiple conversations w/ Alexei
  and Andrii, there are probably some ways forward here, but may be more
  complex than it's worth. If so, can migrate to a callback-based
  approach with similar functionality .
]

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h            |  3 ++
 include/uapi/linux/bpf.h       | 34 ++++++++++++++-
 kernel/bpf/helpers.c           | 12 ++++++
 kernel/bpf/rbtree.c            | 78 ++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 53 +++++++++++++++++++++--
 tools/include/uapi/linux/bpf.h | 34 ++++++++++++++-
 6 files changed, 209 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a601ab30a2b1..819778e05060 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -416,6 +416,9 @@ enum bpf_type_flag {
 
 	CONDITIONAL_RELEASE	= BIT(12 + BPF_BASE_TYPE_BITS),
 
+	PTR_ITER	= BIT(13 + BPF_BASE_TYPE_BITS),
+	PTR_ITER_END	= BIT(14 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d21e2c99ea14..3869653c6aeb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5409,6 +5409,34 @@ union bpf_attr {
  *
  *	Return
  *		0
+ *
+ * long bpf_rbtree_first(struct bpf_map *map)
+ *	Description
+ *		Return the first node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
+ *
+ * long bpf_rbtree_last(struct bpf_map *map)
+ *	Description
+ *		Return the last node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
+ *
+ * long bpf_rbtree_next(struct bpf_map *map, void *cur)
+ *	Description
+ *		Return the next node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
+ *
+ * long bpf_rbtree_prev(struct bpf_map *map, void *cur)
+ *	Description
+ *		Return the previous node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5620,13 +5648,17 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(rbtree_alloc_node),		\
-	FN(rbtree_add),		\
+	FN(rbtree_add),			\
 	FN(rbtree_find),		\
 	FN(rbtree_remove),		\
 	FN(rbtree_free_node),		\
 	FN(rbtree_get_lock),		\
 	FN(rbtree_lock),		\
 	FN(rbtree_unlock),		\
+	FN(rbtree_first),		\
+	FN(rbtree_last),		\
+	FN(rbtree_next),		\
+	FN(rbtree_prev),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index fa2dba1dcec8..fdc505139b99 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1592,6 +1592,10 @@ const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
 const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
 const struct bpf_func_proto bpf_rbtree_lock_proto __weak;
 const struct bpf_func_proto bpf_rbtree_unlock_proto __weak;
+const struct bpf_func_proto bpf_rbtree_first_proto __weak;
+const struct bpf_func_proto bpf_rbtree_last_proto __weak;
+const struct bpf_func_proto bpf_rbtree_next_proto __weak;
+const struct bpf_func_proto bpf_rbtree_prev_proto __weak;
 
 const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
@@ -1697,6 +1701,14 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_rbtree_lock_proto;
 	case BPF_FUNC_rbtree_unlock:
 		return &bpf_rbtree_unlock_proto;
+	case BPF_FUNC_rbtree_first:
+		return &bpf_rbtree_first_proto;
+	case BPF_FUNC_rbtree_last:
+		return &bpf_rbtree_last_proto;
+	case BPF_FUNC_rbtree_next:
+		return &bpf_rbtree_next_proto;
+	case BPF_FUNC_rbtree_prev:
+		return &bpf_rbtree_prev_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
index dcf8f69d4ada..f1a06c340d88 100644
--- a/kernel/bpf/rbtree.c
+++ b/kernel/bpf/rbtree.c
@@ -140,6 +140,84 @@ const struct bpf_func_proto bpf_rbtree_find_proto = {
 	.arg3_type = ARG_PTR_TO_FUNC,
 };
 
+BPF_CALL_1(bpf_rbtree_first, struct bpf_map *, map)
+{
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
+	return (u64)rb_first_cached(&tree->root);
+}
+
+const struct bpf_func_proto bpf_rbtree_first_proto = {
+	.func = bpf_rbtree_first,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | PTR_ITER | OBJ_NON_OWNING_REF,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+};
+
+BPF_CALL_1(bpf_rbtree_last, struct bpf_map *, map)
+{
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
+	return (u64)rb_last(&tree->root.rb_root);
+}
+
+const struct bpf_func_proto bpf_rbtree_last_proto = {
+	.func = bpf_rbtree_last,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | PTR_ITER | OBJ_NON_OWNING_REF,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+};
+
+BPF_CALL_2(bpf_rbtree_next, struct bpf_map *, map, void *, cur)
+{
+	struct rb_node *next = rb_next((struct rb_node *)cur);
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
+	return (u64)next;
+}
+
+const struct bpf_func_proto bpf_rbtree_next_proto = {
+	.func = bpf_rbtree_next,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | PTR_ITER_END | OBJ_NON_OWNING_REF,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID | PTR_ITER,
+	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
+};
+
+BPF_CALL_2(bpf_rbtree_prev, struct bpf_map *, map, void *, cur)
+{
+	struct rb_node *node = (struct rb_node *)cur;
+	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
+
+	if (!__rbtree_lock_held(tree))
+		return (u64)NULL;
+
+	return (u64)rb_prev(node);
+}
+
+const struct bpf_func_proto bpf_rbtree_prev_proto = {
+	.func = bpf_rbtree_prev,
+	.gpl_only = true,
+	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | PTR_ITER_END | OBJ_NON_OWNING_REF,
+	.ret_btf_id = &bpf_rbtree_btf_ids[0],
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID | PTR_ITER,
+	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
+};
+
 /* Like rbtree_postorder_for_each_entry_safe, but 'pos' and 'n' are
  * 'rb_node *', so field name of rb_node within containing struct is not
  * needed.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f80e161170de..e1580a01cfe3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -482,6 +482,11 @@ static bool type_may_be_null(u32 type)
 	return type & PTR_MAYBE_NULL;
 }
 
+static bool type_is_iter(u32 type)
+{
+	return type & PTR_ITER || type & PTR_ITER_END;
+}
+
 static bool is_acquire_function(enum bpf_func_id func_id,
 				const struct bpf_map *map)
 {
@@ -547,14 +552,20 @@ static bool function_manipulates_rbtree_node(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_rbtree_add ||
 		func_id == BPF_FUNC_rbtree_remove ||
-		func_id == BPF_FUNC_rbtree_free_node;
+		func_id == BPF_FUNC_rbtree_free_node ||
+		func_id == BPF_FUNC_rbtree_next ||
+		func_id == BPF_FUNC_rbtree_prev;
 }
 
 static bool function_returns_rbtree_node(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_rbtree_alloc_node ||
 		func_id == BPF_FUNC_rbtree_add ||
-		func_id == BPF_FUNC_rbtree_remove;
+		func_id == BPF_FUNC_rbtree_remove ||
+		func_id == BPF_FUNC_rbtree_first ||
+		func_id == BPF_FUNC_rbtree_last ||
+		func_id == BPF_FUNC_rbtree_next ||
+		func_id == BPF_FUNC_rbtree_prev;
 }
 
 /* string representation of 'enum bpf_reg_type'
@@ -614,6 +625,12 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
 					       32 - postfix_idx);
 	}
 
+	if (type_is_iter(type)) {
+		postfix_idx += strlcpy(postfix + postfix_idx, "_iter", 32 - postfix_idx);
+		if (type & PTR_ITER_END)
+			postfix_idx += strlcpy(postfix + postfix_idx, "_end", 32 - postfix_idx);
+	}
+
 	if (type & MEM_RDONLY)
 		strncpy(prefix, "rdonly_", 32);
 	if (type & MEM_ALLOC)
@@ -1428,7 +1445,7 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
 			   map->map_type == BPF_MAP_TYPE_SOCKHASH) {
 			reg->type = PTR_TO_SOCKET;
 		} else {
-			reg->type = PTR_TO_MAP_VALUE;
+			reg->type &= ~PTR_MAYBE_NULL;
 		}
 		return;
 	}
@@ -3021,6 +3038,11 @@ static bool __is_pointer_value(bool allow_ptr_leaks,
 	return reg->type != SCALAR_VALUE;
 }
 
+static bool __is_iter_end(const struct bpf_reg_state *reg)
+{
+	return reg->type & PTR_ITER_END;
+}
+
 static void save_register_state(struct bpf_func_state *state,
 				int spi, struct bpf_reg_state *reg,
 				int size)
@@ -5666,6 +5688,8 @@ static const struct bpf_reg_types map_key_value_types = {
 		PTR_TO_PACKET_META,
 		PTR_TO_MAP_KEY,
 		PTR_TO_MAP_VALUE,
+		PTR_TO_MAP_VALUE | PTR_ITER,
+		PTR_TO_MAP_VALUE | PTR_ITER_END,
 	},
 };
 
@@ -5793,6 +5817,17 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 	if (arg_type & PTR_MAYBE_NULL)
 		type &= ~PTR_MAYBE_NULL;
 
+	/* TYPE | PTR_ITER is valid input for helpers that expect TYPE
+	 * TYPE is not valid input for helpers that expect TYPE | PTR_ITER
+	 */
+	if (type_is_iter(arg_type)) {
+		if (!type_is_iter(type))
+			goto not_found;
+
+		type &= ~PTR_ITER;
+		type &= ~PTR_ITER_END;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(compatible->types); i++) {
 		expected = compatible->types[i];
 		if (expected == NOT_INIT)
@@ -5802,6 +5837,7 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			goto found;
 	}
 
+not_found:
 	verbose(env, "R%d type=%s expected=", regno, reg_type_str(env, reg->type));
 	for (j = 0; j + 1 < i; j++)
 		verbose(env, "%s, ", reg_type_str(env, compatible->types[j]));
@@ -9762,6 +9798,17 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode,
 			   bool is_jmp32)
 {
 	if (__is_pointer_value(false, reg)) {
+		if (__is_iter_end(reg) && val == 0) {
+			__mark_reg_const_zero(reg);
+			switch (opcode) {
+			case BPF_JEQ:
+				return 1;
+			case BPF_JNE:
+				return 0;
+			default:
+				return -1;
+			}
+		}
 		if (!reg_type_not_null(reg->type))
 			return -1;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d21e2c99ea14..3869653c6aeb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5409,6 +5409,34 @@ union bpf_attr {
  *
  *	Return
  *		0
+ *
+ * long bpf_rbtree_first(struct bpf_map *map)
+ *	Description
+ *		Return the first node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
+ *
+ * long bpf_rbtree_last(struct bpf_map *map)
+ *	Description
+ *		Return the last node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
+ *
+ * long bpf_rbtree_next(struct bpf_map *map, void *cur)
+ *	Description
+ *		Return the next node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
+ *
+ * long bpf_rbtree_prev(struct bpf_map *map, void *cur)
+ *	Description
+ *		Return the previous node in the tree according to sort order
+ *
+ *	Return
+ *		If found, ptr to node, otherwise NULL
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5620,13 +5648,17 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(rbtree_alloc_node),		\
-	FN(rbtree_add),		\
+	FN(rbtree_add),			\
 	FN(rbtree_find),		\
 	FN(rbtree_remove),		\
 	FN(rbtree_free_node),		\
 	FN(rbtree_get_lock),		\
 	FN(rbtree_lock),		\
 	FN(rbtree_unlock),		\
+	FN(rbtree_first),		\
+	FN(rbtree_last),		\
+	FN(rbtree_next),		\
+	FN(rbtree_prev),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (9 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags Dave Marchevsky
@ 2022-07-22 18:34 ` Dave Marchevsky
  2022-07-28  7:18   ` Yonghong Song
  2022-07-28  7:04 ` [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Yonghong Song
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dave Marchevsky @ 2022-07-22 18:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo, Dave Marchevsky

Add tests demonstrating happy path of rbtree map usage as well as
exercising numerous failure paths and conditions. Structure of failing
test runner is based on dynptr tests.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../selftests/bpf/prog_tests/rbtree_map.c     | 164 ++++++++++++
 .../testing/selftests/bpf/progs/rbtree_map.c  | 111 ++++++++
 .../selftests/bpf/progs/rbtree_map_fail.c     | 236 ++++++++++++++++++
 .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
 4 files changed, 535 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/rbtree_map.c b/tools/testing/selftests/bpf/prog_tests/rbtree_map.c
new file mode 100644
index 000000000000..17cadcd05ee4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/rbtree_map.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <sys/syscall.h>
+#include <test_progs.h>
+#include "rbtree_map.skel.h"
+#include "rbtree_map_fail.skel.h"
+#include "rbtree_map_load_fail.skel.h"
+
+static size_t log_buf_sz = 1048576; /* 1 MB */
+static char obj_log_buf[1048576];
+
+static struct {
+	const char *prog_name;
+	const char *expected_err_msg;
+} rbtree_prog_load_fail_tests[] = {
+	{"rb_node__field_store", "only read is supported"},
+	{"rb_node__alloc_no_add", "Unreleased reference id=2 alloc_insn=3"},
+	{"rb_node__two_alloc_one_add", "Unreleased reference id=2 alloc_insn=3"},
+	{"rb_node__remove_no_free", "Unreleased reference id=5 alloc_insn=28"},
+	{"rb_tree__add_wrong_type", "rbtree: R2 is of type task_struct but node_data is expected"},
+	{"rb_tree__conditional_release_helper_usage",
+		"R2 type=ptr_cond_rel_ expected=ptr_"},
+};
+
+void test_rbtree_map_load_fail(void)
+{
+	struct rbtree_map_load_fail *skel;
+
+	skel = rbtree_map_load_fail__open_and_load();
+	if (!ASSERT_ERR_PTR(skel, "rbtree_map_load_fail__open_and_load"))
+		rbtree_map_load_fail__destroy(skel);
+}
+
+static void verify_fail(const char *prog_name, const char *expected_err_msg)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, opts);
+	struct rbtree_map_fail *skel;
+	struct bpf_program *prog;
+	int err;
+
+	opts.kernel_log_buf = obj_log_buf;
+	opts.kernel_log_size = log_buf_sz;
+	opts.kernel_log_level = 1;
+
+	skel = rbtree_map_fail__open_opts(&opts);
+	if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open_opts"))
+		goto cleanup;
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+		goto cleanup;
+
+	bpf_program__set_autoload(prog, true);
+	err = rbtree_map_fail__load(skel);
+	if (!ASSERT_ERR(err, "unexpected load success"))
+		goto cleanup;
+
+	if (!ASSERT_OK_PTR(strstr(obj_log_buf, expected_err_msg), "expected_err_msg")) {
+		fprintf(stderr, "Expected err_msg: %s\n", expected_err_msg);
+		fprintf(stderr, "Verifier output: %s\n", obj_log_buf);
+	}
+
+cleanup:
+	rbtree_map_fail__destroy(skel);
+}
+
+void test_rbtree_map_alloc_node__size_too_small(void)
+{
+	struct rbtree_map_fail *skel;
+	struct bpf_program *prog;
+	struct bpf_link *link;
+	int err;
+
+	skel = rbtree_map_fail__open();
+	if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open"))
+		goto cleanup;
+
+	prog = skel->progs.alloc_node__size_too_small;
+	bpf_program__set_autoload(prog, true);
+
+	err = rbtree_map_fail__load(skel);
+	if (!ASSERT_OK(err, "unexpected load fail"))
+		goto cleanup;
+
+	link = bpf_program__attach(skel->progs.alloc_node__size_too_small);
+	if (!ASSERT_OK_PTR(link, "link"))
+		goto cleanup;
+
+	syscall(SYS_getpgid);
+
+	ASSERT_EQ(skel->bss->size_too_small__alloc_fail, 1, "alloc_fail");
+
+	bpf_link__destroy(link);
+cleanup:
+	rbtree_map_fail__destroy(skel);
+}
+
+void test_rbtree_map_add_node__no_lock(void)
+{
+	struct rbtree_map_fail *skel;
+	struct bpf_program *prog;
+	struct bpf_link *link;
+	int err;
+
+	skel = rbtree_map_fail__open();
+	if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open"))
+		goto cleanup;
+
+	prog = skel->progs.add_node__no_lock;
+	bpf_program__set_autoload(prog, true);
+
+	err = rbtree_map_fail__load(skel);
+	if (!ASSERT_OK(err, "unexpected load fail"))
+		goto cleanup;
+
+	link = bpf_program__attach(skel->progs.add_node__no_lock);
+	if (!ASSERT_OK_PTR(link, "link"))
+		goto cleanup;
+
+	syscall(SYS_getpgid);
+
+	ASSERT_EQ(skel->bss->no_lock_add__fail, 1, "alloc_fail");
+
+	bpf_link__destroy(link);
+cleanup:
+	rbtree_map_fail__destroy(skel);
+}
+
+void test_rbtree_map_prog_load_fail(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(rbtree_prog_load_fail_tests); i++) {
+		if (!test__start_subtest(rbtree_prog_load_fail_tests[i].prog_name))
+			continue;
+
+		verify_fail(rbtree_prog_load_fail_tests[i].prog_name,
+			    rbtree_prog_load_fail_tests[i].expected_err_msg);
+	}
+}
+
+void test_rbtree_map(void)
+{
+	struct rbtree_map *skel;
+	struct bpf_link *link;
+
+	skel = rbtree_map__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "rbtree_map__open_and_load"))
+		goto cleanup;
+
+	link = bpf_program__attach(skel->progs.check_rbtree);
+	if (!ASSERT_OK_PTR(link, "link"))
+		goto cleanup;
+
+	for (int i = 0; i < 100; i++)
+		syscall(SYS_getpgid);
+
+	ASSERT_EQ(skel->bss->calls, 100, "calls_equal");
+
+	bpf_link__destroy(link);
+cleanup:
+	rbtree_map__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/rbtree_map.c b/tools/testing/selftests/bpf/progs/rbtree_map.c
new file mode 100644
index 000000000000..0cd467838f6e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree_map.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+struct node_data {
+	struct rb_node node;
+	__u32 one;
+	__u32 two;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RBTREE);
+	__type(value, struct node_data);
+} rbtree SEC(".maps");
+
+long calls;
+
+static bool less(struct rb_node *a, const struct rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+
+	return node_a->one < node_b->one;
+}
+
+// Key = node_datq
+static int cmp(const void *key, const struct rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(key, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+
+	return node_b->one - node_a->one;
+}
+
+// Key = just node_data.one
+static int cmp2(const void *key, const struct rb_node *b)
+{
+	__u32 one;
+	struct node_data *node_b;
+
+	one = *(__u32 *)key;
+	node_b = container_of(b, struct node_data, node);
+
+	return node_b->one - one;
+}
+
+SEC("fentry/" SYS_PREFIX "sys_getpgid")
+int check_rbtree(void *ctx)
+{
+	struct node_data *node, *found, *ret;
+	struct node_data popped;
+	struct node_data search;
+	__u32 search2;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+
+	node->one = calls;
+	node->two = 6;
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+
+	ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
+	if (!ret) {
+		bpf_rbtree_free_node(&rbtree, node);
+		goto unlock_ret;
+	}
+
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+
+	search.one = calls;
+	found = (struct node_data *)bpf_rbtree_find(&rbtree, &search, cmp);
+	if (!found)
+		goto unlock_ret;
+
+	int node_ct = 0;
+	struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);
+
+	while (iter) {
+		node_ct++;
+		iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
+	}
+
+	ret = (struct node_data *)bpf_rbtree_remove(&rbtree, found);
+	if (!ret)
+		goto unlock_ret;
+
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+
+	bpf_rbtree_free_node(&rbtree, ret);
+
+	__sync_fetch_and_add(&calls, 1);
+	return 0;
+
+unlock_ret:
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/rbtree_map_fail.c b/tools/testing/selftests/bpf/progs/rbtree_map_fail.c
new file mode 100644
index 000000000000..287c8db080d8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree_map_fail.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+struct node_data {
+	struct rb_node node;
+	__u32 one;
+	__u32 two;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RBTREE);
+	__type(value, struct node_data);
+} rbtree SEC(".maps");
+
+long calls;
+
+static bool less(struct rb_node *a, const struct rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+
+	return node_a->one < node_b->one;
+}
+
+// Key = node_datq
+static int cmp(const void *key, const struct rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(key, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+
+	return node_b->one - node_a->one;
+}
+
+long size_too_small__alloc_fail;
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int alloc_node__size_too_small(void *ctx)
+{
+	struct node_data *node, *ret;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(char));
+	if (!node) {
+		size_too_small__alloc_fail++;
+		return 0;
+	}
+
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+	/* will never execute, alloc_node should fail */
+	node->one = 1;
+	ret = bpf_rbtree_add(&rbtree, node, less);
+	if (!ret) {
+		bpf_rbtree_free_node(&rbtree, node);
+		goto unlock_ret;
+	}
+
+unlock_ret:
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+	return 0;
+}
+
+long no_lock_add__fail;
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int add_node__no_lock(void *ctx)
+{
+	struct node_data *node, *ret;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+
+	node->one = 1;
+	ret = bpf_rbtree_add(&rbtree, node, less);
+	if (!ret) {
+		no_lock_add__fail++;
+		/* will always execute, rbtree_add should fail
+		 * because no lock held
+		 */
+		bpf_rbtree_free_node(&rbtree, node);
+	}
+
+unlock_ret:
+	return 0;
+}
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int rb_node__field_store(void *ctx)
+{
+	struct node_data *node;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+
+	/* Only rbtree_map helpers can modify rb_node field */
+	node->node.rb_left = NULL;
+	return 0;
+}
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int rb_node__alloc_no_add(void *ctx)
+{
+	struct node_data *node;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+	/* The node alloc'd above is never added to the rbtree. It must be
+	 * added or free'd before prog terminates.
+	 */
+
+	node->one = 42;
+	return 0;
+}
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int rb_node__two_alloc_one_add(void *ctx)
+{
+	struct node_data *node, *ret;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+	node->one = 1;
+	/* The node alloc'd above is never added to the rbtree. It must be
+	 * added or free'd before prog terminates.
+	 */
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+	node->one = 42;
+
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+
+	ret = bpf_rbtree_add(&rbtree, node, less);
+	if (!ret) {
+		bpf_rbtree_free_node(&rbtree, node);
+		goto unlock_ret;
+	}
+
+unlock_ret:
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+	return 0;
+}
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int rb_node__remove_no_free(void *ctx)
+{
+	struct node_data *node, *ret;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+	node->one = 42;
+
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+
+	ret = bpf_rbtree_add(&rbtree, node, less);
+	if (!ret) {
+		bpf_rbtree_free_node(&rbtree, node);
+		goto unlock_ret;
+	}
+
+	ret = bpf_rbtree_remove(&rbtree, ret);
+	if (!ret)
+		goto unlock_ret;
+	/* At this point we've successfully acquired a reference from
+	 * bpf_rbtree_remove. It must be released via rbtree_add or
+	 * rbtree_free_node before prog terminates.
+	 */
+
+unlock_ret:
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+	return 0;
+}
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int rb_tree__add_wrong_type(void *ctx)
+{
+	/* Can't add a task_struct to rbtree
+	 */
+	struct task_struct *task;
+	struct node_data *ret;
+
+	task = bpf_get_current_task_btf();
+
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+
+	ret = bpf_rbtree_add(&rbtree, task, less);
+	/* Verifier should fail at bpf_rbtree_add, so don't bother handling
+	 * failure.
+	 */
+
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+	return 0;
+}
+
+SEC("?fentry/" SYS_PREFIX "sys_getpgid")
+int rb_tree__conditional_release_helper_usage(void *ctx)
+{
+	struct node_data *node, *ret;
+
+	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
+	if (!node)
+		return 0;
+	node->one = 42;
+
+	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
+
+	ret = bpf_rbtree_add(&rbtree, node, less);
+	/* Verifier should fail when trying to use CONDITIONAL_RELEASE
+	 * type in a helper
+	 */
+	bpf_rbtree_free_node(&rbtree, node);
+	if (!ret) {
+		bpf_rbtree_free_node(&rbtree, node);
+		goto unlock_ret;
+	}
+
+unlock_ret:
+	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c b/tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c
new file mode 100644
index 000000000000..036558eedd66
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+struct node_data_no_rb_node {
+	__u64 one;
+	__u64 two;
+	__u64 three;
+	__u64 four;
+	__u64 five;
+	__u64 six;
+	__u64 seven;
+};
+
+/* Should fail because value struct has no rb_node
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_RBTREE);
+	__type(value, struct node_data_no_rb_node);
+} rbtree_fail_no_rb_node SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (10 preceding siblings ...)
  2022-07-22 18:34 ` [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests Dave Marchevsky
@ 2022-07-28  7:04 ` Yonghong Song
  2022-08-10 17:54   ` Dave Marchevsky
  2022-08-01 21:27 ` Alexei Starovoitov
  2022-08-02 22:02 ` Andrii Nakryiko
  13 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2022-07-28  7:04 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo



On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
> allows bpf programs to use red-black trees similarly to kernel code.
> Programs interact with rbtree maps in a much more open-coded way than
> more classic map implementations. Some example code to demonstrate:
> 
>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>    if (!node)
>      return 0;
> 
>    node->one = calls;
>    node->two = 6;
>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));

Can we just do
      bpf_rbtree_lock(&rbtree)
      bpf_rbtree_unlock(&rbtree)
? Looks like the only places bpf_rbtree_get_lock() used are
as arguments of bpf_rbtree_lock/unlock or bpf_spin_lock/unlock?

> 
>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>    if (!ret) {
>      bpf_rbtree_free_node(&rbtree, node);
>      goto unlock_ret;
>    }
> 
> unlock_ret:
>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>    return 0;
> 
> 
> This series is in a heavy RFC state, with some added verifier semantics
> needing improvement before they can be considered safe. I am sending
> early to gather feedback on approach:
> 
>    * Does the API seem reasonable and might it be useful for others?
> 
>    * Do new verifier semantics added in this series make logical sense?
>      Are there any glaring safety holes aside from those called out in
>      individual patches?
> 
> Please see individual patches for more in-depth explanation. A quick
> summary of patches follows:
> 
> 
> Patches 1-3 extend verifier and BTF searching logic in minor ways to
> prepare for rbtree implementation patch.
>    bpf: Pull repeated reg access bounds check into helper fn
>    bpf: Add verifier support for custom callback return range
>    bpf: Add rb_node_off to bpf_map
> 
> 
> Patch 4 adds basic rbtree map implementation.
>    bpf: Add rbtree map
> 
> Note that 'complete' implementation requires concepts and changes
> introduced in further patches in the series. The series is currently
> arranged in this way to ease RFC review.
> 
> 
> Patches 5-7 add a spinlock to the rbtree map, with some differing
> semantics from existing verifier spinlock handling.
>    bpf: Add bpf_spin_lock member to rbtree
>    bpf: Add bpf_rbtree_{lock,unlock} helpers
>    bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
> 
> Notably, rbtree's bpf_spin_lock must be held while manipulating the tree
> via helpers, while existing spinlock verifier logic prevents any helper
> calls while lock is held. In current state this is worked around by not
> having the verifier treat rbtree's lock specially in any way. This
> needs to be improved before leaving RFC state as it's unsafe.
> 
[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests
  2022-07-22 18:34 ` [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests Dave Marchevsky
@ 2022-07-28  7:18   ` Yonghong Song
  2022-08-10 17:48     ` Dave Marchevsky
  0 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2022-07-28  7:18 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo



On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Add tests demonstrating happy path of rbtree map usage as well as
> exercising numerous failure paths and conditions. Structure of failing
> test runner is based on dynptr tests.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   .../selftests/bpf/prog_tests/rbtree_map.c     | 164 ++++++++++++
>   .../testing/selftests/bpf/progs/rbtree_map.c  | 111 ++++++++
>   .../selftests/bpf/progs/rbtree_map_fail.c     | 236 ++++++++++++++++++
>   .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
>   4 files changed, 535 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
>   create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
>   create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
>   create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/rbtree_map.c b/tools/testing/selftests/bpf/prog_tests/rbtree_map.c
> new file mode 100644
> index 000000000000..17cadcd05ee4
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/rbtree_map.c
> @@ -0,0 +1,164 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +#include <sys/syscall.h>
> +#include <test_progs.h>
> +#include "rbtree_map.skel.h"
> +#include "rbtree_map_fail.skel.h"
> +#include "rbtree_map_load_fail.skel.h"
> +
> +static size_t log_buf_sz = 1048576; /* 1 MB */
> +static char obj_log_buf[1048576];
> +
> +static struct {
> +	const char *prog_name;
> +	const char *expected_err_msg;
> +} rbtree_prog_load_fail_tests[] = {
> +	{"rb_node__field_store", "only read is supported"},
> +	{"rb_node__alloc_no_add", "Unreleased reference id=2 alloc_insn=3"},
> +	{"rb_node__two_alloc_one_add", "Unreleased reference id=2 alloc_insn=3"},
> +	{"rb_node__remove_no_free", "Unreleased reference id=5 alloc_insn=28"},
> +	{"rb_tree__add_wrong_type", "rbtree: R2 is of type task_struct but node_data is expected"},
> +	{"rb_tree__conditional_release_helper_usage",
> +		"R2 type=ptr_cond_rel_ expected=ptr_"},
> +};
> +
> +void test_rbtree_map_load_fail(void)
> +{
> +	struct rbtree_map_load_fail *skel;
> +
> +	skel = rbtree_map_load_fail__open_and_load();
> +	if (!ASSERT_ERR_PTR(skel, "rbtree_map_load_fail__open_and_load"))
> +		rbtree_map_load_fail__destroy(skel);
> +}
> +
> +static void verify_fail(const char *prog_name, const char *expected_err_msg)
> +{
> +	LIBBPF_OPTS(bpf_object_open_opts, opts);
> +	struct rbtree_map_fail *skel;
> +	struct bpf_program *prog;
> +	int err;
> +
> +	opts.kernel_log_buf = obj_log_buf;
> +	opts.kernel_log_size = log_buf_sz;
> +	opts.kernel_log_level = 1;
> +
> +	skel = rbtree_map_fail__open_opts(&opts);
> +	if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open_opts"))
> +		goto cleanup;
> +
> +	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
> +	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
> +		goto cleanup;
> +
> +	bpf_program__set_autoload(prog, true);
> +	err = rbtree_map_fail__load(skel);
> +	if (!ASSERT_ERR(err, "unexpected load success"))
> +		goto cleanup;
> +
> +	if (!ASSERT_OK_PTR(strstr(obj_log_buf, expected_err_msg), "expected_err_msg")) {
> +		fprintf(stderr, "Expected err_msg: %s\n", expected_err_msg);
> +		fprintf(stderr, "Verifier output: %s\n", obj_log_buf);
> +	}
> +
> +cleanup:
> +	rbtree_map_fail__destroy(skel);
> +}
> +
> +void test_rbtree_map_alloc_node__size_too_small(void)
> +{
> +	struct rbtree_map_fail *skel;
> +	struct bpf_program *prog;
> +	struct bpf_link *link;
> +	int err;
> +
> +	skel = rbtree_map_fail__open();
> +	if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open"))
> +		goto cleanup;
> +
> +	prog = skel->progs.alloc_node__size_too_small;
> +	bpf_program__set_autoload(prog, true);
> +
> +	err = rbtree_map_fail__load(skel);
> +	if (!ASSERT_OK(err, "unexpected load fail"))
> +		goto cleanup;
> +
> +	link = bpf_program__attach(skel->progs.alloc_node__size_too_small);
> +	if (!ASSERT_OK_PTR(link, "link"))
> +		goto cleanup;
> +
> +	syscall(SYS_getpgid);
> +
> +	ASSERT_EQ(skel->bss->size_too_small__alloc_fail, 1, "alloc_fail");
> +
> +	bpf_link__destroy(link);
> +cleanup:
> +	rbtree_map_fail__destroy(skel);
> +}
> +
> +void test_rbtree_map_add_node__no_lock(void)
> +{
> +	struct rbtree_map_fail *skel;
> +	struct bpf_program *prog;
> +	struct bpf_link *link;
> +	int err;
> +
> +	skel = rbtree_map_fail__open();
> +	if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open"))
> +		goto cleanup;
> +
> +	prog = skel->progs.add_node__no_lock;
> +	bpf_program__set_autoload(prog, true);
> +
> +	err = rbtree_map_fail__load(skel);
> +	if (!ASSERT_OK(err, "unexpected load fail"))
> +		goto cleanup;
> +
> +	link = bpf_program__attach(skel->progs.add_node__no_lock);
> +	if (!ASSERT_OK_PTR(link, "link"))
> +		goto cleanup;
> +
> +	syscall(SYS_getpgid);
> +
> +	ASSERT_EQ(skel->bss->no_lock_add__fail, 1, "alloc_fail");
> +
> +	bpf_link__destroy(link);
> +cleanup:
> +	rbtree_map_fail__destroy(skel);
> +}
> +
> +void test_rbtree_map_prog_load_fail(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(rbtree_prog_load_fail_tests); i++) {
> +		if (!test__start_subtest(rbtree_prog_load_fail_tests[i].prog_name))
> +			continue;
> +
> +		verify_fail(rbtree_prog_load_fail_tests[i].prog_name,
> +			    rbtree_prog_load_fail_tests[i].expected_err_msg);
> +	}
> +}
> +
> +void test_rbtree_map(void)
> +{
> +	struct rbtree_map *skel;
> +	struct bpf_link *link;
> +
> +	skel = rbtree_map__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "rbtree_map__open_and_load"))
> +		goto cleanup;
> +
> +	link = bpf_program__attach(skel->progs.check_rbtree);
> +	if (!ASSERT_OK_PTR(link, "link"))
> +		goto cleanup;
> +
> +	for (int i = 0; i < 100; i++)
> +		syscall(SYS_getpgid);
> +
> +	ASSERT_EQ(skel->bss->calls, 100, "calls_equal");
> +
> +	bpf_link__destroy(link);
> +cleanup:
> +	rbtree_map__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/rbtree_map.c b/tools/testing/selftests/bpf/progs/rbtree_map.c
> new file mode 100644
> index 000000000000..0cd467838f6e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/rbtree_map.c
> @@ -0,0 +1,111 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_misc.h"
> +
> +struct node_data {
> +	struct rb_node node;
> +	__u32 one;
> +	__u32 two;
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_RBTREE);
> +	__type(value, struct node_data);
> +} rbtree SEC(".maps");
> +
> +long calls;
> +
> +static bool less(struct rb_node *a, const struct rb_node *b)
> +{
> +	struct node_data *node_a;
> +	struct node_data *node_b;
> +
> +	node_a = container_of(a, struct node_data, node);
> +	node_b = container_of(b, struct node_data, node);
> +
> +	return node_a->one < node_b->one;
> +}
> +
> +// Key = node_datq
> +static int cmp(const void *key, const struct rb_node *b)
> +{
> +	struct node_data *node_a;
> +	struct node_data *node_b;
> +
> +	node_a = container_of(key, struct node_data, node);
> +	node_b = container_of(b, struct node_data, node);
> +
> +	return node_b->one - node_a->one;
> +}
> +
> +// Key = just node_data.one
> +static int cmp2(const void *key, const struct rb_node *b)
> +{
> +	__u32 one;
> +	struct node_data *node_b;
> +
> +	one = *(__u32 *)key;
> +	node_b = container_of(b, struct node_data, node);
> +
> +	return node_b->one - one;
> +}
> +
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")
> +int check_rbtree(void *ctx)
> +{
> +	struct node_data *node, *found, *ret;
> +	struct node_data popped;
> +	struct node_data search;
> +	__u32 search2;
> +
> +	node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));

If I understand correctly, bpf_rtbree_alloc_node() may cause reschedule
inside the function. So, the program should be sleepable, right?

> +	if (!node)
> +		return 0;
> +
> +	node->one = calls;
> +	node->two = 6;
> +	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
> +
> +	ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
> +	if (!ret) {
> +		bpf_rbtree_free_node(&rbtree, node);
> +		goto unlock_ret;
> +	}
> +
> +	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
> +
> +	bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
> +
> +	search.one = calls;
> +	found = (struct node_data *)bpf_rbtree_find(&rbtree, &search, cmp);
> +	if (!found)
> +		goto unlock_ret;
> +
> +	int node_ct = 0;
> +	struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);
> +
> +	while (iter) {
> +		node_ct++;
> +		iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
> +	}
> +
> +	ret = (struct node_data *)bpf_rbtree_remove(&rbtree, found);
> +	if (!ret)
> +		goto unlock_ret;
> +
> +	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
> +
> +	bpf_rbtree_free_node(&rbtree, ret);
> +
> +	__sync_fetch_and_add(&calls, 1);
> +	return 0;
> +
> +unlock_ret:
> +	bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
> +	return 0;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  2022-07-22 18:34 ` [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags Dave Marchevsky
@ 2022-07-29 16:31   ` Tejun Heo
  2022-08-01 22:44   ` Alexei Starovoitov
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2022-07-29 16:31 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team

Hello,

One nit clang found.

On Fri, Jul 22, 2022 at 11:34:37AM -0700, Dave Marchevsky wrote:
> @@ -5793,6 +5817,17 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>  	if (arg_type & PTR_MAYBE_NULL)
>  		type &= ~PTR_MAYBE_NULL;
>  
> +	/* TYPE | PTR_ITER is valid input for helpers that expect TYPE
> +	 * TYPE is not valid input for helpers that expect TYPE | PTR_ITER
> +	 */
> +	if (type_is_iter(arg_type)) {
> +		if (!type_is_iter(type))
> +			goto not_found;

Here, we go to not_found with @i uninitialized and the not_found block loops
till @i.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (11 preceding siblings ...)
  2022-07-28  7:04 ` [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Yonghong Song
@ 2022-08-01 21:27 ` Alexei Starovoitov
  2022-08-10 18:11   ` Dave Marchevsky
  2022-08-02 22:02 ` Andrii Nakryiko
  13 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 21:27 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
> allows bpf programs to use red-black trees similarly to kernel code.
> Programs interact with rbtree maps in a much more open-coded way than
> more classic map implementations. Some example code to demonstrate:
> 
>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>    if (!node)
>      return 0;
> 
>    node->one = calls;
>    node->two = 6;
>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
> 
>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>    if (!ret) {
>      bpf_rbtree_free_node(&rbtree, node);
>      goto unlock_ret;
>    }
> 
> unlock_ret:
>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>    return 0;
> 
> 
> This series is in a heavy RFC state, with some added verifier semantics
> needing improvement before they can be considered safe. I am sending
> early to gather feedback on approach:
> 
>    * Does the API seem reasonable and might it be useful for others?

Overall it seems to be on the right track.
The two main issue to resolve are locks and iterators.
The rest needs work too, but it's getting close :)
Will reply in individual patches.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 04/11] bpf: Add rbtree map
  2022-07-22 18:34 ` [RFC PATCH bpf-next 04/11] bpf: Add rbtree map Dave Marchevsky
@ 2022-08-01 21:49   ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 21:49 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> +
> +static struct rb_node *rbtree_map_alloc_node(struct bpf_map *map, size_t sz)
> +{
> +	struct rb_node *node;
> +
> +	node = bpf_map_kmalloc_node(map, sz, GFP_KERNEL, map->numa_node);

As Yonghong pointed out this should be GFP_NOWAIT for now.
Later we can convert this to bpf_mem_alloc to make sure it's safe from 
any context.

> +	if (!node)
> +		return NULL;
> +	RB_CLEAR_NODE(node);
> +	return node;
> +}
> +
> +BPF_CALL_2(bpf_rbtree_alloc_node, struct bpf_map *, map, u32, sz)
> +{
> +	struct rb_node *node;
> +
> +	if (map->map_type != BPF_MAP_TYPE_RBTREE)
> +		return (u64)NULL;
> +
> +	if (sz < sizeof(*node))
> +		return (u64)NULL;
> +
> +	node = rbtree_map_alloc_node(map, (size_t)sz);
> +	if (!node)
> +		return (u64)NULL;
> +
> +	return (u64)node;
> +}
> +
> +const struct bpf_func_proto bpf_rbtree_alloc_node_proto = {
> +	.func = bpf_rbtree_alloc_node,
> +	.gpl_only = true,
> +	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
> +	.ret_btf_id = &bpf_rbtree_btf_ids[0],

since the btf_id is unused please use
.ret_btf_id   = BPF_PTR_POISON
as bpf_kptr_xchg_proto() is doing.

> +
> +BPF_CALL_2(bpf_rbtree_remove, struct bpf_map *, map, void *, value)
> +{
> +	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
> +	struct rb_node *node = (struct rb_node *)value;
> +
> +	if (WARN_ON_ONCE(RB_EMPTY_NODE(node)))
> +		return (u64)NULL;
> +
> +	rb_erase_cached(node, &tree->root);
> +	RB_CLEAR_NODE(node);
> +	return (u64)node;
> +}
> +
> +const struct bpf_func_proto bpf_rbtree_remove_proto = {
> +	.func = bpf_rbtree_remove,
> +	.gpl_only = true,
> +	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
> +	.ret_btf_id = &bpf_rbtree_btf_ids[0],
> +	.arg1_type = ARG_CONST_MAP_PTR,
> +	.arg2_type = ARG_PTR_TO_BTF_ID,
> +	.arg2_btf_id = &bpf_rbtree_btf_ids[0],

same for args.

> +
> +BTF_ID_LIST_SINGLE(bpf_rbtree_map_btf_ids, struct, bpf_rbtree)

can be removed?

> +const struct bpf_map_ops rbtree_map_ops = {
> +	.map_meta_equal = bpf_map_meta_equal,
> +	.map_alloc_check = rbtree_map_alloc_check,
> +	.map_alloc = rbtree_map_alloc,
> +	.map_free = rbtree_map_free,
> +	.map_get_next_key = rbtree_map_get_next_key,
> +	.map_push_elem = rbtree_map_push_elem,
> +	.map_peek_elem = rbtree_map_peek_elem,
> +	.map_pop_elem = rbtree_map_pop_elem,
> +	.map_lookup_elem = rbtree_map_lookup_elem,
> +	.map_update_elem = rbtree_map_update_elem,
> +	.map_delete_elem = rbtree_map_delete_elem,
> +	.map_check_btf = rbtree_map_check_btf,
> +	.map_btf_id = &bpf_rbtree_map_btf_ids[0],
> +};
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1f50becce141..535f673882cd 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -481,7 +481,9 @@ static bool is_acquire_function(enum bpf_func_id func_id,
>   	    func_id == BPF_FUNC_sk_lookup_udp ||
>   	    func_id == BPF_FUNC_skc_lookup_tcp ||
>   	    func_id == BPF_FUNC_ringbuf_reserve ||
> -	    func_id == BPF_FUNC_kptr_xchg)
> +	    func_id == BPF_FUNC_kptr_xchg ||
> +	    func_id == BPF_FUNC_rbtree_alloc_node ||
> +	    func_id == BPF_FUNC_rbtree_remove)
>   		return true;
>   
>   	if (func_id == BPF_FUNC_map_lookup_elem &&
> @@ -531,6 +533,20 @@ static bool is_cmpxchg_insn(const struct bpf_insn *insn)
>   	       insn->imm == BPF_CMPXCHG;
>   }
>   
> +static bool function_manipulates_rbtree_node(enum bpf_func_id func_id)
> +{
> +	return func_id == BPF_FUNC_rbtree_add ||
> +		func_id == BPF_FUNC_rbtree_remove ||
> +		func_id == BPF_FUNC_rbtree_free_node;
> +}
> +
> +static bool function_returns_rbtree_node(enum bpf_func_id func_id)
> +{
> +	return func_id == BPF_FUNC_rbtree_alloc_node ||
> +		func_id == BPF_FUNC_rbtree_add ||
> +		func_id == BPF_FUNC_rbtree_remove;
> +}
> +
>   /* string representation of 'enum bpf_reg_type'
>    *
>    * Note that reg_type_str() can not appear more than once in a single verbose()
> @@ -3784,6 +3800,13 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
>   	return 0;
>   }
>

 >    * [ TODO: Existing logic prevents any writes to PTR_TO_BTF_ID. This
 >      broadly turned off in this patch and replaced with "no writes to
 >      struct rb_node is PTR_TO_BTF_ID struct has one". This is a hack and
 >      needs to be replaced. ]

..

> +static bool access_may_touch_field(u32 access_off, size_t access_sz,

can_write is more accurate.
There is no ambiguity here. atype == BPF_WRITE.

> +				   u32 field_off, size_t field_sz)
> +{
> +	return access_off < field_off + field_sz &&
> +		field_off < access_off + access_sz;
> +}
> +
>   /* if any part of struct field can be touched by
>    * load/store reject this program.
>    * To check that [x1, x2) overlaps with [y1, y2)
> @@ -4490,7 +4513,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>   	const char *tname = btf_name_by_offset(reg->btf, t->name_off);
>   	enum bpf_type_flag flag = 0;
>   	u32 btf_id;
> -	int ret;
> +	int ret, rb_node_off;
>   
>   	if (off < 0) {
>   		verbose(env,
> @@ -4527,8 +4550,13 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>   						  off, size, atype, &btf_id, &flag);
>   	} else {
>   		if (atype != BPF_READ) {
> -			verbose(env, "only read is supported\n");
> -			return -EACCES;
> +			rb_node_off = btf_find_rb_node(reg->btf, t);
> +			if (rb_node_off < 0 ||
> +			    access_may_touch_field(off, size, rb_node_off,
> +						   sizeof(struct rb_node))) {
> +				verbose(env, "only read is supported\n");
> +				return -EACCES;
> +			}

Allowing writes into ptr_to_btf_id probably should be a separate patch.
It's a big change.
btf_find_rb_node() alone is not enough.
Otherwise bpf progs will be able to write into any struct that has 
'rb_node'.
Maybe check that reg->btf == this prog's btf ?
Also allow writes into scalars only?
All pointers in prog's struct should be __kptr anyway to be safe.


>   		}
>   
>   		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
> @@ -5764,6 +5792,17 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>   		if (meta->func_id == BPF_FUNC_kptr_xchg) {
>   			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
>   				return -EACCES;
> +		} else if (function_manipulates_rbtree_node(meta->func_id)) {
> +			if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +						  meta->map_ptr->btf,
> +						  meta->map_ptr->btf_value_type_id,
> +						  strict_type_match)) {
> +				verbose(env, "rbtree: R%d is of type %s but %s is expected\n",
> +					regno, kernel_type_name(reg->btf, reg->btf_id),
> +					kernel_type_name(meta->map_ptr->btf,
> +							 meta->map_ptr->btf_value_type_id));
> +				return -EACCES;
> +			}
>   		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
>   						 btf_vmlinux, *arg_btf_id,
>   						 strict_type_match)) {
> @@ -6369,10 +6408,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>   		break;
>   	case BPF_FUNC_map_pop_elem:
>   		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
> +		    map->map_type != BPF_MAP_TYPE_RBTREE &&
>   		    map->map_type != BPF_MAP_TYPE_STACK)
>   			goto error;
>   		break;
>   	case BPF_FUNC_map_peek_elem:
> +		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
> +		    map->map_type != BPF_MAP_TYPE_STACK &&
> +		    map->map_type != BPF_MAP_TYPE_RBTREE &&
> +		    map->map_type != BPF_MAP_TYPE_BLOOM_FILTER)
> +			goto error;
> +		break;
>   	case BPF_FUNC_map_push_elem:
>   		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
>   		    map->map_type != BPF_MAP_TYPE_STACK &&
> @@ -6828,6 +6874,57 @@ static int set_loop_callback_state(struct bpf_verifier_env *env,
>   	return 0;
>   }
>   
> +static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
> +					 struct bpf_func_state *caller,
> +					 struct bpf_func_state *callee,
> +					 int insn_idx)
> +{
> +	struct bpf_map *map_ptr = caller->regs[BPF_REG_1].map_ptr;
> +
> +	/* bpf_rbtree_add(struct bpf_map *map, void *value, void *cb)
> +	 * cb(struct rb_node *a, const struct rb_node *b);
> +	 */
> +	callee->regs[BPF_REG_1].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
> +	callee->regs[BPF_REG_1].map_ptr = map_ptr;
> +
> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
> +	callee->regs[BPF_REG_2].map_ptr = map_ptr;
> +
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
> +	callee->in_callback_fn = true;
> +	return 0;
> +}
> +
> +static int set_rbtree_find_callback_state(struct bpf_verifier_env *env,
> +					  struct bpf_func_state *caller,
> +					  struct bpf_func_state *callee,
> +					  int insn_idx)
> +{
> +	struct bpf_map *map_ptr = caller->regs[BPF_REG_1].map_ptr;
> +
> +	/* bpf_rbtree_find(struct bpf_map *map, void *key, void *cb)
> +	 * cb(void *key, const struct rb_node *b);
> +	 */
> +	callee->regs[BPF_REG_1].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
> +	callee->regs[BPF_REG_1].map_ptr = map_ptr;
> +
> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
> +	callee->regs[BPF_REG_2].map_ptr = map_ptr;
> +
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
> +	callee->in_callback_fn = true;

add and find looks the same until this point.
Reuse set_rbtree_add_callback_state here?

> +	callee->callback_ret_range = tnum_range(0, U64_MAX);

that's to enforce that add's cb can only return 0 or 1 ?
But that would require bpf prog to have different cb-s for add and find.
Is this ok?

> +	return 0;
> +}
> +
>   static int set_timer_callback_state(struct bpf_verifier_env *env,
>   				    struct bpf_func_state *caller,
>   				    struct bpf_func_state *callee,
> @@ -7310,6 +7407,14 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>   		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
>   					set_loop_callback_state);
>   		break;
> +	case BPF_FUNC_rbtree_add:
> +		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +					set_rbtree_add_callback_state);
> +		break;
> +	case BPF_FUNC_rbtree_find:
> +		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +					set_rbtree_find_callback_state);
> +		break;
>   	case BPF_FUNC_dynptr_from_mem:
>   		if (regs[BPF_REG_1].type != PTR_TO_MAP_VALUE) {
>   			verbose(env, "Unsupported reg type %s for bpf_dynptr_from_mem data\n",
> @@ -7424,6 +7529,9 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>   		if (func_id == BPF_FUNC_kptr_xchg) {
>   			ret_btf = meta.kptr_off_desc->kptr.btf;
>   			ret_btf_id = meta.kptr_off_desc->kptr.btf_id;
> +		} else if (function_returns_rbtree_node(func_id)) {
> +			ret_btf = meta.map_ptr->btf;
> +			ret_btf_id = meta.map_ptr->btf_value_type_id;
>   		} else {
>   			ret_btf = btf_vmlinux;
>   			ret_btf_id = *fn->ret_btf_id;
> @@ -13462,8 +13570,10 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>   					BPF_SIZE((insn)->code);
>   				env->prog->aux->num_exentries++;
>   			} else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) {
> +				/*TODO: Not sure what to do here
>   				verbose(env, "Writes through BTF pointers are not allowed\n");
>   				return -EINVAL;
> +				*/

Not sure whether it's worth defining PTR_TO_BTF_ID | PROGS_BTF
for writeable ptr_to_btf_id as return value from rb_alloc/find/add.
It may help here and earlier ?

All ptr_to_btf_id were kernel's or module's BTF so far.
With this change the verifier will see prog's ptr_to_btf_id that point
to prog's BTF.
PROGS_BTF might be a useful flag to have ?

Then instead of
 > +		} else if (function_returns_rbtree_node(func_id)) {
 > +			ret_btf = meta.map_ptr->btf;
 > +			ret_btf_id = meta.map_ptr->btf_value_type_id;

it will check PROGS_BTF flag in proto->ret_type ?
Still not fully generic, since it takes btf and btf_id from the map.
But a bit cleaner than switch() by func_id ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 06/11] bpf: Add bpf_rbtree_{lock,unlock} helpers
  2022-07-22 18:34 ` [RFC PATCH bpf-next 06/11] bpf: Add bpf_rbtree_{lock,unlock} helpers Dave Marchevsky
@ 2022-08-01 21:58   ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 21:58 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> These helpers are equivalent to bpf_spin_{lock,unlock}, but the verifier
> doesn't try to enforce that no helper calls occur when there's an active
> spin lock.
> 
> [ TODO: Currently the verifier doesn't do _anything_ spinlock related
> when it sees one of these, including setting active_spin_lock. This is
> probably too lenient. Also, EXPORT_SYMBOL for internal lock helpers
> might not be the best code structure. ]
> 
> Future patches will add enforcement of "rbtree helpers must always be
> called when lock is held" constraint.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/uapi/linux/bpf.h       | 20 ++++++++++++++++++++
>   kernel/bpf/helpers.c           | 12 ++++++++++--
>   kernel/bpf/rbtree.c            | 29 +++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          |  2 ++
>   tools/include/uapi/linux/bpf.h | 20 ++++++++++++++++++++
>   5 files changed, 81 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c677d92de3bc..d21e2c99ea14 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5391,6 +5391,24 @@ union bpf_attr {
>    *
>    *	Return
>    *		Ptr to lock
> + *
> + * void *bpf_rbtree_lock(struct bpf_spin_lock *lock)
> + *	Description
> + *		Like bpf_spin_lock helper, but use separate helper for now
> + *		as we don't want this helper to have special meaning to the verifier
> + *		so that we can do rbtree helper calls between rbtree_lock/unlock
> + *
> + *	Return
> + *		0
> + *
> + * void *bpf_rbtree_unlock(struct bpf_spin_lock *lock)
> + *	Description
> + *		Like bpf_spin_unlock helper, but use separate helper for now
> + *		as we don't want this helper to have special meaning to the verifier
> + *		so that we can do rbtree helper calls between rbtree_lock/unlock
> + *
> + *	Return
> + *		0
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -5607,6 +5625,8 @@ union bpf_attr {
>   	FN(rbtree_remove),		\
>   	FN(rbtree_free_node),		\
>   	FN(rbtree_get_lock),		\
> +	FN(rbtree_lock),		\
> +	FN(rbtree_unlock),		\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 257a808bb767..fa2dba1dcec8 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -303,7 +303,7 @@ static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
>   
>   static DEFINE_PER_CPU(unsigned long, irqsave_flags);
>   
> -static inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock)
> +inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock)
>   {
>   	unsigned long flags;
>   
> @@ -311,6 +311,7 @@ static inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock)
>   	__bpf_spin_lock(lock);
>   	__this_cpu_write(irqsave_flags, flags);
>   }
> +EXPORT_SYMBOL(__bpf_spin_lock_irqsave);

what is it for?
It's not used out of modules.

>   
>   notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
>   {
> @@ -325,7 +326,7 @@ const struct bpf_func_proto bpf_spin_lock_proto = {
>   	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
>   };
>   
> -static inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
> +inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
>   {
>   	unsigned long flags;
>   
> @@ -333,6 +334,7 @@ static inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
>   	__bpf_spin_unlock(lock);
>   	local_irq_restore(flags);
>   }
> +EXPORT_SYMBOL(__bpf_spin_unlock_irqrestore);
>   
>   notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
>   {
> @@ -1588,6 +1590,8 @@ const struct bpf_func_proto bpf_rbtree_find_proto __weak;
>   const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
>   const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
>   const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
> +const struct bpf_func_proto bpf_rbtree_lock_proto __weak;
> +const struct bpf_func_proto bpf_rbtree_unlock_proto __weak;
>   
>   const struct bpf_func_proto *
>   bpf_base_func_proto(enum bpf_func_id func_id)
> @@ -1689,6 +1693,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   		return &bpf_rbtree_free_node_proto;
>   	case BPF_FUNC_rbtree_get_lock:
>   		return &bpf_rbtree_get_lock_proto;
> +	case BPF_FUNC_rbtree_lock:
> +		return &bpf_rbtree_lock_proto;
> +	case BPF_FUNC_rbtree_unlock:
> +		return &bpf_rbtree_unlock_proto;
>   	default:
>   		break;
>   	}
> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> index c6f0a2a083f6..bf2e30af82ec 100644
> --- a/kernel/bpf/rbtree.c
> +++ b/kernel/bpf/rbtree.c
> @@ -262,6 +262,35 @@ const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
>   	.arg1_type = ARG_CONST_MAP_PTR,
>   };
>   
> +extern void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock);
> +extern void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock);
> +
> +BPF_CALL_1(bpf_rbtree_lock, void *, lock)
> +{
> +	__bpf_spin_lock_irqsave((struct bpf_spin_lock *)lock);
> +	return 0;
> +}

it doesn't have to be bpf_spin_lock.
Normal spin_lock will do.
bpf_spin_lock has specific size requirement, so when it's used inside
map value the value size doesn't change from kernel to kernel.
Since this lock is hidden it can be any lock.

Also it needs to remember current task or something.
Just spin_is_locked() from bpf_rbtree_add() is not enough.
Instead of remembering current we can pass hidden 'prog' pointer
and remember that in bpf_rbtree_lock.
Also pass that hidden prog ptr to add/remove/find and compare.
But probably overkill. Current task should be fine.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-07-22 18:34 ` [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree Dave Marchevsky
@ 2022-08-01 22:17   ` Alexei Starovoitov
  2022-08-02 13:59     ` Kumar Kartikeya Dwivedi
  2022-08-10 21:46     ` Kumar Kartikeya Dwivedi
  0 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 22:17 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
> well as a bpf_rbtree_get_lock helper which allows bpf programs to access
> the lock.
> 
> Ideally the bpf_spin_lock would be created independently oustide of the
> tree and associated with it before the tree is used, either as part of
> map definition or via some call like rbtree_init(&rbtree, &lock). Doing
> this in an ergonomic way is proving harder than expected, so for now use
> this workaround.
> 
> Why is creating the bpf_spin_lock independently and associating it with
> the tree preferable? Because we want to be able to transfer nodes
> between trees atomically, and for this to work need same lock associated
> with 2 trees.

Right. We need one lock to protect multiple rbtrees.
Since add/find/remove helpers will look into rbtree->lock
the two different rbtree (== map) have to point to the same lock.
Other than rbtree_init(&rbtree, &lock) what would be an alternative ?

> 
> Further locking-related patches will make it possible for the lock to be
> used in BPF programs and add code which enforces that the lock is held
> when doing any operation on the tree.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/uapi/linux/bpf.h       |  7 +++++++
>   kernel/bpf/helpers.c           |  3 +++
>   kernel/bpf/rbtree.c            | 24 ++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |  7 +++++++
>   4 files changed, 41 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4688ce88caf4..c677d92de3bc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5385,6 +5385,12 @@ union bpf_attr {
>    *	Return
>    *		0
>    *
> + * void *bpf_rbtree_get_lock(struct bpf_map *map)
> + *	Description
> + *		Return the bpf_spin_lock associated with the rbtree
> + *
> + *	Return
> + *		Ptr to lock
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -5600,6 +5606,7 @@ union bpf_attr {
>   	FN(rbtree_find),		\
>   	FN(rbtree_remove),		\
>   	FN(rbtree_free_node),		\
> +	FN(rbtree_get_lock),		\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 35eb66d11bf6..257a808bb767 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1587,6 +1587,7 @@ const struct bpf_func_proto bpf_rbtree_add_proto __weak;
>   const struct bpf_func_proto bpf_rbtree_find_proto __weak;
>   const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
>   const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
> +const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
>   
>   const struct bpf_func_proto *
>   bpf_base_func_proto(enum bpf_func_id func_id)
> @@ -1686,6 +1687,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   		return &bpf_rbtree_remove_proto;
>   	case BPF_FUNC_rbtree_free_node:
>   		return &bpf_rbtree_free_node_proto;
> +	case BPF_FUNC_rbtree_get_lock:
> +		return &bpf_rbtree_get_lock_proto;
>   	default:
>   		break;
>   	}
> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> index 250d62210804..c6f0a2a083f6 100644
> --- a/kernel/bpf/rbtree.c
> +++ b/kernel/bpf/rbtree.c
> @@ -9,6 +9,7 @@
>   struct bpf_rbtree {
>   	struct bpf_map map;
>   	struct rb_root_cached root;
> +	struct bpf_spin_lock *lock;
>   };
>   
>   BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
> @@ -39,6 +40,14 @@ static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
>   
>   	tree->root = RB_ROOT_CACHED;
>   	bpf_map_init_from_attr(&tree->map, attr);
> +
> +	tree->lock = bpf_map_kzalloc(&tree->map, sizeof(struct bpf_spin_lock),
> +				     GFP_KERNEL | __GFP_NOWARN);
> +	if (!tree->lock) {
> +		bpf_map_area_free(tree);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
>   	return &tree->map;
>   }
>   
> @@ -139,6 +148,7 @@ static void rbtree_map_free(struct bpf_map *map)
>   
>   	bpf_rbtree_postorder_for_each_entry_safe(pos, n, &tree->root.rb_root)
>   		kfree(pos);
> +	kfree(tree->lock);
>   	bpf_map_area_free(tree);
>   }
>   
> @@ -238,6 +248,20 @@ static int rbtree_map_get_next_key(struct bpf_map *map, void *key,
>   	return -ENOTSUPP;
>   }
>   
> +BPF_CALL_1(bpf_rbtree_get_lock, struct bpf_map *, map)
> +{
> +	struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
> +
> +	return (u64)tree->lock;
> +}
> +
> +const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
> +	.func = bpf_rbtree_get_lock,
> +	.gpl_only = true,
> +	.ret_type = RET_PTR_TO_MAP_VALUE,

This hack and

+const struct bpf_func_proto bpf_rbtree_unlock_proto = {
+	.func = bpf_rbtree_unlock,
+	.gpl_only = true,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_PTR_TO_SPIN_LOCK,

may be too much arm twisting to reuse bpf_spin_lock.

May be return ptr_to_btf_id here and bpf_rbtree_lock
should match the type?
It could be new 'struct bpf_lock' in kernel's BTF.

Let's figure out how to associate locks with rbtrees.

Reiterating my proposal that was done earlier in the context
of Delyan's irq_work, but for different type:
How about:
struct bpf_lock *l;
l = bpf_mem_alloc(allocator, bpf_core_type_id_kernel(*l));

that would allocate ptr_to_btf_id object from kernel's btf.
The bpf_lock would have constructor and destructor provided by the 
kernel code.
constructor will set bpf_lock's refcnt to 1.
then bpf_rbtree_init(&map, lock) will bump the refcnt.
and dtor will eventually free it when all rbtrees are freed.
That would be similar to kptr's logic with kptr_get and dtor's 
associated with kernel's btf_id-s.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 03/11] bpf: Add rb_node_off to bpf_map
  2022-07-22 18:34 ` [RFC PATCH bpf-next 03/11] bpf: Add rb_node_off to bpf_map Dave Marchevsky
@ 2022-08-01 22:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 22:19 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:

> +	case BTF_FIELD_RB_NODE:
> +		name = "rb_node";
> +		sz = sizeof(struct rb_node);
> +		align = __alignof__(struct rb_node);
> +		break;

and its usage in bpf prog:
+struct node_data {
+	struct rb_node node;
+	__u32 one;
+	__u32 two;
+};

may break from kernel to kernel.
let's wrap it into bpf_rb_node ?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 09/11] bpf: Add CONDITIONAL_RELEASE type flag
  2022-07-22 18:34 ` [RFC PATCH bpf-next 09/11] bpf: Add CONDITIONAL_RELEASE " Dave Marchevsky
@ 2022-08-01 22:23   ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 22:23 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Currently if a helper proto has an arg with OBJ_RELEASE flag, the verifier
> assumes the 'release' logic in the helper will always succeed, and
> therefore always clears the arg reg, other regs w/ same
> ref_obj_id, and releases the reference. This poses a problem for
> 'release' helpers which may not always succeed.
> 
> For example, bpf_rbtree_add will fail to add the passed-in node to a
> bpf_rbtree if the rbtree's lock is not held when the helper is called.
> In this case the helper returns NULL and the calling bpf program must
> release the node in another way before terminating to avoid leaking
> memory. An example of such logic:
> 
> 1  struct node_data *node = bpf_rbtree_alloc_node(&rbtree, ...);
> 2  struct node_data *ret = bpf_rbtree_add(&rbtree, node);
> 3  if (!ret) {
> 4    bpf_rbtree_free_node(node);
> 5    return 0;
> 6  }
> 7  bpf_trace_printk("%d\n", ret->id);
> 
> However, current verifier logic does not allow this: after line 2,
> ref_obj_id of reg holding 'node' (BPF_REG_2) will be released via
> release_reference function, which will mark BPF_REG_2 and any other reg
> with same ref_obj_id as unknown. As a result neither ret nor node will
> point to anything useful if line 3's check passes. Additionally, since the
> reference is unconditionally released, the program would pass the
> verifier without the null check.
> 
> This patch adds 'conditional release' semantics so that the verifier can
> handle the above example correctly. The CONDITIONAL_RELEASE type flag
> works in concert with the existing OBJ_RELEASE flag - the latter is used
> to tag an argument, while the new type flag is used to tag return type.
> 
> If a helper has an OBJ_RELEASE arg type and CONDITIONAL_RELEASE return
> type, the helper is considered to use its return value to indicate
> success or failure of the release operation. NULL is returned if release
> fails, non-null otherwise.
> 
> For my concrete usecase - bpf_rbtree_add - CONDITIONAL_RELEASE works in
> concert with OBJ_NON_OWNING_REF: successful release results in a non-owning
> reference being returned, allowing line 7 in above example.
> 
> Instead of unconditionally releasing the OBJ_RELEASE reference when
> doing check_helper_call, for CONDITIONAL_RELEASE helpers the verifier
> will wait until the return value is checked for null.
>    If not null: the reference is released
> 
>    If null: no reference is released. Since other regs w/ same ref_obj_id
>             were not marked unknown by check_helper_call, they can be
>             used to release the reference via other means (line 4 above),
> 
> It's necessary to prevent conditionally-released ref_obj_id regs from
> being used between the release helper and null check. For example:
> 
> 1  struct node_data *node = bpf_rbtree_alloc_node(&rbtree, ...);
> 2  struct node_data *ret = bpf_rbtree_add(&rbtree, node);
> 3  do_something_with_a_node(node);
> 4  if (!ret) {
> 5    bpf_rbtree_free_node(node);
> 6    return 0;
> 7  }
> 
> Line 3 shouldn't be allowed since node may have been released. The
> verifier tags all regs with ref_obj_id of the conditionally-released arg
> in the period between the helper call and null check for this reason.
> 
> Why no matching CONDITIONAL_ACQUIRE type flag? Existing verifier logic
> already treats acquire of an _OR_NULL type as a conditional acquire.
> Consider this code:
> 
> 1  struct thing *i = acquire_helper_that_returns_thing_or_null();
> 2  if (!i)
> 3    return 0;
> 4  manipulate_thing(i);
> 5  release_thing(i);
> 
> After line 1, BPF_REG_0 will have an _OR_NULL type and a ref_obj_id set.
> When the verifier sees line 2's conditional jump, existing logic in
> mark_ptr_or_null_regs, specifically the if:
> 
>    if (ref_obj_id && ref_obj_id == id && is_null)
>            /* regs[regno] is in the " == NULL" branch.
>             * No one could have freed the reference state before
>             * doing the NULL check.
>             */
>             WARN_ON_ONCE(release_reference_state(state, id));
> 
> will release the reference in the is_null state.
> 
> [ TODO: Either need to remove WARN_ON_ONCE there without adding
> CONDITIONAL_ACQUIRE flag or add the flag and don't WARN_ON_ONCE if it's
> set. Left out of first pass for simplicity's sake. ]
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/linux/bpf.h          |   3 +
>   include/linux/bpf_verifier.h |   1 +
>   kernel/bpf/rbtree.c          |   2 +-
>   kernel/bpf/verifier.c        | 122 +++++++++++++++++++++++++++++++----
>   4 files changed, 113 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c9c4b4fb019c..a601ab30a2b1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -413,6 +413,9 @@ enum bpf_type_flag {
>   	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
>   
>   	OBJ_NON_OWNING_REF	= BIT(11 + BPF_BASE_TYPE_BITS),
> +
> +	CONDITIONAL_RELEASE	= BIT(12 + BPF_BASE_TYPE_BITS),
> +
>   	__BPF_TYPE_FLAG_MAX,
>   	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
>   };
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 9c017575c034..bdc8c48c2343 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -313,6 +313,7 @@ struct bpf_verifier_state {
>   	u32 insn_idx;
>   	u32 curframe;
>   	u32 active_spin_lock;
> +	u32 active_cond_ref_obj_id;
>   	bool speculative;
>   
>   	/* first and last insn idx of this verifier state */
> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> index 34864fc83209..dcf8f69d4ada 100644
> --- a/kernel/bpf/rbtree.c
> +++ b/kernel/bpf/rbtree.c
> @@ -111,7 +111,7 @@ BPF_CALL_3(bpf_rbtree_add, struct bpf_map *, map, void *, value, void *, cb)
>   const struct bpf_func_proto bpf_rbtree_add_proto = {
>   	.func = bpf_rbtree_add,
>   	.gpl_only = true,
> -	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF,
> +	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF | CONDITIONAL_RELEASE,
>   	.arg1_type = ARG_CONST_MAP_PTR,
>   	.arg2_type = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
>   	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 4f46b2dfbc4b..f80e161170de 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -472,6 +472,11 @@ static bool type_is_non_owning_ref(u32 type)
>   	return type & OBJ_NON_OWNING_REF;
>   }
>   
> +static bool type_is_cond_release(u32 type)
> +{
> +	return type & CONDITIONAL_RELEASE;
> +}
> +
>   static bool type_may_be_null(u32 type)
>   {
>   	return type & PTR_MAYBE_NULL;
> @@ -600,6 +605,15 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
>   			postfix_idx += strlcpy(postfix + postfix_idx, "_non_own", 32 - postfix_idx);
>   	}
>   
> +	if (type_is_cond_release(type)) {
> +		if (base_type(type) == PTR_TO_BTF_ID)
> +			postfix_idx += strlcpy(postfix + postfix_idx, "cond_rel_",
> +					       32 - postfix_idx);
> +		else
> +			postfix_idx += strlcpy(postfix + postfix_idx, "_cond_rel",
> +					       32 - postfix_idx);
> +	}
> +
>   	if (type & MEM_RDONLY)
>   		strncpy(prefix, "rdonly_", 32);
>   	if (type & MEM_ALLOC)
> @@ -1211,6 +1225,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
>   	dst_state->speculative = src->speculative;
>   	dst_state->curframe = src->curframe;
>   	dst_state->active_spin_lock = src->active_spin_lock;
> +	dst_state->active_cond_ref_obj_id = src->active_cond_ref_obj_id;
>   	dst_state->branches = src->branches;
>   	dst_state->parent = src->parent;
>   	dst_state->first_insn_idx = src->first_insn_idx;
> @@ -1418,6 +1433,7 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
>   		return;
>   	}
>   
> +	reg->type &= ~CONDITIONAL_RELEASE;
>   	reg->type &= ~PTR_MAYBE_NULL;
>   }
>   
> @@ -6635,24 +6651,78 @@ static void release_reg_references(struct bpf_verifier_env *env,
>   	}
>   }
>   
> +static int __release_reference(struct bpf_verifier_env *env, struct bpf_verifier_state *vstate,
> +			       int ref_obj_id)
> +{
> +	int err;
> +	int i;
> +
> +	err = release_reference_state(vstate->frame[vstate->curframe], ref_obj_id);
> +	if (err)
> +		return err;
> +
> +	for (i = 0; i <= vstate->curframe; i++)
> +		release_reg_references(env, vstate->frame[i], ref_obj_id);
> +	return 0;
> +}
> +
>   /* The pointer with the specified id has released its reference to kernel
>    * resources. Identify all copies of the same pointer and clear the reference.
>    */
>   static int release_reference(struct bpf_verifier_env *env,
>   			     int ref_obj_id)
>   {
> -	struct bpf_verifier_state *vstate = env->cur_state;
> -	int err;
> +	return __release_reference(env, env->cur_state, ref_obj_id);
> +}
> +
> +static void tag_reference_cond_release_regs(struct bpf_verifier_env *env,
> +					    struct bpf_func_state *state,
> +					    int ref_obj_id,
> +					    bool remove)
> +{
> +	struct bpf_reg_state *regs = state->regs, *reg;
>   	int i;
>   
> -	err = release_reference_state(cur_func(env), ref_obj_id);
> -	if (err)
> -		return err;
> +	for (i = 0; i < MAX_BPF_REG; i++)
> +		if (regs[i].ref_obj_id == ref_obj_id) {
> +			if (remove)
> +				regs[i].type &= ~CONDITIONAL_RELEASE;
> +			else
> +				regs[i].type |= CONDITIONAL_RELEASE;
> +		}
> +
> +	bpf_for_each_spilled_reg(i, state, reg) {
> +		if (!reg)
> +			continue;
> +		if (reg->ref_obj_id == ref_obj_id) {
> +			if (remove)
> +				reg->type &= ~CONDITIONAL_RELEASE;
> +			else
> +				reg->type |= CONDITIONAL_RELEASE;
> +		}
> +	}
> +}
> +
> +static void tag_reference_cond_release(struct bpf_verifier_env *env,
> +				       int ref_obj_id)
> +{
> +	struct bpf_verifier_state *vstate = env->cur_state;
> +	int i;
>   
>   	for (i = 0; i <= vstate->curframe; i++)
> -		release_reg_references(env, vstate->frame[i], ref_obj_id);
> +		tag_reference_cond_release_regs(env, vstate->frame[i],
> +						ref_obj_id, false);
> +}
>   
> -	return 0;
> +static void untag_reference_cond_release(struct bpf_verifier_env *env,
> +					 struct bpf_verifier_state *vstate,
> +					 int ref_obj_id)
> +{
> +	int i;
> +
> +	for (i = 0; i <= vstate->curframe; i++)
> +		tag_reference_cond_release_regs(env, vstate->frame[i],
> +						ref_obj_id, true);
>   }
>   
>   static void clear_non_owning_ref_regs(struct bpf_verifier_env *env,
> @@ -7406,7 +7476,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>   		if (arg_type_is_dynptr(fn->arg_type[meta.release_regno - BPF_REG_1]))
>   			err = unmark_stack_slots_dynptr(env, &regs[meta.release_regno]);
>   		else if (meta.ref_obj_id)
> -			err = release_reference(env, meta.ref_obj_id);
> +			if (type_is_cond_release(fn->ret_type)) {
> +				if (env->cur_state->active_cond_ref_obj_id) {
> +					verbose(env, "can't handle >1 cond_release\n");
> +					return err;
> +				}
> +				env->cur_state->active_cond_ref_obj_id = meta.ref_obj_id;
> +				tag_reference_cond_release(env, meta.ref_obj_id);
> +				err = 0;
> +			} else {
> +				err = release_reference(env, meta.ref_obj_id);
> +			}
>   		/* meta.ref_obj_id can only be 0 if register that is meant to be
>   		 * released is NULL, which must be > R0.
>   		 */
> @@ -10040,8 +10120,8 @@ static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id,
>   /* The logic is similar to find_good_pkt_pointers(), both could eventually
>    * be folded together at some point.
>    */
> -static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
> -				  bool is_null)
> +static int mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
> +				 bool is_null, struct bpf_verifier_env *env)
>   {
>   	struct bpf_func_state *state = vstate->frame[vstate->curframe];
>   	struct bpf_reg_state *regs = state->regs;
> @@ -10056,8 +10136,19 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
>   		 */
>   		WARN_ON_ONCE(release_reference_state(state, id));
>   
> +	if (type_is_cond_release(regs[regno].type)) {
> +		if (!is_null) {
> +			__release_reference(env, vstate, vstate->active_cond_ref_obj_id);
> +			vstate->active_cond_ref_obj_id = 0;
> +		} else {
> +			untag_reference_cond_release(env, vstate, vstate->active_cond_ref_obj_id);
> +			vstate->active_cond_ref_obj_id = 0;
> +		}
> +	}
>   	for (i = 0; i <= vstate->curframe; i++)
>   		__mark_ptr_or_null_regs(vstate->frame[i], id, is_null);
> +
> +	return 0;
>   }
>   
>   static bool try_match_pkt_pointers(const struct bpf_insn *insn,
> @@ -10365,10 +10456,10 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
>   		/* Mark all identical registers in each branch as either
>   		 * safe or unknown depending R == 0 or R != 0 conditional.
>   		 */
> -		mark_ptr_or_null_regs(this_branch, insn->dst_reg,
> -				      opcode == BPF_JNE);
> -		mark_ptr_or_null_regs(other_branch, insn->dst_reg,
> -				      opcode == BPF_JEQ);
> +		err = mark_ptr_or_null_regs(this_branch, insn->dst_reg,
> +					    opcode == BPF_JNE, env);
> +		err = mark_ptr_or_null_regs(other_branch, insn->dst_reg,
> +					    opcode == BPF_JEQ, env);

CONDITIONAL_RELEASE concept doesn't look too horrible :)
I couldn't come up with anything better.

But what's the point of returning 0 from mark_ptr_or_null_regs() ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 08/11] bpf: Add OBJ_NON_OWNING_REF type flag
  2022-07-22 18:34 ` [RFC PATCH bpf-next 08/11] bpf: Add OBJ_NON_OWNING_REF type flag Dave Marchevsky
@ 2022-08-01 22:41   ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 22:41 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Consider a pointer to a type that would normally need acquire / release
> semantics to be safely held. There may be scenarios where such a pointer
> can be safely held without the need to acquire a reference.
> 
> For example, although a PTR_TO_BTF_ID for a rbtree_map node is released
> via bpf_rbtree_add helper, the helper doesn't change the address of the
> node and must be called with the rbtree_map's spinlock held. Since the
> only way to remove a node from the rbtree - bpf_rbtree_remove helper -
> requires the same lock, the newly-added node cannot be removed by a
> concurrently-running program until the lock is released. Therefore it is
> safe to hold a reference to this node until bpf_rbtree_unlock is called.
> 
> This patch introduces a new type flag and associated verifier logic to
> handle such "non-owning" references.
> 
> Currently the only usecase I have is the rbtree example above, so the
> verifier logic is straightforward:
>    * Tag return types of bpf_rbtree_{add,find} with OBJ_NON_OWNING_REF
>      * These both require the rbtree lock to be held to return anything
>      non-NULL
>      * Since ret type for both is PTR_TO_BTF_ID_OR_NULL, if lock is not
>      held and NULL is returned, existing mark_ptr_or_null_reg logic
>      will clear reg type.
>      * So if mark_ptr_or_null_reg logic turns the returned reg into a
>      PTR_TO_BTF_ID | OBJ_NON_OWNING_REF, verifier knows lock is held.
> 
>    * When the lock is released the verifier invalidates any regs holding
>    non owning refs similarly to existing release_reference logic - but no
>    need to clear ref_obj_id as an 'owning' reference was never acquired.
> 
> [ TODO: Currently the invalidation logic in
> clear_rbtree_node_non_owning_refs is not parametrized by map so
> unlocking any rbtree lock will invalidate all non-owning refs ]

probably should be parametrized by 'lock' instead of by 'map'.

> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/linux/bpf.h   |  1 +
>   kernel/bpf/rbtree.c   |  4 +--
>   kernel/bpf/verifier.c | 63 +++++++++++++++++++++++++++++++++++++++----
>   3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index eb8c550db0e2..c9c4b4fb019c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -412,6 +412,7 @@ enum bpf_type_flag {
>   	/* Size is known at compile time. */
>   	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
>   
> +	OBJ_NON_OWNING_REF	= BIT(11 + BPF_BASE_TYPE_BITS),
>   	__BPF_TYPE_FLAG_MAX,
>   	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
>   };
> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> index 5b1ab73e164f..34864fc83209 100644
> --- a/kernel/bpf/rbtree.c
> +++ b/kernel/bpf/rbtree.c
> @@ -111,7 +111,7 @@ BPF_CALL_3(bpf_rbtree_add, struct bpf_map *, map, void *, value, void *, cb)
>   const struct bpf_func_proto bpf_rbtree_add_proto = {
>   	.func = bpf_rbtree_add,
>   	.gpl_only = true,
> -	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
> +	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF,
>   	.arg1_type = ARG_CONST_MAP_PTR,
>   	.arg2_type = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
>   	.arg2_btf_id = &bpf_rbtree_btf_ids[0],
> @@ -133,7 +133,7 @@ BPF_CALL_3(bpf_rbtree_find, struct bpf_map *, map, void *, key, void *, cb)
>   const struct bpf_func_proto bpf_rbtree_find_proto = {
>   	.func = bpf_rbtree_find,
>   	.gpl_only = true,
> -	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL,
> +	.ret_type = RET_PTR_TO_BTF_ID_OR_NULL | OBJ_NON_OWNING_REF,
>   	.ret_btf_id = &bpf_rbtree_btf_ids[0],
>   	.arg1_type = ARG_CONST_MAP_PTR,
>   	.arg2_type = ARG_ANYTHING,

To prevent race the bpf_rbtree_find/add feels not enough.
bpf_rbtree_alloc_node probably needs it too?
Otherwise after bpf_rbtree_unlock the stashed pointer from alloc
will still be accessible?
May be it's actually ok to access by the prog?
Due to the way preallocated maps work the elements can be use-after-free
within the same map. It's similar to SLAB_TYPESAFE_BY_RCU.
The elements won't be released into global kernel memory while progs are 
looking at them.
It seems to me it's ok to do away without OBJ_NON_OWNING_REF concept.
The prog might have pointers to rbtree elements and they will be
accessible by the prog even after bpf_rbtree_unlock().
This access will be racy, but it's safe.
So just drop this patch?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  2022-07-22 18:34 ` [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags Dave Marchevsky
  2022-07-29 16:31   ` Tejun Heo
@ 2022-08-01 22:44   ` Alexei Starovoitov
  2022-08-02 13:05     ` Kumar Kartikeya Dwivedi
  2022-08-10 17:56     ` Dave Marchevsky
  1 sibling, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 22:44 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>   	if (__is_pointer_value(false, reg)) {
> +		if (__is_iter_end(reg) && val == 0) {
> +			__mark_reg_const_zero(reg);
> +			switch (opcode) {
> +			case BPF_JEQ:
> +				return 1;
> +			case BPF_JNE:
> +				return 0;
> +			default:
> +				return -1;
> +			}
> +		}

as discussed the verifying the loop twice is not safe.
This needs more advanced verifier hacking.
Maybe let's postpone rbtree iters for now and resolve all the rest?
Or do iters with a callback, since that's more or less a clear path fwd?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  2022-08-01 22:44   ` Alexei Starovoitov
@ 2022-08-02 13:05     ` Kumar Kartikeya Dwivedi
  2022-08-02 15:10       ` Alexei Starovoitov
  2022-08-10 17:56     ` Dave Marchevsky
  1 sibling, 1 reply; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-02 13:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On Tue, 2 Aug 2022 at 00:46, Alexei Starovoitov <ast@fb.com> wrote:
>
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> >       if (__is_pointer_value(false, reg)) {
> > +             if (__is_iter_end(reg) && val == 0) {
> > +                     __mark_reg_const_zero(reg);
> > +                     switch (opcode) {
> > +                     case BPF_JEQ:
> > +                             return 1;
> > +                     case BPF_JNE:
> > +                             return 0;
> > +                     default:
> > +                             return -1;
> > +                     }
> > +             }
>
> as discussed the verifying the loop twice is not safe.
> This needs more advanced verifier hacking.
> Maybe let's postpone rbtree iters for now and resolve all the rest?
> Or do iters with a callback, since that's more or less a clear path fwd?
>

Can you elaborate a bit on what you think the challenges/concerns are
(even just for educational purposes)? I am exploring a similar
approach for one of my use cases.

On Tue, 2 Aug 2022 at 00:46, Alexei Starovoitov <ast@fb.com> wrote:
>
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> >       if (__is_pointer_value(false, reg)) {
> > +             if (__is_iter_end(reg) && val == 0) {
> > +                     __mark_reg_const_zero(reg);
> > +                     switch (opcode) {
> > +                     case BPF_JEQ:
> > +                             return 1;
> > +                     case BPF_JNE:
> > +                             return 0;
> > +                     default:
> > +                             return -1;
> > +                     }
> > +             }
>
> as discussed the verifying the loop twice is not safe.
> This needs more advanced verifier hacking.
> Maybe let's postpone rbtree iters for now and resolve all the rest?
> Or do iters with a callback, since that's more or less a clear path fwd?
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-01 22:17   ` Alexei Starovoitov
@ 2022-08-02 13:59     ` Kumar Kartikeya Dwivedi
  2022-08-02 15:30       ` Alexei Starovoitov
  2022-08-10 21:46     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-02 13:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo, toke

On Tue, 2 Aug 2022 at 00:23, Alexei Starovoitov <ast@fb.com> wrote:
>
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> > This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
> > well as a bpf_rbtree_get_lock helper which allows bpf programs to access
> > the lock.
> >
> > Ideally the bpf_spin_lock would be created independently oustide of the
> > tree and associated with it before the tree is used, either as part of
> > map definition or via some call like rbtree_init(&rbtree, &lock). Doing
> > this in an ergonomic way is proving harder than expected, so for now use
> > this workaround.
> >
> > Why is creating the bpf_spin_lock independently and associating it with
> > the tree preferable? Because we want to be able to transfer nodes
> > between trees atomically, and for this to work need same lock associated
> > with 2 trees.
>
> Right. We need one lock to protect multiple rbtrees.
> Since add/find/remove helpers will look into rbtree->lock
> the two different rbtree (== map) have to point to the same lock.
> Other than rbtree_init(&rbtree, &lock) what would be an alternative ?
>

Instead of dynamically associating locks with the rbtree map, why not
have lock embedded with the rbtree map, and construct a global locking
order among rbtree maps during prog load time that the verifier
maintains globally.

Prog A always takes rb_mapA.lock, then rb_mapB.lock,
If we see Prog B being loaded and it takes them in opposite order, we
reject the load because it can lead to ABBA deadlock. We also know the
map pointer statically so we ensure rb_mapA.lock cannot be called
recursively.

Everytime a prog is loaded, it validates against this single list of
lock orders amongst maps. Some of them may not have interdependencies
at all. There is a single total order, and any cycles lead to
verification failure. This might also work out for normal
bpf_spin_locks allowing us to take more than 1 at a time.

Then you can do moves atomically across two maps, by acquiring the
locks for both. Maybe we limit this to two locks for now only. There
could also be a C++ style std::lock{lock1, lock2} helper that takes
multiple locks to acquire in order, if you want to prevent anything
from happening between those two calls; just an idea.

Probably need to make rb_node add/find helpers notrace so that the bpf
program does not invoke lock again recursively when invoked from the
helper.

Then you can embed locked state statically in map pointer reg, and you
don't need to dynamically check whether map is locked. It will be
passed into the helpers, and from there a member offset can be fixed
which indicates whether the map is 'locked'.

If you have already explored this, then please share what the
problems/limitations were that you ran into, or if this is impossible.
It does sound too good to be true, so maybe I missed something
important here.

I was prototyping something similar for the pifomap in the XDP
queueing series, where we were exploring exposing the underlying
locking of the map to the user (to be able to batch insertions to the
map). The major concern was leaking too many implementation details to
the UAPI and locking (pun intended) ourselves out of improvements to
the map implementation later, so we held off on that for now (and also
wanted to evaluate other alternatives before doubling down on it).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  2022-08-02 13:05     ` Kumar Kartikeya Dwivedi
@ 2022-08-02 15:10       ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-02 15:10 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On 8/2/22 6:05 AM, Kumar Kartikeya Dwivedi wrote:
> On Tue, 2 Aug 2022 at 00:46, Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>>>        if (__is_pointer_value(false, reg)) {
>>> +             if (__is_iter_end(reg) && val == 0) {
>>> +                     __mark_reg_const_zero(reg);
>>> +                     switch (opcode) {
>>> +                     case BPF_JEQ:
>>> +                             return 1;
>>> +                     case BPF_JNE:
>>> +                             return 0;
>>> +                     default:
>>> +                             return -1;
>>> +                     }
>>> +             }
>>
>> as discussed the verifying the loop twice is not safe.
>> This needs more advanced verifier hacking.
>> Maybe let's postpone rbtree iters for now and resolve all the rest?
>> Or do iters with a callback, since that's more or less a clear path fwd?
>>
> 
> Can you elaborate a bit on what you think the challenges/concerns are
> (even just for educational purposes)? I am exploring a similar
> approach for one of my use cases.

struct node *elem = data_structure_iter_first(&some_map);
int i = 0;

while (elem) {
     array[i++] = 0;
     elem = data_structure_iter_next(&some_map, elem);
}

If the verifier checks the loop body only twice the array[] access
will go out of bounds.
We discussed downgrading all scalars to be not-precise (in the verifier 
terms) while verifying the body. Then we can catch such cases.
In other words the verifier need to look at any induction variable
as being unbounded, since the loop count is non deterministic.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-02 13:59     ` Kumar Kartikeya Dwivedi
@ 2022-08-02 15:30       ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-02 15:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo, toke

On 8/2/22 6:59 AM, Kumar Kartikeya Dwivedi wrote:
> On Tue, 2 Aug 2022 at 00:23, Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>>> This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
>>> well as a bpf_rbtree_get_lock helper which allows bpf programs to access
>>> the lock.
>>>
>>> Ideally the bpf_spin_lock would be created independently oustide of the
>>> tree and associated with it before the tree is used, either as part of
>>> map definition or via some call like rbtree_init(&rbtree, &lock). Doing
>>> this in an ergonomic way is proving harder than expected, so for now use
>>> this workaround.
>>>
>>> Why is creating the bpf_spin_lock independently and associating it with
>>> the tree preferable? Because we want to be able to transfer nodes
>>> between trees atomically, and for this to work need same lock associated
>>> with 2 trees.
>>
>> Right. We need one lock to protect multiple rbtrees.
>> Since add/find/remove helpers will look into rbtree->lock
>> the two different rbtree (== map) have to point to the same lock.
>> Other than rbtree_init(&rbtree, &lock) what would be an alternative ?
>>
> 
> Instead of dynamically associating locks with the rbtree map, why not
> have lock embedded with the rbtree map, and construct a global locking
> order among rbtree maps during prog load time that the verifier
> maintains globally.
> 
> Prog A always takes rb_mapA.lock, then rb_mapB.lock,
> If we see Prog B being loaded and it takes them in opposite order, we
> reject the load because it can lead to ABBA deadlock. We also know the
> map pointer statically so we ensure rb_mapA.lock cannot be called
> recursively.
> 
> Everytime a prog is loaded, it validates against this single list of
> lock orders amongst maps. Some of them may not have interdependencies
> at all. There is a single total order, and any cycles lead to
> verification failure. This might also work out for normal
> bpf_spin_locks allowing us to take more than 1 at a time.
> 
> Then you can do moves atomically across two maps, by acquiring the
> locks for both. Maybe we limit this to two locks for now only. There
> could also be a C++ style std::lock{lock1, lock2} helper that takes
> multiple locks to acquire in order, if you want to prevent anything
> from happening between those two calls; just an idea.

Consider the case of N rbtrees. One for each cpu.
In run-time the bpf prog might grab lock of X's rbtree and
then decide to transfer the node to a higher or lower cpu.
The order cannot be determined statically by the verifier.
The simple fix for this would be to share one lock across all rbtrees.
Not necessary such 'global' lock will scale. Just an example
where sharing the lock is useful.

> Probably need to make rb_node add/find helpers notrace so that the bpf
> program does not invoke lock again recursively when invoked from the
> helper.
> 
> Then you can embed locked state statically in map pointer reg, and you
> don't need to dynamically check whether map is locked. It will be
> passed into the helpers, and from there a member offset can be fixed
> which indicates whether the map is 'locked'.

considered that. So far doesn't look like we can avoid run-time checks.
And if we do run-time check there is no need to complicate the verifier.

> If you have already explored this, then please share what the
> problems/limitations were that you ran into, or if this is impossible.
> It does sound too good to be true, so maybe I missed something
> important here.
> 
> I was prototyping something similar for the pifomap in the XDP
> queueing series, where we were exploring exposing the underlying
> locking of the map to the user (to be able to batch insertions to the
> map). The major concern was leaking too many implementation details to
> the UAPI and locking (pun intended) ourselves out of improvements to
> the map implementation later, so we held off on that for now (and also
> wanted to evaluate other alternatives before doubling down on it).

That's probably not a concern. spin_lock concept is natural.
It's implementation could be different, but we already hide it in 
bpf_spin_lock.

As far as locking...
beyond rtbtree we will likely add link lists. They would need to be
protected by locks as well.
The same bpf struct element might have two fields: bpf_rb_node and
bpf_link_list_node. It could be a part of one rbtree while at the same
time linked into some other list.
Sharing one lock for rbtree operations and link list manipulations
would be necessary.

In the past we hid locks inside maps. Exposing 'bpf_lock' as a
standalone object and associating the lock with rbtrees, lists, etc
would allow for greater flexibility and allow bpf progs create
more advanced data structures.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map
  2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
                   ` (12 preceding siblings ...)
  2022-08-01 21:27 ` Alexei Starovoitov
@ 2022-08-02 22:02 ` Andrii Nakryiko
  13 siblings, 0 replies; 37+ messages in thread
From: Andrii Nakryiko @ 2022-08-02 22:02 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On Fri, Jul 22, 2022 at 11:34 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
> allows bpf programs to use red-black trees similarly to kernel code.
> Programs interact with rbtree maps in a much more open-coded way than
> more classic map implementations. Some example code to demonstrate:
>
>   node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>   if (!node)
>     return 0;
>
>   node->one = calls;
>   node->two = 6;
>   bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
>
>   ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>   if (!ret) {
>     bpf_rbtree_free_node(&rbtree, node);
>     goto unlock_ret;
>   }
>
> unlock_ret:
>   bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>   return 0;
>
>
> This series is in a heavy RFC state, with some added verifier semantics
> needing improvement before they can be considered safe. I am sending
> early to gather feedback on approach:
>
>   * Does the API seem reasonable and might it be useful for others?
>
>   * Do new verifier semantics added in this series make logical sense?
>     Are there any glaring safety holes aside from those called out in
>     individual patches?
>
> Please see individual patches for more in-depth explanation. A quick
> summary of patches follows:
>
>
> Patches 1-3 extend verifier and BTF searching logic in minor ways to
> prepare for rbtree implementation patch.
>   bpf: Pull repeated reg access bounds check into helper fn
>   bpf: Add verifier support for custom callback return range
>   bpf: Add rb_node_off to bpf_map
>
>
> Patch 4 adds basic rbtree map implementation.
>   bpf: Add rbtree map
>
> Note that 'complete' implementation requires concepts and changes
> introduced in further patches in the series. The series is currently
> arranged in this way to ease RFC review.
>
>
> Patches 5-7 add a spinlock to the rbtree map, with some differing
> semantics from existing verifier spinlock handling.
>   bpf: Add bpf_spin_lock member to rbtree
>   bpf: Add bpf_rbtree_{lock,unlock} helpers
>   bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
>
> Notably, rbtree's bpf_spin_lock must be held while manipulating the tree
> via helpers, while existing spinlock verifier logic prevents any helper
> calls while lock is held. In current state this is worked around by not
> having the verifier treat rbtree's lock specially in any way. This
> needs to be improved before leaving RFC state as it's unsafe.
>
>
> Patch 8 adds the concept of non-owning references, firming up the
> semantics of helpers that return a ptr to node which is owned by
> a rbtree. See patch 4's summary for additional discussion of node
> ownership.
>
>
> Patch 9 adds a 'conditional release' concept: helpers which release a
> resource, but may fail to do so and need to enforce that the BPF program
> handles this failure appropriately, namely by freeing the resource
> another way.
>
>
> Path 10 adds 'iter' type flags which teach the verifier to understand
> open-coded iteration of a data structure. Specifically, with such flags
> the verifier can understand that this loop eventually ends:
>
>   struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);
>
>   while (iter) {
>     node_ct++;
>     iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
>   }
>
> NOTE: Patch 10's logic is currently very unsafe and it's unclear whether
> there's a safe path forward that isn't too complex. It's the most RFC-ey
> of all the patches.
>
>
> Patch 11 adds tests. Best to start here to see BPF programs using rbtree
> map as intended.
>
>
> This series is based ontop of "bpf: Cleanup check_refcount_ok" patch,
> which was submitted separately [0] and therefore is not included here. That
> patch is likely to be applied before this is out of RFC state, so will
> just rebase on newer bpf-next/master.
>
>   [0]: lore.kernel.org/bpf/20220719215536.2787530-1-davemarchevsky@fb.com/
>
> Dave Marchevsky (11):
>   bpf: Pull repeated reg access bounds check into helper fn
>   bpf: Add verifier support for custom callback return range
>   bpf: Add rb_node_off to bpf_map
>   bpf: Add rbtree map
>   bpf: Add bpf_spin_lock member to rbtree
>   bpf: Add bpf_rbtree_{lock,unlock} helpers
>   bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
>   bpf: Add OBJ_NON_OWNING_REF type flag
>   bpf: Add CONDITIONAL_RELEASE type flag
>   bpf: Introduce PTR_ITER and PTR_ITER_END type flags
>   selftests/bpf: Add rbtree map tests
>
>  include/linux/bpf.h                           |  13 +
>  include/linux/bpf_types.h                     |   1 +
>  include/linux/bpf_verifier.h                  |   2 +
>  include/linux/btf.h                           |   1 +
>  include/uapi/linux/bpf.h                      | 121 ++++++
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/btf.c                              |  21 +
>  kernel/bpf/helpers.c                          |  42 +-
>  kernel/bpf/rbtree.c                           | 401 ++++++++++++++++++
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 382 +++++++++++++++--
>  tools/include/uapi/linux/bpf.h                | 121 ++++++
>  .../selftests/bpf/prog_tests/rbtree_map.c     | 164 +++++++
>  .../testing/selftests/bpf/progs/rbtree_map.c  | 108 +++++
>  .../selftests/bpf/progs/rbtree_map_fail.c     | 236 +++++++++++
>  .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
>  16 files changed, 1605 insertions(+), 37 deletions(-)
>  create mode 100644 kernel/bpf/rbtree.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
>  create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c
>
> --
> 2.30.2
>

I just skimmed through commit descriptions and wanted to ask few questions.

1. It's not exactly clear from descriptions what are the advantages of
having struct node_data data vs sticking to more typical BPF map
interface with bpf_map_lookup_elem() and so on. Am I right that the
biggest advantage is that we can move node from one RB tree to
another? But if that's the only reason, why can't we have a
specialized bpf_rbtree_move(rb1, rb2, value) (where rb1 and rb2 are
trees involved in a move, and value is whatever is returned by
bpf_map_lookup_elem, which internally will also keep rbtree_node
hidden in front of value pointer, just like we do it with ringbuf).

2. As for rbtree node pre-allocated in more permissive mode and then
inserting into RB tree later on in more restrictive (e.g., NMI) mode.
Wouldn't this problem be mostly solved by Alexei's BPF-specific memory
allocator? And right now you are requiring to insert in the same mode
as when node was allocated (or free it), right? That's just a current
limitation and you are planning to lift this restriction? If yes, what
would be the mechanism to "temporarily" store rbtree_node somewhere?
kptr? dynptr? In both cases not exactly clear how type information is
preserved.

3. As for locking. Assuming we add bpf_rbtree_move_node() helper, you
can pass lock as an argument to it instead of storing lock inside the
RB tree itself. Preventing deadlocks and stuff like that is still hard
and tricky, but at least you won't have to worry about storing locks
inside RB trees themselves.


Just some questions that I still remember from a brief reading of
this. I think it would be extremely helpful for me and others to have
a clear "why we need open-coded rbtree_node approach" question
answered with comparison to possible alternative using a bit more
conventional BPF approach.

Overall, I'm still trying to wrap my head around this, this approach
and stated goals (e.g., moving nodes between two trees, all the
locking story, etc) look very complicated, so will take some time to
internalize this new approach and convince myself it's doable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests
  2022-07-28  7:18   ` Yonghong Song
@ 2022-08-10 17:48     ` Dave Marchevsky
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-08-10 17:48 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/28/22 3:18 AM, Yonghong Song wrote:   
> 
> 
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>> Add tests demonstrating happy path of rbtree map usage as well as
>> exercising numerous failure paths and conditions. Structure of failing
>> test runner is based on dynptr tests.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>>   .../selftests/bpf/prog_tests/rbtree_map.c     | 164 ++++++++++++
>>   .../testing/selftests/bpf/progs/rbtree_map.c  | 111 ++++++++
>>   .../selftests/bpf/progs/rbtree_map_fail.c     | 236 ++++++++++++++++++
>>   .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
>>   4 files changed, 535 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/rbtree_map.c b/tools/testing/selftests/bpf/prog_tests/rbtree_map.c
>> new file mode 100644
>> index 000000000000..17cadcd05ee4
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/rbtree_map.c
>> @@ -0,0 +1,164 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +#include <sys/syscall.h>
>> +#include <test_progs.h>
>> +#include "rbtree_map.skel.h"
>> +#include "rbtree_map_fail.skel.h"
>> +#include "rbtree_map_load_fail.skel.h"
>> +
>> +static size_t log_buf_sz = 1048576; /* 1 MB */
>> +static char obj_log_buf[1048576];
>> +
>> +static struct {
>> +    const char *prog_name;
>> +    const char *expected_err_msg;
>> +} rbtree_prog_load_fail_tests[] = {
>> +    {"rb_node__field_store", "only read is supported"},
>> +    {"rb_node__alloc_no_add", "Unreleased reference id=2 alloc_insn=3"},
>> +    {"rb_node__two_alloc_one_add", "Unreleased reference id=2 alloc_insn=3"},
>> +    {"rb_node__remove_no_free", "Unreleased reference id=5 alloc_insn=28"},
>> +    {"rb_tree__add_wrong_type", "rbtree: R2 is of type task_struct but node_data is expected"},
>> +    {"rb_tree__conditional_release_helper_usage",
>> +        "R2 type=ptr_cond_rel_ expected=ptr_"},
>> +};
>> +
>> +void test_rbtree_map_load_fail(void)
>> +{
>> +    struct rbtree_map_load_fail *skel;
>> +
>> +    skel = rbtree_map_load_fail__open_and_load();
>> +    if (!ASSERT_ERR_PTR(skel, "rbtree_map_load_fail__open_and_load"))
>> +        rbtree_map_load_fail__destroy(skel);
>> +}
>> +
>> +static void verify_fail(const char *prog_name, const char *expected_err_msg)
>> +{
>> +    LIBBPF_OPTS(bpf_object_open_opts, opts);
>> +    struct rbtree_map_fail *skel;
>> +    struct bpf_program *prog;
>> +    int err;
>> +
>> +    opts.kernel_log_buf = obj_log_buf;
>> +    opts.kernel_log_size = log_buf_sz;
>> +    opts.kernel_log_level = 1;
>> +
>> +    skel = rbtree_map_fail__open_opts(&opts);
>> +    if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open_opts"))
>> +        goto cleanup;
>> +
>> +    prog = bpf_object__find_program_by_name(skel->obj, prog_name);
>> +    if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
>> +        goto cleanup;
>> +
>> +    bpf_program__set_autoload(prog, true);
>> +    err = rbtree_map_fail__load(skel);
>> +    if (!ASSERT_ERR(err, "unexpected load success"))
>> +        goto cleanup;
>> +
>> +    if (!ASSERT_OK_PTR(strstr(obj_log_buf, expected_err_msg), "expected_err_msg")) {
>> +        fprintf(stderr, "Expected err_msg: %s\n", expected_err_msg);
>> +        fprintf(stderr, "Verifier output: %s\n", obj_log_buf);
>> +    }
>> +
>> +cleanup:
>> +    rbtree_map_fail__destroy(skel);
>> +}
>> +
>> +void test_rbtree_map_alloc_node__size_too_small(void)
>> +{
>> +    struct rbtree_map_fail *skel;
>> +    struct bpf_program *prog;
>> +    struct bpf_link *link;
>> +    int err;
>> +
>> +    skel = rbtree_map_fail__open();
>> +    if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open"))
>> +        goto cleanup;
>> +
>> +    prog = skel->progs.alloc_node__size_too_small;
>> +    bpf_program__set_autoload(prog, true);
>> +
>> +    err = rbtree_map_fail__load(skel);
>> +    if (!ASSERT_OK(err, "unexpected load fail"))
>> +        goto cleanup;
>> +
>> +    link = bpf_program__attach(skel->progs.alloc_node__size_too_small);
>> +    if (!ASSERT_OK_PTR(link, "link"))
>> +        goto cleanup;
>> +
>> +    syscall(SYS_getpgid);
>> +
>> +    ASSERT_EQ(skel->bss->size_too_small__alloc_fail, 1, "alloc_fail");
>> +
>> +    bpf_link__destroy(link);
>> +cleanup:
>> +    rbtree_map_fail__destroy(skel);
>> +}
>> +
>> +void test_rbtree_map_add_node__no_lock(void)
>> +{
>> +    struct rbtree_map_fail *skel;
>> +    struct bpf_program *prog;
>> +    struct bpf_link *link;
>> +    int err;
>> +
>> +    skel = rbtree_map_fail__open();
>> +    if (!ASSERT_OK_PTR(skel, "rbtree_map_fail__open"))
>> +        goto cleanup;
>> +
>> +    prog = skel->progs.add_node__no_lock;
>> +    bpf_program__set_autoload(prog, true);
>> +
>> +    err = rbtree_map_fail__load(skel);
>> +    if (!ASSERT_OK(err, "unexpected load fail"))
>> +        goto cleanup;
>> +
>> +    link = bpf_program__attach(skel->progs.add_node__no_lock);
>> +    if (!ASSERT_OK_PTR(link, "link"))
>> +        goto cleanup;
>> +
>> +    syscall(SYS_getpgid);
>> +
>> +    ASSERT_EQ(skel->bss->no_lock_add__fail, 1, "alloc_fail");
>> +
>> +    bpf_link__destroy(link);
>> +cleanup:
>> +    rbtree_map_fail__destroy(skel);
>> +}
>> +
>> +void test_rbtree_map_prog_load_fail(void)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < ARRAY_SIZE(rbtree_prog_load_fail_tests); i++) {
>> +        if (!test__start_subtest(rbtree_prog_load_fail_tests[i].prog_name))
>> +            continue;
>> +
>> +        verify_fail(rbtree_prog_load_fail_tests[i].prog_name,
>> +                rbtree_prog_load_fail_tests[i].expected_err_msg);
>> +    }
>> +}
>> +
>> +void test_rbtree_map(void)
>> +{
>> +    struct rbtree_map *skel;
>> +    struct bpf_link *link;
>> +
>> +    skel = rbtree_map__open_and_load();
>> +    if (!ASSERT_OK_PTR(skel, "rbtree_map__open_and_load"))
>> +        goto cleanup;
>> +
>> +    link = bpf_program__attach(skel->progs.check_rbtree);
>> +    if (!ASSERT_OK_PTR(link, "link"))
>> +        goto cleanup;
>> +
>> +    for (int i = 0; i < 100; i++)
>> +        syscall(SYS_getpgid);
>> +
>> +    ASSERT_EQ(skel->bss->calls, 100, "calls_equal");
>> +
>> +    bpf_link__destroy(link);
>> +cleanup:
>> +    rbtree_map__destroy(skel);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/rbtree_map.c b/tools/testing/selftests/bpf/progs/rbtree_map.c
>> new file mode 100644
>> index 000000000000..0cd467838f6e
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/rbtree_map.c
>> @@ -0,0 +1,111 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +#include "vmlinux.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include "bpf_misc.h"
>> +
>> +struct node_data {
>> +    struct rb_node node;
>> +    __u32 one;
>> +    __u32 two;
>> +};
>> +
>> +struct {
>> +    __uint(type, BPF_MAP_TYPE_RBTREE);
>> +    __type(value, struct node_data);
>> +} rbtree SEC(".maps");
>> +
>> +long calls;
>> +
>> +static bool less(struct rb_node *a, const struct rb_node *b)
>> +{
>> +    struct node_data *node_a;
>> +    struct node_data *node_b;
>> +
>> +    node_a = container_of(a, struct node_data, node);
>> +    node_b = container_of(b, struct node_data, node);
>> +
>> +    return node_a->one < node_b->one;
>> +}
>> +
>> +// Key = node_datq
>> +static int cmp(const void *key, const struct rb_node *b)
>> +{
>> +    struct node_data *node_a;
>> +    struct node_data *node_b;
>> +
>> +    node_a = container_of(key, struct node_data, node);
>> +    node_b = container_of(b, struct node_data, node);
>> +
>> +    return node_b->one - node_a->one;
>> +}
>> +
>> +// Key = just node_data.one
>> +static int cmp2(const void *key, const struct rb_node *b)
>> +{
>> +    __u32 one;
>> +    struct node_data *node_b;
>> +
>> +    one = *(__u32 *)key;
>> +    node_b = container_of(b, struct node_data, node);
>> +
>> +    return node_b->one - one;
>> +}
>> +
>> +SEC("fentry/" SYS_PREFIX "sys_getpgid")
>> +int check_rbtree(void *ctx)
>> +{
>> +    struct node_data *node, *found, *ret;
>> +    struct node_data popped;
>> +    struct node_data search;
>> +    __u32 search2;
>> +
>> +    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
> 
> If I understand correctly, bpf_rtbree_alloc_node() may cause reschedule
> inside the function. So, the program should be sleepable, right?
> 

My mistake, will change alloc flag to GFP_NOWAIT.

>> +    if (!node)
>> +        return 0;
>> +
>> +    node->one = calls;
>> +    node->two = 6;
>> +    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
>> +
>> +    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>> +    if (!ret) {
>> +        bpf_rbtree_free_node(&rbtree, node);
>> +        goto unlock_ret;
>> +    }
>> +
>> +    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>> +
>> +    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
>> +
>> +    search.one = calls;
>> +    found = (struct node_data *)bpf_rbtree_find(&rbtree, &search, cmp);
>> +    if (!found)
>> +        goto unlock_ret;
>> +
>> +    int node_ct = 0;
>> +    struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);
>> +
>> +    while (iter) {
>> +        node_ct++;
>> +        iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
>> +    }
>> +
>> +    ret = (struct node_data *)bpf_rbtree_remove(&rbtree, found);
>> +    if (!ret)
>> +        goto unlock_ret;
>> +
>> +    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>> +
>> +    bpf_rbtree_free_node(&rbtree, ret);
>> +
>> +    __sync_fetch_and_add(&calls, 1);
>> +    return 0;
>> +
>> +unlock_ret:
>> +    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>> +    return 0;
>> +}
>> +
> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map
  2022-07-28  7:04 ` [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Yonghong Song
@ 2022-08-10 17:54   ` Dave Marchevsky
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-08-10 17:54 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 7/28/22 3:04 AM, Yonghong Song wrote:   
> 
> 
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
>> allows bpf programs to use red-black trees similarly to kernel code.
>> Programs interact with rbtree maps in a much more open-coded way than
>> more classic map implementations. Some example code to demonstrate:
>>
>>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>>    if (!node)
>>      return 0;
>>
>>    node->one = calls;
>>    node->two = 6;
>>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
> 
> Can we just do
>      bpf_rbtree_lock(&rbtree)
>      bpf_rbtree_unlock(&rbtree)
> ? Looks like the only places bpf_rbtree_get_lock() used are
> as arguments of bpf_rbtree_lock/unlock or bpf_spin_lock/unlock?
> 

Summarizing our VC convo: the intent here is to have the lock live separately
from the tree, meaning it'd be separately initialized and passed into the tree
instead of current state where it's a bpf_rbtree field.

If the tree still keeps a pointer to the lock - ideally passed in when rbtree
map is instantiated - it'll be possible to move to a cleaner API as you
describe. The very explicit way it's done in this series is not the end state
I'd like either.

>>
>>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>>    if (!ret) {
>>      bpf_rbtree_free_node(&rbtree, node);
>>      goto unlock_ret;
>>    }
>>
>> unlock_ret:
>>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>>    return 0;
>>
>>
>> This series is in a heavy RFC state, with some added verifier semantics
>> needing improvement before they can be considered safe. I am sending
>> early to gather feedback on approach:
>>
>>    * Does the API seem reasonable and might it be useful for others?
>>
>>    * Do new verifier semantics added in this series make logical sense?
>>      Are there any glaring safety holes aside from those called out in
>>      individual patches?
>>
>> Please see individual patches for more in-depth explanation. A quick
>> summary of patches follows:
>>
>>
>> Patches 1-3 extend verifier and BTF searching logic in minor ways to
>> prepare for rbtree implementation patch.
>>    bpf: Pull repeated reg access bounds check into helper fn
>>    bpf: Add verifier support for custom callback return range
>>    bpf: Add rb_node_off to bpf_map
>>
>>
>> Patch 4 adds basic rbtree map implementation.
>>    bpf: Add rbtree map
>>
>> Note that 'complete' implementation requires concepts and changes
>> introduced in further patches in the series. The series is currently
>> arranged in this way to ease RFC review.
>>
>>
>> Patches 5-7 add a spinlock to the rbtree map, with some differing
>> semantics from existing verifier spinlock handling.
>>    bpf: Add bpf_spin_lock member to rbtree
>>    bpf: Add bpf_rbtree_{lock,unlock} helpers
>>    bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
>>
>> Notably, rbtree's bpf_spin_lock must be held while manipulating the tree
>> via helpers, while existing spinlock verifier logic prevents any helper
>> calls while lock is held. In current state this is worked around by not
>> having the verifier treat rbtree's lock specially in any way. This
>> needs to be improved before leaving RFC state as it's unsafe.
>>
> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  2022-08-01 22:44   ` Alexei Starovoitov
  2022-08-02 13:05     ` Kumar Kartikeya Dwivedi
@ 2022-08-10 17:56     ` Dave Marchevsky
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-08-10 17:56 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 8/1/22 6:44 PM, Alexei Starovoitov wrote:   
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>>       if (__is_pointer_value(false, reg)) {
>> +        if (__is_iter_end(reg) && val == 0) {
>> +            __mark_reg_const_zero(reg);
>> +            switch (opcode) {
>> +            case BPF_JEQ:
>> +                return 1;
>> +            case BPF_JNE:
>> +                return 0;
>> +            default:
>> +                return -1;
>> +            }
>> +        }
> 
> as discussed the verifying the loop twice is not safe.
> This needs more advanced verifier hacking.
> Maybe let's postpone rbtree iters for now and resolve all the rest?
> Or do iters with a callback, since that's more or less a clear path fwd?
> 
Yep, I will drop and move to callback-based approach for now. As we discussed
over VC, getting open-coded iteration right will take a long time and hold up
the rest of the patchset.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map
  2022-08-01 21:27 ` Alexei Starovoitov
@ 2022-08-10 18:11   ` Dave Marchevsky
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Marchevsky @ 2022-08-10 18:11 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 8/1/22 5:27 PM, Alexei Starovoitov wrote:   
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
>> allows bpf programs to use red-black trees similarly to kernel code.
>> Programs interact with rbtree maps in a much more open-coded way than
>> more classic map implementations. Some example code to demonstrate:
>>
>>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>>    if (!node)
>>      return 0;
>>
>>    node->one = calls;
>>    node->two = 6;
>>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
>>
>>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>>    if (!ret) {
>>      bpf_rbtree_free_node(&rbtree, node);
>>      goto unlock_ret;
>>    }
>>
>> unlock_ret:
>>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>>    return 0;
>>
>>
>> This series is in a heavy RFC state, with some added verifier semantics
>> needing improvement before they can be considered safe. I am sending
>> early to gather feedback on approach:
>>
>>    * Does the API seem reasonable and might it be useful for others?
> 
> Overall it seems to be on the right track.
> The two main issue to resolve are locks and iterators.
> The rest needs work too, but it's getting close :)
> Will reply in individual patches.

I generally agree with the comments on individual patches. We talked about the
whole series over VC quite a bit, I will summarize the larger points below.

* It's worth experimenting with statically verifying correct locking instead
of current dynamic approach where helpers that really shouldn't fail - e.g.
adding / removing from tree - can fail if lock isn't held. If it's possible
to get it working we can get rid of CONDITIONAL_RELEASE concept since failing
add operation is the only thing that needs it.

* You're not sold on the need for OBJ_NON_OWNING_REF flag in verifier, and
feel that it makes the API more confusing to work with. Worth experimenting
with a 'valid use-after-free' (really use-after-release here) concept that
doesn't require new type flag, or just accepting "racy, but safe".

* Open-coded iteration will take a long time to do correctly and
should be dropped for now.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-01 22:17   ` Alexei Starovoitov
  2022-08-02 13:59     ` Kumar Kartikeya Dwivedi
@ 2022-08-10 21:46     ` Kumar Kartikeya Dwivedi
  2022-08-10 22:06       ` Alexei Starovoitov
  2022-08-15  5:33       ` Yonghong Song
  1 sibling, 2 replies; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-10 21:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On Tue, 2 Aug 2022 at 00:23, Alexei Starovoitov <ast@fb.com> wrote:
>
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> > This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
> > well as a bpf_rbtree_get_lock helper which allows bpf programs to access
> > the lock.
> >
> > Ideally the bpf_spin_lock would be created independently oustide of the
> > tree and associated with it before the tree is used, either as part of
> > map definition or via some call like rbtree_init(&rbtree, &lock). Doing
> > this in an ergonomic way is proving harder than expected, so for now use
> > this workaround.
> >
> > Why is creating the bpf_spin_lock independently and associating it with
> > the tree preferable? Because we want to be able to transfer nodes
> > between trees atomically, and for this to work need same lock associated
> > with 2 trees.
>
> Right. We need one lock to protect multiple rbtrees.
> Since add/find/remove helpers will look into rbtree->lock
> the two different rbtree (== map) have to point to the same lock.
> Other than rbtree_init(&rbtree, &lock) what would be an alternative ?
>
> >
> > Further locking-related patches will make it possible for the lock to be
> > used in BPF programs and add code which enforces that the lock is held
> > when doing any operation on the tree.
> >
> > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > ---
> >   include/uapi/linux/bpf.h       |  7 +++++++
> >   kernel/bpf/helpers.c           |  3 +++
> >   kernel/bpf/rbtree.c            | 24 ++++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |  7 +++++++
> >   4 files changed, 41 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4688ce88caf4..c677d92de3bc 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5385,6 +5385,12 @@ union bpf_attr {
> >    *  Return
> >    *          0
> >    *
> > + * void *bpf_rbtree_get_lock(struct bpf_map *map)
> > + *   Description
> > + *           Return the bpf_spin_lock associated with the rbtree
> > + *
> > + *   Return
> > + *           Ptr to lock
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)               \
> >       FN(unspec),                     \
> > @@ -5600,6 +5606,7 @@ union bpf_attr {
> >       FN(rbtree_find),                \
> >       FN(rbtree_remove),              \
> >       FN(rbtree_free_node),           \
> > +     FN(rbtree_get_lock),            \
> >       /* */
> >
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 35eb66d11bf6..257a808bb767 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1587,6 +1587,7 @@ const struct bpf_func_proto bpf_rbtree_add_proto __weak;
> >   const struct bpf_func_proto bpf_rbtree_find_proto __weak;
> >   const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
> >   const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
> > +const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
> >
> >   const struct bpf_func_proto *
> >   bpf_base_func_proto(enum bpf_func_id func_id)
> > @@ -1686,6 +1687,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >               return &bpf_rbtree_remove_proto;
> >       case BPF_FUNC_rbtree_free_node:
> >               return &bpf_rbtree_free_node_proto;
> > +     case BPF_FUNC_rbtree_get_lock:
> > +             return &bpf_rbtree_get_lock_proto;
> >       default:
> >               break;
> >       }
> > diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> > index 250d62210804..c6f0a2a083f6 100644
> > --- a/kernel/bpf/rbtree.c
> > +++ b/kernel/bpf/rbtree.c
> > @@ -9,6 +9,7 @@
> >   struct bpf_rbtree {
> >       struct bpf_map map;
> >       struct rb_root_cached root;
> > +     struct bpf_spin_lock *lock;
> >   };
> >
> >   BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
> > @@ -39,6 +40,14 @@ static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
> >
> >       tree->root = RB_ROOT_CACHED;
> >       bpf_map_init_from_attr(&tree->map, attr);
> > +
> > +     tree->lock = bpf_map_kzalloc(&tree->map, sizeof(struct bpf_spin_lock),
> > +                                  GFP_KERNEL | __GFP_NOWARN);
> > +     if (!tree->lock) {
> > +             bpf_map_area_free(tree);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> >       return &tree->map;
> >   }
> >
> > @@ -139,6 +148,7 @@ static void rbtree_map_free(struct bpf_map *map)
> >
> >       bpf_rbtree_postorder_for_each_entry_safe(pos, n, &tree->root.rb_root)
> >               kfree(pos);
> > +     kfree(tree->lock);
> >       bpf_map_area_free(tree);
> >   }
> >
> > @@ -238,6 +248,20 @@ static int rbtree_map_get_next_key(struct bpf_map *map, void *key,
> >       return -ENOTSUPP;
> >   }
> >
> > +BPF_CALL_1(bpf_rbtree_get_lock, struct bpf_map *, map)
> > +{
> > +     struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
> > +
> > +     return (u64)tree->lock;
> > +}
> > +
> > +const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
> > +     .func = bpf_rbtree_get_lock,
> > +     .gpl_only = true,
> > +     .ret_type = RET_PTR_TO_MAP_VALUE,
>
> This hack and
>
> +const struct bpf_func_proto bpf_rbtree_unlock_proto = {
> +       .func = bpf_rbtree_unlock,
> +       .gpl_only = true,
> +       .ret_type = RET_INTEGER,
> +       .arg1_type = ARG_PTR_TO_SPIN_LOCK,
>
> may be too much arm twisting to reuse bpf_spin_lock.
>
> May be return ptr_to_btf_id here and bpf_rbtree_lock
> should match the type?
> It could be new 'struct bpf_lock' in kernel's BTF.
>
> Let's figure out how to associate locks with rbtrees.
>
> Reiterating my proposal that was done earlier in the context
> of Delyan's irq_work, but for different type:
> How about:
> struct bpf_lock *l;
> l = bpf_mem_alloc(allocator, bpf_core_type_id_kernel(*l));
>
> that would allocate ptr_to_btf_id object from kernel's btf.
> The bpf_lock would have constructor and destructor provided by the
> kernel code.
> constructor will set bpf_lock's refcnt to 1.
> then bpf_rbtree_init(&map, lock) will bump the refcnt.
> and dtor will eventually free it when all rbtrees are freed.
> That would be similar to kptr's logic with kptr_get and dtor's
> associated with kernel's btf_id-s.

Just to continue brainstorming: Comments on this?

Instead of a rbtree map, you have a struct bpf_rbtree global variable
which works like a rbtree. To associate a lock with multiple
bpf_rbtree, you do clang style thread safety annotation in the bpf
program:

#define __guarded_by(lock) __attribute__((btf_type_tag("guarded_by:" #lock))

struct bpf_spin_lock shared_lock;
struct bpf_rbtree rbtree1 __guarded_by(shared_lock);
struct bpf_rbtree rbtree2 __guarded_by(shared_lock);

guarded_by tag is mandatory for the rbtree. Verifier ensures
shared_lock spin lock is held whenever rbtree1 or rbtree2 is being
accessed, and whitelists add/remove helpers inside the critical
section.

I don't think associating locks to rbtree dynamically is a hard
requirement for your use case? But if you need that, you may probably
also allocate sets of rbtree that are part of the same lock "class"
dynamically using bpf_kptr_alloc, and do similar annotation for the
struct being allocated.
struct rbtree_set {
  struct bpf_spin_lock lock;
  struct bpf_rbtree rbtree1 __guarded_by(lock);
  struct bpf_rbtree rbtree2 __guarded_by(lock);
};
struct rbtree_set *s = bpf_kptr_alloc(sizeof(*s), btf_local_type_id(*s));
// Just stash the pointer somewhere with kptr_xchg
On bpf_kptr_free, the verifier knows this is not a "trivial" struct,
so inserts destructor calls for bpf_rbtree fields during program
fixup.

The main insight is that lock and rbtree are part of the same
allocation (map value for global case, ptr_to_btf_id for dynamic case)
so the locked state can be bound to the reg state in the verifier.
Then we can also make this new rbtree API use kfuncs instead of UAPI
helpers, to get some field experience before baking it in.

Any opinions? Any brainos or deficiencies in the scheme above?

Background: I have been thinking about how I can bind kptr and normal
data synchronization without having unneeded atomic xchg cost when
lock is already protecting kptr. In my tests this guarded_by
annotation has been proving very useful (you mark data and kptr
protected by lock as guarded_by some spin lock member in same map
value, verifier ensures lock is held during access, and kptr_xchg for
guarded kptr is lowered to non-atomic load/store assuming no
concurrent access, and kptr_xchg from outside the lock section is
rejected).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-10 21:46     ` Kumar Kartikeya Dwivedi
@ 2022-08-10 22:06       ` Alexei Starovoitov
  2022-08-10 23:16         ` Kumar Kartikeya Dwivedi
  2022-08-15  5:33       ` Yonghong Song
  1 sibling, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2022-08-10 22:06 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Alexei Starovoitov, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Wed, Aug 10, 2022 at 2:47 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Just to continue brainstorming: Comments on this?
>
> Instead of a rbtree map, you have a struct bpf_rbtree global variable
> which works like a rbtree. To associate a lock with multiple
> bpf_rbtree, you do clang style thread safety annotation in the bpf
> program:
>
> #define __guarded_by(lock) __attribute__((btf_type_tag("guarded_by:" #lock))
>
> struct bpf_spin_lock shared_lock;
> struct bpf_rbtree rbtree1 __guarded_by(shared_lock);
> struct bpf_rbtree rbtree2 __guarded_by(shared_lock);
>
> guarded_by tag is mandatory for the rbtree. Verifier ensures
> shared_lock spin lock is held whenever rbtree1 or rbtree2 is being
> accessed, and whitelists add/remove helpers inside the critical
> section.

We've been discussing exactly btf_type_tag annotation
to associate rbtree with a lock :)
Thanks for bringing it up as well.
Great that we're aligned.

> I don't think associating locks to rbtree dynamically is a hard
> requirement for your use case? But if you need that, you may probably
> also allocate sets of rbtree that are part of the same lock "class"
> dynamically using bpf_kptr_alloc, and do similar annotation for the
> struct being allocated.
> struct rbtree_set {
>   struct bpf_spin_lock lock;
>   struct bpf_rbtree rbtree1 __guarded_by(lock);
>   struct bpf_rbtree rbtree2 __guarded_by(lock);
> };
> struct rbtree_set *s = bpf_kptr_alloc(sizeof(*s), btf_local_type_id(*s));
> // Just stash the pointer somewhere with kptr_xchg
> On bpf_kptr_free, the verifier knows this is not a "trivial" struct,
> so inserts destructor calls for bpf_rbtree fields during program
> fixup.
>
> The main insight is that lock and rbtree are part of the same
> allocation (map value for global case, ptr_to_btf_id for dynamic case)
> so the locked state can be bound to the reg state in the verifier.

It works nicely in the static case, but ffwd a bit.
We might need an rbtree of rbtrees.
Pretty much like map-in-map that we have today for hash tables.

And the rbtree_node might be a part of an rbtree while
being chained into a separate link list.

We need a lock to protect operations on both rbtree and link list.
Then we might need to create an rbtree dynamically for each cgroup.
And store a pointer to rb_root in cgroup local storage.
Maybe allocating a lock + rb_root + linklist_head as one
allocation will not be too restrictive.

> Then we can also make this new rbtree API use kfuncs instead of UAPI
> helpers, to get some field experience before baking it in.

+1 to that.

> Any opinions? Any brainos or deficiencies in the scheme above?

It would be great if we can do the lock checks statically.
Dynamic locks means that rbtree_add/erase and in the future
link list insert/remove might fail which is horrible from
programmer perspective.
We've been thinking about the "abort" concept for such cases.
When helpers detect an unsafe condition like correct lock is not
taken, they can abort execution of itself, the bpf program
and prevent the program from executing in the future.
Conceptually it sounds great and will solve all kinds of ugliness,
but it's not clear yet how to implement such abort mechanism
which would mean stack unwind and calling of destructors for kptr-s,
refcnt decrements for acquired objects like sockets, etc.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-10 22:06       ` Alexei Starovoitov
@ 2022-08-10 23:16         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-10 23:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Thu, 11 Aug 2022 at 00:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 10, 2022 at 2:47 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Just to continue brainstorming: Comments on this?
> >
> > Instead of a rbtree map, you have a struct bpf_rbtree global variable
> > which works like a rbtree. To associate a lock with multiple
> > bpf_rbtree, you do clang style thread safety annotation in the bpf
> > program:
> >
> > #define __guarded_by(lock) __attribute__((btf_type_tag("guarded_by:" #lock))
> >
> > struct bpf_spin_lock shared_lock;
> > struct bpf_rbtree rbtree1 __guarded_by(shared_lock);
> > struct bpf_rbtree rbtree2 __guarded_by(shared_lock);
> >
> > guarded_by tag is mandatory for the rbtree. Verifier ensures
> > shared_lock spin lock is held whenever rbtree1 or rbtree2 is being
> > accessed, and whitelists add/remove helpers inside the critical
> > section.
>
> We've been discussing exactly btf_type_tag annotation
> to associate rbtree with a lock :)
> Thanks for bringing it up as well.
> Great that we're aligned.
>

You know how the saying goes, "Great minds think alike" ;-)

> > I don't think associating locks to rbtree dynamically is a hard
> > requirement for your use case? But if you need that, you may probably
> > also allocate sets of rbtree that are part of the same lock "class"
> > dynamically using bpf_kptr_alloc, and do similar annotation for the
> > struct being allocated.
> > struct rbtree_set {
> >   struct bpf_spin_lock lock;
> >   struct bpf_rbtree rbtree1 __guarded_by(lock);
> >   struct bpf_rbtree rbtree2 __guarded_by(lock);
> > };
> > struct rbtree_set *s = bpf_kptr_alloc(sizeof(*s), btf_local_type_id(*s));
> > // Just stash the pointer somewhere with kptr_xchg
> > On bpf_kptr_free, the verifier knows this is not a "trivial" struct,
> > so inserts destructor calls for bpf_rbtree fields during program
> > fixup.
> >
> > The main insight is that lock and rbtree are part of the same
> > allocation (map value for global case, ptr_to_btf_id for dynamic case)
> > so the locked state can be bound to the reg state in the verifier.
>
> It works nicely in the static case, but ffwd a bit.
> We might need an rbtree of rbtrees.
> Pretty much like map-in-map that we have today for hash tables.
>

True, the map in map case makes things harder. But just like
inner_map_fd, you will need to parameterize such rb_root/list_head
containers with a value type (probably type tag again) so the result
of find/remove can be known statically.
From there, we know the type of lock associated with the rbtree in the
value, _if_ we follow the convention of binding rb_root + lock + ...
in a single data item. Without that, it's pretty hard to do without
runtime checks, as you've pointed out in a previous reply.

My idea was that we should try to associate data being locked with its
lock in the same allocation, working a bit like a bpf_spin_lock<T>,
but more flexible as you can have additional unprotected data in the
struct by annotating using guarded_by style tags.

There might be some use cases where we really require dynamically
binding locks at runtime to some data structure, then we should
probably add that later when a use case comes up, and keep those sets
of APIs separate from the simpler case. As you mentioned, in the
dynamic case the add helper becomes fallible. The same will happen
with linked list helpers.

> And the rbtree_node might be a part of an rbtree while
> being chained into a separate link list.
>
> We need a lock to protect operations on both rbtree and link list.
> Then we might need to create an rbtree dynamically for each cgroup.
> And store a pointer to rb_root in cgroup local storage.
> Maybe allocating a lock + rb_root + linklist_head as one
> allocation will not be too restrictive.
>
> > Then we can also make this new rbtree API use kfuncs instead of UAPI
> > helpers, to get some field experience before baking it in.
>
> +1 to that.
>
> > Any opinions? Any brainos or deficiencies in the scheme above?
>
> It would be great if we can do the lock checks statically.
> Dynamic locks means that rbtree_add/erase and in the future
> link list insert/remove might fail which is horrible from
> programmer perspective.

+1. I'll also be happy to help with code review (for as much as I
understand) when Dave reposts the series.

> We've been thinking about the "abort" concept for such cases.
> When helpers detect an unsafe condition like correct lock is not
> taken, they can abort execution of itself, the bpf program
> and prevent the program from executing in the future.
> Conceptually it sounds great and will solve all kinds of ugliness,
> but it's not clear yet how to implement such abort mechanism
> which would mean stack unwind and calling of destructors for kptr-s,
> refcnt decrements for acquired objects like sockets, etc.

Fancy. Though it certainly looks very painful to implement (just
thinking about kptr, say the release kfunc is NMI unsafe, and
detecting such live kptr in perf prog moved out of a map, we would
probably then need to do it using irq_work_queue, but also alloc work
item for it, which might fail in NMI context unwind so mem leak).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-10 21:46     ` Kumar Kartikeya Dwivedi
  2022-08-10 22:06       ` Alexei Starovoitov
@ 2022-08-15  5:33       ` Yonghong Song
  2022-08-15  5:37         ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2022-08-15  5:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo



On 8/10/22 2:46 PM, Kumar Kartikeya Dwivedi wrote:
> On Tue, 2 Aug 2022 at 00:23, Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>>> This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
>>> well as a bpf_rbtree_get_lock helper which allows bpf programs to access
>>> the lock.
>>>
>>> Ideally the bpf_spin_lock would be created independently oustide of the
>>> tree and associated with it before the tree is used, either as part of
>>> map definition or via some call like rbtree_init(&rbtree, &lock). Doing
>>> this in an ergonomic way is proving harder than expected, so for now use
>>> this workaround.
>>>
>>> Why is creating the bpf_spin_lock independently and associating it with
>>> the tree preferable? Because we want to be able to transfer nodes
>>> between trees atomically, and for this to work need same lock associated
>>> with 2 trees.
>>
>> Right. We need one lock to protect multiple rbtrees.
>> Since add/find/remove helpers will look into rbtree->lock
>> the two different rbtree (== map) have to point to the same lock.
>> Other than rbtree_init(&rbtree, &lock) what would be an alternative ?
>>
>>>
>>> Further locking-related patches will make it possible for the lock to be
>>> used in BPF programs and add code which enforces that the lock is held
>>> when doing any operation on the tree.
>>>
>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>> ---
>>>    include/uapi/linux/bpf.h       |  7 +++++++
>>>    kernel/bpf/helpers.c           |  3 +++
>>>    kernel/bpf/rbtree.c            | 24 ++++++++++++++++++++++++
>>>    tools/include/uapi/linux/bpf.h |  7 +++++++
>>>    4 files changed, 41 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 4688ce88caf4..c677d92de3bc 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -5385,6 +5385,12 @@ union bpf_attr {
>>>     *  Return
>>>     *          0
>>>     *
>>> + * void *bpf_rbtree_get_lock(struct bpf_map *map)
>>> + *   Description
>>> + *           Return the bpf_spin_lock associated with the rbtree
>>> + *
>>> + *   Return
>>> + *           Ptr to lock
>>>     */
>>>    #define __BPF_FUNC_MAPPER(FN)               \
>>>        FN(unspec),                     \
>>> @@ -5600,6 +5606,7 @@ union bpf_attr {
>>>        FN(rbtree_find),                \
>>>        FN(rbtree_remove),              \
>>>        FN(rbtree_free_node),           \
>>> +     FN(rbtree_get_lock),            \
>>>        /* */
>>>
>>>    /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index 35eb66d11bf6..257a808bb767 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -1587,6 +1587,7 @@ const struct bpf_func_proto bpf_rbtree_add_proto __weak;
>>>    const struct bpf_func_proto bpf_rbtree_find_proto __weak;
>>>    const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
>>>    const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
>>> +const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
>>>
>>>    const struct bpf_func_proto *
>>>    bpf_base_func_proto(enum bpf_func_id func_id)
>>> @@ -1686,6 +1687,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>>>                return &bpf_rbtree_remove_proto;
>>>        case BPF_FUNC_rbtree_free_node:
>>>                return &bpf_rbtree_free_node_proto;
>>> +     case BPF_FUNC_rbtree_get_lock:
>>> +             return &bpf_rbtree_get_lock_proto;
>>>        default:
>>>                break;
>>>        }
>>> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
>>> index 250d62210804..c6f0a2a083f6 100644
>>> --- a/kernel/bpf/rbtree.c
>>> +++ b/kernel/bpf/rbtree.c
>>> @@ -9,6 +9,7 @@
>>>    struct bpf_rbtree {
>>>        struct bpf_map map;
>>>        struct rb_root_cached root;
>>> +     struct bpf_spin_lock *lock;
>>>    };
>>>
>>>    BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
>>> @@ -39,6 +40,14 @@ static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
>>>
>>>        tree->root = RB_ROOT_CACHED;
>>>        bpf_map_init_from_attr(&tree->map, attr);
>>> +
>>> +     tree->lock = bpf_map_kzalloc(&tree->map, sizeof(struct bpf_spin_lock),
>>> +                                  GFP_KERNEL | __GFP_NOWARN);
>>> +     if (!tree->lock) {
>>> +             bpf_map_area_free(tree);
>>> +             return ERR_PTR(-ENOMEM);
>>> +     }
>>> +
>>>        return &tree->map;
>>>    }
>>>
>>> @@ -139,6 +148,7 @@ static void rbtree_map_free(struct bpf_map *map)
>>>
>>>        bpf_rbtree_postorder_for_each_entry_safe(pos, n, &tree->root.rb_root)
>>>                kfree(pos);
>>> +     kfree(tree->lock);
>>>        bpf_map_area_free(tree);
>>>    }
>>>
>>> @@ -238,6 +248,20 @@ static int rbtree_map_get_next_key(struct bpf_map *map, void *key,
>>>        return -ENOTSUPP;
>>>    }
>>>
>>> +BPF_CALL_1(bpf_rbtree_get_lock, struct bpf_map *, map)
>>> +{
>>> +     struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
>>> +
>>> +     return (u64)tree->lock;
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
>>> +     .func = bpf_rbtree_get_lock,
>>> +     .gpl_only = true,
>>> +     .ret_type = RET_PTR_TO_MAP_VALUE,
>>
>> This hack and
>>
>> +const struct bpf_func_proto bpf_rbtree_unlock_proto = {
>> +       .func = bpf_rbtree_unlock,
>> +       .gpl_only = true,
>> +       .ret_type = RET_INTEGER,
>> +       .arg1_type = ARG_PTR_TO_SPIN_LOCK,
>>
>> may be too much arm twisting to reuse bpf_spin_lock.
>>
>> May be return ptr_to_btf_id here and bpf_rbtree_lock
>> should match the type?
>> It could be new 'struct bpf_lock' in kernel's BTF.
>>
>> Let's figure out how to associate locks with rbtrees.
>>
>> Reiterating my proposal that was done earlier in the context
>> of Delyan's irq_work, but for different type:
>> How about:
>> struct bpf_lock *l;
>> l = bpf_mem_alloc(allocator, bpf_core_type_id_kernel(*l));
>>
>> that would allocate ptr_to_btf_id object from kernel's btf.
>> The bpf_lock would have constructor and destructor provided by the
>> kernel code.
>> constructor will set bpf_lock's refcnt to 1.
>> then bpf_rbtree_init(&map, lock) will bump the refcnt.
>> and dtor will eventually free it when all rbtrees are freed.
>> That would be similar to kptr's logic with kptr_get and dtor's
>> associated with kernel's btf_id-s.
> 
> Just to continue brainstorming: Comments on this?
> 
> Instead of a rbtree map, you have a struct bpf_rbtree global variable
> which works like a rbtree. To associate a lock with multiple
> bpf_rbtree, you do clang style thread safety annotation in the bpf
> program:
> 
> #define __guarded_by(lock) __attribute__((btf_type_tag("guarded_by:" #lock))
> 
> struct bpf_spin_lock shared_lock;
> struct bpf_rbtree rbtree1 __guarded_by(shared_lock);
> struct bpf_rbtree rbtree2 __guarded_by(shared_lock);

For the above __guarded_by macro, we should use
btf_decl_tag instead of btf_type_tag

#define __guarded_by(lock) __attribute__((btf_decl_tag("guarded_by:" #lock))

Currently, in llvm implementation, btf_type_tag only applies
to pointee type's. btf_decl_tag can apply to global variable,
function argument, function return value and struct/union members.
So btf_decl_tag shoul work for the above global variable case or
below struct rbtree_set member case.

> 
> guarded_by tag is mandatory for the rbtree. Verifier ensures
> shared_lock spin lock is held whenever rbtree1 or rbtree2 is being
> accessed, and whitelists add/remove helpers inside the critical
> section.
> 
> I don't think associating locks to rbtree dynamically is a hard
> requirement for your use case? But if you need that, you may probably
> also allocate sets of rbtree that are part of the same lock "class"
> dynamically using bpf_kptr_alloc, and do similar annotation for the
> struct being allocated.
> struct rbtree_set {
>    struct bpf_spin_lock lock;
>    struct bpf_rbtree rbtree1 __guarded_by(lock);
>    struct bpf_rbtree rbtree2 __guarded_by(lock);
> };
> struct rbtree_set *s = bpf_kptr_alloc(sizeof(*s), btf_local_type_id(*s));
> // Just stash the pointer somewhere with kptr_xchg
> On bpf_kptr_free, the verifier knows this is not a "trivial" struct,
> so inserts destructor calls for bpf_rbtree fields during program
> fixup.
> 
> The main insight is that lock and rbtree are part of the same
> allocation (map value for global case, ptr_to_btf_id for dynamic case)
> so the locked state can be bound to the reg state in the verifier.
> Then we can also make this new rbtree API use kfuncs instead of UAPI
> helpers, to get some field experience before baking it in.
> 
> Any opinions? Any brainos or deficiencies in the scheme above?
> 
> Background: I have been thinking about how I can bind kptr and normal
> data synchronization without having unneeded atomic xchg cost when
> lock is already protecting kptr. In my tests this guarded_by
> annotation has been proving very useful (you mark data and kptr
> protected by lock as guarded_by some spin lock member in same map
> value, verifier ensures lock is held during access, and kptr_xchg for
> guarded kptr is lowered to non-atomic load/store assuming no
> concurrent access, and kptr_xchg from outside the lock section is
> rejected).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
  2022-08-15  5:33       ` Yonghong Song
@ 2022-08-15  5:37         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-15  5:37 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Mon, 15 Aug 2022 at 07:34, Yonghong Song <yhs@fb.com> wrote:
>
> On 8/10/22 2:46 PM, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 2 Aug 2022 at 00:23, Alexei Starovoitov <ast@fb.com> wrote:
> [...]
> >
> > Just to continue brainstorming: Comments on this?
> >
> > Instead of a rbtree map, you have a struct bpf_rbtree global variable
> > which works like a rbtree. To associate a lock with multiple
> > bpf_rbtree, you do clang style thread safety annotation in the bpf
> > program:
> >
> > #define __guarded_by(lock) __attribute__((btf_type_tag("guarded_by:" #lock))
> >
> > struct bpf_spin_lock shared_lock;
> > struct bpf_rbtree rbtree1 __guarded_by(shared_lock);
> > struct bpf_rbtree rbtree2 __guarded_by(shared_lock);
>
> For the above __guarded_by macro, we should use
> btf_decl_tag instead of btf_type_tag
>
> #define __guarded_by(lock) __attribute__((btf_decl_tag("guarded_by:" #lock))
>
> Currently, in llvm implementation, btf_type_tag only applies
> to pointee type's. btf_decl_tag can apply to global variable,
> function argument, function return value and struct/union members.
> So btf_decl_tag shoul work for the above global variable case or
> below struct rbtree_set member case.
>

Yep, I actually wrote a prototype list implementation (it's very close
so I can probably post it very soon as an RFC, at this point) which is
using declaration tags like this. For now only one bpf_spin_lock is
there in a map value or globally, so I didn't add guarded_by, but if
you look in [0] it can be used to tag e.g. value_type of a specific
bpf_list_head, which node in that value type can be linked, etc.

  [0]: https://github.com/kkdwivedi/linux/commits/bpf-list

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-08-15  5:38 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-22 18:34 [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Dave Marchevsky
2022-07-22 18:34 ` [RFC PATCH bpf-next 01/11] bpf: Pull repeated reg access bounds check into helper fn Dave Marchevsky
2022-07-22 18:34 ` [RFC PATCH bpf-next 02/11] bpf: Add verifier support for custom callback return range Dave Marchevsky
2022-07-22 18:34 ` [RFC PATCH bpf-next 03/11] bpf: Add rb_node_off to bpf_map Dave Marchevsky
2022-08-01 22:19   ` Alexei Starovoitov
2022-07-22 18:34 ` [RFC PATCH bpf-next 04/11] bpf: Add rbtree map Dave Marchevsky
2022-08-01 21:49   ` Alexei Starovoitov
2022-07-22 18:34 ` [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree Dave Marchevsky
2022-08-01 22:17   ` Alexei Starovoitov
2022-08-02 13:59     ` Kumar Kartikeya Dwivedi
2022-08-02 15:30       ` Alexei Starovoitov
2022-08-10 21:46     ` Kumar Kartikeya Dwivedi
2022-08-10 22:06       ` Alexei Starovoitov
2022-08-10 23:16         ` Kumar Kartikeya Dwivedi
2022-08-15  5:33       ` Yonghong Song
2022-08-15  5:37         ` Kumar Kartikeya Dwivedi
2022-07-22 18:34 ` [RFC PATCH bpf-next 06/11] bpf: Add bpf_rbtree_{lock,unlock} helpers Dave Marchevsky
2022-08-01 21:58   ` Alexei Starovoitov
2022-07-22 18:34 ` [RFC PATCH bpf-next 07/11] bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find} Dave Marchevsky
2022-07-22 18:34 ` [RFC PATCH bpf-next 08/11] bpf: Add OBJ_NON_OWNING_REF type flag Dave Marchevsky
2022-08-01 22:41   ` Alexei Starovoitov
2022-07-22 18:34 ` [RFC PATCH bpf-next 09/11] bpf: Add CONDITIONAL_RELEASE " Dave Marchevsky
2022-08-01 22:23   ` Alexei Starovoitov
2022-07-22 18:34 ` [RFC PATCH bpf-next 10/11] bpf: Introduce PTR_ITER and PTR_ITER_END type flags Dave Marchevsky
2022-07-29 16:31   ` Tejun Heo
2022-08-01 22:44   ` Alexei Starovoitov
2022-08-02 13:05     ` Kumar Kartikeya Dwivedi
2022-08-02 15:10       ` Alexei Starovoitov
2022-08-10 17:56     ` Dave Marchevsky
2022-07-22 18:34 ` [RFC PATCH bpf-next 11/11] selftests/bpf: Add rbtree map tests Dave Marchevsky
2022-07-28  7:18   ` Yonghong Song
2022-08-10 17:48     ` Dave Marchevsky
2022-07-28  7:04 ` [RFC PATCH bpf-next 00/11] bpf: Introduce rbtree map Yonghong Song
2022-08-10 17:54   ` Dave Marchevsky
2022-08-01 21:27 ` Alexei Starovoitov
2022-08-10 18:11   ` Dave Marchevsky
2022-08-02 22:02 ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).