bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
@ 2023-04-15 20:18 Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 1/9] bpf: Remove btf_field_offs, use btf_record's fields instead Dave Marchevsky
                   ` (10 more replies)
  0 siblings, 11 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

This series adds support for refcounted local kptrs to the verifier. A local
kptr is 'refcounted' if its type contains a struct bpf_refcount field:

  struct refcounted_node {
    long data;
    struct bpf_list_node ll;
    struct bpf_refcount ref;
  };

bpf_refcount is used to implement shared ownership for local kptrs.

Motivating usecase
==================

If a struct has two collection node fields, e.g.:

  struct node {
    long key;
    long val;
    struct bpf_rb_node rb;
    struct bpf_list_node ll;
  };

It's not currently possible to add a node to both the list and rbtree:

  long bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(typeof(*n));
    if (!n) { /* ... */ }

    bpf_spin_lock(&lock);

    bpf_list_push_back(&head, &n->ll);
    bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
    bpf_spin_unlock(&lock);
  }

The above program will fail verification due to current owning / non-owning ref
logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
passed to bpf_rbtree_add. The only way to get an owning reference for the node
that was added is to bpf_list_pop_{front,back} it.

More generally, verifier ownership semantics expect that a node has one
owner (program, collection, or stashed in map) with exclusive ownership
of the node's lifetime. The owner free's the node's underlying memory when it
itself goes away.

Without a shared ownership concept it's impossible to express many real-world
usecases such that they pass verification.

Semantic Changes
================

Before this series, the verifier could make this statement: "whoever has the
owning reference has exclusive ownership of the referent's lifetime". As
demonstrated in the previous section, this implies that a BPF program can't
have an owning reference to some node if that node is in a collection. If
such a state were possible, the node would have multiple owners, each thinking
they have exclusive ownership. In order to support shared ownership it's
necessary to modify the exclusive ownership semantic.

After this series' changes, an owning reference has ownership of the referent's
lifetime, but it's not necessarily exclusive. The referent's underlying memory
is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
used for collection insert.

This change doesn't affect UX of owning or non-owning references much:

  * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
    an owning reference arg, as ownership still must be passed to the
    collection in a shared-ownership world.

  * non-owning references still refer to valid memory without claiming
    any ownership.

One important conclusion that followed from "exclusive ownership" statement
is no longer valid, though. In exclusive-ownership world, if a BPF prog has
an owning reference to a node, the verifier can conclude that no collection has
ownership of it. This conclusion was used to avoid runtime checking in the
implementations of insert and remove operations (""has the node already been
{inserted, removed}?").

In a shared-ownership world the aforementioned conclusion is no longer valid,
which necessitates doing runtime checking in insert and remove operation
kfuncs, and those functions possibly failing to insert or remove anything.

Luckily the verifier changes necessary to go from exclusive to shared ownership
were fairly minimal. Patches in this series which do change verifier semantics
generally have some summary dedicated to explaining why certain usecases
Just Work for shared ownership without verifier changes.

Implementation
==============

The changes in this series can be categorized as follows:

  * struct bpf_refcount opaque field + plumbing
  * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
  * bpf_refcount_acquire kfunc
    * enables shared ownershp by bumping refcount + acquiring owning ref
  * support for possibly-failing collection insertion and removal
    * insertion changes are more complex

If a patch's changes have some nuance to their effect - or lack of effect - on
verifier behavior, the patch summary talks about it at length.

Patch contents:
  * Patch 1 removes btf_field_offs struct
  * Patch 2 adds struct bpf_refcount and associated plumbing
  * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
    refcounted kptrs
  * Patch 4 adds bpf_refcount_acquire
  * Patches 5-7 add support for possibly-failing collection insert and remove
  * Patch 8 centralizes constructor-like functionality for local kptr types
  * Patch 9 adds tests for new functionality

base-commit: 4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f

Changelog:

v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com

Patch #s used below refer to the patch's position in v1 unless otherwise
specified.

  * General
    * Rebase onto latest bpf-next (base-commit updated above)

  * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
    * Fix typo in summary (Alexei)
  * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
    * Modify a paragraph in patch summary to more clearly state that only
      bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
      patch (Alexei)
    * refcount_off == -1 -> refcount_off < 0  in "node type w/ both list
      and rb_node fields" check, since any negative value means "no
      bpf_refcount field found", and furthermore refcount_off is never
      explicitly set to -1, but rather -EINVAL. (Alexei)
    * Instead of just changing "btf: list_node and rb_node in same struct" test
      expectation to pass instead of fail, do some refactoring to test both
      "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
      _no_ bpf_refcount" (failure) cases. This ensures that logic change in
      previous bullet point is correct.
      * v1's "btf: list_node and rb_node in same struct" test changes didn't
        add bpf_refcount, so the fact that btf load succeeded w/ list and
        rb_nodes but no bpf_refcount field is further proof that this logic
        was incorrect in v1.
  * Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
    * Instead of doing __init_field_infer_size in kfuncs when taking
      bpf_list_head type input which might've been 0-initialized in map, go
      back to simple oneliner initialization. Add short comment explaining why
      this is necessary. (Alexei)
  * Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
    * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)

Dave Marchevsky (9):
  bpf: Remove btf_field_offs, use btf_record's fields instead
  bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
  bpf: Support refcounted local kptrs in existing semantics
  bpf: Add bpf_refcount_acquire kfunc
  bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly
    fail
  selftests/bpf: Modify linked_list tests to work with macro-ified
    inserts
  bpf: Migrate bpf_rbtree_remove to possibly fail
  bpf: Centralize btf_field-specific initialization logic
  selftests/bpf: Add refcounted_kptr tests

 include/linux/bpf.h                           |  80 ++--
 include/linux/bpf_verifier.h                  |   7 +-
 include/linux/btf.h                           |   2 -
 include/uapi/linux/bpf.h                      |   4 +
 kernel/bpf/btf.c                              | 126 ++----
 kernel/bpf/helpers.c                          | 113 +++--
 kernel/bpf/map_in_map.c                       |  15 -
 kernel/bpf/syscall.c                          |  23 +-
 kernel/bpf/verifier.c                         | 155 +++++--
 tools/include/uapi/linux/bpf.h                |   4 +
 .../testing/selftests/bpf/bpf_experimental.h  |  60 ++-
 .../selftests/bpf/prog_tests/linked_list.c    |  96 +++--
 .../testing/selftests/bpf/prog_tests/rbtree.c |  25 ++
 .../bpf/prog_tests/refcounted_kptr.c          |  18 +
 .../testing/selftests/bpf/progs/linked_list.c |  34 +-
 .../testing/selftests/bpf/progs/linked_list.h |   4 +-
 .../selftests/bpf/progs/linked_list_fail.c    |  96 +++--
 tools/testing/selftests/bpf/progs/rbtree.c    |  74 +++-
 .../testing/selftests/bpf/progs/rbtree_fail.c |  77 ++--
 .../selftests/bpf/progs/refcounted_kptr.c     | 406 ++++++++++++++++++
 .../bpf/progs/refcounted_kptr_fail.c          |  72 ++++
 21 files changed, 1114 insertions(+), 377 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/refcounted_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/refcounted_kptr_fail.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 1/9] bpf: Remove btf_field_offs, use btf_record's fields instead
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing Dave Marchevsky
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

The btf_field_offs struct contains (offset, size) for btf_record fields,
sorted by offset. btf_field_offs is always used in conjunction with
btf_record, which has btf_field 'fields' array with (offset, type), the
latter of which btf_field_offs' size is derived from via
btf_field_type_size.

This patch adds a size field to struct btf_field and sorts btf_record's
fields by offset, making it possible to get rid of btf_field_offs. Less
data duplication and less code complexity results.

Since btf_field_offs' lifetime closely followed the btf_record used to
populate it, most complexity wins are from removal of initialization
code like:

  if (btf_record_successfully_initialized) {
    foffs = btf_parse_field_offs(rec);
    if (IS_ERR_OR_NULL(foffs))
      // free the btf_record and return err
  }

Other changes in this patch are pretty mechanical:

  * foffs->field_off[i] -> rec->fields[i].offset
  * foffs->field_sz[i] -> rec->fields[i].size
  * Sort rec->fields in btf_parse_fields before returning
    * It's possible that this is necessary independently of other
      changes in this patch. btf_record_find in syscall.c expects
      btf_record's fields to be sorted by offset, yet there's no
      explicit sorting of them before this patch, record's fields are
      populated in the order they're read from BTF struct definition.
      BTF docs don't say anything about the sortedness of struct fields.
  * All functions taking struct btf_field_offs * input now instead take
    struct btf_record *. All callsites of these functions already have
    access to the correct btf_record.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h     | 44 +++++++++----------
 include/linux/btf.h     |  2 -
 kernel/bpf/btf.c        | 93 ++++++++++-------------------------------
 kernel/bpf/helpers.c    |  2 +-
 kernel/bpf/map_in_map.c | 15 -------
 kernel/bpf/syscall.c    | 17 +-------
 6 files changed, 43 insertions(+), 130 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 88845aadc47d..7888ed497432 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -210,6 +210,7 @@ struct btf_field_graph_root {
 
 struct btf_field {
 	u32 offset;
+	u32 size;
 	enum btf_field_type type;
 	union {
 		struct btf_field_kptr kptr;
@@ -225,12 +226,6 @@ struct btf_record {
 	struct btf_field fields[];
 };
 
-struct btf_field_offs {
-	u32 cnt;
-	u32 field_off[BTF_FIELDS_MAX];
-	u8 field_sz[BTF_FIELDS_MAX];
-};
-
 struct bpf_map {
 	/* The first two cachelines with read-mostly members of which some
 	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -257,7 +252,6 @@ struct bpf_map {
 	struct obj_cgroup *objcg;
 #endif
 	char name[BPF_OBJ_NAME_LEN];
-	struct btf_field_offs *field_offs;
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
 	 */
@@ -360,14 +354,14 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
 	return rec->field_mask & type;
 }
 
-static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
+static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
 {
 	int i;
 
-	if (!foffs)
+	if (IS_ERR_OR_NULL(rec))
 		return;
-	for (i = 0; i < foffs->cnt; i++)
-		memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
+	for (i = 0; i < rec->cnt; i++)
+		memset(obj + rec->fields[i].offset, 0, rec->fields[i].size);
 }
 
 /* 'dst' must be a temporary buffer and should not point to memory that is being
@@ -379,7 +373,7 @@ static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
  */
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
-	bpf_obj_init(map->field_offs, dst);
+	bpf_obj_init(map->record, dst);
 }
 
 /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
@@ -399,14 +393,14 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 }
 
 /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
-static inline void bpf_obj_memcpy(struct btf_field_offs *foffs,
+static inline void bpf_obj_memcpy(struct btf_record *rec,
 				  void *dst, void *src, u32 size,
 				  bool long_memcpy)
 {
 	u32 curr_off = 0;
 	int i;
 
-	if (likely(!foffs)) {
+	if (IS_ERR_OR_NULL(rec)) {
 		if (long_memcpy)
 			bpf_long_memcpy(dst, src, round_up(size, 8));
 		else
@@ -414,49 +408,49 @@ static inline void bpf_obj_memcpy(struct btf_field_offs *foffs,
 		return;
 	}
 
-	for (i = 0; i < foffs->cnt; i++) {
-		u32 next_off = foffs->field_off[i];
+	for (i = 0; i < rec->cnt; i++) {
+		u32 next_off = rec->fields[i].offset;
 		u32 sz = next_off - curr_off;
 
 		memcpy(dst + curr_off, src + curr_off, sz);
-		curr_off += foffs->field_sz[i] + sz;
+		curr_off += rec->fields[i].size + sz;
 	}
 	memcpy(dst + curr_off, src + curr_off, size - curr_off);
 }
 
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
-	bpf_obj_memcpy(map->field_offs, dst, src, map->value_size, false);
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
 }
 
 static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
 {
-	bpf_obj_memcpy(map->field_offs, dst, src, map->value_size, true);
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
 }
 
-static inline void bpf_obj_memzero(struct btf_field_offs *foffs, void *dst, u32 size)
+static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
 {
 	u32 curr_off = 0;
 	int i;
 
-	if (likely(!foffs)) {
+	if (IS_ERR_OR_NULL(rec)) {
 		memset(dst, 0, size);
 		return;
 	}
 
-	for (i = 0; i < foffs->cnt; i++) {
-		u32 next_off = foffs->field_off[i];
+	for (i = 0; i < rec->cnt; i++) {
+		u32 next_off = rec->fields[i].offset;
 		u32 sz = next_off - curr_off;
 
 		memset(dst + curr_off, 0, sz);
-		curr_off += foffs->field_sz[i] + sz;
+		curr_off += rec->fields[i].size + sz;
 	}
 	memset(dst + curr_off, 0, size - curr_off);
 }
 
 static inline void zero_map_value(struct bpf_map *map, void *dst)
 {
-	bpf_obj_memzero(map->field_offs, dst, map->value_size);
+	bpf_obj_memzero(map->record, dst, map->value_size);
 }
 
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 495250162422..813227bff58a 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -113,7 +113,6 @@ struct btf_id_dtor_kfunc {
 struct btf_struct_meta {
 	u32 btf_id;
 	struct btf_record *record;
-	struct btf_field_offs *field_offs;
 };
 
 struct btf_struct_metas {
@@ -207,7 +206,6 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size);
 int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec);
-struct btf_field_offs *btf_parse_field_offs(struct btf_record *rec);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 2c2d1fb9f410..f3c998feeccb 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1666,10 +1666,8 @@ static void btf_struct_metas_free(struct btf_struct_metas *tab)
 
 	if (!tab)
 		return;
-	for (i = 0; i < tab->cnt; i++) {
+	for (i = 0; i < tab->cnt; i++)
 		btf_record_free(tab->types[i].record);
-		kfree(tab->types[i].field_offs);
-	}
 	kfree(tab);
 }
 
@@ -3700,12 +3698,24 @@ static int btf_parse_rb_root(const struct btf *btf, struct btf_field *field,
 					    __alignof__(struct bpf_rb_node));
 }
 
+static int btf_field_cmp(const void *_a, const void *_b, const void *priv)
+{
+	const struct btf_field *a = (const struct btf_field *)_a;
+	const struct btf_field *b = (const struct btf_field *)_b;
+
+	if (a->offset < b->offset)
+		return -1;
+	else if (a->offset > b->offset)
+		return 1;
+	return 0;
+}
+
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size)
 {
 	struct btf_field_info info_arr[BTF_FIELDS_MAX];
+	u32 next_off = 0, field_type_size;
 	struct btf_record *rec;
-	u32 next_off = 0;
 	int ret, i, cnt;
 
 	ret = btf_find_field(btf, t, field_mask, info_arr, ARRAY_SIZE(info_arr));
@@ -3725,7 +3735,8 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	rec->spin_lock_off = -EINVAL;
 	rec->timer_off = -EINVAL;
 	for (i = 0; i < cnt; i++) {
-		if (info_arr[i].off + btf_field_type_size(info_arr[i].type) > value_size) {
+		field_type_size = btf_field_type_size(info_arr[i].type);
+		if (info_arr[i].off + field_type_size > value_size) {
 			WARN_ONCE(1, "verifier bug off %d size %d", info_arr[i].off, value_size);
 			ret = -EFAULT;
 			goto end;
@@ -3734,11 +3745,12 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 			ret = -EEXIST;
 			goto end;
 		}
-		next_off = info_arr[i].off + btf_field_type_size(info_arr[i].type);
+		next_off = info_arr[i].off + field_type_size;
 
 		rec->field_mask |= info_arr[i].type;
 		rec->fields[i].offset = info_arr[i].off;
 		rec->fields[i].type = info_arr[i].type;
+		rec->fields[i].size = field_type_size;
 
 		switch (info_arr[i].type) {
 		case BPF_SPIN_LOCK:
@@ -3808,6 +3820,9 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		goto end;
 	}
 
+	sort_r(rec->fields, rec->cnt, sizeof(struct btf_field), btf_field_cmp,
+	       NULL, rec);
+
 	return rec;
 end:
 	btf_record_free(rec);
@@ -3889,61 +3904,6 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
 	return 0;
 }
 
-static int btf_field_offs_cmp(const void *_a, const void *_b, const void *priv)
-{
-	const u32 a = *(const u32 *)_a;
-	const u32 b = *(const u32 *)_b;
-
-	if (a < b)
-		return -1;
-	else if (a > b)
-		return 1;
-	return 0;
-}
-
-static void btf_field_offs_swap(void *_a, void *_b, int size, const void *priv)
-{
-	struct btf_field_offs *foffs = (void *)priv;
-	u32 *off_base = foffs->field_off;
-	u32 *a = _a, *b = _b;
-	u8 *sz_a, *sz_b;
-
-	sz_a = foffs->field_sz + (a - off_base);
-	sz_b = foffs->field_sz + (b - off_base);
-
-	swap(*a, *b);
-	swap(*sz_a, *sz_b);
-}
-
-struct btf_field_offs *btf_parse_field_offs(struct btf_record *rec)
-{
-	struct btf_field_offs *foffs;
-	u32 i, *off;
-	u8 *sz;
-
-	BUILD_BUG_ON(ARRAY_SIZE(foffs->field_off) != ARRAY_SIZE(foffs->field_sz));
-	if (IS_ERR_OR_NULL(rec))
-		return NULL;
-
-	foffs = kzalloc(sizeof(*foffs), GFP_KERNEL | __GFP_NOWARN);
-	if (!foffs)
-		return ERR_PTR(-ENOMEM);
-
-	off = foffs->field_off;
-	sz = foffs->field_sz;
-	for (i = 0; i < rec->cnt; i++) {
-		off[i] = rec->fields[i].offset;
-		sz[i] = btf_field_type_size(rec->fields[i].type);
-	}
-	foffs->cnt = rec->cnt;
-
-	if (foffs->cnt == 1)
-		return foffs;
-	sort_r(foffs->field_off, foffs->cnt, sizeof(foffs->field_off[0]),
-	       btf_field_offs_cmp, btf_field_offs_swap, foffs);
-	return foffs;
-}
-
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
@@ -5386,7 +5346,6 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 	for (i = 1; i < n; i++) {
 		struct btf_struct_metas *new_tab;
 		const struct btf_member *member;
-		struct btf_field_offs *foffs;
 		struct btf_struct_meta *type;
 		struct btf_record *record;
 		const struct btf_type *t;
@@ -5428,17 +5387,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 			ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT;
 			goto free;
 		}
-		foffs = btf_parse_field_offs(record);
-		/* We need the field_offs to be valid for a valid record,
-		 * either both should be set or both should be unset.
-		 */
-		if (IS_ERR_OR_NULL(foffs)) {
-			btf_record_free(record);
-			ret = -EFAULT;
-			goto free;
-		}
 		type->record = record;
-		type->field_offs = foffs;
 		tab->cnt++;
 	}
 	return tab;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index f04e60a4847f..e989b6460147 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1893,7 +1893,7 @@ __bpf_kfunc void *bpf_obj_new_impl(u64 local_type_id__k, void *meta__ign)
 	if (!p)
 		return NULL;
 	if (meta)
-		bpf_obj_init(meta->field_offs, p);
+		bpf_obj_init(meta->record, p);
 	return p;
 }
 
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 38136ec4e095..2c5c64c2a53b 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -56,18 +56,6 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 		ret = PTR_ERR(inner_map_meta->record);
 		goto free;
 	}
-	if (inner_map_meta->record) {
-		struct btf_field_offs *field_offs;
-		/* If btf_record is !IS_ERR_OR_NULL, then field_offs is always
-		 * valid.
-		 */
-		field_offs = kmemdup(inner_map->field_offs, sizeof(*inner_map->field_offs), GFP_KERNEL | __GFP_NOWARN);
-		if (!field_offs) {
-			ret = -ENOMEM;
-			goto free_rec;
-		}
-		inner_map_meta->field_offs = field_offs;
-	}
 	/* Note: We must use the same BTF, as we also used btf_record_dup above
 	 * which relies on BTF being same for both maps, as some members like
 	 * record->fields.list_head have pointers like value_rec pointing into
@@ -88,8 +76,6 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 	fdput(f);
 	return inner_map_meta;
-free_rec:
-	btf_record_free(inner_map_meta->record);
 free:
 	kfree(inner_map_meta);
 put:
@@ -99,7 +85,6 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
 {
-	kfree(map_meta->field_offs);
 	bpf_map_free_record(map_meta);
 	btf_put(map_meta->btf);
 	kfree(map_meta);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6d575505f89c..c08b7933bf8f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -717,14 +717,13 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
-	struct btf_field_offs *foffs = map->field_offs;
 	struct btf_record *rec = map->record;
 
 	security_bpf_map_free(map);
 	bpf_map_release_memcg(map);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
-	/* Delay freeing of field_offs and btf_record for maps, as map_free
+	/* Delay freeing of btf_record for maps, as map_free
 	 * callback usually needs access to them. It is better to do it here
 	 * than require each callback to do the free itself manually.
 	 *
@@ -733,7 +732,6 @@ static void bpf_map_free_deferred(struct work_struct *work)
 	 * eventually calls bpf_map_free_meta, since inner_map_meta is only a
 	 * template bpf_map struct used during verification.
 	 */
-	kfree(foffs);
 	btf_record_free(rec);
 }
 
@@ -1125,7 +1123,6 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 static int map_create(union bpf_attr *attr)
 {
 	int numa_node = bpf_map_attr_numa_node(attr);
-	struct btf_field_offs *foffs;
 	struct bpf_map *map;
 	int f_flags;
 	int err;
@@ -1205,17 +1202,9 @@ static int map_create(union bpf_attr *attr)
 			attr->btf_vmlinux_value_type_id;
 	}
 
-
-	foffs = btf_parse_field_offs(map->record);
-	if (IS_ERR(foffs)) {
-		err = PTR_ERR(foffs);
-		goto free_map;
-	}
-	map->field_offs = foffs;
-
 	err = security_bpf_map_alloc(map);
 	if (err)
-		goto free_map_field_offs;
+		goto free_map;
 
 	err = bpf_map_alloc_id(map);
 	if (err)
@@ -1239,8 +1228,6 @@ static int map_create(union bpf_attr *attr)
 
 free_map_sec:
 	security_bpf_map_free(map);
-free_map_field_offs:
-	kfree(map->field_offs);
 free_map:
 	btf_put(map->btf);
 	map->ops->map_free(map);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 1/9] bpf: Remove btf_field_offs, use btf_record's fields instead Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 3/9] bpf: Support refcounted local kptrs in existing semantics Dave Marchevsky
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
meant for use in BPF programs. Similarly to other opaque types like
bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
user-defined struct types a bpf_refcount can be located, so necessary
btf_record plumbing is added to enable this. bpf_refcount is sized to
hold a refcount_t.

Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
btf_record as refcount_off in addition to being in the field array.
Caching refcount_off makes sense for this field because further patches
in the series will modify functions that take local kptrs (e.g.
bpf_obj_drop) to change their behavior if the type they're operating on
is refcounted. So enabling fast "is this type refcounted?" checks is
desirable.

No such verifier behavior changes are introduced in this patch, just
logic to recognize 'struct bpf_refcount' in btf_record.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h            |  8 ++++++++
 include/uapi/linux/bpf.h       |  4 ++++
 kernel/bpf/btf.c               | 12 +++++++++++-
 kernel/bpf/syscall.c           |  6 +++++-
 tools/include/uapi/linux/bpf.h |  4 ++++
 5 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7888ed497432..be44d765b7a4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -187,6 +187,7 @@ enum btf_field_type {
 	BPF_RB_NODE    = (1 << 7),
 	BPF_GRAPH_NODE_OR_ROOT = BPF_LIST_NODE | BPF_LIST_HEAD |
 				 BPF_RB_NODE | BPF_RB_ROOT,
+	BPF_REFCOUNT   = (1 << 8),
 };
 
 typedef void (*btf_dtor_kfunc_t)(void *);
@@ -223,6 +224,7 @@ struct btf_record {
 	u32 field_mask;
 	int spin_lock_off;
 	int timer_off;
+	int refcount_off;
 	struct btf_field fields[];
 };
 
@@ -293,6 +295,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_rb_root";
 	case BPF_RB_NODE:
 		return "bpf_rb_node";
+	case BPF_REFCOUNT:
+		return "bpf_refcount";
 	default:
 		WARN_ON_ONCE(1);
 		return "unknown";
@@ -317,6 +321,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_rb_root);
 	case BPF_RB_NODE:
 		return sizeof(struct bpf_rb_node);
+	case BPF_REFCOUNT:
+		return sizeof(struct bpf_refcount);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -341,6 +347,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_rb_root);
 	case BPF_RB_NODE:
 		return __alignof__(struct bpf_rb_node);
+	case BPF_REFCOUNT:
+		return __alignof__(struct bpf_refcount);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3823100b7934..4b20a7269bee 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6985,6 +6985,10 @@ struct bpf_rb_node {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_refcount {
+	__u32 :32;
+} __attribute__((aligned(4)));
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index f3c998feeccb..14889fd5ba8e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3391,6 +3391,7 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
 	field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
 	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
 	field_mask_test_name(BPF_RB_NODE,   "bpf_rb_node");
+	field_mask_test_name(BPF_REFCOUNT,  "bpf_refcount");
 
 	/* Only return BPF_KPTR when all other types with matchable names fail */
 	if (field_mask & BPF_KPTR) {
@@ -3439,6 +3440,7 @@ static int btf_find_struct_field(const struct btf *btf,
 		case BPF_TIMER:
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
+		case BPF_REFCOUNT:
 			ret = btf_find_struct(btf, member_type, off, sz, field_type,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3504,6 +3506,7 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		case BPF_TIMER:
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
+		case BPF_REFCOUNT:
 			ret = btf_find_struct(btf, var_type, off, sz, field_type,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3734,6 +3737,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 
 	rec->spin_lock_off = -EINVAL;
 	rec->timer_off = -EINVAL;
+	rec->refcount_off = -EINVAL;
 	for (i = 0; i < cnt; i++) {
 		field_type_size = btf_field_type_size(info_arr[i].type);
 		if (info_arr[i].off + field_type_size > value_size) {
@@ -3763,6 +3767,11 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 			/* Cache offset for faster lookup at runtime */
 			rec->timer_off = rec->fields[i].offset;
 			break;
+		case BPF_REFCOUNT:
+			WARN_ON_ONCE(rec->refcount_off >= 0);
+			/* Cache offset for faster lookup at runtime */
+			rec->refcount_off = rec->fields[i].offset;
+			break;
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 			ret = btf_parse_kptr(btf, &rec->fields[i], &info_arr[i]);
@@ -5308,6 +5317,7 @@ static const char *alloc_obj_fields[] = {
 	"bpf_list_node",
 	"bpf_rb_root",
 	"bpf_rb_node",
+	"bpf_refcount",
 };
 
 static struct btf_struct_metas *
@@ -5381,7 +5391,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		type = &tab->types[tab->cnt];
 		type->btf_id = i;
 		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
-						  BPF_RB_ROOT | BPF_RB_NODE, t->size);
+						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT, t->size);
 		/* The record cannot be unset, treat it as an error if so */
 		if (IS_ERR_OR_NULL(record)) {
 			ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c08b7933bf8f..28eac7434d32 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -552,6 +552,7 @@ void btf_record_free(struct btf_record *rec)
 		case BPF_RB_NODE:
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
+		case BPF_REFCOUNT:
 			/* Nothing to release */
 			break;
 		default:
@@ -599,6 +600,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 		case BPF_RB_NODE:
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
+		case BPF_REFCOUNT:
 			/* Nothing to acquire */
 			break;
 		default:
@@ -705,6 +707,7 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			break;
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
+		case BPF_REFCOUNT:
 			break;
 		default:
 			WARN_ON_ONCE(1);
@@ -1032,7 +1035,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
-				       BPF_RB_ROOT,
+				       BPF_RB_ROOT | BPF_REFCOUNT,
 				       map->value_size);
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
@@ -1071,6 +1074,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 				break;
 			case BPF_KPTR_UNREF:
 			case BPF_KPTR_REF:
+			case BPF_REFCOUNT:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
 				    map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
 				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3823100b7934..4b20a7269bee 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6985,6 +6985,10 @@ struct bpf_rb_node {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_refcount {
+	__u32 :32;
+} __attribute__((aligned(4)));
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 3/9] bpf: Support refcounted local kptrs in existing semantics
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 1/9] bpf: Remove btf_field_offs, use btf_record's fields instead Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 4/9] bpf: Add bpf_refcount_acquire kfunc Dave Marchevsky
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

A local kptr is considered 'refcounted' when it is of a type that has a
bpf_refcount field. When such a kptr is created, its refcount should be
initialized to 1; when destroyed, the object should be free'd only if a
refcount decr results in 0 refcount.

Existing logic always frees the underlying memory when destroying a
local kptr, and 0-initializes all btf_record fields. This patch adds
checks for "is local kptr refcounted?" and new logic for that case in
the appropriate places.

This patch focuses on changing existing semantics and thus conspicuously
does _not_ provide a way for BPF programs in increment refcount. That
follows later in the series.

__bpf_obj_drop_impl is modified to do the right thing when it sees a
refcounted type. Container types for graph nodes (list, tree, stashed in
map) are migrated to use __bpf_obj_drop_impl as a destructor for their
nodes instead of each having custom destruction code in their _free
paths. Now that "drop" isn't a synonym for "free" when the type is
refcounted it makes sense to centralize this logic.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h  |  3 +++
 kernel/bpf/helpers.c | 21 +++++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index be44d765b7a4..b065de2cfcea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -370,6 +370,9 @@ static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
 		return;
 	for (i = 0; i < rec->cnt; i++)
 		memset(obj + rec->fields[i].offset, 0, rec->fields[i].size);
+
+	if (rec->refcount_off >= 0)
+		refcount_set((refcount_t *)(obj + rec->refcount_off), 1);
 }
 
 /* 'dst' must be a temporary buffer and should not point to memory that is being
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index e989b6460147..e2dbd9644e5c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1798,6 +1798,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+void __bpf_obj_drop_impl(void *p, const struct btf_record *rec);
+
 void bpf_list_head_free(const struct btf_field *field, void *list_head,
 			struct bpf_spin_lock *spin_lock)
 {
@@ -1828,13 +1830,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
 		/* The contained type can also have resources, including a
 		 * bpf_list_head which needs to be freed.
 		 */
-		bpf_obj_free_fields(field->graph_root.value_rec, obj);
-		/* bpf_mem_free requires migrate_disable(), since we can be
-		 * called from map free path as well apart from BPF program (as
-		 * part of map ops doing bpf_obj_free_fields).
-		 */
 		migrate_disable();
-		bpf_mem_free(&bpf_global_ma, obj);
+		__bpf_obj_drop_impl(obj, field->graph_root.value_rec);
 		migrate_enable();
 	}
 }
@@ -1871,10 +1868,9 @@ void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
 		obj = pos;
 		obj -= field->graph_root.node_offset;
 
-		bpf_obj_free_fields(field->graph_root.value_rec, obj);
 
 		migrate_disable();
-		bpf_mem_free(&bpf_global_ma, obj);
+		__bpf_obj_drop_impl(obj, field->graph_root.value_rec);
 		migrate_enable();
 	}
 }
@@ -1897,8 +1893,17 @@ __bpf_kfunc void *bpf_obj_new_impl(u64 local_type_id__k, void *meta__ign)
 	return p;
 }
 
+/* Must be called under migrate_disable(), as required by bpf_mem_free */
 void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
 {
+	if (rec && rec->refcount_off >= 0 &&
+	    !refcount_dec_and_test((refcount_t *)(p + rec->refcount_off))) {
+		/* Object is refcounted and refcount_dec didn't result in 0
+		 * refcount. Return without freeing the object
+		 */
+		return;
+	}
+
 	if (rec)
 		bpf_obj_free_fields(rec, p);
 	bpf_mem_free(&bpf_global_ma, p);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 4/9] bpf: Add bpf_refcount_acquire kfunc
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (2 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 3/9] bpf: Support refcounted local kptrs in existing semantics Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail Dave Marchevsky
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

Currently, BPF programs can interact with the lifetime of refcounted
local kptrs in the following ways:

  bpf_obj_new  - Initialize refcount to 1 as part of new object creation
  bpf_obj_drop - Decrement refcount and free object if it's 0
  collection add - Pass ownership to the collection. No change to
                   refcount but collection is responsible for
		   bpf_obj_dropping it

In order to be able to add a refcounted local kptr to multiple
collections we need to be able to increment the refcount and acquire a
new owning reference. This patch adds a kfunc, bpf_refcount_acquire,
implementing such an operation.

bpf_refcount_acquire takes a refcounted local kptr and returns a new
owning reference to the same underlying memory as the input. The input
can be either owning or non-owning. To reinforce why this is safe,
consider the following code snippets:

  struct node *n = bpf_obj_new(typeof(*n)); // A
  struct node *m = bpf_refcount_acquire(n); // B

In the above snippet, n will be alive with refcount=1 after (A), and
since nothing changes that state before (B), it's obviously safe. If
n is instead added to some rbtree, we can still safely refcount_acquire
it:

  struct node *n = bpf_obj_new(typeof(*n));
  struct node *m;

  bpf_spin_lock(&glock);
  bpf_rbtree_add(&groot, &n->node, less);   // A
  m = bpf_refcount_acquire(n);              // B
  bpf_spin_unlock(&glock);

In the above snippet, after (A) n is a non-owning reference, and after
(B) m is an owning reference pointing to the same memory as n. Although
n has no ownership of that memory's lifetime, it's guaranteed to be
alive until the end of the critical section, and n would be clobbered if
we were past the end of the critical section, so it's safe to bump
refcount.

Implementation details:

* From verifier's perspective, bpf_refcount_acquire handling is similar
  to bpf_obj_new and bpf_obj_drop. Like the former, it returns a new
  owning reference matching input type, although like the latter, type
  can be inferred from concrete kptr input. Verifier changes in
  {check,fixup}_kfunc_call and check_kfunc_args are largely copied from
  aforementioned functions' verifier changes.

* An exception to the above is the new KF_ARG_PTR_TO_REFCOUNTED_KPTR
  arg, indicated by new "__refcounted_kptr" kfunc arg suffix. This is
  necessary in order to handle both owning and non-owning input without
  adding special-casing to "__alloc" arg handling. Also a convenient
  place to confirm that input type has bpf_refcount field.

* The implemented kfunc is actually bpf_refcount_acquire_impl, with
  'hidden' second arg that the verifier sets to the type's struct_meta
  in fixup_kfunc_call.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/helpers.c                          | 15 ++++
 kernel/bpf/verifier.c                         | 74 ++++++++++++++++---
 .../testing/selftests/bpf/bpf_experimental.h  | 13 ++++
 3 files changed, 91 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index e2dbd9644e5c..57ff8a60222c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1917,6 +1917,20 @@ __bpf_kfunc void bpf_obj_drop_impl(void *p__alloc, void *meta__ign)
 	__bpf_obj_drop_impl(p, meta ? meta->record : NULL);
 }
 
+__bpf_kfunc void *bpf_refcount_acquire_impl(void *p__refcounted_kptr, void *meta__ign)
+{
+	struct btf_struct_meta *meta = meta__ign;
+	struct bpf_refcount *ref;
+
+	/* Could just cast directly to refcount_t *, but need some code using
+	 * bpf_refcount type so that it is emitted in vmlinux BTF
+	 */
+	ref = (struct bpf_refcount *)p__refcounted_kptr + meta->record->refcount_off;
+
+	refcount_inc((refcount_t *)ref);
+	return (void *)p__refcounted_kptr;
+}
+
 static void __bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head, bool tail)
 {
 	struct list_head *n = (void *)node, *h = (void *)head;
@@ -2276,6 +2290,7 @@ BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 #endif
 BTF_ID_FLAGS(func, bpf_obj_new_impl, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_obj_drop_impl, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_refcount_acquire_impl, KF_ACQUIRE)
 BTF_ID_FLAGS(func, bpf_list_push_front)
 BTF_ID_FLAGS(func, bpf_list_push_back)
 BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4aa6d715e655..29e106f7ccaa 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -273,6 +273,11 @@ struct bpf_call_arg_meta {
 	struct btf_field *kptr_field;
 };
 
+struct btf_and_id {
+	struct btf *btf;
+	u32 btf_id;
+};
+
 struct bpf_kfunc_call_arg_meta {
 	/* In parameters */
 	struct btf *btf;
@@ -291,10 +296,10 @@ struct bpf_kfunc_call_arg_meta {
 		u64 value;
 		bool found;
 	} arg_constant;
-	struct {
-		struct btf *btf;
-		u32 btf_id;
-	} arg_obj_drop;
+	union {
+		struct btf_and_id arg_obj_drop;
+		struct btf_and_id arg_refcount_acquire;
+	};
 	struct {
 		struct btf_field *field;
 	} arg_list_head;
@@ -9403,6 +9408,11 @@ static bool is_kfunc_arg_uninit(const struct btf *btf, const struct btf_param *a
 	return __kfunc_param_match_suffix(btf, arg, "__uninit");
 }
 
+static bool is_kfunc_arg_refcounted_kptr(const struct btf *btf, const struct btf_param *arg)
+{
+	return __kfunc_param_match_suffix(btf, arg, "__refcounted_kptr");
+}
+
 static bool is_kfunc_arg_scalar_with_name(const struct btf *btf,
 					  const struct btf_param *arg,
 					  const char *name)
@@ -9542,15 +9552,16 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
 
 enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_CTX,
-	KF_ARG_PTR_TO_ALLOC_BTF_ID,  /* Allocated object */
-	KF_ARG_PTR_TO_KPTR,	     /* PTR_TO_KPTR but type specific */
+	KF_ARG_PTR_TO_ALLOC_BTF_ID,    /* Allocated object */
+	KF_ARG_PTR_TO_REFCOUNTED_KPTR, /* Refcounted local kptr */
+	KF_ARG_PTR_TO_KPTR,	       /* PTR_TO_KPTR but type specific */
 	KF_ARG_PTR_TO_DYNPTR,
 	KF_ARG_PTR_TO_ITER,
 	KF_ARG_PTR_TO_LIST_HEAD,
 	KF_ARG_PTR_TO_LIST_NODE,
-	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
+	KF_ARG_PTR_TO_BTF_ID,	       /* Also covers reg2btf_ids conversions */
 	KF_ARG_PTR_TO_MEM,
-	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
+	KF_ARG_PTR_TO_MEM_SIZE,	       /* Size derived from next argument, skip it */
 	KF_ARG_PTR_TO_CALLBACK,
 	KF_ARG_PTR_TO_RB_ROOT,
 	KF_ARG_PTR_TO_RB_NODE,
@@ -9559,6 +9570,7 @@ enum kfunc_ptr_arg_type {
 enum special_kfunc_type {
 	KF_bpf_obj_new_impl,
 	KF_bpf_obj_drop_impl,
+	KF_bpf_refcount_acquire_impl,
 	KF_bpf_list_push_front,
 	KF_bpf_list_push_back,
 	KF_bpf_list_pop_front,
@@ -9579,6 +9591,7 @@ enum special_kfunc_type {
 BTF_SET_START(special_kfunc_set)
 BTF_ID(func, bpf_obj_new_impl)
 BTF_ID(func, bpf_obj_drop_impl)
+BTF_ID(func, bpf_refcount_acquire_impl)
 BTF_ID(func, bpf_list_push_front)
 BTF_ID(func, bpf_list_push_back)
 BTF_ID(func, bpf_list_pop_front)
@@ -9597,6 +9610,7 @@ BTF_SET_END(special_kfunc_set)
 BTF_ID_LIST(special_kfunc_list)
 BTF_ID(func, bpf_obj_new_impl)
 BTF_ID(func, bpf_obj_drop_impl)
+BTF_ID(func, bpf_refcount_acquire_impl)
 BTF_ID(func, bpf_list_push_front)
 BTF_ID(func, bpf_list_push_back)
 BTF_ID(func, bpf_list_pop_front)
@@ -9649,6 +9663,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_alloc_obj(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_ALLOC_BTF_ID;
 
+	if (is_kfunc_arg_refcounted_kptr(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_REFCOUNTED_KPTR;
+
 	if (is_kfunc_arg_kptr_get(meta, argno)) {
 		if (!btf_type_is_ptr(ref_t)) {
 			verbose(env, "arg#0 BTF type must be a double pointer for kptr_get kfunc\n");
@@ -9952,7 +9969,8 @@ static bool is_bpf_rbtree_api_kfunc(u32 btf_id)
 
 static bool is_bpf_graph_api_kfunc(u32 btf_id)
 {
-	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
+	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id) ||
+	       btf_id == special_kfunc_list[KF_bpf_refcount_acquire_impl];
 }
 
 static bool is_callback_calling_kfunc(u32 btf_id)
@@ -10171,6 +10189,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 	const char *func_name = meta->func_name, *ref_tname;
 	const struct btf *btf = meta->btf;
 	const struct btf_param *args;
+	struct btf_record *rec;
 	u32 i, nargs;
 	int ret;
 
@@ -10306,6 +10325,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_MEM:
 		case KF_ARG_PTR_TO_MEM_SIZE:
 		case KF_ARG_PTR_TO_CALLBACK:
+		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
 			/* Trusted by default */
 			break;
 		default:
@@ -10523,6 +10543,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_CALLBACK:
 			meta->subprogno = reg->subprogno;
 			break;
+		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
+			if (!type_is_ptr_alloc_obj(reg->type) && !type_is_non_owning_ref(reg->type)) {
+				verbose(env, "arg#%d is neither owning or non-owning ref\n", i);
+				return -EINVAL;
+			}
+
+			rec = reg_btf_record(reg);
+			if (!rec) {
+				verbose(env, "verifier internal error: Couldn't find btf_record\n");
+				return -EFAULT;
+			}
+
+			if (rec->refcount_off < 0) {
+				verbose(env, "arg#%d doesn't point to a type with bpf_refcount field\n", i);
+				return -EINVAL;
+			}
+
+			meta->arg_refcount_acquire.btf = reg->btf;
+			meta->arg_refcount_acquire.btf_id = reg->btf_id;
+			break;
 		}
 	}
 
@@ -10699,7 +10739,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 	if (is_kfunc_acquire(&meta) && !btf_type_is_struct_ptr(meta.btf, t)) {
 		/* Only exception is bpf_obj_new_impl */
-		if (meta.btf != btf_vmlinux || meta.func_id != special_kfunc_list[KF_bpf_obj_new_impl]) {
+		if (meta.btf != btf_vmlinux ||
+		    (meta.func_id != special_kfunc_list[KF_bpf_obj_new_impl] &&
+		     meta.func_id != special_kfunc_list[KF_bpf_refcount_acquire_impl])) {
 			verbose(env, "acquire kernel function does not return PTR_TO_BTF_ID\n");
 			return -EINVAL;
 		}
@@ -10747,6 +10789,15 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				insn_aux->obj_new_size = ret_t->size;
 				insn_aux->kptr_struct_meta =
 					btf_find_struct_meta(ret_btf, ret_btf_id);
+			} else if (meta.func_id == special_kfunc_list[KF_bpf_refcount_acquire_impl]) {
+				mark_reg_known_zero(env, regs, BPF_REG_0);
+				regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
+				regs[BPF_REG_0].btf = meta.arg_refcount_acquire.btf;
+				regs[BPF_REG_0].btf_id = meta.arg_refcount_acquire.btf_id;
+
+				insn_aux->kptr_struct_meta =
+					btf_find_struct_meta(meta.arg_refcount_acquire.btf,
+							     meta.arg_refcount_acquire.btf_id);
 			} else if (meta.func_id == special_kfunc_list[KF_bpf_list_pop_front] ||
 				   meta.func_id == special_kfunc_list[KF_bpf_list_pop_back]) {
 				struct btf_field *field = meta.arg_list_head.field;
@@ -17393,7 +17444,8 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		insn_buf[2] = addr[1];
 		insn_buf[3] = *insn;
 		*cnt = 4;
-	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_drop_impl]) {
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_drop_impl] ||
+		   desc->func_id == special_kfunc_list[KF_bpf_refcount_acquire_impl]) {
 		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
 		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
 
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index dbd2c729781a..619afcab2ab0 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -37,6 +37,19 @@ extern void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
 /* Convenience macro to wrap over bpf_obj_drop_impl */
 #define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
 
+/* Description
+ *	Increment the refcount on a refcounted local kptr, turning the
+ *	non-owning reference input into an owning reference in the process.
+ *
+ *	The 'meta' parameter is a hidden argument that is ignored.
+ * Returns
+ *	An owning reference to the object pointed to by 'kptr'
+ */
+extern void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_refcount_acquire_impl */
+#define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL)
+
 /* Description
  *	Add a new entry to the beginning of the BPF linked list.
  * Returns
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (3 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 4/9] bpf: Add bpf_refcount_acquire kfunc Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-16  1:11   ` Alexei Starovoitov
  2023-04-15 20:18 ` [PATCH v2 bpf-next 6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts Dave Marchevsky
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

Consider this code snippet:

  struct node {
    long key;
    bpf_list_node l;
    bpf_rb_node r;
    bpf_refcount ref;
  }

  int some_bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(/*...*/), *m;

    bpf_spin_lock(&glock);

    bpf_rbtree_add(&some_tree, &n->r, /* ... */);
    m = bpf_refcount_acquire(n);
    bpf_rbtree_add(&other_tree, &m->r, /* ... */);

    bpf_spin_unlock(&glock);

    /* ... */
  }

After bpf_refcount_acquire, n and m point to the same underlying memory,
and that node's bpf_rb_node field is being used by the some_tree insert,
so overwriting it as a result of the second insert is an error. In order
to properly support refcounted nodes, the rbtree and list insert
functions must be allowed to fail. This patch adds such support.

The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
return an int indicating success/failure, with 0 -> success, nonzero ->
failure.

bpf_obj_drop on failure
=======================

Currently the only reason an insert can fail is the example above: the
bpf_{list,rb}_node is already in use. When such a failure occurs, the
insert kfuncs will bpf_obj_drop the input node. This allows the insert
operations to logically fail without changing their verifier owning ref
behavior, namely the unconditional release_reference of the input
owning ref.

With insert that always succeeds, ownership of the node is always passed
to the collection, since the node always ends up in the collection.

With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
is always passed either to the collection (success), or to bpf_obj_drop
(failure). Regardless, it's correct to continue unconditionally
releasing the input owning ref, as something is always taking ownership
from the calling program on insert.

Keeping owning ref behavior unchanged results in a nice default UX for
insert functions that can fail. If the program's reaction to a failed
insert is "fine, just get rid of this owning ref for me and let me go
on with my business", then there's no reason to check for failure since
that's default behavior. e.g.:

  long important_failures = 0;

  int some_bpf_prog(void *ctx)
  {
    struct node *n, *m, *o; /* all bpf_obj_new'd */

    bpf_spin_lock(&glock);
    bpf_rbtree_add(&some_tree, &n->node, /* ... */);
    bpf_rbtree_add(&some_tree, &m->node, /* ... */);
    if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
      important_failures++;
    }
    bpf_spin_unlock(&glock);
  }

If we instead chose to pass ownership back to the program on failed
insert - by returning NULL on success or an owning ref on failure -
programs would always have to do something with the returned ref on
failure. The most likely action is probably "I'll just get rid of this
owning ref and go about my business", which ideally would look like:

  if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
    bpf_obj_drop(n);

But bpf_obj_drop isn't allowed in a critical section and inserts must
occur within one, so in reality error handling would become a
hard-to-parse mess.

For refcounted nodes, we can replicate the "pass ownership back to
program on failure" logic with this patch's semantics, albeit in an ugly
way:

  struct node *n = bpf_obj_new(/* ... */), *m;

  bpf_spin_lock(&glock);

  m = bpf_refcount_acquire(n);
  if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
    /* Do something with m */
  }

  bpf_spin_unlock(&glock);
  bpf_obj_drop(m);

bpf_refcount_acquire is used to simulate "return owning ref on failure".
This should be an uncommon occurrence, though.

Addition of two verifier-fixup'd args to collection inserts
===========================================================

The actual bpf_obj_drop kfunc is
bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
macro populating the second arg with 0 and the verifier later filling in
the arg during insn fixup.

Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
passed to bpf_obj_drop_impl.

Similarly, because the 'node' param to those insert functions is the
bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
pointer to the beginning of the node, the insert functions need to be
able to find the beginning of the node struct. A second
verifier-populated param is necessary: the offset of {list,rb}_node within the
node type.

These two new params allow the insert kfuncs to correctly call
__bpf_obj_drop_impl:

  beginning_of_node = bpf_rb_node_ptr - offset
  if (already_inserted)
    __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);

Similarly to other kfuncs with "hidden" verifier-populated params, the
insert functions are renamed with _impl prefix and a macro is provided
for common usage. For example, bpf_rbtree_add kfunc is now
bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
"hidden" args to 0.

Due to the two new args BPF progs will need to be recompiled to work
with the new _impl kfuncs.

This patch also rewrites the "hidden argument" explanation to more
directly say why the BPF program writer doesn't need to populate the
arguments with anything meaningful.

How does this new logic affect non-owning references?
=====================================================

Currently, non-owning refs are valid until the end of the critical
section in which they're created. We can make this guarantee because, if
a non-owning ref exists, the referent was added to some collection. The
collection will drop() its nodes when it goes away, but it can't go away
while our program is accessing it, so that's not a problem. If the
referent is removed from the collection in the same CS that it was added
in, it can't be bpf_obj_drop'd until after CS end. Those are the only
two ways to free the referent's memory and neither can happen until
after the non-owning ref's lifetime ends.

On first glance, having these collection insert functions potentially
bpf_obj_drop their input seems like it breaks the "can't be
bpf_obj_drop'd until after CS end" line of reasoning. But we care about
the memory not being _freed_ until end of CS end, and a previous patch
in the series modified bpf_obj_drop such that it doesn't free refcounted
nodes until refcount == 0. So the statement can be more accurately
rewritten as "can't be free'd until after CS end".

We can prove that this rewritten statement holds for any non-owning
reference produced by collection insert functions:

* If the input to the insert function is _not_ refcounted
  * We have an owning reference to the input, and can conclude it isn't
    in any collection
    * Inserting a node in a collection turns owning refs into
      non-owning, and since our input type isn't refcounted, there's no
      way to obtain additional owning refs to the same underlying
      memory
  * Because our node isn't in any collection, the insert operation
    cannot fail, so bpf_obj_drop will not execute
  * If bpf_obj_drop is guaranteed not to execute, there's no risk of
    memory being free'd

* Otherwise, the input to the insert function is refcounted
  * If the insert operation fails due to the node's list_head or rb_root
    already being in some collection, there was some previous successful
    insert which passed refcount to the collection
  * We have an owning reference to the input, it must have been
    acquired via bpf_refcount_acquire, which bumped the refcount
  * refcount must be >= 2 since there's a valid owning reference and the
    node is already in a collection
  * Insert triggering bpf_obj_drop will decr refcount to >= 1, never
    resulting in a free

So although we may do bpf_obj_drop during the critical section, this
will never result in memory being free'd, and no changes to non-owning
ref logic are needed in this patch.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf_verifier.h                  |  7 +-
 kernel/bpf/helpers.c                          | 65 ++++++++++++----
 kernel/bpf/verifier.c                         | 78 +++++++++++++------
 .../testing/selftests/bpf/bpf_experimental.h  | 49 +++++++++---
 4 files changed, 148 insertions(+), 51 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index f03852b89d28..3dd29a53b711 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -464,7 +464,12 @@ struct bpf_insn_aux_data {
 		 */
 		struct bpf_loop_inline_state loop_inline_state;
 	};
-	u64 obj_new_size; /* remember the size of type passed to bpf_obj_new to rewrite R1 */
+	union {
+		/* remember the size of type passed to bpf_obj_new to rewrite R1 */
+		u64 obj_new_size;
+		/* remember the offset of node field within type to rewrite */
+		u64 insert_off;
+	};
 	struct btf_struct_meta *kptr_struct_meta;
 	u64 map_key_state; /* constant (32 bit) key tracking for maps */
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 57ff8a60222c..5067f8d46872 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1931,7 +1931,8 @@ __bpf_kfunc void *bpf_refcount_acquire_impl(void *p__refcounted_kptr, void *meta
 	return (void *)p__refcounted_kptr;
 }
 
-static void __bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head, bool tail)
+static int __bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head,
+			  bool tail, struct btf_record *rec, u64 off)
 {
 	struct list_head *n = (void *)node, *h = (void *)head;
 
@@ -1939,17 +1940,35 @@ static void __bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *hea
 		INIT_LIST_HEAD(h);
 	if (unlikely(!n->next))
 		INIT_LIST_HEAD(n);
+	if (!list_empty(n)) {
+		/* Only called from BPF prog, no need to migrate_disable */
+		__bpf_obj_drop_impl(n - off, rec);
+		return -EINVAL;
+	}
+
 	tail ? list_add_tail(n, h) : list_add(n, h);
+
+	return 0;
 }
 
-__bpf_kfunc void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node)
+__bpf_kfunc int bpf_list_push_front_impl(struct bpf_list_head *head,
+					 struct bpf_list_node *node,
+					 void *meta__ign, u64 off)
 {
-	return __bpf_list_add(node, head, false);
+	struct btf_struct_meta *meta = meta__ign;
+
+	return __bpf_list_add(node, head, false,
+			      meta ? meta->record : NULL, off);
 }
 
-__bpf_kfunc void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node)
+__bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
+					struct bpf_list_node *node,
+					void *meta__ign, u64 off)
 {
-	return __bpf_list_add(node, head, true);
+	struct btf_struct_meta *meta = meta__ign;
+
+	return __bpf_list_add(node, head, true,
+			      meta ? meta->record : NULL, off);
 }
 
 static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tail)
@@ -1989,14 +2008,23 @@ __bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
 /* Need to copy rbtree_add_cached's logic here because our 'less' is a BPF
  * program
  */
-static void __bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
-			     void *less)
+static int __bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
+			    void *less, struct btf_record *rec, u64 off)
 {
 	struct rb_node **link = &((struct rb_root_cached *)root)->rb_root.rb_node;
+	struct rb_node *parent = NULL, *n = (struct rb_node *)node;
 	bpf_callback_t cb = (bpf_callback_t)less;
-	struct rb_node *parent = NULL;
 	bool leftmost = true;
 
+	if (!n->__rb_parent_color)
+		RB_CLEAR_NODE(n);
+
+	if (!RB_EMPTY_NODE(n)) {
+		/* Only called from BPF prog, no need to migrate_disable */
+		__bpf_obj_drop_impl(n - off, rec);
+		return -EINVAL;
+	}
+
 	while (*link) {
 		parent = *link;
 		if (cb((uintptr_t)node, (uintptr_t)parent, 0, 0, 0)) {
@@ -2007,15 +2035,18 @@ static void __bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
 		}
 	}
 
-	rb_link_node((struct rb_node *)node, parent, link);
-	rb_insert_color_cached((struct rb_node *)node,
-			       (struct rb_root_cached *)root, leftmost);
+	rb_link_node(n, parent, link);
+	rb_insert_color_cached(n, (struct rb_root_cached *)root, leftmost);
+	return 0;
 }
 
-__bpf_kfunc void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
-				bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b))
+__bpf_kfunc int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
+				    bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
+				    void *meta__ign, u64 off)
 {
-	__bpf_rbtree_add(root, node, (void *)less);
+	struct btf_struct_meta *meta = meta__ign;
+
+	return __bpf_rbtree_add(root, node, (void *)less, meta ? meta->record : NULL, off);
 }
 
 __bpf_kfunc struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root)
@@ -2291,14 +2322,14 @@ BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 BTF_ID_FLAGS(func, bpf_obj_new_impl, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_obj_drop_impl, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_refcount_acquire_impl, KF_ACQUIRE)
-BTF_ID_FLAGS(func, bpf_list_push_front)
-BTF_ID_FLAGS(func, bpf_list_push_back)
+BTF_ID_FLAGS(func, bpf_list_push_front_impl)
+BTF_ID_FLAGS(func, bpf_list_push_back_impl)
 BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE)
-BTF_ID_FLAGS(func, bpf_rbtree_add)
+BTF_ID_FLAGS(func, bpf_rbtree_add_impl)
 BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
 
 #ifdef CONFIG_CGROUPS
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 29e106f7ccaa..736cb7cec0bd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8500,10 +8500,10 @@ static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
 					 struct bpf_func_state *callee,
 					 int insn_idx)
 {
-	/* void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
+	/* void bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
 	 *                     bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b));
 	 *
-	 * 'struct bpf_rb_node *node' arg to bpf_rbtree_add is the same PTR_TO_BTF_ID w/ offset
+	 * 'struct bpf_rb_node *node' arg to bpf_rbtree_add_impl is the same PTR_TO_BTF_ID w/ offset
 	 * that 'less' callback args will be receiving. However, 'node' arg was release_reference'd
 	 * by this point, so look at 'root'
 	 */
@@ -9571,8 +9571,8 @@ enum special_kfunc_type {
 	KF_bpf_obj_new_impl,
 	KF_bpf_obj_drop_impl,
 	KF_bpf_refcount_acquire_impl,
-	KF_bpf_list_push_front,
-	KF_bpf_list_push_back,
+	KF_bpf_list_push_front_impl,
+	KF_bpf_list_push_back_impl,
 	KF_bpf_list_pop_front,
 	KF_bpf_list_pop_back,
 	KF_bpf_cast_to_kern_ctx,
@@ -9580,7 +9580,7 @@ enum special_kfunc_type {
 	KF_bpf_rcu_read_lock,
 	KF_bpf_rcu_read_unlock,
 	KF_bpf_rbtree_remove,
-	KF_bpf_rbtree_add,
+	KF_bpf_rbtree_add_impl,
 	KF_bpf_rbtree_first,
 	KF_bpf_dynptr_from_skb,
 	KF_bpf_dynptr_from_xdp,
@@ -9592,14 +9592,14 @@ BTF_SET_START(special_kfunc_set)
 BTF_ID(func, bpf_obj_new_impl)
 BTF_ID(func, bpf_obj_drop_impl)
 BTF_ID(func, bpf_refcount_acquire_impl)
-BTF_ID(func, bpf_list_push_front)
-BTF_ID(func, bpf_list_push_back)
+BTF_ID(func, bpf_list_push_front_impl)
+BTF_ID(func, bpf_list_push_back_impl)
 BTF_ID(func, bpf_list_pop_front)
 BTF_ID(func, bpf_list_pop_back)
 BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
 BTF_ID(func, bpf_rbtree_remove)
-BTF_ID(func, bpf_rbtree_add)
+BTF_ID(func, bpf_rbtree_add_impl)
 BTF_ID(func, bpf_rbtree_first)
 BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_dynptr_from_xdp)
@@ -9611,8 +9611,8 @@ BTF_ID_LIST(special_kfunc_list)
 BTF_ID(func, bpf_obj_new_impl)
 BTF_ID(func, bpf_obj_drop_impl)
 BTF_ID(func, bpf_refcount_acquire_impl)
-BTF_ID(func, bpf_list_push_front)
-BTF_ID(func, bpf_list_push_back)
+BTF_ID(func, bpf_list_push_front_impl)
+BTF_ID(func, bpf_list_push_back_impl)
 BTF_ID(func, bpf_list_pop_front)
 BTF_ID(func, bpf_list_pop_back)
 BTF_ID(func, bpf_cast_to_kern_ctx)
@@ -9620,7 +9620,7 @@ BTF_ID(func, bpf_rdonly_cast)
 BTF_ID(func, bpf_rcu_read_lock)
 BTF_ID(func, bpf_rcu_read_unlock)
 BTF_ID(func, bpf_rbtree_remove)
-BTF_ID(func, bpf_rbtree_add)
+BTF_ID(func, bpf_rbtree_add_impl)
 BTF_ID(func, bpf_rbtree_first)
 BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_dynptr_from_xdp)
@@ -9954,15 +9954,15 @@ static int check_reg_allocation_locked(struct bpf_verifier_env *env, struct bpf_
 
 static bool is_bpf_list_api_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_list_push_front] ||
-	       btf_id == special_kfunc_list[KF_bpf_list_push_back] ||
+	return btf_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
+	       btf_id == special_kfunc_list[KF_bpf_list_push_back_impl] ||
 	       btf_id == special_kfunc_list[KF_bpf_list_pop_front] ||
 	       btf_id == special_kfunc_list[KF_bpf_list_pop_back];
 }
 
 static bool is_bpf_rbtree_api_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_rbtree_add] ||
+	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl] ||
 	       btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
 	       btf_id == special_kfunc_list[KF_bpf_rbtree_first];
 }
@@ -9975,7 +9975,7 @@ static bool is_bpf_graph_api_kfunc(u32 btf_id)
 
 static bool is_callback_calling_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_rbtree_add];
+	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
 }
 
 static bool is_rbtree_lock_required_kfunc(u32 btf_id)
@@ -10016,12 +10016,12 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env,
 
 	switch (node_field_type) {
 	case BPF_LIST_NODE:
-		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_front] ||
-		       kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_back]);
+		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
+		       kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_back_impl]);
 		break;
 	case BPF_RB_NODE:
 		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
-		       kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_add]);
+		       kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl]);
 		break;
 	default:
 		verbose(env, "verifier internal error: unexpected graph node argument type %s\n",
@@ -10702,10 +10702,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
-	if (meta.func_id == special_kfunc_list[KF_bpf_list_push_front] ||
-	    meta.func_id == special_kfunc_list[KF_bpf_list_push_back] ||
-	    meta.func_id == special_kfunc_list[KF_bpf_rbtree_add]) {
+	if (meta.func_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
+	    meta.func_id == special_kfunc_list[KF_bpf_list_push_back_impl] ||
+	    meta.func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) {
 		release_ref_obj_id = regs[BPF_REG_2].ref_obj_id;
+		insn_aux->insert_off = regs[BPF_REG_2].off;
 		err = ref_convert_owning_non_owning(env, release_ref_obj_id);
 		if (err) {
 			verbose(env, "kfunc %s#%d conversion of owning ref to non-owning failed\n",
@@ -10721,7 +10722,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
-	if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_add]) {
+	if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) {
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
 					set_rbtree_add_callback_state);
 		if (err) {
@@ -14764,7 +14765,7 @@ static bool regs_exact(const struct bpf_reg_state *rold,
 		       const struct bpf_reg_state *rcur,
 		       struct bpf_id_pair *idmap)
 {
-	return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 && 
+	return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 &&
 	       check_ids(rold->id, rcur->id, idmap) &&
 	       check_ids(rold->ref_obj_id, rcur->ref_obj_id, idmap);
 }
@@ -17407,6 +17408,23 @@ static void specialize_kfunc(struct bpf_verifier_env *env,
 	}
 }
 
+static void __fixup_collection_insert_kfunc(struct bpf_insn_aux_data *insn_aux,
+					    u16 struct_meta_reg,
+					    u16 node_offset_reg,
+					    struct bpf_insn *insn,
+					    struct bpf_insn *insn_buf,
+					    int *cnt)
+{
+	struct btf_struct_meta *kptr_struct_meta = insn_aux->kptr_struct_meta;
+	struct bpf_insn addr[2] = { BPF_LD_IMM64(struct_meta_reg, (long)kptr_struct_meta) };
+
+	insn_buf[0] = addr[0];
+	insn_buf[1] = addr[1];
+	insn_buf[2] = BPF_MOV64_IMM(node_offset_reg, insn_aux->insert_off);
+	insn_buf[3] = *insn;
+	*cnt = 4;
+}
+
 static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
@@ -17453,6 +17471,20 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		insn_buf[1] = addr[1];
 		insn_buf[2] = *insn;
 		*cnt = 3;
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_list_push_back_impl] ||
+		   desc->func_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
+		   desc->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) {
+		int struct_meta_reg = BPF_REG_3;
+		int node_offset_reg = BPF_REG_4;
+
+		/* rbtree_add has extra 'less' arg, so args-to-fixup are in diff regs */
+		if (desc->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) {
+			struct_meta_reg = BPF_REG_4;
+			node_offset_reg = BPF_REG_5;
+		}
+
+		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
+						node_offset_reg, insn, insn_buf, cnt);
 	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
 		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
 		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 619afcab2ab0..209811b1993a 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -14,7 +14,8 @@
  *	type ID of a struct in program BTF.
  *
  *	The 'local_type_id' parameter must be a known constant.
- *	The 'meta' parameter is a hidden argument that is ignored.
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
  * Returns
  *	A pointer to an object of the type corresponding to the passed in
  *	'local_type_id', or NULL on failure.
@@ -28,7 +29,8 @@ extern void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym;
  *	Free an allocated object. All fields of the object that require
  *	destruction will be destructed before the storage is freed.
  *
- *	The 'meta' parameter is a hidden argument that is ignored.
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
  * Returns
  *	Void.
  */
@@ -41,7 +43,8 @@ extern void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
  *	Increment the refcount on a refcounted local kptr, turning the
  *	non-owning reference input into an owning reference in the process.
  *
- *	The 'meta' parameter is a hidden argument that is ignored.
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
  * Returns
  *	An owning reference to the object pointed to by 'kptr'
  */
@@ -52,17 +55,35 @@ extern void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
 
 /* Description
  *	Add a new entry to the beginning of the BPF linked list.
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
  * Returns
- *	Void.
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a list
  */
-extern void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
+extern int bpf_list_push_front_impl(struct bpf_list_head *head,
+				    struct bpf_list_node *node,
+				    void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_list_push_front_impl */
+#define bpf_list_push_front(head, node) bpf_list_push_front_impl(head, node, NULL, 0)
 
 /* Description
  *	Add a new entry to the end of the BPF linked list.
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
  * Returns
- *	Void.
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a list
  */
-extern void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
+extern int bpf_list_push_back_impl(struct bpf_list_head *head,
+				   struct bpf_list_node *node,
+				   void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_list_push_back_impl */
+#define bpf_list_push_back(head, node) bpf_list_push_back_impl(head, node, NULL, 0)
 
 /* Description
  *	Remove the entry at the beginning of the BPF linked list.
@@ -88,11 +109,19 @@ extern struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
 
 /* Description
  *	Add 'node' to rbtree with root 'root' using comparator 'less'
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
  * Returns
- *	Nothing
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a tree
  */
-extern void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
-			   bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b)) __ksym;
+extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
+			       bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
+			       void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_rbtree_add_impl */
+#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
 
 /* Description
  *	Return the first (leftmost) node in input tree
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (4 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 7/9] bpf: Migrate bpf_rbtree_remove to possibly fail Dave Marchevsky
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

The linked_list tests use macros and function pointers to reduce code
duplication. Earlier in the series, bpf_list_push_{front,back} were
modified to be macros, expanding to invoke actual kfuncs
bpf_list_push_{front,back}_impl. Due to this change, a code snippet
like:

  void (*p)(void *, void *) = (void *)&bpf_list_##op;
  p(hexpr, nexpr);

meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer
work as it's no longer valid to do &bpf_list_push_{front,back} since
they're no longer functions.

This patch fixes issues of this type, along with two other minor changes
- one improvement and one fix - both related to the node argument to
list_push_{front,back}.

  * The fix: migration of list_push tests away from (void *, void *)
    func ptr uncovered that some tests were incorrectly passing pointer
    to node, not pointer to struct bpf_list_node within the node. This
    patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node))

  * The improvement: In linked_list tests, the struct foo type has two
    list_node fields: node and node2, at byte offsets 0 and 40 within
    the struct, respectively. Currently node is used in ~all tests
    involving struct foo and lists. The verifier needs to do some work
    to account for the offset of bpf_list_node within the node type, so
    using node2 instead of node exercises that logic more in the tests.
    This patch migrates linked_list tests to use node2 instead of node.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../selftests/bpf/prog_tests/linked_list.c    |  6 +-
 .../testing/selftests/bpf/progs/linked_list.c | 34 +++----
 .../testing/selftests/bpf/progs/linked_list.h |  4 +-
 .../selftests/bpf/progs/linked_list_fail.c    | 96 ++++++++++---------
 4 files changed, 73 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/linked_list.c b/tools/testing/selftests/bpf/prog_tests/linked_list.c
index 0ed8132ce1c3..872e4bd500fd 100644
--- a/tools/testing/selftests/bpf/prog_tests/linked_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/linked_list.c
@@ -84,11 +84,11 @@ static struct {
 	{ "double_push_back", "arg#1 expected pointer to allocated object" },
 	{ "no_node_value_type", "bpf_list_node not found at offset=0" },
 	{ "incorrect_value_type",
-	  "operation on bpf_list_head expects arg#1 bpf_list_node at offset=0 in struct foo, "
+	  "operation on bpf_list_head expects arg#1 bpf_list_node at offset=40 in struct foo, "
 	  "but arg is at offset=0 in struct bar" },
 	{ "incorrect_node_var_off", "variable ptr_ access var_off=(0x0; 0xffffffff) disallowed" },
-	{ "incorrect_node_off1", "bpf_list_node not found at offset=1" },
-	{ "incorrect_node_off2", "arg#1 offset=40, but expected bpf_list_node at offset=0 in struct foo" },
+	{ "incorrect_node_off1", "bpf_list_node not found at offset=41" },
+	{ "incorrect_node_off2", "arg#1 offset=0, but expected bpf_list_node at offset=40 in struct foo" },
 	{ "no_head_type", "bpf_list_head not found at offset=0" },
 	{ "incorrect_head_var_off1", "R1 doesn't have constant offset" },
 	{ "incorrect_head_var_off2", "variable ptr_ access var_off=(0x0; 0xffffffff) disallowed" },
diff --git a/tools/testing/selftests/bpf/progs/linked_list.c b/tools/testing/selftests/bpf/progs/linked_list.c
index 53ded51a3abb..57440a554304 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.c
+++ b/tools/testing/selftests/bpf/progs/linked_list.c
@@ -25,7 +25,7 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 	n = bpf_list_pop_front(head);
 	bpf_spin_unlock(lock);
 	if (n) {
-		bpf_obj_drop(container_of(n, struct foo, node));
+		bpf_obj_drop(container_of(n, struct foo, node2));
 		bpf_obj_drop(f);
 		return 3;
 	}
@@ -34,7 +34,7 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 	n = bpf_list_pop_back(head);
 	bpf_spin_unlock(lock);
 	if (n) {
-		bpf_obj_drop(container_of(n, struct foo, node));
+		bpf_obj_drop(container_of(n, struct foo, node2));
 		bpf_obj_drop(f);
 		return 4;
 	}
@@ -42,7 +42,7 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 
 	bpf_spin_lock(lock);
 	f->data = 42;
-	bpf_list_push_front(head, &f->node);
+	bpf_list_push_front(head, &f->node2);
 	bpf_spin_unlock(lock);
 	if (leave_in_map)
 		return 0;
@@ -51,7 +51,7 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 	bpf_spin_unlock(lock);
 	if (!n)
 		return 5;
-	f = container_of(n, struct foo, node);
+	f = container_of(n, struct foo, node2);
 	if (f->data != 42) {
 		bpf_obj_drop(f);
 		return 6;
@@ -59,14 +59,14 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 
 	bpf_spin_lock(lock);
 	f->data = 13;
-	bpf_list_push_front(head, &f->node);
+	bpf_list_push_front(head, &f->node2);
 	bpf_spin_unlock(lock);
 	bpf_spin_lock(lock);
 	n = bpf_list_pop_front(head);
 	bpf_spin_unlock(lock);
 	if (!n)
 		return 7;
-	f = container_of(n, struct foo, node);
+	f = container_of(n, struct foo, node2);
 	if (f->data != 13) {
 		bpf_obj_drop(f);
 		return 8;
@@ -77,7 +77,7 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 	n = bpf_list_pop_front(head);
 	bpf_spin_unlock(lock);
 	if (n) {
-		bpf_obj_drop(container_of(n, struct foo, node));
+		bpf_obj_drop(container_of(n, struct foo, node2));
 		return 9;
 	}
 
@@ -85,7 +85,7 @@ int list_push_pop(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool l
 	n = bpf_list_pop_back(head);
 	bpf_spin_unlock(lock);
 	if (n) {
-		bpf_obj_drop(container_of(n, struct foo, node));
+		bpf_obj_drop(container_of(n, struct foo, node2));
 		return 10;
 	}
 	return 0;
@@ -119,8 +119,8 @@ int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *hea
 		f[i + 1]->data = i + 1;
 
 		bpf_spin_lock(lock);
-		bpf_list_push_front(head, &f[i]->node);
-		bpf_list_push_front(head, &f[i + 1]->node);
+		bpf_list_push_front(head, &f[i]->node2);
+		bpf_list_push_front(head, &f[i + 1]->node2);
 		bpf_spin_unlock(lock);
 	}
 
@@ -130,13 +130,13 @@ int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *hea
 		bpf_spin_unlock(lock);
 		if (!n)
 			return 3;
-		pf = container_of(n, struct foo, node);
+		pf = container_of(n, struct foo, node2);
 		if (pf->data != (ARRAY_SIZE(f) - i - 1)) {
 			bpf_obj_drop(pf);
 			return 4;
 		}
 		bpf_spin_lock(lock);
-		bpf_list_push_back(head, &pf->node);
+		bpf_list_push_back(head, &pf->node2);
 		bpf_spin_unlock(lock);
 	}
 
@@ -149,7 +149,7 @@ int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *hea
 		bpf_spin_unlock(lock);
 		if (!n)
 			return 5;
-		pf = container_of(n, struct foo, node);
+		pf = container_of(n, struct foo, node2);
 		if (pf->data != i) {
 			bpf_obj_drop(pf);
 			return 6;
@@ -160,7 +160,7 @@ int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *hea
 	n = bpf_list_pop_back(head);
 	bpf_spin_unlock(lock);
 	if (n) {
-		bpf_obj_drop(container_of(n, struct foo, node));
+		bpf_obj_drop(container_of(n, struct foo, node2));
 		return 7;
 	}
 
@@ -168,7 +168,7 @@ int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *hea
 	n = bpf_list_pop_front(head);
 	bpf_spin_unlock(lock);
 	if (n) {
-		bpf_obj_drop(container_of(n, struct foo, node));
+		bpf_obj_drop(container_of(n, struct foo, node2));
 		return 8;
 	}
 	return 0;
@@ -199,7 +199,7 @@ int list_in_list(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool le
 
 	bpf_spin_lock(lock);
 	f->data = 42;
-	bpf_list_push_front(head, &f->node);
+	bpf_list_push_front(head, &f->node2);
 	bpf_spin_unlock(lock);
 
 	if (leave_in_map)
@@ -210,7 +210,7 @@ int list_in_list(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool le
 	bpf_spin_unlock(lock);
 	if (!n)
 		return 4;
-	f = container_of(n, struct foo, node);
+	f = container_of(n, struct foo, node2);
 	if (f->data != 42) {
 		bpf_obj_drop(f);
 		return 5;
diff --git a/tools/testing/selftests/bpf/progs/linked_list.h b/tools/testing/selftests/bpf/progs/linked_list.h
index 3fb2412552fc..c0f3609a7ffa 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.h
+++ b/tools/testing/selftests/bpf/progs/linked_list.h
@@ -22,7 +22,7 @@ struct foo {
 struct map_value {
 	struct bpf_spin_lock lock;
 	int data;
-	struct bpf_list_head head __contains(foo, node);
+	struct bpf_list_head head __contains(foo, node2);
 };
 
 struct array_map {
@@ -50,7 +50,7 @@ struct {
 #define private(name) SEC(".bss." #name) __hidden __attribute__((aligned(8)))
 
 private(A) struct bpf_spin_lock glock;
-private(A) struct bpf_list_head ghead __contains(foo, node);
+private(A) struct bpf_list_head ghead __contains(foo, node2);
 private(B) struct bpf_spin_lock glock2;
 
 #endif
diff --git a/tools/testing/selftests/bpf/progs/linked_list_fail.c b/tools/testing/selftests/bpf/progs/linked_list_fail.c
index 41978b46f58e..f4c63daba229 100644
--- a/tools/testing/selftests/bpf/progs/linked_list_fail.c
+++ b/tools/testing/selftests/bpf/progs/linked_list_fail.c
@@ -73,22 +73,21 @@ CHECK(inner_map, pop_back, &iv->head);
 	int test##_missing_lock_##op(void *ctx)				\
 	{								\
 		INIT;							\
-		void (*p)(void *, void *) = (void *)&bpf_list_##op;	\
-		p(hexpr, nexpr);					\
+		bpf_list_##op(hexpr, nexpr);				\
 		return 0;						\
 	}
 
-CHECK(kptr, push_front, &f->head, b);
-CHECK(kptr, push_back, &f->head, b);
+CHECK(kptr, push_front, &f->head, &b->node);
+CHECK(kptr, push_back, &f->head, &b->node);
 
-CHECK(global, push_front, &ghead, f);
-CHECK(global, push_back, &ghead, f);
+CHECK(global, push_front, &ghead, &f->node2);
+CHECK(global, push_back, &ghead, &f->node2);
 
-CHECK(map, push_front, &v->head, f);
-CHECK(map, push_back, &v->head, f);
+CHECK(map, push_front, &v->head, &f->node2);
+CHECK(map, push_back, &v->head, &f->node2);
 
-CHECK(inner_map, push_front, &iv->head, f);
-CHECK(inner_map, push_back, &iv->head, f);
+CHECK(inner_map, push_front, &iv->head, &f->node2);
+CHECK(inner_map, push_back, &iv->head, &f->node2);
 
 #undef CHECK
 
@@ -135,32 +134,31 @@ CHECK_OP(pop_back);
 	int test##_incorrect_lock_##op(void *ctx)			\
 	{								\
 		INIT;							\
-		void (*p)(void *, void*) = (void *)&bpf_list_##op;	\
 		bpf_spin_lock(lexpr);					\
-		p(hexpr, nexpr);					\
+		bpf_list_##op(hexpr, nexpr);				\
 		return 0;						\
 	}
 
 #define CHECK_OP(op)							\
-	CHECK(kptr_kptr, op, &f1->lock, &f2->head, b);			\
-	CHECK(kptr_global, op, &f1->lock, &ghead, f);			\
-	CHECK(kptr_map, op, &f1->lock, &v->head, f);			\
-	CHECK(kptr_inner_map, op, &f1->lock, &iv->head, f);		\
+	CHECK(kptr_kptr, op, &f1->lock, &f2->head, &b->node);		\
+	CHECK(kptr_global, op, &f1->lock, &ghead, &f->node2);		\
+	CHECK(kptr_map, op, &f1->lock, &v->head, &f->node2);		\
+	CHECK(kptr_inner_map, op, &f1->lock, &iv->head, &f->node2);	\
 									\
-	CHECK(global_global, op, &glock2, &ghead, f);			\
-	CHECK(global_kptr, op, &glock, &f1->head, b);			\
-	CHECK(global_map, op, &glock, &v->head, f);			\
-	CHECK(global_inner_map, op, &glock, &iv->head, f);		\
+	CHECK(global_global, op, &glock2, &ghead, &f->node2);		\
+	CHECK(global_kptr, op, &glock, &f1->head, &b->node);		\
+	CHECK(global_map, op, &glock, &v->head, &f->node2);		\
+	CHECK(global_inner_map, op, &glock, &iv->head, &f->node2);	\
 									\
-	CHECK(map_map, op, &v->lock, &v2->head, f);			\
-	CHECK(map_kptr, op, &v->lock, &f2->head, b);			\
-	CHECK(map_global, op, &v->lock, &ghead, f);			\
-	CHECK(map_inner_map, op, &v->lock, &iv->head, f);		\
+	CHECK(map_map, op, &v->lock, &v2->head, &f->node2);		\
+	CHECK(map_kptr, op, &v->lock, &f2->head, &b->node);		\
+	CHECK(map_global, op, &v->lock, &ghead, &f->node2);		\
+	CHECK(map_inner_map, op, &v->lock, &iv->head, &f->node2);	\
 									\
-	CHECK(inner_map_inner_map, op, &iv->lock, &iv2->head, f);	\
-	CHECK(inner_map_kptr, op, &iv->lock, &f2->head, b);		\
-	CHECK(inner_map_global, op, &iv->lock, &ghead, f);		\
-	CHECK(inner_map_map, op, &iv->lock, &v->head, f);
+	CHECK(inner_map_inner_map, op, &iv->lock, &iv2->head, &f->node2);\
+	CHECK(inner_map_kptr, op, &iv->lock, &f2->head, &b->node);	\
+	CHECK(inner_map_global, op, &iv->lock, &ghead, &f->node2);	\
+	CHECK(inner_map_map, op, &iv->lock, &v->head, &f->node2);
 
 CHECK_OP(push_front);
 CHECK_OP(push_back);
@@ -340,7 +338,7 @@ int direct_read_node(void *ctx)
 	f = bpf_obj_new(typeof(*f));
 	if (!f)
 		return 0;
-	return *(int *)&f->node;
+	return *(int *)&f->node2;
 }
 
 SEC("?tc")
@@ -351,12 +349,12 @@ int direct_write_node(void *ctx)
 	f = bpf_obj_new(typeof(*f));
 	if (!f)
 		return 0;
-	*(int *)&f->node = 0;
+	*(int *)&f->node2 = 0;
 	return 0;
 }
 
 static __always_inline
-int use_after_unlock(void (*op)(void *head, void *node))
+int use_after_unlock(bool push_front)
 {
 	struct foo *f;
 
@@ -365,7 +363,10 @@ int use_after_unlock(void (*op)(void *head, void *node))
 		return 0;
 	bpf_spin_lock(&glock);
 	f->data = 42;
-	op(&ghead, &f->node);
+	if (push_front)
+		bpf_list_push_front(&ghead, &f->node2);
+	else
+		bpf_list_push_back(&ghead, &f->node2);
 	bpf_spin_unlock(&glock);
 
 	return f->data;
@@ -374,17 +375,17 @@ int use_after_unlock(void (*op)(void *head, void *node))
 SEC("?tc")
 int use_after_unlock_push_front(void *ctx)
 {
-	return use_after_unlock((void *)bpf_list_push_front);
+	return use_after_unlock(true);
 }
 
 SEC("?tc")
 int use_after_unlock_push_back(void *ctx)
 {
-	return use_after_unlock((void *)bpf_list_push_back);
+	return use_after_unlock(false);
 }
 
 static __always_inline
-int list_double_add(void (*op)(void *head, void *node))
+int list_double_add(bool push_front)
 {
 	struct foo *f;
 
@@ -392,8 +393,13 @@ int list_double_add(void (*op)(void *head, void *node))
 	if (!f)
 		return 0;
 	bpf_spin_lock(&glock);
-	op(&ghead, &f->node);
-	op(&ghead, &f->node);
+	if (push_front) {
+		bpf_list_push_front(&ghead, &f->node2);
+		bpf_list_push_front(&ghead, &f->node2);
+	} else {
+		bpf_list_push_back(&ghead, &f->node2);
+		bpf_list_push_back(&ghead, &f->node2);
+	}
 	bpf_spin_unlock(&glock);
 
 	return 0;
@@ -402,13 +408,13 @@ int list_double_add(void (*op)(void *head, void *node))
 SEC("?tc")
 int double_push_front(void *ctx)
 {
-	return list_double_add((void *)bpf_list_push_front);
+	return list_double_add(true);
 }
 
 SEC("?tc")
 int double_push_back(void *ctx)
 {
-	return list_double_add((void *)bpf_list_push_back);
+	return list_double_add(false);
 }
 
 SEC("?tc")
@@ -450,7 +456,7 @@ int incorrect_node_var_off(struct __sk_buff *ctx)
 	if (!f)
 		return 0;
 	bpf_spin_lock(&glock);
-	bpf_list_push_front(&ghead, (void *)&f->node + ctx->protocol);
+	bpf_list_push_front(&ghead, (void *)&f->node2 + ctx->protocol);
 	bpf_spin_unlock(&glock);
 
 	return 0;
@@ -465,7 +471,7 @@ int incorrect_node_off1(void *ctx)
 	if (!f)
 		return 0;
 	bpf_spin_lock(&glock);
-	bpf_list_push_front(&ghead, (void *)&f->node + 1);
+	bpf_list_push_front(&ghead, (void *)&f->node2 + 1);
 	bpf_spin_unlock(&glock);
 
 	return 0;
@@ -480,7 +486,7 @@ int incorrect_node_off2(void *ctx)
 	if (!f)
 		return 0;
 	bpf_spin_lock(&glock);
-	bpf_list_push_front(&ghead, &f->node2);
+	bpf_list_push_front(&ghead, &f->node);
 	bpf_spin_unlock(&glock);
 
 	return 0;
@@ -510,7 +516,7 @@ int incorrect_head_var_off1(struct __sk_buff *ctx)
 	if (!f)
 		return 0;
 	bpf_spin_lock(&glock);
-	bpf_list_push_front((void *)&ghead + ctx->protocol, &f->node);
+	bpf_list_push_front((void *)&ghead + ctx->protocol, &f->node2);
 	bpf_spin_unlock(&glock);
 
 	return 0;
@@ -525,7 +531,7 @@ int incorrect_head_var_off2(struct __sk_buff *ctx)
 	if (!f)
 		return 0;
 	bpf_spin_lock(&glock);
-	bpf_list_push_front((void *)&f->head + ctx->protocol, &f->node);
+	bpf_list_push_front((void *)&f->head + ctx->protocol, &f->node2);
 	bpf_spin_unlock(&glock);
 
 	return 0;
@@ -563,7 +569,7 @@ int incorrect_head_off2(void *ctx)
 		return 0;
 
 	bpf_spin_lock(&glock);
-	bpf_list_push_front((void *)&ghead + 1, &f->node);
+	bpf_list_push_front((void *)&ghead + 1, &f->node2);
 	bpf_spin_unlock(&glock);
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 7/9] bpf: Migrate bpf_rbtree_remove to possibly fail
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (5 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 8/9] bpf: Centralize btf_field-specific initialization logic Dave Marchevsky
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

This patch modifies bpf_rbtree_remove to account for possible failure
due to the input rb_node already not being in any collection.
The function can now return NULL, and does when the aforementioned
scenario occurs. As before, on successful removal an owning reference to
the removed node is returned.

Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL |
KF_ACQUIRE - provides the desired verifier semantics:

  * retval must be checked for NULL before use
  * if NULL, retval's ref_obj_id is released
  * retval is a "maybe acquired" owning ref, not a non-owning ref,
    so it will live past end of critical section (bpf_spin_unlock), and
    thus can be checked for NULL after the end of the CS

BPF programs must add checks
============================

This does change bpf_rbtree_remove's verifier behavior. BPF program
writers will need to add NULL checks to their programs, but the
resulting UX looks natural:

  bpf_spin_lock(&glock);

  n = bpf_rbtree_first(&ghead);
  if (!n) { /* ... */}
  res = bpf_rbtree_remove(&ghead, &n->node);

  bpf_spin_unlock(&glock);

  if (!res)  /* Newly-added check after this patch */
    return 1;

  n = container_of(res, /* ... */);
  /* Do something else with n */
  bpf_obj_drop(n);
  return 0;

The "if (!res)" check above is the only addition necessary for the above
program to pass verification after this patch.

bpf_rbtree_remove no longer clobbers non-owning refs
====================================================

An issue arises when bpf_rbtree_remove fails, though. Consider this
example:

  struct node_data {
    long key;
    struct bpf_list_node l;
    struct bpf_rb_node r;
    struct bpf_refcount ref;
  };

  long failed_sum;

  void bpf_prog()
  {
    struct node_data *n = bpf_obj_new(/* ... */);
    struct bpf_rb_node *res;
    n->key = 10;

    bpf_spin_lock(&glock);

    bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */
    res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */);
    if (!res)
      failed_sum += n->key;  /* not possible */

    bpf_spin_unlock(&glock);
    /* if (res) { do something useful and drop } ... */
  }

The bpf_rbtree_remove in this example will always fail. Similarly to
bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference
invalidation point. The verifier clobbers all non-owning refs after a
bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail
verification, and in fact there's no good way to get information about
the node which failed to add after the invalidation. This patch removes
non-owning reference invalidation from bpf_rbtree_remove to allow the
above usecase to pass verification. The logic for why this is now
possible is as follows:

Before this series, bpf_rbtree_add couldn't fail and thus assumed that
its input, a non-owning reference, was in the tree. But it's easy to
construct an example where two non-owning references pointing to the same
underlying memory are acquired and passed to rbtree_remove one after
another (see rbtree_api_release_aliasing in
selftests/bpf/progs/rbtree_fail.c).

So it was necessary to clobber non-owning refs to prevent this
case and, more generally, to enforce "non-owning ref is definitely
in some collection" invariant. This series removes that invariant and
the failure / runtime checking added in this patch provide a clean way
to deal with the aliasing issue - just fail to remove.

Because the aliasing issue prevented by clobbering non-owning refs is no
longer an issue, this patch removes the invalidate_non_owning_refs
call from verifier handling of bpf_rbtree_remove. Note that
bpf_spin_unlock - the other caller of invalidate_non_owning_refs -
clobbers non-owning refs for a different reason, so its clobbering
behavior remains unchanged.

No BPF program changes are necessary for programs to remain valid as a
result of this clobbering change. A valid program before this patch
passed verification with its non-owning refs having shorter (or equal)
lifetimes due to more aggressive clobbering.

Also, update existing tests to check bpf_rbtree_remove retval for NULL
where necessary, and move rbtree_api_release_aliasing from
progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass
verification.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/btf.c                              | 21 +----
 kernel/bpf/helpers.c                          |  8 +-
 kernel/bpf/verifier.c                         |  3 -
 .../selftests/bpf/prog_tests/linked_list.c    | 90 ++++++++++++-------
 .../testing/selftests/bpf/prog_tests/rbtree.c | 25 ++++++
 tools/testing/selftests/bpf/progs/rbtree.c    | 74 ++++++++++++++-
 .../testing/selftests/bpf/progs/rbtree_fail.c | 77 ++++++----------
 7 files changed, 191 insertions(+), 107 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 14889fd5ba8e..027f9f8a3551 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3805,25 +3805,8 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		goto end;
 	}
 
-	/* need collection identity for non-owning refs before allowing this
-	 *
-	 * Consider a node type w/ both list and rb_node fields:
-	 *   struct node {
-	 *     struct bpf_list_node l;
-	 *     struct bpf_rb_node r;
-	 *   }
-	 *
-	 * Used like so:
-	 *   struct node *n = bpf_obj_new(....);
-	 *   bpf_list_push_front(&list_head, &n->l);
-	 *   bpf_rbtree_remove(&rb_root, &n->r);
-	 *
-	 * It should not be possible to rbtree_remove the node since it hasn't
-	 * been added to a tree. But push_front converts n to a non-owning
-	 * reference, and rbtree_remove accepts the non-owning reference to
-	 * a type w/ bpf_rb_node field.
-	 */
-	if (btf_record_has_field(rec, BPF_LIST_NODE) &&
+	if (rec->refcount_off < 0 &&
+	    btf_record_has_field(rec, BPF_LIST_NODE) &&
 	    btf_record_has_field(rec, BPF_RB_NODE)) {
 		ret = -EINVAL;
 		goto end;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5067f8d46872..1835df333287 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2000,6 +2000,12 @@ __bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
 	struct rb_root_cached *r = (struct rb_root_cached *)root;
 	struct rb_node *n = (struct rb_node *)node;
 
+	if (!n->__rb_parent_color)
+		RB_CLEAR_NODE(n);
+
+	if (RB_EMPTY_NODE(n))
+		return NULL;
+
 	rb_erase_cached(n, r);
 	RB_CLEAR_NODE(n);
 	return (struct bpf_rb_node *)n;
@@ -2328,7 +2334,7 @@ BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE)
-BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE)
+BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_rbtree_add_impl)
 BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 736cb7cec0bd..6a41b69a424e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10922,9 +10922,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			ref_set_non_owning(env, &regs[BPF_REG_0]);
 		}
 
-		if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove])
-			invalidate_non_owning_refs(env);
-
 		if (reg_may_point_to_spin_lock(&regs[BPF_REG_0]) && !regs[BPF_REG_0].id)
 			regs[BPF_REG_0].id = ++env->id_gen;
 	} else if (btf_type_is_void(t)) {
diff --git a/tools/testing/selftests/bpf/prog_tests/linked_list.c b/tools/testing/selftests/bpf/prog_tests/linked_list.c
index 872e4bd500fd..f63309fd0e28 100644
--- a/tools/testing/selftests/bpf/prog_tests/linked_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/linked_list.c
@@ -266,6 +266,59 @@ static struct btf *init_btf(void)
 	return NULL;
 }
 
+static void list_and_rb_node_same_struct(bool refcount_field)
+{
+	int bpf_rb_node_btf_id, bpf_refcount_btf_id, foo_btf_id;
+	struct btf *btf;
+	int id, err;
+
+	btf = init_btf();
+	if (!ASSERT_OK_PTR(btf, "init_btf"))
+		return;
+
+	bpf_rb_node_btf_id = btf__add_struct(btf, "bpf_rb_node", 24);
+	if (!ASSERT_GT(bpf_rb_node_btf_id, 0, "btf__add_struct bpf_rb_node"))
+		return;
+
+	if (refcount_field) {
+		bpf_refcount_btf_id = btf__add_struct(btf, "bpf_refcount", 4);
+		if (!ASSERT_GT(bpf_refcount_btf_id, 0, "btf__add_struct bpf_refcount"))
+			return;
+	}
+
+	id = btf__add_struct(btf, "bar", refcount_field ? 44 : 40);
+	if (!ASSERT_GT(id, 0, "btf__add_struct bar"))
+		return;
+	err = btf__add_field(btf, "a", LIST_NODE, 0, 0);
+	if (!ASSERT_OK(err, "btf__add_field bar::a"))
+		return;
+	err = btf__add_field(btf, "c", bpf_rb_node_btf_id, 128, 0);
+	if (!ASSERT_OK(err, "btf__add_field bar::c"))
+		return;
+	if (refcount_field) {
+		err = btf__add_field(btf, "ref", bpf_refcount_btf_id, 320, 0);
+		if (!ASSERT_OK(err, "btf__add_field bar::ref"))
+			return;
+	}
+
+	foo_btf_id = btf__add_struct(btf, "foo", 20);
+	if (!ASSERT_GT(foo_btf_id, 0, "btf__add_struct foo"))
+		return;
+	err = btf__add_field(btf, "a", LIST_HEAD, 0, 0);
+	if (!ASSERT_OK(err, "btf__add_field foo::a"))
+		return;
+	err = btf__add_field(btf, "b", SPIN_LOCK, 128, 0);
+	if (!ASSERT_OK(err, "btf__add_field foo::b"))
+		return;
+	id = btf__add_decl_tag(btf, "contains:bar:a", foo_btf_id, 0);
+	if (!ASSERT_GT(id, 0, "btf__add_decl_tag contains:bar:a"))
+		return;
+
+	err = btf__load_into_kernel(btf);
+	ASSERT_EQ(err, refcount_field ? 0 : -EINVAL, "check btf");
+	btf__free(btf);
+}
+
 static void test_btf(void)
 {
 	struct btf *btf = NULL;
@@ -717,39 +770,12 @@ static void test_btf(void)
 	}
 
 	while (test__start_subtest("btf: list_node and rb_node in same struct")) {
-		btf = init_btf();
-		if (!ASSERT_OK_PTR(btf, "init_btf"))
-			break;
-
-		id = btf__add_struct(btf, "bpf_rb_node", 24);
-		if (!ASSERT_EQ(id, 5, "btf__add_struct bpf_rb_node"))
-			break;
-		id = btf__add_struct(btf, "bar", 40);
-		if (!ASSERT_EQ(id, 6, "btf__add_struct bar"))
-			break;
-		err = btf__add_field(btf, "a", LIST_NODE, 0, 0);
-		if (!ASSERT_OK(err, "btf__add_field bar::a"))
-			break;
-		err = btf__add_field(btf, "c", 5, 128, 0);
-		if (!ASSERT_OK(err, "btf__add_field bar::c"))
-			break;
-
-		id = btf__add_struct(btf, "foo", 20);
-		if (!ASSERT_EQ(id, 7, "btf__add_struct foo"))
-			break;
-		err = btf__add_field(btf, "a", LIST_HEAD, 0, 0);
-		if (!ASSERT_OK(err, "btf__add_field foo::a"))
-			break;
-		err = btf__add_field(btf, "b", SPIN_LOCK, 128, 0);
-		if (!ASSERT_OK(err, "btf__add_field foo::b"))
-			break;
-		id = btf__add_decl_tag(btf, "contains:bar:a", 7, 0);
-		if (!ASSERT_EQ(id, 8, "btf__add_decl_tag contains:bar:a"))
-			break;
+		list_and_rb_node_same_struct(true);
+		break;
+	}
 
-		err = btf__load_into_kernel(btf);
-		ASSERT_EQ(err, -EINVAL, "check btf");
-		btf__free(btf);
+	while (test__start_subtest("btf: list_node and rb_node in same struct, no bpf_refcount")) {
+		list_and_rb_node_same_struct(false);
 		break;
 	}
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/rbtree.c b/tools/testing/selftests/bpf/prog_tests/rbtree.c
index 156fa95c42f6..e9300c96607d 100644
--- a/tools/testing/selftests/bpf/prog_tests/rbtree.c
+++ b/tools/testing/selftests/bpf/prog_tests/rbtree.c
@@ -77,6 +77,29 @@ static void test_rbtree_first_and_remove(void)
 	rbtree__destroy(skel);
 }
 
+static void test_rbtree_api_release_aliasing(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .repeat = 1,
+	);
+	struct rbtree *skel;
+	int ret;
+
+	skel = rbtree__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "rbtree__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.rbtree_api_release_aliasing), &opts);
+	ASSERT_OK(ret, "rbtree_api_release_aliasing");
+	ASSERT_OK(opts.retval, "rbtree_api_release_aliasing retval");
+	ASSERT_EQ(skel->data->first_data[0], 42, "rbtree_api_release_aliasing first rbtree_remove()");
+	ASSERT_EQ(skel->data->first_data[1], -1, "rbtree_api_release_aliasing second rbtree_remove()");
+
+	rbtree__destroy(skel);
+}
+
 void test_rbtree_success(void)
 {
 	if (test__start_subtest("rbtree_add_nodes"))
@@ -85,6 +108,8 @@ void test_rbtree_success(void)
 		test_rbtree_add_and_remove();
 	if (test__start_subtest("rbtree_first_and_remove"))
 		test_rbtree_first_and_remove();
+	if (test__start_subtest("rbtree_api_release_aliasing"))
+		test_rbtree_api_release_aliasing();
 }
 
 #define BTF_FAIL_TEST(suffix)									\
diff --git a/tools/testing/selftests/bpf/progs/rbtree.c b/tools/testing/selftests/bpf/progs/rbtree.c
index 4c90aa6abddd..b09f4fffe57c 100644
--- a/tools/testing/selftests/bpf/progs/rbtree.c
+++ b/tools/testing/selftests/bpf/progs/rbtree.c
@@ -93,9 +93,11 @@ long rbtree_add_and_remove(void *ctx)
 	res = bpf_rbtree_remove(&groot, &n->node);
 	bpf_spin_unlock(&glock);
 
+	if (!res)
+		return 1;
+
 	n = container_of(res, struct node_data, node);
 	removed_key = n->key;
-
 	bpf_obj_drop(n);
 
 	return 0;
@@ -148,9 +150,11 @@ long rbtree_first_and_remove(void *ctx)
 	res = bpf_rbtree_remove(&groot, &o->node);
 	bpf_spin_unlock(&glock);
 
+	if (!res)
+		return 5;
+
 	o = container_of(res, struct node_data, node);
 	removed_key = o->key;
-
 	bpf_obj_drop(o);
 
 	bpf_spin_lock(&glock);
@@ -173,4 +177,70 @@ long rbtree_first_and_remove(void *ctx)
 	return 1;
 }
 
+SEC("tc")
+long rbtree_api_release_aliasing(void *ctx)
+{
+	struct node_data *n, *m, *o;
+	struct bpf_rb_node *res, *res2;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+	n->key = 41;
+	n->data = 42;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_spin_unlock(&glock);
+
+	bpf_spin_lock(&glock);
+
+	/* m and o point to the same node,
+	 * but verifier doesn't know this
+	 */
+	res = bpf_rbtree_first(&groot);
+	if (!res)
+		goto err_out;
+	o = container_of(res, struct node_data, node);
+
+	res = bpf_rbtree_first(&groot);
+	if (!res)
+		goto err_out;
+	m = container_of(res, struct node_data, node);
+
+	res = bpf_rbtree_remove(&groot, &m->node);
+	/* Retval of previous remove returns an owning reference to m,
+	 * which is the same node non-owning ref o is pointing at.
+	 * We can safely try to remove o as the second rbtree_remove will
+	 * return NULL since the node isn't in a tree.
+	 *
+	 * Previously we relied on the verifier type system + rbtree_remove
+	 * invalidating non-owning refs to ensure that rbtree_remove couldn't
+	 * fail, but now rbtree_remove does runtime checking so we no longer
+	 * invalidate non-owning refs after remove.
+	 */
+	res2 = bpf_rbtree_remove(&groot, &o->node);
+
+	bpf_spin_unlock(&glock);
+
+	if (res) {
+		o = container_of(res, struct node_data, node);
+		first_data[0] = o->data;
+		bpf_obj_drop(o);
+	}
+	if (res2) {
+		/* The second remove fails, so res2 is null and this doesn't
+		 * execute
+		 */
+		m = container_of(res2, struct node_data, node);
+		first_data[1] = m->data;
+		bpf_obj_drop(m);
+	}
+	return 0;
+
+err_out:
+	bpf_spin_unlock(&glock);
+	return 1;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/rbtree_fail.c b/tools/testing/selftests/bpf/progs/rbtree_fail.c
index 46d7d18a218f..3fecf1c6dfe5 100644
--- a/tools/testing/selftests/bpf/progs/rbtree_fail.c
+++ b/tools/testing/selftests/bpf/progs/rbtree_fail.c
@@ -105,7 +105,7 @@ long rbtree_api_remove_unadded_node(void *ctx)
 }
 
 SEC("?tc")
-__failure __msg("Unreleased reference id=2 alloc_insn=10")
+__failure __msg("Unreleased reference id=3 alloc_insn=10")
 long rbtree_api_remove_no_drop(void *ctx)
 {
 	struct bpf_rb_node *res;
@@ -118,11 +118,13 @@ long rbtree_api_remove_no_drop(void *ctx)
 
 	res = bpf_rbtree_remove(&groot, res);
 
-	n = container_of(res, struct node_data, node);
-	__sink(n);
+	if (res) {
+		n = container_of(res, struct node_data, node);
+		__sink(n);
+	}
 	bpf_spin_unlock(&glock);
 
-	/* bpf_obj_drop(n) is missing here */
+	/* if (res) { bpf_obj_drop(n); } is missing here */
 	return 0;
 
 unlock_err:
@@ -150,35 +152,36 @@ long rbtree_api_add_to_multiple_trees(void *ctx)
 }
 
 SEC("?tc")
-__failure __msg("rbtree_remove node input must be non-owning ref")
-long rbtree_api_add_release_unlock_escape(void *ctx)
+__failure __msg("dereference of modified ptr_or_null_ ptr R2 off=16 disallowed")
+long rbtree_api_use_unchecked_remove_retval(void *ctx)
 {
-	struct node_data *n;
-
-	n = bpf_obj_new(typeof(*n));
-	if (!n)
-		return 1;
+	struct bpf_rb_node *res;
 
 	bpf_spin_lock(&glock);
-	bpf_rbtree_add(&groot, &n->node, less);
+
+	res = bpf_rbtree_first(&groot);
+	if (!res)
+		goto err_out;
+	res = bpf_rbtree_remove(&groot, res);
+
 	bpf_spin_unlock(&glock);
 
 	bpf_spin_lock(&glock);
-	/* After add() in previous critical section, n should be
-	 * release_on_unlock and released after previous spin_unlock,
-	 * so should not be possible to use it here
-	 */
-	bpf_rbtree_remove(&groot, &n->node);
+	/* Must check res for NULL before using in rbtree_add below */
+	bpf_rbtree_add(&groot, res, less);
 	bpf_spin_unlock(&glock);
 	return 0;
+
+err_out:
+	bpf_spin_unlock(&glock);
+	return 1;
 }
 
 SEC("?tc")
 __failure __msg("rbtree_remove node input must be non-owning ref")
-long rbtree_api_release_aliasing(void *ctx)
+long rbtree_api_add_release_unlock_escape(void *ctx)
 {
-	struct node_data *n, *m, *o;
-	struct bpf_rb_node *res;
+	struct node_data *n;
 
 	n = bpf_obj_new(typeof(*n));
 	if (!n)
@@ -189,37 +192,11 @@ long rbtree_api_release_aliasing(void *ctx)
 	bpf_spin_unlock(&glock);
 
 	bpf_spin_lock(&glock);
-
-	/* m and o point to the same node,
-	 * but verifier doesn't know this
-	 */
-	res = bpf_rbtree_first(&groot);
-	if (!res)
-		return 1;
-	o = container_of(res, struct node_data, node);
-
-	res = bpf_rbtree_first(&groot);
-	if (!res)
-		return 1;
-	m = container_of(res, struct node_data, node);
-
-	bpf_rbtree_remove(&groot, &m->node);
-	/* This second remove shouldn't be possible. Retval of previous
-	 * remove returns owning reference to m, which is the same
-	 * node o's non-owning ref is pointing at
-	 *
-	 * In order to preserve property
-	 *   * owning ref must not be in rbtree
-	 *   * non-owning ref must be in rbtree
-	 *
-	 * o's ref must be invalidated after previous remove. Otherwise
-	 * we'd have non-owning ref to node that isn't in rbtree, and
-	 * verifier wouldn't be able to use type system to prevent remove
-	 * of ref that already isn't in any tree. Would have to do runtime
-	 * checks in that case.
+	/* After add() in previous critical section, n should be
+	 * release_on_unlock and released after previous spin_unlock,
+	 * so should not be possible to use it here
 	 */
-	bpf_rbtree_remove(&groot, &o->node);
-
+	bpf_rbtree_remove(&groot, &n->node);
 	bpf_spin_unlock(&glock);
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 8/9] bpf: Centralize btf_field-specific initialization logic
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (6 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 7/9] bpf: Migrate bpf_rbtree_remove to possibly fail Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-15 20:18 ` [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests Dave Marchevsky
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

All btf_fields in an object are 0-initialized by memset in
bpf_obj_init. This might not be a valid initial state for some field
types, in which case kfuncs that use the type will properly initialize
their input if it's been 0-initialized. Some BPF graph collection types
and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.

An earlier patch in this series added the bpf_refcount field, for which
the 0 state indicates that the refcounted object should be free'd.
bpf_obj_init treats this field specially, setting refcount to 1 instead
of relying on scattered "refcount is 0? Must have just been initialized,
let's set to 1" logic in kfuncs.

This patch extends this treatment to list and rbtree field types,
allowing most scattered initialization logic in kfuncs to be removed.

Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
they'll be 0-initialized without passing through the newly-added logic,
so scattered initialization logic must remain for these collection root
types.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h  | 33 +++++++++++++++++++++++++++++----
 kernel/bpf/helpers.c | 14 ++++++--------
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b065de2cfcea..18b592fde896 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -355,6 +355,34 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 	}
 }
 
+static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
+{
+	memset(addr, 0, field->size);
+
+	switch (field->type) {
+	case BPF_REFCOUNT:
+		refcount_set((refcount_t *)addr, 1);
+		break;
+	case BPF_RB_NODE:
+		RB_CLEAR_NODE((struct rb_node *)addr);
+		break;
+	case BPF_LIST_HEAD:
+	case BPF_LIST_NODE:
+		INIT_LIST_HEAD((struct list_head *)addr);
+		break;
+	case BPF_RB_ROOT:
+		/* RB_ROOT_CACHED 0-inits, no need to do anything after memset */
+	case BPF_SPIN_LOCK:
+	case BPF_TIMER:
+	case BPF_KPTR_UNREF:
+	case BPF_KPTR_REF:
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+}
+
 static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_field_type type)
 {
 	if (IS_ERR_OR_NULL(rec))
@@ -369,10 +397,7 @@ static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
 	if (IS_ERR_OR_NULL(rec))
 		return;
 	for (i = 0; i < rec->cnt; i++)
-		memset(obj + rec->fields[i].offset, 0, rec->fields[i].size);
-
-	if (rec->refcount_off >= 0)
-		refcount_set((refcount_t *)(obj + rec->refcount_off), 1);
+		bpf_obj_init_field(&rec->fields[i], obj + rec->fields[i].offset);
 }
 
 /* 'dst' must be a temporary buffer and should not point to memory that is being
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1835df333287..00e5fb0682ac 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1936,10 +1936,11 @@ static int __bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head
 {
 	struct list_head *n = (void *)node, *h = (void *)head;
 
+	/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
+	 * called on its fields, so init here
+	 */
 	if (unlikely(!h->next))
 		INIT_LIST_HEAD(h);
-	if (unlikely(!n->next))
-		INIT_LIST_HEAD(n);
 	if (!list_empty(n)) {
 		/* Only called from BPF prog, no need to migrate_disable */
 		__bpf_obj_drop_impl(n - off, rec);
@@ -1975,6 +1976,9 @@ static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tai
 {
 	struct list_head *n, *h = (void *)head;
 
+	/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
+	 * called on its fields, so init here
+	 */
 	if (unlikely(!h->next))
 		INIT_LIST_HEAD(h);
 	if (list_empty(h))
@@ -2000,9 +2004,6 @@ __bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
 	struct rb_root_cached *r = (struct rb_root_cached *)root;
 	struct rb_node *n = (struct rb_node *)node;
 
-	if (!n->__rb_parent_color)
-		RB_CLEAR_NODE(n);
-
 	if (RB_EMPTY_NODE(n))
 		return NULL;
 
@@ -2022,9 +2023,6 @@ static int __bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
 	bpf_callback_t cb = (bpf_callback_t)less;
 	bool leftmost = true;
 
-	if (!n->__rb_parent_color)
-		RB_CLEAR_NODE(n);
-
 	if (!RB_EMPTY_NODE(n)) {
 		/* Only called from BPF prog, no need to migrate_disable */
 		__bpf_obj_drop_impl(n - off, rec);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (7 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 8/9] bpf: Centralize btf_field-specific initialization logic Dave Marchevsky
@ 2023-04-15 20:18 ` Dave Marchevsky
  2023-04-21 22:17   ` Kumar Kartikeya Dwivedi
  2023-04-16  0:50 ` [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs patchwork-bot+netdevbpf
  2023-04-22  2:03 ` Kumar Kartikeya Dwivedi
  10 siblings, 1 reply; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-15 20:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Dave Marchevsky

Test refcounted local kptr functionality added in previous patches in
the series.

Usecases which pass verification:

* Add refcounted local kptr to both tree and list. Then, read and -
  possibly, depending on test variant - delete from tree, then list.
  * Also test doing read-and-maybe-delete in opposite order
* Stash a refcounted local kptr in a map_value, then add it to a
  rbtree. Read from both, possibly deleting after tree read.
* Add refcounted local kptr to both tree and list. Then, try reading and
  deleting twice from one of the collections.
* bpf_refcount_acquire of just-added non-owning ref should work, as
  should bpf_refcount_acquire of owning ref just out of bpf_obj_new

Usecases which fail verification:

* The simple successful bpf_refcount_acquire cases from above should
  both fail to verify if the newly-acquired owning ref is not dropped

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../bpf/prog_tests/refcounted_kptr.c          |  18 +
 .../selftests/bpf/progs/refcounted_kptr.c     | 406 ++++++++++++++++++
 .../bpf/progs/refcounted_kptr_fail.c          |  72 ++++
 3 files changed, 496 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/refcounted_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/refcounted_kptr_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c b/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
new file mode 100644
index 000000000000..2ab23832062d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+
+#include "refcounted_kptr.skel.h"
+#include "refcounted_kptr_fail.skel.h"
+
+void test_refcounted_kptr(void)
+{
+	RUN_TESTS(refcounted_kptr);
+}
+
+void test_refcounted_kptr_fail(void)
+{
+	RUN_TESTS(refcounted_kptr_fail);
+}
diff --git a/tools/testing/selftests/bpf/progs/refcounted_kptr.c b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
new file mode 100644
index 000000000000..1d348a225140
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+struct node_data {
+	long key;
+	long list_data;
+	struct bpf_rb_node r;
+	struct bpf_list_node l;
+	struct bpf_refcount ref;
+};
+
+struct map_value {
+	struct node_data __kptr *node;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} stashed_nodes SEC(".maps");
+
+struct node_acquire {
+	long key;
+	long data;
+	struct bpf_rb_node node;
+	struct bpf_refcount refcount;
+};
+
+#define private(name) SEC(".bss." #name) __hidden __attribute__((aligned(8)))
+private(A) struct bpf_spin_lock lock;
+private(A) struct bpf_rb_root root __contains(node_data, r);
+private(A) struct bpf_list_head head __contains(node_data, l);
+
+private(B) struct bpf_spin_lock alock;
+private(B) struct bpf_rb_root aroot __contains(node_acquire, node);
+
+static bool less(struct bpf_rb_node *node_a, const struct bpf_rb_node *node_b)
+{
+	struct node_data *a;
+	struct node_data *b;
+
+	a = container_of(node_a, struct node_data, r);
+	b = container_of(node_b, struct node_data, r);
+
+	return a->key < b->key;
+}
+
+static bool less_a(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_acquire *node_a;
+	struct node_acquire *node_b;
+
+	node_a = container_of(a, struct node_acquire, node);
+	node_b = container_of(b, struct node_acquire, node);
+
+	return node_a->key < node_b->key;
+}
+
+static long __insert_in_tree_and_list(struct bpf_list_head *head,
+				      struct bpf_rb_root *root,
+				      struct bpf_spin_lock *lock)
+{
+	struct node_data *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return -1;
+
+	m = bpf_refcount_acquire(n);
+	m->key = 123;
+	m->list_data = 456;
+
+	bpf_spin_lock(lock);
+	if (bpf_rbtree_add(root, &n->r, less)) {
+		/* Failure to insert - unexpected */
+		bpf_spin_unlock(lock);
+		bpf_obj_drop(m);
+		return -2;
+	}
+	bpf_spin_unlock(lock);
+
+	bpf_spin_lock(lock);
+	if (bpf_list_push_front(head, &m->l)) {
+		/* Failure to insert - unexpected */
+		bpf_spin_unlock(lock);
+		return -3;
+	}
+	bpf_spin_unlock(lock);
+	return 0;
+}
+
+static long __stash_map_insert_tree(int idx, int val, struct bpf_rb_root *root,
+				    struct bpf_spin_lock *lock)
+{
+	struct map_value *mapval;
+	struct node_data *n, *m;
+
+	mapval = bpf_map_lookup_elem(&stashed_nodes, &idx);
+	if (!mapval)
+		return -1;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return -2;
+
+	n->key = val;
+	m = bpf_refcount_acquire(n);
+
+	n = bpf_kptr_xchg(&mapval->node, n);
+	if (n) {
+		bpf_obj_drop(n);
+		bpf_obj_drop(m);
+		return -3;
+	}
+
+	bpf_spin_lock(lock);
+	if (bpf_rbtree_add(root, &m->r, less)) {
+		/* Failure to insert - unexpected */
+		bpf_spin_unlock(lock);
+		return -4;
+	}
+	bpf_spin_unlock(lock);
+	return 0;
+}
+
+static long __read_from_tree(struct bpf_rb_root *root,
+			     struct bpf_spin_lock *lock,
+			     bool remove_from_tree)
+{
+	struct bpf_rb_node *rb;
+	struct node_data *n;
+	long res = -99;
+
+	bpf_spin_lock(lock);
+
+	rb = bpf_rbtree_first(root);
+	if (!rb) {
+		bpf_spin_unlock(lock);
+		return -1;
+	}
+
+	n = container_of(rb, struct node_data, r);
+	res = n->key;
+
+	if (!remove_from_tree) {
+		bpf_spin_unlock(lock);
+		return res;
+	}
+
+	rb = bpf_rbtree_remove(root, rb);
+	bpf_spin_unlock(lock);
+	if (!rb)
+		return -2;
+	n = container_of(rb, struct node_data, r);
+	bpf_obj_drop(n);
+	return res;
+}
+
+static long __read_from_list(struct bpf_list_head *head,
+			     struct bpf_spin_lock *lock,
+			     bool remove_from_list)
+{
+	struct bpf_list_node *l;
+	struct node_data *n;
+	long res = -99;
+
+	bpf_spin_lock(lock);
+
+	l = bpf_list_pop_front(head);
+	if (!l) {
+		bpf_spin_unlock(lock);
+		return -1;
+	}
+
+	n = container_of(l, struct node_data, l);
+	res = n->list_data;
+
+	if (!remove_from_list) {
+		if (bpf_list_push_back(head, &n->l)) {
+			bpf_spin_unlock(lock);
+			return -2;
+		}
+	}
+
+	bpf_spin_unlock(lock);
+
+	if (remove_from_list)
+		bpf_obj_drop(n);
+	return res;
+}
+
+static long __read_from_unstash(int idx)
+{
+	struct node_data *n = NULL;
+	struct map_value *mapval;
+	long val = -99;
+
+	mapval = bpf_map_lookup_elem(&stashed_nodes, &idx);
+	if (!mapval)
+		return -1;
+
+	n = bpf_kptr_xchg(&mapval->node, n);
+	if (!n)
+		return -2;
+
+	val = n->key;
+	bpf_obj_drop(n);
+	return val;
+}
+
+#define INSERT_READ_BOTH(rem_tree, rem_list, desc)			\
+SEC("tc")								\
+__description(desc)							\
+__success __retval(579)							\
+long insert_and_remove_tree_##rem_tree##_list_##rem_list(void *ctx)	\
+{									\
+	long err, tree_data, list_data;					\
+									\
+	err = __insert_in_tree_and_list(&head, &root, &lock);		\
+	if (err)							\
+		return err;						\
+									\
+	err = __read_from_tree(&root, &lock, rem_tree);			\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		tree_data = err;					\
+									\
+	err = __read_from_list(&head, &lock, rem_list);			\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		list_data = err;					\
+									\
+	return tree_data + list_data;					\
+}
+
+/* After successful insert of struct node_data into both collections:
+ *   - it should have refcount = 2
+ *   - removing / not removing the node_data from a collection after
+ *     reading should have no effect on ability to read / remove from
+ *     the other collection
+ */
+INSERT_READ_BOTH(true, true, "insert_read_both: remove from tree + list");
+INSERT_READ_BOTH(false, false, "insert_read_both: remove from neither");
+INSERT_READ_BOTH(true, false, "insert_read_both: remove from tree");
+INSERT_READ_BOTH(false, true, "insert_read_both: remove from list");
+
+#undef INSERT_READ_BOTH
+#define INSERT_READ_BOTH(rem_tree, rem_list, desc)			\
+SEC("tc")								\
+__description(desc)							\
+__success __retval(579)							\
+long insert_and_remove_lf_tree_##rem_tree##_list_##rem_list(void *ctx)	\
+{									\
+	long err, tree_data, list_data;					\
+									\
+	err = __insert_in_tree_and_list(&head, &root, &lock);		\
+	if (err)							\
+		return err;						\
+									\
+	err = __read_from_list(&head, &lock, rem_list);			\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		list_data = err;					\
+									\
+	err = __read_from_tree(&root, &lock, rem_tree);			\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		tree_data = err;					\
+									\
+	return tree_data + list_data;					\
+}
+
+/* Similar to insert_read_both, but list data is read and possibly removed
+ * first
+ *
+ * Results should be no different than reading and possibly removing rbtree
+ * node first
+ */
+INSERT_READ_BOTH(true, true, "insert_read_both_list_first: remove from tree + list");
+INSERT_READ_BOTH(false, false, "insert_read_both_list_first: remove from neither");
+INSERT_READ_BOTH(true, false, "insert_read_both_list_first: remove from tree");
+INSERT_READ_BOTH(false, true, "insert_read_both_list_first: remove from list");
+
+#define INSERT_DOUBLE_READ_AND_DEL(read_fn, read_root, desc)		\
+SEC("tc")								\
+__description(desc)							\
+__success __retval(-1)							\
+long insert_double_##read_fn##_and_del_##read_root(void *ctx)		\
+{									\
+	long err, list_data;						\
+									\
+	err = __insert_in_tree_and_list(&head, &root, &lock);		\
+	if (err)							\
+		return err;						\
+									\
+	err = read_fn(&read_root, &lock, true);				\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		list_data = err;					\
+									\
+	err = read_fn(&read_root, &lock, true);				\
+	if (err < 0)							\
+		return err;						\
+									\
+	return err + list_data;						\
+}
+
+/* Insert into both tree and list, then try reading-and-removing from either twice
+ *
+ * The second read-and-remove should fail on read step since the node has
+ * already been removed
+ */
+INSERT_DOUBLE_READ_AND_DEL(__read_from_tree, root, "insert_double_del: 2x read-and-del from tree");
+INSERT_DOUBLE_READ_AND_DEL(__read_from_list, head, "insert_double_del: 2x read-and-del from list");
+
+#define INSERT_STASH_READ(rem_tree, desc)				\
+SEC("tc")								\
+__description(desc)							\
+__success __retval(84)							\
+long insert_rbtree_and_stash__del_tree_##rem_tree(void *ctx)		\
+{									\
+	long err, tree_data, map_data;					\
+									\
+	err = __stash_map_insert_tree(0, 42, &root, &lock);		\
+	if (err)							\
+		return err;						\
+									\
+	err = __read_from_tree(&root, &lock, rem_tree);			\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		tree_data = err;					\
+									\
+	err = __read_from_unstash(0);					\
+	if (err < 0)							\
+		return err;						\
+	else								\
+		map_data = err;						\
+									\
+	return tree_data + map_data;					\
+}
+
+/* Stash a refcounted node in map_val, insert same node into tree, then try
+ * reading data from tree then unstashed map_val, possibly removing from tree
+ *
+ * Removing from tree should have no effect on map_val kptr validity
+ */
+INSERT_STASH_READ(true, "insert_stash_read: remove from tree");
+INSERT_STASH_READ(false, "insert_stash_read: don't remove from tree");
+
+SEC("tc")
+__success
+long rbtree_refcounted_node_ref_escapes(void *ctx)
+{
+	struct node_acquire *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_spin_lock(&alock);
+	bpf_rbtree_add(&aroot, &n->node, less_a);
+	m = bpf_refcount_acquire(n);
+	bpf_spin_unlock(&alock);
+
+	m->key = 2;
+	bpf_obj_drop(m);
+	return 0;
+}
+
+SEC("tc")
+__success
+long rbtree_refcounted_node_ref_escapes_owning_input(void *ctx)
+{
+	struct node_acquire *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	m = bpf_refcount_acquire(n);
+	m->key = 2;
+
+	bpf_spin_lock(&alock);
+	bpf_rbtree_add(&aroot, &n->node, less_a);
+	bpf_spin_unlock(&alock);
+
+	bpf_obj_drop(m);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/refcounted_kptr_fail.c b/tools/testing/selftests/bpf/progs/refcounted_kptr_fail.c
new file mode 100644
index 000000000000..efcb308f80ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/refcounted_kptr_fail.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+#include "bpf_misc.h"
+
+struct node_acquire {
+	long key;
+	long data;
+	struct bpf_rb_node node;
+	struct bpf_refcount refcount;
+};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+private(A) struct bpf_spin_lock glock;
+private(A) struct bpf_rb_root groot __contains(node_acquire, node);
+
+static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_acquire *node_a;
+	struct node_acquire *node_b;
+
+	node_a = container_of(a, struct node_acquire, node);
+	node_b = container_of(b, struct node_acquire, node);
+
+	return node_a->key < node_b->key;
+}
+
+SEC("?tc")
+__failure __msg("Unreleased reference id=3 alloc_insn=21")
+long rbtree_refcounted_node_ref_escapes(void *ctx)
+{
+	struct node_acquire *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	/* m becomes an owning ref but is never drop'd or added to a tree */
+	m = bpf_refcount_acquire(n);
+	bpf_spin_unlock(&glock);
+
+	m->key = 2;
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("Unreleased reference id=3 alloc_insn=9")
+long rbtree_refcounted_node_ref_escapes_owning_input(void *ctx)
+{
+	struct node_acquire *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	/* m becomes an owning ref but is never drop'd or added to a tree */
+	m = bpf_refcount_acquire(n);
+	m->key = 2;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_spin_unlock(&glock);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (8 preceding siblings ...)
  2023-04-15 20:18 ` [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests Dave Marchevsky
@ 2023-04-16  0:50 ` patchwork-bot+netdevbpf
  2023-04-22  2:03 ` Kumar Kartikeya Dwivedi
  10 siblings, 0 replies; 22+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-04-16  0:50 UTC (permalink / raw)
  To: Dave Marchevsky; +Cc: bpf, ast, daniel, andrii, martin.lau, kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Sat, 15 Apr 2023 13:18:02 -0700 you wrote:
> This series adds support for refcounted local kptrs to the verifier. A local
> kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> 
>   struct refcounted_node {
>     long data;
>     struct bpf_list_node ll;
>     struct bpf_refcount ref;
>   };
> 
> [...]

Here is the summary with links:
  - [v2,bpf-next,1/9] bpf: Remove btf_field_offs, use btf_record's fields instead
    https://git.kernel.org/bpf/bpf-next/c/cd2a8079014a
  - [v2,bpf-next,2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
    https://git.kernel.org/bpf/bpf-next/c/d54730b50bae
  - [v2,bpf-next,3/9] bpf: Support refcounted local kptrs in existing semantics
    https://git.kernel.org/bpf/bpf-next/c/1512217c47f0
  - [v2,bpf-next,4/9] bpf: Add bpf_refcount_acquire kfunc
    https://git.kernel.org/bpf/bpf-next/c/7c50b1cb76ac
  - [v2,bpf-next,5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
    https://git.kernel.org/bpf/bpf-next/c/d2dcc67df910
  - [v2,bpf-next,6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts
    https://git.kernel.org/bpf/bpf-next/c/de67ba3968fa
  - [v2,bpf-next,7/9] bpf: Migrate bpf_rbtree_remove to possibly fail
    https://git.kernel.org/bpf/bpf-next/c/404ad75a36fb
  - [v2,bpf-next,8/9] bpf: Centralize btf_field-specific initialization logic
    https://git.kernel.org/bpf/bpf-next/c/3e81740a9062
  - [v2,bpf-next,9/9] selftests/bpf: Add refcounted_kptr tests
    https://git.kernel.org/bpf/bpf-next/c/6147f15131e2

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
  2023-04-15 20:18 ` [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail Dave Marchevsky
@ 2023-04-16  1:11   ` Alexei Starovoitov
  2023-04-17 18:08     ` Dave Marchevsky
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-04-16  1:11 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team

On Sat, Apr 15, 2023 at 01:18:07PM -0700, Dave Marchevsky wrote:
> -extern void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
> -			   bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b)) __ksym;
> +extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
> +			       bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
> +			       void *meta, __u64 off) __ksym;
> +
> +/* Convenience macro to wrap over bpf_rbtree_add_impl */
> +#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)

Applied, but can we do better here?
It's not a new issue. We have the same inefficiency in bpf_obj_drop.
BPF program populates 1 or 2 extra registers, but the verifier patches the call insn
with necessary values for R4 and R5 for bpf_rbtree_add_impl or R2 for bpf_obj_drop_impl.
So one/two register assignments by bpf prog is a dead code.
Can we come up with a way to avoid this unnecessary register assignment in bpf prog?
Can we keep
extern void bpf_rbtree_add(root, node, less) __ksym; ?
Both in the kernel and in bpf_experimental.h so that libbpf's
bpf_object__resolve_ksym_func_btf_id() -> bpf_core_types_are_compat() check will succeed,
but the kernel bpf_rbtree_add will actually have 5 arguments?
Maybe always_inline or __attribute__((alias(..))) trick we can use?
Or define both and patch bpf code to use _impl later ?

@@ -2053,6 +2053,12 @@ __bpf_kfunc int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node
        return __bpf_rbtree_add(root, node, (void *)less, meta ? meta->record : NULL, off);
 }

+__bpf_kfunc notrace int bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
+                                      bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b))
+{
+       return 0;
+}

Only wastes 3 bytes of .text on x86 and extra BTF_KIND_FUNC in vmlinux BTF,
but will save two registers assignment at run-time ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
  2023-04-16  1:11   ` Alexei Starovoitov
@ 2023-04-17 18:08     ` Dave Marchevsky
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-17 18:08 UTC (permalink / raw)
  To: Alexei Starovoitov, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team



On 4/15/23 9:11 PM, Alexei Starovoitov wrote:
> On Sat, Apr 15, 2023 at 01:18:07PM -0700, Dave Marchevsky wrote:
>> -extern void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
>> -			   bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b)) __ksym;
>> +extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
>> +			       bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
>> +			       void *meta, __u64 off) __ksym;
>> +
>> +/* Convenience macro to wrap over bpf_rbtree_add_impl */
>> +#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
> 
> Applied, but can we do better here?
> It's not a new issue. We have the same inefficiency in bpf_obj_drop.
> BPF program populates 1 or 2 extra registers, but the verifier patches the call insn
> with necessary values for R4 and R5 for bpf_rbtree_add_impl or R2 for bpf_obj_drop_impl.
> So one/two register assignments by bpf prog is a dead code.
> Can we come up with a way to avoid this unnecessary register assignment in bpf prog?
> Can we keep
> extern void bpf_rbtree_add(root, node, less) __ksym; ?
> Both in the kernel and in bpf_experimental.h so that libbpf's
> bpf_object__resolve_ksym_func_btf_id() -> bpf_core_types_are_compat() check will succeed,
> but the kernel bpf_rbtree_add will actually have 5 arguments?
> Maybe always_inline or __attribute__((alias(..))) trick we can use?
> Or define both and patch bpf code to use _impl later ?
> 
> @@ -2053,6 +2053,12 @@ __bpf_kfunc int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node
>         return __bpf_rbtree_add(root, node, (void *)less, meta ? meta->record : NULL, off);
>  }
> 
> +__bpf_kfunc notrace int bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
> +                                      bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b))
> +{
> +       return 0;
> +}
> 
> Only wastes 3 bytes of .text on x86 and extra BTF_KIND_FUNC in vmlinux BTF,
> but will save two registers assignment at run-time ?

I see what you mean, and agree that smarter patching of BPF insns is probably
best way forward. Will give it a shot.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests
  2023-04-15 20:18 ` [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests Dave Marchevsky
@ 2023-04-21 22:17   ` Kumar Kartikeya Dwivedi
  2023-04-21 23:49     ` Alexei Starovoitov
  2023-04-25  6:53     ` Dave Marchevsky
  0 siblings, 2 replies; 22+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-04-21 22:17 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team

On Sat, Apr 15, 2023 at 10:18:11PM CEST, Dave Marchevsky wrote:
> Test refcounted local kptr functionality added in previous patches in
> the series.
>
> Usecases which pass verification:
>
> * Add refcounted local kptr to both tree and list. Then, read and -
>   possibly, depending on test variant - delete from tree, then list.
>   * Also test doing read-and-maybe-delete in opposite order
> * Stash a refcounted local kptr in a map_value, then add it to a
>   rbtree. Read from both, possibly deleting after tree read.
> * Add refcounted local kptr to both tree and list. Then, try reading and
>   deleting twice from one of the collections.
> * bpf_refcount_acquire of just-added non-owning ref should work, as
>   should bpf_refcount_acquire of owning ref just out of bpf_obj_new
>
> Usecases which fail verification:
>
> * The simple successful bpf_refcount_acquire cases from above should
>   both fail to verify if the newly-acquired owning ref is not dropped
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
> [...]
> +SEC("?tc")
> +__failure __msg("Unreleased reference id=3 alloc_insn=21")
> +long rbtree_refcounted_node_ref_escapes(void *ctx)
> +{
> +	struct node_acquire *n, *m;
> +
> +	n = bpf_obj_new(typeof(*n));
> +	if (!n)
> +		return 1;
> +
> +	bpf_spin_lock(&glock);
> +	bpf_rbtree_add(&groot, &n->node, less);
> +	/* m becomes an owning ref but is never drop'd or added to a tree */
> +	m = bpf_refcount_acquire(n);

I am analyzing the set (and I'll reply in detail to the cover letter), but this
stood out.

Isn't this going to be problematic if n has refcount == 1 and is dropped
internally by bpf_rbtree_add? Are we sure this can never occur? It took me some
time, but the following schedule seems problematic.

CPU 0					CPU 1
n = bpf_obj_new
lock(lock1)
bpf_rbtree_add(rbtree1, n)
m = bpf_rbtree_acquire(n)
unlock(lock1)

kptr_xchg(map, m) // move to map
// at this point, refcount = 2
					m = kptr_xchg(map, NULL)
					lock(lock2)
lock(lock1)				bpf_rbtree_add(rbtree2, m)
p = bpf_rbtree_first(rbtree1)			if (!RB_EMPTY_NODE) bpf_obj_drop_impl(m) // A
bpf_rbtree_remove(rbtree1, p)
unlock(lock1)
bpf_obj_drop(p) // B
					bpf_refcount_acquire(m) // use-after-free
					...

B will decrement refcount from 1 to 0, after which bpf_refcount_acquire is
basically performing a use-after-free (when fortunate, one will get a
WARN_ON_ONCE splat for 0 to 1, otherwise, a silent refcount raise for some
different object).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests
  2023-04-21 22:17   ` Kumar Kartikeya Dwivedi
@ 2023-04-21 23:49     ` Alexei Starovoitov
  2023-04-22  2:06       ` Kumar Kartikeya Dwivedi
  2023-04-25  6:53     ` Dave Marchevsky
  1 sibling, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-04-21 23:49 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team

On Sat, Apr 22, 2023 at 12:17:47AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 15, 2023 at 10:18:11PM CEST, Dave Marchevsky wrote:
> > Test refcounted local kptr functionality added in previous patches in
> > the series.
> >
> > Usecases which pass verification:
> >
> > * Add refcounted local kptr to both tree and list. Then, read and -
> >   possibly, depending on test variant - delete from tree, then list.
> >   * Also test doing read-and-maybe-delete in opposite order
> > * Stash a refcounted local kptr in a map_value, then add it to a
> >   rbtree. Read from both, possibly deleting after tree read.
> > * Add refcounted local kptr to both tree and list. Then, try reading and
> >   deleting twice from one of the collections.
> > * bpf_refcount_acquire of just-added non-owning ref should work, as
> >   should bpf_refcount_acquire of owning ref just out of bpf_obj_new
> >
> > Usecases which fail verification:
> >
> > * The simple successful bpf_refcount_acquire cases from above should
> >   both fail to verify if the newly-acquired owning ref is not dropped
> >
> > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > ---
> > [...]
> > +SEC("?tc")
> > +__failure __msg("Unreleased reference id=3 alloc_insn=21")
> > +long rbtree_refcounted_node_ref_escapes(void *ctx)
> > +{
> > +	struct node_acquire *n, *m;
> > +
> > +	n = bpf_obj_new(typeof(*n));
> > +	if (!n)
> > +		return 1;
> > +
> > +	bpf_spin_lock(&glock);
> > +	bpf_rbtree_add(&groot, &n->node, less);
> > +	/* m becomes an owning ref but is never drop'd or added to a tree */
> > +	m = bpf_refcount_acquire(n);
> 
> I am analyzing the set (and I'll reply in detail to the cover letter), but this
> stood out.
> 
> Isn't this going to be problematic if n has refcount == 1 and is dropped
> internally by bpf_rbtree_add? Are we sure this can never occur? It took me some
> time, but the following schedule seems problematic.
> 
> CPU 0					CPU 1
> n = bpf_obj_new
> lock(lock1)
> bpf_rbtree_add(rbtree1, n)
> m = bpf_rbtree_acquire(n)
> unlock(lock1)
> 
> kptr_xchg(map, m) // move to map
> // at this point, refcount = 2
> 					m = kptr_xchg(map, NULL)
> 					lock(lock2)
> lock(lock1)				bpf_rbtree_add(rbtree2, m)
> p = bpf_rbtree_first(rbtree1)			if (!RB_EMPTY_NODE) bpf_obj_drop_impl(m) // A
> bpf_rbtree_remove(rbtree1, p)
> unlock(lock1)
> bpf_obj_drop(p) // B

You probably meant:
p2 = bpf_rbtree_remove(rbtree1, p)
unlock(lock1)
if (p2)
  bpf_obj_drop(p2)

> 					bpf_refcount_acquire(m) // use-after-free
> 					...
> 
> B will decrement refcount from 1 to 0, after which bpf_refcount_acquire is
> basically performing a use-after-free (when fortunate, one will get a
> WARN_ON_ONCE splat for 0 to 1, otherwise, a silent refcount raise for some
> different object).

As discussed earlier we'll be switching all bpf_obj_new to use BPF_MA_REUSE_AFTER_RCU_GP.

and to adress 0->1 transition.. it does look like we need to two flavors of bpf_refcount_acquire.
One of owned refs and another for non-owned.
The owned bpf_refcount_acquire() can stay KF_ACQUIRE with refcount_inc,
while bpf_refcount_acquire() for non-own will use KF_ACQUIRE | KF_RET_NULL and refcount_inc_not_zero.
The bpf prog can use bpf_refcount_acquire everywhere and the verifier will treat it on the spot
differently depending on the argument.
So the code:
n = bpf_obj_new();
if (!n) ...;
m = bpf_refcount_acquire(n);
doesn't need to check if (!m).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
  2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
                   ` (9 preceding siblings ...)
  2023-04-16  0:50 ` [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs patchwork-bot+netdevbpf
@ 2023-04-22  2:03 ` Kumar Kartikeya Dwivedi
  2023-04-22  3:21   ` Alexei Starovoitov
  10 siblings, 1 reply; 22+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-04-22  2:03 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team

On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> This series adds support for refcounted local kptrs to the verifier. A local
> kptr is 'refcounted' if its type contains a struct bpf_refcount field:
>
>   struct refcounted_node {
>     long data;
>     struct bpf_list_node ll;
>     struct bpf_refcount ref;
>   };
>
> bpf_refcount is used to implement shared ownership for local kptrs.
>
> Motivating usecase
> ==================
>
> If a struct has two collection node fields, e.g.:
>
>   struct node {
>     long key;
>     long val;
>     struct bpf_rb_node rb;
>     struct bpf_list_node ll;
>   };
>
> It's not currently possible to add a node to both the list and rbtree:
>
>   long bpf_prog(void *ctx)
>   {
>     struct node *n = bpf_obj_new(typeof(*n));
>     if (!n) { /* ... */ }
>
>     bpf_spin_lock(&lock);
>
>     bpf_list_push_back(&head, &n->ll);
>     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
>     bpf_spin_unlock(&lock);
>   }
>
> The above program will fail verification due to current owning / non-owning ref
> logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> passed to bpf_rbtree_add. The only way to get an owning reference for the node
> that was added is to bpf_list_pop_{front,back} it.
>
> More generally, verifier ownership semantics expect that a node has one
> owner (program, collection, or stashed in map) with exclusive ownership
> of the node's lifetime. The owner free's the node's underlying memory when it
> itself goes away.
>
> Without a shared ownership concept it's impossible to express many real-world
> usecases such that they pass verification.
>
> Semantic Changes
> ================
>
> Before this series, the verifier could make this statement: "whoever has the
> owning reference has exclusive ownership of the referent's lifetime". As
> demonstrated in the previous section, this implies that a BPF program can't
> have an owning reference to some node if that node is in a collection. If
> such a state were possible, the node would have multiple owners, each thinking
> they have exclusive ownership. In order to support shared ownership it's
> necessary to modify the exclusive ownership semantic.
>
> After this series' changes, an owning reference has ownership of the referent's
> lifetime, but it's not necessarily exclusive. The referent's underlying memory
> is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> used for collection insert.
>
> This change doesn't affect UX of owning or non-owning references much:
>
>   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
>     an owning reference arg, as ownership still must be passed to the
>     collection in a shared-ownership world.
>
>   * non-owning references still refer to valid memory without claiming
>     any ownership.
> [...]

I think there are a series of major issues right now. I am not sure everything
can be addressed using bug fixes.

If I had to summarize the main problems in one liners:
The mutation of the node fields of an object can be racy.
Lack of collection identity either at runtime or verification.

--

It is possible for multiple CPUs to get owned references to an object containing
a rbtree or list node, and they can attempt to modify those fields in parallel
without any synchronization.

CPU 0					CPU 1
n = bpf_obj_new(...)
m = bpf_refcount_acquire(n)
kptr_xchg(map, m)
					m = kptr_xchg(map, NULL)
// m == n
bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)

--

Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
sufficient. Consider this:

Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):

CPU 0					CPU 1
n = bpf_obj_new(...)
bpf_spin_lock(lock1)
bpf_rbtree_add(rbtree1, n)
m = bpf_refcount_acquire(n)
bpf_spin_unlock(lock1)
					bpf_spin_lock(lock1)
					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
					bpf_spin_unlock(lock1)
// let m == n
bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)

--

There's also no discussion on the problem with collection identities we
discussed before (maybe you plan to address it later):
https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com

But static tracaking won't be sufficient any longer. Considering another case
where the runtime will be confused about which rbtree a node belongs to.

CPU 0								CPU 1
n = bpf_obj_new(...)
m = bpf_refcount_acquire(n)
kptr_xchg(map, m)
								p = kptr_xchg(map, NULL)
								lock(lock2)
								bpf_rbtree_add(rbtree2, p->rnode)
								unlock(lock2)
lock(lock1)
bpf_list_push_back(head1, n->lnode)
// make n non-owning ref
bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
unlock(lock1)

--

I can come up with multiple other examples. The point I'm trying to drive home
is that it's a more fundamental issue in the way things are set up right now,
not something that was overlooked during the implementation.

The root cause is that access to a node's fields is going to be racy once
multiple CPUs have *mutable* references to it. The lack of ownership information
mapped to the collection either at runtime or during verification is also a
separate problem.

When we were discussing this a few months ago, we tried to form consensus on
synchronizing updates over a node using an 'owner' pointer to eliminate similar
races. Every node field has an associated owner field, which each updater
modifying that node synchronizes over.

In short:
Node's owner describes its state at runtime.
node.owner == ptr_of_ds // part of DS
node.owner == NULL // not part of DS
node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)

cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
bpf_rbtree_add and such will do this to claim node ownership before trying to
link it in, and then store owner to ptr_of_ds. The _store_ will be the
*linearization* point of bpf_rbtree_add, not cmpxchg.

READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
remain in this state until the end of CS, since ptr_to_ds to NULL only happens
in remove under a held lock for the ds. bpf_rbtree_remove will do this check
before WRITE_ONCE of NULL to unlink a node.

Ofcourse, this is slower, and requires extra space in the object, but unless
this or something else is used to eliminate races, there will be bugs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests
  2023-04-21 23:49     ` Alexei Starovoitov
@ 2023-04-22  2:06       ` Kumar Kartikeya Dwivedi
  2023-04-22  2:18         ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-04-22  2:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team

On Sat, Apr 22, 2023 at 01:49:08AM CEST, Alexei Starovoitov wrote:
> On Sat, Apr 22, 2023 at 12:17:47AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Sat, Apr 15, 2023 at 10:18:11PM CEST, Dave Marchevsky wrote:
> > > Test refcounted local kptr functionality added in previous patches in
> > > the series.
> > >
> > > Usecases which pass verification:
> > >
> > > * Add refcounted local kptr to both tree and list. Then, read and -
> > >   possibly, depending on test variant - delete from tree, then list.
> > >   * Also test doing read-and-maybe-delete in opposite order
> > > * Stash a refcounted local kptr in a map_value, then add it to a
> > >   rbtree. Read from both, possibly deleting after tree read.
> > > * Add refcounted local kptr to both tree and list. Then, try reading and
> > >   deleting twice from one of the collections.
> > > * bpf_refcount_acquire of just-added non-owning ref should work, as
> > >   should bpf_refcount_acquire of owning ref just out of bpf_obj_new
> > >
> > > Usecases which fail verification:
> > >
> > > * The simple successful bpf_refcount_acquire cases from above should
> > >   both fail to verify if the newly-acquired owning ref is not dropped
> > >
> > > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > > ---
> > > [...]
> > > +SEC("?tc")
> > > +__failure __msg("Unreleased reference id=3 alloc_insn=21")
> > > +long rbtree_refcounted_node_ref_escapes(void *ctx)
> > > +{
> > > +	struct node_acquire *n, *m;
> > > +
> > > +	n = bpf_obj_new(typeof(*n));
> > > +	if (!n)
> > > +		return 1;
> > > +
> > > +	bpf_spin_lock(&glock);
> > > +	bpf_rbtree_add(&groot, &n->node, less);
> > > +	/* m becomes an owning ref but is never drop'd or added to a tree */
> > > +	m = bpf_refcount_acquire(n);
> >
> > I am analyzing the set (and I'll reply in detail to the cover letter), but this
> > stood out.
> >
> > Isn't this going to be problematic if n has refcount == 1 and is dropped
> > internally by bpf_rbtree_add? Are we sure this can never occur? It took me some
> > time, but the following schedule seems problematic.
> >
> > CPU 0					CPU 1
> > n = bpf_obj_new
> > lock(lock1)
> > bpf_rbtree_add(rbtree1, n)
> > m = bpf_rbtree_acquire(n)
> > unlock(lock1)
> >
> > kptr_xchg(map, m) // move to map
> > // at this point, refcount = 2
> > 					m = kptr_xchg(map, NULL)
> > 					lock(lock2)
> > lock(lock1)				bpf_rbtree_add(rbtree2, m)
> > p = bpf_rbtree_first(rbtree1)			if (!RB_EMPTY_NODE) bpf_obj_drop_impl(m) // A
> > bpf_rbtree_remove(rbtree1, p)
> > unlock(lock1)
> > bpf_obj_drop(p) // B
>
> You probably meant:
> p2 = bpf_rbtree_remove(rbtree1, p)
> unlock(lock1)
> if (p2)
>   bpf_obj_drop(p2)
>
> > 					bpf_refcount_acquire(m) // use-after-free
> > 					...
> >
> > B will decrement refcount from 1 to 0, after which bpf_refcount_acquire is
> > basically performing a use-after-free (when fortunate, one will get a
> > WARN_ON_ONCE splat for 0 to 1, otherwise, a silent refcount raise for some
> > different object).
>
> As discussed earlier we'll be switching all bpf_obj_new to use BPF_MA_REUSE_AFTER_RCU_GP

I probably missed that thread. In that case, is use of this stuff going to
require bpf_rcu_read_lock in sleepable programs?

>
> and to adress 0->1 transition.. it does look like we need to two flavors of bpf_refcount_acquire.
> One of owned refs and another for non-owned.
> The owned bpf_refcount_acquire() can stay KF_ACQUIRE with refcount_inc,
> while bpf_refcount_acquire() for non-own will use KF_ACQUIRE | KF_RET_NULL and refcount_inc_not_zero.
> The bpf prog can use bpf_refcount_acquire everywhere and the verifier will treat it on the spot
> differently depending on the argument.
> So the code:
> n = bpf_obj_new();
> if (!n) ...;
> m = bpf_refcount_acquire(n);
> doesn't need to check if (!m).

If memory reuse is prevented, than indeed the above should fix the problem.
Though it might be a bit surprising if a pointer from the same helper has to be
null checked in one context, and not in another. Though that's just a minor
point.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests
  2023-04-22  2:06       ` Kumar Kartikeya Dwivedi
@ 2023-04-22  2:18         ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2023-04-22  2:18 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team

On Fri, Apr 21, 2023 at 7:06 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, Apr 22, 2023 at 01:49:08AM CEST, Alexei Starovoitov wrote:
> > On Sat, Apr 22, 2023 at 12:17:47AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Sat, Apr 15, 2023 at 10:18:11PM CEST, Dave Marchevsky wrote:
> > > > Test refcounted local kptr functionality added in previous patches in
> > > > the series.
> > > >
> > > > Usecases which pass verification:
> > > >
> > > > * Add refcounted local kptr to both tree and list. Then, read and -
> > > >   possibly, depending on test variant - delete from tree, then list.
> > > >   * Also test doing read-and-maybe-delete in opposite order
> > > > * Stash a refcounted local kptr in a map_value, then add it to a
> > > >   rbtree. Read from both, possibly deleting after tree read.
> > > > * Add refcounted local kptr to both tree and list. Then, try reading and
> > > >   deleting twice from one of the collections.
> > > > * bpf_refcount_acquire of just-added non-owning ref should work, as
> > > >   should bpf_refcount_acquire of owning ref just out of bpf_obj_new
> > > >
> > > > Usecases which fail verification:
> > > >
> > > > * The simple successful bpf_refcount_acquire cases from above should
> > > >   both fail to verify if the newly-acquired owning ref is not dropped
> > > >
> > > > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > > > ---
> > > > [...]
> > > > +SEC("?tc")
> > > > +__failure __msg("Unreleased reference id=3 alloc_insn=21")
> > > > +long rbtree_refcounted_node_ref_escapes(void *ctx)
> > > > +{
> > > > + struct node_acquire *n, *m;
> > > > +
> > > > + n = bpf_obj_new(typeof(*n));
> > > > + if (!n)
> > > > +         return 1;
> > > > +
> > > > + bpf_spin_lock(&glock);
> > > > + bpf_rbtree_add(&groot, &n->node, less);
> > > > + /* m becomes an owning ref but is never drop'd or added to a tree */
> > > > + m = bpf_refcount_acquire(n);
> > >
> > > I am analyzing the set (and I'll reply in detail to the cover letter), but this
> > > stood out.
> > >
> > > Isn't this going to be problematic if n has refcount == 1 and is dropped
> > > internally by bpf_rbtree_add? Are we sure this can never occur? It took me some
> > > time, but the following schedule seems problematic.
> > >
> > > CPU 0                                       CPU 1
> > > n = bpf_obj_new
> > > lock(lock1)
> > > bpf_rbtree_add(rbtree1, n)
> > > m = bpf_rbtree_acquire(n)
> > > unlock(lock1)
> > >
> > > kptr_xchg(map, m) // move to map
> > > // at this point, refcount = 2
> > >                                     m = kptr_xchg(map, NULL)
> > >                                     lock(lock2)
> > > lock(lock1)                         bpf_rbtree_add(rbtree2, m)
> > > p = bpf_rbtree_first(rbtree1)                       if (!RB_EMPTY_NODE) bpf_obj_drop_impl(m) // A
> > > bpf_rbtree_remove(rbtree1, p)
> > > unlock(lock1)
> > > bpf_obj_drop(p) // B
> >
> > You probably meant:
> > p2 = bpf_rbtree_remove(rbtree1, p)
> > unlock(lock1)
> > if (p2)
> >   bpf_obj_drop(p2)
> >
> > >                                     bpf_refcount_acquire(m) // use-after-free
> > >                                     ...
> > >
> > > B will decrement refcount from 1 to 0, after which bpf_refcount_acquire is
> > > basically performing a use-after-free (when fortunate, one will get a
> > > WARN_ON_ONCE splat for 0 to 1, otherwise, a silent refcount raise for some
> > > different object).
> >
> > As discussed earlier we'll be switching all bpf_obj_new to use BPF_MA_REUSE_AFTER_RCU_GP
>
> I probably missed that thread. In that case, is use of this stuff going to
> require bpf_rcu_read_lock in sleepable programs?

Correct.
We've refactored rcu enforcement.
We also decided to stop waiting for gcc btf_tag support to land and
went with allowlisting BTF_TYPE_SAFE_RCU[_OR_NULL] for fields that
we need to use now.
Overall it allowed us to get rid of kptr_get.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
  2023-04-22  2:03 ` Kumar Kartikeya Dwivedi
@ 2023-04-22  3:21   ` Alexei Starovoitov
  2023-04-22 18:42     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2023-04-22  3:21 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team

On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> > This series adds support for refcounted local kptrs to the verifier. A local
> > kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> >
> >   struct refcounted_node {
> >     long data;
> >     struct bpf_list_node ll;
> >     struct bpf_refcount ref;
> >   };
> >
> > bpf_refcount is used to implement shared ownership for local kptrs.
> >
> > Motivating usecase
> > ==================
> >
> > If a struct has two collection node fields, e.g.:
> >
> >   struct node {
> >     long key;
> >     long val;
> >     struct bpf_rb_node rb;
> >     struct bpf_list_node ll;
> >   };
> >
> > It's not currently possible to add a node to both the list and rbtree:
> >
> >   long bpf_prog(void *ctx)
> >   {
> >     struct node *n = bpf_obj_new(typeof(*n));
> >     if (!n) { /* ... */ }
> >
> >     bpf_spin_lock(&lock);
> >
> >     bpf_list_push_back(&head, &n->ll);
> >     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> >     bpf_spin_unlock(&lock);
> >   }
> >
> > The above program will fail verification due to current owning / non-owning ref
> > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> > passed to bpf_rbtree_add. The only way to get an owning reference for the node
> > that was added is to bpf_list_pop_{front,back} it.
> >
> > More generally, verifier ownership semantics expect that a node has one
> > owner (program, collection, or stashed in map) with exclusive ownership
> > of the node's lifetime. The owner free's the node's underlying memory when it
> > itself goes away.
> >
> > Without a shared ownership concept it's impossible to express many real-world
> > usecases such that they pass verification.
> >
> > Semantic Changes
> > ================
> >
> > Before this series, the verifier could make this statement: "whoever has the
> > owning reference has exclusive ownership of the referent's lifetime". As
> > demonstrated in the previous section, this implies that a BPF program can't
> > have an owning reference to some node if that node is in a collection. If
> > such a state were possible, the node would have multiple owners, each thinking
> > they have exclusive ownership. In order to support shared ownership it's
> > necessary to modify the exclusive ownership semantic.
> >
> > After this series' changes, an owning reference has ownership of the referent's
> > lifetime, but it's not necessarily exclusive. The referent's underlying memory
> > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> > used for collection insert.
> >
> > This change doesn't affect UX of owning or non-owning references much:
> >
> >   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> >     an owning reference arg, as ownership still must be passed to the
> >     collection in a shared-ownership world.
> >
> >   * non-owning references still refer to valid memory without claiming
> >     any ownership.
> > [...]
> 
> I think there are a series of major issues right now. I am not sure everything
> can be addressed using bug fixes.
> 
> If I had to summarize the main problems in one liners:
> The mutation of the node fields of an object can be racy.
> Lack of collection identity either at runtime or verification.
> 
> --
> 
> It is possible for multiple CPUs to get owned references to an object containing
> a rbtree or list node, and they can attempt to modify those fields in parallel
> without any synchronization.
> 
> CPU 0					CPU 1
> n = bpf_obj_new(...)
> m = bpf_refcount_acquire(n)
> kptr_xchg(map, m)
> 					m = kptr_xchg(map, NULL)
> // m == n
> bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> 
> --
> 
> Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
> sufficient. Consider this:
> 
> Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
> 
> CPU 0					CPU 1
> n = bpf_obj_new(...)
> bpf_spin_lock(lock1)
> bpf_rbtree_add(rbtree1, n)
> m = bpf_refcount_acquire(n)
> bpf_spin_unlock(lock1)
> 					bpf_spin_lock(lock1)
> 					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
> 					bpf_spin_unlock(lock1)
> // let m == n
> bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> 
> --
> 
> There's also no discussion on the problem with collection identities we
> discussed before (maybe you plan to address it later):
> https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
> 
> But static tracaking won't be sufficient any longer. Considering another case
> where the runtime will be confused about which rbtree a node belongs to.
> 
> CPU 0								CPU 1
> n = bpf_obj_new(...)
> m = bpf_refcount_acquire(n)
> kptr_xchg(map, m)
> 								p = kptr_xchg(map, NULL)
> 								lock(lock2)
> 								bpf_rbtree_add(rbtree2, p->rnode)
> 								unlock(lock2)
> lock(lock1)
> bpf_list_push_back(head1, n->lnode)
> // make n non-owning ref
> bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
> unlock(lock1)

Thanks for describing the races.

> --
> 
> I can come up with multiple other examples. The point I'm trying to drive home
> is that it's a more fundamental issue in the way things are set up right now,
> not something that was overlooked during the implementation.
> 
> The root cause is that access to a node's fields is going to be racy once
> multiple CPUs have *mutable* references to it. The lack of ownership information
> mapped to the collection either at runtime or during verification is also a
> separate problem.
> 
> When we were discussing this a few months ago, we tried to form consensus on
> synchronizing updates over a node using an 'owner' pointer to eliminate similar
> races. Every node field has an associated owner field, which each updater
> modifying that node synchronizes over.
> 
> In short:
> Node's owner describes its state at runtime.
> node.owner == ptr_of_ds // part of DS

what is DS ?

> node.owner == NULL // not part of DS
> node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
> 
> cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
> bpf_rbtree_add and such will do this to claim node ownership before trying to
> link it in, and then store owner to ptr_of_ds. The _store_ will be the
> *linearization* point of bpf_rbtree_add, not cmpxchg.
> 
> READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
> remain in this state until the end of CS, since ptr_to_ds to NULL only happens
> in remove under a held lock for the ds. bpf_rbtree_remove will do this check
> before WRITE_ONCE of NULL to unlink a node.
> 
> Ofcourse, this is slower, and requires extra space in the object, but unless
> this or something else is used to eliminate races, there will be bugs.

Makes sense to me. Where do you propose to store the owner? Another explicit
field that bpf prog has to provide inside a node?
A hidden field in bpf_refcount ?
A hidden field in bpf_list_node / bpf_rb_node ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
  2023-04-22  3:21   ` Alexei Starovoitov
@ 2023-04-22 18:42     ` Kumar Kartikeya Dwivedi
  2023-04-22 21:25       ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-04-22 18:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team

On Sat, Apr 22, 2023 at 05:21:36AM CEST, Alexei Starovoitov wrote:
> On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> > > This series adds support for refcounted local kptrs to the verifier. A local
> > > kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> > >
> > >   struct refcounted_node {
> > >     long data;
> > >     struct bpf_list_node ll;
> > >     struct bpf_refcount ref;
> > >   };
> > >
> > > bpf_refcount is used to implement shared ownership for local kptrs.
> > >
> > > Motivating usecase
> > > ==================
> > >
> > > If a struct has two collection node fields, e.g.:
> > >
> > >   struct node {
> > >     long key;
> > >     long val;
> > >     struct bpf_rb_node rb;
> > >     struct bpf_list_node ll;
> > >   };
> > >
> > > It's not currently possible to add a node to both the list and rbtree:
> > >
> > >   long bpf_prog(void *ctx)
> > >   {
> > >     struct node *n = bpf_obj_new(typeof(*n));
> > >     if (!n) { /* ... */ }
> > >
> > >     bpf_spin_lock(&lock);
> > >
> > >     bpf_list_push_back(&head, &n->ll);
> > >     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> > >     bpf_spin_unlock(&lock);
> > >   }
> > >
> > > The above program will fail verification due to current owning / non-owning ref
> > > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> > > passed to bpf_rbtree_add. The only way to get an owning reference for the node
> > > that was added is to bpf_list_pop_{front,back} it.
> > >
> > > More generally, verifier ownership semantics expect that a node has one
> > > owner (program, collection, or stashed in map) with exclusive ownership
> > > of the node's lifetime. The owner free's the node's underlying memory when it
> > > itself goes away.
> > >
> > > Without a shared ownership concept it's impossible to express many real-world
> > > usecases such that they pass verification.
> > >
> > > Semantic Changes
> > > ================
> > >
> > > Before this series, the verifier could make this statement: "whoever has the
> > > owning reference has exclusive ownership of the referent's lifetime". As
> > > demonstrated in the previous section, this implies that a BPF program can't
> > > have an owning reference to some node if that node is in a collection. If
> > > such a state were possible, the node would have multiple owners, each thinking
> > > they have exclusive ownership. In order to support shared ownership it's
> > > necessary to modify the exclusive ownership semantic.
> > >
> > > After this series' changes, an owning reference has ownership of the referent's
> > > lifetime, but it's not necessarily exclusive. The referent's underlying memory
> > > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> > > used for collection insert.
> > >
> > > This change doesn't affect UX of owning or non-owning references much:
> > >
> > >   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> > >     an owning reference arg, as ownership still must be passed to the
> > >     collection in a shared-ownership world.
> > >
> > >   * non-owning references still refer to valid memory without claiming
> > >     any ownership.
> > > [...]
> >
> > I think there are a series of major issues right now. I am not sure everything
> > can be addressed using bug fixes.
> >
> > If I had to summarize the main problems in one liners:
> > The mutation of the node fields of an object can be racy.
> > Lack of collection identity either at runtime or verification.
> >
> > --
> >
> > It is possible for multiple CPUs to get owned references to an object containing
> > a rbtree or list node, and they can attempt to modify those fields in parallel
> > without any synchronization.
> >
> > CPU 0					CPU 1
> > n = bpf_obj_new(...)
> > m = bpf_refcount_acquire(n)
> > kptr_xchg(map, m)
> > 					m = kptr_xchg(map, NULL)
> > // m == n
> > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> >
> > --
> >
> > Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
> > sufficient. Consider this:
> >
> > Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
> >
> > CPU 0					CPU 1
> > n = bpf_obj_new(...)
> > bpf_spin_lock(lock1)
> > bpf_rbtree_add(rbtree1, n)
> > m = bpf_refcount_acquire(n)
> > bpf_spin_unlock(lock1)
> > 					bpf_spin_lock(lock1)
> > 					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
> > 					bpf_spin_unlock(lock1)
> > // let m == n
> > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> >
> > --
> >
> > There's also no discussion on the problem with collection identities we
> > discussed before (maybe you plan to address it later):
> > https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
> >
> > But static tracaking won't be sufficient any longer. Considering another case
> > where the runtime will be confused about which rbtree a node belongs to.
> >
> > CPU 0								CPU 1
> > n = bpf_obj_new(...)
> > m = bpf_refcount_acquire(n)
> > kptr_xchg(map, m)
> > 								p = kptr_xchg(map, NULL)
> > 								lock(lock2)
> > 								bpf_rbtree_add(rbtree2, p->rnode)
> > 								unlock(lock2)
> > lock(lock1)
> > bpf_list_push_back(head1, n->lnode)
> > // make n non-owning ref
> > bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
> > unlock(lock1)
>
> Thanks for describing the races.
>
> > --
> >
> > I can come up with multiple other examples. The point I'm trying to drive home
> > is that it's a more fundamental issue in the way things are set up right now,
> > not something that was overlooked during the implementation.
> >
> > The root cause is that access to a node's fields is going to be racy once
> > multiple CPUs have *mutable* references to it. The lack of ownership information
> > mapped to the collection either at runtime or during verification is also a
> > separate problem.
> >
> > When we were discussing this a few months ago, we tried to form consensus on
> > synchronizing updates over a node using an 'owner' pointer to eliminate similar
> > races. Every node field has an associated owner field, which each updater
> > modifying that node synchronizes over.
> >
> > In short:
> > Node's owner describes its state at runtime.
> > node.owner == ptr_of_ds // part of DS
>
> what is DS ?
>

The graph data structure.

> > node.owner == NULL // not part of DS
> > node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
> >
> > cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
> > bpf_rbtree_add and such will do this to claim node ownership before trying to
> > link it in, and then store owner to ptr_of_ds. The _store_ will be the
> > *linearization* point of bpf_rbtree_add, not cmpxchg.
> >
> > READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
> > remain in this state until the end of CS, since ptr_to_ds to NULL only happens
> > in remove under a held lock for the ds. bpf_rbtree_remove will do this check
> > before WRITE_ONCE of NULL to unlink a node.
> >
> > Ofcourse, this is slower, and requires extra space in the object, but unless
> > this or something else is used to eliminate races, there will be bugs.
>
> Makes sense to me. Where do you propose to store the owner? Another explicit
> field that bpf prog has to provide inside a node?
> A hidden field in bpf_refcount ?
> A hidden field in bpf_list_node / bpf_rb_node ?

It will have to be with each bpf_list_node / bpf_rb_node.

Pseudocode wise, this is how things will work:

bpf_rbtree_add(rbtree, node):
	// Move a node from free to busy state
	if (!cmpxchg(node.owner, NULL, BPF_PTR_POISON))
		return -EINVAL
	__rb_add(...)
	WRITE_ONCE(node.owner, rbtree)

bpf_rbtree_remove(rbtree, node):
	// Check if node is part of rbtree:
	// If != rbtree, it could be changing concurrently
	// If == rbtree, changing it requires the lock we are holding
	if (READ_ONCE(node.owner) != rbtree)
		return -EINVAL
	__rb_remove(...)
	WRITE_ONCE(node.owner, NULL)

Likewise for the linked list helpers.

My opinion would be to have a different bpf_list_node_shared and
bpf_rb_node_shared for refcount_off >= 0 case, which encode both bpf_list_node
and bpf_rb_node and their owner together, since the space and runtime check
tradeoff is not required if you're using exclusive ownership.

One more point is that you can minimize cost of cmpxchg by adding compound
operation helpers.

For instance (under appropriate locking):

bpf_rbtree_move(rbtree1, rbtree2, node):
	if (READ_ONCE(node.owner) != rbtree1)
		return -EINVAL
	__rb_remove(rbtree1, node)
	__rb_add(rbtree2, node)
	WRITE_ONCE(node.owner, rbtree2)

Instead of:
	bpf_rbtree_remove(rbtree1, node)
	bpf_rbtree_add(rbtree2, node)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
  2023-04-22 18:42     ` Kumar Kartikeya Dwivedi
@ 2023-04-22 21:25       ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2023-04-22 21:25 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team

On Sat, Apr 22, 2023 at 08:42:47PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 22, 2023 at 05:21:36AM CEST, Alexei Starovoitov wrote:
> > On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> > > > This series adds support for refcounted local kptrs to the verifier. A local
> > > > kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> > > >
> > > >   struct refcounted_node {
> > > >     long data;
> > > >     struct bpf_list_node ll;
> > > >     struct bpf_refcount ref;
> > > >   };
> > > >
> > > > bpf_refcount is used to implement shared ownership for local kptrs.
> > > >
> > > > Motivating usecase
> > > > ==================
> > > >
> > > > If a struct has two collection node fields, e.g.:
> > > >
> > > >   struct node {
> > > >     long key;
> > > >     long val;
> > > >     struct bpf_rb_node rb;
> > > >     struct bpf_list_node ll;
> > > >   };
> > > >
> > > > It's not currently possible to add a node to both the list and rbtree:
> > > >
> > > >   long bpf_prog(void *ctx)
> > > >   {
> > > >     struct node *n = bpf_obj_new(typeof(*n));
> > > >     if (!n) { /* ... */ }
> > > >
> > > >     bpf_spin_lock(&lock);
> > > >
> > > >     bpf_list_push_back(&head, &n->ll);
> > > >     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> > > >     bpf_spin_unlock(&lock);
> > > >   }
> > > >
> > > > The above program will fail verification due to current owning / non-owning ref
> > > > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> > > > passed to bpf_rbtree_add. The only way to get an owning reference for the node
> > > > that was added is to bpf_list_pop_{front,back} it.
> > > >
> > > > More generally, verifier ownership semantics expect that a node has one
> > > > owner (program, collection, or stashed in map) with exclusive ownership
> > > > of the node's lifetime. The owner free's the node's underlying memory when it
> > > > itself goes away.
> > > >
> > > > Without a shared ownership concept it's impossible to express many real-world
> > > > usecases such that they pass verification.
> > > >
> > > > Semantic Changes
> > > > ================
> > > >
> > > > Before this series, the verifier could make this statement: "whoever has the
> > > > owning reference has exclusive ownership of the referent's lifetime". As
> > > > demonstrated in the previous section, this implies that a BPF program can't
> > > > have an owning reference to some node if that node is in a collection. If
> > > > such a state were possible, the node would have multiple owners, each thinking
> > > > they have exclusive ownership. In order to support shared ownership it's
> > > > necessary to modify the exclusive ownership semantic.
> > > >
> > > > After this series' changes, an owning reference has ownership of the referent's
> > > > lifetime, but it's not necessarily exclusive. The referent's underlying memory
> > > > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> > > > used for collection insert.
> > > >
> > > > This change doesn't affect UX of owning or non-owning references much:
> > > >
> > > >   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> > > >     an owning reference arg, as ownership still must be passed to the
> > > >     collection in a shared-ownership world.
> > > >
> > > >   * non-owning references still refer to valid memory without claiming
> > > >     any ownership.
> > > > [...]
> > >
> > > I think there are a series of major issues right now. I am not sure everything
> > > can be addressed using bug fixes.
> > >
> > > If I had to summarize the main problems in one liners:
> > > The mutation of the node fields of an object can be racy.
> > > Lack of collection identity either at runtime or verification.
> > >
> > > --
> > >
> > > It is possible for multiple CPUs to get owned references to an object containing
> > > a rbtree or list node, and they can attempt to modify those fields in parallel
> > > without any synchronization.
> > >
> > > CPU 0					CPU 1
> > > n = bpf_obj_new(...)
> > > m = bpf_refcount_acquire(n)
> > > kptr_xchg(map, m)
> > > 					m = kptr_xchg(map, NULL)
> > > // m == n
> > > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> > >
> > > --
> > >
> > > Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
> > > sufficient. Consider this:
> > >
> > > Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
> > >
> > > CPU 0					CPU 1
> > > n = bpf_obj_new(...)
> > > bpf_spin_lock(lock1)
> > > bpf_rbtree_add(rbtree1, n)
> > > m = bpf_refcount_acquire(n)
> > > bpf_spin_unlock(lock1)
> > > 					bpf_spin_lock(lock1)
> > > 					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
> > > 					bpf_spin_unlock(lock1)
> > > // let m == n
> > > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> > >
> > > --
> > >
> > > There's also no discussion on the problem with collection identities we
> > > discussed before (maybe you plan to address it later):
> > > https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
> > >
> > > But static tracaking won't be sufficient any longer. Considering another case
> > > where the runtime will be confused about which rbtree a node belongs to.
> > >
> > > CPU 0								CPU 1
> > > n = bpf_obj_new(...)
> > > m = bpf_refcount_acquire(n)
> > > kptr_xchg(map, m)
> > > 								p = kptr_xchg(map, NULL)
> > > 								lock(lock2)
> > > 								bpf_rbtree_add(rbtree2, p->rnode)
> > > 								unlock(lock2)
> > > lock(lock1)
> > > bpf_list_push_back(head1, n->lnode)
> > > // make n non-owning ref
> > > bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
> > > unlock(lock1)
> >
> > Thanks for describing the races.
> >
> > > --
> > >
> > > I can come up with multiple other examples. The point I'm trying to drive home
> > > is that it's a more fundamental issue in the way things are set up right now,
> > > not something that was overlooked during the implementation.
> > >
> > > The root cause is that access to a node's fields is going to be racy once
> > > multiple CPUs have *mutable* references to it. The lack of ownership information
> > > mapped to the collection either at runtime or during verification is also a
> > > separate problem.
> > >
> > > When we were discussing this a few months ago, we tried to form consensus on
> > > synchronizing updates over a node using an 'owner' pointer to eliminate similar
> > > races. Every node field has an associated owner field, which each updater
> > > modifying that node synchronizes over.
> > >
> > > In short:
> > > Node's owner describes its state at runtime.
> > > node.owner == ptr_of_ds // part of DS
> >
> > what is DS ?
> >
> 
> The graph data structure.
> 
> > > node.owner == NULL // not part of DS
> > > node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
> > >
> > > cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
> > > bpf_rbtree_add and such will do this to claim node ownership before trying to
> > > link it in, and then store owner to ptr_of_ds. The _store_ will be the
> > > *linearization* point of bpf_rbtree_add, not cmpxchg.
> > >
> > > READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
> > > remain in this state until the end of CS, since ptr_to_ds to NULL only happens
> > > in remove under a held lock for the ds. bpf_rbtree_remove will do this check
> > > before WRITE_ONCE of NULL to unlink a node.
> > >
> > > Ofcourse, this is slower, and requires extra space in the object, but unless
> > > this or something else is used to eliminate races, there will be bugs.
> >
> > Makes sense to me. Where do you propose to store the owner? Another explicit
> > field that bpf prog has to provide inside a node?
> > A hidden field in bpf_refcount ?
> > A hidden field in bpf_list_node / bpf_rb_node ?
> 
> It will have to be with each bpf_list_node / bpf_rb_node.

yeah. realized after pressed 'send'.

> Pseudocode wise, this is how things will work:
> 
> bpf_rbtree_add(rbtree, node):
> 	// Move a node from free to busy state
> 	if (!cmpxchg(node.owner, NULL, BPF_PTR_POISON))
> 		return -EINVAL

should be bpf_obj_drop here instead of plain return.
In other words the current 
if (!RB_EMPTY_NODE()) bpf_obj_drop
will be replaced by the above cmpxchg

> 	__rb_add(...)
> 	WRITE_ONCE(node.owner, rbtree)
> 
> bpf_rbtree_remove(rbtree, node):
> 	// Check if node is part of rbtree:
> 	// If != rbtree, it could be changing concurrently
> 	// If == rbtree, changing it requires the lock we are holding
> 	if (READ_ONCE(node.owner) != rbtree)
> 		return -EINVAL

should be 'return NULL'.
The above check will replace current RB_EMPTY_NODE check.

> 	__rb_remove(...)
> 	WRITE_ONCE(node.owner, NULL)

comparing to existing code it's only extra WRITE_ONCE
and cmpxchg instead of load+cmp.
imo the perf cost is roughly the same.
Because of that there is no need to have
two bpf_rbtree_add/remove for shared/exclusive nodes.

> Likewise for the linked list helpers.
> 
> My opinion would be to have a different bpf_list_node_shared and
> bpf_rb_node_shared for refcount_off >= 0 case, which encode both bpf_list_node
> and bpf_rb_node and their owner together, since the space and runtime check
> tradeoff is not required if you're using exclusive ownership.

true, but runtime performance of bpf_rbtree_add/remove is the same for both cases.
With bpf_rb_node_shared the bpf prog will save 8 bytes of memory at the expense
of plenty verifier complexity that would need to have two checks
if (btf_id == bpf_rb_node_shared || btf_id == bpf_rb_node) in a bunch of places
and one of the two in another set of places.
I'd rather waste 8 bytes than complicate the verifier.
Having valid node->owner for exclusive rb-tree and link list is a nice debug feature too.
Not saying we screwed up non-shared rb-tree, but I won't be surprised if
somebody will figure out how to construct similar race there in the future.

> One more point is that you can minimize cost of cmpxchg by adding compound
> operation helpers.
> 
> For instance (under appropriate locking):
> 
> bpf_rbtree_move(rbtree1, rbtree2, node):
> 	if (READ_ONCE(node.owner) != rbtree1)
> 		return -EINVAL
> 	__rb_remove(rbtree1, node)
> 	__rb_add(rbtree2, node)
> 	WRITE_ONCE(node.owner, rbtree2)
> 
> Instead of:
> 	bpf_rbtree_remove(rbtree1, node)
> 	bpf_rbtree_add(rbtree2, node)

That makes sense. Not urgent. Distant follow up.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests
  2023-04-21 22:17   ` Kumar Kartikeya Dwivedi
  2023-04-21 23:49     ` Alexei Starovoitov
@ 2023-04-25  6:53     ` Dave Marchevsky
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Marchevsky @ 2023-04-25  6:53 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team

On 4/21/23 6:17 PM, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 15, 2023 at 10:18:11PM CEST, Dave Marchevsky wrote:
>> Test refcounted local kptr functionality added in previous patches in
>> the series.
>>
>> Usecases which pass verification:
>>
>> * Add refcounted local kptr to both tree and list. Then, read and -
>>   possibly, depending on test variant - delete from tree, then list.
>>   * Also test doing read-and-maybe-delete in opposite order
>> * Stash a refcounted local kptr in a map_value, then add it to a
>>   rbtree. Read from both, possibly deleting after tree read.
>> * Add refcounted local kptr to both tree and list. Then, try reading and
>>   deleting twice from one of the collections.
>> * bpf_refcount_acquire of just-added non-owning ref should work, as
>>   should bpf_refcount_acquire of owning ref just out of bpf_obj_new
>>
>> Usecases which fail verification:
>>
>> * The simple successful bpf_refcount_acquire cases from above should
>>   both fail to verify if the newly-acquired owning ref is not dropped
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>> [...]
>> +SEC("?tc")
>> +__failure __msg("Unreleased reference id=3 alloc_insn=21")
>> +long rbtree_refcounted_node_ref_escapes(void *ctx)
>> +{
>> +	struct node_acquire *n, *m;
>> +
>> +	n = bpf_obj_new(typeof(*n));
>> +	if (!n)
>> +		return 1;
>> +
>> +	bpf_spin_lock(&glock);
>> +	bpf_rbtree_add(&groot, &n->node, less);
>> +	/* m becomes an owning ref but is never drop'd or added to a tree */
>> +	m = bpf_refcount_acquire(n);
> 
> I am analyzing the set (and I'll reply in detail to the cover letter), but this
> stood out.
> 
> Isn't this going to be problematic if n has refcount == 1 and is dropped
> internally by bpf_rbtree_add? Are we sure this can never occur? It took me some
> time, but the following schedule seems problematic.
> 
> CPU 0					CPU 1
> n = bpf_obj_new
> lock(lock1)
> bpf_rbtree_add(rbtree1, n)
> m = bpf_rbtree_acquire(n)
> unlock(lock1)
> 
> kptr_xchg(map, m) // move to map
> // at this point, refcount = 2
> 					m = kptr_xchg(map, NULL)
> 					lock(lock2)
> lock(lock1)				bpf_rbtree_add(rbtree2, m)
> p = bpf_rbtree_first(rbtree1)			if (!RB_EMPTY_NODE) bpf_obj_drop_impl(m) // A
> bpf_rbtree_remove(rbtree1, p)
> unlock(lock1)
> bpf_obj_drop(p) // B
> 					bpf_refcount_acquire(m) // use-after-free
> 					...
> 
> B will decrement refcount from 1 to 0, after which bpf_refcount_acquire is
> basically performing a use-after-free (when fortunate, one will get a
> WARN_ON_ONCE splat for 0 to 1, otherwise, a silent refcount raise for some
> different object).

Thanks for the detailed feedback here and in the other thread in the series.
I will address the issues you raised ASAP, starting with this one, which I've
confirmed via a repro selftest. Will be sending fixes soon.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-04-25  6:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 1/9] bpf: Remove btf_field_offs, use btf_record's fields instead Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 3/9] bpf: Support refcounted local kptrs in existing semantics Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 4/9] bpf: Add bpf_refcount_acquire kfunc Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail Dave Marchevsky
2023-04-16  1:11   ` Alexei Starovoitov
2023-04-17 18:08     ` Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 7/9] bpf: Migrate bpf_rbtree_remove to possibly fail Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 8/9] bpf: Centralize btf_field-specific initialization logic Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests Dave Marchevsky
2023-04-21 22:17   ` Kumar Kartikeya Dwivedi
2023-04-21 23:49     ` Alexei Starovoitov
2023-04-22  2:06       ` Kumar Kartikeya Dwivedi
2023-04-22  2:18         ` Alexei Starovoitov
2023-04-25  6:53     ` Dave Marchevsky
2023-04-16  0:50 ` [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs patchwork-bot+netdevbpf
2023-04-22  2:03 ` Kumar Kartikeya Dwivedi
2023-04-22  3:21   ` Alexei Starovoitov
2023-04-22 18:42     ` Kumar Kartikeya Dwivedi
2023-04-22 21:25       ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).