All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps
@ 2022-04-09  9:32 Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

This set enables storing pointers of a certain type in BPF map, and extends the
verifier to enforce type safety and lifetime correctness properties.

The infrastructure being added is generic enough for allowing storing any kind
of pointers whose type is available using BTF (user or kernel) in the future
(e.g. strongly typed memory allocation in BPF program), which are internally
tracked in the verifier as PTR_TO_BTF_ID, but for now the series limits them to
two kinds of pointers obtained from the kernel.

Obviously, use of this feature depends on map BTF.

1. Unreferenced kernel pointer

In this case, there are very few restrictions. The pointer type being stored
must match the type declared in the map value. However, such a pointer when
loaded from the map can only be dereferenced, but not passed to any in-kernel
helpers or kernel functions available to the program. This is because while the
verifier's exception handling mechanism coverts BPF_LDX to PROBE_MEM loads,
which are then handled specially by the JIT implementation, the same liberty is
not available to accesses inside the kernel. The pointer by the time it is
passed into a helper has no lifetime related guarantees about the object it is
pointing to, and may well be referencing invalid memory.

2. Referenced kernel pointer

This case imposes a lot of restrictions on the programmer, to ensure safety. To
transfer the ownership of a reference in the BPF program to the map, the user
must use the bpf_kptr_xchg helper, which returns the old pointer contained in
the map, as an acquired reference, and releases verifier state for the
referenced pointer being exchanged, as it moves into the map.

This a normal PTR_TO_BTF_ID that can be used with in-kernel helpers and kernel
functions callable by the program.

However, if BPF_LDX is used to load a referenced pointer from the map, it is
still not permitted to pass it to in-kernel helpers or kernel functions. To
obtain a reference usable with helpers, the user must invoke a kfunc helper
which returns a usable reference (which also must be eventually released before
BPF_EXIT, or moved into a map).

Since the load of the pointer (preserving data dependency ordering) must happen
inside the RCU read section, the kfunc helper will take a pointer to the map
value, which must point to the actual pointer of the object whose reference is
to be raised. The type will be verified from the BTF information of the kfunc,
as the prototype must be:

	T *func(T **, ... /* other arguments */);

Then, the verifier checks whether pointer at offset of the map value points to
the type T, and permits the call.

This convention is followed so that such helpers may also be called from
sleepable BPF programs, where RCU read lock is not necessarily held in the BPF
program context, hence necessiating the need to pass in a pointer to the actual
pointer to perform the load inside the RCU read section.

Notes
-----

 * C selftests require https://reviews.llvm.org/D119799 to pass.
 * Unlike BPF timers, kptr is not reset or freed on map_release_uref.
 * Referenced kptr storage is always treated as unsigned long * on kernel side,
   as BPF side cannot mutate it. The storage (8 bytes) is sufficient for both
   32-bit and 64-bit platforms.
 * Use of WRITE_ONCE to reset unreferenced kptr on 32-bit systems is fine, as
   the actual pointer is always word sized, so the store tearing into two 32-bit
   stores won't be a problem as the other half is always zeroed out.

Changelog:
----------
v3 -> v4
v3: https://lore.kernel.org/bpf/20220320155510.671497-1-memxor@gmail.com

 * Use btf_parse_kptrs, plural kptrs naming (Joanne, Andrii)
 * Remove unused parameters in check_map_kptr_access (Joanne)
 * Handle idx < info_cnt kludge using tmp variable (Andrii)
 * Validate tags always precede modifiers in BTF (Andrii)
   * Split out into https://lore.kernel.org/bpf/20220406004121.282699-1-memxor@gmail.com
 * Store u32 type_id in btf_field_info (Andrii)
 * Use base_type in map_kptr_match_type (Andrii)
 * Free	kptr_off_tab when not bpf_capable (Martin)
 * Use PTR_RELEASE flag instead of bools in bpf_func_proto (Joanne)
 * Drop extra reg->off and reg->ref_obj_id checks in map_kptr_match_type (Martin)
 * Use separate u32 and u8 arrays for offs and sizes in off_arr (Andrii)
 * Simplify and remove map->value_size sentinel in copy_map_value (Andrii)
 * Use sort_r to keep both arrays in sync while sorting (Andrii)
 * Rename check_and_free_timers_and_kptr to check_and_free_fields (Andrii)
 * Move dtor prototype checks to registration phase (Alexei)
 * Use ret variable for checking ASSERT_XXX, use shorter strings (Andrii)
 * Fix missing checks for other maps (Jiri)
 * Fix various other nits, and bugs noticed during self review

v2 -> v3
v2: https://lore.kernel.org/bpf/20220317115957.3193097-1-memxor@gmail.com

 * Address comments from Alexei
   * Set name, sz, align in btf_find_field
   * Do idx >= info_cnt check in caller of btf_find_field_*
     * Use extra element in the info_arr to make this safe
   * Remove while loop, reject extra tags
   * Remove cases of defensive programming
   * Move bpf_capable() check to map_check_btf
   * Put check_ptr_off_reg reordering hunk into separate patch
   * Warn for ref_ptr once
   * Make the meta.ref_obj_id == 0 case simpler to read
   * Remove kptr_percpu and kptr_user support, remove their tests
   * Store size of field at offset in off_arr
 * Fix BPF_F_NO_PREALLOC set wrongly for hash map in C selftest
 * Add missing check_mem_reg call for kptr_get kfunc arg#0 check

v1 -> v2
v1: https://lore.kernel.org/bpf/20220220134813.3411982-1-memxor@gmail.com

 * Address comments from Alexei
   * Rename bpf_btf_find_by_name_kind_all to bpf_find_btf_id
   * Reduce indentation level in that function
   * Always take reference regardless of module or vmlinux BTF
   * Also made it the same for btf_get_module_btf
   * Use kptr, kptr_ref, kptr_percpu, kptr_user type tags
   * Don't reserve tag namespace
   * Refactor btf_find_field to be side effect free, allocate and populate
     kptr_off_tab in caller
   * Move module reference to dtor patch
   * Remove support for BPF_XCHG, BPF_CMPXCHG insn
   * Introduce bpf_kptr_xchg helper
   * Embed offset array in struct bpf_map, populate and sort it once
   * Adjust copy_map_value to memcpy directly using this offset array
   * Removed size member from offset array to save space
 * Fix some problems pointed out by kernel test robot
 * Tidy selftests
 * Lots of other minor fixes

Kumar Kartikeya Dwivedi (13):
  bpf: Make btf_find_field more generic
  bpf: Move check_ptr_off_reg before check_map_access
  bpf: Allow storing unreferenced kptr in map
  bpf: Tag argument to be released in bpf_func_proto
  bpf: Allow storing referenced kptr in map
  bpf: Prevent escaping of kptr loaded from maps
  bpf: Adapt copy_map_value for multiple offset case
  bpf: Populate pairs of btf_id and destructor kfunc in btf
  bpf: Wire up freeing of referenced kptr
  bpf: Teach verifier about kptr_get kfunc helpers
  libbpf: Add kptr type tag macros to bpf_helpers.h
  selftests/bpf: Add C tests for kptr
  selftests/bpf: Add verifier tests for kptr

 include/linux/bpf.h                           | 107 +++-
 include/linux/btf.h                           |  23 +
 include/uapi/linux/bpf.h                      |  12 +
 kernel/bpf/arraymap.c                         |  14 +-
 kernel/bpf/btf.c                              | 520 ++++++++++++++++--
 kernel/bpf/hashtab.c                          |  58 +-
 kernel/bpf/helpers.c                          |  21 +
 kernel/bpf/map_in_map.c                       |   5 +-
 kernel/bpf/ringbuf.c                          |   4 +-
 kernel/bpf/syscall.c                          | 249 ++++++++-
 kernel/bpf/verifier.c                         | 368 +++++++++++--
 net/bpf/test_run.c                            |  45 +-
 net/core/filter.c                             |   2 +-
 tools/include/uapi/linux/bpf.h                |  12 +
 tools/lib/bpf/bpf_helpers.h                   |   2 +
 .../selftests/bpf/prog_tests/map_kptr.c       |  37 ++
 tools/testing/selftests/bpf/progs/map_kptr.c  | 190 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  55 +-
 .../testing/selftests/bpf/verifier/map_kptr.c | 446 +++++++++++++++
 19 files changed, 2014 insertions(+), 156 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/map_kptr.c
 create mode 100644 tools/testing/selftests/bpf/verifier/map_kptr.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-11 20:20   ` Joanne Koong
  2022-04-09  9:32 ` [PATCH bpf-next v4 02/13] bpf: Move check_ptr_off_reg before check_map_access Kumar Kartikeya Dwivedi
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Next commit introduces field type 'kptr' whose kind will not be struct,
but pointer, and it will not be limited to one offset, but multiple
ones. Make existing btf_find_struct_field and btf_find_datasec_var
functions amenable to use for finding kptrs in map value, by moving
spin_lock and timer specific checks into their own function.

The alignment, and name are checked before the function is called, so it
is the last point where we can skip field or return an error before the
next loop iteration happens. The name parameter is now optional, and
only checked if it is not NULL. Size of the field and type is meant to
be checked inside the function, and base type will need to be obtained
by skipping modifiers.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/btf.c | 129 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 96 insertions(+), 33 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 0918a39279f6..db7bf05adfc5 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3163,71 +3163,126 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
+enum {
+	BTF_FIELD_SPIN_LOCK,
+	BTF_FIELD_TIMER,
+};
+
+struct btf_field_info {
+	u32 off;
+};
+
+static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
+				 u32 off, int sz, struct btf_field_info *info)
+{
+	if (!__btf_type_is_struct(t))
+		return 0;
+	if (t->size != sz)
+		return 0;
+	if (info->off != -ENOENT)
+		/* only one such field is allowed */
+		return -E2BIG;
+	info->off = off;
+	return 0;
+}
+
 static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
-				 const char *name, int sz, int align)
+				 const char *name, int sz, int align, int field_type,
+				 struct btf_field_info *info)
 {
 	const struct btf_member *member;
-	u32 i, off = -ENOENT;
+	u32 i, off;
+	int ret;
 
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
-		if (!__btf_type_is_struct(member_type))
-			continue;
-		if (member_type->size != sz)
-			continue;
-		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
-			continue;
-		if (off != -ENOENT)
-			/* only one such field is allowed */
-			return -E2BIG;
+
 		off = __btf_member_bit_offset(t, member);
+
+		if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
+			continue;
 		if (off % 8)
 			/* valid C code cannot generate such BTF */
 			return -EINVAL;
 		off /= 8;
 		if (off % align)
 			return -EINVAL;
+
+		switch (field_type) {
+		case BTF_FIELD_SPIN_LOCK:
+		case BTF_FIELD_TIMER:
+			ret = btf_find_field_struct(btf, member_type, off, sz, info);
+			if (ret < 0)
+				return ret;
+			break;
+		default:
+			return -EFAULT;
+		}
 	}
-	return off;
+	return 0;
 }
 
 static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
-				const char *name, int sz, int align)
+				const char *name, int sz, int align, int field_type,
+				struct btf_field_info *info)
 {
 	const struct btf_var_secinfo *vsi;
-	u32 i, off = -ENOENT;
+	u32 i, off;
+	int ret;
 
 	for_each_vsi(i, t, vsi) {
 		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
 		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
 
-		if (!__btf_type_is_struct(var_type))
-			continue;
-		if (var_type->size != sz)
+		off = vsi->offset;
+
+		if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
 			continue;
 		if (vsi->size != sz)
 			continue;
-		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
-			continue;
-		if (off != -ENOENT)
-			/* only one such field is allowed */
-			return -E2BIG;
-		off = vsi->offset;
 		if (off % align)
 			return -EINVAL;
+
+		switch (field_type) {
+		case BTF_FIELD_SPIN_LOCK:
+		case BTF_FIELD_TIMER:
+			ret = btf_find_field_struct(btf, var_type, off, sz, info);
+			if (ret < 0)
+				return ret;
+			break;
+		default:
+			return -EFAULT;
+		}
 	}
-	return off;
+	return 0;
 }
 
 static int btf_find_field(const struct btf *btf, const struct btf_type *t,
-			  const char *name, int sz, int align)
+			  int field_type, struct btf_field_info *info)
 {
+	const char *name;
+	int sz, align;
+
+	switch (field_type) {
+	case BTF_FIELD_SPIN_LOCK:
+		name = "bpf_spin_lock";
+		sz = sizeof(struct bpf_spin_lock);
+		align = __alignof__(struct bpf_spin_lock);
+		break;
+	case BTF_FIELD_TIMER:
+		name = "bpf_timer";
+		sz = sizeof(struct bpf_timer);
+		align = __alignof__(struct bpf_timer);
+		break;
+	default:
+		return -EFAULT;
+	}
 
 	if (__btf_type_is_struct(t))
-		return btf_find_struct_field(btf, t, name, sz, align);
+		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
 	else if (btf_type_is_datasec(t))
-		return btf_find_datasec_var(btf, t, name, sz, align);
+		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
 	return -EINVAL;
 }
 
@@ -3237,16 +3292,24 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
  */
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
 {
-	return btf_find_field(btf, t, "bpf_spin_lock",
-			      sizeof(struct bpf_spin_lock),
-			      __alignof__(struct bpf_spin_lock));
+	struct btf_field_info info = { .off = -ENOENT };
+	int ret;
+
+	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
+	if (ret < 0)
+		return ret;
+	return info.off;
 }
 
 int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 {
-	return btf_find_field(btf, t, "bpf_timer",
-			      sizeof(struct bpf_timer),
-			      __alignof__(struct bpf_timer));
+	struct btf_field_info info = { .off = -ENOENT };
+	int ret;
+
+	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
+	if (ret < 0)
+		return ret;
+	return info.off;
 }
 
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 02/13] bpf: Move check_ptr_off_reg before check_map_access
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-11 20:28   ` Joanne Koong
  2022-04-09  9:32 ` [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Some functions in next patch want to use this function, and those
functions will be called by check_map_access, hence move it before
check_map_access.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 76 +++++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9c1a02b82ecd..71827d14724a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3469,6 +3469,44 @@ static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
 	return 0;
 }
 
+static int __check_ptr_off_reg(struct bpf_verifier_env *env,
+			       const struct bpf_reg_state *reg, int regno,
+			       bool fixed_off_ok)
+{
+	/* Access to this pointer-typed register or passing it to a helper
+	 * is only allowed in its original, unmodified form.
+	 */
+
+	if (reg->off < 0) {
+		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
+			reg_type_str(env, reg->type), regno, reg->off);
+		return -EACCES;
+	}
+
+	if (!fixed_off_ok && reg->off) {
+		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
+			reg_type_str(env, reg->type), regno, reg->off);
+		return -EACCES;
+	}
+
+	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
+		char tn_buf[48];
+
+		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
+		verbose(env, "variable %s access var_off=%s disallowed\n",
+			reg_type_str(env, reg->type), tn_buf);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+int check_ptr_off_reg(struct bpf_verifier_env *env,
+		      const struct bpf_reg_state *reg, int regno)
+{
+	return __check_ptr_off_reg(env, reg, regno, false);
+}
+
 /* check read/write into a map element with possible variable offset */
 static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			    int off, int size, bool zero_size_allowed)
@@ -3980,44 +4018,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
 }
 #endif
 
-static int __check_ptr_off_reg(struct bpf_verifier_env *env,
-			       const struct bpf_reg_state *reg, int regno,
-			       bool fixed_off_ok)
-{
-	/* Access to this pointer-typed register or passing it to a helper
-	 * is only allowed in its original, unmodified form.
-	 */
-
-	if (reg->off < 0) {
-		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
-			reg_type_str(env, reg->type), regno, reg->off);
-		return -EACCES;
-	}
-
-	if (!fixed_off_ok && reg->off) {
-		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
-			reg_type_str(env, reg->type), regno, reg->off);
-		return -EACCES;
-	}
-
-	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
-		char tn_buf[48];
-
-		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
-		verbose(env, "variable %s access var_off=%s disallowed\n",
-			reg_type_str(env, reg->type), tn_buf);
-		return -EACCES;
-	}
-
-	return 0;
-}
-
-int check_ptr_off_reg(struct bpf_verifier_env *env,
-		      const struct bpf_reg_state *reg, int regno)
-{
-	return __check_ptr_off_reg(env, reg, regno, false);
-}
-
 static int __check_buffer_access(struct bpf_verifier_env *env,
 				 const char *buf_info,
 				 const struct bpf_reg_state *reg,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 02/13] bpf: Move check_ptr_off_reg before check_map_access Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-12  0:32   ` Joanne Koong
  2022-04-13  5:41   ` kernel test robot
  2022-04-09  9:32 ` [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto Kumar Kartikeya Dwivedi
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

This commit introduces a new pointer type 'kptr' which can be embedded
in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
register must have the same type as in the map value's BTF, and loading
a kptr marks the destination register as PTR_TO_BTF_ID with the correct
kernel BTF and BTF ID.

Such kptr are unreferenced, i.e. by the time another invocation of the
BPF program loads this pointer, the object which the pointer points to
may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
patched to PROBE_MEM loads by the verifier, it would safe to allow user
to still access such invalid pointer, but passing such pointers into
BPF helpers and kfuncs should not be permitted. A future patch in this
series will close this gap.

The flexibility offered by allowing programs to dereference such invalid
pointers while being safe at runtime frees the verifier from doing
complex lifetime tracking. As long as the user may ensure that the
object remains valid, it can ensure data read by it from the kernel
object is valid.

The user indicates that a certain pointer must be treated as kptr
capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
information is recorded in the object BTF which will be passed into the
kernel by way of map's BTF information. The name and kind from the map
value BTF is used to look up the in-kernel type, and the actual BTF and
BTF ID is recorded in the map struct in a new kptr_off_tab member. For
now, only storing pointers to structs is permitted.

An example of this specification is shown below:

	#define __kptr __attribute__((btf_type_tag("kptr")))

	struct map_value {
		...
		struct task_struct __kptr *task;
		...
	};

Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
task_struct into the map, and then load it later.

Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
the verifier cannot know whether the value is NULL or not statically, it
must treat all potential loads at that map value offset as loading a
possibly NULL pointer.

Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
are allowed instructions that can access such a pointer. On BPF_LDX, the
destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
it is checked whether the source register type is a PTR_TO_BTF_ID with
same BTF type as specified in the map BTF. The access size must always
be BPF_DW.

For the map in map support, the kptr_off_tab for outer map is copied
from the inner map's kptr_off_tab. It was chosen to do a deep copy
instead of introducing a refcount to kptr_off_tab, because the copy only
needs to be done when paramterizing using inner_map_fd in the map in map
case, hence would be unnecessary for all other users.

It is not permitted to use MAP_FREEZE command and mmap for BPF map
having kptrs, similar to the bpf_timer case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h     |  29 +++++++-
 include/linux/btf.h     |   2 +
 kernel/bpf/btf.c        | 160 ++++++++++++++++++++++++++++++++++------
 kernel/bpf/map_in_map.c |   5 +-
 kernel/bpf/syscall.c    | 114 +++++++++++++++++++++++++++-
 kernel/bpf/verifier.c   | 116 ++++++++++++++++++++++++++++-
 6 files changed, 399 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bdb5298735ce..e267db260cb7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -155,6 +155,22 @@ struct bpf_map_ops {
 	const struct bpf_iter_seq_info *iter_seq_info;
 };
 
+enum {
+	/* Support at most 8 pointers in a BPF map value */
+	BPF_MAP_VALUE_OFF_MAX = 8,
+};
+
+struct bpf_map_value_off_desc {
+	u32 offset;
+	u32 btf_id;
+	struct btf *btf;
+};
+
+struct bpf_map_value_off {
+	u32 nr_off;
+	struct bpf_map_value_off_desc off[];
+};
+
 struct bpf_map {
 	/* The first two cachelines with read-mostly members of which some
 	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -171,6 +187,7 @@ struct bpf_map {
 	u64 map_extra; /* any per-map-type extra fields */
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
+	struct bpf_map_value_off *kptr_off_tab;
 	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
@@ -184,7 +201,7 @@ struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 	bool bypass_spec_v1;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
-	/* 14 bytes hole */
+	/* 6 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
 	return map->timer_off >= 0;
 }
 
+static inline bool map_value_has_kptrs(const struct bpf_map *map)
+{
+	return !IS_ERR_OR_NULL(map->kptr_off_tab);
+}
+
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
 	if (unlikely(map_value_has_spin_lock(map)))
@@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
 void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
 
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
+void bpf_map_free_kptr_off_tab(struct bpf_map *map);
+struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 36bc09b8e890..19c297f9a52f 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
+struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
+					  const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index db7bf05adfc5..28b1d9e9124e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3166,9 +3166,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
 enum {
 	BTF_FIELD_SPIN_LOCK,
 	BTF_FIELD_TIMER,
+	BTF_FIELD_KPTR,
+};
+
+enum {
+	BTF_FIELD_IGNORE = 0,
+	BTF_FIELD_FOUND  = 1,
 };
 
 struct btf_field_info {
+	u32 type_id;
 	u32 off;
 };
 
@@ -3176,23 +3183,50 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
 				 u32 off, int sz, struct btf_field_info *info)
 {
 	if (!__btf_type_is_struct(t))
-		return 0;
+		return BTF_FIELD_IGNORE;
 	if (t->size != sz)
-		return 0;
-	if (info->off != -ENOENT)
-		/* only one such field is allowed */
-		return -E2BIG;
+		return BTF_FIELD_IGNORE;
 	info->off = off;
-	return 0;
+	return BTF_FIELD_FOUND;
+}
+
+static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
+			       u32 off, int sz, struct btf_field_info *info)
+{
+	u32 res_id;
+
+	/* For PTR, sz is always == 8 */
+	if (!btf_type_is_ptr(t))
+		return BTF_FIELD_IGNORE;
+	t = btf_type_by_id(btf, t->type);
+
+	if (!btf_type_is_type_tag(t))
+		return BTF_FIELD_IGNORE;
+	/* Reject extra tags */
+	if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
+		return -EINVAL;
+	if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
+		return -EINVAL;
+
+	/* Get the base type */
+	t = btf_type_skip_modifiers(btf, t->type, &res_id);
+	/* Only pointer to struct is allowed */
+	if (!__btf_type_is_struct(t))
+		return -EINVAL;
+
+	info->type_id = res_id;
+	info->off = off;
+	return BTF_FIELD_FOUND;
 }
 
 static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
 				 const char *name, int sz, int align, int field_type,
-				 struct btf_field_info *info)
+				 struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_member *member;
+	struct btf_field_info tmp;
+	int ret, idx = 0;
 	u32 i, off;
-	int ret;
 
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
@@ -3212,24 +3246,38 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
-			ret = btf_find_field_struct(btf, member_type, off, sz, info);
+			ret = btf_find_field_struct(btf, member_type, off, sz, idx < info_cnt ?
+						    &info[idx] : &tmp);
+			if (ret < 0)
+				return ret;
+			break;
+		case BTF_FIELD_KPTR:
+			ret = btf_find_field_kptr(btf, member_type, off, sz, idx < info_cnt ?
+						  &info[idx] : &tmp);
 			if (ret < 0)
 				return ret;
 			break;
 		default:
 			return -EFAULT;
 		}
+
+		if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
+			return -E2BIG;
+		else if (ret == BTF_FIELD_IGNORE)
+			continue;
+		++idx;
 	}
-	return 0;
+	return idx;
 }
 
 static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 				const char *name, int sz, int align, int field_type,
-				struct btf_field_info *info)
+				struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_var_secinfo *vsi;
+	struct btf_field_info tmp;
+	int ret, idx = 0;
 	u32 i, off;
-	int ret;
 
 	for_each_vsi(i, t, vsi) {
 		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
@@ -3247,19 +3295,32 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
-			ret = btf_find_field_struct(btf, var_type, off, sz, info);
+			ret = btf_find_field_struct(btf, var_type, off, sz, idx < info_cnt ?
+						    &info[idx] : &tmp);
+			if (ret < 0)
+				return ret;
+			break;
+		case BTF_FIELD_KPTR:
+			ret = btf_find_field_kptr(btf, var_type, off, sz, idx < info_cnt ?
+						  &info[idx] : &tmp);
 			if (ret < 0)
 				return ret;
 			break;
 		default:
 			return -EFAULT;
 		}
+
+		if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
+			return -E2BIG;
+		if (ret == BTF_FIELD_IGNORE)
+			continue;
+		++idx;
 	}
-	return 0;
+	return idx;
 }
 
 static int btf_find_field(const struct btf *btf, const struct btf_type *t,
-			  int field_type, struct btf_field_info *info)
+			  int field_type, struct btf_field_info *info, int info_cnt)
 {
 	const char *name;
 	int sz, align;
@@ -3275,14 +3336,19 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 		sz = sizeof(struct bpf_timer);
 		align = __alignof__(struct bpf_timer);
 		break;
+	case BTF_FIELD_KPTR:
+		name = NULL;
+		sz = sizeof(u64);
+		align = 8;
+		break;
 	default:
 		return -EFAULT;
 	}
 
 	if (__btf_type_is_struct(t))
-		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
+		return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
 	else if (btf_type_is_datasec(t))
-		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
+		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
 	return -EINVAL;
 }
 
@@ -3292,26 +3358,78 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
  */
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
 {
-	struct btf_field_info info = { .off = -ENOENT };
+	struct btf_field_info info;
 	int ret;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
+	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info, 1);
 	if (ret < 0)
 		return ret;
+	if (!ret)
+		return -ENOENT;
 	return info.off;
 }
 
 int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 {
-	struct btf_field_info info = { .off = -ENOENT };
+	struct btf_field_info info;
 	int ret;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
+	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info, 1);
 	if (ret < 0)
 		return ret;
+	if (!ret)
+		return -ENOENT;
 	return info.off;
 }
 
+struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
+					  const struct btf_type *t)
+{
+	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
+	struct bpf_map_value_off *tab;
+	int ret, i, nr_off;
+
+	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
+	BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
+
+	ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (!ret)
+		return NULL;
+
+	nr_off = ret;
+	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
+	if (!tab)
+		return ERR_PTR(-ENOMEM);
+
+	tab->nr_off = 0;
+	for (i = 0; i < nr_off; i++) {
+		const struct btf_type *t;
+		struct btf *off_btf;
+		s32 id;
+
+		t = btf_type_by_id(btf, info_arr[i].type_id);
+		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
+				     &off_btf);
+		if (id < 0) {
+			ret = id;
+			goto end;
+		}
+
+		tab->off[i].offset = info_arr[i].off;
+		tab->off[i].btf_id = id;
+		tab->off[i].btf = off_btf;
+		tab->nr_off = i + 1;
+	}
+	return tab;
+end:
+	while (tab->nr_off--)
+		btf_put(tab->off[tab->nr_off].btf);
+	kfree(tab);
+	return ERR_PTR(ret);
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 5cd8f5277279..135205d0d560 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
 	inner_map_meta->timer_off = inner_map->timer_off;
+	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
 	if (inner_map->btf) {
 		btf_get(inner_map->btf);
 		inner_map_meta->btf = inner_map->btf;
@@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
 {
+	bpf_map_free_kptr_off_tab(map_meta);
 	btf_put(map_meta->btf);
 	kfree(map_meta);
 }
@@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		meta0->key_size == meta1->key_size &&
 		meta0->value_size == meta1->value_size &&
 		meta0->timer_off == meta1->timer_off &&
-		meta0->map_flags == meta1->map_flags;
+		meta0->map_flags == meta1->map_flags &&
+		bpf_map_equal_kptr_off_tab(meta0, meta1);
 }
 
 void *bpf_map_fd_get_ptr(struct bpf_map *map,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cdaa1152436a..edfe691284b0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -6,6 +6,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
+#include <linux/bsearch.h>
 #include <linux/btf.h>
 #include <linux/syscalls.h>
 #include <linux/slab.h>
@@ -473,12 +474,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
+static int bpf_map_kptr_off_cmp(const void *a, const void *b)
+{
+	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
+
+	if (off_desc1->offset < off_desc2->offset)
+		return -1;
+	else if (off_desc1->offset > off_desc2->offset)
+		return 1;
+	return 0;
+}
+
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
+{
+	/* Since members are iterated in btf_find_field in increasing order,
+	 * offsets appended to kptr_off_tab are in increasing order, so we can
+	 * do bsearch to find exact match.
+	 */
+	struct bpf_map_value_off *tab;
+
+	if (!map_value_has_kptrs(map))
+		return NULL;
+	tab = map->kptr_off_tab;
+	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
+}
+
+void bpf_map_free_kptr_off_tab(struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab;
+	int i;
+
+	if (!map_value_has_kptrs(map))
+		return;
+	for (i = 0; i < tab->nr_off; i++) {
+		struct btf *btf = tab->off[i].btf;
+
+		btf_put(btf);
+	}
+	kfree(tab);
+	map->kptr_off_tab = NULL;
+}
+
+struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
+	int size, i, ret;
+
+	if (!map_value_has_kptrs(map))
+		return ERR_PTR(-ENOENT);
+	/* Do a deep copy of the kptr_off_tab */
+	for (i = 0; i < tab->nr_off; i++)
+		btf_get(tab->off[i].btf);
+
+	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
+	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!new_tab) {
+		ret = -ENOMEM;
+		goto end;
+	}
+	memcpy(new_tab, tab, size);
+	return new_tab;
+end:
+	while (i--)
+		btf_put(tab->off[i].btf);
+	return ERR_PTR(ret);
+}
+
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
+{
+	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
+	bool a_has_kptr = map_value_has_kptrs(map_a), b_has_kptr = map_value_has_kptrs(map_b);
+	int size;
+
+	if (!a_has_kptr && !b_has_kptr)
+		return true;
+	if (a_has_kptr != b_has_kptr)
+		return false;
+	if (tab_a->nr_off != tab_b->nr_off)
+		return false;
+	size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
+	return !memcmp(tab_a, tab_b, size);
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 
 	security_bpf_map_free(map);
+	bpf_map_free_kptr_off_tab(map);
 	bpf_map_release_memcg(map);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
@@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	int err;
 
 	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
-	    map_value_has_timer(map))
+	    map_value_has_timer(map) || map_value_has_kptrs(map))
 		return -ENOTSUPP;
 
 	if (!(vma->vm_flags & VM_SHARED))
@@ -820,9 +904,33 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			return -EOPNOTSUPP;
 	}
 
-	if (map->ops->map_check_btf)
+	map->kptr_off_tab = btf_parse_kptrs(btf, value_type);
+	if (map_value_has_kptrs(map)) {
+		if (!bpf_capable()) {
+			ret = -EPERM;
+			goto free_map_tab;
+		}
+		if (map->map_flags & BPF_F_RDONLY_PROG) {
+			ret = -EACCES;
+			goto free_map_tab;
+		}
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY) {
+			ret = -EOPNOTSUPP;
+			goto free_map_tab;
+		}
+	}
+
+	if (map->ops->map_check_btf) {
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
+		if (ret < 0)
+			goto free_map_tab;
+	}
 
+	return ret;
+free_map_tab:
+	bpf_map_free_kptr_off_tab(map);
 	return ret;
 }
 
@@ -1639,7 +1747,7 @@ static int map_freeze(const union bpf_attr *attr)
 		return PTR_ERR(map);
 
 	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
-	    map_value_has_timer(map)) {
+	    map_value_has_timer(map) || map_value_has_kptrs(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 71827d14724a..01d45c5010f9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3507,6 +3507,83 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
 	return __check_ptr_off_reg(env, reg, regno, false);
 }
 
+static int map_kptr_match_type(struct bpf_verifier_env *env,
+			       struct bpf_map_value_off_desc *off_desc,
+			       struct bpf_reg_state *reg, u32 regno)
+{
+	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
+	const char *reg_name = "";
+
+	if (base_type(reg->type) != PTR_TO_BTF_ID || type_flag(reg->type) != PTR_MAYBE_NULL)
+		goto bad_type;
+
+	if (!btf_is_kernel(reg->btf)) {
+		verbose(env, "R%d must point to kernel BTF\n", regno);
+		return -EINVAL;
+	}
+	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
+	reg_name = kernel_type_name(reg->btf, reg->btf_id);
+
+	if (__check_ptr_off_reg(env, reg, regno, true))
+		return -EACCES;
+
+	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
+				  off_desc->btf, off_desc->btf_id))
+		goto bad_type;
+	return 0;
+bad_type:
+	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
+		reg_type_str(env, reg->type), reg_name);
+	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+	return -EINVAL;
+}
+
+static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
+				 int value_regno, int insn_idx,
+				 struct bpf_map_value_off_desc *off_desc)
+{
+	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
+	int class = BPF_CLASS(insn->code);
+	struct bpf_reg_state *val_reg;
+
+	/* Things we already checked for in check_map_access and caller:
+	 *  - Reject cases where variable offset may touch kptr
+	 *  - size of access (must be BPF_DW)
+	 *  - tnum_is_const(reg->var_off)
+	 *  - off_desc->offset == off + reg->var_off.value
+	 */
+	/* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
+	if (BPF_MODE(insn->code) != BPF_MEM)
+		goto end;
+
+	if (class == BPF_LDX) {
+		val_reg = reg_state(env, value_regno);
+		/* We can simply mark the value_regno receiving the pointer
+		 * value from map as PTR_TO_BTF_ID, with the correct type.
+		 */
+		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
+				off_desc->btf_id, PTR_MAYBE_NULL);
+		val_reg->id = ++env->id_gen;
+	} else if (class == BPF_STX) {
+		val_reg = reg_state(env, value_regno);
+		if (!register_is_null(val_reg) &&
+		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
+			return -EACCES;
+	} else if (class == BPF_ST) {
+		if (insn->imm) {
+			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
+				off_desc->offset);
+			return -EACCES;
+		}
+	} else {
+		goto end;
+	}
+	return 0;
+end:
+	verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
+	return -EACCES;
+}
+
 /* check read/write into a map element with possible variable offset */
 static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			    int off, int size, bool zero_size_allowed)
@@ -3545,6 +3622,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 	}
+	if (map_value_has_kptrs(map)) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++) {
+			u32 p = tab->off[i].offset;
+
+			if (reg->smin_value + off < p + sizeof(u64) &&
+			    p < reg->umax_value + off + size) {
+				if (!tnum_is_const(reg->var_off)) {
+					verbose(env, "kptr access cannot have variable offset\n");
+					return -EACCES;
+				}
+				if (p != off + reg->var_off.value) {
+					verbose(env, "kptr access misaligned expected=%u off=%llu\n",
+						p, off + reg->var_off.value);
+					return -EACCES;
+				}
+				if (size != bpf_size_to_bytes(BPF_DW)) {
+					verbose(env, "kptr access size must be BPF_DW\n");
+					return -EACCES;
+				}
+				break;
+			}
+		}
+	}
 	return err;
 }
 
@@ -4412,6 +4515,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		if (value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
 	} else if (reg->type == PTR_TO_MAP_VALUE) {
+		struct bpf_map_value_off_desc *off_desc = NULL;
+
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
 			verbose(env, "R%d leaks addr into map\n", value_regno);
@@ -4421,7 +4526,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		if (err)
 			return err;
 		err = check_map_access(env, regno, off, size, false);
-		if (!err && t == BPF_READ && value_regno >= 0) {
+		if (err)
+			return err;
+		if (tnum_is_const(reg->var_off))
+			off_desc = bpf_map_kptr_off_contains(reg->map_ptr,
+							     off + reg->var_off.value);
+		if (off_desc) {
+			err = check_map_kptr_access(env, regno, value_regno, insn_idx, off_desc);
+			if (err)
+				return err;
+		} else if (t == BPF_READ && value_regno >= 0) {
 			struct bpf_map *map = reg->map_ptr;
 
 			/* if map is read-only, track its contents as scalars */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-12 18:16   ` Joanne Koong
  2022-04-09  9:32 ` [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map Kumar Kartikeya Dwivedi
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Add a new type flag for bpf_arg_type that when set tells verifier that
for a release function, that argument's register will be the one for
which meta.ref_obj_id will be set, and which will then be released
using release_reference. To capture the regno, introduce a new field
release_regno in bpf_call_arg_meta.

This would be required in the next patch, where we may either pass NULL
or a refcounted pointer as an argument to the release function
bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
enough, as there is a case where the type of argument needed matches,
but the ref_obj_id is set to 0. Hence, we must enforce that whenever
meta.ref_obj_id is zero, the register that is to be released can only
be NULL for a release function.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |  5 ++++-
 kernel/bpf/ringbuf.c  |  4 ++--
 kernel/bpf/verifier.c | 46 ++++++++++++++++++++++++++++++++++++-------
 net/core/filter.c     |  2 +-
 4 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e267db260cb7..a6d1982e8118 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -364,7 +364,10 @@ enum bpf_type_flag {
 	 */
 	MEM_PERCPU		= BIT(4 + BPF_BASE_TYPE_BITS),
 
-	__BPF_TYPE_LAST_FLAG	= MEM_PERCPU,
+	/* Indicates that the pointer argument will be released. */
+	PTR_RELEASE		= BIT(5 + BPF_BASE_TYPE_BITS),
+
+	__BPF_TYPE_LAST_FLAG	= PTR_RELEASE,
 };
 
 /* Max number of base types. */
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 710ba9de12ce..a22c21c0a7ef 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -404,7 +404,7 @@ BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
 const struct bpf_func_proto bpf_ringbuf_submit_proto = {
 	.func		= bpf_ringbuf_submit,
 	.ret_type	= RET_VOID,
-	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
+	.arg1_type	= ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
 	.arg2_type	= ARG_ANYTHING,
 };
 
@@ -417,7 +417,7 @@ BPF_CALL_2(bpf_ringbuf_discard, void *, sample, u64, flags)
 const struct bpf_func_proto bpf_ringbuf_discard_proto = {
 	.func		= bpf_ringbuf_discard,
 	.ret_type	= RET_VOID,
-	.arg1_type	= ARG_PTR_TO_ALLOC_MEM,
+	.arg1_type	= ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
 	.arg2_type	= ARG_ANYTHING,
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 01d45c5010f9..6cc08526e049 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -245,6 +245,7 @@ struct bpf_call_arg_meta {
 	struct bpf_map *map_ptr;
 	bool raw_mode;
 	bool pkt_access;
+	u8 release_regno;
 	int regno;
 	int access_size;
 	int mem_size;
@@ -5300,6 +5301,11 @@ static bool arg_type_is_int_ptr(enum bpf_arg_type type)
 	       type == ARG_PTR_TO_LONG;
 }
 
+static bool arg_type_is_release_ptr(enum bpf_arg_type type)
+{
+	return type & PTR_RELEASE;
+}
+
 static int int_ptr_type_to_size(enum bpf_arg_type type)
 {
 	if (type == ARG_PTR_TO_INT)
@@ -5532,7 +5538,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
 		/* Some of the argument types nevertheless require a
 		 * zero register offset.
 		 */
-		if (arg_type != ARG_PTR_TO_ALLOC_MEM)
+		if (base_type(arg_type) != ARG_PTR_TO_ALLOC_MEM)
 			return 0;
 		break;
 	/* All the rest must be rejected, except PTR_TO_BTF_ID which allows
@@ -6124,12 +6130,31 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
 	return true;
 }
 
-static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
+static bool check_release_regno(const struct bpf_func_proto *fn, int func_id,
+				struct bpf_call_arg_meta *meta)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
+		if (arg_type_is_release_ptr(fn->arg_type[i])) {
+			if (!is_release_function(func_id))
+				return false;
+			if (meta->release_regno)
+				return false;
+			meta->release_regno = i + 1;
+		}
+	}
+	return !is_release_function(func_id) || meta->release_regno;
+}
+
+static int check_func_proto(const struct bpf_func_proto *fn, int func_id,
+			    struct bpf_call_arg_meta *meta)
 {
 	return check_raw_mode_ok(fn) &&
 	       check_arg_pair_ok(fn) &&
 	       check_btf_id_ok(fn) &&
-	       check_refcount_ok(fn, func_id) ? 0 : -EINVAL;
+	       check_refcount_ok(fn, func_id) &&
+	       check_release_regno(fn, func_id, meta) ? 0 : -EINVAL;
 }
 
 /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
@@ -6808,7 +6833,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	memset(&meta, 0, sizeof(meta));
 	meta.pkt_access = fn->pkt_access;
 
-	err = check_func_proto(fn, func_id);
+	err = check_func_proto(fn, func_id, &meta);
 	if (err) {
 		verbose(env, "kernel subsystem misconfigured func %s#%d\n",
 			func_id_name(func_id), func_id);
@@ -6841,8 +6866,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return err;
 	}
 
+	regs = cur_regs(env);
+
 	if (is_release_function(func_id)) {
-		err = release_reference(env, meta.ref_obj_id);
+		err = -EINVAL;
+		if (meta.ref_obj_id)
+			err = release_reference(env, meta.ref_obj_id);
+		/* meta.ref_obj_id can only be 0 if register that is meant to be
+		 * released is NULL, which must be > R0.
+		 */
+		else if (meta.release_regno && register_is_null(&regs[meta.release_regno]))
+			err = 0;
 		if (err) {
 			verbose(env, "func %s#%d reference has not been acquired before\n",
 				func_id_name(func_id), func_id);
@@ -6850,8 +6884,6 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		}
 	}
 
-	regs = cur_regs(env);
-
 	switch (func_id) {
 	case BPF_FUNC_tail_call:
 		err = check_reference_leak(env);
diff --git a/net/core/filter.c b/net/core/filter.c
index 143f442a9505..8eb01a997476 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6621,7 +6621,7 @@ static const struct bpf_func_proto bpf_sk_release_proto = {
 	.func		= bpf_sk_release,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON,
+	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON | PTR_RELEASE,
 };
 
 BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-12 23:05   ` Joanne Koong
  2022-04-09  9:32 ` [PATCH bpf-next v4 06/13] bpf: Prevent escaping of kptr loaded from maps Kumar Kartikeya Dwivedi
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Extending the code in previous commit, introduce referenced kptr
support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
unreferenced kptr, referenced kptr have a lot more restrictions. In
addition to the type matching, only a newly introduced bpf_kptr_xchg
helper is allowed to modify the map value at that offset. This transfers
the referenced pointer being stored into the map, releasing the
references state for the program, and returning the old value and
creating new reference state for the returned pointer.

Similar to unreferenced pointer case, return value for this case will
also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
must either be eventually released by calling the corresponding release
function, otherwise it must be transferred into another map.

It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
the value, and obtain the old value if any.

BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
commit will permit using BPF_LDX for such pointers, but attempt at
making it safe, since the lifetime of object won't be guaranteed.

There are valid reasons to enforce the restriction of permitting only
bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
consistent in face of concurrent modification, and any prior values
contained in the map must also be released before a new one is moved
into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
returns the old value, which the verifier would require the user to
either free or move into another map, and releases the reference held
for the pointer being moved in.

In the future, direct BPF_XCHG instruction may also be permitted to work
like bpf_kptr_xchg helper.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h            |   7 +++
 include/uapi/linux/bpf.h       |  12 ++++
 kernel/bpf/btf.c               |  10 ++-
 kernel/bpf/helpers.c           |  21 +++++++
 kernel/bpf/verifier.c          | 107 +++++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h |  12 ++++
 6 files changed, 155 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a6d1982e8118..bd682c29883a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -160,10 +160,15 @@ enum {
 	BPF_MAP_VALUE_OFF_MAX = 8,
 };
 
+enum {
+	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
+};
+
 struct bpf_map_value_off_desc {
 	u32 offset;
 	u32 btf_id;
 	struct btf *btf;
+	int flags;
 };
 
 struct bpf_map_value_off {
@@ -416,6 +421,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_STACK,	/* pointer to stack */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
 	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
+	ARG_PTR_TO_KPTR,	/* pointer to kptr */
 	__BPF_ARG_TYPE_MAX,
 
 	/* Extended arg_types. */
@@ -425,6 +431,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_SOCKET_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
 	ARG_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
 	ARG_PTR_TO_STACK_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
+	ARG_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
 
 	/* This must be the last entry. Its purpose is to ensure the enum is
 	 * wide enough to hold the higher bits reserved for bpf_type_flag.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d14b10b85e51..444fe6f1cf35 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5143,6 +5143,17 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void *bpf_kptr_xchg(void *map_value, void *ptr)
+ *	Description
+ *		Exchange kptr at pointer *map_value* with *ptr*, and return the
+ *		old value. *ptr* can be NULL, otherwise it must be a referenced
+ *		pointer which will be released when this helper is called.
+ *	Return
+ *		The old value of kptr (which can be NULL). The returned pointer
+ *		if not NULL, is a reference which must be released using its
+ *		corresponding release function, or moved into a BPF map before
+ *		program exit.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5339,6 +5350,7 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(kptr_xchg),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 28b1d9e9124e..43ea9ed5652e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3177,6 +3177,7 @@ enum {
 struct btf_field_info {
 	u32 type_id;
 	u32 off;
+	int flags;
 };
 
 static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
@@ -3194,6 +3195,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			       u32 off, int sz, struct btf_field_info *info)
 {
 	u32 res_id;
+	int flags;
 
 	/* For PTR, sz is always == 8 */
 	if (!btf_type_is_ptr(t))
@@ -3205,7 +3207,11 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 	/* Reject extra tags */
 	if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
 		return -EINVAL;
-	if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
+	if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
+		flags = 0;
+	else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
+		flags = BPF_MAP_VALUE_OFF_F_REF;
+	else
 		return -EINVAL;
 
 	/* Get the base type */
@@ -3216,6 +3222,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 
 	info->type_id = res_id;
 	info->off = off;
+	info->flags = flags;
 	return BTF_FIELD_FOUND;
 }
 
@@ -3420,6 +3427,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 		tab->off[i].offset = info_arr[i].off;
 		tab->off[i].btf_id = id;
 		tab->off[i].btf = off_btf;
+		tab->off[i].flags = info_arr[i].flags;
 		tab->nr_off = i + 1;
 	}
 	return tab;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 315053ef6a75..a437d0f0458a 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
 	kfree(t);
 }
 
+BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
+{
+	unsigned long *kptr = map_value;
+
+	return xchg(kptr, (unsigned long)ptr);
+}
+
+static u32 bpf_kptr_xchg_btf_id;
+
+const struct bpf_func_proto bpf_kptr_xchg_proto = {
+	.func         = bpf_kptr_xchg,
+	.gpl_only     = false,
+	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id   = &bpf_kptr_xchg_btf_id,
+	.arg1_type    = ARG_PTR_TO_KPTR,
+	.arg2_type    = ARG_PTR_TO_BTF_ID_OR_NULL | PTR_RELEASE,
+	.arg2_btf_id  = &bpf_kptr_xchg_btf_id,
+};
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
@@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_timer_start_proto;
 	case BPF_FUNC_timer_cancel:
 		return &bpf_timer_cancel_proto;
+	case BPF_FUNC_kptr_xchg:
+		return &bpf_kptr_xchg_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6cc08526e049..92efe6c3999c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -258,6 +258,7 @@ struct bpf_call_arg_meta {
 	struct btf *ret_btf;
 	u32 ret_btf_id;
 	u32 subprogno;
+	struct bpf_map_value_off_desc *kptr_off_desc;
 };
 
 struct btf *btf_vmlinux;
@@ -480,7 +481,8 @@ static bool is_release_function(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_sk_release ||
 	       func_id == BPF_FUNC_ringbuf_submit ||
-	       func_id == BPF_FUNC_ringbuf_discard;
+	       func_id == BPF_FUNC_ringbuf_discard ||
+	       func_id == BPF_FUNC_kptr_xchg;
 }
 
 static bool may_be_acquire_function(enum bpf_func_id func_id)
@@ -500,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	if (func_id == BPF_FUNC_sk_lookup_tcp ||
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
-	    func_id == BPF_FUNC_ringbuf_reserve)
+	    func_id == BPF_FUNC_ringbuf_reserve ||
+	    func_id == BPF_FUNC_kptr_xchg)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -3525,6 +3528,12 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
 	reg_name = kernel_type_name(reg->btf, reg->btf_id);
 
+	/* For ref_ptr case, release function check should ensure we get one
+	 * referenced PTR_TO_BTF_ID, and that its fixed offset is 0. For the
+	 * normal store of unreferenced kptr, we must ensure var_off is zero.
+	 * Since ref_ptr cannot be accessed directly by BPF insns, checks for
+	 * reg->off and reg->ref_obj_id are not needed here.
+	 */
 	if (__check_ptr_off_reg(env, reg, regno, true))
 		return -EACCES;
 
@@ -3557,6 +3566,12 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	if (BPF_MODE(insn->code) != BPF_MEM)
 		goto end;
 
+	/* We cannot directly access kptr_ref */
+	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF) {
+		verbose(env, "accessing referenced kptr disallowed\n");
+		return -EACCES;
+	}
+
 	if (class == BPF_LDX) {
 		val_reg = reg_state(env, value_regno);
 		/* We can simply mark the value_regno receiving the pointer
@@ -5278,6 +5293,59 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_kptr_func(struct bpf_verifier_env *env, int regno,
+			     struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	struct bpf_map_value_off_desc *off_desc;
+	struct bpf_map *map_ptr = reg->map_ptr;
+	u32 kptr_off;
+	int ret;
+
+	if (!tnum_is_const(reg->var_off)) {
+		verbose(env,
+			"R%d doesn't have constant offset. kptr has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map_ptr->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
+			map_ptr->name);
+		return -EINVAL;
+	}
+	if (!map_value_has_kptrs(map_ptr)) {
+		ret = PTR_ERR(map_ptr->kptr_off_tab);
+		if (ret == -E2BIG)
+			verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
+				BPF_MAP_VALUE_OFF_MAX);
+		else if (ret == -EEXIST)
+			verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
+		else
+			verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
+		return -EINVAL;
+	}
+
+	meta->map_ptr = map_ptr;
+	/* Check access for BPF_WRITE */
+	meta->raw_mode = true;
+	ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
+	if (ret < 0)
+		return ret;
+
+	kptr_off = reg->off + reg->var_off.value;
+	off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
+	if (!off_desc) {
+		verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
+		return -EACCES;
+	}
+	if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+		verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
+		return -EACCES;
+	}
+	meta->kptr_off_desc = off_desc;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return base_type(type) == ARG_PTR_TO_MEM ||
@@ -5418,6 +5486,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -5445,11 +5514,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_STACK]		= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
 	[ARG_PTR_TO_TIMER]		= &timer_types,
+	[ARG_PTR_TO_KPTR]		= &kptr_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			  enum bpf_arg_type arg_type,
-			  const u32 *arg_btf_id)
+			  const u32 *arg_btf_id,
+			  struct bpf_call_arg_meta *meta)
 {
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
 	enum bpf_reg_type expected, type = reg->type;
@@ -5502,8 +5573,11 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			arg_btf_id = compatible->btf_id;
 		}
 
-		if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
-					  btf_vmlinux, *arg_btf_id)) {
+		if (meta->func_id == BPF_FUNC_kptr_xchg) {
+			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
+				return -EACCES;
+		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
+						 btf_vmlinux, *arg_btf_id)) {
 			verbose(env, "R%d is of type %s but %s is expected\n",
 				regno, kernel_type_name(reg->btf, reg->btf_id),
 				kernel_type_name(btf_vmlinux, *arg_btf_id));
@@ -5613,7 +5687,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		 */
 		goto skip_type_check;
 
-	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
+	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
 	if (err)
 		return err;
 
@@ -5778,6 +5852,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "string is not zero-terminated\n");
 			return -EINVAL;
 		}
+	} else if (arg_type == ARG_PTR_TO_KPTR) {
+		if (process_kptr_func(env, regno, meta))
+			return -EACCES;
 	}
 
 	return err;
@@ -6120,10 +6197,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
-		if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
+		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
 			return false;
 
-		if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
+		if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
 			return false;
 	}
 
@@ -7007,21 +7084,25 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			regs[BPF_REG_0].btf_id = meta.ret_btf_id;
 		}
 	} else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
+		struct btf *ret_btf;
 		int ret_btf_id;
 
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
-		ret_btf_id = *fn->ret_btf_id;
+		if (func_id == BPF_FUNC_kptr_xchg) {
+			ret_btf = meta.kptr_off_desc->btf;
+			ret_btf_id = meta.kptr_off_desc->btf_id;
+		} else {
+			ret_btf = btf_vmlinux;
+			ret_btf_id = *fn->ret_btf_id;
+		}
 		if (ret_btf_id == 0) {
 			verbose(env, "invalid return type %u of func %s#%d\n",
 				base_type(ret_type), func_id_name(func_id),
 				func_id);
 			return -EINVAL;
 		}
-		/* current BPF helper definitions are only coming from
-		 * built-in code with type IDs from  vmlinux BTF
-		 */
-		regs[BPF_REG_0].btf = btf_vmlinux;
+		regs[BPF_REG_0].btf = ret_btf;
 		regs[BPF_REG_0].btf_id = ret_btf_id;
 	} else {
 		verbose(env, "unknown return type %u of func %s#%d\n",
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d14b10b85e51..444fe6f1cf35 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5143,6 +5143,17 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void *bpf_kptr_xchg(void *map_value, void *ptr)
+ *	Description
+ *		Exchange kptr at pointer *map_value* with *ptr*, and return the
+ *		old value. *ptr* can be NULL, otherwise it must be a referenced
+ *		pointer which will be released when this helper is called.
+ *	Return
+ *		The old value of kptr (which can be NULL). The returned pointer
+ *		if not NULL, is a reference which must be released using its
+ *		corresponding release function, or moved into a BPF map before
+ *		program exit.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5339,6 +5350,7 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(kptr_xchg),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 06/13] bpf: Prevent escaping of kptr loaded from maps
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 07/13] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

While we can guarantee that even for unreferenced kptr, the object
pointer points to being freed etc. can be handled by the verifier's
exception handling (normal load patching to PROBE_MEM loads), we still
cannot allow the user to pass these pointers to BPF helpers and kfunc,
because the same exception handling won't be done for accesses inside
the kernel. The same is true if a referenced pointer is loaded using
normal load instruction. Since the reference is not guaranteed to be
held while the pointer is used, it must be marked as untrusted.

Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
all registers loading unreferenced and referenced kptr from BPF maps,
and ensure they can never escape the BPF program and into the kernel by
way of calling stable/unstable helpers.

In check_ptr_to_btf_access, the !type_may_be_null check to reject type
flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
two are checked inside the function and rejected using a proper error
message, but we still want to allow dereference of untrusted case.

Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
walked, so that this flag is never dropped once it has been set on a
PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
direction).

In convert_ctx_accesses, extend the switch case to consider untrusted
PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
conversion for BPF_LDX.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   | 10 +++++++++-
 kernel/bpf/verifier.c | 35 ++++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bd682c29883a..e9791ecafa5d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -372,7 +372,15 @@ enum bpf_type_flag {
 	/* Indicates that the pointer argument will be released. */
 	PTR_RELEASE		= BIT(5 + BPF_BASE_TYPE_BITS),
 
-	__BPF_TYPE_LAST_FLAG	= PTR_RELEASE,
+	/* PTR is not trusted. This is only used with PTR_TO_BTF_ID, to mark
+	 * unreferenced and referenced kptr loaded from map value using a load
+	 * instruction, so that they can only be dereferenced but not escape the
+	 * BPF program into the kernel (i.e. cannot be passed as arguments to
+	 * kfunc or bpf helpers).
+	 */
+	PTR_UNTRUSTED		= BIT(6 + BPF_BASE_TYPE_BITS),
+
+	__BPF_TYPE_LAST_FLAG	= PTR_UNTRUSTED,
 };
 
 /* Max number of base types. */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 92efe6c3999c..c6cc4180ae45 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -579,6 +579,8 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
 		strncpy(prefix, "user_", 32);
 	if (type & MEM_PERCPU)
 		strncpy(prefix, "percpu_", 32);
+	if (type & PTR_UNTRUSTED)
+		strncpy(prefix, "untrusted_", 32);
 
 	snprintf(env->type_str_buf, TYPE_STR_BUF_LEN, "%s%s%s",
 		 prefix, str[base_type(type)], postfix);
@@ -3516,9 +3518,14 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 			       struct bpf_reg_state *reg, u32 regno)
 {
 	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
+	int perm_flags = PTR_MAYBE_NULL;
 	const char *reg_name = "";
 
-	if (base_type(reg->type) != PTR_TO_BTF_ID || type_flag(reg->type) != PTR_MAYBE_NULL)
+	/* Only unreferenced case accepts untrusted pointers */
+	if (!off_desc->flags)
+		perm_flags |= PTR_UNTRUSTED;
+
+	if (base_type(reg->type) != PTR_TO_BTF_ID || (type_flag(reg->type) & ~perm_flags))
 		goto bad_type;
 
 	if (!btf_is_kernel(reg->btf)) {
@@ -3544,7 +3551,12 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 bad_type:
 	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
 		reg_type_str(env, reg->type), reg_name);
-	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+	verbose(env, "expected=%s%s", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+	if (!off_desc->flags)
+		verbose(env, " or %s%s\n", reg_type_str(env, PTR_TO_BTF_ID | PTR_UNTRUSTED),
+			targ_name);
+	else
+		verbose(env, "\n");
 	return -EINVAL;
 }
 
@@ -3566,9 +3578,11 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	if (BPF_MODE(insn->code) != BPF_MEM)
 		goto end;
 
-	/* We cannot directly access kptr_ref */
-	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF) {
-		verbose(env, "accessing referenced kptr disallowed\n");
+	/* We only allow loading referenced kptr, since it will be marked as
+	 * untrusted, similar to unreferenced kptr.
+	 */
+	if (class != BPF_LDX && (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+		verbose(env, "store to referenced kptr disallowed\n");
 		return -EACCES;
 	}
 
@@ -3578,7 +3592,7 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 		 * value from map as PTR_TO_BTF_ID, with the correct type.
 		 */
 		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
-				off_desc->btf_id, PTR_MAYBE_NULL);
+				off_desc->btf_id, PTR_MAYBE_NULL | PTR_UNTRUSTED);
 		val_reg->id = ++env->id_gen;
 	} else if (class == BPF_STX) {
 		val_reg = reg_state(env, value_regno);
@@ -4343,6 +4357,12 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	if (ret < 0)
 		return ret;
 
+	/* If this is an untrusted pointer, all pointers formed by walking it
+	 * also inherit the untrusted flag.
+	 */
+	if (type_flag(reg->type) & PTR_UNTRUSTED)
+		flag |= PTR_UNTRUSTED;
+
 	if (atype == BPF_READ && value_regno >= 0)
 		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, btf_id, flag);
 
@@ -13078,7 +13098,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		if (!ctx_access)
 			continue;
 
-		switch (env->insn_aux_data[i + delta].ptr_type) {
+		switch ((int)env->insn_aux_data[i + delta].ptr_type) {
 		case PTR_TO_CTX:
 			if (!ops->convert_ctx_access)
 				continue;
@@ -13095,6 +13115,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
 			break;
 		case PTR_TO_BTF_ID:
+		case PTR_TO_BTF_ID | PTR_UNTRUSTED:
 			if (type == BPF_READ) {
 				insn->code = BPF_LDX | BPF_PROBE_MEM |
 					BPF_SIZE((insn)->code);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 07/13] bpf: Adapt copy_map_value for multiple offset case
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 06/13] bpf: Prevent escaping of kptr loaded from maps Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 08/13] bpf: Populate pairs of btf_id and destructor kfunc in btf Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Since now there might be at most 10 offsets that need handling in
copy_map_value, the manual shuffling and special case is no longer going
to work. Hence, let's generalise the copy_map_value function by using
a sorted array of offsets to skip regions that must be avoided while
copying into and out of a map value.

When the map is created, we populate the offset array in struct map,
with one extra element for map->value_size, which is used as the final
offset to subtract previous offset from. Then, copy_map_value uses this
sorted offset array is used to memcpy while skipping timer, spin lock,
and kptr.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h  | 56 +++++++++++++++-------------
 kernel/bpf/syscall.c | 88 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 117 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e9791ecafa5d..bd79132c664d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -158,6 +158,9 @@ struct bpf_map_ops {
 enum {
 	/* Support at most 8 pointers in a BPF map value */
 	BPF_MAP_VALUE_OFF_MAX = 8,
+	BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
+				1 + /* for bpf_spin_lock */
+				1,  /* for bpf_timer */
 };
 
 enum {
@@ -176,6 +179,12 @@ struct bpf_map_value_off {
 	struct bpf_map_value_off_desc off[];
 };
 
+struct bpf_map_off_arr {
+	u32 cnt;
+	u32 field_off[BPF_MAP_OFF_ARR_MAX];
+	u8 field_sz[BPF_MAP_OFF_ARR_MAX];
+};
+
 struct bpf_map {
 	/* The first two cachelines with read-mostly members of which some
 	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -204,10 +213,7 @@ struct bpf_map {
 	struct mem_cgroup *memcg;
 #endif
 	char name[BPF_OBJ_NAME_LEN];
-	bool bypass_spec_v1;
-	bool frozen; /* write-once; write-protected by freeze_mutex */
-	/* 6 bytes hole */
-
+	struct bpf_map_off_arr *off_arr;
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
 	 */
@@ -227,6 +233,8 @@ struct bpf_map {
 		bool jited;
 		bool xdp_has_frags;
 	} owner;
+	bool bypass_spec_v1;
+	bool frozen; /* write-once; write-protected by freeze_mutex */
 };
 
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
@@ -250,37 +258,33 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 		memset(dst + map->spin_lock_off, 0, sizeof(struct bpf_spin_lock));
 	if (unlikely(map_value_has_timer(map)))
 		memset(dst + map->timer_off, 0, sizeof(struct bpf_timer));
+	if (unlikely(map_value_has_kptrs(map))) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++)
+			*(u64 *)(dst + tab->off[i].offset) = 0;
+	}
 }
 
 /* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
-	u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;
+	u32 curr_off = 0;
+	int i;
 
-	if (unlikely(map_value_has_spin_lock(map))) {
-		s_off = map->spin_lock_off;
-		s_sz = sizeof(struct bpf_spin_lock);
-	}
-	if (unlikely(map_value_has_timer(map))) {
-		t_off = map->timer_off;
-		t_sz = sizeof(struct bpf_timer);
+	if (likely(!map->off_arr)) {
+		memcpy(dst, src, map->value_size);
+		return;
 	}
 
-	if (unlikely(s_sz || t_sz)) {
-		if (s_off < t_off || !s_sz) {
-			swap(s_off, t_off);
-			swap(s_sz, t_sz);
-		}
-		memcpy(dst, src, t_off);
-		memcpy(dst + t_off + t_sz,
-		       src + t_off + t_sz,
-		       s_off - t_off - t_sz);
-		memcpy(dst + s_off + s_sz,
-		       src + s_off + s_sz,
-		       map->value_size - s_off - s_sz);
-	} else {
-		memcpy(dst, src, map->value_size);
+	for (i = 0; i < map->off_arr->cnt; i++) {
+		u32 next_off = map->off_arr->field_off[i];
+
+		memcpy(dst + curr_off, src + curr_off, next_off - curr_off);
+		curr_off += map->off_arr->field_sz[i];
 	}
+	memcpy(dst + curr_off, src + curr_off, map->value_size - curr_off);
 }
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index edfe691284b0..481d5bb06203 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -30,6 +30,7 @@
 #include <linux/pgtable.h>
 #include <linux/bpf_lsm.h>
 #include <linux/poll.h>
+#include <linux/sort.h>
 #include <linux/bpf-netns.h>
 #include <linux/rcupdate_trace.h>
 #include <linux/memcontrol.h>
@@ -562,6 +563,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 
 	security_bpf_map_free(map);
+	kfree(map->off_arr);
 	bpf_map_free_kptr_off_tab(map);
 	bpf_map_release_memcg(map);
 	/* implementation dependent freeing */
@@ -851,6 +853,84 @@ int map_check_no_btf(const struct bpf_map *map,
 	return -ENOTSUPP;
 }
 
+static int map_off_arr_cmp(const void *_a, const void *_b, const void *priv)
+{
+	const u32 a = *(const u32 *)_a;
+	const u32 b = *(const u32 *)_b;
+
+	if (a < b)
+		return -1;
+	else if (a > b)
+		return 1;
+	return 0;
+}
+
+static void map_off_arr_swap(void *_a, void *_b, int size, const void *priv)
+{
+	struct bpf_map *map = (struct bpf_map *)priv;
+	u32 *off_base = map->off_arr->field_off;
+	u32 *a = _a, *b = _b;
+	u8 *sz_a, *sz_b;
+
+	sz_a = map->off_arr->field_sz + (a - off_base);
+	sz_b = map->off_arr->field_sz + (b - off_base);
+
+	swap(*a, *b);
+	swap(*sz_a, *sz_b);
+}
+
+static int bpf_map_alloc_off_arr(struct bpf_map *map)
+{
+	bool has_spin_lock = map_value_has_spin_lock(map);
+	bool has_timer = map_value_has_timer(map);
+	bool has_kptrs = map_value_has_kptrs(map);
+	struct bpf_map_off_arr *off_arr;
+	u32 i;
+
+	if (!has_spin_lock && !has_timer && !has_kptrs) {
+		map->off_arr = NULL;
+		return 0;
+	}
+
+	off_arr = kmalloc(sizeof(*map->off_arr), GFP_KERNEL | __GFP_NOWARN);
+	if (!off_arr)
+		return -ENOMEM;
+	map->off_arr = off_arr;
+
+	off_arr->cnt = 0;
+	if (has_spin_lock) {
+		i = off_arr->cnt;
+
+		off_arr->field_off[i] = map->spin_lock_off;
+		off_arr->field_sz[i] = sizeof(struct bpf_spin_lock);
+		off_arr->cnt++;
+	}
+	if (has_timer) {
+		i = off_arr->cnt;
+
+		off_arr->field_off[i] = map->timer_off;
+		off_arr->field_sz[i] = sizeof(struct bpf_timer);
+		off_arr->cnt++;
+	}
+	if (has_kptrs) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		u32 *off = &off_arr->field_off[off_arr->cnt];
+		u8 *sz = &off_arr->field_sz[off_arr->cnt];
+
+		for (i = 0; i < tab->nr_off; i++) {
+			*off++ = tab->off[i].offset;
+			*sz++ = sizeof(u64);
+		}
+		off_arr->cnt += tab->nr_off;
+	}
+
+	if (off_arr->cnt == 1)
+		return 0;
+	sort_r(off_arr->field_off, off_arr->cnt, sizeof(off_arr->field_off[0]),
+	       map_off_arr_cmp, map_off_arr_swap, map);
+	return 0;
+}
+
 static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			 u32 btf_key_id, u32 btf_value_id)
 {
@@ -1020,10 +1100,14 @@ static int map_create(union bpf_attr *attr)
 			attr->btf_vmlinux_value_type_id;
 	}
 
-	err = security_bpf_map_alloc(map);
+	err = bpf_map_alloc_off_arr(map);
 	if (err)
 		goto free_map;
 
+	err = security_bpf_map_alloc(map);
+	if (err)
+		goto free_map_off_arr;
+
 	err = bpf_map_alloc_id(map);
 	if (err)
 		goto free_map_sec;
@@ -1046,6 +1130,8 @@ static int map_create(union bpf_attr *attr)
 
 free_map_sec:
 	security_bpf_map_free(map);
+free_map_off_arr:
+	kfree(map->off_arr);
 free_map:
 	btf_put(map->btf);
 	map->ops->map_free(map);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 08/13] bpf: Populate pairs of btf_id and destructor kfunc in btf
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 07/13] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:32 ` [PATCH bpf-next v4 09/13] bpf: Wire up freeing of referenced kptr Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

To support storing referenced PTR_TO_BTF_ID in maps, we require
associating a specific BTF ID with a 'destructor' kfunc. This is because
we need to release a live referenced pointer at a certain offset in map
value from the map destruction path, otherwise we end up leaking
resources.

Hence, introduce support for passing an array of btf_id, kfunc_btf_id
pairs that denote a BTF ID and its associated release function. Then,
add an accessor 'btf_find_dtor_kfunc' which can be used to look up the
destructor kfunc of a certain BTF ID. If found, we can use it to free
the object from the map free path.

The registration of these pairs also serve as a whitelist of structures
which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because
without finding the destructor kfunc, we will bail and return an error.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h |  17 +++++++
 kernel/bpf/btf.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 19c297f9a52f..fea424681d66 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -40,6 +40,11 @@ struct btf_kfunc_id_set {
 	};
 };
 
+struct btf_id_dtor_kfunc {
+	u32 btf_id;
+	u32 kfunc_btf_id;
+};
+
 extern const struct file_operations btf_fops;
 
 void btf_get(struct btf *btf);
@@ -346,6 +351,9 @@ bool btf_kfunc_id_set_contains(const struct btf *btf,
 			       enum btf_kfunc_type type, u32 kfunc_btf_id);
 int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 			      const struct btf_kfunc_id_set *s);
+s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id);
+int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
+				struct module *owner);
 #else
 static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
 						    u32 type_id)
@@ -369,6 +377,15 @@ static inline int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 {
 	return 0;
 }
+static inline s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id)
+{
+	return -ENOENT;
+}
+static inline int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors,
+					      u32 add_cnt, struct module *owner)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 43ea9ed5652e..8d7cdb8a6391 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -207,12 +207,18 @@ enum btf_kfunc_hook {
 
 enum {
 	BTF_KFUNC_SET_MAX_CNT = 32,
+	BTF_DTOR_KFUNC_MAX_CNT = 256,
 };
 
 struct btf_kfunc_set_tab {
 	struct btf_id_set *sets[BTF_KFUNC_HOOK_MAX][BTF_KFUNC_TYPE_MAX];
 };
 
+struct btf_id_dtor_kfunc_tab {
+	u32 cnt;
+	struct btf_id_dtor_kfunc dtors[];
+};
+
 struct btf {
 	void *data;
 	struct btf_type **types;
@@ -228,6 +234,7 @@ struct btf {
 	u32 id;
 	struct rcu_head rcu;
 	struct btf_kfunc_set_tab *kfunc_set_tab;
+	struct btf_id_dtor_kfunc_tab *dtor_kfunc_tab;
 
 	/* split BTF support */
 	struct btf *base_btf;
@@ -1616,8 +1623,19 @@ static void btf_free_kfunc_set_tab(struct btf *btf)
 	btf->kfunc_set_tab = NULL;
 }
 
+static void btf_free_dtor_kfunc_tab(struct btf *btf)
+{
+	struct btf_id_dtor_kfunc_tab *tab = btf->dtor_kfunc_tab;
+
+	if (!tab)
+		return;
+	kfree(tab);
+	btf->dtor_kfunc_tab = NULL;
+}
+
 static void btf_free(struct btf *btf)
 {
+	btf_free_dtor_kfunc_tab(btf);
 	btf_free_kfunc_set_tab(btf);
 	kvfree(btf->types);
 	kvfree(btf->resolved_sizes);
@@ -7021,6 +7039,96 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 }
 EXPORT_SYMBOL_GPL(register_btf_kfunc_id_set);
 
+s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id)
+{
+	struct btf_id_dtor_kfunc_tab *tab = btf->dtor_kfunc_tab;
+	struct btf_id_dtor_kfunc *dtor;
+
+	if (!tab)
+		return -ENOENT;
+	/* Even though the size of tab->dtors[0] is > sizeof(u32), we only need
+	 * to compare the first u32 with btf_id, so we can reuse btf_id_cmp_func.
+	 */
+	BUILD_BUG_ON(offsetof(struct btf_id_dtor_kfunc, btf_id) != 0);
+	dtor = bsearch(&btf_id, tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func);
+	if (!dtor)
+		return -ENOENT;
+	return dtor->kfunc_btf_id;
+}
+
+/* This function must be invoked only from initcalls/module init functions */
+int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
+				struct module *owner)
+{
+	struct btf_id_dtor_kfunc_tab *tab;
+	struct btf *btf;
+	u32 tab_cnt;
+	int ret;
+
+	btf = btf_get_module_btf(owner);
+	if (!btf) {
+		if (!owner && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {
+			pr_err("missing vmlinux BTF, cannot register dtor kfuncs\n");
+			return -ENOENT;
+		}
+		if (owner && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES)) {
+			pr_err("missing module BTF, cannot register dtor kfuncs\n");
+			return -ENOENT;
+		}
+		return 0;
+	}
+	if (IS_ERR(btf))
+		return PTR_ERR(btf);
+
+	if (add_cnt >= BTF_DTOR_KFUNC_MAX_CNT) {
+		pr_err("cannot register more than %d kfunc destructors\n", BTF_DTOR_KFUNC_MAX_CNT);
+		ret = -E2BIG;
+		goto end;
+	}
+
+	tab = btf->dtor_kfunc_tab;
+	/* Only one call allowed for modules */
+	if (WARN_ON_ONCE(tab && btf_is_module(btf))) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	tab_cnt = tab ? tab->cnt : 0;
+	if (tab_cnt > U32_MAX - add_cnt) {
+		ret = -EOVERFLOW;
+		goto end;
+	}
+	if (tab_cnt + add_cnt >= BTF_DTOR_KFUNC_MAX_CNT) {
+		pr_err("cannot register more than %d kfunc destructors\n", BTF_DTOR_KFUNC_MAX_CNT);
+		ret = -E2BIG;
+		goto end;
+	}
+
+	tab = krealloc(btf->dtor_kfunc_tab,
+		       offsetof(struct btf_id_dtor_kfunc_tab, dtors[tab_cnt + add_cnt]),
+		       GFP_KERNEL | __GFP_NOWARN);
+	if (!tab) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	if (!btf->dtor_kfunc_tab)
+		tab->cnt = 0;
+	btf->dtor_kfunc_tab = tab;
+
+	memcpy(tab->dtors + tab->cnt, dtors, add_cnt * sizeof(tab->dtors[0]));
+	tab->cnt += add_cnt;
+
+	sort(tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func, NULL);
+
+	return 0;
+end:
+	btf_free_dtor_kfunc_tab(btf);
+	btf_put(btf);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_btf_id_dtor_kfuncs);
+
 #define MAX_TYPES_ARE_COMPAT_DEPTH 2
 
 static
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 09/13] bpf: Wire up freeing of referenced kptr
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (7 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 08/13] bpf: Populate pairs of btf_id and destructor kfunc in btf Kumar Kartikeya Dwivedi
@ 2022-04-09  9:32 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:33 ` [PATCH bpf-next v4 10/13] bpf: Teach verifier about kptr_get kfunc helpers Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.

In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.

Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.

Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.

The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.

Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |   4 ++
 include/linux/btf.h   |   2 +
 kernel/bpf/arraymap.c |  14 +++++-
 kernel/bpf/btf.c      | 100 +++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/hashtab.c  |  58 ++++++++++++++++++------
 kernel/bpf/syscall.c  |  57 +++++++++++++++++++++---
 6 files changed, 212 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bd79132c664d..a0a848127007 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
 #include <linux/bpfptr.h>
+#include <linux/btf.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -171,6 +172,8 @@ struct bpf_map_value_off_desc {
 	u32 offset;
 	u32 btf_id;
 	struct btf *btf;
+	struct module *module;
+	btf_dtor_kfunc_t dtor;
 	int flags;
 };
 
@@ -1545,6 +1548,7 @@ struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u3
 void bpf_map_free_kptr_off_tab(struct bpf_map *map);
 struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
 bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+void bpf_map_free_kptrs(struct bpf_map *map, void *map_value);
 
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index fea424681d66..f70625dd5bb4 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -45,6 +45,8 @@ struct btf_id_dtor_kfunc {
 	u32 kfunc_btf_id;
 };
 
+typedef void (*btf_dtor_kfunc_t)(void *);
+
 extern const struct file_operations btf_fops;
 
 void btf_get(struct btf *btf);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..a84bbca55336 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -287,10 +287,12 @@ static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key
 	return 0;
 }
 
-static void check_and_free_timer_in_array(struct bpf_array *arr, void *val)
+static void check_and_free_fields(struct bpf_array *arr, void *val)
 {
 	if (unlikely(map_value_has_timer(&arr->map)))
 		bpf_timer_cancel_and_free(val + arr->map.timer_off);
+	if (unlikely(map_value_has_kptrs(&arr->map)))
+		bpf_map_free_kptrs(&arr->map, val);
 }
 
 /* Called from syscall or from eBPF program */
@@ -327,7 +329,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map, val, value, false);
 		else
 			copy_map_value(map, val, value);
-		check_and_free_timer_in_array(array, val);
+		check_and_free_fields(array, val);
 	}
 	return 0;
 }
@@ -386,6 +388,7 @@ static void array_map_free_timers(struct bpf_map *map)
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
 
+	/* We don't reset or free kptr on uref dropping to zero. */
 	if (likely(!map_value_has_timer(map)))
 		return;
 
@@ -398,6 +401,13 @@ static void array_map_free_timers(struct bpf_map *map)
 static void array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	int i;
+
+	if (unlikely(map_value_has_kptrs(map))) {
+		for (i = 0; i < array->map.max_entries; i++)
+			bpf_map_free_kptrs(map, array->value + array->elem_size * i);
+		bpf_map_free_kptr_off_tab(map);
+	}
 
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8d7cdb8a6391..8978724b0b61 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3412,6 +3412,8 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 {
 	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
 	struct bpf_map_value_off *tab;
+	struct btf *off_btf = NULL;
+	struct module *mod = NULL;
 	int ret, i, nr_off;
 
 	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
@@ -3431,7 +3433,6 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 	tab->nr_off = 0;
 	for (i = 0; i < nr_off; i++) {
 		const struct btf_type *t;
-		struct btf *off_btf;
 		s32 id;
 
 		t = btf_type_by_id(btf, info_arr[i].type_id);
@@ -3442,16 +3443,69 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 			goto end;
 		}
 
+		/* Find and stash the function pointer for the destruction function that
+		 * needs to be eventually invoked from the map free path.
+		 */
+		if (info_arr[i].flags & BPF_MAP_VALUE_OFF_F_REF) {
+			const struct btf_type *dtor_func;
+			const char *dtor_func_name;
+			unsigned long addr;
+			s32 dtor_btf_id;
+
+			/* This call also serves as a whitelist of allowed objects that
+			 * can be used as a referenced pointer and be stored in a map at
+			 * the same time.
+			 */
+			dtor_btf_id = btf_find_dtor_kfunc(off_btf, id);
+			if (dtor_btf_id < 0) {
+				ret = dtor_btf_id;
+				goto end_btf;
+			}
+
+			dtor_func = btf_type_by_id(off_btf, dtor_btf_id);
+			if (!dtor_func) {
+				ret = -ENOENT;
+				goto end_btf;
+			}
+
+			if (btf_is_module(btf)) {
+				mod = btf_try_get_module(off_btf);
+				if (!mod) {
+					ret = -ENXIO;
+					goto end_btf;
+				}
+			}
+
+			/* We already verified dtor_func to be btf_type_is_func
+			 * in register_btf_id_dtor_kfuncs.
+			 */
+			dtor_func_name = __btf_name_by_offset(off_btf, dtor_func->name_off);
+			addr = kallsyms_lookup_name(dtor_func_name);
+			if (!addr) {
+				ret = -EINVAL;
+				goto end_mod;
+			}
+			tab->off[i].dtor = (void *)addr;
+		}
+
 		tab->off[i].offset = info_arr[i].off;
 		tab->off[i].btf_id = id;
 		tab->off[i].btf = off_btf;
+		tab->off[i].module = mod;
 		tab->off[i].flags = info_arr[i].flags;
 		tab->nr_off = i + 1;
 	}
 	return tab;
+end_mod:
+	module_put(mod);
+end_btf:
+	btf_put(off_btf);
 end:
-	while (tab->nr_off--)
+	while (tab->nr_off--) {
 		btf_put(tab->off[tab->nr_off].btf);
+		if (tab->off[tab->nr_off].module)
+			module_put(tab->off[tab->nr_off].module);
+	}
 	kfree(tab);
 	return ERR_PTR(ret);
 }
@@ -7056,6 +7110,43 @@ s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id)
 	return dtor->kfunc_btf_id;
 }
 
+static int btf_check_dtor_kfuncs(struct btf *btf, const struct btf_id_dtor_kfunc *dtors, u32 cnt)
+{
+	const struct btf_type *dtor_func, *dtor_func_proto, *t;
+	const struct btf_param *args;
+	s32 dtor_btf_id;
+	u32 nr_args, i;
+
+	for (i = 0; i < cnt; i++) {
+		dtor_btf_id = dtors[i].kfunc_btf_id;
+
+		dtor_func = btf_type_by_id(btf, dtor_btf_id);
+		if (!dtor_func || !btf_type_is_func(dtor_func))
+			return -EINVAL;
+
+		dtor_func_proto = btf_type_by_id(btf, dtor_func->type);
+		if (!dtor_func_proto || !btf_type_is_func_proto(dtor_func_proto))
+			return -EINVAL;
+
+		/* Make sure the prototype of the destructor kfunc is 'void func(type *)' */
+		t = btf_type_by_id(btf, dtor_func_proto->type);
+		if (!t || !btf_type_is_void(t))
+			return -EINVAL;
+
+		nr_args = btf_type_vlen(dtor_func_proto);
+		if (nr_args != 1)
+			return -EINVAL;
+		args = btf_params(dtor_func_proto);
+		t = btf_type_by_id(btf, args[0].type);
+		/* Allow any pointer type, as width on targets Linux supports
+		 * will be same for all pointer types (i.e. sizeof(void *))
+		 */
+		if (!t || !btf_type_is_ptr(t))
+			return -EINVAL;
+	}
+	return 0;
+}
+
 /* This function must be invoked only from initcalls/module init functions */
 int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
 				struct module *owner)
@@ -7086,6 +7177,11 @@ int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_c
 		goto end;
 	}
 
+	/* Ensure that the prototype of dtor kfuncs being registered is sane */
+	ret = btf_check_dtor_kfuncs(btf, dtors, add_cnt);
+	if (ret < 0)
+		goto end;
+
 	tab = btf->dtor_kfunc_tab;
 	/* Only one call allowed for modules */
 	if (WARN_ON_ONCE(tab && btf_is_module(btf))) {
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 65877967f414..d5ef0ae56a61 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -254,6 +254,25 @@ static void htab_free_prealloced_timers(struct bpf_htab *htab)
 	}
 }
 
+static void htab_free_prealloced_kptrs(struct bpf_htab *htab)
+{
+	u32 num_entries = htab->map.max_entries;
+	int i;
+
+	if (likely(!map_value_has_kptrs(&htab->map)))
+		return;
+	if (htab_has_extra_elems(htab))
+		num_entries += num_possible_cpus();
+
+	for (i = 0; i < num_entries; i++) {
+		struct htab_elem *elem;
+
+		elem = get_htab_elem(htab, i);
+		bpf_map_free_kptrs(&htab->map, elem->key + round_up(htab->map.key_size, 8));
+		cond_resched();
+	}
+}
+
 static void htab_free_elems(struct bpf_htab *htab)
 {
 	int i;
@@ -725,12 +744,15 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
 	return insn - insn_buf;
 }
 
-static void check_and_free_timer(struct bpf_htab *htab, struct htab_elem *elem)
+static void check_and_free_fields(struct bpf_htab *htab,
+				  struct htab_elem *elem)
 {
+	void *map_value = elem->key + round_up(htab->map.key_size, 8);
+
 	if (unlikely(map_value_has_timer(&htab->map)))
-		bpf_timer_cancel_and_free(elem->key +
-					  round_up(htab->map.key_size, 8) +
-					  htab->map.timer_off);
+		bpf_timer_cancel_and_free(map_value + htab->map.timer_off);
+	if (unlikely(map_value_has_kptrs(&htab->map)))
+		bpf_map_free_kptrs(&htab->map, map_value);
 }
 
 /* It is called from the bpf_lru_list when the LRU needs to delete
@@ -757,7 +779,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
 		if (l == tgt_l) {
 			hlist_nulls_del_rcu(&l->hash_node);
-			check_and_free_timer(htab, l);
+			check_and_free_fields(htab, l);
 			break;
 		}
 
@@ -829,7 +851,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
-	check_and_free_timer(htab, l);
+	check_and_free_fields(htab, l);
 	kfree(l);
 }
 
@@ -857,7 +879,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 	htab_put_fd_value(htab, l);
 
 	if (htab_is_prealloc(htab)) {
-		check_and_free_timer(htab, l);
+		check_and_free_fields(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		atomic_dec(&htab->count);
@@ -1104,7 +1126,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		if (!htab_is_prealloc(htab))
 			free_htab_elem(htab, l_old);
 		else
-			check_and_free_timer(htab, l_old);
+			check_and_free_fields(htab, l_old);
 	}
 	ret = 0;
 err:
@@ -1114,7 +1136,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
 {
-	check_and_free_timer(htab, elem);
+	check_and_free_fields(htab, elem);
 	bpf_lru_push_free(&htab->lru, &elem->lru_node);
 }
 
@@ -1419,8 +1441,14 @@ static void htab_free_malloced_timers(struct bpf_htab *htab)
 		struct hlist_nulls_node *n;
 		struct htab_elem *l;
 
-		hlist_nulls_for_each_entry(l, n, head, hash_node)
-			check_and_free_timer(htab, l);
+		hlist_nulls_for_each_entry(l, n, head, hash_node) {
+			/* We don't reset or free kptr on uref dropping to zero,
+			 * hence just free timer.
+			 */
+			bpf_timer_cancel_and_free(l->key +
+						  round_up(htab->map.key_size, 8) +
+						  htab->map.timer_off);
+		}
 		cond_resched_rcu();
 	}
 	rcu_read_unlock();
@@ -1430,6 +1458,7 @@ static void htab_map_free_timers(struct bpf_map *map)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 
+	/* We don't reset or free kptr on uref dropping to zero. */
 	if (likely(!map_value_has_timer(&htab->map)))
 		return;
 	if (!htab_is_prealloc(htab))
@@ -1453,11 +1482,14 @@ static void htab_map_free(struct bpf_map *map)
 	 * not have executed. Wait for them.
 	 */
 	rcu_barrier();
-	if (!htab_is_prealloc(htab))
+	if (!htab_is_prealloc(htab)) {
 		delete_all_elements(htab);
-	else
+	} else {
+		htab_free_prealloced_kptrs(htab);
 		prealloc_destroy(htab);
+	}
 
+	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 481d5bb06203..de237bfede85 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -508,8 +508,11 @@ void bpf_map_free_kptr_off_tab(struct bpf_map *map)
 	if (!map_value_has_kptrs(map))
 		return;
 	for (i = 0; i < tab->nr_off; i++) {
+		struct module *mod = tab->off[i].module;
 		struct btf *btf = tab->off[i].btf;
 
+		if (mod)
+			module_put(mod);
 		btf_put(btf);
 	}
 	kfree(tab);
@@ -524,8 +527,16 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
 	if (!map_value_has_kptrs(map))
 		return ERR_PTR(-ENOENT);
 	/* Do a deep copy of the kptr_off_tab */
-	for (i = 0; i < tab->nr_off; i++)
-		btf_get(tab->off[i].btf);
+	for (i = 0; i < tab->nr_off; i++) {
+		struct module *mod = tab->off[i].module;
+		struct btf *btf = tab->off[i].btf;
+
+		if (mod && !try_module_get(mod)) {
+			ret = -ENXIO;
+			goto end;
+		}
+		btf_get(btf);
+	}
 
 	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
 	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
@@ -536,8 +547,14 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
 	memcpy(new_tab, tab, size);
 	return new_tab;
 end:
-	while (i--)
-		btf_put(tab->off[i].btf);
+	while (i--) {
+		struct module *mod = tab->off[i].module;
+		struct btf *btf = tab->off[i].btf;
+
+		if (mod)
+			module_put(mod);
+		btf_put(btf);
+	}
 	return ERR_PTR(ret);
 }
 
@@ -557,6 +574,33 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
 	return !memcmp(tab_a, tab_b, size);
 }
 
+/* Caller must ensure map_value_has_kptrs is true. Note that this function can
+ * be called on a map value while the map_value is visible to BPF programs, as
+ * it ensures the correct synchronization, and we already enforce the same using
+ * the bpf_kptr_xchg helper on the BPF program side for referenced kptrs.
+ */
+void bpf_map_free_kptrs(struct bpf_map *map, void *map_value)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab;
+	unsigned long *btf_id_ptr;
+	int i;
+
+	for (i = 0; i < tab->nr_off; i++) {
+		struct bpf_map_value_off_desc *off_desc = &tab->off[i];
+		unsigned long old_ptr;
+
+		btf_id_ptr = map_value + off_desc->offset;
+		if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+			u64 *p = (u64 *)btf_id_ptr;
+
+			WRITE_ONCE(p, 0);
+			continue;
+		}
+		old_ptr = xchg(btf_id_ptr, 0);
+		off_desc->dtor((void *)old_ptr);
+	}
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
@@ -564,9 +608,10 @@ static void bpf_map_free_deferred(struct work_struct *work)
 
 	security_bpf_map_free(map);
 	kfree(map->off_arr);
-	bpf_map_free_kptr_off_tab(map);
 	bpf_map_release_memcg(map);
-	/* implementation dependent freeing */
+	/* implementation dependent freeing, map_free callback also does
+	 * bpf_map_free_kptr_off_tab, if needed.
+	 */
 	map->ops->map_free(map);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 10/13] bpf: Teach verifier about kptr_get kfunc helpers
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (8 preceding siblings ...)
  2022-04-09  9:32 ` [PATCH bpf-next v4 09/13] bpf: Wire up freeing of referenced kptr Kumar Kartikeya Dwivedi
@ 2022-04-09  9:33 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:33 ` [PATCH bpf-next v4 11/13] libbpf: Add kptr type tag macros to bpf_helpers.h Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:33 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

We introduce a new style of kfunc helpers, namely *_kptr_get, where they
take pointer to the map value which points to a referenced kernel
pointer contained in the map. Since this is referenced, only
bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to
change the current value, and each pointer that resides in that location
would be referenced, and RCU protected (this must be kept in mind while
adding kernel types embeddable as reference kptr in BPF maps).

This means that if do the load of the pointer value in an RCU read
section, and find a live pointer, then as long as we hold RCU read lock,
it won't be freed by a parallel xchg + release operation. This allows us
to implement a safe refcount increment scheme. Hence, enforce that first
argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the
right offset to referenced pointer.

For the rest of the arguments, they are subjected to typical kfunc
argument checks, hence allowing some flexibility in passing more intent
into how the reference should be taken.

For instance, in case of struct nf_conn, it is not freed until RCU grace
period ends, but can still be reused for another tuple once refcount has
dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call
refcount_inc_not_zero, but also do a tuple match after incrementing the
reference, and when it fails to match it, put the reference again and
return NULL.

This can be implemented easily if we allow passing additional parameters
to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a
tuple__sz pair.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h |  2 ++
 kernel/bpf/btf.c    | 61 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index f70625dd5bb4..2611cea2c2b6 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -17,6 +17,7 @@ enum btf_kfunc_type {
 	BTF_KFUNC_TYPE_ACQUIRE,
 	BTF_KFUNC_TYPE_RELEASE,
 	BTF_KFUNC_TYPE_RET_NULL,
+	BTF_KFUNC_TYPE_KPTR_ACQUIRE,
 	BTF_KFUNC_TYPE_MAX,
 };
 
@@ -35,6 +36,7 @@ struct btf_kfunc_id_set {
 			struct btf_id_set *acquire_set;
 			struct btf_id_set *release_set;
 			struct btf_id_set *ret_null_set;
+			struct btf_id_set *kptr_acquire_set;
 		};
 		struct btf_id_set *sets[BTF_KFUNC_TYPE_MAX];
 	};
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8978724b0b61..ddde28ce0d34 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6033,11 +6033,11 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 	struct bpf_verifier_log *log = &env->log;
 	u32 i, nargs, ref_id, ref_obj_id = 0;
 	bool is_kfunc = btf_is_kernel(btf);
+	bool rel = false, kptr_get = false;
 	const char *func_name, *ref_tname;
 	const struct btf_type *t, *ref_t;
 	const struct btf_param *args;
 	int ref_regno = 0, ret;
-	bool rel = false;
 
 	t = btf_type_by_id(btf, func_id);
 	if (!t || !btf_type_is_func(t)) {
@@ -6063,10 +6063,14 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
-	/* Only kfunc can be release func */
-	if (is_kfunc)
+	if (is_kfunc) {
+		/* Only kfunc can be release func */
 		rel = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog),
 						BTF_KFUNC_TYPE_RELEASE, func_id);
+		kptr_get = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog),
+						     BTF_KFUNC_TYPE_KPTR_ACQUIRE, func_id);
+	}
+
 	/* check that BTF function arguments match actual types that the
 	 * verifier sees.
 	 */
@@ -6095,8 +6099,55 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 		if (ret < 0)
 			return ret;
 
-		if (btf_get_prog_ctx_type(log, btf, t,
-					  env->prog->type, i)) {
+		/* kptr_get is only true for kfunc */
+		if (i == 0 && kptr_get) {
+			struct bpf_map_value_off_desc *off_desc;
+
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				bpf_log(log, "arg#0 expected pointer to map value\n");
+				return -EINVAL;
+			}
+
+			ret = check_mem_reg(env, reg, regno, sizeof(u64));
+			if (ret < 0)
+				return ret;
+
+			/* check_func_arg_reg_off allows var_off for
+			 * PTR_TO_MAP_VALUE, but we need fixed offset to find
+			 * off_desc.
+			 */
+			if (!tnum_is_const(reg->var_off)) {
+				bpf_log(log, "arg#0 must have constant offset\n");
+				return -EINVAL;
+			}
+
+			off_desc = bpf_map_kptr_off_contains(reg->map_ptr, reg->off + reg->var_off.value);
+			if (!off_desc || !(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+				bpf_log(log, "arg#0 no referenced kptr at map value offset=%llu\n",
+					reg->off + reg->var_off.value);
+				return -EINVAL;
+			}
+
+			if (!btf_type_is_ptr(ref_t)) {
+				bpf_log(log, "arg#0 BTF type must be a double pointer\n");
+				return -EINVAL;
+			}
+
+			ref_t = btf_type_skip_modifiers(btf, ref_t->type, &ref_id);
+			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
+
+			if (!btf_type_is_struct(ref_t)) {
+				bpf_log(log, "kernel function %s args#%d pointer type %s %s is not supported\n",
+					func_name, i, btf_type_str(ref_t), ref_tname);
+				return -EINVAL;
+			}
+			if (!btf_struct_ids_match(log, btf, ref_id, 0, off_desc->btf, off_desc->btf_id)) {
+				bpf_log(log, "kernel function %s args#%d expected pointer to %s %s\n",
+					func_name, i, btf_type_str(ref_t), ref_tname);
+				return -EINVAL;
+			}
+			/* rest of the arguments can be anything, like normal kfunc */
+		} else if (btf_get_prog_ctx_type(log, btf, t, env->prog->type, i)) {
 			/* If function expects ctx type in BTF check that caller
 			 * is passing PTR_TO_CTX.
 			 */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 11/13] libbpf: Add kptr type tag macros to bpf_helpers.h
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (9 preceding siblings ...)
  2022-04-09  9:33 ` [PATCH bpf-next v4 10/13] bpf: Teach verifier about kptr_get kfunc helpers Kumar Kartikeya Dwivedi
@ 2022-04-09  9:33 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:33 ` [PATCH bpf-next v4 12/13] selftests/bpf: Add C tests for kptr Kumar Kartikeya Dwivedi
  2022-04-09  9:33 ` [PATCH bpf-next v4 13/13] selftests/bpf: Add verifier " Kumar Kartikeya Dwivedi
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:33 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Include convenience definitions:
__kptr:		Unreferenced BTF ID pointer
__kptr_ref:	Referenced BTF ID pointer

Users can use them to tag the pointer type meant to be used with the new
support directly in the map value definition.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/lib/bpf/bpf_helpers.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 44df982d2a5c..bbae9a057bc8 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -149,6 +149,8 @@ enum libbpf_tristate {
 
 #define __kconfig __attribute__((section(".kconfig")))
 #define __ksym __attribute__((section(".ksyms")))
+#define __kptr __attribute__((btf_type_tag("kptr")))
+#define __kptr_ref __attribute__((btf_type_tag("kptr_ref")))
 
 #ifndef ___bpf_concat
 #define ___bpf_concat(a, b) a ## b
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 12/13] selftests/bpf: Add C tests for kptr
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (10 preceding siblings ...)
  2022-04-09  9:33 ` [PATCH bpf-next v4 11/13] libbpf: Add kptr type tag macros to bpf_helpers.h Kumar Kartikeya Dwivedi
@ 2022-04-09  9:33 ` Kumar Kartikeya Dwivedi
  2022-04-09  9:33 ` [PATCH bpf-next v4 13/13] selftests/bpf: Add verifier " Kumar Kartikeya Dwivedi
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:33 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

This uses the __kptr and __kptr_ref macros as well, and tries to test
the stuff that is supposed to work, since we have negative tests in
test_verifier suite. Also include some code to test map-in-map support,
such that the inner_map_meta matches the kptr_off_tab of map added as
element.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/map_kptr.c       |  37 ++++
 tools/testing/selftests/bpf/progs/map_kptr.c  | 190 ++++++++++++++++++
 2 files changed, 227 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/map_kptr.c

diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
new file mode 100644
index 000000000000..9e2fbda64a65
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+#include "map_kptr.skel.h"
+
+void test_map_kptr(void)
+{
+	struct map_kptr *skel;
+	int key = 0, ret;
+	char buf[24];
+
+	skel = map_kptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "map_kptr__open_and_load"))
+		return;
+
+	ret = bpf_map_update_elem(bpf_map__fd(skel->maps.array_map), &key, buf, 0);
+	ASSERT_OK(ret, "array_map update");
+	ret = bpf_map_update_elem(bpf_map__fd(skel->maps.array_map), &key, buf, 0);
+	ASSERT_OK(ret, "array_map update2");
+
+	ret = bpf_map_update_elem(bpf_map__fd(skel->maps.hash_map), &key, buf, 0);
+	ASSERT_OK(ret, "hash_map update");
+	ret = bpf_map_delete_elem(bpf_map__fd(skel->maps.hash_map), &key);
+	ASSERT_OK(ret, "hash_map delete");
+
+	ret = bpf_map_update_elem(bpf_map__fd(skel->maps.hash_malloc_map), &key, buf, 0);
+	ASSERT_OK(ret, "hash_malloc_map update");
+	ret = bpf_map_delete_elem(bpf_map__fd(skel->maps.hash_malloc_map), &key);
+	ASSERT_OK(ret, "hash_malloc_map delete");
+
+	ret = bpf_map_update_elem(bpf_map__fd(skel->maps.lru_hash_map), &key, buf, 0);
+	ASSERT_OK(ret, "lru_hash_map update");
+	ret = bpf_map_delete_elem(bpf_map__fd(skel->maps.lru_hash_map), &key);
+	ASSERT_OK(ret, "lru_hash_map delete");
+
+	map_kptr__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
new file mode 100644
index 000000000000..1b0e0409eaa5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/map_kptr.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+struct map_value {
+	struct prog_test_ref_kfunc __kptr *unref_ptr;
+	struct prog_test_ref_kfunc __kptr_ref *ref_ptr;
+};
+
+struct array_map {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} array_map SEC(".maps");
+
+struct hash_map {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} hash_map SEC(".maps");
+
+struct hash_malloc_map {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} hash_malloc_map SEC(".maps");
+
+struct lru_hash_map {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} lru_hash_map SEC(".maps");
+
+#define DEFINE_MAP_OF_MAP(map_type, inner_map_type, name)       \
+	struct {                                                \
+		__uint(type, map_type);                         \
+		__uint(max_entries, 1);                         \
+		__uint(key_size, sizeof(int));                  \
+		__uint(value_size, sizeof(int));                \
+		__array(values, struct inner_map_type);         \
+	} name SEC(".maps") = {                                 \
+		.values = { [0] = &inner_map_type },            \
+	}
+
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_map, array_of_array_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, hash_map, array_of_hash_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, hash_malloc_map, array_of_hash_malloc_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, lru_hash_map, array_of_lru_hash_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, array_map, hash_of_array_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, hash_map, hash_of_hash_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, hash_malloc_map, hash_of_hash_malloc_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, lru_hash_map, hash_of_lru_hash_maps);
+
+extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym;
+extern struct prog_test_ref_kfunc *
+bpf_kfunc_call_test_kptr_get(struct prog_test_ref_kfunc **p, int a, int b) __ksym;
+extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym;
+
+static void test_kptr_unref(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = v->unref_ptr;
+	/* store untrusted_ptr_or_null_ */
+	v->unref_ptr = p;
+	if (!p)
+		return;
+	if (p->a + p->b > 100)
+		return;
+	/* store untrusted_ptr_ */
+	v->unref_ptr = p;
+	/* store NULL */
+	v->unref_ptr = NULL;
+}
+
+static void test_kptr_ref(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = v->ref_ptr;
+	/* store ptr_or_null_ */
+	v->unref_ptr = p;
+	if (!p)
+		return;
+	if (p->a + p->b > 100)
+		return;
+	/* store NULL */
+	p = bpf_kptr_xchg(&v->ref_ptr, NULL);
+	if (!p)
+		return;
+	if (p->a + p->b > 100) {
+		bpf_kfunc_call_test_release(p);
+		return;
+	}
+	/* store ptr_ */
+	v->unref_ptr = p;
+	bpf_kfunc_call_test_release(p);
+
+	p = bpf_kfunc_call_test_acquire(&(unsigned long){0});
+	if (!p)
+		return;
+	/* store ptr_ */
+	p = bpf_kptr_xchg(&v->ref_ptr, p);
+	if (!p)
+		return;
+	if (p->a + p->b > 100) {
+		bpf_kfunc_call_test_release(p);
+		return;
+	}
+	bpf_kfunc_call_test_release(p);
+}
+
+static void test_kptr_get(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = bpf_kfunc_call_test_kptr_get(&v->ref_ptr, 0, 0);
+	if (!p)
+		return;
+	if (p->a + p->b > 100) {
+		bpf_kfunc_call_test_release(p);
+		return;
+	}
+	bpf_kfunc_call_test_release(p);
+}
+
+static void test_kptr(struct map_value *v)
+{
+	test_kptr_unref(v);
+	test_kptr_ref(v);
+	test_kptr_get(v);
+}
+
+SEC("tc")
+int test_map_kptr(struct __sk_buff *ctx)
+{
+	struct map_value *v;
+	int i, key = 0;
+
+#define TEST(map)					\
+	v = bpf_map_lookup_elem(&map, &key);		\
+	if (!v)						\
+		return 0;				\
+	test_kptr(v)
+
+	TEST(array_map);
+	TEST(hash_map);
+	TEST(hash_malloc_map);
+	TEST(lru_hash_map);
+
+#undef TEST
+	return 0;
+}
+
+SEC("tc")
+int test_map_in_map_kptr(struct __sk_buff *ctx)
+{
+	struct map_value *v;
+	int i, key = 0;
+	void *map;
+
+#define TEST(map_in_map)                                \
+	map = bpf_map_lookup_elem(&map_in_map, &key);   \
+	if (!map)                                       \
+		return 0;                               \
+	v = bpf_map_lookup_elem(map, &key);		\
+	if (!v)						\
+		return 0;				\
+	test_kptr(v)
+
+	TEST(array_of_array_maps);
+	TEST(array_of_hash_maps);
+	TEST(array_of_hash_malloc_maps);
+	TEST(array_of_lru_hash_maps);
+	TEST(hash_of_array_maps);
+	TEST(hash_of_hash_maps);
+	TEST(hash_of_hash_malloc_maps);
+	TEST(hash_of_lru_hash_maps);
+
+#undef TEST
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v4 13/13] selftests/bpf: Add verifier tests for kptr
  2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (11 preceding siblings ...)
  2022-04-09  9:33 ` [PATCH bpf-next v4 12/13] selftests/bpf: Add C tests for kptr Kumar Kartikeya Dwivedi
@ 2022-04-09  9:33 ` Kumar Kartikeya Dwivedi
  12 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-09  9:33 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Reuse bpf_prog_test functions to test the support for PTR_TO_BTF_ID in
BPF map case, including some tests that verify implementation sanity and
corner cases.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 net/bpf/test_run.c                            |  45 +-
 tools/testing/selftests/bpf/test_verifier.c   |  55 ++-
 .../testing/selftests/bpf/verifier/map_kptr.c | 446 ++++++++++++++++++
 3 files changed, 539 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/map_kptr.c

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index e7b9c2636d10..29fe32821e7e 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -584,6 +584,12 @@ noinline void bpf_kfunc_call_memb_release(struct prog_test_member *p)
 {
 }
 
+noinline struct prog_test_ref_kfunc *
+bpf_kfunc_call_test_kptr_get(struct prog_test_ref_kfunc **p, int a, int b)
+{
+	return &prog_test_struct;
+}
+
 struct prog_test_pass1 {
 	int x0;
 	struct {
@@ -669,6 +675,7 @@ BTF_ID(func, bpf_kfunc_call_test3)
 BTF_ID(func, bpf_kfunc_call_test_acquire)
 BTF_ID(func, bpf_kfunc_call_test_release)
 BTF_ID(func, bpf_kfunc_call_memb_release)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
 BTF_ID(func, bpf_kfunc_call_test_pass_ctx)
 BTF_ID(func, bpf_kfunc_call_test_pass1)
 BTF_ID(func, bpf_kfunc_call_test_pass2)
@@ -682,6 +689,7 @@ BTF_SET_END(test_sk_check_kfunc_ids)
 
 BTF_SET_START(test_sk_acquire_kfunc_ids)
 BTF_ID(func, bpf_kfunc_call_test_acquire)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
 BTF_SET_END(test_sk_acquire_kfunc_ids)
 
 BTF_SET_START(test_sk_release_kfunc_ids)
@@ -691,8 +699,13 @@ BTF_SET_END(test_sk_release_kfunc_ids)
 
 BTF_SET_START(test_sk_ret_null_kfunc_ids)
 BTF_ID(func, bpf_kfunc_call_test_acquire)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
 BTF_SET_END(test_sk_ret_null_kfunc_ids)
 
+BTF_SET_START(test_sk_kptr_acquire_kfunc_ids)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
+BTF_SET_END(test_sk_kptr_acquire_kfunc_ids)
+
 static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size,
 			   u32 size, u32 headroom, u32 tailroom)
 {
@@ -1579,14 +1592,36 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
 
 static const struct btf_kfunc_id_set bpf_prog_test_kfunc_set = {
 	.owner        = THIS_MODULE,
-	.check_set    = &test_sk_check_kfunc_ids,
-	.acquire_set  = &test_sk_acquire_kfunc_ids,
-	.release_set  = &test_sk_release_kfunc_ids,
-	.ret_null_set = &test_sk_ret_null_kfunc_ids,
+	.check_set        = &test_sk_check_kfunc_ids,
+	.acquire_set      = &test_sk_acquire_kfunc_ids,
+	.release_set      = &test_sk_release_kfunc_ids,
+	.ret_null_set     = &test_sk_ret_null_kfunc_ids,
+	.kptr_acquire_set = &test_sk_kptr_acquire_kfunc_ids
 };
 
+BTF_ID_LIST(bpf_prog_test_dtor_kfunc_ids)
+BTF_ID(struct, prog_test_ref_kfunc)
+BTF_ID(func, bpf_kfunc_call_test_release)
+BTF_ID(struct, prog_test_member)
+BTF_ID(func, bpf_kfunc_call_memb_release)
+
 static int __init bpf_prog_test_run_init(void)
 {
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_prog_test_kfunc_set);
+	const struct btf_id_dtor_kfunc bpf_prog_test_dtor_kfunc[] = {
+		{
+		  .btf_id       = bpf_prog_test_dtor_kfunc_ids[0],
+		  .kfunc_btf_id = bpf_prog_test_dtor_kfunc_ids[1]
+		},
+		{
+		  .btf_id	= bpf_prog_test_dtor_kfunc_ids[2],
+		  .kfunc_btf_id = bpf_prog_test_dtor_kfunc_ids[3],
+		},
+	};
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_prog_test_kfunc_set);
+	return ret ?: register_btf_id_dtor_kfuncs(bpf_prog_test_dtor_kfunc,
+						  ARRAY_SIZE(bpf_prog_test_dtor_kfunc),
+						  THIS_MODULE);
 }
 late_initcall(bpf_prog_test_run_init);
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index a2cd236c32eb..372579c9f45e 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -53,7 +53,7 @@
 #define MAX_INSNS	BPF_MAXINSNS
 #define MAX_TEST_INSNS	1000000
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	22
+#define MAX_NR_MAPS	23
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -101,6 +101,7 @@ struct bpf_test {
 	int fixup_map_reuseport_array[MAX_FIXUPS];
 	int fixup_map_ringbuf[MAX_FIXUPS];
 	int fixup_map_timer[MAX_FIXUPS];
+	int fixup_map_kptr[MAX_FIXUPS];
 	struct kfunc_btf_id_pair fixup_kfunc_btf_id[MAX_FIXUPS];
 	/* Expected verifier log output for result REJECT or VERBOSE_ACCEPT.
 	 * Can be a tab-separated sequence of expected strings. An empty string
@@ -621,8 +622,15 @@ static int create_cgroup_storage(bool percpu)
  * struct timer {
  *   struct bpf_timer t;
  * };
+ * struct btf_ptr {
+ *   struct prog_test_ref_kfunc __kptr *ptr;
+ *   struct prog_test_ref_kfunc __kptr_ref *ptr;
+ *   struct prog_test_member __kptr_ref *ptr;
+ * }
  */
-static const char btf_str_sec[] = "\0bpf_spin_lock\0val\0cnt\0l\0bpf_timer\0timer\0t";
+static const char btf_str_sec[] = "\0bpf_spin_lock\0val\0cnt\0l\0bpf_timer\0timer\0t"
+				  "\0btf_ptr\0prog_test_ref_kfunc\0ptr\0kptr\0kptr_ref"
+				  "\0prog_test_member";
 static __u32 btf_raw_types[] = {
 	/* int */
 	BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
@@ -638,6 +646,22 @@ static __u32 btf_raw_types[] = {
 	/* struct timer */                              /* [5] */
 	BTF_TYPE_ENC(35, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 1), 16),
 	BTF_MEMBER_ENC(41, 4, 0), /* struct bpf_timer t; */
+	/* struct prog_test_ref_kfunc */		/* [6] */
+	BTF_STRUCT_ENC(51, 0, 0),
+	BTF_STRUCT_ENC(89, 0, 0),			/* [7] */
+	/* type tag "kptr" */
+	BTF_TYPE_TAG_ENC(75, 6),			/* [8] */
+	/* type tag "kptr_ref" */
+	BTF_TYPE_TAG_ENC(80, 6),			/* [9] */
+	BTF_TYPE_TAG_ENC(80, 7),			/* [10] */
+	BTF_PTR_ENC(8),					/* [11] */
+	BTF_PTR_ENC(9),					/* [12] */
+	BTF_PTR_ENC(10),				/* [13] */
+	/* struct btf_ptr */				/* [14] */
+	BTF_STRUCT_ENC(43, 3, 24),
+	BTF_MEMBER_ENC(71, 11, 0), /* struct prog_test_ref_kfunc __kptr *ptr; */
+	BTF_MEMBER_ENC(71, 12, 64), /* struct prog_test_ref_kfunc __kptr_ref *ptr; */
+	BTF_MEMBER_ENC(71, 13, 128), /* struct prog_test_member __kptr_ref *ptr; */
 };
 
 static int load_btf(void)
@@ -727,6 +751,25 @@ static int create_map_timer(void)
 	return fd;
 }
 
+static int create_map_kptr(void)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts,
+		.btf_key_type_id = 1,
+		.btf_value_type_id = 14,
+	);
+	int fd, btf_fd;
+
+	btf_fd = load_btf();
+	if (btf_fd < 0)
+		return -1;
+
+	opts.btf_fd = btf_fd;
+	fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "test_map", 4, 24, 1, &opts);
+	if (fd < 0)
+		printf("Failed to create map with btf_id pointer\n");
+	return fd;
+}
+
 static char bpf_vlog[UINT_MAX >> 8];
 
 static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
@@ -754,6 +797,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_reuseport_array = test->fixup_map_reuseport_array;
 	int *fixup_map_ringbuf = test->fixup_map_ringbuf;
 	int *fixup_map_timer = test->fixup_map_timer;
+	int *fixup_map_kptr = test->fixup_map_kptr;
 	struct kfunc_btf_id_pair *fixup_kfunc_btf_id = test->fixup_kfunc_btf_id;
 
 	if (test->fill_helper) {
@@ -947,6 +991,13 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_map_timer++;
 		} while (*fixup_map_timer);
 	}
+	if (*fixup_map_kptr) {
+		map_fds[22] = create_map_kptr();
+		do {
+			prog[*fixup_map_kptr].imm = map_fds[22];
+			fixup_map_kptr++;
+		} while (*fixup_map_kptr);
+	}
 
 	/* Patch in kfunc BTF IDs */
 	if (fixup_kfunc_btf_id->kfunc) {
diff --git a/tools/testing/selftests/bpf/verifier/map_kptr.c b/tools/testing/selftests/bpf/verifier/map_kptr.c
new file mode 100644
index 000000000000..609ec2942b4c
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/map_kptr.c
@@ -0,0 +1,446 @@
+/* Common tests */
+{
+	"map_kptr: BPF_ST imm != 0",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "BPF_ST imm must be 0 when storing to kptr at off=0",
+},
+{
+	"map_kptr: size != bpf_size_to_bytes(BPF_DW)",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ST_MEM(BPF_W, BPF_REG_0, 0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "kptr access size must be BPF_DW",
+},
+{
+	"map_kptr: map_value non-const var_off",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2, 0),
+	BPF_JMP_IMM(BPF_JLE, BPF_REG_2, 4, 1),
+	BPF_EXIT_INSN(),
+	BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_2),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_3, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "kptr access cannot have variable offset",
+},
+{
+	"map_kptr: bpf_kptr_xchg non-const var_off",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2, 0),
+	BPF_JMP_IMM(BPF_JLE, BPF_REG_2, 4, 1),
+	BPF_EXIT_INSN(),
+	BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_2),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_3),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 doesn't have constant offset. kptr has to be at the constant offset",
+},
+{
+	"map_kptr: unaligned boundary load/store",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 7),
+	BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "kptr access misaligned expected=0 off=7",
+},
+{
+	"map_kptr: reject var_off != 0",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0),
+	BPF_JMP_IMM(BPF_JLE, BPF_REG_2, 4, 1),
+	BPF_EXIT_INSN(),
+	BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "variable untrusted_ptr_ access var_off=(0x0; 0x7) disallowed",
+},
+/* Tests for unreferened PTR_TO_BTF_ID */
+{
+	"map_kptr: unref: reject btf_struct_ids_match == false",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R1 type=untrusted_ptr_prog_test_ref_kfunc expected=ptr_prog_test",
+},
+{
+	"map_kptr: unref: loaded pointer marked as untrusted",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R0 invalid mem access 'untrusted_ptr_or_null_'",
+},
+{
+	"map_kptr: unref: correct in kernel type size",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 24),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "access beyond struct prog_test_ref_kfunc at off 24 size 8",
+},
+{
+	"map_kptr: unref: inherit PTR_UNTRUSTED on struct walk",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 16),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_this_cpu_ptr),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 type=untrusted_ptr_ expected=percpu_ptr_",
+},
+{
+	"map_kptr: unref: no reference state created",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = ACCEPT,
+},
+{
+	"map_kptr: unref: bpf_kptr_xchg rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "off=0 kptr isn't referenced kptr",
+},
+{
+	"map_kptr: unref: bpf_kfunc_call_test_kptr_get rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "arg#0 no referenced kptr at map value offset=0",
+	.fixup_kfunc_btf_id = {
+		{ "bpf_kfunc_call_test_kptr_get", 13 },
+	}
+},
+/* Tests for referenced PTR_TO_BTF_ID */
+{
+	"map_kptr: ref: loaded pointer marked as untrusted",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_IMM(BPF_REG_1, 0),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 8),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_this_cpu_ptr),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 type=untrusted_ptr_or_null_ expected=percpu_ptr_",
+},
+{
+	"map_kptr: ref: reject off != 0",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 8),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R2 must have zero offset when passed to release func",
+},
+{
+	"map_kptr: ref: reference state created and released on xchg",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_1, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "Unreleased reference id=5 alloc_insn=20",
+	.fixup_kfunc_btf_id = {
+		{ "bpf_kfunc_call_test_acquire", 15 },
+	}
+},
+{
+	"map_kptr: ref: reject STX",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, 0),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "store to referenced kptr disallowed",
+},
+{
+	"map_kptr: ref: reject ST",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ST_MEM(BPF_DW, BPF_REG_0, 8, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "store to referenced kptr disallowed",
+},
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic
  2022-04-09  9:32 ` [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
@ 2022-04-11 20:20   ` Joanne Koong
  2022-04-12 19:48     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2022-04-11 20:20 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Apr 9, 2022 at 2:32 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Next commit introduces field type 'kptr' whose kind will not be struct,
> but pointer, and it will not be limited to one offset, but multiple
> ones. Make existing btf_find_struct_field and btf_find_datasec_var
> functions amenable to use for finding kptrs in map value, by moving
> spin_lock and timer specific checks into their own function.
>
> The alignment, and name are checked before the function is called, so it
> is the last point where we can skip field or return an error before the
> next loop iteration happens. The name parameter is now optional, and
> only checked if it is not NULL. Size of the field and type is meant to
> be checked inside the function, and base type will need to be obtained
> by skipping modifiers.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  kernel/bpf/btf.c | 129 +++++++++++++++++++++++++++++++++++------------
>  1 file changed, 96 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 0918a39279f6..db7bf05adfc5 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3163,71 +3163,126 @@ static void btf_struct_log(struct btf_verifier_env *env,
>         btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
>  }
>
> +enum {
> +       BTF_FIELD_SPIN_LOCK,
> +       BTF_FIELD_TIMER,
> +};
> +
> +struct btf_field_info {
> +       u32 off;
> +};
> +
> +static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> +                                u32 off, int sz, struct btf_field_info *info)
> +{
> +       if (!__btf_type_is_struct(t))
> +               return 0;
> +       if (t->size != sz)
> +               return 0;
Do we need these two checks? I think in the original version we did
because we were checking this before doing the name comparison, but
now that the name comparison check is first, if the struct name is a
match, then won't these two things always be true (or if not, then it
seems like we should return -EINVAL)? But maybe I'm missing something
here - I'm still getting more familiar with BTF :)

Also, as a side-note: I personally find this function name
"btf_find_field_struct" confusing since there's also the
"btf_find_struct_field" function that exists. I wonder whether we
should just keep the logic inside btf_find_struct_field instead of
putting it in this separate function?

> +       if (info->off != -ENOENT)
> +               /* only one such field is allowed */
> +               return -E2BIG;
In the future, do you plan to add support for multiple fields? I think
this would be useful for dynptrs as well, so just curious what your
plans for this are.
> +       info->off = off;
> +       return 0;
> +}
> +
>  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> -                                const char *name, int sz, int align)
> +                                const char *name, int sz, int align, int field_type,

What are your thoughts on just passing in field_type in place of name,
sz, and align? As in a function signature like:

static int btf_find_struct_field(const struct btf *btf, const struct
btf_type *t, int field_type, struct btf_field_info *info);

where inside btf_find_struct_field when we do the switch statement on
field_type, we can have the name, sz, and align for each of the
different field types there? That to me seems a bit cleaner where the
descriptors for the field types are all in one place (instead of also
in btf_find_spin_lock() and btf_find_timer() functions) and the
function definition for btf_find_struct_field is more straightforward.
At that point, I don't think we'd even need btf_find_spin_lock() and
btf_find_timer() as functions since it'd be just a straightforward
"btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK/BTF_FIELD_TIMER)" call
instead. Curious to hear your thoughts.

nit: should field_type be a u32 since it's an enum? Or should we be
explicit and give the enum a name and define this as something like
"enum btf_field_type type"?

> +                                struct btf_field_info *info)
>  {
>         const struct btf_member *member;
> -       u32 i, off = -ENOENT;
> +       u32 i, off;
> +       int ret;
>
>         for_each_member(i, t, member) {
>                 const struct btf_type *member_type = btf_type_by_id(btf,
>                                                                     member->type);
> -               if (!__btf_type_is_struct(member_type))
> -                       continue;
> -               if (member_type->size != sz)
> -                       continue;
> -               if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> -                       continue;
> -               if (off != -ENOENT)
> -                       /* only one such field is allowed */
> -                       return -E2BIG;
> +
>                 off = __btf_member_bit_offset(t, member);
nit: should this be moved to after the strcmp on the name? Since if
the name doesn't match, there's no point in doing this
__btf_member_bit_offset call
> +
> +               if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
I'm confused by the if (name) part of the check. If name is NULL, then
won't this "btf_find_struct_field" function always return the offset
to the first struct? I don't think name will ever be NULL so maybe we
should just remove this? Or do something like if (name) return
-EINVAL; before doing the strcmp?

> +                       continue;
>                 if (off % 8)
>                         /* valid C code cannot generate such BTF */
>                         return -EINVAL;
>                 off /= 8;
>                 if (off % align)
>                         return -EINVAL;
> +
> +               switch (field_type) {
> +               case BTF_FIELD_SPIN_LOCK:
> +               case BTF_FIELD_TIMER:
> +                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
nit: I think we can just do "return btf_find_field_struct(btf,
member_type, off, sz, info);" here and remove the "int ret;"
declaration a few lines above.

> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               default:
> +                       return -EFAULT;
> +               }
>         }
> -       return off;
> +       return 0;
>  }
>
>  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> -                               const char *name, int sz, int align)
> +                               const char *name, int sz, int align, int field_type,
> +                               struct btf_field_info *info)
The same comments for the btf_find_struct_field function also apply to
this function
>  {
>         const struct btf_var_secinfo *vsi;
> -       u32 i, off = -ENOENT;
> +       u32 i, off;
> +       int ret;
>
>         for_each_vsi(i, t, vsi) {
>                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
>                 const struct btf_type *var_type = btf_type_by_id(btf, var->type);
>
> -               if (!__btf_type_is_struct(var_type))
> -                       continue;
> -               if (var_type->size != sz)
> +               off = vsi->offset;
> +
> +               if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
>                         continue;
>                 if (vsi->size != sz)
>                         continue;
> -               if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> -                       continue;
> -               if (off != -ENOENT)
> -                       /* only one such field is allowed */
> -                       return -E2BIG;
> -               off = vsi->offset;
>                 if (off % align)
>                         return -EINVAL;
> +
> +               switch (field_type) {
> +               case BTF_FIELD_SPIN_LOCK:
> +               case BTF_FIELD_TIMER:
> +                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               default:
> +                       return -EFAULT;
> +               }
>         }
> -       return off;
> +       return 0;
>  }
>
>  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> -                         const char *name, int sz, int align)
> +                         int field_type, struct btf_field_info *info)
>  {
> +       const char *name;
> +       int sz, align;
> +
> +       switch (field_type) {
> +       case BTF_FIELD_SPIN_LOCK:
> +               name = "bpf_spin_lock";
> +               sz = sizeof(struct bpf_spin_lock);
> +               align = __alignof__(struct bpf_spin_lock);
> +               break;
> +       case BTF_FIELD_TIMER:
> +               name = "bpf_timer";
> +               sz = sizeof(struct bpf_timer);
> +               align = __alignof__(struct bpf_timer);
> +               break;
> +       default:
> +               return -EFAULT;
> +       }
>
>         if (__btf_type_is_struct(t))
> -               return btf_find_struct_field(btf, t, name, sz, align);
> +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
>         else if (btf_type_is_datasec(t))
> -               return btf_find_datasec_var(btf, t, name, sz, align);
> +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
>         return -EINVAL;
>  }
>
> @@ -3237,16 +3292,24 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>   */
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
>  {
> -       return btf_find_field(btf, t, "bpf_spin_lock",
> -                             sizeof(struct bpf_spin_lock),
> -                             __alignof__(struct bpf_spin_lock));
> +       struct btf_field_info info = { .off = -ENOENT };
> +       int ret;
> +
> +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
I'm confused about why we pass in "struct btf_field_info" as the out
parameter. Maybe I'm missing something here, but why can't
"btf_find_field" just return back the offset?
> +       if (ret < 0)
> +               return ret;
> +       return info.off;
>  }
>
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
>  {
> -       return btf_find_field(btf, t, "bpf_timer",
> -                             sizeof(struct bpf_timer),
> -                             __alignof__(struct bpf_timer));
> +       struct btf_field_info info = { .off = -ENOENT };
> +       int ret;
> +
> +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> +       if (ret < 0)
> +               return ret;
> +       return info.off;
>  }
>
>  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 02/13] bpf: Move check_ptr_off_reg before check_map_access
  2022-04-09  9:32 ` [PATCH bpf-next v4 02/13] bpf: Move check_ptr_off_reg before check_map_access Kumar Kartikeya Dwivedi
@ 2022-04-11 20:28   ` Joanne Koong
  0 siblings, 0 replies; 29+ messages in thread
From: Joanne Koong @ 2022-04-11 20:28 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Apr 9, 2022 at 1:40 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Some functions in next patch want to use this function, and those
> functions will be called by check_map_access, hence move it before
> check_map_access.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

LGTM.

Acked-by: Joanne Koong <joannelkoong@gmail.com>

> ---
>  kernel/bpf/verifier.c | 76 +++++++++++++++++++++----------------------
>  1 file changed, 38 insertions(+), 38 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9c1a02b82ecd..71827d14724a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3469,6 +3469,44 @@ static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
>         return 0;
>  }
>
> +static int __check_ptr_off_reg(struct bpf_verifier_env *env,
> +                              const struct bpf_reg_state *reg, int regno,
> +                              bool fixed_off_ok)
> +{
> +       /* Access to this pointer-typed register or passing it to a helper
> +        * is only allowed in its original, unmodified form.
> +        */
> +
> +       if (reg->off < 0) {
> +               verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
> +                       reg_type_str(env, reg->type), regno, reg->off);
> +               return -EACCES;
> +       }
> +
> +       if (!fixed_off_ok && reg->off) {
> +               verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
> +                       reg_type_str(env, reg->type), regno, reg->off);
> +               return -EACCES;
> +       }
> +
> +       if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> +               char tn_buf[48];
> +
> +               tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> +               verbose(env, "variable %s access var_off=%s disallowed\n",
> +                       reg_type_str(env, reg->type), tn_buf);
> +               return -EACCES;
> +       }
> +
> +       return 0;
> +}
> +
> +int check_ptr_off_reg(struct bpf_verifier_env *env,
> +                     const struct bpf_reg_state *reg, int regno)
> +{
> +       return __check_ptr_off_reg(env, reg, regno, false);
> +}
> +
>  /* check read/write into a map element with possible variable offset */
>  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>                             int off, int size, bool zero_size_allowed)
> @@ -3980,44 +4018,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
>  }
>  #endif
>
> -static int __check_ptr_off_reg(struct bpf_verifier_env *env,
> -                              const struct bpf_reg_state *reg, int regno,
> -                              bool fixed_off_ok)
> -{
> -       /* Access to this pointer-typed register or passing it to a helper
> -        * is only allowed in its original, unmodified form.
> -        */
> -
> -       if (reg->off < 0) {
> -               verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
> -                       reg_type_str(env, reg->type), regno, reg->off);
> -               return -EACCES;
> -       }
> -
> -       if (!fixed_off_ok && reg->off) {
> -               verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
> -                       reg_type_str(env, reg->type), regno, reg->off);
> -               return -EACCES;
> -       }
> -
> -       if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> -               char tn_buf[48];
> -
> -               tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> -               verbose(env, "variable %s access var_off=%s disallowed\n",
> -                       reg_type_str(env, reg->type), tn_buf);
> -               return -EACCES;
> -       }
> -
> -       return 0;
> -}
> -
> -int check_ptr_off_reg(struct bpf_verifier_env *env,
> -                     const struct bpf_reg_state *reg, int regno)
> -{
> -       return __check_ptr_off_reg(env, reg, regno, false);
> -}
> -
>  static int __check_buffer_access(struct bpf_verifier_env *env,
>                                  const char *buf_info,
>                                  const struct bpf_reg_state *reg,
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map
  2022-04-09  9:32 ` [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
@ 2022-04-12  0:32   ` Joanne Koong
  2022-04-12 19:16     ` Kumar Kartikeya Dwivedi
  2022-04-13  5:41   ` kernel test robot
  1 sibling, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2022-04-12  0:32 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Apr 9, 2022 at 6:18 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> This commit introduces a new pointer type 'kptr' which can be embedded
> in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> register must have the same type as in the map value's BTF, and loading
> a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> kernel BTF and BTF ID.
>
> Such kptr are unreferenced, i.e. by the time another invocation of the
> BPF program loads this pointer, the object which the pointer points to
> may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> patched to PROBE_MEM loads by the verifier, it would safe to allow user
> to still access such invalid pointer, but passing such pointers into
> BPF helpers and kfuncs should not be permitted. A future patch in this
> series will close this gap.
>
> The flexibility offered by allowing programs to dereference such invalid
> pointers while being safe at runtime frees the verifier from doing
> complex lifetime tracking. As long as the user may ensure that the
> object remains valid, it can ensure data read by it from the kernel
> object is valid.
>
> The user indicates that a certain pointer must be treated as kptr
> capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> information is recorded in the object BTF which will be passed into the
> kernel by way of map's BTF information. The name and kind from the map
> value BTF is used to look up the in-kernel type, and the actual BTF and
> BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> now, only storing pointers to structs is permitted.
>
> An example of this specification is shown below:
>
>         #define __kptr __attribute__((btf_type_tag("kptr")))
>
>         struct map_value {
>                 ...
>                 struct task_struct __kptr *task;
>                 ...
>         };
>
> Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> task_struct into the map, and then load it later.
>
> Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> the verifier cannot know whether the value is NULL or not statically, it
> must treat all potential loads at that map value offset as loading a
> possibly NULL pointer.
>
> Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
> are allowed instructions that can access such a pointer. On BPF_LDX, the
> destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> it is checked whether the source register type is a PTR_TO_BTF_ID with
> same BTF type as specified in the map BTF. The access size must always
> be BPF_DW.
>
> For the map in map support, the kptr_off_tab for outer map is copied
> from the inner map's kptr_off_tab. It was chosen to do a deep copy
> instead of introducing a refcount to kptr_off_tab, because the copy only
> needs to be done when paramterizing using inner_map_fd in the map in map
> case, hence would be unnecessary for all other users.
>
> It is not permitted to use MAP_FREEZE command and mmap for BPF map
> having kptrs, similar to the bpf_timer case.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h     |  29 +++++++-
>  include/linux/btf.h     |   2 +
>  kernel/bpf/btf.c        | 160 ++++++++++++++++++++++++++++++++++------
>  kernel/bpf/map_in_map.c |   5 +-
>  kernel/bpf/syscall.c    | 114 +++++++++++++++++++++++++++-
>  kernel/bpf/verifier.c   | 116 ++++++++++++++++++++++++++++-
>  6 files changed, 399 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index bdb5298735ce..e267db260cb7 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -155,6 +155,22 @@ struct bpf_map_ops {
>         const struct bpf_iter_seq_info *iter_seq_info;
>  };
>
> +enum {
> +       /* Support at most 8 pointers in a BPF map value */
> +       BPF_MAP_VALUE_OFF_MAX = 8,
> +};
nit: should this be a typedef instead of an enum?
> +
> +struct bpf_map_value_off_desc {
> +       u32 offset;
> +       u32 btf_id;
> +       struct btf *btf;
nit: Since bpf_map_value_off_desc is generic and will support
non-kptrs as well, I think embedding "btf_id" and "btf" in a "union {
} kptr;" would make it more clear that only kptrs will have these
fields used

> +};
> +
> +struct bpf_map_value_off {
> +       u32 nr_off;
> +       struct bpf_map_value_off_desc off[];
> +};
> +
>  struct bpf_map {
>         /* The first two cachelines with read-mostly members of which some
>          * are also accessed in fast-path (e.g. ops, max_entries).
> @@ -171,6 +187,7 @@ struct bpf_map {
>         u64 map_extra; /* any per-map-type extra fields */
>         u32 map_flags;
>         int spin_lock_off; /* >=0 valid offset, <0 error */
> +       struct bpf_map_value_off *kptr_off_tab;
>         int timer_off; /* >=0 valid offset, <0 error */
>         u32 id;
>         int numa_node;
> @@ -184,7 +201,7 @@ struct bpf_map {
>         char name[BPF_OBJ_NAME_LEN];
>         bool bypass_spec_v1;
>         bool frozen; /* write-once; write-protected by freeze_mutex */
> -       /* 14 bytes hole */
> +       /* 6 bytes hole */
>
>         /* The 3rd and 4th cacheline with misc members to avoid false sharing
>          * particularly with refcounting.
> @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
>         return map->timer_off >= 0;
>  }
>
> +static inline bool map_value_has_kptrs(const struct bpf_map *map)
> +{
> +       return !IS_ERR_OR_NULL(map->kptr_off_tab);
> +}
> +
>  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>  {
>         if (unlikely(map_value_has_spin_lock(map)))
> @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
>  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
>  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
>
> +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> +
>  struct bpf_map *bpf_map_get(u32 ufd);
>  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>  struct bpf_map *__bpf_map_get(struct fd f);
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 36bc09b8e890..19c297f9a52f 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
>                            u32 expected_offset, u32 expected_size);
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> +struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> +                                         const struct btf_type *t);
>  bool btf_type_is_void(const struct btf_type *t);
>  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
>  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index db7bf05adfc5..28b1d9e9124e 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3166,9 +3166,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
>  enum {
>         BTF_FIELD_SPIN_LOCK,
>         BTF_FIELD_TIMER,
> +       BTF_FIELD_KPTR,
> +};
> +
> +enum {
> +       BTF_FIELD_IGNORE = 0,
> +       BTF_FIELD_FOUND  = 1,
>  };
>
>  struct btf_field_info {
> +       u32 type_id;
>         u32 off;
>  };
>
> @@ -3176,23 +3183,50 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
>                                  u32 off, int sz, struct btf_field_info *info)
>  {
>         if (!__btf_type_is_struct(t))
> -               return 0;
> +               return BTF_FIELD_IGNORE;
>         if (t->size != sz)
> -               return 0;
> -       if (info->off != -ENOENT)
> -               /* only one such field is allowed */
> -               return -E2BIG;
> +               return BTF_FIELD_IGNORE;
>         info->off = off;
> -       return 0;
> +       return BTF_FIELD_FOUND;
> +}
> +
> +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> +                              u32 off, int sz, struct btf_field_info *info)
> +{
> +       u32 res_id;
> +
> +       /* For PTR, sz is always == 8 */
> +       if (!btf_type_is_ptr(t))
> +               return BTF_FIELD_IGNORE;
> +       t = btf_type_by_id(btf, t->type);
> +
> +       if (!btf_type_is_type_tag(t))
> +               return BTF_FIELD_IGNORE;
> +       /* Reject extra tags */
> +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> +               return -EINVAL;
> +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> +               return -EINVAL;
> +
> +       /* Get the base type */
> +       t = btf_type_skip_modifiers(btf, t->type, &res_id);
> +       /* Only pointer to struct is allowed */
> +       if (!__btf_type_is_struct(t))
> +               return -EINVAL;
> +
> +       info->type_id = res_id;
> +       info->off = off;
> +       return BTF_FIELD_FOUND;
>  }
>
>  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
>                                  const char *name, int sz, int align, int field_type,
> -                                struct btf_field_info *info)
> +                                struct btf_field_info *info, int info_cnt)
Ah okay. I should have read this patch first before commenting on the
previous one :) I see now why you are passing in info instead of just
returning the offset.
>  {
>         const struct btf_member *member;
> +       struct btf_field_info tmp;
> +       int ret, idx = 0;
>         u32 i, off;
> -       int ret;
>
>         for_each_member(i, t, member) {
>                 const struct btf_type *member_type = btf_type_by_id(btf,
> @@ -3212,24 +3246,38 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
>                 switch (field_type) {
>                 case BTF_FIELD_SPIN_LOCK:
>                 case BTF_FIELD_TIMER:
> -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> +                       ret = btf_find_field_struct(btf, member_type, off, sz, idx < info_cnt ?
> +                                                   &info[idx] : &tmp);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               case BTF_FIELD_KPTR:
> +                       ret = btf_find_field_kptr(btf, member_type, off, sz, idx < info_cnt ?
> +                                                 &info[idx] : &tmp);
>                         if (ret < 0)
>                                 return ret;
>                         break;
>                 default:
>                         return -EFAULT;
>                 }
> +
> +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> +                       return -E2BIG;
> +               else if (ret == BTF_FIELD_IGNORE)
> +                       continue;
nit: I think if you check the "ret == BTF_FIELD_IGNORE" first, then
you just need to check idx >= info_cnt instead of "ret ==
BTF_FIELD_FOUND && idx >= info_cnt"
> +               ++idx;
>         }
> -       return 0;
> +       return idx;
>  }
>
>  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                                 const char *name, int sz, int align, int field_type,
> -                               struct btf_field_info *info)
> +                               struct btf_field_info *info, int info_cnt)
>  {
>         const struct btf_var_secinfo *vsi;
> +       struct btf_field_info tmp;
> +       int ret, idx = 0;
>         u32 i, off;
> -       int ret;
>
>         for_each_vsi(i, t, vsi) {
>                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> @@ -3247,19 +3295,32 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                 switch (field_type) {
>                 case BTF_FIELD_SPIN_LOCK:
>                 case BTF_FIELD_TIMER:
> -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> +                       ret = btf_find_field_struct(btf, var_type, off, sz, idx < info_cnt ?
> +                                                   &info[idx] : &tmp);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               case BTF_FIELD_KPTR:
> +                       ret = btf_find_field_kptr(btf, var_type, off, sz, idx < info_cnt ?
> +                                                 &info[idx] : &tmp);
>                         if (ret < 0)
>                                 return ret;
>                         break;
>                 default:
>                         return -EFAULT;
>                 }
> +
> +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> +                       return -E2BIG;
> +               if (ret == BTF_FIELD_IGNORE)
> +                       continue;
> +               ++idx;
>         }
> -       return 0;
> +       return idx;
>  }
>
>  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> -                         int field_type, struct btf_field_info *info)
> +                         int field_type, struct btf_field_info *info, int info_cnt)
>  {
>         const char *name;
>         int sz, align;
> @@ -3275,14 +3336,19 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>                 sz = sizeof(struct bpf_timer);
>                 align = __alignof__(struct bpf_timer);
>                 break;
> +       case BTF_FIELD_KPTR:
> +               name = NULL;
I see now why you added the if (name) check in the previous patch.
Maybe that should be part of this patch instead to make it more clear?
> +               sz = sizeof(u64);
> +               align = 8;
> +               break;
>         default:
>                 return -EFAULT;
>         }
>
>         if (__btf_type_is_struct(t))
> -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
>         else if (btf_type_is_datasec(t))
> -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
>         return -EINVAL;
>  }
>
> @@ -3292,26 +3358,78 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>   */
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
>  {
> -       struct btf_field_info info = { .off = -ENOENT };
> +       struct btf_field_info info;
>         int ret;
>
> -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info, 1);
>         if (ret < 0)
>                 return ret;
> +       if (!ret)
> +               return -ENOENT;
>         return info.off;
>  }
>
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
>  {
> -       struct btf_field_info info = { .off = -ENOENT };
> +       struct btf_field_info info;
>         int ret;
>
> -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info, 1);
>         if (ret < 0)
>                 return ret;
> +       if (!ret)
> +               return -ENOENT;
>         return info.off;
>  }
>
> +struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> +                                         const struct btf_type *t)
> +{
> +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> +       struct bpf_map_value_off *tab;
> +       int ret, i, nr_off;
> +
> +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> +
> +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> +       if (ret < 0)
> +               return ERR_PTR(ret);
> +       if (!ret)
> +               return NULL;
> +
> +       nr_off = ret;
> +       tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> +       if (!tab)
> +               return ERR_PTR(-ENOMEM);
> +
> +       tab->nr_off = 0;
tab is kzalloced - I think we can just remove this line
> +       for (i = 0; i < nr_off; i++) {
> +               const struct btf_type *t;
> +               struct btf *off_btf;
> +               s32 id;
> +
> +               t = btf_type_by_id(btf, info_arr[i].type_id);
> +               id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> +                                    &off_btf);
> +               if (id < 0) {
> +                       ret = id;
> +                       goto end;
> +               }
> +
> +               tab->off[i].offset = info_arr[i].off;
> +               tab->off[i].btf_id = id;
> +               tab->off[i].btf = off_btf;
> +               tab->nr_off = i + 1;
> +       }
Instead of incrementing tab in every loop iteration, why not just set
"tab->nr_off = nr_off" here after the loop? And then in the end: case
where there's an error, we could just do

while (i--)
  btf_put(tab->off[i].btf);

> +       return tab;
> +end:
> +       while (tab->nr_off--)
> +               btf_put(tab->off[tab->nr_off].btf);
> +       kfree(tab);
> +       return ERR_PTR(ret);
> +}
> +
>  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
>                               u32 type_id, void *data, u8 bits_offset,
>                               struct btf_show *show)
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 5cd8f5277279..135205d0d560 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>         inner_map_meta->max_entries = inner_map->max_entries;
>         inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
>         inner_map_meta->timer_off = inner_map->timer_off;
> +       inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
>         if (inner_map->btf) {
>                 btf_get(inner_map->btf);
>                 inner_map_meta->btf = inner_map->btf;
> @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>
>  void bpf_map_meta_free(struct bpf_map *map_meta)
>  {
> +       bpf_map_free_kptr_off_tab(map_meta);
>         btf_put(map_meta->btf);
>         kfree(map_meta);
>  }
> @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
>                 meta0->key_size == meta1->key_size &&
>                 meta0->value_size == meta1->value_size &&
>                 meta0->timer_off == meta1->timer_off &&
> -               meta0->map_flags == meta1->map_flags;
> +               meta0->map_flags == meta1->map_flags &&
> +               bpf_map_equal_kptr_off_tab(meta0, meta1);
>  }
>
>  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index cdaa1152436a..edfe691284b0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -6,6 +6,7 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/bsearch.h>
>  #include <linux/btf.h>
>  #include <linux/syscalls.h>
>  #include <linux/slab.h>
> @@ -473,12 +474,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
>  }
>  #endif
>
> +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> +{
> +       const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> +
> +       if (off_desc1->offset < off_desc2->offset)
> +               return -1;
> +       else if (off_desc1->offset > off_desc2->offset)
> +               return 1;
> +       return 0;
> +}
> +
> +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> +{
> +       /* Since members are iterated in btf_find_field in increasing order,
> +        * offsets appended to kptr_off_tab are in increasing order, so we can
> +        * do bsearch to find exact match.
> +        */
> +       struct bpf_map_value_off *tab;
> +
> +       if (!map_value_has_kptrs(map))
> +               return NULL;
> +       tab = map->kptr_off_tab;
> +       return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> +}
> +
> +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> +{
> +       struct bpf_map_value_off *tab = map->kptr_off_tab;
> +       int i;
> +
> +       if (!map_value_has_kptrs(map))
> +               return;
> +       for (i = 0; i < tab->nr_off; i++) {
> +               struct btf *btf = tab->off[i].btf;
> +
> +               btf_put(btf);
> +       }
> +       kfree(tab);
> +       map->kptr_off_tab = NULL;
> +}
> +
> +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> +{
> +       struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> +       int size, i, ret;
> +
> +       if (!map_value_has_kptrs(map))
> +               return ERR_PTR(-ENOENT);
> +       /* Do a deep copy of the kptr_off_tab */
> +       for (i = 0; i < tab->nr_off; i++)
> +               btf_get(tab->off[i].btf);
> +
> +       size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> +       new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
I think we can get away with not zero-ing out the memory, since we're
going to be memcpying over its contents right after
> +       if (!new_tab) {
> +               ret = -ENOMEM;
> +               goto end;
> +       }
> +       memcpy(new_tab, tab, size);
> +       return new_tab;
> +end:
> +       while (i--)
> +               btf_put(tab->off[i].btf);
> +       return ERR_PTR(ret);
> +}
> +
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +{
> +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> +       bool a_has_kptr = map_value_has_kptrs(map_a), b_has_kptr = map_value_has_kptrs(map_b);
> +       int size;
> +
> +       if (!a_has_kptr && !b_has_kptr)
> +               return true;
> +       if (a_has_kptr != b_has_kptr)
> +               return false;
> +       if (tab_a->nr_off != tab_b->nr_off)
> +               return false;
> +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> +       return !memcmp(tab_a, tab_b, size);
> +}
> +
>  /* called from workqueue */
>  static void bpf_map_free_deferred(struct work_struct *work)
>  {
>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>
>         security_bpf_map_free(map);
> +       bpf_map_free_kptr_off_tab(map);
>         bpf_map_release_memcg(map);
>         /* implementation dependent freeing */
>         map->ops->map_free(map);
> @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>         int err;
>
>         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> -           map_value_has_timer(map))
> +           map_value_has_timer(map) || map_value_has_kptrs(map))
>                 return -ENOTSUPP;
>
>         if (!(vma->vm_flags & VM_SHARED))
> @@ -820,9 +904,33 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>                         return -EOPNOTSUPP;
>         }
>
> -       if (map->ops->map_check_btf)
> +       map->kptr_off_tab = btf_parse_kptrs(btf, value_type);
Since btf_parse_kptrs can return back ERR_PTR, I think we need to
check here whether map->kptr_offtab is an ERR_PTR.

> +       if (map_value_has_kptrs(map)) {
> +               if (!bpf_capable()) {
> +                       ret = -EPERM;
> +                       goto free_map_tab;
> +               }
> +               if (map->map_flags & BPF_F_RDONLY_PROG) {
Why is it an error if BPF_F_RDONLY_PROG is set? Maybe I'm
misunderstanding what BPF_F_RDONLY_PROG means, but why can't a program
have read-only access to the kptr value?
> +                       ret = -EACCES;
> +                       goto free_map_tab;
> +               }
> +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
Out of curiosity, do you also plan to add kptr support in the future
to local storage maps as well?
> +                       ret = -EOPNOTSUPP;
> +                       goto free_map_tab;
> +               }
> +       }
> +
> +       if (map->ops->map_check_btf) {
>                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> +               if (ret < 0)
> +                       goto free_map_tab;
> +       }
>
> +       return ret;
> +free_map_tab:
> +       bpf_map_free_kptr_off_tab(map);
>         return ret;
>  }
>
> @@ -1639,7 +1747,7 @@ static int map_freeze(const union bpf_attr *attr)
>                 return PTR_ERR(map);
>
>         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> -           map_value_has_timer(map)) {
> +           map_value_has_timer(map) || map_value_has_kptrs(map)) {
>                 fdput(f);
>                 return -ENOTSUPP;
>         }

Maybe I'm missing something, but I'm not seeing it in this patch - do
we also need to add checks that prohibit userspace programs from
trying to do bpf_map_update_elem syscalls that manipulate kptr map
values?

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 71827d14724a..01d45c5010f9 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3507,6 +3507,83 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
>         return __check_ptr_off_reg(env, reg, regno, false);
>  }
>
> +static int map_kptr_match_type(struct bpf_verifier_env *env,
> +                              struct bpf_map_value_off_desc *off_desc,
> +                              struct bpf_reg_state *reg, u32 regno)
> +{
> +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> +       const char *reg_name = "";
> +
> +       if (base_type(reg->type) != PTR_TO_BTF_ID || type_flag(reg->type) != PTR_MAYBE_NULL)
> +               goto bad_type;
> +
> +       if (!btf_is_kernel(reg->btf)) {
> +               verbose(env, "R%d must point to kernel BTF\n", regno);
> +               return -EINVAL;
> +       }
> +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> +
> +       if (__check_ptr_off_reg(env, reg, regno, true))
> +               return -EACCES;
> +
> +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +                                 off_desc->btf, off_desc->btf_id))
> +               goto bad_type;
> +       return 0;
> +bad_type:
> +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> +               reg_type_str(env, reg->type), reg_name);
> +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> +       return -EINVAL;
> +}
> +
> +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> +                                int value_regno, int insn_idx,
> +                                struct bpf_map_value_off_desc *off_desc)
> +{
> +       struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> +       int class = BPF_CLASS(insn->code);
> +       struct bpf_reg_state *val_reg;
> +
> +       /* Things we already checked for in check_map_access and caller:
> +        *  - Reject cases where variable offset may touch kptr
> +        *  - size of access (must be BPF_DW)
> +        *  - tnum_is_const(reg->var_off)
> +        *  - off_desc->offset == off + reg->var_off.value
> +        */
> +       /* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
> +       if (BPF_MODE(insn->code) != BPF_MEM)
> +               goto end;
I think this needs its own verbose statement - the one in end: doesn't
seem to match this error
> +
> +       if (class == BPF_LDX) {
> +               val_reg = reg_state(env, value_regno);
> +               /* We can simply mark the value_regno receiving the pointer
> +                * value from map as PTR_TO_BTF_ID, with the correct type.
> +                */
> +               mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> +                               off_desc->btf_id, PTR_MAYBE_NULL);
> +               val_reg->id = ++env->id_gen;
> +       } else if (class == BPF_STX) {
> +               val_reg = reg_state(env, value_regno);
> +               if (!register_is_null(val_reg) &&
> +                   map_kptr_match_type(env, off_desc, val_reg, value_regno))
> +                       return -EACCES;
> +       } else if (class == BPF_ST) {
> +               if (insn->imm) {
> +                       verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> +                               off_desc->offset);
> +                       return -EACCES;
> +               }
> +       } else {
> +               goto end;
> +       }
> +       return 0;
> +end:
> +       verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> +       return -EACCES;
> +}
> +
>  /* check read/write into a map element with possible variable offset */
>  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>                             int off, int size, bool zero_size_allowed)
> @@ -3545,6 +3622,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>                         return -EACCES;
>                 }
>         }
> +       if (map_value_has_kptrs(map)) {
> +               struct bpf_map_value_off *tab = map->kptr_off_tab;
> +               int i;
> +
> +               for (i = 0; i < tab->nr_off; i++) {
> +                       u32 p = tab->off[i].offset;
> +
> +                       if (reg->smin_value + off < p + sizeof(u64) &&
> +                           p < reg->umax_value + off + size) {
> +                               if (!tnum_is_const(reg->var_off)) {
> +                                       verbose(env, "kptr access cannot have variable offset\n");
> +                                       return -EACCES;
> +                               }
> +                               if (p != off + reg->var_off.value) {
> +                                       verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> +                                               p, off + reg->var_off.value);
> +                                       return -EACCES;
> +                               }
> +                               if (size != bpf_size_to_bytes(BPF_DW)) {
> +                                       verbose(env, "kptr access size must be BPF_DW\n");
> +                                       return -EACCES;
> +                               }
> +                               break;
> +                       }
> +               }
> +       }
>         return err;
>  }
>
> @@ -4412,6 +4515,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>                 if (value_regno >= 0)
>                         mark_reg_unknown(env, regs, value_regno);
>         } else if (reg->type == PTR_TO_MAP_VALUE) {
> +               struct bpf_map_value_off_desc *off_desc = NULL;
> +
>                 if (t == BPF_WRITE && value_regno >= 0 &&
>                     is_pointer_value(env, value_regno)) {
>                         verbose(env, "R%d leaks addr into map\n", value_regno);
> @@ -4421,7 +4526,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>                 if (err)
>                         return err;
>                 err = check_map_access(env, regno, off, size, false);
> -               if (!err && t == BPF_READ && value_regno >= 0) {
> +               if (err)
> +                       return err;
> +               if (tnum_is_const(reg->var_off))
> +                       off_desc = bpf_map_kptr_off_contains(reg->map_ptr,
> +                                                            off + reg->var_off.value);
> +               if (off_desc) {
I think this logic would be a little clearer if you renamed off_desc
to kptr_off_desc to denote that this only applies to kptrs.
> +                       err = check_map_kptr_access(env, regno, value_regno, insn_idx, off_desc);
> +                       if (err)
> +                               return err;
I don't think you need this if check - it'll return err by default at
the end of the function.
> +               } else if (t == BPF_READ && value_regno >= 0) {
>                         struct bpf_map *map = reg->map_ptr;
>
>                         /* if map is read-only, track its contents as scalars */
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto
  2022-04-09  9:32 ` [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto Kumar Kartikeya Dwivedi
@ 2022-04-12 18:16   ` Joanne Koong
  2022-04-12 20:11     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2022-04-12 18:16 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Apr 10, 2022 at 11:58 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Add a new type flag for bpf_arg_type that when set tells verifier that
> for a release function, that argument's register will be the one for
> which meta.ref_obj_id will be set, and which will then be released
> using release_reference. To capture the regno, introduce a new field
> release_regno in bpf_call_arg_meta.
>
> This would be required in the next patch, where we may either pass NULL
> or a refcounted pointer as an argument to the release function
> bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
> enough, as there is a case where the type of argument needed matches,
> but the ref_obj_id is set to 0. Hence, we must enforce that whenever
> meta.ref_obj_id is zero, the register that is to be released can only
> be NULL for a release function.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h   |  5 ++++-
>  kernel/bpf/ringbuf.c  |  4 ++--
>  kernel/bpf/verifier.c | 46 ++++++++++++++++++++++++++++++++++++-------
>  net/core/filter.c     |  2 +-
>  4 files changed, 46 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e267db260cb7..a6d1982e8118 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -364,7 +364,10 @@ enum bpf_type_flag {
>          */
>         MEM_PERCPU              = BIT(4 + BPF_BASE_TYPE_BITS),
>
> -       __BPF_TYPE_LAST_FLAG    = MEM_PERCPU,
> +       /* Indicates that the pointer argument will be released. */
> +       PTR_RELEASE             = BIT(5 + BPF_BASE_TYPE_BITS),
> +
> +       __BPF_TYPE_LAST_FLAG    = PTR_RELEASE,
>  };
>
>  /* Max number of base types. */
> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> index 710ba9de12ce..a22c21c0a7ef 100644
> --- a/kernel/bpf/ringbuf.c
> +++ b/kernel/bpf/ringbuf.c
> @@ -404,7 +404,7 @@ BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
>  const struct bpf_func_proto bpf_ringbuf_submit_proto = {
>         .func           = bpf_ringbuf_submit,
>         .ret_type       = RET_VOID,
> -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
>         .arg2_type      = ARG_ANYTHING,
>  };
>
> @@ -417,7 +417,7 @@ BPF_CALL_2(bpf_ringbuf_discard, void *, sample, u64, flags)
>  const struct bpf_func_proto bpf_ringbuf_discard_proto = {
>         .func           = bpf_ringbuf_discard,
>         .ret_type       = RET_VOID,
> -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
>         .arg2_type      = ARG_ANYTHING,
>  };
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 01d45c5010f9..6cc08526e049 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -245,6 +245,7 @@ struct bpf_call_arg_meta {
>         struct bpf_map *map_ptr;
>         bool raw_mode;
>         bool pkt_access;
> +       u8 release_regno;
>         int regno;
>         int access_size;
>         int mem_size;
> @@ -5300,6 +5301,11 @@ static bool arg_type_is_int_ptr(enum bpf_arg_type type)
>                type == ARG_PTR_TO_LONG;
>  }
>
> +static bool arg_type_is_release_ptr(enum bpf_arg_type type)
> +{
> +       return type & PTR_RELEASE;
> +}
> +
Now that we have PTR_RELEASE as a bpf arg type descriptor, why do we
still need is_release_function() in the verifier? I think we should
just remove is_release_function() altogether - is_release_function()
isn't functionally necessary now that we have PTR_RELEASE, and I don't
think it's great that is_release_function() hardcodes specific
functions into the verifier. What are your thoughts?

>  static int int_ptr_type_to_size(enum bpf_arg_type type)
>  {
>         if (type == ARG_PTR_TO_INT)
> @@ -5532,7 +5538,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
>                 /* Some of the argument types nevertheless require a
>                  * zero register offset.
>                  */
> -               if (arg_type != ARG_PTR_TO_ALLOC_MEM)
> +               if (base_type(arg_type) != ARG_PTR_TO_ALLOC_MEM)
>                         return 0;
>                 break;
>         /* All the rest must be rejected, except PTR_TO_BTF_ID which allows

Later on in this check_func_arg_reg_off() function, I think we can get
rid of the hacky workaround for the PTR_TO_BTF_ID case where it relies
on whether the function is a release function and reg->ref_obj_id is
set, to determine whether the argument is a release arg or not. The
arg type is passed directly to check_func_arg_reg_off(), so I think we
could just use arg_type_is_release_ptr(arg_type) instead, which will
also be more robust when/if we support having multiple release args in
the future.

> @@ -6124,12 +6130,31 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
>         return true;
>  }
>
> -static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
> +static bool check_release_regno(const struct bpf_func_proto *fn, int func_id,
> +                               struct bpf_call_arg_meta *meta)
> +{
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> +               if (arg_type_is_release_ptr(fn->arg_type[i])) {
> +                       if (!is_release_function(func_id))
> +                               return false;
> +                       if (meta->release_regno)
> +                               return false;
> +                       meta->release_regno = i + 1;
> +               }
> +       }
> +       return !is_release_function(func_id) || meta->release_regno;
> +}
Is this check needed? There's already a check in check_func_arg that
there can't be two arg registers with ref_obj_ids set. I think this
already checks against the case where the user tries to pass in two
release registers as arguments.
> +
> +static int check_func_proto(const struct bpf_func_proto *fn, int func_id,
> +                           struct bpf_call_arg_meta *meta)
>  {
>         return check_raw_mode_ok(fn) &&
>                check_arg_pair_ok(fn) &&
>                check_btf_id_ok(fn) &&
> -              check_refcount_ok(fn, func_id) ? 0 : -EINVAL;
> +              check_refcount_ok(fn, func_id) &&
> +              check_release_regno(fn, func_id, meta) ? 0 : -EINVAL;
>  }
>
>  /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
> @@ -6808,7 +6833,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>         memset(&meta, 0, sizeof(meta));
>         meta.pkt_access = fn->pkt_access;
>
> -       err = check_func_proto(fn, func_id);
> +       err = check_func_proto(fn, func_id, &meta);
>         if (err) {
>                 verbose(env, "kernel subsystem misconfigured func %s#%d\n",
>                         func_id_name(func_id), func_id);
> @@ -6841,8 +6866,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                         return err;
>         }
>
> +       regs = cur_regs(env);
> +
>         if (is_release_function(func_id)) {
> -               err = release_reference(env, meta.ref_obj_id);
> +               err = -EINVAL;
> +               if (meta.ref_obj_id)
> +                       err = release_reference(env, meta.ref_obj_id);
> +               /* meta.ref_obj_id can only be 0 if register that is meant to be
> +                * released is NULL, which must be > R0.
> +                */
> +               else if (meta.release_regno && register_is_null(&regs[meta.release_regno]))
> +                       err = 0;
>                 if (err) {
>                         verbose(env, "func %s#%d reference has not been acquired before\n",
>                                 func_id_name(func_id), func_id);
> @@ -6850,8 +6884,6 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                 }
>         }
>
> -       regs = cur_regs(env);
> -
>         switch (func_id) {
>         case BPF_FUNC_tail_call:
>                 err = check_reference_leak(env);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 143f442a9505..8eb01a997476 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6621,7 +6621,7 @@ static const struct bpf_func_proto bpf_sk_release_proto = {
>         .func           = bpf_sk_release,
>         .gpl_only       = false,
>         .ret_type       = RET_INTEGER,
> -       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
> +       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON | PTR_RELEASE,
>  };
>
>  BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx,
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map
  2022-04-12  0:32   ` Joanne Koong
@ 2022-04-12 19:16     ` Kumar Kartikeya Dwivedi
  2022-04-12 23:56       ` Joanne Koong
  0 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-12 19:16 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Tue, Apr 12, 2022 at 06:02:11AM IST, Joanne Koong wrote:
> On Sat, Apr 9, 2022 at 6:18 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > This commit introduces a new pointer type 'kptr' which can be embedded
> > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > register must have the same type as in the map value's BTF, and loading
> > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > kernel BTF and BTF ID.
> >
> > Such kptr are unreferenced, i.e. by the time another invocation of the
> > BPF program loads this pointer, the object which the pointer points to
> > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > to still access such invalid pointer, but passing such pointers into
> > BPF helpers and kfuncs should not be permitted. A future patch in this
> > series will close this gap.
> >
> > The flexibility offered by allowing programs to dereference such invalid
> > pointers while being safe at runtime frees the verifier from doing
> > complex lifetime tracking. As long as the user may ensure that the
> > object remains valid, it can ensure data read by it from the kernel
> > object is valid.
> >
> > The user indicates that a certain pointer must be treated as kptr
> > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > information is recorded in the object BTF which will be passed into the
> > kernel by way of map's BTF information. The name and kind from the map
> > value BTF is used to look up the in-kernel type, and the actual BTF and
> > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > now, only storing pointers to structs is permitted.
> >
> > An example of this specification is shown below:
> >
> >         #define __kptr __attribute__((btf_type_tag("kptr")))
> >
> >         struct map_value {
> >                 ...
> >                 struct task_struct __kptr *task;
> >                 ...
> >         };
> >
> > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > task_struct into the map, and then load it later.
> >
> > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > the verifier cannot know whether the value is NULL or not statically, it
> > must treat all potential loads at that map value offset as loading a
> > possibly NULL pointer.
> >
> > Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
> > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > same BTF type as specified in the map BTF. The access size must always
> > be BPF_DW.
> >
> > For the map in map support, the kptr_off_tab for outer map is copied
> > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > instead of introducing a refcount to kptr_off_tab, because the copy only
> > needs to be done when paramterizing using inner_map_fd in the map in map
> > case, hence would be unnecessary for all other users.
> >
> > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > having kptrs, similar to the bpf_timer case.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h     |  29 +++++++-
> >  include/linux/btf.h     |   2 +
> >  kernel/bpf/btf.c        | 160 ++++++++++++++++++++++++++++++++++------
> >  kernel/bpf/map_in_map.c |   5 +-
> >  kernel/bpf/syscall.c    | 114 +++++++++++++++++++++++++++-
> >  kernel/bpf/verifier.c   | 116 ++++++++++++++++++++++++++++-
> >  6 files changed, 399 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index bdb5298735ce..e267db260cb7 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> >         const struct bpf_iter_seq_info *iter_seq_info;
> >  };
> >
> > +enum {
> > +       /* Support at most 8 pointers in a BPF map value */
> > +       BPF_MAP_VALUE_OFF_MAX = 8,
> > +};
> nit: should this be a typedef instead of an enum?

typedef? Do you mean #define? I prefer enum constants as they get emitted to
BTF.

> > +
> > +struct bpf_map_value_off_desc {
> > +       u32 offset;
> > +       u32 btf_id;
> > +       struct btf *btf;
> nit: Since bpf_map_value_off_desc is generic and will support
> non-kptrs as well, I think embedding "btf_id" and "btf" in a "union {
> } kptr;" would make it more clear that only kptrs will have these
> fields used
>

Ok, will do.

> > +};
> > +
> > +struct bpf_map_value_off {
> > +       u32 nr_off;
> > +       struct bpf_map_value_off_desc off[];
> > +};
> > +
> >  struct bpf_map {
> >         /* The first two cachelines with read-mostly members of which some
> >          * are also accessed in fast-path (e.g. ops, max_entries).
> > @@ -171,6 +187,7 @@ struct bpf_map {
> >         u64 map_extra; /* any per-map-type extra fields */
> >         u32 map_flags;
> >         int spin_lock_off; /* >=0 valid offset, <0 error */
> > +       struct bpf_map_value_off *kptr_off_tab;
> >         int timer_off; /* >=0 valid offset, <0 error */
> >         u32 id;
> >         int numa_node;
> > @@ -184,7 +201,7 @@ struct bpf_map {
> >         char name[BPF_OBJ_NAME_LEN];
> >         bool bypass_spec_v1;
> >         bool frozen; /* write-once; write-protected by freeze_mutex */
> > -       /* 14 bytes hole */
> > +       /* 6 bytes hole */
> >
> >         /* The 3rd and 4th cacheline with misc members to avoid false sharing
> >          * particularly with refcounting.
> > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> >         return map->timer_off >= 0;
> >  }
> >
> > +static inline bool map_value_has_kptrs(const struct bpf_map *map)
> > +{
> > +       return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > +}
> > +
> >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >  {
> >         if (unlikely(map_value_has_spin_lock(map)))
> > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> >
> > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > +
> >  struct bpf_map *bpf_map_get(u32 ufd);
> >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> >  struct bpf_map *__bpf_map_get(struct fd f);
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index 36bc09b8e890..19c297f9a52f 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> >                            u32 expected_offset, u32 expected_size);
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > +struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> > +                                         const struct btf_type *t);
> >  bool btf_type_is_void(const struct btf_type *t);
> >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index db7bf05adfc5..28b1d9e9124e 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3166,9 +3166,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
> >  enum {
> >         BTF_FIELD_SPIN_LOCK,
> >         BTF_FIELD_TIMER,
> > +       BTF_FIELD_KPTR,
> > +};
> > +
> > +enum {
> > +       BTF_FIELD_IGNORE = 0,
> > +       BTF_FIELD_FOUND  = 1,
> >  };
> >
> >  struct btf_field_info {
> > +       u32 type_id;
> >         u32 off;
> >  };
> >
> > @@ -3176,23 +3183,50 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
> >                                  u32 off, int sz, struct btf_field_info *info)
> >  {
> >         if (!__btf_type_is_struct(t))
> > -               return 0;
> > +               return BTF_FIELD_IGNORE;
> >         if (t->size != sz)
> > -               return 0;
> > -       if (info->off != -ENOENT)
> > -               /* only one such field is allowed */
> > -               return -E2BIG;
> > +               return BTF_FIELD_IGNORE;
> >         info->off = off;
> > -       return 0;
> > +       return BTF_FIELD_FOUND;
> > +}
> > +
> > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > +                              u32 off, int sz, struct btf_field_info *info)
> > +{
> > +       u32 res_id;
> > +
> > +       /* For PTR, sz is always == 8 */
> > +       if (!btf_type_is_ptr(t))
> > +               return BTF_FIELD_IGNORE;
> > +       t = btf_type_by_id(btf, t->type);
> > +
> > +       if (!btf_type_is_type_tag(t))
> > +               return BTF_FIELD_IGNORE;
> > +       /* Reject extra tags */
> > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > +               return -EINVAL;
> > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > +               return -EINVAL;
> > +
> > +       /* Get the base type */
> > +       t = btf_type_skip_modifiers(btf, t->type, &res_id);
> > +       /* Only pointer to struct is allowed */
> > +       if (!__btf_type_is_struct(t))
> > +               return -EINVAL;
> > +
> > +       info->type_id = res_id;
> > +       info->off = off;
> > +       return BTF_FIELD_FOUND;
> >  }
> >
> >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> >                                  const char *name, int sz, int align, int field_type,
> > -                                struct btf_field_info *info)
> > +                                struct btf_field_info *info, int info_cnt)
> Ah okay. I should have read this patch first before commenting on the
> previous one :) I see now why you are passing in info instead of just
> returning the offset.
> >  {
> >         const struct btf_member *member;
> > +       struct btf_field_info tmp;
> > +       int ret, idx = 0;
> >         u32 i, off;
> > -       int ret;
> >
> >         for_each_member(i, t, member) {
> >                 const struct btf_type *member_type = btf_type_by_id(btf,
> > @@ -3212,24 +3246,38 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> >                 switch (field_type) {
> >                 case BTF_FIELD_SPIN_LOCK:
> >                 case BTF_FIELD_TIMER:
> > -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > +                       ret = btf_find_field_struct(btf, member_type, off, sz, idx < info_cnt ?
> > +                                                   &info[idx] : &tmp);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               case BTF_FIELD_KPTR:
> > +                       ret = btf_find_field_kptr(btf, member_type, off, sz, idx < info_cnt ?
> > +                                                 &info[idx] : &tmp);
> >                         if (ret < 0)
> >                                 return ret;
> >                         break;
> >                 default:
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> > +                       return -E2BIG;
> > +               else if (ret == BTF_FIELD_IGNORE)
> > +                       continue;
> nit: I think if you check the "ret == BTF_FIELD_IGNORE" first, then
> you just need to check idx >= info_cnt instead of "ret ==
> BTF_FIELD_FOUND && idx >= info_cnt"

Ok, I'll switch the order.

> > +               ++idx;
> >         }
> > -       return 0;
> > +       return idx;
> >  }
> >
> >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                                 const char *name, int sz, int align, int field_type,
> > -                               struct btf_field_info *info)
> > +                               struct btf_field_info *info, int info_cnt)
> >  {
> >         const struct btf_var_secinfo *vsi;
> > +       struct btf_field_info tmp;
> > +       int ret, idx = 0;
> >         u32 i, off;
> > -       int ret;
> >
> >         for_each_vsi(i, t, vsi) {
> >                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > @@ -3247,19 +3295,32 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                 switch (field_type) {
> >                 case BTF_FIELD_SPIN_LOCK:
> >                 case BTF_FIELD_TIMER:
> > -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > +                       ret = btf_find_field_struct(btf, var_type, off, sz, idx < info_cnt ?
> > +                                                   &info[idx] : &tmp);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               case BTF_FIELD_KPTR:
> > +                       ret = btf_find_field_kptr(btf, var_type, off, sz, idx < info_cnt ?
> > +                                                 &info[idx] : &tmp);
> >                         if (ret < 0)
> >                                 return ret;
> >                         break;
> >                 default:
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> > +                       return -E2BIG;
> > +               if (ret == BTF_FIELD_IGNORE)
> > +                       continue;
> > +               ++idx;
> >         }
> > -       return 0;
> > +       return idx;
> >  }
> >
> >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > -                         int field_type, struct btf_field_info *info)
> > +                         int field_type, struct btf_field_info *info, int info_cnt)
> >  {
> >         const char *name;
> >         int sz, align;
> > @@ -3275,14 +3336,19 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >                 sz = sizeof(struct bpf_timer);
> >                 align = __alignof__(struct bpf_timer);
> >                 break;
> > +       case BTF_FIELD_KPTR:
> > +               name = NULL;
> I see now why you added the if (name) check in the previous patch.
> Maybe that should be part of this patch instead to make it more clear?

Yes, I'll move it to this patch.

> > +               sz = sizeof(u64);
> > +               align = 8;
> > +               break;
> >         default:
> >                 return -EFAULT;
> >         }
> >
> >         if (__btf_type_is_struct(t))
> > -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
> >         else if (btf_type_is_datasec(t))
> > -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
> >         return -EINVAL;
> >  }
> >
> > @@ -3292,26 +3358,78 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >   */
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       struct btf_field_info info = { .off = -ENOENT };
> > +       struct btf_field_info info;
> >         int ret;
> >
> > -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> > +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info, 1);
> >         if (ret < 0)
> >                 return ret;
> > +       if (!ret)
> > +               return -ENOENT;
> >         return info.off;
> >  }
> >
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       struct btf_field_info info = { .off = -ENOENT };
> > +       struct btf_field_info info;
> >         int ret;
> >
> > -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> > +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info, 1);
> >         if (ret < 0)
> >                 return ret;
> > +       if (!ret)
> > +               return -ENOENT;
> >         return info.off;
> >  }
> >
> > +struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> > +                                         const struct btf_type *t)
> > +{
> > +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> > +       struct bpf_map_value_off *tab;
> > +       int ret, i, nr_off;
> > +
> > +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> > +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> > +
> > +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> > +       if (ret < 0)
> > +               return ERR_PTR(ret);
> > +       if (!ret)
> > +               return NULL;
> > +
> > +       nr_off = ret;
> > +       tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> > +       if (!tab)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       tab->nr_off = 0;
> tab is kzalloced - I think we can just remove this line

Right, will drop this.

> > +       for (i = 0; i < nr_off; i++) {
> > +               const struct btf_type *t;
> > +               struct btf *off_btf;
> > +               s32 id;
> > +
> > +               t = btf_type_by_id(btf, info_arr[i].type_id);
> > +               id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> > +                                    &off_btf);
> > +               if (id < 0) {
> > +                       ret = id;
> > +                       goto end;
> > +               }
> > +
> > +               tab->off[i].offset = info_arr[i].off;
> > +               tab->off[i].btf_id = id;
> > +               tab->off[i].btf = off_btf;
> > +               tab->nr_off = i + 1;
> > +       }
> Instead of incrementing tab in every loop iteration, why not just set
> "tab->nr_off = nr_off" here after the loop? And then in the end: case
> where there's an error, we could just do
>
> while (i--)
>   btf_put(tab->off[i].btf);
>

That works too, will change.

> > +       return tab;
> > +end:
> > +       while (tab->nr_off--)
> > +               btf_put(tab->off[tab->nr_off].btf);
> > +       kfree(tab);
> > +       return ERR_PTR(ret);
> > +}
> > +
> >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> >                               u32 type_id, void *data, u8 bits_offset,
> >                               struct btf_show *show)
> > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> > index 5cd8f5277279..135205d0d560 100644
> > --- a/kernel/bpf/map_in_map.c
> > +++ b/kernel/bpf/map_in_map.c
> > @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> >         inner_map_meta->max_entries = inner_map->max_entries;
> >         inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
> >         inner_map_meta->timer_off = inner_map->timer_off;
> > +       inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
> >         if (inner_map->btf) {
> >                 btf_get(inner_map->btf);
> >                 inner_map_meta->btf = inner_map->btf;
> > @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> >
> >  void bpf_map_meta_free(struct bpf_map *map_meta)
> >  {
> > +       bpf_map_free_kptr_off_tab(map_meta);
> >         btf_put(map_meta->btf);
> >         kfree(map_meta);
> >  }
> > @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
> >                 meta0->key_size == meta1->key_size &&
> >                 meta0->value_size == meta1->value_size &&
> >                 meta0->timer_off == meta1->timer_off &&
> > -               meta0->map_flags == meta1->map_flags;
> > +               meta0->map_flags == meta1->map_flags &&
> > +               bpf_map_equal_kptr_off_tab(meta0, meta1);
> >  }
> >
> >  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index cdaa1152436a..edfe691284b0 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/bpf_trace.h>
> >  #include <linux/bpf_lirc.h>
> >  #include <linux/bpf_verifier.h>
> > +#include <linux/bsearch.h>
> >  #include <linux/btf.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/slab.h>
> > @@ -473,12 +474,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
> >  }
> >  #endif
> >
> > +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> > +{
> > +       const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> > +
> > +       if (off_desc1->offset < off_desc2->offset)
> > +               return -1;
> > +       else if (off_desc1->offset > off_desc2->offset)
> > +               return 1;
> > +       return 0;
> > +}
> > +
> > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> > +{
> > +       /* Since members are iterated in btf_find_field in increasing order,
> > +        * offsets appended to kptr_off_tab are in increasing order, so we can
> > +        * do bsearch to find exact match.
> > +        */
> > +       struct bpf_map_value_off *tab;
> > +
> > +       if (!map_value_has_kptrs(map))
> > +               return NULL;
> > +       tab = map->kptr_off_tab;
> > +       return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> > +}
> > +
> > +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> > +{
> > +       struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +       int i;
> > +
> > +       if (!map_value_has_kptrs(map))
> > +               return;
> > +       for (i = 0; i < tab->nr_off; i++) {
> > +               struct btf *btf = tab->off[i].btf;
> > +
> > +               btf_put(btf);
> > +       }
> > +       kfree(tab);
> > +       map->kptr_off_tab = NULL;
> > +}
> > +
> > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> > +{
> > +       struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> > +       int size, i, ret;
> > +
> > +       if (!map_value_has_kptrs(map))
> > +               return ERR_PTR(-ENOENT);
> > +       /* Do a deep copy of the kptr_off_tab */
> > +       for (i = 0; i < tab->nr_off; i++)
> > +               btf_get(tab->off[i].btf);
> > +
> > +       size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> > +       new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> I think we can get away with not zero-ing out the memory, since we're
> going to be memcpying over its contents right after

Right, kmalloc should be fine.

> > +       if (!new_tab) {
> > +               ret = -ENOMEM;
> > +               goto end;
> > +       }
> > +       memcpy(new_tab, tab, size);
> > +       return new_tab;
> > +end:
> > +       while (i--)
> > +               btf_put(tab->off[i].btf);
> > +       return ERR_PTR(ret);
> > +}
> > +
> > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> > +{
> > +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> > +       bool a_has_kptr = map_value_has_kptrs(map_a), b_has_kptr = map_value_has_kptrs(map_b);
> > +       int size;
> > +
> > +       if (!a_has_kptr && !b_has_kptr)
> > +               return true;
> > +       if (a_has_kptr != b_has_kptr)
> > +               return false;
> > +       if (tab_a->nr_off != tab_b->nr_off)
> > +               return false;
> > +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> > +       return !memcmp(tab_a, tab_b, size);
> > +}
> > +
> >  /* called from workqueue */
> >  static void bpf_map_free_deferred(struct work_struct *work)
> >  {
> >         struct bpf_map *map = container_of(work, struct bpf_map, work);
> >
> >         security_bpf_map_free(map);
> > +       bpf_map_free_kptr_off_tab(map);
> >         bpf_map_release_memcg(map);
> >         /* implementation dependent freeing */
> >         map->ops->map_free(map);
> > @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >         int err;
> >
> >         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> > -           map_value_has_timer(map))
> > +           map_value_has_timer(map) || map_value_has_kptrs(map))
> >                 return -ENOTSUPP;
> >
> >         if (!(vma->vm_flags & VM_SHARED))
> > @@ -820,9 +904,33 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >                         return -EOPNOTSUPP;
> >         }
> >
> > -       if (map->ops->map_check_btf)
> > +       map->kptr_off_tab = btf_parse_kptrs(btf, value_type);
> Since btf_parse_kptrs can return back ERR_PTR, I think we need to
> check here whether map->kptr_offtab is an ERR_PTR.
>

This is already checked by map_value_has_kptrs (which is
!IS_ERR_OR_NULL(map->kptr_off_tab)). We store ERR_PTR to distinguish the error
message given to user in process_kptr_func (later in ref kptr patch).

> > +       if (map_value_has_kptrs(map)) {
> > +               if (!bpf_capable()) {
> > +                       ret = -EPERM;
> > +                       goto free_map_tab;
> > +               }
> > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> Why is it an error if BPF_F_RDONLY_PROG is set? Maybe I'm
> misunderstanding what BPF_F_RDONLY_PROG means, but why can't a program
> have read-only access to the kptr value?

It would be useless, kptr can only be set from inside a BPF program.

> > +                       ret = -EACCES;
> > +                       goto free_map_tab;
> > +               }
> > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> Out of curiosity, do you also plan to add kptr support in the future
> to local storage maps as well?

Yes, those and percpu maps are on the TODO list.

> > +                       ret = -EOPNOTSUPP;
> > +                       goto free_map_tab;
> > +               }
> > +       }
> > +
> > +       if (map->ops->map_check_btf) {
> >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > +               if (ret < 0)
> > +                       goto free_map_tab;
> > +       }
> >
> > +       return ret;
> > +free_map_tab:
> > +       bpf_map_free_kptr_off_tab(map);
> >         return ret;
> >  }
> >
> > @@ -1639,7 +1747,7 @@ static int map_freeze(const union bpf_attr *attr)
> >                 return PTR_ERR(map);
> >
> >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > -           map_value_has_timer(map)) {
> > +           map_value_has_timer(map) || map_value_has_kptrs(map)) {
> >                 fdput(f);
> >                 return -ENOTSUPP;
> >         }
>
> Maybe I'm missing something, but I'm not seeing it in this patch - do
> we also need to add checks that prohibit userspace programs from
> trying to do bpf_map_update_elem syscalls that manipulate kptr map
> values?

Userspace should be allowed to do bpf_map_update_elem, whether map value has
timers, spin_lock, kptrs, or dynptrs in the future. copy_map_value will skip
over these fields when updating map value. See patch 7.

>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 71827d14724a..01d45c5010f9 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3507,6 +3507,83 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> >         return __check_ptr_off_reg(env, reg, regno, false);
> >  }
> >
> > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > +                              struct bpf_map_value_off_desc *off_desc,
> > +                              struct bpf_reg_state *reg, u32 regno)
> > +{
> > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > +       const char *reg_name = "";
> > +
> > +       if (base_type(reg->type) != PTR_TO_BTF_ID || type_flag(reg->type) != PTR_MAYBE_NULL)
> > +               goto bad_type;
> > +
> > +       if (!btf_is_kernel(reg->btf)) {
> > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > +               return -EINVAL;
> > +       }
> > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > +
> > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > +               return -EACCES;
> > +
> > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > +                                 off_desc->btf, off_desc->btf_id))
> > +               goto bad_type;
> > +       return 0;
> > +bad_type:
> > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > +               reg_type_str(env, reg->type), reg_name);
> > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > +       return -EINVAL;
> > +}
> > +
> > +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> > +                                int value_regno, int insn_idx,
> > +                                struct bpf_map_value_off_desc *off_desc)
> > +{
> > +       struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> > +       int class = BPF_CLASS(insn->code);
> > +       struct bpf_reg_state *val_reg;
> > +
> > +       /* Things we already checked for in check_map_access and caller:
> > +        *  - Reject cases where variable offset may touch kptr
> > +        *  - size of access (must be BPF_DW)
> > +        *  - tnum_is_const(reg->var_off)
> > +        *  - off_desc->offset == off + reg->var_off.value
> > +        */
> > +       /* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
> > +       if (BPF_MODE(insn->code) != BPF_MEM)
> > +               goto end;
> I think this needs its own verbose statement - the one in end: doesn't
> seem to match this error

Maybe we should say BPF_LDX_MEM, BPF_STX_MEM, BPF_ST_MEM?

> > +
> > +       if (class == BPF_LDX) {
> > +               val_reg = reg_state(env, value_regno);
> > +               /* We can simply mark the value_regno receiving the pointer
> > +                * value from map as PTR_TO_BTF_ID, with the correct type.
> > +                */
> > +               mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> > +                               off_desc->btf_id, PTR_MAYBE_NULL);
> > +               val_reg->id = ++env->id_gen;
> > +       } else if (class == BPF_STX) {
> > +               val_reg = reg_state(env, value_regno);
> > +               if (!register_is_null(val_reg) &&
> > +                   map_kptr_match_type(env, off_desc, val_reg, value_regno))
> > +                       return -EACCES;
> > +       } else if (class == BPF_ST) {
> > +               if (insn->imm) {
> > +                       verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> > +                               off_desc->offset);
> > +                       return -EACCES;
> > +               }
> > +       } else {
> > +               goto end;
> > +       }
> > +       return 0;
> > +end:
> > +       verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> > +       return -EACCES;
> > +}
> > +
> >  /* check read/write into a map element with possible variable offset */
> >  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >                             int off, int size, bool zero_size_allowed)
> > @@ -3545,6 +3622,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >                         return -EACCES;
> >                 }
> >         }
> > +       if (map_value_has_kptrs(map)) {
> > +               struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +               int i;
> > +
> > +               for (i = 0; i < tab->nr_off; i++) {
> > +                       u32 p = tab->off[i].offset;
> > +
> > +                       if (reg->smin_value + off < p + sizeof(u64) &&
> > +                           p < reg->umax_value + off + size) {
> > +                               if (!tnum_is_const(reg->var_off)) {
> > +                                       verbose(env, "kptr access cannot have variable offset\n");
> > +                                       return -EACCES;
> > +                               }
> > +                               if (p != off + reg->var_off.value) {
> > +                                       verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> > +                                               p, off + reg->var_off.value);
> > +                                       return -EACCES;
> > +                               }
> > +                               if (size != bpf_size_to_bytes(BPF_DW)) {
> > +                                       verbose(env, "kptr access size must be BPF_DW\n");
> > +                                       return -EACCES;
> > +                               }
> > +                               break;
> > +                       }
> > +               }
> > +       }
> >         return err;
> >  }
> >
> > @@ -4412,6 +4515,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                 if (value_regno >= 0)
> >                         mark_reg_unknown(env, regs, value_regno);
> >         } else if (reg->type == PTR_TO_MAP_VALUE) {
> > +               struct bpf_map_value_off_desc *off_desc = NULL;
> > +
> >                 if (t == BPF_WRITE && value_regno >= 0 &&
> >                     is_pointer_value(env, value_regno)) {
> >                         verbose(env, "R%d leaks addr into map\n", value_regno);
> > @@ -4421,7 +4526,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                 if (err)
> >                         return err;
> >                 err = check_map_access(env, regno, off, size, false);
> > -               if (!err && t == BPF_READ && value_regno >= 0) {
> > +               if (err)
> > +                       return err;
> > +               if (tnum_is_const(reg->var_off))
> > +                       off_desc = bpf_map_kptr_off_contains(reg->map_ptr,
> > +                                                            off + reg->var_off.value);
> > +               if (off_desc) {
> I think this logic would be a little clearer if you renamed off_desc
> to kptr_off_desc to denote that this only applies to kptrs.

Ok, will change.

> > +                       err = check_map_kptr_access(env, regno, value_regno, insn_idx, off_desc);
> > +                       if (err)
> > +                               return err;
> I don't think you need this if check - it'll return err by default at
> the end of the function.

Right, will drop this.

> > +               } else if (t == BPF_READ && value_regno >= 0) {
> >                         struct bpf_map *map = reg->map_ptr;
> >
> >                         /* if map is read-only, track its contents as scalars */
> > --
> > 2.35.1
> >

--
Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic
  2022-04-11 20:20   ` Joanne Koong
@ 2022-04-12 19:48     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-12 19:48 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Tue, Apr 12, 2022 at 01:50:28AM IST, Joanne Koong wrote:
> On Sat, Apr 9, 2022 at 2:32 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Next commit introduces field type 'kptr' whose kind will not be struct,
> > but pointer, and it will not be limited to one offset, but multiple
> > ones. Make existing btf_find_struct_field and btf_find_datasec_var
> > functions amenable to use for finding kptrs in map value, by moving
> > spin_lock and timer specific checks into their own function.
> >
> > The alignment, and name are checked before the function is called, so it
> > is the last point where we can skip field or return an error before the
> > next loop iteration happens. The name parameter is now optional, and
> > only checked if it is not NULL. Size of the field and type is meant to
> > be checked inside the function, and base type will need to be obtained
> > by skipping modifiers.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  kernel/bpf/btf.c | 129 +++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 96 insertions(+), 33 deletions(-)
> >
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 0918a39279f6..db7bf05adfc5 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3163,71 +3163,126 @@ static void btf_struct_log(struct btf_verifier_env *env,
> >         btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
> >  }
> >
> > +enum {
> > +       BTF_FIELD_SPIN_LOCK,
> > +       BTF_FIELD_TIMER,
> > +};
> > +
> > +struct btf_field_info {
> > +       u32 off;
> > +};
> > +
> > +static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > +                                u32 off, int sz, struct btf_field_info *info)
> > +{
> > +       if (!__btf_type_is_struct(t))
> > +               return 0;
> > +       if (t->size != sz)
> > +               return 0;
> Do we need these two checks? I think in the original version we did
> because we were checking this before doing the name comparison, but
> now that the name comparison check is first, if the struct name is a
> match, then won't these two things always be true (or if not, then it
> seems like we should return -EINVAL)? But maybe I'm missing something
> here - I'm still getting more familiar with BTF :)

The name can be the same, but since this comes from map BTF, it could be a
different struct with the same name string in the map BTF, with a different
size as well. So checking both is still needed.

Returning -EINVAL now would be backwards incompatible, code this is replacing
continues when it doesn't find struct with the required size.

>
> Also, as a side-note: I personally find this function name
> "btf_find_field_struct" confusing since there's also the
> "btf_find_struct_field" function that exists. I wonder whether we
> should just keep the logic inside btf_find_struct_field instead of
> putting it in this separate function?

I'm open to renaming, how about we just call it btf_find_struct? Then in the
next patch I could rename btf_find_field_kptr to btf_find_kptr.

>
> > +       if (info->off != -ENOENT)
> > +               /* only one such field is allowed */
> > +               return -E2BIG;
> In the future, do you plan to add support for multiple fields? I think
> this would be useful for dynptrs as well, so just curious what your
> plans for this are.

In the next patch it is modified to deal with one info at once, so supporting
multiple fields is a matter of passing different info_cnt. It won't do this
info->off check to ensure it only saw one field from next patch, that will be
handled outside in the loop.

> > +       info->off = off;
> > +       return 0;
> > +}
> > +
> >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > -                                const char *name, int sz, int align)
> > +                                const char *name, int sz, int align, int field_type,
>
> What are your thoughts on just passing in field_type in place of name,
> sz, and align? As in a function signature like:
>
> static int btf_find_struct_field(const struct btf *btf, const struct
> btf_type *t, int field_type, struct btf_field_info *info);
>
> where inside btf_find_struct_field when we do the switch statement on
> field_type, we can have the name, sz, and align for each of the
> different field types there? That to me seems a bit cleaner where the
> descriptors for the field types are all in one place (instead of also
> in btf_find_spin_lock() and btf_find_timer() functions) and the
> function definition for btf_find_struct_field is more straightforward.
> At that point, I don't think we'd even need btf_find_spin_lock() and
> btf_find_timer() as functions since it'd be just a straightforward
> "btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK/BTF_FIELD_TIMER)" call
> instead. Curious to hear your thoughts.

Isn't btf_find_field doing exactly this? btf_find_timer e.g. only passed
BTF_FIELD_TIMER; name, sz, and alignment come from btf_find_field.

Also, after btf_find_field, there needs to be handling for the info that was
populated. In case of timer and spin_lock, we just return the offset, but in
case of kptrs we populate the kptr_off_tab. If we move this inside
btf_find_field, then it would be done based on field type inside the same
function (either using if/else or switch cases), not sure that is cleaner than
doing it in separate wrappers.

>
> nit: should field_type be a u32 since it's an enum? Or should we be
> explicit and give the enum a name and define this as something like
> "enum btf_field_type type"?
>

Ok, I'll do that (since I'm respinning anyway), though it doesn't really matter,
the underlying type is still int in C.

> > +                                struct btf_field_info *info)
> >  {
> >         const struct btf_member *member;
> > -       u32 i, off = -ENOENT;
> > +       u32 i, off;
> > +       int ret;
> >
> >         for_each_member(i, t, member) {
> >                 const struct btf_type *member_type = btf_type_by_id(btf,
> >                                                                     member->type);
> > -               if (!__btf_type_is_struct(member_type))
> > -                       continue;
> > -               if (member_type->size != sz)
> > -                       continue;
> > -               if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> > -                       continue;
> > -               if (off != -ENOENT)
> > -                       /* only one such field is allowed */
> > -                       return -E2BIG;
> > +
> >                 off = __btf_member_bit_offset(t, member);
> nit: should this be moved to after the strcmp on the name? Since if
> the name doesn't match, there's no point in doing this
> __btf_member_bit_offset call

Ack.

> > +
> > +               if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> I'm confused by the if (name) part of the check. If name is NULL, then
> won't this "btf_find_struct_field" function always return the offset
> to the first struct? I don't think name will ever be NULL so maybe we
> should just remove this? Or do something like if (name) return
> -EINVAL; before doing the strcmp?
>

I'll move it to the next patch, since you noted there that you realised this is
for kptr.

> > +                       continue;
> >                 if (off % 8)
> >                         /* valid C code cannot generate such BTF */
> >                         return -EINVAL;
> >                 off /= 8;
> >                 if (off % align)
> >                         return -EINVAL;
> > +
> > +               switch (field_type) {
> > +               case BTF_FIELD_SPIN_LOCK:
> > +               case BTF_FIELD_TIMER:
> > +                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> nit: I think we can just do "return btf_find_field_struct(btf,
> member_type, off, sz, info);" here and remove the "int ret;"
> declaration a few lines above.
>

Ack.

> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               default:
> > +                       return -EFAULT;
> > +               }
> >         }
> > -       return off;
> > +       return 0;
> >  }
> >
> >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > -                               const char *name, int sz, int align)
> > +                               const char *name, int sz, int align, int field_type,
> > +                               struct btf_field_info *info)
> The same comments for the btf_find_struct_field function also apply to
> this function
> >  {
> >         const struct btf_var_secinfo *vsi;
> > -       u32 i, off = -ENOENT;
> > +       u32 i, off;
> > +       int ret;
> >
> >         for_each_vsi(i, t, vsi) {
> >                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> >                 const struct btf_type *var_type = btf_type_by_id(btf, var->type);
> >
> > -               if (!__btf_type_is_struct(var_type))
> > -                       continue;
> > -               if (var_type->size != sz)
> > +               off = vsi->offset;
> > +
> > +               if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> >                         continue;
> >                 if (vsi->size != sz)
> >                         continue;
> > -               if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> > -                       continue;
> > -               if (off != -ENOENT)
> > -                       /* only one such field is allowed */
> > -                       return -E2BIG;
> > -               off = vsi->offset;
> >                 if (off % align)
> >                         return -EINVAL;
> > +
> > +               switch (field_type) {
> > +               case BTF_FIELD_SPIN_LOCK:
> > +               case BTF_FIELD_TIMER:
> > +                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               default:
> > +                       return -EFAULT;
> > +               }
> >         }
> > -       return off;
> > +       return 0;
> >  }
> >
> >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > -                         const char *name, int sz, int align)
> > +                         int field_type, struct btf_field_info *info)
> >  {
> > +       const char *name;
> > +       int sz, align;
> > +
> > +       switch (field_type) {
> > +       case BTF_FIELD_SPIN_LOCK:
> > +               name = "bpf_spin_lock";
> > +               sz = sizeof(struct bpf_spin_lock);
> > +               align = __alignof__(struct bpf_spin_lock);
> > +               break;
> > +       case BTF_FIELD_TIMER:
> > +               name = "bpf_timer";
> > +               sz = sizeof(struct bpf_timer);
> > +               align = __alignof__(struct bpf_timer);
> > +               break;
> > +       default:
> > +               return -EFAULT;
> > +       }
> >
> >         if (__btf_type_is_struct(t))
> > -               return btf_find_struct_field(btf, t, name, sz, align);
> > +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> >         else if (btf_type_is_datasec(t))
> > -               return btf_find_datasec_var(btf, t, name, sz, align);
> > +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> >         return -EINVAL;
> >  }
> >
> > @@ -3237,16 +3292,24 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >   */
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       return btf_find_field(btf, t, "bpf_spin_lock",
> > -                             sizeof(struct bpf_spin_lock),
> > -                             __alignof__(struct bpf_spin_lock));
> > +       struct btf_field_info info = { .off = -ENOENT };
> > +       int ret;
> > +
> > +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> I'm confused about why we pass in "struct btf_field_info" as the out
> parameter. Maybe I'm missing something here, but why can't
> "btf_find_field" just return back the offset?
> > +       if (ret < 0)
> > +               return ret;
> > +       return info.off;
> >  }
> >
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       return btf_find_field(btf, t, "bpf_timer",
> > -                             sizeof(struct bpf_timer),
> > -                             __alignof__(struct bpf_timer));
> > +       struct btf_field_info info = { .off = -ENOENT };
> > +       int ret;
> > +
> > +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> > +       if (ret < 0)
> > +               return ret;
> > +       return info.off;
> >  }
> >
> >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> > --
> > 2.35.1
> >

--
Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto
  2022-04-12 18:16   ` Joanne Koong
@ 2022-04-12 20:11     ` Kumar Kartikeya Dwivedi
  2022-04-13 18:33       ` Joanne Koong
  0 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-12 20:11 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Tue, Apr 12, 2022 at 11:46:14PM IST, Joanne Koong wrote:
> On Sun, Apr 10, 2022 at 11:58 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Add a new type flag for bpf_arg_type that when set tells verifier that
> > for a release function, that argument's register will be the one for
> > which meta.ref_obj_id will be set, and which will then be released
> > using release_reference. To capture the regno, introduce a new field
> > release_regno in bpf_call_arg_meta.
> >
> > This would be required in the next patch, where we may either pass NULL
> > or a refcounted pointer as an argument to the release function
> > bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
> > enough, as there is a case where the type of argument needed matches,
> > but the ref_obj_id is set to 0. Hence, we must enforce that whenever
> > meta.ref_obj_id is zero, the register that is to be released can only
> > be NULL for a release function.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h   |  5 ++++-
> >  kernel/bpf/ringbuf.c  |  4 ++--
> >  kernel/bpf/verifier.c | 46 ++++++++++++++++++++++++++++++++++++-------
> >  net/core/filter.c     |  2 +-
> >  4 files changed, 46 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index e267db260cb7..a6d1982e8118 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -364,7 +364,10 @@ enum bpf_type_flag {
> >          */
> >         MEM_PERCPU              = BIT(4 + BPF_BASE_TYPE_BITS),
> >
> > -       __BPF_TYPE_LAST_FLAG    = MEM_PERCPU,
> > +       /* Indicates that the pointer argument will be released. */
> > +       PTR_RELEASE             = BIT(5 + BPF_BASE_TYPE_BITS),
> > +
> > +       __BPF_TYPE_LAST_FLAG    = PTR_RELEASE,
> >  };
> >
> >  /* Max number of base types. */
> > diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> > index 710ba9de12ce..a22c21c0a7ef 100644
> > --- a/kernel/bpf/ringbuf.c
> > +++ b/kernel/bpf/ringbuf.c
> > @@ -404,7 +404,7 @@ BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
> >  const struct bpf_func_proto bpf_ringbuf_submit_proto = {
> >         .func           = bpf_ringbuf_submit,
> >         .ret_type       = RET_VOID,
> > -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> > +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
> >         .arg2_type      = ARG_ANYTHING,
> >  };
> >
> > @@ -417,7 +417,7 @@ BPF_CALL_2(bpf_ringbuf_discard, void *, sample, u64, flags)
> >  const struct bpf_func_proto bpf_ringbuf_discard_proto = {
> >         .func           = bpf_ringbuf_discard,
> >         .ret_type       = RET_VOID,
> > -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> > +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
> >         .arg2_type      = ARG_ANYTHING,
> >  };
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 01d45c5010f9..6cc08526e049 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -245,6 +245,7 @@ struct bpf_call_arg_meta {
> >         struct bpf_map *map_ptr;
> >         bool raw_mode;
> >         bool pkt_access;
> > +       u8 release_regno;
> >         int regno;
> >         int access_size;
> >         int mem_size;
> > @@ -5300,6 +5301,11 @@ static bool arg_type_is_int_ptr(enum bpf_arg_type type)
> >                type == ARG_PTR_TO_LONG;
> >  }
> >
> > +static bool arg_type_is_release_ptr(enum bpf_arg_type type)
> > +{
> > +       return type & PTR_RELEASE;
> > +}
> > +
> Now that we have PTR_RELEASE as a bpf arg type descriptor, why do we
> still need is_release_function() in the verifier? I think we should
> just remove is_release_function() altogether - is_release_function()
> isn't functionally necessary now that we have PTR_RELEASE, and I don't
> think it's great that is_release_function() hardcodes specific
> functions into the verifier. What are your thoughts?

We need it to (atleast) guard the meta.ref_obj_id release, otherwise you have to
check for PTR_RELEASE in all arguments to determine it is a release function.
I guess we could record whether function is release function in meta, then
looping over arguments won't be needed each time (probably best to do in
check_release_regno, and set it there).

>
> >  static int int_ptr_type_to_size(enum bpf_arg_type type)
> >  {
> >         if (type == ARG_PTR_TO_INT)
> > @@ -5532,7 +5538,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
> >                 /* Some of the argument types nevertheless require a
> >                  * zero register offset.
> >                  */
> > -               if (arg_type != ARG_PTR_TO_ALLOC_MEM)
> > +               if (base_type(arg_type) != ARG_PTR_TO_ALLOC_MEM)
> >                         return 0;
> >                 break;
> >         /* All the rest must be rejected, except PTR_TO_BTF_ID which allows
>
> Later on in this check_func_arg_reg_off() function, I think we can get
> rid of the hacky workaround for the PTR_TO_BTF_ID case where it relies
> on whether the function is a release function and reg->ref_obj_id is
> set, to determine whether the argument is a release arg or not. The
> arg type is passed directly to check_func_arg_reg_off(), so I think we
> could just use arg_type_is_release_ptr(arg_type) instead, which will
> also be more robust when/if we support having multiple release args in
> the future.

Ok, sounds good.

>
> > @@ -6124,12 +6130,31 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
> >         return true;
> >  }
> >
> > -static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
> > +static bool check_release_regno(const struct bpf_func_proto *fn, int func_id,
> > +                               struct bpf_call_arg_meta *meta)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> > +               if (arg_type_is_release_ptr(fn->arg_type[i])) {
> > +                       if (!is_release_function(func_id))
> > +                               return false;
> > +                       if (meta->release_regno)
> > +                               return false;
> > +                       meta->release_regno = i + 1;
> > +               }
> > +       }
> > +       return !is_release_function(func_id) || meta->release_regno;
> > +}
> Is this check needed? There's already a check in check_func_arg that
> there can't be two arg registers with ref_obj_ids set. I think this
> already checks against the case where the user tries to pass in two
> release registers as arguments.

This is different, this is about preventing the case where some func_id is
listed as release function, but none of its arguments were tagged as
PTR_RELEASE. It also doubles as a way to record the regno being released,
since we need to loop anyway.

If we are removing is_release_function, we can just make sure PTR_RELEASE is
only seen once, and consider such functions as release functions (and set
meta.release_function to true).

> > +
> > +static int check_func_proto(const struct bpf_func_proto *fn, int func_id,
> > +                           struct bpf_call_arg_meta *meta)
> >  {
> >         return check_raw_mode_ok(fn) &&
> >                check_arg_pair_ok(fn) &&
> >                check_btf_id_ok(fn) &&
> > -              check_refcount_ok(fn, func_id) ? 0 : -EINVAL;
> > +              check_refcount_ok(fn, func_id) &&
> > +              check_release_regno(fn, func_id, meta) ? 0 : -EINVAL;
> >  }
> >
> >  /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
> > @@ -6808,7 +6833,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >         memset(&meta, 0, sizeof(meta));
> >         meta.pkt_access = fn->pkt_access;
> >
> > -       err = check_func_proto(fn, func_id);
> > +       err = check_func_proto(fn, func_id, &meta);
> >         if (err) {
> >                 verbose(env, "kernel subsystem misconfigured func %s#%d\n",
> >                         func_id_name(func_id), func_id);
> > @@ -6841,8 +6866,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                         return err;
> >         }
> >
> > +       regs = cur_regs(env);
> > +
> >         if (is_release_function(func_id)) {
> > -               err = release_reference(env, meta.ref_obj_id);
> > +               err = -EINVAL;
> > +               if (meta.ref_obj_id)
> > +                       err = release_reference(env, meta.ref_obj_id);
> > +               /* meta.ref_obj_id can only be 0 if register that is meant to be
> > +                * released is NULL, which must be > R0.
> > +                */
> > +               else if (meta.release_regno && register_is_null(&regs[meta.release_regno]))
> > +                       err = 0;
> >                 if (err) {
> >                         verbose(env, "func %s#%d reference has not been acquired before\n",
> >                                 func_id_name(func_id), func_id);
> > @@ -6850,8 +6884,6 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                 }
> >         }
> >
> > -       regs = cur_regs(env);
> > -
> >         switch (func_id) {
> >         case BPF_FUNC_tail_call:
> >                 err = check_reference_leak(env);
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 143f442a9505..8eb01a997476 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6621,7 +6621,7 @@ static const struct bpf_func_proto bpf_sk_release_proto = {
> >         .func           = bpf_sk_release,
> >         .gpl_only       = false,
> >         .ret_type       = RET_INTEGER,
> > -       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
> > +       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON | PTR_RELEASE,
> >  };
> >
> >  BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx,
> > --
> > 2.35.1
> >

--
Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map
  2022-04-09  9:32 ` [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map Kumar Kartikeya Dwivedi
@ 2022-04-12 23:05   ` Joanne Koong
  2022-04-13  5:36     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2022-04-12 23:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Mon, Apr 11, 2022 at 12:25 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Extending the code in previous commit, introduce referenced kptr
> support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
> unreferenced kptr, referenced kptr have a lot more restrictions. In
> addition to the type matching, only a newly introduced bpf_kptr_xchg
> helper is allowed to modify the map value at that offset. This transfers
> the referenced pointer being stored into the map, releasing the
> references state for the program, and returning the old value and
> creating new reference state for the returned pointer.
>
> Similar to unreferenced pointer case, return value for this case will
> also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
> must either be eventually released by calling the corresponding release
> function, otherwise it must be transferred into another map.
>
> It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
> the value, and obtain the old value if any.
>
> BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
> commit will permit using BPF_LDX for such pointers, but attempt at
> making it safe, since the lifetime of object won't be guaranteed.
>
> There are valid reasons to enforce the restriction of permitting only
> bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
> consistent in face of concurrent modification, and any prior values
> contained in the map must also be released before a new one is moved
> into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
> returns the old value, which the verifier would require the user to
> either free or move into another map, and releases the reference held
> for the pointer being moved in.
>
> In the future, direct BPF_XCHG instruction may also be permitted to work
> like bpf_kptr_xchg helper.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h            |   7 +++
>  include/uapi/linux/bpf.h       |  12 ++++
>  kernel/bpf/btf.c               |  10 ++-
>  kernel/bpf/helpers.c           |  21 +++++++
>  kernel/bpf/verifier.c          | 107 +++++++++++++++++++++++++++++----
>  tools/include/uapi/linux/bpf.h |  12 ++++
>  6 files changed, 155 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a6d1982e8118..bd682c29883a 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -160,10 +160,15 @@ enum {
>         BPF_MAP_VALUE_OFF_MAX = 8,
>  };
>
> +enum {
> +       BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
I personally find this name very confusing. What does the "f" in
"OFF_F_REF" stand for? Also, I think adding in a comment for this
would help out future readers.
> +};
> +
>  struct bpf_map_value_off_desc {
>         u32 offset;
>         u32 btf_id;
>         struct btf *btf;
> +       int flags;
It's unclear from reading this what flags refers to. Maybe adding a
comment here that flags holds values from the enum above (and maybe we
should give the enum a more descriptive name?) would make it clearer?
>  };
>
>  struct bpf_map_value_off {
> @@ -416,6 +421,7 @@ enum bpf_arg_type {
>         ARG_PTR_TO_STACK,       /* pointer to stack */
>         ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only string */
>         ARG_PTR_TO_TIMER,       /* pointer to bpf_timer */
> +       ARG_PTR_TO_KPTR,        /* pointer to kptr */
This is only a "pointer to a referenced kptr", correct?
>         __BPF_ARG_TYPE_MAX,
>
>         /* Extended arg_types. */
> @@ -425,6 +431,7 @@ enum bpf_arg_type {
>         ARG_PTR_TO_SOCKET_OR_NULL       = PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
>         ARG_PTR_TO_ALLOC_MEM_OR_NULL    = PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
>         ARG_PTR_TO_STACK_OR_NULL        = PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
> +       ARG_PTR_TO_BTF_ID_OR_NULL       = PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
>
>         /* This must be the last entry. Its purpose is to ensure the enum is
>          * wide enough to hold the higher bits reserved for bpf_type_flag.
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d14b10b85e51..444fe6f1cf35 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5143,6 +5143,17 @@ union bpf_attr {
>   *             The **hash_algo** is returned on success,
>   *             **-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
>   *             invalid arguments are passed.
> + *
> + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> + *     Description
> + *             Exchange kptr at pointer *map_value* with *ptr*, and return the
> + *             old value. *ptr* can be NULL, otherwise it must be a referenced
> + *             pointer which will be released when this helper is called.
> + *     Return
> + *             The old value of kptr (which can be NULL). The returned pointer
> + *             if not NULL, is a reference which must be released using its
> + *             corresponding release function, or moved into a BPF map before
> + *             program exit.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -5339,6 +5350,7 @@ union bpf_attr {
>         FN(copy_from_user_task),        \
>         FN(skb_set_tstamp),             \
>         FN(ima_file_hash),              \
> +       FN(kptr_xchg),                  \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 28b1d9e9124e..43ea9ed5652e 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3177,6 +3177,7 @@ enum {
>  struct btf_field_info {
>         u32 type_id;
>         u32 off;
> +       int flags;
>  };
>
>  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> @@ -3194,6 +3195,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
>                                u32 off, int sz, struct btf_field_info *info)
>  {
>         u32 res_id;
> +       int flags;
>
>         /* For PTR, sz is always == 8 */
>         if (!btf_type_is_ptr(t))
> @@ -3205,7 +3207,11 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
>         /* Reject extra tags */
>         if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
>                 return -EINVAL;
> -       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> +       if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> +               flags = 0;
> +       else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
> +               flags = BPF_MAP_VALUE_OFF_F_REF;
> +       else
>                 return -EINVAL;
>
>         /* Get the base type */
> @@ -3216,6 +3222,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
>
>         info->type_id = res_id;
>         info->off = off;
> +       info->flags = flags;
>         return BTF_FIELD_FOUND;
>  }
>
> @@ -3420,6 +3427,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
>                 tab->off[i].offset = info_arr[i].off;
>                 tab->off[i].btf_id = id;
>                 tab->off[i].btf = off_btf;
> +               tab->off[i].flags = info_arr[i].flags;
>                 tab->nr_off = i + 1;
>         }
>         return tab;
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 315053ef6a75..a437d0f0458a 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
>         kfree(t);
>  }
>
> +BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
> +{
> +       unsigned long *kptr = map_value;
> +
> +       return xchg(kptr, (unsigned long)ptr);
> +}
> +
> +static u32 bpf_kptr_xchg_btf_id;
> +
> +const struct bpf_func_proto bpf_kptr_xchg_proto = {
> +       .func         = bpf_kptr_xchg,
> +       .gpl_only     = false,
> +       .ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
> +       .ret_btf_id   = &bpf_kptr_xchg_btf_id,
> +       .arg1_type    = ARG_PTR_TO_KPTR,
> +       .arg2_type    = ARG_PTR_TO_BTF_ID_OR_NULL | PTR_RELEASE,
> +       .arg2_btf_id  = &bpf_kptr_xchg_btf_id,
> +};
> +
>  const struct bpf_func_proto bpf_get_current_task_proto __weak;
>  const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
>  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
> @@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>                 return &bpf_timer_start_proto;
>         case BPF_FUNC_timer_cancel:
>                 return &bpf_timer_cancel_proto;
> +       case BPF_FUNC_kptr_xchg:
> +               return &bpf_kptr_xchg_proto;
>         default:
>                 break;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6cc08526e049..92efe6c3999c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -258,6 +258,7 @@ struct bpf_call_arg_meta {
>         struct btf *ret_btf;
>         u32 ret_btf_id;
>         u32 subprogno;
> +       struct bpf_map_value_off_desc *kptr_off_desc;
>  };
>
>  struct btf *btf_vmlinux;
> @@ -480,7 +481,8 @@ static bool is_release_function(enum bpf_func_id func_id)
>  {
>         return func_id == BPF_FUNC_sk_release ||
>                func_id == BPF_FUNC_ringbuf_submit ||
> -              func_id == BPF_FUNC_ringbuf_discard;
> +              func_id == BPF_FUNC_ringbuf_discard ||
> +              func_id == BPF_FUNC_kptr_xchg;
>  }
>
>  static bool may_be_acquire_function(enum bpf_func_id func_id)
> @@ -500,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
>         if (func_id == BPF_FUNC_sk_lookup_tcp ||
>             func_id == BPF_FUNC_sk_lookup_udp ||
>             func_id == BPF_FUNC_skc_lookup_tcp ||
> -           func_id == BPF_FUNC_ringbuf_reserve)
> +           func_id == BPF_FUNC_ringbuf_reserve ||
> +           func_id == BPF_FUNC_kptr_xchg)
>                 return true;
>
>         if (func_id == BPF_FUNC_map_lookup_elem &&
> @@ -3525,6 +3528,12 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
>         /* We need to verify reg->type and reg->btf, before accessing reg->btf */
>         reg_name = kernel_type_name(reg->btf, reg->btf_id);
>
> +       /* For ref_ptr case, release function check should ensure we get one
> +        * referenced PTR_TO_BTF_ID, and that its fixed offset is 0. For the
I don't fully understand why the first sentence in this comment is
relevant to this function - this seems like it belongs more to
check_func_arg_reg_off() for the PTR_TO_BTF_ID case?

> +        * normal store of unreferenced kptr, we must ensure var_off is zero.
> +        * Since ref_ptr cannot be accessed directly by BPF insns, checks for
> +        * reg->off and reg->ref_obj_id are not needed here.
> +        */

>         if (__check_ptr_off_reg(env, reg, regno, true))
>                 return -EACCES;
>
> @@ -3557,6 +3566,12 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
>         if (BPF_MODE(insn->code) != BPF_MEM)
>                 goto end;
>
> +       /* We cannot directly access kptr_ref */
> +       if (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF) {
> +               verbose(env, "accessing referenced kptr disallowed\n");
> +               return -EACCES;
> +       }
> +
>         if (class == BPF_LDX) {
>                 val_reg = reg_state(env, value_regno);
>                 /* We can simply mark the value_regno receiving the pointer
> @@ -5278,6 +5293,59 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
>         return 0;
>  }
>
> +static int process_kptr_func(struct bpf_verifier_env *env, int regno,
> +                            struct bpf_call_arg_meta *meta)
> +{
> +       struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +       struct bpf_map_value_off_desc *off_desc;
> +       struct bpf_map *map_ptr = reg->map_ptr;
> +       u32 kptr_off;
> +       int ret;
> +
> +       if (!tnum_is_const(reg->var_off)) {
> +               verbose(env,
> +                       "R%d doesn't have constant offset. kptr has to be at the constant offset\n",
> +                       regno);
> +               return -EINVAL;
> +       }
> +       if (!map_ptr->btf) {
> +               verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
> +                       map_ptr->name);
> +               return -EINVAL;
> +       }
> +       if (!map_value_has_kptrs(map_ptr)) {
> +               ret = PTR_ERR(map_ptr->kptr_off_tab);
> +               if (ret == -E2BIG)
> +                       verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
> +                               BPF_MAP_VALUE_OFF_MAX);
> +               else if (ret == -EEXIST)
> +                       verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
> +               else
> +                       verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
> +               return -EINVAL;
> +       }
> +
> +       meta->map_ptr = map_ptr;
> +       /* Check access for BPF_WRITE */
> +       meta->raw_mode = true;
> +       ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
Do you need to check access here for both BPF_WRITE and BPF_READ since
you are also reading the map value when you do the xchg?
> +       if (ret < 0)
> +               return ret;
> +
> +       kptr_off = reg->off + reg->var_off.value;
> +       off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
> +       if (!off_desc) {
> +               verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
> +               return -EACCES;
> +       }
> +       if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
> +               verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
> +               return -EACCES;
> +       }
> +       meta->kptr_off_desc = off_desc;
> +       return 0;
> +}
> +
>  static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
>  {
>         return base_type(type) == ARG_PTR_TO_MEM ||
> @@ -5418,6 +5486,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
>  static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
>  static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
>  static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
> +static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
>
>  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>         [ARG_PTR_TO_MAP_KEY]            = &map_key_value_types,
> @@ -5445,11 +5514,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>         [ARG_PTR_TO_STACK]              = &stack_ptr_types,
>         [ARG_PTR_TO_CONST_STR]          = &const_str_ptr_types,
>         [ARG_PTR_TO_TIMER]              = &timer_types,
> +       [ARG_PTR_TO_KPTR]               = &kptr_types,
>  };
>
>  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>                           enum bpf_arg_type arg_type,
> -                         const u32 *arg_btf_id)
> +                         const u32 *arg_btf_id,
> +                         struct bpf_call_arg_meta *meta)
>  {
>         struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
>         enum bpf_reg_type expected, type = reg->type;
> @@ -5502,8 +5573,11 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>                         arg_btf_id = compatible->btf_id;
>                 }
>
> -               if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> -                                         btf_vmlinux, *arg_btf_id)) {
> +               if (meta->func_id == BPF_FUNC_kptr_xchg) {
> +                       if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
> +                               return -EACCES;
> +               } else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +                                                btf_vmlinux, *arg_btf_id)) {
>                         verbose(env, "R%d is of type %s but %s is expected\n",
>                                 regno, kernel_type_name(reg->btf, reg->btf_id),
>                                 kernel_type_name(btf_vmlinux, *arg_btf_id));
> @@ -5613,7 +5687,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>                  */
>                 goto skip_type_check;
>
> -       err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
> +       err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
>         if (err)
>                 return err;
>
> @@ -5778,6 +5852,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>                         verbose(env, "string is not zero-terminated\n");
>                         return -EINVAL;
>                 }
> +       } else if (arg_type == ARG_PTR_TO_KPTR) {
> +               if (process_kptr_func(env, regno, meta))
> +                       return -EACCES;
>         }
>
>         return err;
> @@ -6120,10 +6197,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
>         int i;
>
>         for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> -               if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> +               if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
>                         return false;
>
> -               if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> +               if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
>                         return false;
>         }
>
> @@ -7007,21 +7084,25 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;
>                 }
>         } else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
> +               struct btf *ret_btf;
>                 int ret_btf_id;
>
>                 mark_reg_known_zero(env, regs, BPF_REG_0);
>                 regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
> -               ret_btf_id = *fn->ret_btf_id;
> +               if (func_id == BPF_FUNC_kptr_xchg) {
> +                       ret_btf = meta.kptr_off_desc->btf;
> +                       ret_btf_id = meta.kptr_off_desc->btf_id;
> +               } else {
> +                       ret_btf = btf_vmlinux;
> +                       ret_btf_id = *fn->ret_btf_id;
> +               }
>                 if (ret_btf_id == 0) {
>                         verbose(env, "invalid return type %u of func %s#%d\n",
>                                 base_type(ret_type), func_id_name(func_id),
>                                 func_id);
>                         return -EINVAL;
>                 }
> -               /* current BPF helper definitions are only coming from
> -                * built-in code with type IDs from  vmlinux BTF
> -                */
> -               regs[BPF_REG_0].btf = btf_vmlinux;
> +               regs[BPF_REG_0].btf = ret_btf;
>                 regs[BPF_REG_0].btf_id = ret_btf_id;
>         } else {
>                 verbose(env, "unknown return type %u of func %s#%d\n",
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index d14b10b85e51..444fe6f1cf35 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5143,6 +5143,17 @@ union bpf_attr {
>   *             The **hash_algo** is returned on success,
>   *             **-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
>   *             invalid arguments are passed.
> + *
> + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> + *     Description
> + *             Exchange kptr at pointer *map_value* with *ptr*, and return the
> + *             old value. *ptr* can be NULL, otherwise it must be a referenced
> + *             pointer which will be released when this helper is called.
> + *     Return
> + *             The old value of kptr (which can be NULL). The returned pointer
> + *             if not NULL, is a reference which must be released using its
> + *             corresponding release function, or moved into a BPF map before
> + *             program exit.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -5339,6 +5350,7 @@ union bpf_attr {
>         FN(copy_from_user_task),        \
>         FN(skb_set_tstamp),             \
>         FN(ima_file_hash),              \
> +       FN(kptr_xchg),                  \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map
  2022-04-12 19:16     ` Kumar Kartikeya Dwivedi
@ 2022-04-12 23:56       ` Joanne Koong
  2022-04-13  5:50         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2022-04-12 23:56 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Tue, Apr 12, 2022 at 12:16 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, Apr 12, 2022 at 06:02:11AM IST, Joanne Koong wrote:
> > On Sat, Apr 9, 2022 at 6:18 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > register must have the same type as in the map value's BTF, and loading
> > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > kernel BTF and BTF ID.
> > >
> > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > BPF program loads this pointer, the object which the pointer points to
> > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > to still access such invalid pointer, but passing such pointers into
> > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > series will close this gap.
> > >
> > > The flexibility offered by allowing programs to dereference such invalid
> > > pointers while being safe at runtime frees the verifier from doing
> > > complex lifetime tracking. As long as the user may ensure that the
> > > object remains valid, it can ensure data read by it from the kernel
> > > object is valid.
> > >
> > > The user indicates that a certain pointer must be treated as kptr
> > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > information is recorded in the object BTF which will be passed into the
> > > kernel by way of map's BTF information. The name and kind from the map
> > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > now, only storing pointers to structs is permitted.
> > >
> > > An example of this specification is shown below:
> > >
> > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > >
> > >         struct map_value {
> > >                 ...
> > >                 struct task_struct __kptr *task;
> > >                 ...
> > >         };
> > >
> > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > task_struct into the map, and then load it later.
> > >
> > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > the verifier cannot know whether the value is NULL or not statically, it
> > > must treat all potential loads at that map value offset as loading a
> > > possibly NULL pointer.
> > >
> > > Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
> > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > same BTF type as specified in the map BTF. The access size must always
> > > be BPF_DW.
> > >
> > > For the map in map support, the kptr_off_tab for outer map is copied
> > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > case, hence would be unnecessary for all other users.
> > >
> > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > having kptrs, similar to the bpf_timer case.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h     |  29 +++++++-
> > >  include/linux/btf.h     |   2 +
> > >  kernel/bpf/btf.c        | 160 ++++++++++++++++++++++++++++++++++------
> > >  kernel/bpf/map_in_map.c |   5 +-
> > >  kernel/bpf/syscall.c    | 114 +++++++++++++++++++++++++++-
> > >  kernel/bpf/verifier.c   | 116 ++++++++++++++++++++++++++++-
> > >  6 files changed, 399 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index bdb5298735ce..e267db260cb7 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> > >         const struct bpf_iter_seq_info *iter_seq_info;
> > >  };
> > >
> > > +enum {
> > > +       /* Support at most 8 pointers in a BPF map value */
> > > +       BPF_MAP_VALUE_OFF_MAX = 8,
> > > +};
> > nit: should this be a typedef instead of an enum?
>
> typedef? Do you mean #define? I prefer enum constants as they get emitted to
> BTF.
Yeah I meant #define, not typedef :) Oh I see - out of curiosity since
I'm still getting acquainted with BTF, what is the utility of the enum
constant getting emitted to the vmlinux BTF? For more detailed
debuggability?

>
> > > +
> > > +struct bpf_map_value_off_desc {
> > > +       u32 offset;
> > > +       u32 btf_id;
> > > +       struct btf *btf;
> > nit: Since bpf_map_value_off_desc is generic and will support
> > non-kptrs as well, I think embedding "btf_id" and "btf" in a "union {
> > } kptr;" would make it more clear that only kptrs will have these
> > fields used
> >
>
> Ok, will do.
>
> > > +};
> > > +
> > > +struct bpf_map_value_off {
> > > +       u32 nr_off;
> > > +       struct bpf_map_value_off_desc off[];
> > > +};
> > > +
> > >  struct bpf_map {
> > >         /* The first two cachelines with read-mostly members of which some
> > >          * are also accessed in fast-path (e.g. ops, max_entries).
> > > @@ -171,6 +187,7 @@ struct bpf_map {
> > >         u64 map_extra; /* any per-map-type extra fields */
> > >         u32 map_flags;
> > >         int spin_lock_off; /* >=0 valid offset, <0 error */
> > > +       struct bpf_map_value_off *kptr_off_tab;
> > >         int timer_off; /* >=0 valid offset, <0 error */
> > >         u32 id;
> > >         int numa_node;
> > > @@ -184,7 +201,7 @@ struct bpf_map {
> > >         char name[BPF_OBJ_NAME_LEN];
> > >         bool bypass_spec_v1;
> > >         bool frozen; /* write-once; write-protected by freeze_mutex */
> > > -       /* 14 bytes hole */
> > > +       /* 6 bytes hole */
> > >
> > >         /* The 3rd and 4th cacheline with misc members to avoid false sharing
> > >          * particularly with refcounting.
> > > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> > >         return map->timer_off >= 0;
> > >  }
> > >
> > > +static inline bool map_value_has_kptrs(const struct bpf_map *map)
> > > +{
> > > +       return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > > +}
> > > +
> > >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > >  {
> > >         if (unlikely(map_value_has_spin_lock(map)))
> > > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> > >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> > >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> > >
> > > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > > +
> > >  struct bpf_map *bpf_map_get(u32 ufd);
> > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > index 36bc09b8e890..19c297f9a52f 100644
> > > --- a/include/linux/btf.h
> > > +++ b/include/linux/btf.h
> > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > >                            u32 expected_offset, u32 expected_size);
> > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > +struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> > > +                                         const struct btf_type *t);
> > >  bool btf_type_is_void(const struct btf_type *t);
> > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index db7bf05adfc5..28b1d9e9124e 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -3166,9 +3166,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > >  enum {
> > >         BTF_FIELD_SPIN_LOCK,
> > >         BTF_FIELD_TIMER,
> > > +       BTF_FIELD_KPTR,
> > > +};
> > > +
> > > +enum {
> > > +       BTF_FIELD_IGNORE = 0,
> > > +       BTF_FIELD_FOUND  = 1,
> > >  };
> > >
> > >  struct btf_field_info {
> > > +       u32 type_id;
> > >         u32 off;
> > >  };
> > >
> > > @@ -3176,23 +3183,50 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
> > >                                  u32 off, int sz, struct btf_field_info *info)
> > >  {
> > >         if (!__btf_type_is_struct(t))
> > > -               return 0;
> > > +               return BTF_FIELD_IGNORE;
> > >         if (t->size != sz)
> > > -               return 0;
> > > -       if (info->off != -ENOENT)
> > > -               /* only one such field is allowed */
> > > -               return -E2BIG;
> > > +               return BTF_FIELD_IGNORE;
> > >         info->off = off;
> > > -       return 0;
> > > +       return BTF_FIELD_FOUND;
> > > +}
> > > +
> > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > +                              u32 off, int sz, struct btf_field_info *info)
> > > +{
> > > +       u32 res_id;
> > > +
> > > +       /* For PTR, sz is always == 8 */
> > > +       if (!btf_type_is_ptr(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       t = btf_type_by_id(btf, t->type);
> > > +
> > > +       if (!btf_type_is_type_tag(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       /* Reject extra tags */
> > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > +               return -EINVAL;
> > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +               return -EINVAL;
> > > +
> > > +       /* Get the base type */
> > > +       t = btf_type_skip_modifiers(btf, t->type, &res_id);
> > > +       /* Only pointer to struct is allowed */
> > > +       if (!__btf_type_is_struct(t))
> > > +               return -EINVAL;
> > > +
> > > +       info->type_id = res_id;
> > > +       info->off = off;
> > > +       return BTF_FIELD_FOUND;
> > >  }
> > >
> > >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > >                                  const char *name, int sz, int align, int field_type,
> > > -                                struct btf_field_info *info)
> > > +                                struct btf_field_info *info, int info_cnt)
> > Ah okay. I should have read this patch first before commenting on the
> > previous one :) I see now why you are passing in info instead of just
> > returning the offset.
> > >  {
> > >         const struct btf_member *member;
> > > +       struct btf_field_info tmp;
> > > +       int ret, idx = 0;
> > >         u32 i, off;
> > > -       int ret;
> > >
> > >         for_each_member(i, t, member) {
> > >                 const struct btf_type *member_type = btf_type_by_id(btf,
> > > @@ -3212,24 +3246,38 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> > >                 switch (field_type) {
> > >                 case BTF_FIELD_SPIN_LOCK:
> > >                 case BTF_FIELD_TIMER:
> > > -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > > +                       ret = btf_find_field_struct(btf, member_type, off, sz, idx < info_cnt ?
> > > +                                                   &info[idx] : &tmp);
> > > +                       if (ret < 0)
> > > +                               return ret;
> > > +                       break;
> > > +               case BTF_FIELD_KPTR:
> > > +                       ret = btf_find_field_kptr(btf, member_type, off, sz, idx < info_cnt ?
> > > +                                                 &info[idx] : &tmp);
> > >                         if (ret < 0)
> > >                                 return ret;
> > >                         break;
> > >                 default:
> > >                         return -EFAULT;
> > >                 }
> > > +
> > > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> > > +                       return -E2BIG;
> > > +               else if (ret == BTF_FIELD_IGNORE)
> > > +                       continue;
> > nit: I think if you check the "ret == BTF_FIELD_IGNORE" first, then
> > you just need to check idx >= info_cnt instead of "ret ==
> > BTF_FIELD_FOUND && idx >= info_cnt"
>
> Ok, I'll switch the order.
>
> > > +               ++idx;
> > >         }
> > > -       return 0;
> > > +       return idx;
> > >  }
> > >
> > >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > >                                 const char *name, int sz, int align, int field_type,
> > > -                               struct btf_field_info *info)
> > > +                               struct btf_field_info *info, int info_cnt)
> > >  {
> > >         const struct btf_var_secinfo *vsi;
> > > +       struct btf_field_info tmp;
> > > +       int ret, idx = 0;
> > >         u32 i, off;
> > > -       int ret;
> > >
> > >         for_each_vsi(i, t, vsi) {
> > >                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > > @@ -3247,19 +3295,32 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > >                 switch (field_type) {
> > >                 case BTF_FIELD_SPIN_LOCK:
> > >                 case BTF_FIELD_TIMER:
> > > -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > > +                       ret = btf_find_field_struct(btf, var_type, off, sz, idx < info_cnt ?
> > > +                                                   &info[idx] : &tmp);
> > > +                       if (ret < 0)
> > > +                               return ret;
> > > +                       break;
> > > +               case BTF_FIELD_KPTR:
> > > +                       ret = btf_find_field_kptr(btf, var_type, off, sz, idx < info_cnt ?
> > > +                                                 &info[idx] : &tmp);
> > >                         if (ret < 0)
> > >                                 return ret;
> > >                         break;
> > >                 default:
> > >                         return -EFAULT;
> > >                 }
> > > +
> > > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> > > +                       return -E2BIG;
> > > +               if (ret == BTF_FIELD_IGNORE)
> > > +                       continue;
> > > +               ++idx;
> > >         }
> > > -       return 0;
> > > +       return idx;
> > >  }
> > >
> > >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > > -                         int field_type, struct btf_field_info *info)
> > > +                         int field_type, struct btf_field_info *info, int info_cnt)
> > >  {
> > >         const char *name;
> > >         int sz, align;
> > > @@ -3275,14 +3336,19 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > >                 sz = sizeof(struct bpf_timer);
> > >                 align = __alignof__(struct bpf_timer);
> > >                 break;
> > > +       case BTF_FIELD_KPTR:
> > > +               name = NULL;
> > I see now why you added the if (name) check in the previous patch.
> > Maybe that should be part of this patch instead to make it more clear?
>
> Yes, I'll move it to this patch.
>
> > > +               sz = sizeof(u64);
> > > +               align = 8;
> > > +               break;
> > >         default:
> > >                 return -EFAULT;
> > >         }
> > >
> > >         if (__btf_type_is_struct(t))
> > > -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > > +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
> > >         else if (btf_type_is_datasec(t))
> > > -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > > +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
> > >         return -EINVAL;
> > >  }
> > >
> > > @@ -3292,26 +3358,78 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > >   */
> > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> > >  {
> > > -       struct btf_field_info info = { .off = -ENOENT };
> > > +       struct btf_field_info info;
> > >         int ret;
> > >
> > > -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> > > +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info, 1);
> > >         if (ret < 0)
> > >                 return ret;
> > > +       if (!ret)
> > > +               return -ENOENT;
> > >         return info.off;
> > >  }
> > >
> > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> > >  {
> > > -       struct btf_field_info info = { .off = -ENOENT };
> > > +       struct btf_field_info info;
> > >         int ret;
> > >
> > > -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> > > +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info, 1);
> > >         if (ret < 0)
> > >                 return ret;
> > > +       if (!ret)
> > > +               return -ENOENT;
> > >         return info.off;
> > >  }
> > >
> > > +struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> > > +                                         const struct btf_type *t)
> > > +{
> > > +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> > > +       struct bpf_map_value_off *tab;
> > > +       int ret, i, nr_off;
> > > +
> > > +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> > > +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> > > +
> > > +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> > > +       if (ret < 0)
> > > +               return ERR_PTR(ret);
> > > +       if (!ret)
> > > +               return NULL;
> > > +
> > > +       nr_off = ret;
> > > +       tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> > > +       if (!tab)
> > > +               return ERR_PTR(-ENOMEM);
> > > +
> > > +       tab->nr_off = 0;
> > tab is kzalloced - I think we can just remove this line
>
> Right, will drop this.
>
> > > +       for (i = 0; i < nr_off; i++) {
> > > +               const struct btf_type *t;
> > > +               struct btf *off_btf;
> > > +               s32 id;
> > > +
> > > +               t = btf_type_by_id(btf, info_arr[i].type_id);
> > > +               id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> > > +                                    &off_btf);
> > > +               if (id < 0) {
> > > +                       ret = id;
> > > +                       goto end;
> > > +               }
> > > +
> > > +               tab->off[i].offset = info_arr[i].off;
> > > +               tab->off[i].btf_id = id;
> > > +               tab->off[i].btf = off_btf;
> > > +               tab->nr_off = i + 1;
> > > +       }
> > Instead of incrementing tab in every loop iteration, why not just set
> > "tab->nr_off = nr_off" here after the loop? And then in the end: case
> > where there's an error, we could just do
> >
> > while (i--)
> >   btf_put(tab->off[i].btf);
> >
>
> That works too, will change.
>
> > > +       return tab;
> > > +end:
> > > +       while (tab->nr_off--)
> > > +               btf_put(tab->off[tab->nr_off].btf);
> > > +       kfree(tab);
> > > +       return ERR_PTR(ret);
> > > +}
> > > +
> > >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> > >                               u32 type_id, void *data, u8 bits_offset,
> > >                               struct btf_show *show)
> > > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> > > index 5cd8f5277279..135205d0d560 100644
> > > --- a/kernel/bpf/map_in_map.c
> > > +++ b/kernel/bpf/map_in_map.c
> > > @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> > >         inner_map_meta->max_entries = inner_map->max_entries;
> > >         inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
> > >         inner_map_meta->timer_off = inner_map->timer_off;
> > > +       inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
> > >         if (inner_map->btf) {
> > >                 btf_get(inner_map->btf);
> > >                 inner_map_meta->btf = inner_map->btf;
> > > @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> > >
> > >  void bpf_map_meta_free(struct bpf_map *map_meta)
> > >  {
> > > +       bpf_map_free_kptr_off_tab(map_meta);
> > >         btf_put(map_meta->btf);
> > >         kfree(map_meta);
> > >  }
> > > @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
> > >                 meta0->key_size == meta1->key_size &&
> > >                 meta0->value_size == meta1->value_size &&
> > >                 meta0->timer_off == meta1->timer_off &&
> > > -               meta0->map_flags == meta1->map_flags;
> > > +               meta0->map_flags == meta1->map_flags &&
> > > +               bpf_map_equal_kptr_off_tab(meta0, meta1);
> > >  }
> > >
> > >  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index cdaa1152436a..edfe691284b0 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -6,6 +6,7 @@
> > >  #include <linux/bpf_trace.h>
> > >  #include <linux/bpf_lirc.h>
> > >  #include <linux/bpf_verifier.h>
> > > +#include <linux/bsearch.h>
> > >  #include <linux/btf.h>
> > >  #include <linux/syscalls.h>
> > >  #include <linux/slab.h>
> > > @@ -473,12 +474,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
> > >  }
> > >  #endif
> > >
> > > +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> > > +{
> > > +       const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> > > +
> > > +       if (off_desc1->offset < off_desc2->offset)
> > > +               return -1;
> > > +       else if (off_desc1->offset > off_desc2->offset)
> > > +               return 1;
> > > +       return 0;
> > > +}
> > > +
> > > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> > > +{
> > > +       /* Since members are iterated in btf_find_field in increasing order,
> > > +        * offsets appended to kptr_off_tab are in increasing order, so we can
> > > +        * do bsearch to find exact match.
> > > +        */
> > > +       struct bpf_map_value_off *tab;
> > > +
> > > +       if (!map_value_has_kptrs(map))
> > > +               return NULL;
> > > +       tab = map->kptr_off_tab;
> > > +       return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> > > +}
> > > +
> > > +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> > > +{
> > > +       struct bpf_map_value_off *tab = map->kptr_off_tab;
> > > +       int i;
> > > +
> > > +       if (!map_value_has_kptrs(map))
> > > +               return;
> > > +       for (i = 0; i < tab->nr_off; i++) {
> > > +               struct btf *btf = tab->off[i].btf;
> > > +
> > > +               btf_put(btf);
> > > +       }
> > > +       kfree(tab);
> > > +       map->kptr_off_tab = NULL;
> > > +}
> > > +
> > > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> > > +{
> > > +       struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> > > +       int size, i, ret;
> > > +
> > > +       if (!map_value_has_kptrs(map))
> > > +               return ERR_PTR(-ENOENT);
> > > +       /* Do a deep copy of the kptr_off_tab */
> > > +       for (i = 0; i < tab->nr_off; i++)
> > > +               btf_get(tab->off[i].btf);
> > > +
> > > +       size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> > > +       new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > I think we can get away with not zero-ing out the memory, since we're
> > going to be memcpying over its contents right after
>
> Right, kmalloc should be fine.
>
> > > +       if (!new_tab) {
> > > +               ret = -ENOMEM;
> > > +               goto end;
> > > +       }
> > > +       memcpy(new_tab, tab, size);
> > > +       return new_tab;
> > > +end:
> > > +       while (i--)
> > > +               btf_put(tab->off[i].btf);
> > > +       return ERR_PTR(ret);
> > > +}
> > > +
> > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> > > +{
> > > +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> > > +       bool a_has_kptr = map_value_has_kptrs(map_a), b_has_kptr = map_value_has_kptrs(map_b);
> > > +       int size;
> > > +
> > > +       if (!a_has_kptr && !b_has_kptr)
> > > +               return true;
> > > +       if (a_has_kptr != b_has_kptr)
> > > +               return false;
> > > +       if (tab_a->nr_off != tab_b->nr_off)
> > > +               return false;
> > > +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> > > +       return !memcmp(tab_a, tab_b, size);
> > > +}
> > > +
> > >  /* called from workqueue */
> > >  static void bpf_map_free_deferred(struct work_struct *work)
> > >  {
> > >         struct bpf_map *map = container_of(work, struct bpf_map, work);
> > >
> > >         security_bpf_map_free(map);
> > > +       bpf_map_free_kptr_off_tab(map);
> > >         bpf_map_release_memcg(map);
> > >         /* implementation dependent freeing */
> > >         map->ops->map_free(map);
> > > @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >         int err;
> > >
> > >         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> > > -           map_value_has_timer(map))
> > > +           map_value_has_timer(map) || map_value_has_kptrs(map))
> > >                 return -ENOTSUPP;
> > >
> > >         if (!(vma->vm_flags & VM_SHARED))
> > > @@ -820,9 +904,33 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> > >                         return -EOPNOTSUPP;
> > >         }
> > >
> > > -       if (map->ops->map_check_btf)
> > > +       map->kptr_off_tab = btf_parse_kptrs(btf, value_type);
> > Since btf_parse_kptrs can return back ERR_PTR, I think we need to
> > check here whether map->kptr_offtab is an ERR_PTR.
> >
>
> This is already checked by map_value_has_kptrs (which is
> !IS_ERR_OR_NULL(map->kptr_off_tab)). We store ERR_PTR to distinguish the error
> message given to user in process_kptr_func (later in ref kptr patch).
>
> > > +       if (map_value_has_kptrs(map)) {
> > > +               if (!bpf_capable()) {
> > > +                       ret = -EPERM;
> > > +                       goto free_map_tab;
> > > +               }
> > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > Why is it an error if BPF_F_RDONLY_PROG is set? Maybe I'm
> > misunderstanding what BPF_F_RDONLY_PROG means, but why can't a program
> > have read-only access to the kptr value?
>
> It would be useless, kptr can only be set from inside a BPF program.
If the kptr is embedded inside a larger struct, couldn't there be use
cases where a program wants to read the other fields in this struct
that have been updated by the userspace application?
>
> > > +                       ret = -EACCES;
> > > +                       goto free_map_tab;
> > > +               }
> > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> > Out of curiosity, do you also plan to add kptr support in the future
> > to local storage maps as well?
>
> Yes, those and percpu maps are on the TODO list.
Awesome!!
>
> > > +                       ret = -EOPNOTSUPP;
> > > +                       goto free_map_tab;
> > > +               }
> > > +       }
> > > +
> > > +       if (map->ops->map_check_btf) {
> > >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > > +               if (ret < 0)
> > > +                       goto free_map_tab;
> > > +       }
> > >
> > > +       return ret;
> > > +free_map_tab:
> > > +       bpf_map_free_kptr_off_tab(map);
> > >         return ret;
> > >  }
> > >
> > > @@ -1639,7 +1747,7 @@ static int map_freeze(const union bpf_attr *attr)
> > >                 return PTR_ERR(map);
> > >
> > >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > > -           map_value_has_timer(map)) {
> > > +           map_value_has_timer(map) || map_value_has_kptrs(map)) {
> > >                 fdput(f);
> > >                 return -ENOTSUPP;
> > >         }
> >
> > Maybe I'm missing something, but I'm not seeing it in this patch - do
> > we also need to add checks that prohibit userspace programs from
> > trying to do bpf_map_update_elem syscalls that manipulate kptr map
> > values?
>
> Userspace should be allowed to do bpf_map_update_elem, whether map value has
> timers, spin_lock, kptrs, or dynptrs in the future. copy_map_value will skip
> over these fields when updating map value. See patch 7.

Within the context of this patch, a userspace program can do
bpf_map_update_elem, put in some unsafe value for the kptr, which will
cause the bpf program to crash the kernel when it accesses that value.
That was my main concern, but since this is going to be addressed in
your patch 7, I don't think this matters then.

>
> >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 71827d14724a..01d45c5010f9 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -3507,6 +3507,83 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> > >         return __check_ptr_off_reg(env, reg, regno, false);
> > >  }
> > >
> > > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > > +                              struct bpf_map_value_off_desc *off_desc,
> > > +                              struct bpf_reg_state *reg, u32 regno)
> > > +{
> > > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > > +       const char *reg_name = "";
> > > +
> > > +       if (base_type(reg->type) != PTR_TO_BTF_ID || type_flag(reg->type) != PTR_MAYBE_NULL)
> > > +               goto bad_type;
> > > +
> > > +       if (!btf_is_kernel(reg->btf)) {
> > > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > > +               return -EINVAL;
> > > +       }
> > > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > > +
> > > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > > +               return -EACCES;
> > > +
> > > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > +                                 off_desc->btf, off_desc->btf_id))
> > > +               goto bad_type;
> > > +       return 0;
> > > +bad_type:
> > > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > > +               reg_type_str(env, reg->type), reg_name);
> > > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > > +       return -EINVAL;
> > > +}
> > > +
> > > +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> > > +                                int value_regno, int insn_idx,
> > > +                                struct bpf_map_value_off_desc *off_desc)
> > > +{
> > > +       struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> > > +       int class = BPF_CLASS(insn->code);
> > > +       struct bpf_reg_state *val_reg;
> > > +
> > > +       /* Things we already checked for in check_map_access and caller:
> > > +        *  - Reject cases where variable offset may touch kptr
> > > +        *  - size of access (must be BPF_DW)
> > > +        *  - tnum_is_const(reg->var_off)
> > > +        *  - off_desc->offset == off + reg->var_off.value
> > > +        */
> > > +       /* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
> > > +       if (BPF_MODE(insn->code) != BPF_MEM)
> > > +               goto end;
> > I think this needs its own verbose statement - the one in end: doesn't
> > seem to match this error
>
> Maybe we should say BPF_LDX_MEM, BPF_STX_MEM, BPF_ST_MEM?
I think it'd be clearest if there were separate error messages for the
case where the program is using a different mode than BPF_MEM vs. the
program using an unsupported instruction class.
>
> > > +
> > > +       if (class == BPF_LDX) {
> > > +               val_reg = reg_state(env, value_regno);
> > > +               /* We can simply mark the value_regno receiving the pointer
> > > +                * value from map as PTR_TO_BTF_ID, with the correct type.
> > > +                */
> > > +               mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> > > +                               off_desc->btf_id, PTR_MAYBE_NULL);
> > > +               val_reg->id = ++env->id_gen;
> > > +       } else if (class == BPF_STX) {
> > > +               val_reg = reg_state(env, value_regno);
> > > +               if (!register_is_null(val_reg) &&
> > > +                   map_kptr_match_type(env, off_desc, val_reg, value_regno))
> > > +                       return -EACCES;
> > > +       } else if (class == BPF_ST) {
> > > +               if (insn->imm) {
> > > +                       verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> > > +                               off_desc->offset);
> > > +                       return -EACCES;
> > > +               }
> > > +       } else {
> > > +               goto end;
> > > +       }
> > > +       return 0;
> > > +end:
> > > +       verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> > > +       return -EACCES;
> > > +}
> > > +
> > >  /* check read/write into a map element with possible variable offset */
> > >  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> > >                             int off, int size, bool zero_size_allowed)
> > > @@ -3545,6 +3622,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> > >                         return -EACCES;
> > >                 }
> > >         }
> > > +       if (map_value_has_kptrs(map)) {
> > > +               struct bpf_map_value_off *tab = map->kptr_off_tab;
> > > +               int i;
> > > +
> > > +               for (i = 0; i < tab->nr_off; i++) {
> > > +                       u32 p = tab->off[i].offset;
> > > +
> > > +                       if (reg->smin_value + off < p + sizeof(u64) &&
> > > +                           p < reg->umax_value + off + size) {
> > > +                               if (!tnum_is_const(reg->var_off)) {
> > > +                                       verbose(env, "kptr access cannot have variable offset\n");
> > > +                                       return -EACCES;
> > > +                               }
> > > +                               if (p != off + reg->var_off.value) {
> > > +                                       verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> > > +                                               p, off + reg->var_off.value);
> > > +                                       return -EACCES;
> > > +                               }
> > > +                               if (size != bpf_size_to_bytes(BPF_DW)) {
> > > +                                       verbose(env, "kptr access size must be BPF_DW\n");
> > > +                                       return -EACCES;
> > > +                               }
> > > +                               break;
> > > +                       }
> > > +               }
> > > +       }
> > >         return err;
> > >  }
> > >
> > > @@ -4412,6 +4515,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > >                 if (value_regno >= 0)
> > >                         mark_reg_unknown(env, regs, value_regno);
> > >         } else if (reg->type == PTR_TO_MAP_VALUE) {
> > > +               struct bpf_map_value_off_desc *off_desc = NULL;
> > > +
> > >                 if (t == BPF_WRITE && value_regno >= 0 &&
> > >                     is_pointer_value(env, value_regno)) {
> > >                         verbose(env, "R%d leaks addr into map\n", value_regno);
> > > @@ -4421,7 +4526,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > >                 if (err)
> > >                         return err;
> > >                 err = check_map_access(env, regno, off, size, false);
> > > -               if (!err && t == BPF_READ && value_regno >= 0) {
> > > +               if (err)
> > > +                       return err;
> > > +               if (tnum_is_const(reg->var_off))
> > > +                       off_desc = bpf_map_kptr_off_contains(reg->map_ptr,
> > > +                                                            off + reg->var_off.value);
> > > +               if (off_desc) {
> > I think this logic would be a little clearer if you renamed off_desc
> > to kptr_off_desc to denote that this only applies to kptrs.
>
> Ok, will change.
>
> > > +                       err = check_map_kptr_access(env, regno, value_regno, insn_idx, off_desc);
> > > +                       if (err)
> > > +                               return err;
> > I don't think you need this if check - it'll return err by default at
> > the end of the function.
>
> Right, will drop this.
>
> > > +               } else if (t == BPF_READ && value_regno >= 0) {
> > >                         struct bpf_map *map = reg->map_ptr;
> > >
> > >                         /* if map is read-only, track its contents as scalars */
> > > --
> > > 2.35.1
> > >
>
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map
  2022-04-12 23:05   ` Joanne Koong
@ 2022-04-13  5:36     ` Kumar Kartikeya Dwivedi
  2022-04-13  5:54       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-13  5:36 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Wed, Apr 13, 2022 at 04:35:11AM IST, Joanne Koong wrote:
> On Mon, Apr 11, 2022 at 12:25 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Extending the code in previous commit, introduce referenced kptr
> > support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
> > unreferenced kptr, referenced kptr have a lot more restrictions. In
> > addition to the type matching, only a newly introduced bpf_kptr_xchg
> > helper is allowed to modify the map value at that offset. This transfers
> > the referenced pointer being stored into the map, releasing the
> > references state for the program, and returning the old value and
> > creating new reference state for the returned pointer.
> >
> > Similar to unreferenced pointer case, return value for this case will
> > also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
> > must either be eventually released by calling the corresponding release
> > function, otherwise it must be transferred into another map.
> >
> > It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
> > the value, and obtain the old value if any.
> >
> > BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
> > commit will permit using BPF_LDX for such pointers, but attempt at
> > making it safe, since the lifetime of object won't be guaranteed.
> >
> > There are valid reasons to enforce the restriction of permitting only
> > bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
> > consistent in face of concurrent modification, and any prior values
> > contained in the map must also be released before a new one is moved
> > into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
> > returns the old value, which the verifier would require the user to
> > either free or move into another map, and releases the reference held
> > for the pointer being moved in.
> >
> > In the future, direct BPF_XCHG instruction may also be permitted to work
> > like bpf_kptr_xchg helper.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h            |   7 +++
> >  include/uapi/linux/bpf.h       |  12 ++++
> >  kernel/bpf/btf.c               |  10 ++-
> >  kernel/bpf/helpers.c           |  21 +++++++
> >  kernel/bpf/verifier.c          | 107 +++++++++++++++++++++++++++++----
> >  tools/include/uapi/linux/bpf.h |  12 ++++
> >  6 files changed, 155 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index a6d1982e8118..bd682c29883a 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -160,10 +160,15 @@ enum {
> >         BPF_MAP_VALUE_OFF_MAX = 8,
> >  };
> >
> > +enum {
> > +       BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> I personally find this name very confusing. What does the "f" in
> "OFF_F_REF" stand for? Also, I think adding in a comment for this
> would help out future readers.

The f is for 'flag', similar to e.g. BPF_F_LOCK. I'll add a comment.
Also it can probably be type rather than flag.

> > +};
> > +
> >  struct bpf_map_value_off_desc {
> >         u32 offset;
> >         u32 btf_id;
> >         struct btf *btf;
> > +       int flags;
> It's unclear from reading this what flags refers to. Maybe adding a
> comment here that flags holds values from the enum above (and maybe we
> should give the enum a more descriptive name?) would make it clearer?

I'll give the enum a name and rename it to 'type'. Then, for dynptr case,
you can add other types of dynptr that can be embedded in map value.

> >  };
> >
> >  struct bpf_map_value_off {
> > @@ -416,6 +421,7 @@ enum bpf_arg_type {
> >         ARG_PTR_TO_STACK,       /* pointer to stack */
> >         ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only string */
> >         ARG_PTR_TO_TIMER,       /* pointer to bpf_timer */
> > +       ARG_PTR_TO_KPTR,        /* pointer to kptr */
> This is only a "pointer to a referenced kptr", correct?

Yes. I'll update the comment.

> >         __BPF_ARG_TYPE_MAX,
> >
> >         /* Extended arg_types. */
> > @@ -425,6 +431,7 @@ enum bpf_arg_type {
> >         ARG_PTR_TO_SOCKET_OR_NULL       = PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
> >         ARG_PTR_TO_ALLOC_MEM_OR_NULL    = PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
> >         ARG_PTR_TO_STACK_OR_NULL        = PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
> > +       ARG_PTR_TO_BTF_ID_OR_NULL       = PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
> >
> >         /* This must be the last entry. Its purpose is to ensure the enum is
> >          * wide enough to hold the higher bits reserved for bpf_type_flag.
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index d14b10b85e51..444fe6f1cf35 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5143,6 +5143,17 @@ union bpf_attr {
> >   *             The **hash_algo** is returned on success,
> >   *             **-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
> >   *             invalid arguments are passed.
> > + *
> > + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> > + *     Description
> > + *             Exchange kptr at pointer *map_value* with *ptr*, and return the
> > + *             old value. *ptr* can be NULL, otherwise it must be a referenced
> > + *             pointer which will be released when this helper is called.
> > + *     Return
> > + *             The old value of kptr (which can be NULL). The returned pointer
> > + *             if not NULL, is a reference which must be released using its
> > + *             corresponding release function, or moved into a BPF map before
> > + *             program exit.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -5339,6 +5350,7 @@ union bpf_attr {
> >         FN(copy_from_user_task),        \
> >         FN(skb_set_tstamp),             \
> >         FN(ima_file_hash),              \
> > +       FN(kptr_xchg),                  \
> >         /* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 28b1d9e9124e..43ea9ed5652e 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3177,6 +3177,7 @@ enum {
> >  struct btf_field_info {
> >         u32 type_id;
> >         u32 off;
> > +       int flags;
> >  };
> >
> >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > @@ -3194,6 +3195,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> >                                u32 off, int sz, struct btf_field_info *info)
> >  {
> >         u32 res_id;
> > +       int flags;
> >
> >         /* For PTR, sz is always == 8 */
> >         if (!btf_type_is_ptr(t))
> > @@ -3205,7 +3207,11 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> >         /* Reject extra tags */
> >         if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> >                 return -EINVAL;
> > -       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > +       if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > +               flags = 0;
> > +       else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
> > +               flags = BPF_MAP_VALUE_OFF_F_REF;
> > +       else
> >                 return -EINVAL;
> >
> >         /* Get the base type */
> > @@ -3216,6 +3222,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> >
> >         info->type_id = res_id;
> >         info->off = off;
> > +       info->flags = flags;
> >         return BTF_FIELD_FOUND;
> >  }
> >
> > @@ -3420,6 +3427,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> >                 tab->off[i].offset = info_arr[i].off;
> >                 tab->off[i].btf_id = id;
> >                 tab->off[i].btf = off_btf;
> > +               tab->off[i].flags = info_arr[i].flags;
> >                 tab->nr_off = i + 1;
> >         }
> >         return tab;
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 315053ef6a75..a437d0f0458a 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
> >         kfree(t);
> >  }
> >
> > +BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
> > +{
> > +       unsigned long *kptr = map_value;
> > +
> > +       return xchg(kptr, (unsigned long)ptr);
> > +}
> > +
> > +static u32 bpf_kptr_xchg_btf_id;
> > +
> > +const struct bpf_func_proto bpf_kptr_xchg_proto = {
> > +       .func         = bpf_kptr_xchg,
> > +       .gpl_only     = false,
> > +       .ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
> > +       .ret_btf_id   = &bpf_kptr_xchg_btf_id,
> > +       .arg1_type    = ARG_PTR_TO_KPTR,
> > +       .arg2_type    = ARG_PTR_TO_BTF_ID_OR_NULL | PTR_RELEASE,
> > +       .arg2_btf_id  = &bpf_kptr_xchg_btf_id,
> > +};
> > +
> >  const struct bpf_func_proto bpf_get_current_task_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
> >  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
> > @@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >                 return &bpf_timer_start_proto;
> >         case BPF_FUNC_timer_cancel:
> >                 return &bpf_timer_cancel_proto;
> > +       case BPF_FUNC_kptr_xchg:
> > +               return &bpf_kptr_xchg_proto;
> >         default:
> >                 break;
> >         }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 6cc08526e049..92efe6c3999c 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -258,6 +258,7 @@ struct bpf_call_arg_meta {
> >         struct btf *ret_btf;
> >         u32 ret_btf_id;
> >         u32 subprogno;
> > +       struct bpf_map_value_off_desc *kptr_off_desc;
> >  };
> >
> >  struct btf *btf_vmlinux;
> > @@ -480,7 +481,8 @@ static bool is_release_function(enum bpf_func_id func_id)
> >  {
> >         return func_id == BPF_FUNC_sk_release ||
> >                func_id == BPF_FUNC_ringbuf_submit ||
> > -              func_id == BPF_FUNC_ringbuf_discard;
> > +              func_id == BPF_FUNC_ringbuf_discard ||
> > +              func_id == BPF_FUNC_kptr_xchg;
> >  }
> >
> >  static bool may_be_acquire_function(enum bpf_func_id func_id)
> > @@ -500,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
> >         if (func_id == BPF_FUNC_sk_lookup_tcp ||
> >             func_id == BPF_FUNC_sk_lookup_udp ||
> >             func_id == BPF_FUNC_skc_lookup_tcp ||
> > -           func_id == BPF_FUNC_ringbuf_reserve)
> > +           func_id == BPF_FUNC_ringbuf_reserve ||
> > +           func_id == BPF_FUNC_kptr_xchg)
> >                 return true;
> >
> >         if (func_id == BPF_FUNC_map_lookup_elem &&
> > @@ -3525,6 +3528,12 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
> >         /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> >         reg_name = kernel_type_name(reg->btf, reg->btf_id);
> >
> > +       /* For ref_ptr case, release function check should ensure we get one
> > +        * referenced PTR_TO_BTF_ID, and that its fixed offset is 0. For the
> I don't fully understand why the first sentence in this comment is
> relevant to this function - this seems like it belongs more to
> check_func_arg_reg_off() for the PTR_TO_BTF_ID case?
>

It is just meant to say that some of the checks that are required for referenced
kptr case are already covered by those for normal release functions, so it is
not required to redo them here. This function is called for both bpf_kptr_xchg
and normal BPF insns.

> > +        * normal store of unreferenced kptr, we must ensure var_off is zero.
> > +        * Since ref_ptr cannot be accessed directly by BPF insns, checks for
> > +        * reg->off and reg->ref_obj_id are not needed here.
> > +        */
>
> >         if (__check_ptr_off_reg(env, reg, regno, true))
> >                 return -EACCES;
> >
> > @@ -3557,6 +3566,12 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> >         if (BPF_MODE(insn->code) != BPF_MEM)
> >                 goto end;
> >
> > +       /* We cannot directly access kptr_ref */
> > +       if (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF) {
> > +               verbose(env, "accessing referenced kptr disallowed\n");
> > +               return -EACCES;
> > +       }
> > +
> >         if (class == BPF_LDX) {
> >                 val_reg = reg_state(env, value_regno);
> >                 /* We can simply mark the value_regno receiving the pointer
> > @@ -5278,6 +5293,59 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
> >         return 0;
> >  }
> >
> > +static int process_kptr_func(struct bpf_verifier_env *env, int regno,
> > +                            struct bpf_call_arg_meta *meta)
> > +{
> > +       struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> > +       struct bpf_map_value_off_desc *off_desc;
> > +       struct bpf_map *map_ptr = reg->map_ptr;
> > +       u32 kptr_off;
> > +       int ret;
> > +
> > +       if (!tnum_is_const(reg->var_off)) {
> > +               verbose(env,
> > +                       "R%d doesn't have constant offset. kptr has to be at the constant offset\n",
> > +                       regno);
> > +               return -EINVAL;
> > +       }
> > +       if (!map_ptr->btf) {
> > +               verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
> > +                       map_ptr->name);
> > +               return -EINVAL;
> > +       }
> > +       if (!map_value_has_kptrs(map_ptr)) {
> > +               ret = PTR_ERR(map_ptr->kptr_off_tab);
> > +               if (ret == -E2BIG)
> > +                       verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
> > +                               BPF_MAP_VALUE_OFF_MAX);
> > +               else if (ret == -EEXIST)
> > +                       verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
> > +               else
> > +                       verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
> > +               return -EINVAL;
> > +       }
> > +
> > +       meta->map_ptr = map_ptr;
> > +       /* Check access for BPF_WRITE */
> > +       meta->raw_mode = true;
> > +       ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
> Do you need to check access here for both BPF_WRITE and BPF_READ since
> you are also reading the map value when you do the xchg?

I have a fix for this for the next version :). Actually it should be the
opposite, that is we just need to check map is both read/write. So it should do
a check with meta->raw_mode = false. We already prevent BPF_F_RDONLY_PROG from
being set, so map is either BPF_F_WRONLY_PROG, or both read/write. In the first
case, BPF_READ won't work, which is needed to support xchg, so we only need to
reject that case.

> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       kptr_off = reg->off + reg->var_off.value;
> > +       off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
> > +       if (!off_desc) {
> > +               verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
> > +               return -EACCES;
> > +       }
> > +       if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
> > +               verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
> > +               return -EACCES;
> > +       }
> > +       meta->kptr_off_desc = off_desc;
> > +       return 0;
> > +}
> > +
> >  static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
> >  {
> >         return base_type(type) == ARG_PTR_TO_MEM ||
> > @@ -5418,6 +5486,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
> >  static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
> >  static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
> >  static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
> > +static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
> >
> >  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> >         [ARG_PTR_TO_MAP_KEY]            = &map_key_value_types,
> > @@ -5445,11 +5514,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> >         [ARG_PTR_TO_STACK]              = &stack_ptr_types,
> >         [ARG_PTR_TO_CONST_STR]          = &const_str_ptr_types,
> >         [ARG_PTR_TO_TIMER]              = &timer_types,
> > +       [ARG_PTR_TO_KPTR]               = &kptr_types,
> >  };
> >
> >  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> >                           enum bpf_arg_type arg_type,
> > -                         const u32 *arg_btf_id)
> > +                         const u32 *arg_btf_id,
> > +                         struct bpf_call_arg_meta *meta)
> >  {
> >         struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> >         enum bpf_reg_type expected, type = reg->type;
> > @@ -5502,8 +5573,11 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> >                         arg_btf_id = compatible->btf_id;
> >                 }
> >
> > -               if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > -                                         btf_vmlinux, *arg_btf_id)) {
> > +               if (meta->func_id == BPF_FUNC_kptr_xchg) {
> > +                       if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
> > +                               return -EACCES;
> > +               } else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > +                                                btf_vmlinux, *arg_btf_id)) {
> >                         verbose(env, "R%d is of type %s but %s is expected\n",
> >                                 regno, kernel_type_name(reg->btf, reg->btf_id),
> >                                 kernel_type_name(btf_vmlinux, *arg_btf_id));
> > @@ -5613,7 +5687,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                  */
> >                 goto skip_type_check;
> >
> > -       err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
> > +       err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
> >         if (err)
> >                 return err;
> >
> > @@ -5778,6 +5852,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                         verbose(env, "string is not zero-terminated\n");
> >                         return -EINVAL;
> >                 }
> > +       } else if (arg_type == ARG_PTR_TO_KPTR) {
> > +               if (process_kptr_func(env, regno, meta))
> > +                       return -EACCES;
> >         }
> >
> >         return err;
> > @@ -6120,10 +6197,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
> >         int i;
> >
> >         for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> > -               if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> > +               if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> >                         return false;
> >
> > -               if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> > +               if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> >                         return false;
> >         }
> >
> > @@ -7007,21 +7084,25 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;
> >                 }
> >         } else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
> > +               struct btf *ret_btf;
> >                 int ret_btf_id;
> >
> >                 mark_reg_known_zero(env, regs, BPF_REG_0);
> >                 regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
> > -               ret_btf_id = *fn->ret_btf_id;
> > +               if (func_id == BPF_FUNC_kptr_xchg) {
> > +                       ret_btf = meta.kptr_off_desc->btf;
> > +                       ret_btf_id = meta.kptr_off_desc->btf_id;
> > +               } else {
> > +                       ret_btf = btf_vmlinux;
> > +                       ret_btf_id = *fn->ret_btf_id;
> > +               }
> >                 if (ret_btf_id == 0) {
> >                         verbose(env, "invalid return type %u of func %s#%d\n",
> >                                 base_type(ret_type), func_id_name(func_id),
> >                                 func_id);
> >                         return -EINVAL;
> >                 }
> > -               /* current BPF helper definitions are only coming from
> > -                * built-in code with type IDs from  vmlinux BTF
> > -                */
> > -               regs[BPF_REG_0].btf = btf_vmlinux;
> > +               regs[BPF_REG_0].btf = ret_btf;
> >                 regs[BPF_REG_0].btf_id = ret_btf_id;
> >         } else {
> >                 verbose(env, "unknown return type %u of func %s#%d\n",
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index d14b10b85e51..444fe6f1cf35 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -5143,6 +5143,17 @@ union bpf_attr {
> >   *             The **hash_algo** is returned on success,
> >   *             **-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
> >   *             invalid arguments are passed.
> > + *
> > + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> > + *     Description
> > + *             Exchange kptr at pointer *map_value* with *ptr*, and return the
> > + *             old value. *ptr* can be NULL, otherwise it must be a referenced
> > + *             pointer which will be released when this helper is called.
> > + *     Return
> > + *             The old value of kptr (which can be NULL). The returned pointer
> > + *             if not NULL, is a reference which must be released using its
> > + *             corresponding release function, or moved into a BPF map before
> > + *             program exit.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -5339,6 +5350,7 @@ union bpf_attr {
> >         FN(copy_from_user_task),        \
> >         FN(skb_set_tstamp),             \
> >         FN(ima_file_hash),              \
> > +       FN(kptr_xchg),                  \
> >         /* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > --
> > 2.35.1
> >

--
Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map
  2022-04-09  9:32 ` [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
  2022-04-12  0:32   ` Joanne Koong
@ 2022-04-13  5:41   ` kernel test robot
  1 sibling, 0 replies; 29+ messages in thread
From: kernel test robot @ 2022-04-13  5:41 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: kbuild-all, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Hi Kumar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Kumar-Kartikeya-Dwivedi/Introduce-typed-pointer-support-in-BPF-maps/20220409-173513
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: microblaze-randconfig-c024-20220408 (https://download.01.org/0day-ci/archive/20220413/202204131252.o56DuHxd-lkp@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 11.2.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


cocci warnings: (new ones prefixed by >>)
>> kernel/bpf/syscall.c:530:11-18: WARNING opportunity for kmemdup

Please review and possibly fold the followup patch.

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map
  2022-04-12 23:56       ` Joanne Koong
@ 2022-04-13  5:50         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-13  5:50 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Wed, Apr 13, 2022 at 05:26:12AM IST, Joanne Koong wrote:
> On Tue, Apr 12, 2022 at 12:16 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, Apr 12, 2022 at 06:02:11AM IST, Joanne Koong wrote:
> > > On Sat, Apr 9, 2022 at 6:18 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > register must have the same type as in the map value's BTF, and loading
> > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > kernel BTF and BTF ID.
> > > >
> > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > BPF program loads this pointer, the object which the pointer points to
> > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > to still access such invalid pointer, but passing such pointers into
> > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > series will close this gap.
> > > >
> > > > The flexibility offered by allowing programs to dereference such invalid
> > > > pointers while being safe at runtime frees the verifier from doing
> > > > complex lifetime tracking. As long as the user may ensure that the
> > > > object remains valid, it can ensure data read by it from the kernel
> > > > object is valid.
> > > >
> > > > The user indicates that a certain pointer must be treated as kptr
> > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > information is recorded in the object BTF which will be passed into the
> > > > kernel by way of map's BTF information. The name and kind from the map
> > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > now, only storing pointers to structs is permitted.
> > > >
> > > > An example of this specification is shown below:
> > > >
> > > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > > >
> > > >         struct map_value {
> > > >                 ...
> > > >                 struct task_struct __kptr *task;
> > > >                 ...
> > > >         };
> > > >
> > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > task_struct into the map, and then load it later.
> > > >
> > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > must treat all potential loads at that map value offset as loading a
> > > > possibly NULL pointer.
> > > >
> > > > Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
> > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > same BTF type as specified in the map BTF. The access size must always
> > > > be BPF_DW.
> > > >
> > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > case, hence would be unnecessary for all other users.
> > > >
> > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > having kptrs, similar to the bpf_timer case.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h     |  29 +++++++-
> > > >  include/linux/btf.h     |   2 +
> > > >  kernel/bpf/btf.c        | 160 ++++++++++++++++++++++++++++++++++------
> > > >  kernel/bpf/map_in_map.c |   5 +-
> > > >  kernel/bpf/syscall.c    | 114 +++++++++++++++++++++++++++-
> > > >  kernel/bpf/verifier.c   | 116 ++++++++++++++++++++++++++++-
> > > >  6 files changed, 399 insertions(+), 27 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index bdb5298735ce..e267db260cb7 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> > > >         const struct bpf_iter_seq_info *iter_seq_info;
> > > >  };
> > > >
> > > > +enum {
> > > > +       /* Support at most 8 pointers in a BPF map value */
> > > > +       BPF_MAP_VALUE_OFF_MAX = 8,
> > > > +};
> > > nit: should this be a typedef instead of an enum?
> >
> > typedef? Do you mean #define? I prefer enum constants as they get emitted to
> > BTF.
> Yeah I meant #define, not typedef :) Oh I see - out of curiosity since
> I'm still getting acquainted with BTF, what is the utility of the enum
> constant getting emitted to the vmlinux BTF? For more detailed
> debuggability?
>

Yes, once it becomes visible in BTF it will be available from vmlinux.h, so
instead of hardcoding the value you can refer to the enum constant by name, and
libbpf relocates it to the correct value on load as well, so overall it is more
convenient.

> > [...]
> > > > +       if (map_value_has_kptrs(map)) {
> > > > +               if (!bpf_capable()) {
> > > > +                       ret = -EPERM;
> > > > +                       goto free_map_tab;
> > > > +               }
> > > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > > Why is it an error if BPF_F_RDONLY_PROG is set? Maybe I'm
> > > misunderstanding what BPF_F_RDONLY_PROG means, but why can't a program
> > > have read-only access to the kptr value?
> >
> > It would be useless, kptr can only be set from inside a BPF program.
> If the kptr is embedded inside a larger struct, couldn't there be use
> cases where a program wants to read the other fields in this struct
> that have been updated by the userspace application?

It should already be able to read such fields without this flag, right? This
flag removes write permissions to the map from BPF program side, which would
make it useless for the purposes of kptr, since setting a kptr requires either
BPF_STX/BPF_ST for unreferenced case, or bpf_kptr_xchg for referenced case.

> >
> > > > +                       ret = -EACCES;
> > > > +                       goto free_map_tab;
> > > > +               }
> > > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> > > Out of curiosity, do you also plan to add kptr support in the future
> > > to local storage maps as well?
> >
> > Yes, those and percpu maps are on the TODO list.
> Awesome!!
> >
> > > > +                       ret = -EOPNOTSUPP;
> > > > +                       goto free_map_tab;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       if (map->ops->map_check_btf) {
> > > >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > > > +               if (ret < 0)
> > > > +                       goto free_map_tab;
> > > > +       }
> > > >
> > > > +       return ret;
> > > > +free_map_tab:
> > > > +       bpf_map_free_kptr_off_tab(map);
> > > >         return ret;
> > > >  }
> > > >
> > > > @@ -1639,7 +1747,7 @@ static int map_freeze(const union bpf_attr *attr)
> > > >                 return PTR_ERR(map);
> > > >
> > > >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > > > -           map_value_has_timer(map)) {
> > > > +           map_value_has_timer(map) || map_value_has_kptrs(map)) {
> > > >                 fdput(f);
> > > >                 return -ENOTSUPP;
> > > >         }
> > >
> > > Maybe I'm missing something, but I'm not seeing it in this patch - do
> > > we also need to add checks that prohibit userspace programs from
> > > trying to do bpf_map_update_elem syscalls that manipulate kptr map
> > > values?
> >
> > Userspace should be allowed to do bpf_map_update_elem, whether map value has
> > timers, spin_lock, kptrs, or dynptrs in the future. copy_map_value will skip
> > over these fields when updating map value. See patch 7.
>
> Within the context of this patch, a userspace program can do
> bpf_map_update_elem, put in some unsafe value for the kptr, which will
> cause the bpf program to crash the kernel when it accesses that value.
> That was my main concern, but since this is going to be addressed in
> your patch 7, I don't think this matters then.
>

Yep, patch 7 is doing too much stuff, so I thought it would be best to split it
into its own patch to make review easier.

> >
> > >
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 71827d14724a..01d45c5010f9 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -3507,6 +3507,83 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> > > >         return __check_ptr_off_reg(env, reg, regno, false);
> > > >  }
> > > >
> > > > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > > > +                              struct bpf_map_value_off_desc *off_desc,
> > > > +                              struct bpf_reg_state *reg, u32 regno)
> > > > +{
> > > > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > > > +       const char *reg_name = "";
> > > > +
> > > > +       if (base_type(reg->type) != PTR_TO_BTF_ID || type_flag(reg->type) != PTR_MAYBE_NULL)
> > > > +               goto bad_type;
> > > > +
> > > > +       if (!btf_is_kernel(reg->btf)) {
> > > > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > > > +               return -EINVAL;
> > > > +       }
> > > > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > > > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > > > +
> > > > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > > > +               return -EACCES;
> > > > +
> > > > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > > +                                 off_desc->btf, off_desc->btf_id))
> > > > +               goto bad_type;
> > > > +       return 0;
> > > > +bad_type:
> > > > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > > > +               reg_type_str(env, reg->type), reg_name);
> > > > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > > > +       return -EINVAL;
> > > > +}
> > > > +
> > > > +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> > > > +                                int value_regno, int insn_idx,
> > > > +                                struct bpf_map_value_off_desc *off_desc)
> > > > +{
> > > > +       struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> > > > +       int class = BPF_CLASS(insn->code);
> > > > +       struct bpf_reg_state *val_reg;
> > > > +
> > > > +       /* Things we already checked for in check_map_access and caller:
> > > > +        *  - Reject cases where variable offset may touch kptr
> > > > +        *  - size of access (must be BPF_DW)
> > > > +        *  - tnum_is_const(reg->var_off)
> > > > +        *  - off_desc->offset == off + reg->var_off.value
> > > > +        */
> > > > +       /* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
> > > > +       if (BPF_MODE(insn->code) != BPF_MEM)
> > > > +               goto end;
> > > I think this needs its own verbose statement - the one in end: doesn't
> > > seem to match this error
> >
> > Maybe we should say BPF_LDX_MEM, BPF_STX_MEM, BPF_ST_MEM?
> I think it'd be clearest if there were separate error messages for the
> case where the program is using a different mode than BPF_MEM vs. the
> program using an unsupported instruction class.

Ok, I'll add it.

> >
> > > > +
> > > > +       if (class == BPF_LDX) {
> > > > +               val_reg = reg_state(env, value_regno);
> > > > +               /* We can simply mark the value_regno receiving the pointer
> > > > +                * value from map as PTR_TO_BTF_ID, with the correct type.
> > > > +                */
> > > > +               mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> > > > +                               off_desc->btf_id, PTR_MAYBE_NULL);
> > > > +               val_reg->id = ++env->id_gen;
> > > > +       } else if (class == BPF_STX) {
> > > > +               val_reg = reg_state(env, value_regno);
> > > > +               if (!register_is_null(val_reg) &&
> > > > +                   map_kptr_match_type(env, off_desc, val_reg, value_regno))
> > > > +                       return -EACCES;
> > > > +       } else if (class == BPF_ST) {
> > > > +               if (insn->imm) {
> > > > +                       verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> > > > +                               off_desc->offset);
> > > > +                       return -EACCES;
> > > > +               }
> > > > +       } else {
> > > > +               goto end;
> > > > +       }
> > > > +       return 0;
> > > > +end:
> > > > +       verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> > > > +       return -EACCES;
> > > > +}
> > > > +
> > > >  /* check read/write into a map element with possible variable offset */
> > > >  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> > > >                             int off, int size, bool zero_size_allowed)
> > > > @@ -3545,6 +3622,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> > > >                         return -EACCES;
> > > >                 }
> > > >         }
> > > > +       if (map_value_has_kptrs(map)) {
> > > > +               struct bpf_map_value_off *tab = map->kptr_off_tab;
> > > > +               int i;
> > > > +
> > > > +               for (i = 0; i < tab->nr_off; i++) {
> > > > +                       u32 p = tab->off[i].offset;
> > > > +
> > > > +                       if (reg->smin_value + off < p + sizeof(u64) &&
> > > > +                           p < reg->umax_value + off + size) {
> > > > +                               if (!tnum_is_const(reg->var_off)) {
> > > > +                                       verbose(env, "kptr access cannot have variable offset\n");
> > > > +                                       return -EACCES;
> > > > +                               }
> > > > +                               if (p != off + reg->var_off.value) {
> > > > +                                       verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> > > > +                                               p, off + reg->var_off.value);
> > > > +                                       return -EACCES;
> > > > +                               }
> > > > +                               if (size != bpf_size_to_bytes(BPF_DW)) {
> > > > +                                       verbose(env, "kptr access size must be BPF_DW\n");
> > > > +                                       return -EACCES;
> > > > +                               }
> > > > +                               break;
> > > > +                       }
> > > > +               }
> > > > +       }
> > > >         return err;
> > > >  }
> > > >
> > > > @@ -4412,6 +4515,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > > >                 if (value_regno >= 0)
> > > >                         mark_reg_unknown(env, regs, value_regno);
> > > >         } else if (reg->type == PTR_TO_MAP_VALUE) {
> > > > +               struct bpf_map_value_off_desc *off_desc = NULL;
> > > > +
> > > >                 if (t == BPF_WRITE && value_regno >= 0 &&
> > > >                     is_pointer_value(env, value_regno)) {
> > > >                         verbose(env, "R%d leaks addr into map\n", value_regno);
> > > > @@ -4421,7 +4526,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > > >                 if (err)
> > > >                         return err;
> > > >                 err = check_map_access(env, regno, off, size, false);
> > > > -               if (!err && t == BPF_READ && value_regno >= 0) {
> > > > +               if (err)
> > > > +                       return err;
> > > > +               if (tnum_is_const(reg->var_off))
> > > > +                       off_desc = bpf_map_kptr_off_contains(reg->map_ptr,
> > > > +                                                            off + reg->var_off.value);
> > > > +               if (off_desc) {
> > > I think this logic would be a little clearer if you renamed off_desc
> > > to kptr_off_desc to denote that this only applies to kptrs.
> >
> > Ok, will change.
> >
> > > > +                       err = check_map_kptr_access(env, regno, value_regno, insn_idx, off_desc);
> > > > +                       if (err)
> > > > +                               return err;
> > > I don't think you need this if check - it'll return err by default at
> > > the end of the function.
> >
> > Right, will drop this.
> >
> > > > +               } else if (t == BPF_READ && value_regno >= 0) {
> > > >                         struct bpf_map *map = reg->map_ptr;
> > > >
> > > >                         /* if map is read-only, track its contents as scalars */
> > > > --
> > > > 2.35.1
> > > >
> >
> > --
> > Kartikeya

--
Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map
  2022-04-13  5:36     ` Kumar Kartikeya Dwivedi
@ 2022-04-13  5:54       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 29+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-04-13  5:54 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Wed, Apr 13, 2022 at 11:06:04AM IST, Kumar Kartikeya Dwivedi wrote:
> On Wed, Apr 13, 2022 at 04:35:11AM IST, Joanne Koong wrote:
> > On Mon, Apr 11, 2022 at 12:25 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > Extending the code in previous commit, introduce referenced kptr
> > > support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
> > > unreferenced kptr, referenced kptr have a lot more restrictions. In
> > > addition to the type matching, only a newly introduced bpf_kptr_xchg
> > > helper is allowed to modify the map value at that offset. This transfers
> > > the referenced pointer being stored into the map, releasing the
> > > references state for the program, and returning the old value and
> > > creating new reference state for the returned pointer.
> > >
> > > Similar to unreferenced pointer case, return value for this case will
> > > also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
> > > must either be eventually released by calling the corresponding release
> > > function, otherwise it must be transferred into another map.
> > >
> > > It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
> > > the value, and obtain the old value if any.
> > >
> > > BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
> > > commit will permit using BPF_LDX for such pointers, but attempt at
> > > making it safe, since the lifetime of object won't be guaranteed.
> > >
> > > There are valid reasons to enforce the restriction of permitting only
> > > bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
> > > consistent in face of concurrent modification, and any prior values
> > > contained in the map must also be released before a new one is moved
> > > into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
> > > returns the old value, which the verifier would require the user to
> > > either free or move into another map, and releases the reference held
> > > for the pointer being moved in.
> > >
> > > In the future, direct BPF_XCHG instruction may also be permitted to work
> > > like bpf_kptr_xchg helper.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h            |   7 +++
> > >  include/uapi/linux/bpf.h       |  12 ++++
> > >  kernel/bpf/btf.c               |  10 ++-
> > >  kernel/bpf/helpers.c           |  21 +++++++
> > >  kernel/bpf/verifier.c          | 107 +++++++++++++++++++++++++++++----
> > >  tools/include/uapi/linux/bpf.h |  12 ++++
> > >  6 files changed, 155 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index a6d1982e8118..bd682c29883a 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -160,10 +160,15 @@ enum {
> > >         BPF_MAP_VALUE_OFF_MAX = 8,
> > >  };
> > >
> > > +enum {
> > > +       BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> > I personally find this name very confusing. What does the "f" in
> > "OFF_F_REF" stand for? Also, I think adding in a comment for this
> > would help out future readers.
>
> The f is for 'flag', similar to e.g. BPF_F_LOCK. I'll add a comment.
> Also it can probably be type rather than flag.
>
> > > +};
> > > +
> > >  struct bpf_map_value_off_desc {
> > >         u32 offset;
> > >         u32 btf_id;
> > >         struct btf *btf;
> > > +       int flags;
> > It's unclear from reading this what flags refers to. Maybe adding a
> > comment here that flags holds values from the enum above (and maybe we
> > should give the enum a more descriptive name?) would make it clearer?
>
> I'll give the enum a name and rename it to 'type'. Then, for dynptr case,
> you can add other types of dynptr that can be embedded in map value.
>
> > >  };
> > >
> > >  struct bpf_map_value_off {
> > > @@ -416,6 +421,7 @@ enum bpf_arg_type {
> > >         ARG_PTR_TO_STACK,       /* pointer to stack */
> > >         ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only string */
> > >         ARG_PTR_TO_TIMER,       /* pointer to bpf_timer */
> > > +       ARG_PTR_TO_KPTR,        /* pointer to kptr */
> > This is only a "pointer to a referenced kptr", correct?
>
> Yes. I'll update the comment.
>
> > >         __BPF_ARG_TYPE_MAX,
> > >
> > >         /* Extended arg_types. */
> > > @@ -425,6 +431,7 @@ enum bpf_arg_type {
> > >         ARG_PTR_TO_SOCKET_OR_NULL       = PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
> > >         ARG_PTR_TO_ALLOC_MEM_OR_NULL    = PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
> > >         ARG_PTR_TO_STACK_OR_NULL        = PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
> > > +       ARG_PTR_TO_BTF_ID_OR_NULL       = PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
> > >
> > >         /* This must be the last entry. Its purpose is to ensure the enum is
> > >          * wide enough to hold the higher bits reserved for bpf_type_flag.
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index d14b10b85e51..444fe6f1cf35 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -5143,6 +5143,17 @@ union bpf_attr {
> > >   *             The **hash_algo** is returned on success,
> > >   *             **-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
> > >   *             invalid arguments are passed.
> > > + *
> > > + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> > > + *     Description
> > > + *             Exchange kptr at pointer *map_value* with *ptr*, and return the
> > > + *             old value. *ptr* can be NULL, otherwise it must be a referenced
> > > + *             pointer which will be released when this helper is called.
> > > + *     Return
> > > + *             The old value of kptr (which can be NULL). The returned pointer
> > > + *             if not NULL, is a reference which must be released using its
> > > + *             corresponding release function, or moved into a BPF map before
> > > + *             program exit.
> > >   */
> > >  #define __BPF_FUNC_MAPPER(FN)          \
> > >         FN(unspec),                     \
> > > @@ -5339,6 +5350,7 @@ union bpf_attr {
> > >         FN(copy_from_user_task),        \
> > >         FN(skb_set_tstamp),             \
> > >         FN(ima_file_hash),              \
> > > +       FN(kptr_xchg),                  \
> > >         /* */
> > >
> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 28b1d9e9124e..43ea9ed5652e 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -3177,6 +3177,7 @@ enum {
> > >  struct btf_field_info {
> > >         u32 type_id;
> > >         u32 off;
> > > +       int flags;
> > >  };
> > >
> > >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > > @@ -3194,6 +3195,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > >                                u32 off, int sz, struct btf_field_info *info)
> > >  {
> > >         u32 res_id;
> > > +       int flags;
> > >
> > >         /* For PTR, sz is always == 8 */
> > >         if (!btf_type_is_ptr(t))
> > > @@ -3205,7 +3207,11 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > >         /* Reject extra tags */
> > >         if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > >                 return -EINVAL;
> > > -       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +       if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +               flags = 0;
> > > +       else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
> > > +               flags = BPF_MAP_VALUE_OFF_F_REF;
> > > +       else
> > >                 return -EINVAL;
> > >
> > >         /* Get the base type */
> > > @@ -3216,6 +3222,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > >
> > >         info->type_id = res_id;
> > >         info->off = off;
> > > +       info->flags = flags;
> > >         return BTF_FIELD_FOUND;
> > >  }
> > >
> > > @@ -3420,6 +3427,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> > >                 tab->off[i].offset = info_arr[i].off;
> > >                 tab->off[i].btf_id = id;
> > >                 tab->off[i].btf = off_btf;
> > > +               tab->off[i].flags = info_arr[i].flags;
> > >                 tab->nr_off = i + 1;
> > >         }
> > >         return tab;
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index 315053ef6a75..a437d0f0458a 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
> > >         kfree(t);
> > >  }
> > >
> > > +BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
> > > +{
> > > +       unsigned long *kptr = map_value;
> > > +
> > > +       return xchg(kptr, (unsigned long)ptr);
> > > +}
> > > +
> > > +static u32 bpf_kptr_xchg_btf_id;
> > > +
> > > +const struct bpf_func_proto bpf_kptr_xchg_proto = {
> > > +       .func         = bpf_kptr_xchg,
> > > +       .gpl_only     = false,
> > > +       .ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
> > > +       .ret_btf_id   = &bpf_kptr_xchg_btf_id,
> > > +       .arg1_type    = ARG_PTR_TO_KPTR,
> > > +       .arg2_type    = ARG_PTR_TO_BTF_ID_OR_NULL | PTR_RELEASE,
> > > +       .arg2_btf_id  = &bpf_kptr_xchg_btf_id,
> > > +};
> > > +
> > >  const struct bpf_func_proto bpf_get_current_task_proto __weak;
> > >  const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
> > >  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
> > > @@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> > >                 return &bpf_timer_start_proto;
> > >         case BPF_FUNC_timer_cancel:
> > >                 return &bpf_timer_cancel_proto;
> > > +       case BPF_FUNC_kptr_xchg:
> > > +               return &bpf_kptr_xchg_proto;
> > >         default:
> > >                 break;
> > >         }
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 6cc08526e049..92efe6c3999c 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -258,6 +258,7 @@ struct bpf_call_arg_meta {
> > >         struct btf *ret_btf;
> > >         u32 ret_btf_id;
> > >         u32 subprogno;
> > > +       struct bpf_map_value_off_desc *kptr_off_desc;
> > >  };
> > >
> > >  struct btf *btf_vmlinux;
> > > @@ -480,7 +481,8 @@ static bool is_release_function(enum bpf_func_id func_id)
> > >  {
> > >         return func_id == BPF_FUNC_sk_release ||
> > >                func_id == BPF_FUNC_ringbuf_submit ||
> > > -              func_id == BPF_FUNC_ringbuf_discard;
> > > +              func_id == BPF_FUNC_ringbuf_discard ||
> > > +              func_id == BPF_FUNC_kptr_xchg;
> > >  }
> > >
> > >  static bool may_be_acquire_function(enum bpf_func_id func_id)
> > > @@ -500,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
> > >         if (func_id == BPF_FUNC_sk_lookup_tcp ||
> > >             func_id == BPF_FUNC_sk_lookup_udp ||
> > >             func_id == BPF_FUNC_skc_lookup_tcp ||
> > > -           func_id == BPF_FUNC_ringbuf_reserve)
> > > +           func_id == BPF_FUNC_ringbuf_reserve ||
> > > +           func_id == BPF_FUNC_kptr_xchg)
> > >                 return true;
> > >
> > >         if (func_id == BPF_FUNC_map_lookup_elem &&
> > > @@ -3525,6 +3528,12 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
> > >         /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > >         reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > >
> > > +       /* For ref_ptr case, release function check should ensure we get one
> > > +        * referenced PTR_TO_BTF_ID, and that its fixed offset is 0. For the
> > I don't fully understand why the first sentence in this comment is
> > relevant to this function - this seems like it belongs more to
> > check_func_arg_reg_off() for the PTR_TO_BTF_ID case?
> >
>
> It is just meant to say that some of the checks that are required for referenced
> kptr case are already covered by those for normal release functions, so it is
> not required to redo them here. This function is called for both bpf_kptr_xchg
> and normal BPF insns.
>
> > > +        * normal store of unreferenced kptr, we must ensure var_off is zero.
> > > +        * Since ref_ptr cannot be accessed directly by BPF insns, checks for
> > > +        * reg->off and reg->ref_obj_id are not needed here.
> > > +        */
> >
> > >         if (__check_ptr_off_reg(env, reg, regno, true))
> > >                 return -EACCES;
> > >
> > > @@ -3557,6 +3566,12 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> > >         if (BPF_MODE(insn->code) != BPF_MEM)
> > >                 goto end;
> > >
> > > +       /* We cannot directly access kptr_ref */
> > > +       if (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF) {
> > > +               verbose(env, "accessing referenced kptr disallowed\n");
> > > +               return -EACCES;
> > > +       }
> > > +
> > >         if (class == BPF_LDX) {
> > >                 val_reg = reg_state(env, value_regno);
> > >                 /* We can simply mark the value_regno receiving the pointer
> > > @@ -5278,6 +5293,59 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
> > >         return 0;
> > >  }
> > >
> > > +static int process_kptr_func(struct bpf_verifier_env *env, int regno,
> > > +                            struct bpf_call_arg_meta *meta)
> > > +{
> > > +       struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> > > +       struct bpf_map_value_off_desc *off_desc;
> > > +       struct bpf_map *map_ptr = reg->map_ptr;
> > > +       u32 kptr_off;
> > > +       int ret;
> > > +
> > > +       if (!tnum_is_const(reg->var_off)) {
> > > +               verbose(env,
> > > +                       "R%d doesn't have constant offset. kptr has to be at the constant offset\n",
> > > +                       regno);
> > > +               return -EINVAL;
> > > +       }
> > > +       if (!map_ptr->btf) {
> > > +               verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
> > > +                       map_ptr->name);
> > > +               return -EINVAL;
> > > +       }
> > > +       if (!map_value_has_kptrs(map_ptr)) {
> > > +               ret = PTR_ERR(map_ptr->kptr_off_tab);
> > > +               if (ret == -E2BIG)
> > > +                       verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
> > > +                               BPF_MAP_VALUE_OFF_MAX);
> > > +               else if (ret == -EEXIST)
> > > +                       verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
> > > +               else
> > > +                       verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       meta->map_ptr = map_ptr;
> > > +       /* Check access for BPF_WRITE */
> > > +       meta->raw_mode = true;
> > > +       ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
> > Do you need to check access here for both BPF_WRITE and BPF_READ since
> > you are also reading the map value when you do the xchg?
>
> I have a fix for this for the next version :). Actually it should be the
> opposite, that is we just need to check map is both read/write. So it should do
> a check with meta->raw_mode = false. We already prevent BPF_F_RDONLY_PROG from
> being set, so map is either BPF_F_WRONLY_PROG, or both read/write. In the first
> case, BPF_READ won't work, which is needed to support xchg, so we only need to
> reject that case.
>

Easiest would be to just reject BPF_F_WRONLY_PROG at MAP_CREATE, and just drop
check_helper_mem_access here completely.

> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       kptr_off = reg->off + reg->var_off.value;
> > > +       off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
> > > +       if (!off_desc) {
> > > +               verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
> > > +               return -EACCES;
> > > +       }
> > > +       if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
> > > +               verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
> > > +               return -EACCES;
> > > +       }
> > > +       meta->kptr_off_desc = off_desc;
> > > +       return 0;
> > > +}
> > > +
> > >  static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
> > >  {
> > >         return base_type(type) == ARG_PTR_TO_MEM ||
> > > @@ -5418,6 +5486,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
> > >  static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
> > >  static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
> > >  static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
> > > +static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
> > >
> > >  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> > >         [ARG_PTR_TO_MAP_KEY]            = &map_key_value_types,
> > > @@ -5445,11 +5514,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> > >         [ARG_PTR_TO_STACK]              = &stack_ptr_types,
> > >         [ARG_PTR_TO_CONST_STR]          = &const_str_ptr_types,
> > >         [ARG_PTR_TO_TIMER]              = &timer_types,
> > > +       [ARG_PTR_TO_KPTR]               = &kptr_types,
> > >  };
> > >
> > >  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> > >                           enum bpf_arg_type arg_type,
> > > -                         const u32 *arg_btf_id)
> > > +                         const u32 *arg_btf_id,
> > > +                         struct bpf_call_arg_meta *meta)
> > >  {
> > >         struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> > >         enum bpf_reg_type expected, type = reg->type;
> > > @@ -5502,8 +5573,11 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> > >                         arg_btf_id = compatible->btf_id;
> > >                 }
> > >
> > > -               if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > -                                         btf_vmlinux, *arg_btf_id)) {
> > > +               if (meta->func_id == BPF_FUNC_kptr_xchg) {
> > > +                       if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
> > > +                               return -EACCES;
> > > +               } else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > +                                                btf_vmlinux, *arg_btf_id)) {
> > >                         verbose(env, "R%d is of type %s but %s is expected\n",
> > >                                 regno, kernel_type_name(reg->btf, reg->btf_id),
> > >                                 kernel_type_name(btf_vmlinux, *arg_btf_id));
> > > @@ -5613,7 +5687,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > >                  */
> > >                 goto skip_type_check;
> > >
> > > -       err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
> > > +       err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
> > >         if (err)
> > >                 return err;
> > >
> > > @@ -5778,6 +5852,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > >                         verbose(env, "string is not zero-terminated\n");
> > >                         return -EINVAL;
> > >                 }
> > > +       } else if (arg_type == ARG_PTR_TO_KPTR) {
> > > +               if (process_kptr_func(env, regno, meta))
> > > +                       return -EACCES;
> > >         }
> > >
> > >         return err;
> > > @@ -6120,10 +6197,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
> > >         int i;
> > >
> > >         for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> > > -               if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> > > +               if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> > >                         return false;
> > >
> > > -               if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> > > +               if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> > >                         return false;
> > >         }
> > >
> > > @@ -7007,21 +7084,25 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;
> > >                 }
> > >         } else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
> > > +               struct btf *ret_btf;
> > >                 int ret_btf_id;
> > >
> > >                 mark_reg_known_zero(env, regs, BPF_REG_0);
> > >                 regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
> > > -               ret_btf_id = *fn->ret_btf_id;
> > > +               if (func_id == BPF_FUNC_kptr_xchg) {
> > > +                       ret_btf = meta.kptr_off_desc->btf;
> > > +                       ret_btf_id = meta.kptr_off_desc->btf_id;
> > > +               } else {
> > > +                       ret_btf = btf_vmlinux;
> > > +                       ret_btf_id = *fn->ret_btf_id;
> > > +               }
> > >                 if (ret_btf_id == 0) {
> > >                         verbose(env, "invalid return type %u of func %s#%d\n",
> > >                                 base_type(ret_type), func_id_name(func_id),
> > >                                 func_id);
> > >                         return -EINVAL;
> > >                 }
> > > -               /* current BPF helper definitions are only coming from
> > > -                * built-in code with type IDs from  vmlinux BTF
> > > -                */
> > > -               regs[BPF_REG_0].btf = btf_vmlinux;
> > > +               regs[BPF_REG_0].btf = ret_btf;
> > >                 regs[BPF_REG_0].btf_id = ret_btf_id;
> > >         } else {
> > >                 verbose(env, "unknown return type %u of func %s#%d\n",
> > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > index d14b10b85e51..444fe6f1cf35 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -5143,6 +5143,17 @@ union bpf_attr {
> > >   *             The **hash_algo** is returned on success,
> > >   *             **-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
> > >   *             invalid arguments are passed.
> > > + *
> > > + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> > > + *     Description
> > > + *             Exchange kptr at pointer *map_value* with *ptr*, and return the
> > > + *             old value. *ptr* can be NULL, otherwise it must be a referenced
> > > + *             pointer which will be released when this helper is called.
> > > + *     Return
> > > + *             The old value of kptr (which can be NULL). The returned pointer
> > > + *             if not NULL, is a reference which must be released using its
> > > + *             corresponding release function, or moved into a BPF map before
> > > + *             program exit.
> > >   */
> > >  #define __BPF_FUNC_MAPPER(FN)          \
> > >         FN(unspec),                     \
> > > @@ -5339,6 +5350,7 @@ union bpf_attr {
> > >         FN(copy_from_user_task),        \
> > >         FN(skb_set_tstamp),             \
> > >         FN(ima_file_hash),              \
> > > +       FN(kptr_xchg),                  \
> > >         /* */
> > >
> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > --
> > > 2.35.1
> > >
>
> --
> Kartikeya

--
Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto
  2022-04-12 20:11     ` Kumar Kartikeya Dwivedi
@ 2022-04-13 18:33       ` Joanne Koong
  2022-04-13 18:39         ` Joanne Koong
  0 siblings, 1 reply; 29+ messages in thread
From: Joanne Koong @ 2022-04-13 18:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Tue, Apr 12, 2022 at 1:11 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, Apr 12, 2022 at 11:46:14PM IST, Joanne Koong wrote:
> > On Sun, Apr 10, 2022 at 11:58 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > Add a new type flag for bpf_arg_type that when set tells verifier that
> > > for a release function, that argument's register will be the one for
> > > which meta.ref_obj_id will be set, and which will then be released
> > > using release_reference. To capture the regno, introduce a new field
> > > release_regno in bpf_call_arg_meta.
> > >
> > > This would be required in the next patch, where we may either pass NULL
> > > or a refcounted pointer as an argument to the release function
> > > bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
> > > enough, as there is a case where the type of argument needed matches,
> > > but the ref_obj_id is set to 0. Hence, we must enforce that whenever
> > > meta.ref_obj_id is zero, the register that is to be released can only
> > > be NULL for a release function.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h   |  5 ++++-
> > >  kernel/bpf/ringbuf.c  |  4 ++--
> > >  kernel/bpf/verifier.c | 46 ++++++++++++++++++++++++++++++++++++-------
> > >  net/core/filter.c     |  2 +-
> > >  4 files changed, 46 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index e267db260cb7..a6d1982e8118 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -364,7 +364,10 @@ enum bpf_type_flag {
> > >          */
> > >         MEM_PERCPU              = BIT(4 + BPF_BASE_TYPE_BITS),
> > >
> > > -       __BPF_TYPE_LAST_FLAG    = MEM_PERCPU,
> > > +       /* Indicates that the pointer argument will be released. */
> > > +       PTR_RELEASE             = BIT(5 + BPF_BASE_TYPE_BITS),
> > > +
> > > +       __BPF_TYPE_LAST_FLAG    = PTR_RELEASE,
> > >  };
> > >
> > >  /* Max number of base types. */
> > > diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> > > index 710ba9de12ce..a22c21c0a7ef 100644
> > > --- a/kernel/bpf/ringbuf.c
> > > +++ b/kernel/bpf/ringbuf.c
> > > @@ -404,7 +404,7 @@ BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
> > >  const struct bpf_func_proto bpf_ringbuf_submit_proto = {
> > >         .func           = bpf_ringbuf_submit,
> > >         .ret_type       = RET_VOID,
> > > -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> > > +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
> > >         .arg2_type      = ARG_ANYTHING,
> > >  };
> > >
> > > @@ -417,7 +417,7 @@ BPF_CALL_2(bpf_ringbuf_discard, void *, sample, u64, flags)
> > >  const struct bpf_func_proto bpf_ringbuf_discard_proto = {
> > >         .func           = bpf_ringbuf_discard,
> > >         .ret_type       = RET_VOID,
> > > -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> > > +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
> > >         .arg2_type      = ARG_ANYTHING,
> > >  };
> > >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 01d45c5010f9..6cc08526e049 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -245,6 +245,7 @@ struct bpf_call_arg_meta {
> > >         struct bpf_map *map_ptr;
> > >         bool raw_mode;
> > >         bool pkt_access;
> > > +       u8 release_regno;
> > >         int regno;
> > >         int access_size;
> > >         int mem_size;
> > > @@ -5300,6 +5301,11 @@ static bool arg_type_is_int_ptr(enum bpf_arg_type type)
> > >                type == ARG_PTR_TO_LONG;
> > >  }
> > >
> > > +static bool arg_type_is_release_ptr(enum bpf_arg_type type)
> > > +{
> > > +       return type & PTR_RELEASE;
> > > +}
> > > +
> > Now that we have PTR_RELEASE as a bpf arg type descriptor, why do we
> > still need is_release_function() in the verifier? I think we should
> > just remove is_release_function() altogether - is_release_function()
> > isn't functionally necessary now that we have PTR_RELEASE, and I don't
> > think it's great that is_release_function() hardcodes specific
> > functions into the verifier. What are your thoughts?
>
> We need it to (atleast) guard the meta.ref_obj_id release, otherwise you have to
> check for PTR_RELEASE in all arguments to determine it is a release function.
> I guess we could record whether function is release function in meta, then
> looping over arguments won't be needed each time (probably best to do in
> check_release_regno, and set it there).
>
I elaborated a bit more on this in my next comment, but I think we
should just get rid of is_release_function() and use
meta.release_regno to track in check_func_arg() if the function is a
release function.
> >
> > >  static int int_ptr_type_to_size(enum bpf_arg_type type)
> > >  {
> > >         if (type == ARG_PTR_TO_INT)
> > > @@ -5532,7 +5538,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
> > >                 /* Some of the argument types nevertheless require a
> > >                  * zero register offset.
> > >                  */
> > > -               if (arg_type != ARG_PTR_TO_ALLOC_MEM)
> > > +               if (base_type(arg_type) != ARG_PTR_TO_ALLOC_MEM)
> > >                         return 0;
> > >                 break;
> > >         /* All the rest must be rejected, except PTR_TO_BTF_ID which allows
> >
> > Later on in this check_func_arg_reg_off() function, I think we can get
> > rid of the hacky workaround for the PTR_TO_BTF_ID case where it relies
> > on whether the function is a release function and reg->ref_obj_id is
> > set, to determine whether the argument is a release arg or not. The
> > arg type is passed directly to check_func_arg_reg_off(), so I think we
> > could just use arg_type_is_release_ptr(arg_type) instead, which will
> > also be more robust when/if we support having multiple release args in
> > the future.
>
> Ok, sounds good.
>
> >
> > > @@ -6124,12 +6130,31 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
> > >         return true;
> > >  }
> > >
> > > -static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
> > > +static bool check_release_regno(const struct bpf_func_proto *fn, int func_id,
> > > +                               struct bpf_call_arg_meta *meta)
> > > +{
> > > +       int i;
> > > +
> > > +       for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> > > +               if (arg_type_is_release_ptr(fn->arg_type[i])) {
> > > +                       if (!is_release_function(func_id))
> > > +                               return false;
> > > +                       if (meta->release_regno)
> > > +                               return false;
> > > +                       meta->release_regno = i + 1;
> > > +               }
> > > +       }
> > > +       return !is_release_function(func_id) || meta->release_regno;
> > > +}
> > Is this check needed? There's already a check in check_func_arg that
> > there can't be two arg registers with ref_obj_ids set. I think this
> > already checks against the case where the user tries to pass in two
> > release registers as arguments.
>
> This is different, this is about preventing the case where some func_id is
> listed as release function, but none of its arguments were tagged as
> PTR_RELEASE. It also doubles as a way to record the regno being released,
> since we need to loop anyway.
Why do we need to prevent the case where a release kernel helper
function doesn't have any of its arguments tagged as PTR_RELEASE or
conversely, that a non-release helper function has one of its
arguments tagged with PTR_RELEASE? That would be a bug in the kernel
then. I think we can just assume that this will never be the case.

Given that, I'm in favor of just removing check_release_regno()
altogether, and doing the meta->release_regno assignment + check for
multiple PTR_MEM args in check_func_arg() right after the
skip_type_check: goto. We already do the assignment + multiple
instances check there for meta->ref_obj_id. That to me looks like the
cleanest approach.
>
> If we are removing is_release_function, we can just make sure PTR_RELEASE is
> only seen once, and consider such functions as release functions (and set
> meta.release_function to true).
I don't think you even need meta->release_function, since you already
have meta->release_regno, no? You can just check whether
meta->release_regno is non-zero.
>
> > > +
> > > +static int check_func_proto(const struct bpf_func_proto *fn, int func_id,
> > > +                           struct bpf_call_arg_meta *meta)
> > >  {
> > >         return check_raw_mode_ok(fn) &&
> > >                check_arg_pair_ok(fn) &&
> > >                check_btf_id_ok(fn) &&
> > > -              check_refcount_ok(fn, func_id) ? 0 : -EINVAL;
> > > +              check_refcount_ok(fn, func_id) &&
> > > +              check_release_regno(fn, func_id, meta) ? 0 : -EINVAL;
> > >  }
> > >
> > >  /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
> > > @@ -6808,7 +6833,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >         memset(&meta, 0, sizeof(meta));
> > >         meta.pkt_access = fn->pkt_access;
> > >
> > > -       err = check_func_proto(fn, func_id);
> > > +       err = check_func_proto(fn, func_id, &meta);
> > >         if (err) {
> > >                 verbose(env, "kernel subsystem misconfigured func %s#%d\n",
> > >                         func_id_name(func_id), func_id);
> > > @@ -6841,8 +6866,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >                         return err;
> > >         }
> > >
> > > +       regs = cur_regs(env);
> > > +
> > >         if (is_release_function(func_id)) {
> > > -               err = release_reference(env, meta.ref_obj_id);
> > > +               err = -EINVAL;
> > > +               if (meta.ref_obj_id)
> > > +                       err = release_reference(env, meta.ref_obj_id);
> > > +               /* meta.ref_obj_id can only be 0 if register that is meant to be
> > > +                * released is NULL, which must be > R0.
> > > +                */
> > > +               else if (meta.release_regno && register_is_null(&regs[meta.release_regno]))
> > > +                       err = 0;
> > >                 if (err) {
> > >                         verbose(env, "func %s#%d reference has not been acquired before\n",
> > >                                 func_id_name(func_id), func_id);

Also, I forgot to mention this earlier, but I think we also need to
check here that meta.release_regno == meta.ref_obj_id; otherwise there
could be the case where if a helper function takes in at least two
parameters one of which is PTR_RELEASE, the program could pass in
something with no ref obj id as the PTR_RELEASE arg, and a ref obj id
arg as one of the other args.

> > > @@ -6850,8 +6884,6 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >                 }
> > >         }
> > >
> > > -       regs = cur_regs(env);
> > > -
> > >         switch (func_id) {
> > >         case BPF_FUNC_tail_call:
> > >                 err = check_reference_leak(env);
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 143f442a9505..8eb01a997476 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -6621,7 +6621,7 @@ static const struct bpf_func_proto bpf_sk_release_proto = {
> > >         .func           = bpf_sk_release,
> > >         .gpl_only       = false,
> > >         .ret_type       = RET_INTEGER,
> > > -       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
> > > +       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON | PTR_RELEASE,
> > >  };
> > >
> > >  BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx,
> > > --
> > > 2.35.1
> > >
>
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto
  2022-04-13 18:33       ` Joanne Koong
@ 2022-04-13 18:39         ` Joanne Koong
  0 siblings, 0 replies; 29+ messages in thread
From: Joanne Koong @ 2022-04-13 18:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Wed, Apr 13, 2022 at 11:33 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Apr 12, 2022 at 1:11 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, Apr 12, 2022 at 11:46:14PM IST, Joanne Koong wrote:
> > > On Sun, Apr 10, 2022 at 11:58 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > Add a new type flag for bpf_arg_type that when set tells verifier that
> > > > for a release function, that argument's register will be the one for
> > > > which meta.ref_obj_id will be set, and which will then be released
> > > > using release_reference. To capture the regno, introduce a new field
> > > > release_regno in bpf_call_arg_meta.
> > > >
> > > > This would be required in the next patch, where we may either pass NULL
> > > > or a refcounted pointer as an argument to the release function
> > > > bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
> > > > enough, as there is a case where the type of argument needed matches,
> > > > but the ref_obj_id is set to 0. Hence, we must enforce that whenever
> > > > meta.ref_obj_id is zero, the register that is to be released can only
> > > > be NULL for a release function.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h   |  5 ++++-
> > > >  kernel/bpf/ringbuf.c  |  4 ++--
> > > >  kernel/bpf/verifier.c | 46 ++++++++++++++++++++++++++++++++++++-------
> > > >  net/core/filter.c     |  2 +-
> > > >  4 files changed, 46 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index e267db260cb7..a6d1982e8118 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -364,7 +364,10 @@ enum bpf_type_flag {
> > > >          */
> > > >         MEM_PERCPU              = BIT(4 + BPF_BASE_TYPE_BITS),
> > > >
> > > > -       __BPF_TYPE_LAST_FLAG    = MEM_PERCPU,
> > > > +       /* Indicates that the pointer argument will be released. */
> > > > +       PTR_RELEASE             = BIT(5 + BPF_BASE_TYPE_BITS),
> > > > +
> > > > +       __BPF_TYPE_LAST_FLAG    = PTR_RELEASE,
> > > >  };
> > > >
> > > >  /* Max number of base types. */
> > > > diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> > > > index 710ba9de12ce..a22c21c0a7ef 100644
> > > > --- a/kernel/bpf/ringbuf.c
> > > > +++ b/kernel/bpf/ringbuf.c
> > > > @@ -404,7 +404,7 @@ BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
> > > >  const struct bpf_func_proto bpf_ringbuf_submit_proto = {
> > > >         .func           = bpf_ringbuf_submit,
> > > >         .ret_type       = RET_VOID,
> > > > -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> > > > +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
> > > >         .arg2_type      = ARG_ANYTHING,
> > > >  };
> > > >
> > > > @@ -417,7 +417,7 @@ BPF_CALL_2(bpf_ringbuf_discard, void *, sample, u64, flags)
> > > >  const struct bpf_func_proto bpf_ringbuf_discard_proto = {
> > > >         .func           = bpf_ringbuf_discard,
> > > >         .ret_type       = RET_VOID,
> > > > -       .arg1_type      = ARG_PTR_TO_ALLOC_MEM,
> > > > +       .arg1_type      = ARG_PTR_TO_ALLOC_MEM | PTR_RELEASE,
> > > >         .arg2_type      = ARG_ANYTHING,
> > > >  };
> > > >
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 01d45c5010f9..6cc08526e049 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -245,6 +245,7 @@ struct bpf_call_arg_meta {
> > > >         struct bpf_map *map_ptr;
> > > >         bool raw_mode;
> > > >         bool pkt_access;
> > > > +       u8 release_regno;
> > > >         int regno;
> > > >         int access_size;
> > > >         int mem_size;
> > > > @@ -5300,6 +5301,11 @@ static bool arg_type_is_int_ptr(enum bpf_arg_type type)
> > > >                type == ARG_PTR_TO_LONG;
> > > >  }
> > > >
> > > > +static bool arg_type_is_release_ptr(enum bpf_arg_type type)
> > > > +{
> > > > +       return type & PTR_RELEASE;
> > > > +}
> > > > +
> > > Now that we have PTR_RELEASE as a bpf arg type descriptor, why do we
> > > still need is_release_function() in the verifier? I think we should
> > > just remove is_release_function() altogether - is_release_function()
> > > isn't functionally necessary now that we have PTR_RELEASE, and I don't
> > > think it's great that is_release_function() hardcodes specific
> > > functions into the verifier. What are your thoughts?
> >
> > We need it to (atleast) guard the meta.ref_obj_id release, otherwise you have to
> > check for PTR_RELEASE in all arguments to determine it is a release function.
> > I guess we could record whether function is release function in meta, then
> > looping over arguments won't be needed each time (probably best to do in
> > check_release_regno, and set it there).
> >
> I elaborated a bit more on this in my next comment, but I think we
> should just get rid of is_release_function() and use
> meta.release_regno to track in check_func_arg() if the function is a
> release function.
> > >
> > > >  static int int_ptr_type_to_size(enum bpf_arg_type type)
> > > >  {
> > > >         if (type == ARG_PTR_TO_INT)
> > > > @@ -5532,7 +5538,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
> > > >                 /* Some of the argument types nevertheless require a
> > > >                  * zero register offset.
> > > >                  */
> > > > -               if (arg_type != ARG_PTR_TO_ALLOC_MEM)
> > > > +               if (base_type(arg_type) != ARG_PTR_TO_ALLOC_MEM)
> > > >                         return 0;
> > > >                 break;
> > > >         /* All the rest must be rejected, except PTR_TO_BTF_ID which allows
> > >
> > > Later on in this check_func_arg_reg_off() function, I think we can get
> > > rid of the hacky workaround for the PTR_TO_BTF_ID case where it relies
> > > on whether the function is a release function and reg->ref_obj_id is
> > > set, to determine whether the argument is a release arg or not. The
> > > arg type is passed directly to check_func_arg_reg_off(), so I think we
> > > could just use arg_type_is_release_ptr(arg_type) instead, which will
> > > also be more robust when/if we support having multiple release args in
> > > the future.
> >
> > Ok, sounds good.
> >
> > >
> > > > @@ -6124,12 +6130,31 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
> > > >         return true;
> > > >  }
> > > >
> > > > -static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
> > > > +static bool check_release_regno(const struct bpf_func_proto *fn, int func_id,
> > > > +                               struct bpf_call_arg_meta *meta)
> > > > +{
> > > > +       int i;
> > > > +
> > > > +       for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> > > > +               if (arg_type_is_release_ptr(fn->arg_type[i])) {
> > > > +                       if (!is_release_function(func_id))
> > > > +                               return false;
> > > > +                       if (meta->release_regno)
> > > > +                               return false;
> > > > +                       meta->release_regno = i + 1;
> > > > +               }
> > > > +       }
> > > > +       return !is_release_function(func_id) || meta->release_regno;
> > > > +}
> > > Is this check needed? There's already a check in check_func_arg that
> > > there can't be two arg registers with ref_obj_ids set. I think this
> > > already checks against the case where the user tries to pass in two
> > > release registers as arguments.
> >
> > This is different, this is about preventing the case where some func_id is
> > listed as release function, but none of its arguments were tagged as
> > PTR_RELEASE. It also doubles as a way to record the regno being released,
> > since we need to loop anyway.
> Why do we need to prevent the case where a release kernel helper
> function doesn't have any of its arguments tagged as PTR_RELEASE or
> conversely, that a non-release helper function has one of its
> arguments tagged with PTR_RELEASE? That would be a bug in the kernel
> then. I think we can just assume that this will never be the case.
>
> Given that, I'm in favor of just removing check_release_regno()
> altogether, and doing the meta->release_regno assignment + check for
> multiple PTR_MEM args in check_func_arg() right after the
> skip_type_check: goto. We already do the assignment + multiple
> instances check there for meta->ref_obj_id. That to me looks like the
> cleanest approach.
> >
> > If we are removing is_release_function, we can just make sure PTR_RELEASE is
> > only seen once, and consider such functions as release functions (and set
> > meta.release_function to true).
> I don't think you even need meta->release_function, since you already
> have meta->release_regno, no? You can just check whether
> meta->release_regno is non-zero.
> >
> > > > +
> > > > +static int check_func_proto(const struct bpf_func_proto *fn, int func_id,
> > > > +                           struct bpf_call_arg_meta *meta)
> > > >  {
> > > >         return check_raw_mode_ok(fn) &&
> > > >                check_arg_pair_ok(fn) &&
> > > >                check_btf_id_ok(fn) &&
> > > > -              check_refcount_ok(fn, func_id) ? 0 : -EINVAL;
> > > > +              check_refcount_ok(fn, func_id) &&
> > > > +              check_release_regno(fn, func_id, meta) ? 0 : -EINVAL;
> > > >  }
> > > >
> > > >  /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
> > > > @@ -6808,7 +6833,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >         memset(&meta, 0, sizeof(meta));
> > > >         meta.pkt_access = fn->pkt_access;
> > > >
> > > > -       err = check_func_proto(fn, func_id);
> > > > +       err = check_func_proto(fn, func_id, &meta);
> > > >         if (err) {
> > > >                 verbose(env, "kernel subsystem misconfigured func %s#%d\n",
> > > >                         func_id_name(func_id), func_id);
> > > > @@ -6841,8 +6866,17 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >                         return err;
> > > >         }
> > > >
> > > > +       regs = cur_regs(env);
> > > > +
> > > >         if (is_release_function(func_id)) {
> > > > -               err = release_reference(env, meta.ref_obj_id);
> > > > +               err = -EINVAL;
> > > > +               if (meta.ref_obj_id)
> > > > +                       err = release_reference(env, meta.ref_obj_id);
> > > > +               /* meta.ref_obj_id can only be 0 if register that is meant to be
> > > > +                * released is NULL, which must be > R0.
> > > > +                */
> > > > +               else if (meta.release_regno && register_is_null(&regs[meta.release_regno]))
> > > > +                       err = 0;
> > > >                 if (err) {
> > > >                         verbose(env, "func %s#%d reference has not been acquired before\n",
> > > >                                 func_id_name(func_id), func_id);
>
> Also, I forgot to mention this earlier, but I think we also need to
> check here that meta.release_regno == meta.ref_obj_id; otherwise there
> could be the case where if a helper function takes in at least two
> parameters one of which is PTR_RELEASE, the program could pass in
> something with no ref obj id as the PTR_RELEASE arg, and a ref obj id
> arg as one of the other args.
Not meta.release_regno == meta.ref_obj_id, but some way of checking
that the release arg is the one that has the ref obj id set. I think
the easiest way to do this is to just check that the reg also has a
valid ref obj id when we do the meta.release_regno assignment.
>
> > > > @@ -6850,8 +6884,6 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >                 }
> > > >         }
> > > >
> > > > -       regs = cur_regs(env);
> > > > -
> > > >         switch (func_id) {
> > > >         case BPF_FUNC_tail_call:
> > > >                 err = check_reference_leak(env);
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 143f442a9505..8eb01a997476 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -6621,7 +6621,7 @@ static const struct bpf_func_proto bpf_sk_release_proto = {
> > > >         .func           = bpf_sk_release,
> > > >         .gpl_only       = false,
> > > >         .ret_type       = RET_INTEGER,
> > > > -       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
> > > > +       .arg1_type      = ARG_PTR_TO_BTF_ID_SOCK_COMMON | PTR_RELEASE,
> > > >  };
> > > >
> > > >  BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx,
> > > > --
> > > > 2.35.1
> > > >
> >
> > --
> > Kartikeya

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-04-13 18:39 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-09  9:32 [PATCH bpf-next v4 00/13] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
2022-04-09  9:32 ` [PATCH bpf-next v4 01/13] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
2022-04-11 20:20   ` Joanne Koong
2022-04-12 19:48     ` Kumar Kartikeya Dwivedi
2022-04-09  9:32 ` [PATCH bpf-next v4 02/13] bpf: Move check_ptr_off_reg before check_map_access Kumar Kartikeya Dwivedi
2022-04-11 20:28   ` Joanne Koong
2022-04-09  9:32 ` [PATCH bpf-next v4 03/13] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
2022-04-12  0:32   ` Joanne Koong
2022-04-12 19:16     ` Kumar Kartikeya Dwivedi
2022-04-12 23:56       ` Joanne Koong
2022-04-13  5:50         ` Kumar Kartikeya Dwivedi
2022-04-13  5:41   ` kernel test robot
2022-04-09  9:32 ` [PATCH bpf-next v4 04/13] bpf: Tag argument to be released in bpf_func_proto Kumar Kartikeya Dwivedi
2022-04-12 18:16   ` Joanne Koong
2022-04-12 20:11     ` Kumar Kartikeya Dwivedi
2022-04-13 18:33       ` Joanne Koong
2022-04-13 18:39         ` Joanne Koong
2022-04-09  9:32 ` [PATCH bpf-next v4 05/13] bpf: Allow storing referenced kptr in map Kumar Kartikeya Dwivedi
2022-04-12 23:05   ` Joanne Koong
2022-04-13  5:36     ` Kumar Kartikeya Dwivedi
2022-04-13  5:54       ` Kumar Kartikeya Dwivedi
2022-04-09  9:32 ` [PATCH bpf-next v4 06/13] bpf: Prevent escaping of kptr loaded from maps Kumar Kartikeya Dwivedi
2022-04-09  9:32 ` [PATCH bpf-next v4 07/13] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
2022-04-09  9:32 ` [PATCH bpf-next v4 08/13] bpf: Populate pairs of btf_id and destructor kfunc in btf Kumar Kartikeya Dwivedi
2022-04-09  9:32 ` [PATCH bpf-next v4 09/13] bpf: Wire up freeing of referenced kptr Kumar Kartikeya Dwivedi
2022-04-09  9:33 ` [PATCH bpf-next v4 10/13] bpf: Teach verifier about kptr_get kfunc helpers Kumar Kartikeya Dwivedi
2022-04-09  9:33 ` [PATCH bpf-next v4 11/13] libbpf: Add kptr type tag macros to bpf_helpers.h Kumar Kartikeya Dwivedi
2022-04-09  9:33 ` [PATCH bpf-next v4 12/13] selftests/bpf: Add C tests for kptr Kumar Kartikeya Dwivedi
2022-04-09  9:33 ` [PATCH bpf-next v4 13/13] selftests/bpf: Add verifier " Kumar Kartikeya Dwivedi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.