bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps
@ 2022-03-17 11:59 Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 01/15] bpf: Factor out fd returning from bpf_btf_find_by_name_kind Kumar Kartikeya Dwivedi
                   ` (15 more replies)
  0 siblings, 16 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

This set enables storing pointers of a certain type in BPF map, and extends the
verifier to enforce type safety and lifetime correctness properties.

The infrastructure being added is generic enough for allowing storing any kind
of pointers whose type is available using BTF (user or kernel) in the future
(e.g. strongly typed memory allocation in BPF program), which are internally
tracked in the verifier as PTR_TO_BTF_ID, but for now the series limits them to
four kinds of pointers obtained from the kernel.

Obviously, use of this feature depends on map BTF.

1. Unreferenced kernel pointer

In this case, there are very few restrictions. The pointer type being stored
must match the type declared in the map value. However, such a pointer when
loaded from the map can only be dereferenced, but not passed to any in-kernel
helpers or kernel functions available to the program. This is because while the
verifier's exception handling mechanism coverts BPF_LDX to PROBE_MEM loads,
which are then handled specially by the JIT implementation, the same liberty is
not available to accesses inside the kernel. The pointer by the time it is
passed into a helper has no lifetime related guarantees about the object it is
pointing to, and may well be referencing invalid memory.

2. Referenced kernel pointer

This case imposes a lot of restrictions on the programmer, to ensure safety. To
transfer the ownership of a reference in the BPF program to the map, the user
must use the bpf_kptr_xchg helper, which returns the old pointer contained in
the map, as an acquired reference, and releases verifier state for the
referenced pointer being exchanged, as it moves into the map.

This a normal PTR_TO_BTF_ID that can be used with in-kernel helpers and kernel
functions callable by the program.

However, if BPF_LDX is used to load a referenced pointer from the map, it is
still not permitted to pass it to in-kernel helpers or kernel functions. To
obtain a reference usable with helpers, the user must invoke a kfunc helper
which returns a usable reference (which also must be eventually released before
BPF_EXIT, or moved into a map).

Since the load of the pointer (preserving data dependency ordering) must happen
inside the RCU read section, the kfunc helper will take a pointer to the map
value, which must point to the actual pointer of the object whose reference is
to be raised. The type will be verified from the BTF information of the kfunc,
as the prototype must be:

	T *func(T **, ... /* other arguments */);

Then, the verifier checks whether pointer at offset of the map value points to
the type T, and permits the call.

This convention is followed so that such helpers may also be called from
sleepable BPF programs, where RCU read lock is not necessarily held in the BPF
program context, hence necessiating the need to pass in a pointer to the actual
pointer to perform the load inside the RCU read section.

3. per-CPU kernel pointer

These have very little restrictions. The user can store a PTR_TO_PERCPU_BTF_ID
into the map, and when loading from the map, they must NULL check it before use,
because while a non-zero value stored into the map should always be valid, it can
still be reset to zero on updates. After checking it to be non-NULL, it can be
passed to bpf_per_cpu_ptr and bpf_this_cpu_ptr helpers to obtain a PTR_TO_BTF_ID
to underlying per-CPU object.

It is also permitted to write 0 and reset the value.

4. Userspace pointer

The verifier recently gained support for annotating BTF with __user type tag.
This indicates pointers pointing to memory which must be read using the
bpf_probe_read_user helper to ensure correct results. The set also permits
storing them into the BPF map, and ensures user pointer cannot be stored into
other kinds of pointers mentioned above.

When loaded from the map, the only thing that can be done is to pass this
pointer to bpf_probe_read_user. No dereference is allowed.

Notes
-----

 * C selftests require https://reviews.llvm.org/D119799 to pass.
 * Unlike BPF timers, kptr is not reset or freed on map_release_uref.
 * Referenced kptr storage is always treated as unsigned long * on kernel side,
   as BPF side cannot mutate it. The storage (8 bytes) is sufficient for both
   32-bit and 64-bit platforms.
 * Use of WRITE_ONCE to reset unreferenced kptr on 32-bit systems is fine, as
   the actual pointer is always word sized, so the store tearing into two 32-bit
   stores won't be a problem as the other half is always zeroed out.

Changelog:
----------
v1 -> v2
v1: https://lore.kernel.org/bpf/20220220134813.3411982-1-memxor@gmail.com

 * Address comments from Alexei
   * Rename bpf_btf_find_by_name_kind_all to bpf_find_btf_id
   * Reduce indentation level in that function
   * Always take reference regardless of module or vmlinux BTF
   * Also made it the same for btf_get_module_btf
   * Use kptr, kptr_ref, kptr_percpu, kptr_user type tags
   * Don't reserve tag namespace
   * Refactor btf_find_field to be side effect free, allocate and populate
     kptr_off_tab in caller
   * Move module reference to dtor patch
   * Remove support for BPF_XCHG, BPF_CMPXCHG insn
   * Introduce bpf_kptr_xchg helper
   * Embed offset array in struct bpf_map, populate and sort it once
   * Adjust copy_map_value to memcpy directly using this offset array
   * Removed size member from offset array to save space
 * Fix some problems pointed out by kernel test robot
 * Tidy selftests
 * Lots of other minor fixes

Kumar Kartikeya Dwivedi (15):
  bpf: Factor out fd returning from bpf_btf_find_by_name_kind
  bpf: Make btf_find_field more generic
  bpf: Allow storing unreferenced kptr in map
  bpf: Allow storing referenced kptr in map
  bpf: Allow storing percpu kptr in map
  bpf: Allow storing user kptr in map
  bpf: Prevent escaping of kptr loaded from maps
  bpf: Adapt copy_map_value for multiple offset case
  bpf: Always raise reference in btf_get_module_btf
  bpf: Populate pairs of btf_id and destructor kfunc in btf
  bpf: Wire up freeing of referenced kptr
  bpf: Teach verifier about kptr_get kfunc helpers
  libbpf: Add kptr type tag macros to bpf_helpers.h
  selftests/bpf: Add C tests for kptr
  selftests/bpf: Add verifier tests for kptr

 include/linux/bpf.h                           | 109 ++-
 include/linux/btf.h                           |  23 +
 include/uapi/linux/bpf.h                      |  12 +
 kernel/bpf/arraymap.c                         |  14 +-
 kernel/bpf/btf.c                              | 623 ++++++++++++--
 kernel/bpf/hashtab.c                          |  28 +-
 kernel/bpf/helpers.c                          |  21 +
 kernel/bpf/map_in_map.c                       |   5 +-
 kernel/bpf/syscall.c                          | 204 ++++-
 kernel/bpf/verifier.c                         | 412 ++++++++--
 net/bpf/test_run.c                            |  39 +-
 tools/include/uapi/linux/bpf.h                |  12 +
 tools/lib/bpf/bpf_helpers.h                   |   4 +
 .../selftests/bpf/prog_tests/map_kptr.c       |  20 +
 tools/testing/selftests/bpf/progs/map_kptr.c  | 236 ++++++
 tools/testing/selftests/bpf/test_verifier.c   |  60 +-
 .../testing/selftests/bpf/verifier/map_kptr.c | 763 ++++++++++++++++++
 17 files changed, 2394 insertions(+), 191 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/map_kptr.c
 create mode 100644 tools/testing/selftests/bpf/verifier/map_kptr.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 01/15] bpf: Factor out fd returning from bpf_btf_find_by_name_kind
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

In next few patches, we need a helper that searches all kernel BTFs
(vmlinux and module BTFs), and finds the type denoted by 'name' and
'kind'. Turns out bpf_btf_find_by_name_kind already does the same thing,
but it instead returns a BTF ID and optionally fd (if module BTF). This
is used for relocating ksyms in BPF loader code (bpftool gen skel -L).

We extract the core code out into a new helper bpf_find_btf_id, which
returns the BTF ID in the return value, and BTF pointer in an out
parameter. The reference for the returned BTF pointer is always raised,
hence user must either transfer it (e.g. to a fd), or release it after
use.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/btf.c | 90 ++++++++++++++++++++++++++++--------------------
 1 file changed, 53 insertions(+), 37 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8b34563a832e..17b9adcd88d3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -525,6 +525,48 @@ s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
 	return -ENOENT;
 }
 
+static s32 bpf_find_btf_id(const char *name, u32 kind, struct btf **btf_p)
+{
+	struct btf *btf;
+	s32 ret;
+	int id;
+
+	btf = bpf_get_btf_vmlinux();
+	if (IS_ERR(btf))
+		return PTR_ERR(btf);
+
+	ret = btf_find_by_name_kind(btf, name, kind);
+	/* ret is never zero, since btf_find_by_name_kind returns
+	 * positive btf_id or negative error.
+	 */
+	if (ret > 0) {
+		btf_get(btf);
+		*btf_p = btf;
+		return ret;
+	}
+
+	/* If name is not found in vmlinux's BTF then search in module's BTFs */
+	spin_lock_bh(&btf_idr_lock);
+	idr_for_each_entry(&btf_idr, btf, id) {
+		if (!btf_is_module(btf))
+			continue;
+		/* linear search could be slow hence unlock/lock
+		 * the IDR to avoiding holding it for too long
+		 */
+		btf_get(btf);
+		spin_unlock_bh(&btf_idr_lock);
+		ret = btf_find_by_name_kind(btf, name, kind);
+		if (ret > 0) {
+			*btf_p = btf;
+			return ret;
+		}
+		spin_lock_bh(&btf_idr_lock);
+		btf_put(btf);
+	}
+	spin_unlock_bh(&btf_idr_lock);
+	return ret;
+}
+
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
 					       u32 id, u32 *res_id)
 {
@@ -6562,7 +6604,8 @@ static struct btf *btf_get_module_btf(const struct module *module)
 
 BPF_CALL_4(bpf_btf_find_by_name_kind, char *, name, int, name_sz, u32, kind, int, flags)
 {
-	struct btf *btf;
+	struct btf *btf = NULL;
+	int btf_obj_fd = 0;
 	long ret;
 
 	if (flags)
@@ -6571,44 +6614,17 @@ BPF_CALL_4(bpf_btf_find_by_name_kind, char *, name, int, name_sz, u32, kind, int
 	if (name_sz <= 1 || name[name_sz - 1])
 		return -EINVAL;
 
-	btf = bpf_get_btf_vmlinux();
-	if (IS_ERR(btf))
-		return PTR_ERR(btf);
-
-	ret = btf_find_by_name_kind(btf, name, kind);
-	/* ret is never zero, since btf_find_by_name_kind returns
-	 * positive btf_id or negative error.
-	 */
-	if (ret < 0) {
-		struct btf *mod_btf;
-		int id;
-
-		/* If name is not found in vmlinux's BTF then search in module's BTFs */
-		spin_lock_bh(&btf_idr_lock);
-		idr_for_each_entry(&btf_idr, mod_btf, id) {
-			if (!btf_is_module(mod_btf))
-				continue;
-			/* linear search could be slow hence unlock/lock
-			 * the IDR to avoiding holding it for too long
-			 */
-			btf_get(mod_btf);
-			spin_unlock_bh(&btf_idr_lock);
-			ret = btf_find_by_name_kind(mod_btf, name, kind);
-			if (ret > 0) {
-				int btf_obj_fd;
-
-				btf_obj_fd = __btf_new_fd(mod_btf);
-				if (btf_obj_fd < 0) {
-					btf_put(mod_btf);
-					return btf_obj_fd;
-				}
-				return ret | (((u64)btf_obj_fd) << 32);
-			}
-			spin_lock_bh(&btf_idr_lock);
-			btf_put(mod_btf);
+	ret = bpf_find_btf_id(name, kind, &btf);
+	if (ret > 0 && btf_is_module(btf)) {
+		btf_obj_fd = __btf_new_fd(btf);
+		if (btf_obj_fd < 0) {
+			btf_put(btf);
+			return btf_obj_fd;
 		}
-		spin_unlock_bh(&btf_idr_lock);
+		return ret | (((u64)btf_obj_fd) << 32);
 	}
+	if (ret > 0)
+		btf_put(btf);
 	return ret;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 01/15] bpf: Factor out fd returning from bpf_btf_find_by_name_kind Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 17:55   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Next commit's field type will not be struct, but pointer, and it will
not be limited to one offset, but multiple ones. Make existing
btf_find_struct_field and btf_find_datasec_var functions amenable to use
for finding BTF ID pointers in map value, by taking a moving spin_lock
and timer specific checks into their own function.

The alignment, and name are checked before the function is called, so it
is the last point where we can skip field or return an error before the
next loop iteration happens. This is important, because we'll be
potentially reallocating memory inside this function in next commit, so
being able to do that when everything else is in order is going to be
more convenient.

The name parameter is now optional, and only checked if it is not NULL.

The size must be checked in the function, because in case of PTR it will
instead point to the underlying BTF ID it is pointing to (or modifiers),
so the check becomes wrong to do outside of function, and the base type
has to be obtained by removing modifiers.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/btf.c | 120 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 86 insertions(+), 34 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 17b9adcd88d3..5b2824332880 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3161,71 +3161,109 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
+enum {
+	BTF_FIELD_SPIN_LOCK,
+	BTF_FIELD_TIMER,
+};
+
+struct btf_field_info {
+	u32 off;
+};
+
+static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
+				 u32 off, int sz, struct btf_field_info *info)
+{
+	if (!__btf_type_is_struct(t))
+		return 0;
+	if (t->size != sz)
+		return 0;
+	if (info->off != -ENOENT)
+		/* only one such field is allowed */
+		return -E2BIG;
+	info->off = off;
+	return 0;
+}
+
 static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
-				 const char *name, int sz, int align)
+				 const char *name, int sz, int align, int field_type,
+				 struct btf_field_info *info)
 {
 	const struct btf_member *member;
-	u32 i, off = -ENOENT;
+	u32 i, off;
+	int ret;
 
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
-		if (!__btf_type_is_struct(member_type))
-			continue;
-		if (member_type->size != sz)
-			continue;
-		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
-			continue;
-		if (off != -ENOENT)
-			/* only one such field is allowed */
-			return -E2BIG;
+
 		off = __btf_member_bit_offset(t, member);
+
+		if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
+			continue;
 		if (off % 8)
 			/* valid C code cannot generate such BTF */
 			return -EINVAL;
 		off /= 8;
 		if (off % align)
 			return -EINVAL;
+
+		switch (field_type) {
+		case BTF_FIELD_SPIN_LOCK:
+		case BTF_FIELD_TIMER:
+			ret = btf_find_field_struct(btf, member_type, off, sz, info);
+			if (ret < 0)
+				return ret;
+			break;
+		default:
+			return -EFAULT;
+		}
 	}
-	return off;
+	return 0;
 }
 
 static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
-				const char *name, int sz, int align)
+				const char *name, int sz, int align, int field_type,
+				struct btf_field_info *info)
 {
 	const struct btf_var_secinfo *vsi;
-	u32 i, off = -ENOENT;
+	u32 i, off;
+	int ret;
 
 	for_each_vsi(i, t, vsi) {
 		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
 		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
 
-		if (!__btf_type_is_struct(var_type))
-			continue;
-		if (var_type->size != sz)
+		off = vsi->offset;
+
+		if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
 			continue;
 		if (vsi->size != sz)
 			continue;
-		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
-			continue;
-		if (off != -ENOENT)
-			/* only one such field is allowed */
-			return -E2BIG;
-		off = vsi->offset;
 		if (off % align)
 			return -EINVAL;
+
+		switch (field_type) {
+		case BTF_FIELD_SPIN_LOCK:
+		case BTF_FIELD_TIMER:
+			ret = btf_find_field_struct(btf, var_type, off, sz, info);
+			if (ret < 0)
+				return ret;
+			break;
+		default:
+			return -EFAULT;
+		}
 	}
-	return off;
+	return 0;
 }
 
 static int btf_find_field(const struct btf *btf, const struct btf_type *t,
-			  const char *name, int sz, int align)
+			  const char *name, int sz, int align, int field_type,
+			  struct btf_field_info *info)
 {
-
 	if (__btf_type_is_struct(t))
-		return btf_find_struct_field(btf, t, name, sz, align);
+		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
 	else if (btf_type_is_datasec(t))
-		return btf_find_datasec_var(btf, t, name, sz, align);
+		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
 	return -EINVAL;
 }
 
@@ -3235,16 +3273,30 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
  */
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
 {
-	return btf_find_field(btf, t, "bpf_spin_lock",
-			      sizeof(struct bpf_spin_lock),
-			      __alignof__(struct bpf_spin_lock));
+	struct btf_field_info info = { .off = -ENOENT };
+	int ret;
+
+	ret = btf_find_field(btf, t, "bpf_spin_lock",
+			     sizeof(struct bpf_spin_lock),
+			     __alignof__(struct bpf_spin_lock),
+			     BTF_FIELD_SPIN_LOCK, &info);
+	if (ret < 0)
+		return ret;
+	return info.off;
 }
 
 int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 {
-	return btf_find_field(btf, t, "bpf_timer",
-			      sizeof(struct bpf_timer),
-			      __alignof__(struct bpf_timer));
+	struct btf_field_info info = { .off = -ENOENT };
+	int ret;
+
+	ret = btf_find_field(btf, t, "bpf_timer",
+			     sizeof(struct bpf_timer),
+			     __alignof__(struct bpf_timer),
+			     BTF_FIELD_TIMER, &info);
+	if (ret < 0)
+		return ret;
+	return info.off;
 }
 
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 01/15] bpf: Factor out fd returning from bpf_btf_find_by_name_kind Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:15   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 04/15] bpf: Allow storing referenced " Kumar Kartikeya Dwivedi
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

This commit introduces a new pointer type 'kptr' which can be embedded
in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
register must have the same type as in the map value's BTF, and loading
a kptr marks the destination register as PTR_TO_BTF_ID with the correct
kernel BTF and BTF ID.

Such kptr are unreferenced, i.e. by the time another invocation of the
BPF program loads this pointer, the object which the pointer points to
may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
patched to PROBE_MEM loads by the verifier, it would safe to allow user
to still access such invalid pointer, but passing such pointers into
BPF helpers and kfuncs should not be permitted. A future patch in this
series will close this gap.

The flexibility offered by allowing programs to dereference such invalid
pointers while being safe at runtime frees the verifier from doing
complex lifetime tracking. As long as the user may ensure that the
object remains valid, it can ensure data read by it from the kernel
object is valid.

The user indicates that a certain pointer must be treated as kptr
capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
information is recorded in the object BTF which will be passed into the
kernel by way of map's BTF information. The name and kind from the map
value BTF is used to look up the in-kernel type, and the actual BTF and
BTF ID is recorded in the map struct in a new kptr_off_tab member. For
now, only storing pointers to structs is permitted.

An example of this specification is shown below:

	#define __kptr __attribute__((btf_type_tag("kptr")))

	struct map_value {
		...
		struct task_struct __kptr *task;
		...
	};

Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
task_struct into the map, and then load it later.

Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
the verifier cannot know whether the value is NULL or not statically, it
must treat all potential loads at that map value offset as loading a
possibly NULL pointer.

Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
are allowed instructions that can access such a pointer. On BPF_LDX, the
destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
it is checked whether the source register type is a PTR_TO_BTF_ID with
same BTF type as specified in the map BTF. The access size must always
be BPF_DW.

For the map in map support, the kptr_off_tab for outer map is copied
from the inner map's kptr_off_tab. It was chosen to do a deep copy
instead of introducing a refcount to kptr_off_tab, because the copy only
needs to be done when paramterizing using inner_map_fd in the map in map
case, hence would be unnecessary for all other users.

It is not permitted to use MAP_FREEZE command and mmap for BPF map
having kptr, similar to the bpf_timer case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h     |  29 +++++-
 include/linux/btf.h     |   2 +
 kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
 kernel/bpf/map_in_map.c |   5 +-
 kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
 kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
 6 files changed, 442 insertions(+), 62 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 88449fbbe063..f35920d279dd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -155,6 +155,22 @@ struct bpf_map_ops {
 	const struct bpf_iter_seq_info *iter_seq_info;
 };
 
+enum {
+	/* Support at most 8 pointers in a BPF map value */
+	BPF_MAP_VALUE_OFF_MAX = 8,
+};
+
+struct bpf_map_value_off_desc {
+	u32 offset;
+	u32 btf_id;
+	struct btf *btf;
+};
+
+struct bpf_map_value_off {
+	u32 nr_off;
+	struct bpf_map_value_off_desc off[];
+};
+
 struct bpf_map {
 	/* The first two cachelines with read-mostly members of which some
 	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -171,6 +187,7 @@ struct bpf_map {
 	u64 map_extra; /* any per-map-type extra fields */
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
+	struct bpf_map_value_off *kptr_off_tab;
 	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
@@ -184,7 +201,7 @@ struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 	bool bypass_spec_v1;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
-	/* 14 bytes hole */
+	/* 6 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
 	return map->timer_off >= 0;
 }
 
+static inline bool map_value_has_kptr(const struct bpf_map *map)
+{
+	return !IS_ERR_OR_NULL(map->kptr_off_tab);
+}
+
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
 	if (unlikely(map_value_has_spin_lock(map)))
@@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
 void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
 
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
+void bpf_map_free_kptr_off_tab(struct bpf_map *map);
+struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 36bc09b8e890..5b578dc81c04 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
+struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
+					const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5b2824332880..9ac9364ef533 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
 enum {
 	BTF_FIELD_SPIN_LOCK,
 	BTF_FIELD_TIMER,
+	BTF_FIELD_KPTR,
+};
+
+enum {
+	BTF_FIELD_IGNORE = 0,
+	BTF_FIELD_FOUND  = 1,
 };
 
 struct btf_field_info {
+	const struct btf_type *type;
 	u32 off;
 };
 
 static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
-				 u32 off, int sz, struct btf_field_info *info)
+				 u32 off, int sz, struct btf_field_info *info,
+				 int info_cnt, int idx)
 {
 	if (!__btf_type_is_struct(t))
-		return 0;
+		return BTF_FIELD_IGNORE;
 	if (t->size != sz)
-		return 0;
-	if (info->off != -ENOENT)
-		/* only one such field is allowed */
+		return BTF_FIELD_IGNORE;
+	if (idx >= info_cnt)
 		return -E2BIG;
+	info[idx].off = off;
 	info->off = off;
-	return 0;
+	return BTF_FIELD_FOUND;
+}
+
+static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
+			       u32 off, int sz, struct btf_field_info *info,
+			       int info_cnt, int idx)
+{
+	bool kptr_tag = false;
+
+	/* For PTR, sz is always == 8 */
+	if (!btf_type_is_ptr(t))
+		return BTF_FIELD_IGNORE;
+	t = btf_type_by_id(btf, t->type);
+
+	while (btf_type_is_type_tag(t)) {
+		if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off))) {
+			/* repeated tag */
+			if (kptr_tag)
+				return -EEXIST;
+			kptr_tag = true;
+		}
+		/* Look for next tag */
+		t = btf_type_by_id(btf, t->type);
+	}
+	if (!kptr_tag)
+		return BTF_FIELD_IGNORE;
+
+	/* Get the base type */
+	if (btf_type_is_modifier(t))
+		t = btf_type_skip_modifiers(btf, t->type, NULL);
+	/* Only pointer to struct is allowed */
+	if (!__btf_type_is_struct(t))
+		return -EINVAL;
+
+	if (idx >= info_cnt)
+		return -E2BIG;
+	info[idx].type = t;
+	info[idx].off = off;
+	return BTF_FIELD_FOUND;
 }
 
 static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
 				 const char *name, int sz, int align, int field_type,
-				 struct btf_field_info *info)
+				 struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_member *member;
+	int ret, idx = 0;
 	u32 i, off;
-	int ret;
 
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
@@ -3210,24 +3256,33 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
-			ret = btf_find_field_struct(btf, member_type, off, sz, info);
+			ret = btf_find_field_struct(btf, member_type, off, sz, info, info_cnt, idx);
+			if (ret < 0)
+				return ret;
+			break;
+		case BTF_FIELD_KPTR:
+			ret = btf_find_field_kptr(btf, member_type, off, sz, info, info_cnt, idx);
 			if (ret < 0)
 				return ret;
 			break;
 		default:
 			return -EFAULT;
 		}
+
+		if (ret == BTF_FIELD_IGNORE)
+			continue;
+		++idx;
 	}
-	return 0;
+	return idx;
 }
 
 static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 				const char *name, int sz, int align, int field_type,
-				struct btf_field_info *info)
+				struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_var_secinfo *vsi;
+	int ret, idx = 0;
 	u32 i, off;
-	int ret;
 
 	for_each_vsi(i, t, vsi) {
 		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
@@ -3245,25 +3300,34 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
-			ret = btf_find_field_struct(btf, var_type, off, sz, info);
+			ret = btf_find_field_struct(btf, var_type, off, sz, info, info_cnt, idx);
+			if (ret < 0)
+				return ret;
+			break;
+		case BTF_FIELD_KPTR:
+			ret = btf_find_field_kptr(btf, var_type, off, sz, info, info_cnt, idx);
 			if (ret < 0)
 				return ret;
 			break;
 		default:
 			return -EFAULT;
 		}
+
+		if (ret == BTF_FIELD_IGNORE)
+			continue;
+		++idx;
 	}
-	return 0;
+	return idx;
 }
 
 static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 			  const char *name, int sz, int align, int field_type,
-			  struct btf_field_info *info)
+			  struct btf_field_info *info, int info_cnt)
 {
 	if (__btf_type_is_struct(t))
-		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
+		return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
 	else if (btf_type_is_datasec(t))
-		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
+		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
 	return -EINVAL;
 }
 
@@ -3279,7 +3343,7 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
 	ret = btf_find_field(btf, t, "bpf_spin_lock",
 			     sizeof(struct bpf_spin_lock),
 			     __alignof__(struct bpf_spin_lock),
-			     BTF_FIELD_SPIN_LOCK, &info);
+			     BTF_FIELD_SPIN_LOCK, &info, 1);
 	if (ret < 0)
 		return ret;
 	return info.off;
@@ -3293,12 +3357,61 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 	ret = btf_find_field(btf, t, "bpf_timer",
 			     sizeof(struct bpf_timer),
 			     __alignof__(struct bpf_timer),
-			     BTF_FIELD_TIMER, &info);
+			     BTF_FIELD_TIMER, &info, 1);
 	if (ret < 0)
 		return ret;
 	return info.off;
 }
 
+struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
+					const struct btf_type *t)
+{
+	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
+	struct bpf_map_value_off *tab;
+	int ret, i, nr_off;
+
+	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
+	BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
+
+	ret = btf_find_field(btf, t, NULL, sizeof(u64), __alignof__(u64),
+			     BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (!ret)
+		return 0;
+
+	nr_off = ret;
+	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
+	if (!tab)
+		return ERR_PTR(-ENOMEM);
+
+	tab->nr_off = 0;
+	for (i = 0; i < nr_off; i++) {
+		const struct btf_type *t;
+		struct btf *off_btf;
+		s32 id;
+
+		t = info_arr[i].type;
+		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
+				     &off_btf);
+		if (id < 0) {
+			ret = id;
+			goto end;
+		}
+
+		tab->off[i].offset = info_arr[i].off;
+		tab->off[i].btf_id = id;
+		tab->off[i].btf = off_btf;
+		tab->nr_off = i + 1;
+	}
+	return tab;
+end:
+	while (tab->nr_off--)
+		btf_put(tab->off[tab->nr_off].btf);
+	kfree(tab);
+	return ERR_PTR(ret);
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 5cd8f5277279..135205d0d560 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
 	inner_map_meta->timer_off = inner_map->timer_off;
+	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
 	if (inner_map->btf) {
 		btf_get(inner_map->btf);
 		inner_map_meta->btf = inner_map->btf;
@@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
 {
+	bpf_map_free_kptr_off_tab(map_meta);
 	btf_put(map_meta->btf);
 	kfree(map_meta);
 }
@@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		meta0->key_size == meta1->key_size &&
 		meta0->value_size == meta1->value_size &&
 		meta0->timer_off == meta1->timer_off &&
-		meta0->map_flags == meta1->map_flags;
+		meta0->map_flags == meta1->map_flags &&
+		bpf_map_equal_kptr_off_tab(meta0, meta1);
 }
 
 void *bpf_map_fd_get_ptr(struct bpf_map *map,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9beb585be5a6..87263b07f40b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -6,6 +6,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
+#include <linux/bsearch.h>
 #include <linux/btf.h>
 #include <linux/syscalls.h>
 #include <linux/slab.h>
@@ -472,12 +473,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
+static int bpf_map_kptr_off_cmp(const void *a, const void *b)
+{
+	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
+
+	if (off_desc1->offset < off_desc2->offset)
+		return -1;
+	else if (off_desc1->offset > off_desc2->offset)
+		return 1;
+	return 0;
+}
+
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
+{
+	/* Since members are iterated in btf_find_field in increasing order,
+	 * offsets appended to kptr_off_tab are in increasing order, so we can
+	 * do bsearch to find exact match.
+	 */
+	struct bpf_map_value_off *tab;
+
+	if (!map_value_has_kptr(map))
+		return NULL;
+	tab = map->kptr_off_tab;
+	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
+}
+
+void bpf_map_free_kptr_off_tab(struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab;
+	int i;
+
+	if (!map_value_has_kptr(map))
+		return;
+	for (i = 0; i < tab->nr_off; i++) {
+		struct btf *btf = tab->off[i].btf;
+
+		btf_put(btf);
+	}
+	kfree(tab);
+	map->kptr_off_tab = NULL;
+}
+
+struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
+	int size, i, ret;
+
+	if (!map_value_has_kptr(map))
+		return ERR_PTR(-ENOENT);
+	/* Do a deep copy of the kptr_off_tab */
+	for (i = 0; i < tab->nr_off; i++)
+		btf_get(tab->off[i].btf);
+
+	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
+	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!new_tab) {
+		ret = -ENOMEM;
+		goto end;
+	}
+	memcpy(new_tab, tab, size);
+	return new_tab;
+end:
+	while (i--)
+		btf_put(tab->off[i].btf);
+	return ERR_PTR(ret);
+}
+
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
+{
+	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
+	bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
+	int size;
+
+	if (!a_has_kptr && !b_has_kptr)
+		return true;
+	if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
+		return false;
+	if (tab_a->nr_off != tab_b->nr_off)
+		return false;
+	size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
+	return !memcmp(tab_a, tab_b, size);
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 
 	security_bpf_map_free(map);
+	bpf_map_free_kptr_off_tab(map);
 	bpf_map_release_memcg(map);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
@@ -639,7 +723,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	int err;
 
 	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
-	    map_value_has_timer(map))
+	    map_value_has_timer(map) || map_value_has_kptr(map))
 		return -ENOTSUPP;
 
 	if (!(vma->vm_flags & VM_SHARED))
@@ -819,9 +903,29 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			return -EOPNOTSUPP;
 	}
 
-	if (map->ops->map_check_btf)
+	map->kptr_off_tab = btf_find_kptr(btf, value_type);
+	if (map_value_has_kptr(map)) {
+		if (map->map_flags & BPF_F_RDONLY_PROG) {
+			ret = -EACCES;
+			goto free_map_tab;
+		}
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY) {
+			ret = -EOPNOTSUPP;
+			goto free_map_tab;
+		}
+	}
+
+	if (map->ops->map_check_btf) {
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
+		if (ret < 0)
+			goto free_map_tab;
+	}
 
+	return ret;
+free_map_tab:
+	bpf_map_free_kptr_off_tab(map);
 	return ret;
 }
 
@@ -1638,7 +1742,7 @@ static int map_freeze(const union bpf_attr *attr)
 		return PTR_ERR(map);
 
 	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
-	    map_value_has_timer(map)) {
+	    map_value_has_timer(map) || map_value_has_kptr(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cf92f9c01556..881d1381757b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3469,6 +3469,143 @@ static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
 	return 0;
 }
 
+static int __check_ptr_off_reg(struct bpf_verifier_env *env,
+			       const struct bpf_reg_state *reg, int regno,
+			       bool fixed_off_ok)
+{
+	/* Access to this pointer-typed register or passing it to a helper
+	 * is only allowed in its original, unmodified form.
+	 */
+
+	if (reg->off < 0) {
+		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
+			reg_type_str(env, reg->type), regno, reg->off);
+		return -EACCES;
+	}
+
+	if (!fixed_off_ok && reg->off) {
+		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
+			reg_type_str(env, reg->type), regno, reg->off);
+		return -EACCES;
+	}
+
+	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
+		char tn_buf[48];
+
+		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
+		verbose(env, "variable %s access var_off=%s disallowed\n",
+			reg_type_str(env, reg->type), tn_buf);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+int check_ptr_off_reg(struct bpf_verifier_env *env,
+		      const struct bpf_reg_state *reg, int regno)
+{
+	return __check_ptr_off_reg(env, reg, regno, false);
+}
+
+static int map_kptr_match_type(struct bpf_verifier_env *env,
+			       struct bpf_map_value_off_desc *off_desc,
+			       struct bpf_reg_state *reg, u32 regno)
+{
+	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
+	const char *reg_name = "";
+
+	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
+		goto bad_type;
+
+	if (!btf_is_kernel(reg->btf)) {
+		verbose(env, "R%d must point to kernel BTF\n", regno);
+		return -EINVAL;
+	}
+	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
+	reg_name = kernel_type_name(reg->btf, reg->btf_id);
+
+	if (__check_ptr_off_reg(env, reg, regno, true))
+		return -EACCES;
+
+	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
+				  off_desc->btf, off_desc->btf_id))
+		goto bad_type;
+	return 0;
+bad_type:
+	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
+		reg_type_str(env, reg->type), reg_name);
+	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+	return -EINVAL;
+}
+
+/* Returns an error, or 0 if ignoring the access, or 1 if register state was
+ * updated, in which case later updates must be skipped.
+ */
+static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
+				 int off, int size, int value_regno,
+				 enum bpf_access_type t, int insn_idx)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
+	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
+	struct bpf_map_value_off_desc *off_desc;
+	int insn_class = BPF_CLASS(insn->code);
+	struct bpf_map *map = reg->map_ptr;
+
+	/* Things we already checked for in check_map_access:
+	 *  - Reject cases where variable offset may touch BTF ID pointer
+	 *  - size of access (must be BPF_DW)
+	 *  - off_desc->offset == off + reg->var_off.value
+	 */
+	if (!tnum_is_const(reg->var_off))
+		return 0;
+
+	off_desc = bpf_map_kptr_off_contains(map, off + reg->var_off.value);
+	if (!off_desc)
+		return 0;
+
+	if (WARN_ON_ONCE(size != bpf_size_to_bytes(BPF_DW)))
+		return -EACCES;
+
+	if (BPF_MODE(insn->code) != BPF_MEM)
+		goto end;
+
+	if (!env->bpf_capable) {
+		verbose(env, "kptr access only allowed for CAP_BPF and CAP_SYS_ADMIN\n");
+		return -EPERM;
+	}
+
+	if (insn_class == BPF_LDX) {
+		if (WARN_ON_ONCE(value_regno < 0))
+			return -EFAULT;
+		val_reg = reg_state(env, value_regno);
+		/* We can simply mark the value_regno receiving the pointer
+		 * value from map as PTR_TO_BTF_ID, with the correct type.
+		 */
+		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
+				off_desc->btf_id, PTR_MAYBE_NULL);
+		val_reg->id = ++env->id_gen;
+	} else if (insn_class == BPF_STX) {
+		if (WARN_ON_ONCE(value_regno < 0))
+			return -EFAULT;
+		val_reg = reg_state(env, value_regno);
+		if (!register_is_null(val_reg) &&
+		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
+			return -EACCES;
+	} else if (insn_class == BPF_ST) {
+		if (insn->imm) {
+			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
+				off_desc->offset);
+			return -EACCES;
+		}
+	} else {
+		goto end;
+	}
+	return 1;
+end:
+	verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
+	return -EACCES;
+}
+
 /* check read/write into a map element with possible variable offset */
 static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			    int off, int size, bool zero_size_allowed)
@@ -3507,6 +3644,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 	}
+	if (map_value_has_kptr(map)) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++) {
+			u32 p = tab->off[i].offset;
+
+			if (reg->smin_value + off < p + sizeof(u64) &&
+			    p < reg->umax_value + off + size) {
+				if (!tnum_is_const(reg->var_off)) {
+					verbose(env, "kptr access cannot have variable offset\n");
+					return -EACCES;
+				}
+				if (p != off + reg->var_off.value) {
+					verbose(env, "kptr access misaligned expected=%u off=%llu\n",
+						p, off + reg->var_off.value);
+					return -EACCES;
+				}
+				if (size != bpf_size_to_bytes(BPF_DW)) {
+					verbose(env, "kptr access size must be BPF_DW\n");
+					return -EACCES;
+				}
+				break;
+			}
+		}
+	}
 	return err;
 }
 
@@ -3980,44 +4143,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
 }
 #endif
 
-static int __check_ptr_off_reg(struct bpf_verifier_env *env,
-			       const struct bpf_reg_state *reg, int regno,
-			       bool fixed_off_ok)
-{
-	/* Access to this pointer-typed register or passing it to a helper
-	 * is only allowed in its original, unmodified form.
-	 */
-
-	if (reg->off < 0) {
-		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
-			reg_type_str(env, reg->type), regno, reg->off);
-		return -EACCES;
-	}
-
-	if (!fixed_off_ok && reg->off) {
-		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
-			reg_type_str(env, reg->type), regno, reg->off);
-		return -EACCES;
-	}
-
-	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
-		char tn_buf[48];
-
-		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
-		verbose(env, "variable %s access var_off=%s disallowed\n",
-			reg_type_str(env, reg->type), tn_buf);
-		return -EACCES;
-	}
-
-	return 0;
-}
-
-int check_ptr_off_reg(struct bpf_verifier_env *env,
-		      const struct bpf_reg_state *reg, int regno)
-{
-	return __check_ptr_off_reg(env, reg, regno, false);
-}
-
 static int __check_buffer_access(struct bpf_verifier_env *env,
 				 const char *buf_info,
 				 const struct bpf_reg_state *reg,
@@ -4421,6 +4546,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		if (err)
 			return err;
 		err = check_map_access(env, regno, off, size, false);
+		err = err ?: check_map_kptr_access(env, regno, off, size, value_regno, t, insn_idx);
+		if (err < 0)
+			return err;
+		/* if err == 0, check_map_kptr_access ignored the access */
 		if (!err && t == BPF_READ && value_regno >= 0) {
 			struct bpf_map *map = reg->map_ptr;
 
@@ -4442,6 +4571,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				mark_reg_unknown(env, regs, value_regno);
 			}
 		}
+		/* clear err == 1 */
+		err = err < 0 ? err : 0;
 	} else if (base_type(reg->type) == PTR_TO_MEM) {
 		bool rdonly_mem = type_is_rdonly_mem(reg->type);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 04/15] bpf: Allow storing referenced kptr in map
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:24   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 05/15] bpf: Allow storing percpu " Kumar Kartikeya Dwivedi
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Extending the code in previous commit, introduce referenced kptr
support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
unreferenced kptr, referenced kptr have a lot more restrictions. In
addition to the type matching, only a newly introduced bpf_kptr_xchg
helper is allowed to modify the map value at that offset. This transfers
the referenced pointer being stored into the map, releasing the
references state for the program, and returning the old value and
creating new reference state for the returned pointer.

Similar to unreferenced pointer case, return value for this case will
also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
must either be eventually released by calling the corresponding release
function, otherwise it must be transferred into another map.

It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
the value, and obtain the old value if any.

BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
commit will permit using BPF_LDX for such pointers, but attempt at
making it safe, since the lifetime of object won't be guaranteed.

There are valid reasons to enforce the restriction of permitting only
bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
consistent in face of concurrent modification, and any prior values
contained in the map must also be released before a new one is moved
into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
returns the old value, which the verifier would require the user to
either free or move into another map, and releases the reference held
for the pointer being moved in.

In the future, direct BPF_XCHG instruction may also be permitted to work
like bpf_kptr_xchg helper.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h            |   7 ++
 include/uapi/linux/bpf.h       |  12 +++
 kernel/bpf/btf.c               |  20 +++-
 kernel/bpf/helpers.c           |  21 +++++
 kernel/bpf/verifier.c          | 167 +++++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h |  12 +++
 6 files changed, 219 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f35920d279dd..702aa882e4a3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -160,10 +160,15 @@ enum {
 	BPF_MAP_VALUE_OFF_MAX = 8,
 };
 
+enum {
+	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
+};
+
 struct bpf_map_value_off_desc {
 	u32 offset;
 	u32 btf_id;
 	struct btf *btf;
+	int flags;
 };
 
 struct bpf_map_value_off {
@@ -413,6 +418,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_STACK,	/* pointer to stack */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
 	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
+	ARG_PTR_TO_KPTR,	/* pointer to kptr */
 	__BPF_ARG_TYPE_MAX,
 
 	/* Extended arg_types. */
@@ -422,6 +428,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_SOCKET_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
 	ARG_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
 	ARG_PTR_TO_STACK_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
+	ARG_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
 
 	/* This must be the last entry. Its purpose is to ensure the enum is
 	 * wide enough to hold the higher bits reserved for bpf_type_flag.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 99fab54ae9c0..d45568746e79 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5129,6 +5129,17 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void *bpf_kptr_xchg(void *map_value, void *ptr)
+ *	Description
+ *		Exchange kptr at pointer *map_value* with *ptr*, and return the
+ *		old value. *ptr* can be NULL, otherwise it must be a referenced
+ *		pointer which will be released when this helper is called.
+ *	Return
+ *		The old value of kptr (which can be NULL). The returned pointer
+ *		if not NULL, is a reference which must be released using its
+ *		corresponding release function, or moved into a BPF map before
+ *		program exit.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5325,6 +5336,7 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(kptr_xchg),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 9ac9364ef533..7b4179667bf1 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3175,6 +3175,7 @@ enum {
 struct btf_field_info {
 	const struct btf_type *type;
 	u32 off;
+	int flags;
 };
 
 static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
@@ -3196,7 +3197,8 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			       u32 off, int sz, struct btf_field_info *info,
 			       int info_cnt, int idx)
 {
-	bool kptr_tag = false;
+	bool kptr_tag = false, kptr_ref_tag = false;
+	int tags;
 
 	/* For PTR, sz is always == 8 */
 	if (!btf_type_is_ptr(t))
@@ -3209,12 +3211,21 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			if (kptr_tag)
 				return -EEXIST;
 			kptr_tag = true;
+		} else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off))) {
+			/* repeated tag */
+			if (kptr_ref_tag)
+				return -EEXIST;
+			kptr_ref_tag = true;
 		}
 		/* Look for next tag */
 		t = btf_type_by_id(btf, t->type);
 	}
-	if (!kptr_tag)
+
+	tags = kptr_tag + kptr_ref_tag;
+	if (!tags)
 		return BTF_FIELD_IGNORE;
+	else if (tags > 1)
+		return -EINVAL;
 
 	/* Get the base type */
 	if (btf_type_is_modifier(t))
@@ -3225,6 +3236,10 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 
 	if (idx >= info_cnt)
 		return -E2BIG;
+	if (kptr_ref_tag)
+		info[idx].flags = BPF_MAP_VALUE_OFF_F_REF;
+	else
+		info[idx].flags = 0;
 	info[idx].type = t;
 	info[idx].off = off;
 	return BTF_FIELD_FOUND;
@@ -3402,6 +3417,7 @@ struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
 		tab->off[i].offset = info_arr[i].off;
 		tab->off[i].btf_id = id;
 		tab->off[i].btf = off_btf;
+		tab->off[i].flags = info_arr[i].flags;
 		tab->nr_off = i + 1;
 	}
 	return tab;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 315053ef6a75..cb717bfbda19 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
 	kfree(t);
 }
 
+BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
+{
+	unsigned long *kptr = map_value;
+
+	return xchg(kptr, (unsigned long)ptr);
+}
+
+static u32 bpf_kptr_xchg_btf_id;
+
+const struct bpf_func_proto bpf_kptr_xchg_proto = {
+	.func        = bpf_kptr_xchg,
+	.gpl_only    = false,
+	.ret_type    = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id  = &bpf_kptr_xchg_btf_id,
+	.arg1_type   = ARG_PTR_TO_KPTR,
+	.arg2_type   = ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id = &bpf_kptr_xchg_btf_id,
+};
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
@@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_timer_start_proto;
 	case BPF_FUNC_timer_cancel:
 		return &bpf_timer_cancel_proto;
+	case BPF_FUNC_kptr_xchg:
+		return &bpf_kptr_xchg_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 881d1381757b..f8738054aa52 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -257,6 +257,7 @@ struct bpf_call_arg_meta {
 	struct btf *ret_btf;
 	u32 ret_btf_id;
 	u32 subprogno;
+	struct bpf_map_value_off_desc *kptr_off_desc;
 };
 
 struct btf *btf_vmlinux;
@@ -479,7 +480,8 @@ static bool is_release_function(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_sk_release ||
 	       func_id == BPF_FUNC_ringbuf_submit ||
-	       func_id == BPF_FUNC_ringbuf_discard;
+	       func_id == BPF_FUNC_ringbuf_discard ||
+	       func_id == BPF_FUNC_kptr_xchg;
 }
 
 static bool may_be_acquire_function(enum bpf_func_id func_id)
@@ -488,7 +490,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
 		func_id == BPF_FUNC_sk_lookup_udp ||
 		func_id == BPF_FUNC_skc_lookup_tcp ||
 		func_id == BPF_FUNC_map_lookup_elem ||
-	        func_id == BPF_FUNC_ringbuf_reserve;
+		func_id == BPF_FUNC_ringbuf_reserve ||
+		func_id == BPF_FUNC_kptr_xchg;
 }
 
 static bool is_acquire_function(enum bpf_func_id func_id,
@@ -499,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	if (func_id == BPF_FUNC_sk_lookup_tcp ||
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
-	    func_id == BPF_FUNC_ringbuf_reserve)
+	    func_id == BPF_FUNC_ringbuf_reserve ||
+	    func_id == BPF_FUNC_kptr_xchg)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -3509,10 +3513,12 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
 
 static int map_kptr_match_type(struct bpf_verifier_env *env,
 			       struct bpf_map_value_off_desc *off_desc,
-			       struct bpf_reg_state *reg, u32 regno)
+			       struct bpf_reg_state *reg, u32 regno,
+			       bool ref_ptr)
 {
 	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
 	const char *reg_name = "";
+	bool fixed_off_ok = true;
 
 	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
 		goto bad_type;
@@ -3524,7 +3530,26 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
 	reg_name = kernel_type_name(reg->btf, reg->btf_id);
 
-	if (__check_ptr_off_reg(env, reg, regno, true))
+	if (ref_ptr) {
+		if (!reg->ref_obj_id) {
+			verbose(env, "R%d must be referenced %s%s\n", regno,
+				reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+			return -EACCES;
+		}
+		/* reg->off can be used to store pointer to a certain type formed by
+		 * incrementing pointer of a parent structure the object is embedded in,
+		 * e.g. map may expect unreferenced struct path *, and user should be
+		 * allowed a store using &file->f_path. However, in the case of
+		 * referenced pointer, we cannot do this, because the reference is only
+		 * for the parent structure, not its embedded object(s), and because
+		 * the transfer of ownership happens for the original pointer to and
+		 * from the map (before its eventual release).
+		 */
+		if (reg->off)
+			fixed_off_ok = false;
+	}
+	/* var_off is rejected by __check_ptr_off_reg for PTR_TO_BTF_ID */
+	if (__check_ptr_off_reg(env, reg, regno, fixed_off_ok))
 		return -EACCES;
 
 	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
@@ -3550,6 +3575,7 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	struct bpf_map_value_off_desc *off_desc;
 	int insn_class = BPF_CLASS(insn->code);
 	struct bpf_map *map = reg->map_ptr;
+	bool ref_ptr = false;
 
 	/* Things we already checked for in check_map_access:
 	 *  - Reject cases where variable offset may touch BTF ID pointer
@@ -3574,9 +3600,15 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 		return -EPERM;
 	}
 
+	ref_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_REF;
+
 	if (insn_class == BPF_LDX) {
 		if (WARN_ON_ONCE(value_regno < 0))
 			return -EFAULT;
+		if (ref_ptr) {
+			verbose(env, "accessing referenced kptr disallowed\n");
+			return -EACCES;
+		}
 		val_reg = reg_state(env, value_regno);
 		/* We can simply mark the value_regno receiving the pointer
 		 * value from map as PTR_TO_BTF_ID, with the correct type.
@@ -3587,11 +3619,19 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	} else if (insn_class == BPF_STX) {
 		if (WARN_ON_ONCE(value_regno < 0))
 			return -EFAULT;
+		if (ref_ptr) {
+			verbose(env, "accessing referenced kptr disallowed\n");
+			return -EACCES;
+		}
 		val_reg = reg_state(env, value_regno);
 		if (!register_is_null(val_reg) &&
-		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
+		    map_kptr_match_type(env, off_desc, val_reg, value_regno, false))
 			return -EACCES;
 	} else if (insn_class == BPF_ST) {
+		if (ref_ptr) {
+			verbose(env, "accessing referenced kptr disallowed\n");
+			return -EACCES;
+		}
 		if (insn->imm) {
 			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
 				off_desc->offset);
@@ -5265,6 +5305,63 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_kptr_func(struct bpf_verifier_env *env, int regno,
+			     struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	struct bpf_map_value_off_desc *off_desc;
+	struct bpf_map *map_ptr = reg->map_ptr;
+	u32 kptr_off;
+	int ret;
+
+	if (!env->bpf_capable) {
+		verbose(env, "kptr access only allowed for CAP_BPF and CAP_SYS_ADMIN\n");
+		return -EPERM;
+	}
+	if (!tnum_is_const(reg->var_off)) {
+		verbose(env,
+			"R%d doesn't have constant offset. kptr has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map_ptr->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
+			map_ptr->name);
+		return -EINVAL;
+	}
+	if (!map_value_has_kptr(map_ptr)) {
+		ret = PTR_ERR(map_ptr->kptr_off_tab);
+		if (ret == -E2BIG)
+			verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
+				BPF_MAP_VALUE_OFF_MAX);
+		else if (ret == -EEXIST)
+			verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
+		else
+			verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
+		return -EINVAL;
+	}
+
+	meta->map_ptr = map_ptr;
+	/* Check access for BPF_WRITE */
+	meta->raw_mode = true;
+	ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
+	if (ret < 0)
+		return ret;
+
+	kptr_off = reg->off + reg->var_off.value;
+	off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
+	if (!off_desc) {
+		verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
+		return -EACCES;
+	}
+	if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+		verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
+		return -EACCES;
+	}
+	meta->kptr_off_desc = off_desc;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return base_type(type) == ARG_PTR_TO_MEM ||
@@ -5400,6 +5497,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -5427,11 +5525,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_STACK]		= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
 	[ARG_PTR_TO_TIMER]		= &timer_types,
+	[ARG_PTR_TO_KPTR]		= &kptr_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			  enum bpf_arg_type arg_type,
-			  const u32 *arg_btf_id)
+			  const u32 *arg_btf_id,
+			  struct bpf_call_arg_meta *meta)
 {
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
 	enum bpf_reg_type expected, type = reg->type;
@@ -5484,8 +5584,15 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			arg_btf_id = compatible->btf_id;
 		}
 
-		if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
-					  btf_vmlinux, *arg_btf_id)) {
+		if (meta->func_id == BPF_FUNC_kptr_xchg) {
+			if (!meta->kptr_off_desc) {
+				verbose(env, "verifier internal error: meta.kptr_off_desc unset\n");
+				return -EFAULT;
+			}
+			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno, true))
+				return -EACCES;
+		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
+						 btf_vmlinux, *arg_btf_id)) {
 			verbose(env, "R%d is of type %s but %s is expected\n",
 				regno, kernel_type_name(reg->btf, reg->btf_id),
 				kernel_type_name(btf_vmlinux, *arg_btf_id));
@@ -5595,7 +5702,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		 */
 		goto skip_type_check;
 
-	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
+	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
 	if (err)
 		return err;
 
@@ -5760,6 +5867,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "string is not zero-terminated\n");
 			return -EINVAL;
 		}
+	} else if (arg_type == ARG_PTR_TO_KPTR) {
+		if (meta->func_id == BPF_FUNC_kptr_xchg) {
+			if (process_kptr_func(env, regno, meta))
+				return -EACCES;
+		} else {
+			verbose(env, "verifier internal error\n");
+			return -EFAULT;
+		}
 	}
 
 	return err;
@@ -6102,10 +6217,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
-		if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
+		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
 			return false;
 
-		if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
+		if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
 			return false;
 	}
 
@@ -6830,7 +6945,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	}
 
 	if (is_release_function(func_id)) {
-		err = release_reference(env, meta.ref_obj_id);
+		err = -EINVAL;
+		if (meta.ref_obj_id)
+			err = release_reference(env, meta.ref_obj_id);
+		/* Only bpf_kptr_xchg is a release function that accepts a
+		 * possibly NULL reg, hence meta.ref_obj_id can only be unset
+		 * for it.
+		 */
+		else if (func_id == BPF_FUNC_kptr_xchg)
+			err = 0;
 		if (err) {
 			verbose(env, "func %s#%d reference has not been acquired before\n",
 				func_id_name(func_id), func_id);
@@ -6963,21 +7086,29 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			regs[BPF_REG_0].btf_id = meta.ret_btf_id;
 		}
 	} else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
+		struct btf *ret_btf;
 		int ret_btf_id;
 
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
-		ret_btf_id = *fn->ret_btf_id;
+		if (func_id == BPF_FUNC_kptr_xchg) {
+			if (!meta.kptr_off_desc) {
+				verbose(env, "verifier internal error: meta.kptr_off_desc unset\n");
+				return -EFAULT;
+			}
+			ret_btf = meta.kptr_off_desc->btf;
+			ret_btf_id = meta.kptr_off_desc->btf_id;
+		} else {
+			ret_btf = btf_vmlinux;
+			ret_btf_id = *fn->ret_btf_id;
+		}
 		if (ret_btf_id == 0) {
 			verbose(env, "invalid return type %u of func %s#%d\n",
 				base_type(ret_type), func_id_name(func_id),
 				func_id);
 			return -EINVAL;
 		}
-		/* current BPF helper definitions are only coming from
-		 * built-in code with type IDs from  vmlinux BTF
-		 */
-		regs[BPF_REG_0].btf = btf_vmlinux;
+		regs[BPF_REG_0].btf = ret_btf;
 		regs[BPF_REG_0].btf_id = ret_btf_id;
 	} else {
 		verbose(env, "unknown return type %u of func %s#%d\n",
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 99fab54ae9c0..d45568746e79 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5129,6 +5129,17 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void *bpf_kptr_xchg(void *map_value, void *ptr)
+ *	Description
+ *		Exchange kptr at pointer *map_value* with *ptr*, and return the
+ *		old value. *ptr* can be NULL, otherwise it must be a referenced
+ *		pointer which will be released when this helper is called.
+ *	Return
+ *		The old value of kptr (which can be NULL). The returned pointer
+ *		if not NULL, is a reference which must be released using its
+ *		corresponding release function, or moved into a BPF map before
+ *		program exit.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5325,6 +5336,7 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(kptr_xchg),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 05/15] bpf: Allow storing percpu kptr in map
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 04/15] bpf: Allow storing referenced " Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:30   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 06/15] bpf: Allow storing user " Kumar Kartikeya Dwivedi
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Make adjustments to the code to allow storing percpu PTR_TO_BTF_ID in a
map. Similar to 'kptr_ref' tag, a new 'kptr_percpu' allows tagging types
of pointers accepting stores of such register types. On load, verifier
marks destination register as having type PTR_TO_BTF_ID | MEM_PERCPU |
PTR_MAYBE_NULL.

Cc: Hao Luo <haoluo@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |  3 ++-
 kernel/bpf/btf.c      | 13 ++++++++++---
 kernel/bpf/verifier.c | 26 +++++++++++++++++++++-----
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 702aa882e4a3..433f5cb161cf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -161,7 +161,8 @@ enum {
 };
 
 enum {
-	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
+	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
+	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
 };
 
 struct bpf_map_value_off_desc {
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 7b4179667bf1..04d604931f59 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3197,7 +3197,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			       u32 off, int sz, struct btf_field_info *info,
 			       int info_cnt, int idx)
 {
-	bool kptr_tag = false, kptr_ref_tag = false;
+	bool kptr_tag = false, kptr_ref_tag = false, kptr_percpu_tag = false;
 	int tags;
 
 	/* For PTR, sz is always == 8 */
@@ -3216,12 +3216,17 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			if (kptr_ref_tag)
 				return -EEXIST;
 			kptr_ref_tag = true;
+		} else if (!strcmp("kptr_percpu", __btf_name_by_offset(btf, t->name_off))) {
+			/* repeated tag */
+			if (kptr_percpu_tag)
+				return -EEXIST;
+			kptr_percpu_tag = true;
 		}
 		/* Look for next tag */
 		t = btf_type_by_id(btf, t->type);
 	}
 
-	tags = kptr_tag + kptr_ref_tag;
+	tags = kptr_tag + kptr_ref_tag + kptr_percpu_tag;
 	if (!tags)
 		return BTF_FIELD_IGNORE;
 	else if (tags > 1)
@@ -3236,7 +3241,9 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 
 	if (idx >= info_cnt)
 		return -E2BIG;
-	if (kptr_ref_tag)
+	if (kptr_percpu_tag)
+		info[idx].flags = BPF_MAP_VALUE_OFF_F_PERCPU;
+	else if (kptr_ref_tag)
 		info[idx].flags = BPF_MAP_VALUE_OFF_F_REF;
 	else
 		info[idx].flags = 0;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f8738054aa52..cc8f7250e43e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3517,11 +3517,19 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 			       bool ref_ptr)
 {
 	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
+	enum bpf_reg_type reg_type;
 	const char *reg_name = "";
 	bool fixed_off_ok = true;
 
-	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
-		goto bad_type;
+	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU) {
+		if (reg->type != (PTR_TO_BTF_ID | MEM_PERCPU) &&
+		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_PERCPU))
+			goto bad_type;
+	} else { /* referenced and unreferenced case */
+		if (reg->type != PTR_TO_BTF_ID &&
+		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL))
+			goto bad_type;
+	}
 
 	if (!btf_is_kernel(reg->btf)) {
 		verbose(env, "R%d must point to kernel BTF\n", regno);
@@ -3557,9 +3565,13 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 		goto bad_type;
 	return 0;
 bad_type:
+	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU)
+		reg_type = PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_PERCPU;
+	else
+		reg_type = PTR_TO_BTF_ID | PTR_MAYBE_NULL;
 	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
 		reg_type_str(env, reg->type), reg_name);
-	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+	verbose(env, "expected=%s%s\n", reg_type_str(env, reg_type), targ_name);
 	return -EINVAL;
 }
 
@@ -3572,10 +3584,11 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 {
 	struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
 	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
+	enum bpf_type_flag reg_flags = PTR_MAYBE_NULL;
+	bool ref_ptr = false, percpu_ptr = false;
 	struct bpf_map_value_off_desc *off_desc;
 	int insn_class = BPF_CLASS(insn->code);
 	struct bpf_map *map = reg->map_ptr;
-	bool ref_ptr = false;
 
 	/* Things we already checked for in check_map_access:
 	 *  - Reject cases where variable offset may touch BTF ID pointer
@@ -3601,6 +3614,9 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	}
 
 	ref_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_REF;
+	percpu_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU;
+	if (percpu_ptr)
+		reg_flags |= MEM_PERCPU;
 
 	if (insn_class == BPF_LDX) {
 		if (WARN_ON_ONCE(value_regno < 0))
@@ -3614,7 +3630,7 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 		 * value from map as PTR_TO_BTF_ID, with the correct type.
 		 */
 		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
-				off_desc->btf_id, PTR_MAYBE_NULL);
+				off_desc->btf_id, reg_flags);
 		val_reg->id = ++env->id_gen;
 	} else if (insn_class == BPF_STX) {
 		if (WARN_ON_ONCE(value_regno < 0))
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 06/15] bpf: Allow storing user kptr in map
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 05/15] bpf: Allow storing percpu " Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:28   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 07/15] bpf: Prevent escaping of kptr loaded from maps Kumar Kartikeya Dwivedi
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Recently, verifier gained __user annotation support [0] where it
prevents BPF program from normally derefering user memory pointer in the
kernel, and instead requires use of bpf_probe_read_user. We can allow
the user to also store these pointers in BPF maps, with the logic that
whenever user loads it from the BPF map, it gets marked as MEM_USER. The
tag 'kptr_user' is used to tag such pointers.

  [0]: https://lore.kernel.org/bpf/20220127154555.650886-1-yhs@fb.com

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/btf.c      | 13 ++++++++++---
 kernel/bpf/verifier.c | 15 ++++++++++++---
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 433f5cb161cf..989f47334215 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -163,6 +163,7 @@ enum {
 enum {
 	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
 	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
+	BPF_MAP_VALUE_OFF_F_USER   = (1U << 2),
 };
 
 struct bpf_map_value_off_desc {
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 04d604931f59..12a89e55e77b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3197,7 +3197,7 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			       u32 off, int sz, struct btf_field_info *info,
 			       int info_cnt, int idx)
 {
-	bool kptr_tag = false, kptr_ref_tag = false, kptr_percpu_tag = false;
+	bool kptr_tag = false, kptr_ref_tag = false, kptr_percpu_tag = false, kptr_user_tag = false;
 	int tags;
 
 	/* For PTR, sz is always == 8 */
@@ -3221,12 +3221,17 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 			if (kptr_percpu_tag)
 				return -EEXIST;
 			kptr_percpu_tag = true;
+		} else if (!strcmp("kptr_user", __btf_name_by_offset(btf, t->name_off))) {
+			/* repeated tag */
+			if (kptr_user_tag)
+				return -EEXIST;
+			kptr_user_tag = true;
 		}
 		/* Look for next tag */
 		t = btf_type_by_id(btf, t->type);
 	}
 
-	tags = kptr_tag + kptr_ref_tag + kptr_percpu_tag;
+	tags = kptr_tag + kptr_ref_tag + kptr_percpu_tag + kptr_user_tag;
 	if (!tags)
 		return BTF_FIELD_IGNORE;
 	else if (tags > 1)
@@ -3241,7 +3246,9 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
 
 	if (idx >= info_cnt)
 		return -E2BIG;
-	if (kptr_percpu_tag)
+	if (kptr_user_tag)
+		info[idx].flags = BPF_MAP_VALUE_OFF_F_USER;
+	else if (kptr_percpu_tag)
 		info[idx].flags = BPF_MAP_VALUE_OFF_F_PERCPU;
 	else if (kptr_ref_tag)
 		info[idx].flags = BPF_MAP_VALUE_OFF_F_REF;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cc8f7250e43e..5325cc37797a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3521,7 +3521,11 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	const char *reg_name = "";
 	bool fixed_off_ok = true;
 
-	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU) {
+	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_USER) {
+		if (reg->type != (PTR_TO_BTF_ID | MEM_USER) &&
+		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_USER))
+			goto bad_type;
+	} else if (off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU) {
 		if (reg->type != (PTR_TO_BTF_ID | MEM_PERCPU) &&
 		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_PERCPU))
 			goto bad_type;
@@ -3565,7 +3569,9 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 		goto bad_type;
 	return 0;
 bad_type:
-	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU)
+	if (off_desc->flags & BPF_MAP_VALUE_OFF_F_USER)
+		reg_type = PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_USER;
+	else if (off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU)
 		reg_type = PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_PERCPU;
 	else
 		reg_type = PTR_TO_BTF_ID | PTR_MAYBE_NULL;
@@ -3583,9 +3589,9 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 				 enum bpf_access_type t, int insn_idx)
 {
 	struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
+	bool ref_ptr = false, percpu_ptr = false, user_ptr = false;
 	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
 	enum bpf_type_flag reg_flags = PTR_MAYBE_NULL;
-	bool ref_ptr = false, percpu_ptr = false;
 	struct bpf_map_value_off_desc *off_desc;
 	int insn_class = BPF_CLASS(insn->code);
 	struct bpf_map *map = reg->map_ptr;
@@ -3615,8 +3621,11 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 
 	ref_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_REF;
 	percpu_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU;
+	user_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_USER;
 	if (percpu_ptr)
 		reg_flags |= MEM_PERCPU;
+	else if (user_ptr)
+		reg_flags |= MEM_USER;
 
 	if (insn_class == BPF_LDX) {
 		if (WARN_ON_ONCE(value_regno < 0))
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 07/15] bpf: Prevent escaping of kptr loaded from maps
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 06/15] bpf: Allow storing user " Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

While we can guarantee that even for unreferenced kptr, the object
pointer points to being freed etc. can be handled by the verifier's
exception handling (normal load patching to PROBE_MEM loads), we still
cannot allow the user to pass these pointers to BPF helpers and kfunc,
because the same exception handling won't be done for accesses inside
the kernel. The same is true if a referenced pointer is loaded using
normal load instruction. Since the reference is not guaranteed to be
held while the pointer is used, it must be marked as untrusted.

Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
all registers loading unreferenced and referenced kptr from BPF maps,
and ensure they can never escape the BPF program and into the kernel by
way of calling stable/unstable helpers.

In check_ptr_to_btf_access, the !type_may_be_null check to reject type
flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
two are checked inside the function and rejected using a proper error
message, but we still want to allow dereference of untrusted case.

Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
walked, so that this flag is never dropped once it has been set on a
PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
direction).

In convert_ctx_accesses, extend the switch case to consider untrusted
PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
conversion for BPF_LDX.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   | 10 +++++++++-
 kernel/bpf/verifier.c | 29 +++++++++++++++++++++++------
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 989f47334215..8ac3070aa5e6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -371,7 +371,15 @@ enum bpf_type_flag {
 	 */
 	MEM_PERCPU		= BIT(4 + BPF_BASE_TYPE_BITS),
 
-	__BPF_TYPE_LAST_FLAG	= MEM_PERCPU,
+	/* PTR is not trusted. This is only used with PTR_TO_BTF_ID, to mark
+	 * unreferenced and referenced kptr loaded from map value using a load
+	 * instruction, so that they can only be dereferenced but not escape the
+	 * BPF program into the kernel (i.e. cannot be passed as arguments to
+	 * kfunc or bpf helpers).
+	 */
+	PTR_UNTRUSTED		= BIT(5 + BPF_BASE_TYPE_BITS),
+
+	__BPF_TYPE_LAST_FLAG	= PTR_UNTRUSTED,
 };
 
 /* Max number of base types. */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5325cc37797a..1130a74ad864 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -579,6 +579,8 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
 		strncpy(prefix, "user_", 32);
 	if (type & MEM_PERCPU)
 		strncpy(prefix, "percpu_", 32);
+	if (type & PTR_UNTRUSTED)
+		strncpy(prefix, "untrusted_", 32);
 
 	snprintf(env->type_str_buf, TYPE_STR_BUF_LEN, "%s%s%s",
 		 prefix, str[base_type(type)], postfix);
@@ -3529,10 +3531,16 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 		if (reg->type != (PTR_TO_BTF_ID | MEM_PERCPU) &&
 		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL | MEM_PERCPU))
 			goto bad_type;
-	} else { /* referenced and unreferenced case */
+	} else if (off_desc->flags & BPF_MAP_VALUE_OFF_F_REF) {
 		if (reg->type != PTR_TO_BTF_ID &&
 		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL))
 			goto bad_type;
+	} else { /* only unreferenced case accepts untrusted pointers */
+		if (reg->type != PTR_TO_BTF_ID &&
+		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL) &&
+		    reg->type != (PTR_TO_BTF_ID | PTR_UNTRUSTED) &&
+		    reg->type != (PTR_TO_BTF_ID | PTR_MAYBE_NULL | PTR_UNTRUSTED))
+			goto bad_type;
 	}
 
 	if (!btf_is_kernel(reg->btf)) {
@@ -3622,18 +3630,20 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	ref_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_REF;
 	percpu_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_PERCPU;
 	user_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_USER;
+
 	if (percpu_ptr)
 		reg_flags |= MEM_PERCPU;
 	else if (user_ptr)
 		reg_flags |= MEM_USER;
+	else
+		reg_flags |= PTR_UNTRUSTED;
 
 	if (insn_class == BPF_LDX) {
 		if (WARN_ON_ONCE(value_regno < 0))
 			return -EFAULT;
-		if (ref_ptr) {
-			verbose(env, "accessing referenced kptr disallowed\n");
-			return -EACCES;
-		}
+		/* We allow loading referenced kptr, since it will be marked as
+		 * untrusted, similar to unreferenced kptr.
+		 */
 		val_reg = reg_state(env, value_regno);
 		/* We can simply mark the value_regno receiving the pointer
 		 * value from map as PTR_TO_BTF_ID, with the correct type.
@@ -4414,6 +4424,12 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	if (ret < 0)
 		return ret;
 
+	/* If this is an untrusted pointer, all pointers formed by walking it
+	 * also inherit the untrusted flag.
+	 */
+	if (type_flag(reg->type) & PTR_UNTRUSTED)
+		flag |= PTR_UNTRUSTED;
+
 	if (atype == BPF_READ && value_regno >= 0)
 		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, btf_id, flag);
 
@@ -13109,7 +13125,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		if (!ctx_access)
 			continue;
 
-		switch (env->insn_aux_data[i + delta].ptr_type) {
+		switch ((int)env->insn_aux_data[i + delta].ptr_type) {
 		case PTR_TO_CTX:
 			if (!ops->convert_ctx_access)
 				continue;
@@ -13126,6 +13142,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
 			break;
 		case PTR_TO_BTF_ID:
+		case PTR_TO_BTF_ID | PTR_UNTRUSTED:
 			if (type == BPF_READ) {
 				insn->code = BPF_LDX | BPF_PROBE_MEM |
 					BPF_SIZE((insn)->code);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 07/15] bpf: Prevent escaping of kptr loaded from maps Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:34   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 09/15] bpf: Always raise reference in btf_get_module_btf Kumar Kartikeya Dwivedi
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Since now there might be at most 10 offsets that need handling in
copy_map_value, the manual shuffling and special case is no longer going
to work. Hence, let's generalise the copy_map_value function by using
a sorted array of offsets to skip regions that must be avoided while
copying into and out of a map value.

When the map is created, we populate the offset array in struct map,
with one extra element for map->value_size, which is used as the final
offset to subtract previous offset from. Since there can only be three
sizes, we can avoid recording the size in the struct map, and only store
sorted offsets. Later we can determine the size for each offset by
comparing it to timer_off and spin_lock_off, otherwise it must be
sizeof(u64) for kptr.

Then, copy_map_value uses this sorted offset array is used to memcpy
while skipping timer, spin lock, and kptr.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h  | 59 +++++++++++++++++++++++++-------------------
 kernel/bpf/syscall.c | 47 +++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8ac3070aa5e6..f0f1e0d3bb2e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -158,6 +158,10 @@ struct bpf_map_ops {
 enum {
 	/* Support at most 8 pointers in a BPF map value */
 	BPF_MAP_VALUE_OFF_MAX = 8,
+	BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
+				1 + /* for bpf_spin_lock */
+				1 + /* for bpf_timer */
+				1,  /* for map->value_size sentinel */
 };
 
 enum {
@@ -208,7 +212,12 @@ struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 	bool bypass_spec_v1;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
-	/* 6 bytes hole */
+	/* 2 bytes hole */
+	struct {
+		u32 off[BPF_MAP_OFF_ARR_MAX];
+		u32 cnt;
+	} off_arr;
+	/* 20 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -252,36 +261,34 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 		memset(dst + map->spin_lock_off, 0, sizeof(struct bpf_spin_lock));
 	if (unlikely(map_value_has_timer(map)))
 		memset(dst + map->timer_off, 0, sizeof(struct bpf_timer));
+	if (unlikely(map_value_has_kptr(map))) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++)
+			*(u64 *)(dst + tab->off[i].offset) = 0;
+	}
 }
 
 /* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
-	u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;
-
-	if (unlikely(map_value_has_spin_lock(map))) {
-		s_off = map->spin_lock_off;
-		s_sz = sizeof(struct bpf_spin_lock);
-	}
-	if (unlikely(map_value_has_timer(map))) {
-		t_off = map->timer_off;
-		t_sz = sizeof(struct bpf_timer);
-	}
-
-	if (unlikely(s_sz || t_sz)) {
-		if (s_off < t_off || !s_sz) {
-			swap(s_off, t_off);
-			swap(s_sz, t_sz);
-		}
-		memcpy(dst, src, t_off);
-		memcpy(dst + t_off + t_sz,
-		       src + t_off + t_sz,
-		       s_off - t_off - t_sz);
-		memcpy(dst + s_off + s_sz,
-		       src + s_off + s_sz,
-		       map->value_size - s_off - s_sz);
-	} else {
-		memcpy(dst, src, map->value_size);
+	int i;
+
+	memcpy(dst, src, map->off_arr.off[0]);
+	for (i = 1; i < map->off_arr.cnt; i++) {
+		u32 curr_off = map->off_arr.off[i - 1];
+		u32 next_off = map->off_arr.off[i];
+		u32 curr_sz;
+
+		if (map_value_has_spin_lock(map) && map->spin_lock_off == curr_off)
+			curr_sz = sizeof(struct bpf_spin_lock);
+		else if (map_value_has_timer(map) && map->timer_off == curr_off)
+			curr_sz = sizeof(struct bpf_timer);
+		else
+			curr_sz = sizeof(u64);
+		curr_off += curr_sz;
+		memcpy(dst + curr_off, src + curr_off, next_off - curr_off);
 	}
 }
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 87263b07f40b..69e8ea1be432 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -30,6 +30,7 @@
 #include <linux/pgtable.h>
 #include <linux/bpf_lsm.h>
 #include <linux/poll.h>
+#include <linux/sort.h>
 #include <linux/bpf-netns.h>
 #include <linux/rcupdate_trace.h>
 #include <linux/memcontrol.h>
@@ -850,6 +851,50 @@ int map_check_no_btf(const struct bpf_map *map,
 	return -ENOTSUPP;
 }
 
+static int map_off_arr_cmp(const void *_a, const void *_b)
+{
+	const u32 a = *(const u32 *)_a;
+	const u32 b = *(const u32 *)_b;
+
+	if (a < b)
+		return -1;
+	else if (a > b)
+		return 1;
+	return 0;
+}
+
+static void map_populate_off_arr(struct bpf_map *map)
+{
+	u32 i;
+
+	map->off_arr.cnt = 0;
+	if (map_value_has_spin_lock(map)) {
+		i = map->off_arr.cnt;
+
+		map->off_arr.off[i] = map->spin_lock_off;
+		map->off_arr.cnt++;
+	}
+	if (map_value_has_timer(map)) {
+		i = map->off_arr.cnt;
+
+		map->off_arr.off[i] = map->timer_off;
+		map->off_arr.cnt++;
+	}
+	if (map_value_has_kptr(map)) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		u32 j = map->off_arr.cnt;
+
+		for (i = 0; i < tab->nr_off; i++)
+			map->off_arr.off[j + i] = tab->off[i].offset;
+		map->off_arr.cnt += tab->nr_off;
+	}
+
+	map->off_arr.off[map->off_arr.cnt++] = map->value_size;
+	if (map->off_arr.cnt == 1)
+		return;
+	sort(map->off_arr.off, map->off_arr.cnt, sizeof(map->off_arr.off[0]), map_off_arr_cmp, NULL);
+}
+
 static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			 u32 btf_key_id, u32 btf_value_id)
 {
@@ -1015,6 +1060,8 @@ static int map_create(union bpf_attr *attr)
 			attr->btf_vmlinux_value_type_id;
 	}
 
+	map_populate_off_arr(map);
+
 	err = security_bpf_map_alloc(map);
 	if (err)
 		goto free_map;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 09/15] bpf: Always raise reference in btf_get_module_btf
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (7 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:43   ` Alexei Starovoitov
  2022-03-17 11:59 ` [PATCH bpf-next v2 10/15] bpf: Populate pairs of btf_id and destructor kfunc in btf Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Align it with helpers like bpf_find_btf_id, so all functions returning
BTF in out parameter follow the same rule of raising reference
consistently, regardless of module or vmlinux BTF.

Adjust existing callers to handle the change accordinly.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/btf.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 12a89e55e77b..2e51b391b684 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6767,20 +6767,23 @@ struct module *btf_try_get_module(const struct btf *btf)
 	return res;
 }
 
-/* Returns struct btf corresponding to the struct module
- *
- * This function can return NULL or ERR_PTR. Note that caller must
- * release reference for struct btf iff btf_is_module is true.
+/* Returns struct btf corresponding to the struct module.
+ * This function can return NULL or ERR_PTR.
  */
 static struct btf *btf_get_module_btf(const struct module *module)
 {
-	struct btf *btf = NULL;
 #ifdef CONFIG_DEBUG_INFO_BTF_MODULES
 	struct btf_module *btf_mod, *tmp;
 #endif
+	struct btf *btf = NULL;
+
+	if (!module) {
+		btf = bpf_get_btf_vmlinux();
+		if (!IS_ERR(btf))
+			btf_get(btf);
+		return btf;
+	}
 
-	if (!module)
-		return bpf_get_btf_vmlinux();
 #ifdef CONFIG_DEBUG_INFO_BTF_MODULES
 	mutex_lock(&btf_module_mutex);
 	list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) {
@@ -7018,9 +7021,7 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 
 	hook = bpf_prog_type_to_kfunc_hook(prog_type);
 	ret = btf_populate_kfunc_set(btf, hook, kset);
-	/* reference is only taken for module BTF */
-	if (btf_is_module(btf))
-		btf_put(btf);
+	btf_put(btf);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(register_btf_kfunc_id_set);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 10/15] bpf: Populate pairs of btf_id and destructor kfunc in btf
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (8 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 09/15] bpf: Always raise reference in btf_get_module_btf Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 11/15] bpf: Wire up freeing of referenced kptr Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

To support storing referenced PTR_TO_BTF_ID in maps, we require
associating a specific BTF ID with a 'destructor' kfunc. This is because
we need to release a live referenced pointer at a certain offset in map
value from the map destruction path, otherwise we end up leaking
resources.

Hence, introduce support for passing an array of btf_id, kfunc_btf_id
pairs that denote a BTF ID and its associated release function. Then,
add an accessor 'btf_find_dtor_kfunc' which can be used to look up the
destructor kfunc of a certain BTF ID. If found, we can use it to free
the object from the map free path.

The registration of these pairs also serve as a whitelist of structures
which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because
without finding the destructor kfunc, we will bail and return an error.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h |  17 +++++++
 kernel/bpf/btf.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 5b578dc81c04..ff4be49b7a26 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -40,6 +40,11 @@ struct btf_kfunc_id_set {
 	};
 };
 
+struct btf_id_dtor_kfunc {
+	u32 btf_id;
+	u32 kfunc_btf_id;
+};
+
 extern const struct file_operations btf_fops;
 
 void btf_get(struct btf *btf);
@@ -346,6 +351,9 @@ bool btf_kfunc_id_set_contains(const struct btf *btf,
 			       enum btf_kfunc_type type, u32 kfunc_btf_id);
 int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 			      const struct btf_kfunc_id_set *s);
+s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id);
+int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
+				struct module *owner);
 #else
 static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
 						    u32 type_id)
@@ -369,6 +377,15 @@ static inline int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 {
 	return 0;
 }
+static inline s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id)
+{
+	return -ENOENT;
+}
+static inline int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors,
+					      u32 add_cnt, struct module *owner)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 2e51b391b684..275db109a470 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -207,12 +207,18 @@ enum btf_kfunc_hook {
 
 enum {
 	BTF_KFUNC_SET_MAX_CNT = 32,
+	BTF_DTOR_KFUNC_MAX_CNT = 256,
 };
 
 struct btf_kfunc_set_tab {
 	struct btf_id_set *sets[BTF_KFUNC_HOOK_MAX][BTF_KFUNC_TYPE_MAX];
 };
 
+struct btf_id_dtor_kfunc_tab {
+	u32 cnt;
+	struct btf_id_dtor_kfunc dtors[];
+};
+
 struct btf {
 	void *data;
 	struct btf_type **types;
@@ -228,6 +234,7 @@ struct btf {
 	u32 id;
 	struct rcu_head rcu;
 	struct btf_kfunc_set_tab *kfunc_set_tab;
+	struct btf_id_dtor_kfunc_tab *dtor_kfunc_tab;
 
 	/* split BTF support */
 	struct btf *base_btf;
@@ -1614,8 +1621,19 @@ static void btf_free_kfunc_set_tab(struct btf *btf)
 	btf->kfunc_set_tab = NULL;
 }
 
+static void btf_free_dtor_kfunc_tab(struct btf *btf)
+{
+	struct btf_id_dtor_kfunc_tab *tab = btf->dtor_kfunc_tab;
+
+	if (!tab)
+		return;
+	kfree(tab);
+	btf->dtor_kfunc_tab = NULL;
+}
+
 static void btf_free(struct btf *btf)
 {
+	btf_free_dtor_kfunc_tab(btf);
 	btf_free_kfunc_set_tab(btf);
 	kvfree(btf->types);
 	kvfree(btf->resolved_sizes);
@@ -7026,6 +7044,96 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 }
 EXPORT_SYMBOL_GPL(register_btf_kfunc_id_set);
 
+s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id)
+{
+	struct btf_id_dtor_kfunc_tab *tab = btf->dtor_kfunc_tab;
+	struct btf_id_dtor_kfunc *dtor;
+
+	if (!tab)
+		return -ENOENT;
+	/* Even though the size of tab->dtors[0] is > sizeof(u32), we only need
+	 * to compare the first u32 with btf_id, so we can reuse btf_id_cmp_func.
+	 */
+	BUILD_BUG_ON(offsetof(struct btf_id_dtor_kfunc, btf_id) != 0);
+	dtor = bsearch(&btf_id, tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func);
+	if (!dtor)
+		return -ENOENT;
+	return dtor->kfunc_btf_id;
+}
+
+/* This function must be invoked only from initcalls/module init functions */
+int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
+				struct module *owner)
+{
+	struct btf_id_dtor_kfunc_tab *tab;
+	struct btf *btf;
+	u32 tab_cnt;
+	int ret;
+
+	btf = btf_get_module_btf(owner);
+	if (!btf) {
+		if (!owner && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {
+			pr_err("missing vmlinux BTF, cannot register dtor kfuncs\n");
+			return -ENOENT;
+		}
+		if (owner && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES)) {
+			pr_err("missing module BTF, cannot register dtor kfuncs\n");
+			return -ENOENT;
+		}
+		return 0;
+	}
+	if (IS_ERR(btf))
+		return PTR_ERR(btf);
+
+	if (add_cnt >= BTF_DTOR_KFUNC_MAX_CNT) {
+		pr_err("cannot register more than %d kfunc destructors\n", BTF_DTOR_KFUNC_MAX_CNT);
+		ret = -E2BIG;
+		goto end;
+	}
+
+	tab = btf->dtor_kfunc_tab;
+	/* Only one call allowed for modules */
+	if (WARN_ON_ONCE(tab && btf_is_module(btf))) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	tab_cnt = tab ? tab->cnt : 0;
+	if (tab_cnt > U32_MAX - add_cnt) {
+		ret = -EOVERFLOW;
+		goto end;
+	}
+	if (tab_cnt + add_cnt >= BTF_DTOR_KFUNC_MAX_CNT) {
+		pr_err("cannot register more than %d kfunc destructors\n", BTF_DTOR_KFUNC_MAX_CNT);
+		ret = -E2BIG;
+		goto end;
+	}
+
+	tab = krealloc(btf->dtor_kfunc_tab,
+		       offsetof(struct btf_id_dtor_kfunc_tab, dtors[tab_cnt + add_cnt]),
+		       GFP_KERNEL | __GFP_NOWARN);
+	if (!tab) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	if (!btf->dtor_kfunc_tab)
+		tab->cnt = 0;
+	btf->dtor_kfunc_tab = tab;
+
+	memcpy(tab->dtors + tab->cnt, dtors, add_cnt * sizeof(tab->dtors[0]));
+	tab->cnt += add_cnt;
+
+	sort(tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func, NULL);
+
+	return 0;
+end:
+	btf_free_dtor_kfunc_tab(btf);
+	btf_put(btf);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_btf_id_dtor_kfuncs);
+
 #define MAX_TYPES_ARE_COMPAT_DEPTH 2
 
 static
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 11/15] bpf: Wire up freeing of referenced kptr
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (9 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 10/15] bpf: Populate pairs of btf_id and destructor kfunc in btf Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 12/15] bpf: Teach verifier about kptr_get kfunc helpers Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.

In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.

Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.

Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.

The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.

Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |  4 ++
 include/linux/btf.h   |  2 +
 kernel/bpf/arraymap.c | 14 ++++++-
 kernel/bpf/btf.c      | 86 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/hashtab.c  | 28 +++++++++-----
 kernel/bpf/syscall.c  | 57 +++++++++++++++++++++++++---
 6 files changed, 172 insertions(+), 19 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f0f1e0d3bb2e..1abb8d3aa6c5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
 #include <linux/bpfptr.h>
+#include <linux/btf.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -174,6 +175,8 @@ struct bpf_map_value_off_desc {
 	u32 offset;
 	u32 btf_id;
 	struct btf *btf;
+	struct module *module;
+	btf_dtor_kfunc_t dtor;
 	int flags;
 };
 
@@ -1547,6 +1550,7 @@ struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u3
 void bpf_map_free_kptr_off_tab(struct bpf_map *map);
 struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
 bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+void bpf_map_free_kptr(struct bpf_map *map, void *map_value);
 
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index ff4be49b7a26..8acf728c8616 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -45,6 +45,8 @@ struct btf_id_dtor_kfunc {
 	u32 kfunc_btf_id;
 };
 
+typedef void (*btf_dtor_kfunc_t)(void *);
+
 extern const struct file_operations btf_fops;
 
 void btf_get(struct btf *btf);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..3cc2884321e7 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -287,10 +287,12 @@ static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key
 	return 0;
 }
 
-static void check_and_free_timer_in_array(struct bpf_array *arr, void *val)
+static void check_and_free_timer_and_kptr(struct bpf_array *arr, void *val)
 {
 	if (unlikely(map_value_has_timer(&arr->map)))
 		bpf_timer_cancel_and_free(val + arr->map.timer_off);
+	if (unlikely(map_value_has_kptr(&arr->map)))
+		bpf_map_free_kptr(&arr->map, val);
 }
 
 /* Called from syscall or from eBPF program */
@@ -327,7 +329,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map, val, value, false);
 		else
 			copy_map_value(map, val, value);
-		check_and_free_timer_in_array(array, val);
+		check_and_free_timer_and_kptr(array, val);
 	}
 	return 0;
 }
@@ -386,6 +388,7 @@ static void array_map_free_timers(struct bpf_map *map)
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
 
+	/* We don't reset or free kptr on uref dropping to zero. */
 	if (likely(!map_value_has_timer(map)))
 		return;
 
@@ -398,6 +401,13 @@ static void array_map_free_timers(struct bpf_map *map)
 static void array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	int i;
+
+	if (unlikely(map_value_has_kptr(map))) {
+		for (i = 0; i < array->map.max_entries; i++)
+			bpf_map_free_kptr(map, array->value + array->elem_size * i);
+		bpf_map_free_kptr_off_tab(map);
+	}
 
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 275db109a470..b561d807c9a1 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3415,6 +3415,7 @@ struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
 {
 	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
 	struct bpf_map_value_off *tab;
+	struct module *mod = NULL;
 	int ret, i, nr_off;
 
 	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
@@ -3446,16 +3447,99 @@ struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
 			goto end;
 		}
 
+		/* Find and stash the function pointer for the destruction function that
+		 * needs to be eventually invoked from the map free path.
+		 */
+		if (info_arr[i].flags & BPF_MAP_VALUE_OFF_F_REF) {
+			const struct btf_type *dtor_func, *dtor_func_proto;
+			const struct btf_param *args;
+			const char *dtor_func_name;
+			unsigned long addr;
+			s32 dtor_btf_id;
+			u32 nr_args;
+
+			/* This call also serves as a whitelist of allowed objects that
+			 * can be used as a referenced pointer and be stored in a map at
+			 * the same time.
+			 */
+			dtor_btf_id = btf_find_dtor_kfunc(off_btf, id);
+			if (dtor_btf_id < 0) {
+				ret = dtor_btf_id;
+				btf_put(off_btf);
+				goto end;
+			}
+
+			dtor_func = btf_type_by_id(off_btf, dtor_btf_id);
+			if (!dtor_func || !btf_type_is_func(dtor_func)) {
+				ret = -EINVAL;
+				btf_put(off_btf);
+				goto end;
+			}
+
+			dtor_func_proto = btf_type_by_id(off_btf, dtor_func->type);
+			if (!dtor_func_proto || !btf_type_is_func_proto(dtor_func_proto)) {
+				ret = -EINVAL;
+				btf_put(off_btf);
+				goto end;
+			}
+
+			/* Make sure the prototype of the destructor kfunc is 'void func(type *)' */
+			t = btf_type_by_id(off_btf, dtor_func_proto->type);
+			if (!t || !btf_type_is_void(t)) {
+				ret = -EINVAL;
+				btf_put(off_btf);
+				goto end;
+			}
+
+			nr_args = btf_type_vlen(dtor_func_proto);
+			args = btf_params(dtor_func_proto);
+
+			t = NULL;
+			if (nr_args)
+				t = btf_type_by_id(off_btf, args[0].type);
+			/* Allow any pointer type, as width on targets Linux supports
+			 * will be same for all pointer types (i.e. sizeof(void *))
+			 */
+			if (nr_args != 1 || !t || !btf_type_is_ptr(t)) {
+				ret = -EINVAL;
+				btf_put(off_btf);
+				goto end;
+			}
+
+			if (btf_is_module(btf)) {
+				mod = btf_try_get_module(off_btf);
+				if (!mod) {
+					ret = -ENXIO;
+					btf_put(off_btf);
+					goto end;
+				}
+			}
+
+			dtor_func_name = __btf_name_by_offset(off_btf, dtor_func->name_off);
+			addr = kallsyms_lookup_name(dtor_func_name);
+			if (!addr) {
+				ret = -EINVAL;
+				module_put(mod);
+				btf_put(off_btf);
+				goto end;
+			}
+			tab->off[i].dtor = (void *)addr;
+		}
+
 		tab->off[i].offset = info_arr[i].off;
 		tab->off[i].btf_id = id;
 		tab->off[i].btf = off_btf;
+		tab->off[i].module = mod;
 		tab->off[i].flags = info_arr[i].flags;
 		tab->nr_off = i + 1;
 	}
 	return tab;
 end:
-	while (tab->nr_off--)
+	while (tab->nr_off--) {
 		btf_put(tab->off[tab->nr_off].btf);
+		if (tab->off[tab->nr_off].module)
+			module_put(tab->off[tab->nr_off].module);
+	}
 	kfree(tab);
 	return ERR_PTR(ret);
 }
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 65877967f414..3d0698e19958 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -725,12 +725,16 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
 	return insn - insn_buf;
 }
 
-static void check_and_free_timer(struct bpf_htab *htab, struct htab_elem *elem)
+static void check_and_free_timer_and_kptr(struct bpf_htab *htab,
+					  struct htab_elem *elem,
+					  bool free_kptr)
 {
+	void *map_value = elem->key + round_up(htab->map.key_size, 8);
+
 	if (unlikely(map_value_has_timer(&htab->map)))
-		bpf_timer_cancel_and_free(elem->key +
-					  round_up(htab->map.key_size, 8) +
-					  htab->map.timer_off);
+		bpf_timer_cancel_and_free(map_value + htab->map.timer_off);
+	if (unlikely(map_value_has_kptr(&htab->map)) && free_kptr)
+		bpf_map_free_kptr(&htab->map, map_value);
 }
 
 /* It is called from the bpf_lru_list when the LRU needs to delete
@@ -757,7 +761,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
 		if (l == tgt_l) {
 			hlist_nulls_del_rcu(&l->hash_node);
-			check_and_free_timer(htab, l);
+			check_and_free_timer_and_kptr(htab, l, true);
 			break;
 		}
 
@@ -829,7 +833,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
-	check_and_free_timer(htab, l);
+	check_and_free_timer_and_kptr(htab, l, true);
 	kfree(l);
 }
 
@@ -857,7 +861,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 	htab_put_fd_value(htab, l);
 
 	if (htab_is_prealloc(htab)) {
-		check_and_free_timer(htab, l);
+		check_and_free_timer_and_kptr(htab, l, true);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		atomic_dec(&htab->count);
@@ -1104,7 +1108,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		if (!htab_is_prealloc(htab))
 			free_htab_elem(htab, l_old);
 		else
-			check_and_free_timer(htab, l_old);
+			check_and_free_timer_and_kptr(htab, l_old, true);
 	}
 	ret = 0;
 err:
@@ -1114,7 +1118,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
 {
-	check_and_free_timer(htab, elem);
+	check_and_free_timer_and_kptr(htab, elem, true);
 	bpf_lru_push_free(&htab->lru, &elem->lru_node);
 }
 
@@ -1420,7 +1424,10 @@ static void htab_free_malloced_timers(struct bpf_htab *htab)
 		struct htab_elem *l;
 
 		hlist_nulls_for_each_entry(l, n, head, hash_node)
-			check_and_free_timer(htab, l);
+			/* We don't reset or free kptr on uref dropping to zero,
+			 * hence set free_kptr to false.
+			 */
+			check_and_free_timer_and_kptr(htab, l, false);
 		cond_resched_rcu();
 	}
 	rcu_read_unlock();
@@ -1458,6 +1465,7 @@ static void htab_map_free(struct bpf_map *map)
 	else
 		prealloc_destroy(htab);
 
+	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 69e8ea1be432..e636f43cc3b9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -507,8 +507,11 @@ void bpf_map_free_kptr_off_tab(struct bpf_map *map)
 	if (!map_value_has_kptr(map))
 		return;
 	for (i = 0; i < tab->nr_off; i++) {
+		struct module *mod = tab->off[i].module;
 		struct btf *btf = tab->off[i].btf;
 
+		if (mod)
+			module_put(mod);
 		btf_put(btf);
 	}
 	kfree(tab);
@@ -523,8 +526,16 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
 	if (!map_value_has_kptr(map))
 		return ERR_PTR(-ENOENT);
 	/* Do a deep copy of the kptr_off_tab */
-	for (i = 0; i < tab->nr_off; i++)
-		btf_get(tab->off[i].btf);
+	for (i = 0; i < tab->nr_off; i++) {
+		struct module *mod = tab->off[i].module;
+		struct btf *btf = tab->off[i].btf;
+
+		if (mod && !try_module_get(mod)) {
+			ret = -ENXIO;
+			goto end;
+		}
+		btf_get(btf);
+	}
 
 	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
 	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
@@ -535,8 +546,14 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
 	memcpy(new_tab, tab, size);
 	return new_tab;
 end:
-	while (i--)
-		btf_put(tab->off[i].btf);
+	while (i--) {
+		struct module *mod = tab->off[i].module;
+		struct btf *btf = tab->off[i].btf;
+
+		if (mod)
+			module_put(mod);
+		btf_put(btf);
+	}
 	return ERR_PTR(ret);
 }
 
@@ -556,15 +573,43 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
 	return !memcmp(tab_a, tab_b, size);
 }
 
+/* Caller must ensure map_value_has_kptr is true. Note that this function can be
+ * called on a map value while the map_value is visible to BPF programs, as it
+ * ensures the correct synchronization, and we already enforce the same using
+ * the verifier on the BPF program side, esp. for referenced pointers.
+ */
+void bpf_map_free_kptr(struct bpf_map *map, void *map_value)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab;
+	unsigned long *btf_id_ptr;
+	int i;
+
+	for (i = 0; i < tab->nr_off; i++) {
+		struct bpf_map_value_off_desc *off_desc = &tab->off[i];
+		unsigned long old_ptr;
+
+		btf_id_ptr = map_value + off_desc->offset;
+		if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+			u64 *p = (u64 *)btf_id_ptr;
+
+			WRITE_ONCE(p, 0);
+			continue;
+		}
+		old_ptr = xchg(btf_id_ptr, 0);
+		off_desc->dtor((void *)old_ptr);
+	}
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 
 	security_bpf_map_free(map);
-	bpf_map_free_kptr_off_tab(map);
 	bpf_map_release_memcg(map);
-	/* implementation dependent freeing */
+	/* implementation dependent freeing, map_free callback also does
+	 * bpf_map_free_kptr_off_tab, if needed.
+	 */
 	map->ops->map_free(map);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 12/15] bpf: Teach verifier about kptr_get kfunc helpers
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (10 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 11/15] bpf: Wire up freeing of referenced kptr Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 13/15] libbpf: Add kptr type tag macros to bpf_helpers.h Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

We introduce a new style of kfunc helpers, namely *_kptr_get, where they
take pointer to the map value which points to a referenced kernel
pointer contained in the map. Since this is referenced, only
bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to
change the current value, and each pointer that resides in that location
would be referenced, and RCU protected (this must be kept in mind while
adding kernel types embeddable as reference kptr in BPF maps).

This means that if do the load of the pointer value in an RCU read
section, and find a live pointer, then as long as we hold RCU read lock,
it won't be freed by a parallel xchg + release operation. This allows us
to implement a safe refcount increment scheme. Hence, enforce that first
argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the
right offset to referenced pointer.

For the rest of the arguments, they are subjected to typical kfunc
argument checks, hence allowing some flexibility in passing more intent
into how the reference should be taken.

For instance, in case of struct nf_conn, it is not freed until RCU grace
period ends, but can still be reused for another tuple once refcount has
dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call
refcount_inc_not_zero, but also do a tuple match after incrementing the
reference, and when it fails to match it, put the reference again and
return NULL.

This can be implemented easily if we allow passing additional parameters
to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a
tuple__sz pair.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h |  2 ++
 kernel/bpf/btf.c    | 57 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 8acf728c8616..d5d37bfde8df 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -17,6 +17,7 @@ enum btf_kfunc_type {
 	BTF_KFUNC_TYPE_ACQUIRE,
 	BTF_KFUNC_TYPE_RELEASE,
 	BTF_KFUNC_TYPE_RET_NULL,
+	BTF_KFUNC_TYPE_KPTR_ACQUIRE,
 	BTF_KFUNC_TYPE_MAX,
 };
 
@@ -35,6 +36,7 @@ struct btf_kfunc_id_set {
 			struct btf_id_set *acquire_set;
 			struct btf_id_set *release_set;
 			struct btf_id_set *ret_null_set;
+			struct btf_id_set *kptr_acquire_set;
 		};
 		struct btf_id_set *sets[BTF_KFUNC_TYPE_MAX];
 	};
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b561d807c9a1..34841e90a128 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6068,11 +6068,11 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 	struct bpf_verifier_log *log = &env->log;
 	u32 i, nargs, ref_id, ref_obj_id = 0;
 	bool is_kfunc = btf_is_kernel(btf);
+	bool rel = false, kptr_get = false;
 	const char *func_name, *ref_tname;
 	const struct btf_type *t, *ref_t;
 	const struct btf_param *args;
 	int ref_regno = 0, ret;
-	bool rel = false;
 
 	t = btf_type_by_id(btf, func_id);
 	if (!t || !btf_type_is_func(t)) {
@@ -6098,10 +6098,14 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
-	/* Only kfunc can be release func */
-	if (is_kfunc)
+	if (is_kfunc) {
+		/* Only kfunc can be release func */
 		rel = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog),
 						BTF_KFUNC_TYPE_RELEASE, func_id);
+		kptr_get = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog),
+						     BTF_KFUNC_TYPE_KPTR_ACQUIRE, func_id);
+	}
+
 	/* check that BTF function arguments match actual types that the
 	 * verifier sees.
 	 */
@@ -6130,8 +6134,51 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 		if (ret < 0)
 			return ret;
 
-		if (btf_get_prog_ctx_type(log, btf, t,
-					  env->prog->type, i)) {
+		/* kptr_get is only true for kfunc */
+		if (i == 0 && kptr_get) {
+			struct bpf_map_value_off_desc *off_desc;
+
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				bpf_log(log, "arg#0 expected pointer to map value\n");
+				return -EINVAL;
+			}
+
+			/* check_func_arg_reg_off allows var_off for
+			 * PTR_TO_MAP_VALUE, but we need fixed offset to find
+			 * off_desc.
+			 */
+			if (!tnum_is_const(reg->var_off)) {
+				bpf_log(log, "arg#0 must have constant offset\n");
+				return -EINVAL;
+			}
+
+			off_desc = bpf_map_kptr_off_contains(reg->map_ptr, reg->off + reg->var_off.value);
+			if (!off_desc || !(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
+				bpf_log(log, "arg#0 no referenced kptr at map value offset=%llu\n",
+					reg->off + reg->var_off.value);
+				return -EINVAL;
+			}
+
+			if (!btf_type_is_ptr(ref_t)) {
+				bpf_log(log, "arg#0 BTF type must be a double pointer\n");
+				return -EINVAL;
+			}
+
+			ref_t = btf_type_skip_modifiers(btf, ref_t->type, &ref_id);
+			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
+
+			if (!btf_type_is_struct(ref_t)) {
+				bpf_log(log, "kernel function %s args#%d pointer type %s %s is not supported\n",
+					func_name, i, btf_type_str(ref_t), ref_tname);
+				return -EINVAL;
+			}
+			if (!btf_struct_ids_match(log, btf, ref_id, 0, off_desc->btf, off_desc->btf_id)) {
+				bpf_log(log, "kernel function %s args#%d expected pointer to %s %s\n",
+					func_name, i, btf_type_str(ref_t), ref_tname);
+				return -EINVAL;
+			}
+			/* rest of the arguments can be anything, like normal kfunc */
+		} else if (btf_get_prog_ctx_type(log, btf, t, env->prog->type, i)) {
 			/* If function expects ctx type in BTF check that caller
 			 * is passing PTR_TO_CTX.
 			 */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 13/15] libbpf: Add kptr type tag macros to bpf_helpers.h
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (11 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 12/15] bpf: Teach verifier about kptr_get kfunc helpers Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 14/15] selftests/bpf: Add C tests for kptr Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Include convenience definitions:
__kptr:		Unreferenced BTF ID pointer
__kptr_ref:	Referenced BTF ID pointer
__kptr_percpu:	per-CPU BTF ID pointer
__kptr_user:	Userspace BTF ID pointer

Users can use them to tag the pointer type meant to be used with the new
support directly in the map value definition. Note that these attributes
require https://reviews.llvm.org/D119799 to be emitted in BPF object
BTF correctly when applied to a non-builtin type.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/lib/bpf/bpf_helpers.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 44df982d2a5c..f27690110eb5 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -149,6 +149,10 @@ enum libbpf_tristate {
 
 #define __kconfig __attribute__((section(".kconfig")))
 #define __ksym __attribute__((section(".ksyms")))
+#define __kptr __attribute__((btf_type_tag("kptr")))
+#define __kptr_ref __attribute__((btf_type_tag("kptr_ref")))
+#define __kptr_percpu __attribute__((btf_type_tag("kptr_percpu")))
+#define __kptr_user __attribute__((btf_type_tag("kptr_user")))
 
 #ifndef ___bpf_concat
 #define ___bpf_concat(a, b) a ## b
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 14/15] selftests/bpf: Add C tests for kptr
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (12 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 13/15] libbpf: Add kptr type tag macros to bpf_helpers.h Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-17 11:59 ` [PATCH bpf-next v2 15/15] selftests/bpf: Add verifier " Kumar Kartikeya Dwivedi
  2022-03-19 18:50 ` [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps patchwork-bot+netdevbpf
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

This uses the __kptr* macros as well, and tries to test the stuff that
is supposed to work, since we have negative tests in test_verifier
suite. Also include some code to test map-in-map support, such that the
inner_map_meta matches the map value of map added as element.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/map_kptr.c       |  20 ++
 tools/testing/selftests/bpf/progs/map_kptr.c  | 236 ++++++++++++++++++
 2 files changed, 256 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/map_kptr.c

diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
new file mode 100644
index 000000000000..688732295ce9
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+#include "map_kptr.skel.h"
+
+void test_map_kptr(void)
+{
+	struct map_kptr *skel;
+	char buf[24];
+	int key = 0;
+
+	skel = map_kptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "map_kptr__open_and_load"))
+		return;
+	ASSERT_OK(bpf_map_update_elem(bpf_map__fd(skel->maps.hash_map), &key, buf, 0),
+		  "bpf_map_update_elem hash_map");
+	ASSERT_OK(bpf_map_update_elem(bpf_map__fd(skel->maps.hash_malloc_map), &key, buf, 0),
+		  "bpf_map_update_elem hash_malloc_map");
+	map_kptr__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
new file mode 100644
index 000000000000..0b6f60423219
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/map_kptr.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+struct map_value {
+	struct prog_test_ref_kfunc __kptr *unref_ptr;
+	struct prog_test_ref_kfunc __kptr_ref *ref_ptr;
+	struct prog_test_ref_kfunc __kptr_percpu *percpu_ptr;
+	struct prog_test_ref_kfunc __kptr_user *user_ptr;
+};
+
+struct array_map {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} array_map SEC(".maps");
+
+struct hash_map {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} hash_map SEC(".maps");
+
+struct hash_malloc_map {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} hash_malloc_map SEC(".maps");
+
+struct lru_hash_map {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} lru_hash_map SEC(".maps");
+
+#define DEFINE_MAP_OF_MAP(map_type, inner_map_type, name)       \
+	struct {                                                \
+		__uint(type, map_type);                         \
+		__uint(max_entries, 1);                         \
+		__uint(key_size, sizeof(int));                  \
+		__uint(value_size, sizeof(int));                \
+		__array(values, struct inner_map_type);         \
+	} name SEC(".maps") = {                                 \
+		.values = { [0] = &inner_map_type },            \
+	}
+
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_map, array_of_array_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, hash_map, array_of_hash_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, hash_malloc_map, array_of_hash_malloc_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_ARRAY_OF_MAPS, lru_hash_map, array_of_lru_hash_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, array_map, hash_of_array_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, hash_map, hash_of_hash_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, hash_malloc_map, hash_of_hash_malloc_maps);
+DEFINE_MAP_OF_MAP(BPF_MAP_TYPE_HASH_OF_MAPS, lru_hash_map, hash_of_lru_hash_maps);
+
+extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym;
+extern struct prog_test_ref_kfunc *
+bpf_kfunc_call_test_kptr_get(struct prog_test_ref_kfunc **p, int a, int b) __ksym;
+extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym;
+
+static __always_inline
+void test_kptr_unref(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = v->unref_ptr;
+	/* store untrusted_ptr_or_null_ */
+	v->unref_ptr = p;
+	if (!p)
+		return;
+	if (p->a + p->b > 100)
+		return;
+	/* store untrusted_ptr_ */
+	v->unref_ptr = p;
+	/* store NULL */
+	v->unref_ptr = NULL;
+}
+
+static __always_inline
+void test_kptr_ref(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = v->ref_ptr;
+	/* store ptr_or_null_ */
+	v->unref_ptr = p;
+	if (!p)
+		return;
+	if (p->a + p->b > 100)
+		return;
+	/* store NULL */
+	p = bpf_kptr_xchg(&v->ref_ptr, NULL);
+	if (!p)
+		return;
+	if (p->a + p->b > 100) {
+		bpf_kfunc_call_test_release(p);
+		return;
+	}
+	/* store ptr_ */
+	v->unref_ptr = p;
+	bpf_kfunc_call_test_release(p);
+
+	p = bpf_kfunc_call_test_acquire(&(unsigned long){0});
+	if (!p)
+		return;
+	/* store ptr_ */
+	p = bpf_kptr_xchg(&v->ref_ptr, p);
+	if (!p)
+		return;
+	if (p->a + p->b > 100) {
+		bpf_kfunc_call_test_release(p);
+		return;
+	}
+	bpf_kfunc_call_test_release(p);
+}
+
+static __always_inline
+void test_kptr_percpu(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = v->percpu_ptr;
+	/* store percpu_ptr_or_null_ */
+	v->percpu_ptr = p;
+	if (!p)
+		return;
+	p = bpf_this_cpu_ptr(p);
+	if (p->a + p->b > 100)
+		return;
+	/* store percpu_ptr_ */
+	v->percpu_ptr = p;
+	/* store NULL */
+	v->percpu_ptr = NULL;
+}
+
+static __always_inline
+void test_kptr_user(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+	char buf[sizeof(*p)];
+
+	p = v->user_ptr;
+	/* store user_ptr_or_null_ */
+	v->user_ptr = p;
+	if (!p)
+		return;
+	bpf_probe_read_user(buf, sizeof(buf), p);
+	/* store user_ptr_ */
+	v->user_ptr = p;
+	/* store NULL */
+	v->user_ptr = NULL;
+}
+
+static __always_inline
+void test_kptr_get(struct map_value *v)
+{
+	struct prog_test_ref_kfunc *p;
+
+	p = bpf_kfunc_call_test_kptr_get(&v->ref_ptr, 0, 0);
+	if (!p)
+		return;
+	if (p->a + p->b > 100) {
+		bpf_kfunc_call_test_release(p);
+		return;
+	}
+	bpf_kfunc_call_test_release(p);
+}
+
+static __always_inline
+void test_kptr(struct map_value *v)
+{
+	test_kptr_unref(v);
+	test_kptr_ref(v);
+	test_kptr_percpu(v);
+	test_kptr_user(v);
+	test_kptr_get(v);
+}
+
+SEC("tc")
+int test_map_kptr(struct __sk_buff *ctx)
+{
+	void *maps[] = {
+		&array_map,
+		&hash_map,
+		&hash_malloc_map,
+		&lru_hash_map,
+	};
+	struct map_value *v;
+	int i, key = 0;
+
+	for (i = 0; i < sizeof(maps) / sizeof(*maps); i++) {
+		v = bpf_map_lookup_elem(&array_map, &key);
+		if (!v)
+			return 0;
+		test_kptr(v);
+	}
+	return 0;
+}
+
+SEC("tc")
+int test_map_in_map_kptr(struct __sk_buff *ctx)
+{
+	void *map_of_maps[] = {
+		&array_of_array_maps,
+		&array_of_hash_maps,
+		&array_of_hash_malloc_maps,
+		&array_of_lru_hash_maps,
+		&hash_of_array_maps,
+		&hash_of_hash_maps,
+		&hash_of_hash_malloc_maps,
+		&hash_of_lru_hash_maps,
+	};
+	struct map_value *v;
+	int i, key = 0;
+	void *map;
+
+	for (i = 0; i < sizeof(map_of_maps) / sizeof(*map_of_maps); i++) {
+		map = bpf_map_lookup_elem(&array_of_array_maps, &key);
+		if (!map)
+			return 0;
+		v = bpf_map_lookup_elem(map, &key);
+		if (!v)
+			return 0;
+		test_kptr(v);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v2 15/15] selftests/bpf: Add verifier tests for kptr
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (13 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 14/15] selftests/bpf: Add C tests for kptr Kumar Kartikeya Dwivedi
@ 2022-03-17 11:59 ` Kumar Kartikeya Dwivedi
  2022-03-19 18:50 ` [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps patchwork-bot+netdevbpf
  15 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-17 11:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

Reuse bpf_prog_test functions to test the support for PTR_TO_BTF_ID in
BPF map case, including some tests that verify implementation sanity and
corner cases.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 net/bpf/test_run.c                            |  39 +-
 tools/testing/selftests/bpf/test_verifier.c   |  60 +-
 .../testing/selftests/bpf/verifier/map_kptr.c | 763 ++++++++++++++++++
 3 files changed, 855 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/map_kptr.c

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index e7b9c2636d10..be1cd7498a4e 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -584,6 +584,12 @@ noinline void bpf_kfunc_call_memb_release(struct prog_test_member *p)
 {
 }
 
+noinline struct prog_test_ref_kfunc *
+bpf_kfunc_call_test_kptr_get(struct prog_test_ref_kfunc **p, int a, int b)
+{
+	return &prog_test_struct;
+}
+
 struct prog_test_pass1 {
 	int x0;
 	struct {
@@ -669,6 +675,7 @@ BTF_ID(func, bpf_kfunc_call_test3)
 BTF_ID(func, bpf_kfunc_call_test_acquire)
 BTF_ID(func, bpf_kfunc_call_test_release)
 BTF_ID(func, bpf_kfunc_call_memb_release)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
 BTF_ID(func, bpf_kfunc_call_test_pass_ctx)
 BTF_ID(func, bpf_kfunc_call_test_pass1)
 BTF_ID(func, bpf_kfunc_call_test_pass2)
@@ -682,6 +689,7 @@ BTF_SET_END(test_sk_check_kfunc_ids)
 
 BTF_SET_START(test_sk_acquire_kfunc_ids)
 BTF_ID(func, bpf_kfunc_call_test_acquire)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
 BTF_SET_END(test_sk_acquire_kfunc_ids)
 
 BTF_SET_START(test_sk_release_kfunc_ids)
@@ -691,8 +699,13 @@ BTF_SET_END(test_sk_release_kfunc_ids)
 
 BTF_SET_START(test_sk_ret_null_kfunc_ids)
 BTF_ID(func, bpf_kfunc_call_test_acquire)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
 BTF_SET_END(test_sk_ret_null_kfunc_ids)
 
+BTF_SET_START(test_sk_kptr_acquire_kfunc_ids)
+BTF_ID(func, bpf_kfunc_call_test_kptr_get)
+BTF_SET_END(test_sk_kptr_acquire_kfunc_ids)
+
 static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size,
 			   u32 size, u32 headroom, u32 tailroom)
 {
@@ -1579,14 +1592,30 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
 
 static const struct btf_kfunc_id_set bpf_prog_test_kfunc_set = {
 	.owner        = THIS_MODULE,
-	.check_set    = &test_sk_check_kfunc_ids,
-	.acquire_set  = &test_sk_acquire_kfunc_ids,
-	.release_set  = &test_sk_release_kfunc_ids,
-	.ret_null_set = &test_sk_ret_null_kfunc_ids,
+	.check_set        = &test_sk_check_kfunc_ids,
+	.acquire_set      = &test_sk_acquire_kfunc_ids,
+	.release_set      = &test_sk_release_kfunc_ids,
+	.ret_null_set     = &test_sk_ret_null_kfunc_ids,
+	.kptr_acquire_set = &test_sk_kptr_acquire_kfunc_ids
 };
 
+BTF_ID_LIST(bpf_prog_test_dtor_kfunc_ids)
+BTF_ID(struct, prog_test_ref_kfunc)
+BTF_ID(func, bpf_kfunc_call_test_release)
+
 static int __init bpf_prog_test_run_init(void)
 {
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_prog_test_kfunc_set);
+	const struct btf_id_dtor_kfunc bpf_prog_test_dtor_kfunc[] = {
+		{
+		  .btf_id       = bpf_prog_test_dtor_kfunc_ids[0],
+		  .kfunc_btf_id = bpf_prog_test_dtor_kfunc_ids[1]
+		},
+	};
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_prog_test_kfunc_set);
+	return ret ?: register_btf_id_dtor_kfuncs(bpf_prog_test_dtor_kfunc,
+						  ARRAY_SIZE(bpf_prog_test_dtor_kfunc),
+						  THIS_MODULE);
 }
 late_initcall(bpf_prog_test_run_init);
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index a2cd236c32eb..71c986d65e39 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -53,7 +53,7 @@
 #define MAX_INSNS	BPF_MAXINSNS
 #define MAX_TEST_INSNS	1000000
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	22
+#define MAX_NR_MAPS	23
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -101,6 +101,7 @@ struct bpf_test {
 	int fixup_map_reuseport_array[MAX_FIXUPS];
 	int fixup_map_ringbuf[MAX_FIXUPS];
 	int fixup_map_timer[MAX_FIXUPS];
+	int fixup_map_kptr[MAX_FIXUPS];
 	struct kfunc_btf_id_pair fixup_kfunc_btf_id[MAX_FIXUPS];
 	/* Expected verifier log output for result REJECT or VERBOSE_ACCEPT.
 	 * Can be a tab-separated sequence of expected strings. An empty string
@@ -621,8 +622,16 @@ static int create_cgroup_storage(bool percpu)
  * struct timer {
  *   struct bpf_timer t;
  * };
+ * struct btf_ptr {
+ *   struct prog_test_ref_kfunc __kptr *ptr;
+ *   struct prog_test_ref_kfunc __kptr_ref *ptr;
+ *   struct prog_test_ref_kfunc __kptr_percpu *ptr;
+ *   struct prog_test_ref_kfunc __kptr_user *ptr;
+ * }
  */
-static const char btf_str_sec[] = "\0bpf_spin_lock\0val\0cnt\0l\0bpf_timer\0timer\0t";
+static const char btf_str_sec[] = "\0bpf_spin_lock\0val\0cnt\0l\0bpf_timer\0timer\0t"
+				  "\0btf_ptr\0prog_test_ref_kfunc\0ptr\0kptr\0kptr_ref"
+				  "\0kptr_percpu\0kptr_user";
 static __u32 btf_raw_types[] = {
 	/* int */
 	BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
@@ -638,6 +647,26 @@ static __u32 btf_raw_types[] = {
 	/* struct timer */                              /* [5] */
 	BTF_TYPE_ENC(35, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 1), 16),
 	BTF_MEMBER_ENC(41, 4, 0), /* struct bpf_timer t; */
+	/* struct prog_test_ref_kfunc */		/* [6] */
+	BTF_STRUCT_ENC(51, 0, 0),
+	/* type tag "kptr" */
+	BTF_TYPE_TAG_ENC(75, 6),			/* [7] */
+	/* type tag "kptr_ref" */
+	BTF_TYPE_TAG_ENC(80, 6),			/* [8] */
+	/* type tag "kptr_percpu" */
+	BTF_TYPE_TAG_ENC(89, 6),			/* [9] */
+	/* type tag "kptr_user" */
+	BTF_TYPE_TAG_ENC(101, 6),			/* [10] */
+	BTF_PTR_ENC(7),					/* [11] */
+	BTF_PTR_ENC(8),					/* [12] */
+	BTF_PTR_ENC(9),					/* [13] */
+	BTF_PTR_ENC(10),				/* [14] */
+	/* struct btf_ptr */				/* [15] */
+	BTF_STRUCT_ENC(43, 4, 32),
+	BTF_MEMBER_ENC(71, 11, 0), /* struct prog_test_ref_kfunc __kptr *ptr; */
+	BTF_MEMBER_ENC(71, 12, 64), /* struct prog_test_ref_kfunc __kptr_ref *ptr; */
+	BTF_MEMBER_ENC(71, 13, 128), /* struct prog_test_ref_kfunc __kptr_percpu *ptr; */
+	BTF_MEMBER_ENC(71, 14, 192), /* struct prog_test_ref_kfunc __kptr_user *ptr; */
 };
 
 static int load_btf(void)
@@ -727,6 +756,25 @@ static int create_map_timer(void)
 	return fd;
 }
 
+static int create_map_kptr(void)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts,
+		.btf_key_type_id = 1,
+		.btf_value_type_id = 15,
+	);
+	int fd, btf_fd;
+
+	btf_fd = load_btf();
+	if (btf_fd < 0)
+		return -1;
+
+	opts.btf_fd = btf_fd;
+	fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "test_map", 4, 32, 1, &opts);
+	if (fd < 0)
+		printf("Failed to create map with btf_id pointer\n");
+	return fd;
+}
+
 static char bpf_vlog[UINT_MAX >> 8];
 
 static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
@@ -754,6 +802,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_reuseport_array = test->fixup_map_reuseport_array;
 	int *fixup_map_ringbuf = test->fixup_map_ringbuf;
 	int *fixup_map_timer = test->fixup_map_timer;
+	int *fixup_map_kptr = test->fixup_map_kptr;
 	struct kfunc_btf_id_pair *fixup_kfunc_btf_id = test->fixup_kfunc_btf_id;
 
 	if (test->fill_helper) {
@@ -947,6 +996,13 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_map_timer++;
 		} while (*fixup_map_timer);
 	}
+	if (*fixup_map_kptr) {
+		map_fds[22] = create_map_kptr();
+		do {
+			prog[*fixup_map_kptr].imm = map_fds[22];
+			fixup_map_kptr++;
+		} while (*fixup_map_kptr);
+	}
 
 	/* Patch in kfunc BTF IDs */
 	if (fixup_kfunc_btf_id->kfunc) {
diff --git a/tools/testing/selftests/bpf/verifier/map_kptr.c b/tools/testing/selftests/bpf/verifier/map_kptr.c
new file mode 100644
index 000000000000..b97b56e4a08b
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/map_kptr.c
@@ -0,0 +1,763 @@
+/* Common tests */
+{
+	"map_kptr: BPF_ST imm != 0",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "BPF_ST imm must be 0 when storing to kptr at off=0",
+},
+{
+	"map_kptr: size != bpf_size_to_bytes(BPF_DW)",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ST_MEM(BPF_W, BPF_REG_0, 0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "kptr access size must be BPF_DW",
+},
+{
+	"map_kptr: map_value non-const var_off",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2, 0),
+	BPF_JMP_IMM(BPF_JLE, BPF_REG_2, 4, 1),
+	BPF_EXIT_INSN(),
+	BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_2),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_3, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "kptr access cannot have variable offset",
+},
+{
+	"map_kptr: bpf_kptr_xchg non-const var_off",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2, 0),
+	BPF_JMP_IMM(BPF_JLE, BPF_REG_2, 4, 1),
+	BPF_EXIT_INSN(),
+	BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_2),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_3),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 doesn't have constant offset. kptr has to be at the constant offset",
+},
+{
+	"map_kptr: unaligned boundary load/store",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 7),
+	BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "kptr access misaligned expected=0 off=7",
+},
+{
+	"map_kptr: reject var_off != 0",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0),
+	BPF_JMP_IMM(BPF_JLE, BPF_REG_2, 4, 1),
+	BPF_EXIT_INSN(),
+	BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "variable untrusted_ptr_ access var_off=(0x0; 0x7) disallowed",
+},
+/* Tests for unreferened PTR_TO_BTF_ID */
+{
+	"map_kptr: unref: reject btf_struct_ids_match == false",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R1 type=untrusted_ptr_prog_test_ref_kfunc expected=ptr_or_null_prog_test",
+},
+{
+	"map_kptr: unref: loaded pointer marked as untrusted",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R0 invalid mem access 'untrusted_ptr_or_null_'",
+},
+{
+	"map_kptr: unref: correct in kernel type size",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 24),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "access beyond struct prog_test_ref_kfunc at off 24 size 8",
+},
+{
+	"map_kptr: unref: inherit PTR_UNTRUSTED on struct walk",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 16),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_this_cpu_ptr),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 type=untrusted_ptr_ expected=percpu_ptr_",
+},
+{
+	"map_kptr: unref: no reference state created",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = ACCEPT,
+},
+{
+	"map_kptr: unref: bpf_kptr_xchg rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "off=0 kptr isn't referenced kptr",
+},
+{
+	"map_kptr: unref: bpf_kfunc_call_test_kptr_get rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "arg#0 no referenced kptr at map value offset=0",
+	.fixup_kfunc_btf_id = {
+		{ "bpf_kfunc_call_test_kptr_get", 13 },
+	}
+},
+/* Tests for referenced PTR_TO_BTF_ID */
+{
+	"map_kptr: ref: loaded pointer marked as untrusted",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_IMM(BPF_REG_1, 0),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 8),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_this_cpu_ptr),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 type=untrusted_ptr_or_null_ expected=percpu_ptr_",
+},
+{
+	"map_kptr: ref: reject off != 0",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 4),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "dereference of modified ptr_ ptr R2 off=4 disallowed",
+},
+{
+	"map_kptr: ref: reference state created and released on xchg",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_1, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "Unreleased reference id=5 alloc_insn=20",
+	.fixup_kfunc_btf_id = {
+		{ "bpf_kfunc_call_test_acquire", 15 },
+	}
+},
+{
+	"map_kptr: ref: reject STX",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, 0),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "accessing referenced kptr disallowed",
+},
+{
+	"map_kptr: ref: reject ST",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ST_MEM(BPF_DW, BPF_REG_0, 8, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "accessing referenced kptr disallowed",
+},
+/* Tests for PTR_TO_PERCPU_BTF_ID */
+{
+	"map_kptr: percpu: loaded pointer marked as percpu",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 16),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_this_cpu_ptr),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 type=percpu_ptr_or_null_ expected=percpu_ptr_",
+},
+{
+	"map_kptr: percpu: reject store of untrusted_ptr_",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 8),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 16),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R1 type=untrusted_ptr_ expected=percpu_ptr_or_null_",
+},
+{
+	"map_kptr: percpu: reject store of ptr_",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 8),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 16),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R0 type=ptr_ expected=percpu_ptr_or_null_",
+},
+{
+	"map_kptr: percpu: reject store of user_ptr_",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 24),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 16),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R1 type=user_ptr_ expected=percpu_ptr_or_null_",
+},
+{
+	"map_kptr: percpu: bpf_kptr_xchg rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 16),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "off=16 kptr isn't referenced kptr",
+},
+{
+	"map_kptr: percpu: bpf_kfunc_call_test_kptr_get rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 16),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "arg#0 no referenced kptr at map value offset=16",
+	.fixup_kfunc_btf_id = {
+		{ "bpf_kfunc_call_test_kptr_get", 14 },
+	}
+},
+/* Tests for PTR_TO_BTF_ID | MEM_USER */
+{
+	"map_kptr: user: loaded pointer marked as user",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 24),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_this_cpu_ptr),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 type=user_ptr_or_null_ expected=percpu_ptr_",
+},
+{
+	"map_kptr: user: reject user pointer deref",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 24),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "R1 is ptr_prog_test_ref_kfunc access user memory: off=8",
+},
+{
+	"map_kptr: user: reject store of untrusted_ptr_",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 8),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 24),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R1 type=untrusted_ptr_ expected=user_ptr_or_null_",
+},
+{
+	"map_kptr: user: reject store of ptr_",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 8),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 24),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R0 type=ptr_ expected=user_ptr_or_null_",
+},
+{
+	"map_kptr: user: reject store of percpu_ptr_",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 16),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 24),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "invalid kptr access, R1 type=percpu_ptr_ expected=user_ptr_or_null_",
+},
+{
+	"map_kptr: user: bpf_kptr_xchg rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 24),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_kptr_xchg),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "off=24 kptr isn't referenced kptr",
+},
+{
+	"map_kptr: user: bpf_kfunc_call_test_kptr_get rejected",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_LD_MAP_FD(BPF_REG_6, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 24),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_kptr = { 1 },
+	.result = REJECT,
+	.errstr = "arg#0 no referenced kptr at map value offset=24",
+	.fixup_kfunc_btf_id = {
+		{ "bpf_kfunc_call_test_kptr_get", 14 },
+	}
+},
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic
  2022-03-17 11:59 ` [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
@ 2022-03-19 17:55   ` Alexei Starovoitov
  2022-03-19 19:31     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 17:55 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:44PM +0530, Kumar Kartikeya Dwivedi wrote:
> Next commit's field type will not be struct, but pointer, and it will
> not be limited to one offset, but multiple ones. Make existing
> btf_find_struct_field and btf_find_datasec_var functions amenable to use
> for finding BTF ID pointers in map value, by taking a moving spin_lock
> and timer specific checks into their own function.
> 
> The alignment, and name are checked before the function is called, so it
> is the last point where we can skip field or return an error before the
> next loop iteration happens. This is important, because we'll be
> potentially reallocating memory inside this function in next commit, so
> being able to do that when everything else is in order is going to be
> more convenient.
> 
> The name parameter is now optional, and only checked if it is not NULL.
> 
> The size must be checked in the function, because in case of PTR it will
> instead point to the underlying BTF ID it is pointing to (or modifiers),
> so the check becomes wrong to do outside of function, and the base type
> has to be obtained by removing modifiers.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  kernel/bpf/btf.c | 120 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 86 insertions(+), 34 deletions(-)
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 17b9adcd88d3..5b2824332880 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3161,71 +3161,109 @@ static void btf_struct_log(struct btf_verifier_env *env,
>  	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
>  }
>  
> +enum {
> +	BTF_FIELD_SPIN_LOCK,
> +	BTF_FIELD_TIMER,
> +};
> +
> +struct btf_field_info {
> +	u32 off;
> +};
> +
> +static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> +				 u32 off, int sz, struct btf_field_info *info)
> +{
> +	if (!__btf_type_is_struct(t))
> +		return 0;
> +	if (t->size != sz)
> +		return 0;
> +	if (info->off != -ENOENT)
> +		/* only one such field is allowed */
> +		return -E2BIG;
> +	info->off = off;
> +	return 0;
> +}
> +
>  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> -				 const char *name, int sz, int align)
> +				 const char *name, int sz, int align, int field_type,
> +				 struct btf_field_info *info)
>  {
>  	const struct btf_member *member;
> -	u32 i, off = -ENOENT;
> +	u32 i, off;
> +	int ret;
>  
>  	for_each_member(i, t, member) {
>  		const struct btf_type *member_type = btf_type_by_id(btf,
>  								    member->type);
> -		if (!__btf_type_is_struct(member_type))
> -			continue;
> -		if (member_type->size != sz)
> -			continue;
> -		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> -			continue;
> -		if (off != -ENOENT)
> -			/* only one such field is allowed */
> -			return -E2BIG;
> +
>  		off = __btf_member_bit_offset(t, member);
> +
> +		if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> +			continue;
>  		if (off % 8)
>  			/* valid C code cannot generate such BTF */
>  			return -EINVAL;
>  		off /= 8;
>  		if (off % align)
>  			return -EINVAL;
> +
> +		switch (field_type) {
> +		case BTF_FIELD_SPIN_LOCK:
> +		case BTF_FIELD_TIMER:

Since spin_lock vs timer is passed into btf_find_struct_field() as field_type
argument there is no need to pass name, sz, align from the caller.
Pls make btf_find_spin_lock() to pass BTF_FIELD_SPIN_LOCK only
and in the above code do something like:
 switch (field_type) {
 case BTF_FIELD_SPIN_LOCK:
     name = "bpf_spin_lock";
     sz = ...
     break;
 case BTF_FIELD_TIMER:
     name = "bpf_timer";
     sz = ...
     break;
 }
 switch (field_type) {
 case BTF_FIELD_SPIN_LOCK:
 case BTF_FIELD_TIMER:
	if (!__btf_type_is_struct(member_type))
		continue;
	if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
        ...
        btf_find_field_struct(btf, member_type, off, sz, info);
 }

It will cleanup the later patch which passes NULL, sizeof(u64), alignof(u64)
only to pass something into the function.
With above suggestion it wouldn't need to pass dummy args. BTF_FIELD_KPTR will be enough.

> +			ret = btf_find_field_struct(btf, member_type, off, sz, info);
> +			if (ret < 0)
> +				return ret;
> +			break;
> +		default:
> +			return -EFAULT;
> +		}
>  	}
> -	return off;
> +	return 0;
>  }
>  
>  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> -				const char *name, int sz, int align)
> +				const char *name, int sz, int align, int field_type,
> +				struct btf_field_info *info)
>  {
>  	const struct btf_var_secinfo *vsi;
> -	u32 i, off = -ENOENT;
> +	u32 i, off;
> +	int ret;
>  
>  	for_each_vsi(i, t, vsi) {
>  		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
>  		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
>  
> -		if (!__btf_type_is_struct(var_type))
> -			continue;
> -		if (var_type->size != sz)
> +		off = vsi->offset;
> +
> +		if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
>  			continue;
>  		if (vsi->size != sz)
>  			continue;
> -		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> -			continue;
> -		if (off != -ENOENT)
> -			/* only one such field is allowed */
> -			return -E2BIG;
> -		off = vsi->offset;
>  		if (off % align)
>  			return -EINVAL;
> +
> +		switch (field_type) {
> +		case BTF_FIELD_SPIN_LOCK:
> +		case BTF_FIELD_TIMER:
> +			ret = btf_find_field_struct(btf, var_type, off, sz, info);
> +			if (ret < 0)
> +				return ret;
> +			break;
> +		default:
> +			return -EFAULT;
> +		}
>  	}
> -	return off;
> +	return 0;
>  }
>  
>  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> -			  const char *name, int sz, int align)
> +			  const char *name, int sz, int align, int field_type,
> +			  struct btf_field_info *info)
>  {
> -
>  	if (__btf_type_is_struct(t))
> -		return btf_find_struct_field(btf, t, name, sz, align);
> +		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
>  	else if (btf_type_is_datasec(t))
> -		return btf_find_datasec_var(btf, t, name, sz, align);
> +		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
>  	return -EINVAL;
>  }
>  
> @@ -3235,16 +3273,30 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>   */
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
>  {
> -	return btf_find_field(btf, t, "bpf_spin_lock",
> -			      sizeof(struct bpf_spin_lock),
> -			      __alignof__(struct bpf_spin_lock));
> +	struct btf_field_info info = { .off = -ENOENT };
> +	int ret;
> +
> +	ret = btf_find_field(btf, t, "bpf_spin_lock",
> +			     sizeof(struct bpf_spin_lock),
> +			     __alignof__(struct bpf_spin_lock),
> +			     BTF_FIELD_SPIN_LOCK, &info);
> +	if (ret < 0)
> +		return ret;
> +	return info.off;
>  }
>  
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
>  {
> -	return btf_find_field(btf, t, "bpf_timer",
> -			      sizeof(struct bpf_timer),
> -			      __alignof__(struct bpf_timer));
> +	struct btf_field_info info = { .off = -ENOENT };
> +	int ret;
> +
> +	ret = btf_find_field(btf, t, "bpf_timer",
> +			     sizeof(struct bpf_timer),
> +			     __alignof__(struct bpf_timer),
> +			     BTF_FIELD_TIMER, &info);
> +	if (ret < 0)
> +		return ret;
> +	return info.off;
>  }
>  
>  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> -- 
> 2.35.1
> 

-- 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-17 11:59 ` [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
@ 2022-03-19 18:15   ` Alexei Starovoitov
  2022-03-19 18:52     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 18:15 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:45PM +0530, Kumar Kartikeya Dwivedi wrote:
> This commit introduces a new pointer type 'kptr' which can be embedded
> in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> register must have the same type as in the map value's BTF, and loading
> a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> kernel BTF and BTF ID.
> 
> Such kptr are unreferenced, i.e. by the time another invocation of the
> BPF program loads this pointer, the object which the pointer points to
> may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> patched to PROBE_MEM loads by the verifier, it would safe to allow user
> to still access such invalid pointer, but passing such pointers into
> BPF helpers and kfuncs should not be permitted. A future patch in this
> series will close this gap.
> 
> The flexibility offered by allowing programs to dereference such invalid
> pointers while being safe at runtime frees the verifier from doing
> complex lifetime tracking. As long as the user may ensure that the
> object remains valid, it can ensure data read by it from the kernel
> object is valid.
> 
> The user indicates that a certain pointer must be treated as kptr
> capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> information is recorded in the object BTF which will be passed into the
> kernel by way of map's BTF information. The name and kind from the map
> value BTF is used to look up the in-kernel type, and the actual BTF and
> BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> now, only storing pointers to structs is permitted.
> 
> An example of this specification is shown below:
> 
> 	#define __kptr __attribute__((btf_type_tag("kptr")))
> 
> 	struct map_value {
> 		...
> 		struct task_struct __kptr *task;
> 		...
> 	};
> 
> Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> task_struct into the map, and then load it later.
> 
> Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> the verifier cannot know whether the value is NULL or not statically, it
> must treat all potential loads at that map value offset as loading a
> possibly NULL pointer.
> 
> Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> are allowed instructions that can access such a pointer. On BPF_LDX, the
> destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> it is checked whether the source register type is a PTR_TO_BTF_ID with
> same BTF type as specified in the map BTF. The access size must always
> be BPF_DW.
> 
> For the map in map support, the kptr_off_tab for outer map is copied
> from the inner map's kptr_off_tab. It was chosen to do a deep copy
> instead of introducing a refcount to kptr_off_tab, because the copy only
> needs to be done when paramterizing using inner_map_fd in the map in map
> case, hence would be unnecessary for all other users.
> 
> It is not permitted to use MAP_FREEZE command and mmap for BPF map
> having kptr, similar to the bpf_timer case.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h     |  29 +++++-
>  include/linux/btf.h     |   2 +
>  kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
>  kernel/bpf/map_in_map.c |   5 +-
>  kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
>  kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
>  6 files changed, 442 insertions(+), 62 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 88449fbbe063..f35920d279dd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -155,6 +155,22 @@ struct bpf_map_ops {
>  	const struct bpf_iter_seq_info *iter_seq_info;
>  };
>  
> +enum {
> +	/* Support at most 8 pointers in a BPF map value */
> +	BPF_MAP_VALUE_OFF_MAX = 8,
> +};
> +
> +struct bpf_map_value_off_desc {
> +	u32 offset;
> +	u32 btf_id;
> +	struct btf *btf;
> +};
> +
> +struct bpf_map_value_off {
> +	u32 nr_off;
> +	struct bpf_map_value_off_desc off[];
> +};
> +
>  struct bpf_map {
>  	/* The first two cachelines with read-mostly members of which some
>  	 * are also accessed in fast-path (e.g. ops, max_entries).
> @@ -171,6 +187,7 @@ struct bpf_map {
>  	u64 map_extra; /* any per-map-type extra fields */
>  	u32 map_flags;
>  	int spin_lock_off; /* >=0 valid offset, <0 error */
> +	struct bpf_map_value_off *kptr_off_tab;
>  	int timer_off; /* >=0 valid offset, <0 error */
>  	u32 id;
>  	int numa_node;
> @@ -184,7 +201,7 @@ struct bpf_map {
>  	char name[BPF_OBJ_NAME_LEN];
>  	bool bypass_spec_v1;
>  	bool frozen; /* write-once; write-protected by freeze_mutex */
> -	/* 14 bytes hole */
> +	/* 6 bytes hole */
>  
>  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
>  	 * particularly with refcounting.
> @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
>  	return map->timer_off >= 0;
>  }
>  
> +static inline bool map_value_has_kptr(const struct bpf_map *map)
> +{
> +	return !IS_ERR_OR_NULL(map->kptr_off_tab);
> +}
> +
>  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>  {
>  	if (unlikely(map_value_has_spin_lock(map)))
> @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
>  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
>  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
>  
> +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> +
>  struct bpf_map *bpf_map_get(u32 ufd);
>  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>  struct bpf_map *__bpf_map_get(struct fd f);
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 36bc09b8e890..5b578dc81c04 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
>  			   u32 expected_offset, u32 expected_size);
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> +					const struct btf_type *t);
>  bool btf_type_is_void(const struct btf_type *t);
>  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
>  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 5b2824332880..9ac9364ef533 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
>  enum {
>  	BTF_FIELD_SPIN_LOCK,
>  	BTF_FIELD_TIMER,
> +	BTF_FIELD_KPTR,
> +};
> +
> +enum {
> +	BTF_FIELD_IGNORE = 0,
> +	BTF_FIELD_FOUND  = 1,
>  };
>  
>  struct btf_field_info {
> +	const struct btf_type *type;
>  	u32 off;
>  };
>  
>  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> -				 u32 off, int sz, struct btf_field_info *info)
> +				 u32 off, int sz, struct btf_field_info *info,
> +				 int info_cnt, int idx)
>  {
>  	if (!__btf_type_is_struct(t))
> -		return 0;
> +		return BTF_FIELD_IGNORE;
>  	if (t->size != sz)
> -		return 0;
> -	if (info->off != -ENOENT)
> -		/* only one such field is allowed */
> +		return BTF_FIELD_IGNORE;
> +	if (idx >= info_cnt)

No need to pass info_cnt, idx into this function.
Move idx >= info_cnt check into the caller and let
caller do 'info++' and pass that.
This function will simply write into 'info'.

>  		return -E2BIG;
> +	info[idx].off = off;
>  	info->off = off;

This can't be right.

> -	return 0;
> +	return BTF_FIELD_FOUND;
> +}
> +
> +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> +			       u32 off, int sz, struct btf_field_info *info,
> +			       int info_cnt, int idx)
> +{
> +	bool kptr_tag = false;
> +
> +	/* For PTR, sz is always == 8 */
> +	if (!btf_type_is_ptr(t))
> +		return BTF_FIELD_IGNORE;
> +	t = btf_type_by_id(btf, t->type);
> +
> +	while (btf_type_is_type_tag(t)) {
> +		if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off))) {
> +			/* repeated tag */
> +			if (kptr_tag)
> +				return -EEXIST;
> +			kptr_tag = true;
> +		}
> +		/* Look for next tag */
> +		t = btf_type_by_id(btf, t->type);
> +	}

There is no need for while() loop and 4 bool kptr_*_tag checks.
Just do:
  if (!btf_type_is_type_tag(t))
     return BTF_FIELD_IGNORE;
  /* check next tag */
  if (btf_type_is_type_tag(btf_type_by_id(btf, t->type))
     return -EINVAL;
  if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
     flag = 0;
  else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
    flag = BPF_MAP_VALUE_OFF_F_REF;
  ...

> +	if (!kptr_tag)
> +		return BTF_FIELD_IGNORE;
> +
> +	/* Get the base type */
> +	if (btf_type_is_modifier(t))
> +		t = btf_type_skip_modifiers(btf, t->type, NULL);
> +	/* Only pointer to struct is allowed */
> +	if (!__btf_type_is_struct(t))
> +		return -EINVAL;
> +
> +	if (idx >= info_cnt)
> +		return -E2BIG;
> +	info[idx].type = t;
> +	info[idx].off = off;
> +	return BTF_FIELD_FOUND;
>  }
>  
>  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
>  				 const char *name, int sz, int align, int field_type,
> -				 struct btf_field_info *info)
> +				 struct btf_field_info *info, int info_cnt)
>  {
>  	const struct btf_member *member;
> +	int ret, idx = 0;
>  	u32 i, off;
> -	int ret;
>  
>  	for_each_member(i, t, member) {
>  		const struct btf_type *member_type = btf_type_by_id(btf,
> @@ -3210,24 +3256,33 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
>  		switch (field_type) {
>  		case BTF_FIELD_SPIN_LOCK:
>  		case BTF_FIELD_TIMER:
> -			ret = btf_find_field_struct(btf, member_type, off, sz, info);
> +			ret = btf_find_field_struct(btf, member_type, off, sz, info, info_cnt, idx);
> +			if (ret < 0)
> +				return ret;
> +			break;
> +		case BTF_FIELD_KPTR:
> +			ret = btf_find_field_kptr(btf, member_type, off, sz, info, info_cnt, idx);
>  			if (ret < 0)
>  				return ret;
>  			break;
>  		default:
>  			return -EFAULT;
>  		}
> +
> +		if (ret == BTF_FIELD_IGNORE)
> +			continue;
> +		++idx;
>  	}
> -	return 0;
> +	return idx;
>  }
>  
>  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>  				const char *name, int sz, int align, int field_type,
> -				struct btf_field_info *info)
> +				struct btf_field_info *info, int info_cnt)
>  {
>  	const struct btf_var_secinfo *vsi;
> +	int ret, idx = 0;
>  	u32 i, off;
> -	int ret;
>  
>  	for_each_vsi(i, t, vsi) {
>  		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> @@ -3245,25 +3300,34 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>  		switch (field_type) {
>  		case BTF_FIELD_SPIN_LOCK:
>  		case BTF_FIELD_TIMER:
> -			ret = btf_find_field_struct(btf, var_type, off, sz, info);
> +			ret = btf_find_field_struct(btf, var_type, off, sz, info, info_cnt, idx);
> +			if (ret < 0)
> +				return ret;
> +			break;
> +		case BTF_FIELD_KPTR:
> +			ret = btf_find_field_kptr(btf, var_type, off, sz, info, info_cnt, idx);
>  			if (ret < 0)
>  				return ret;
>  			break;
>  		default:
>  			return -EFAULT;
>  		}
> +
> +		if (ret == BTF_FIELD_IGNORE)
> +			continue;
> +		++idx;
>  	}
> -	return 0;
> +	return idx;
>  }
>  
>  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>  			  const char *name, int sz, int align, int field_type,
> -			  struct btf_field_info *info)
> +			  struct btf_field_info *info, int info_cnt)
>  {
>  	if (__btf_type_is_struct(t))
> -		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> +		return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
>  	else if (btf_type_is_datasec(t))
> -		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> +		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
>  	return -EINVAL;
>  }
>  
> @@ -3279,7 +3343,7 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
>  	ret = btf_find_field(btf, t, "bpf_spin_lock",
>  			     sizeof(struct bpf_spin_lock),
>  			     __alignof__(struct bpf_spin_lock),
> -			     BTF_FIELD_SPIN_LOCK, &info);
> +			     BTF_FIELD_SPIN_LOCK, &info, 1);
>  	if (ret < 0)
>  		return ret;
>  	return info.off;
> @@ -3293,12 +3357,61 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t)
>  	ret = btf_find_field(btf, t, "bpf_timer",
>  			     sizeof(struct bpf_timer),
>  			     __alignof__(struct bpf_timer),
> -			     BTF_FIELD_TIMER, &info);
> +			     BTF_FIELD_TIMER, &info, 1);
>  	if (ret < 0)
>  		return ret;
>  	return info.off;
>  }
>  
> +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> +					const struct btf_type *t)
> +{
> +	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> +	struct bpf_map_value_off *tab;
> +	int ret, i, nr_off;
> +
> +	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> +	BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> +
> +	ret = btf_find_field(btf, t, NULL, sizeof(u64), __alignof__(u64),

these pointless args will be gone with suggestion in the previous patch.

> +			     BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +	if (!ret)
> +		return 0;
> +
> +	nr_off = ret;
> +	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> +	if (!tab)
> +		return ERR_PTR(-ENOMEM);
> +
> +	tab->nr_off = 0;
> +	for (i = 0; i < nr_off; i++) {
> +		const struct btf_type *t;
> +		struct btf *off_btf;
> +		s32 id;
> +
> +		t = info_arr[i].type;
> +		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> +				     &off_btf);
> +		if (id < 0) {
> +			ret = id;
> +			goto end;
> +		}
> +
> +		tab->off[i].offset = info_arr[i].off;
> +		tab->off[i].btf_id = id;
> +		tab->off[i].btf = off_btf;
> +		tab->nr_off = i + 1;
> +	}
> +	return tab;
> +end:
> +	while (tab->nr_off--)
> +		btf_put(tab->off[tab->nr_off].btf);
> +	kfree(tab);
> +	return ERR_PTR(ret);
> +}
> +
>  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
>  			      u32 type_id, void *data, u8 bits_offset,
>  			      struct btf_show *show)
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 5cd8f5277279..135205d0d560 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>  	inner_map_meta->max_entries = inner_map->max_entries;
>  	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
>  	inner_map_meta->timer_off = inner_map->timer_off;
> +	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
>  	if (inner_map->btf) {
>  		btf_get(inner_map->btf);
>  		inner_map_meta->btf = inner_map->btf;
> @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>  
>  void bpf_map_meta_free(struct bpf_map *map_meta)
>  {
> +	bpf_map_free_kptr_off_tab(map_meta);
>  	btf_put(map_meta->btf);
>  	kfree(map_meta);
>  }
> @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
>  		meta0->key_size == meta1->key_size &&
>  		meta0->value_size == meta1->value_size &&
>  		meta0->timer_off == meta1->timer_off &&
> -		meta0->map_flags == meta1->map_flags;
> +		meta0->map_flags == meta1->map_flags &&
> +		bpf_map_equal_kptr_off_tab(meta0, meta1);
>  }
>  
>  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 9beb585be5a6..87263b07f40b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -6,6 +6,7 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/bsearch.h>
>  #include <linux/btf.h>
>  #include <linux/syscalls.h>
>  #include <linux/slab.h>
> @@ -472,12 +473,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
>  }
>  #endif
>  
> +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> +{
> +	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> +
> +	if (off_desc1->offset < off_desc2->offset)
> +		return -1;
> +	else if (off_desc1->offset > off_desc2->offset)
> +		return 1;
> +	return 0;
> +}
> +
> +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> +{
> +	/* Since members are iterated in btf_find_field in increasing order,
> +	 * offsets appended to kptr_off_tab are in increasing order, so we can
> +	 * do bsearch to find exact match.
> +	 */
> +	struct bpf_map_value_off *tab;
> +
> +	if (!map_value_has_kptr(map))
> +		return NULL;
> +	tab = map->kptr_off_tab;
> +	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> +}
> +
> +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> +{
> +	struct bpf_map_value_off *tab = map->kptr_off_tab;
> +	int i;
> +
> +	if (!map_value_has_kptr(map))
> +		return;
> +	for (i = 0; i < tab->nr_off; i++) {
> +		struct btf *btf = tab->off[i].btf;
> +
> +		btf_put(btf);
> +	}
> +	kfree(tab);
> +	map->kptr_off_tab = NULL;
> +}
> +
> +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> +{
> +	struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> +	int size, i, ret;
> +
> +	if (!map_value_has_kptr(map))
> +		return ERR_PTR(-ENOENT);
> +	/* Do a deep copy of the kptr_off_tab */
> +	for (i = 0; i < tab->nr_off; i++)
> +		btf_get(tab->off[i].btf);
> +
> +	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> +	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!new_tab) {
> +		ret = -ENOMEM;
> +		goto end;
> +	}
> +	memcpy(new_tab, tab, size);
> +	return new_tab;
> +end:
> +	while (i--)
> +		btf_put(tab->off[i].btf);
> +	return ERR_PTR(ret);
> +}
> +
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +{
> +	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> +	bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> +	int size;
> +
> +	if (!a_has_kptr && !b_has_kptr)
> +		return true;
> +	if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
> +		return false;
> +	if (tab_a->nr_off != tab_b->nr_off)
> +		return false;
> +	size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> +	return !memcmp(tab_a, tab_b, size);
> +}
> +
>  /* called from workqueue */
>  static void bpf_map_free_deferred(struct work_struct *work)
>  {
>  	struct bpf_map *map = container_of(work, struct bpf_map, work);
>  
>  	security_bpf_map_free(map);
> +	bpf_map_free_kptr_off_tab(map);
>  	bpf_map_release_memcg(map);
>  	/* implementation dependent freeing */
>  	map->ops->map_free(map);
> @@ -639,7 +723,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>  	int err;
>  
>  	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> -	    map_value_has_timer(map))
> +	    map_value_has_timer(map) || map_value_has_kptr(map))
>  		return -ENOTSUPP;
>  
>  	if (!(vma->vm_flags & VM_SHARED))
> @@ -819,9 +903,29 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  			return -EOPNOTSUPP;
>  	}
>  
> -	if (map->ops->map_check_btf)
> +	map->kptr_off_tab = btf_find_kptr(btf, value_type);
> +	if (map_value_has_kptr(map)) {
> +		if (map->map_flags & BPF_F_RDONLY_PROG) {
> +			ret = -EACCES;
> +			goto free_map_tab;
> +		}
> +		if (map->map_type != BPF_MAP_TYPE_HASH &&
> +		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +		    map->map_type != BPF_MAP_TYPE_ARRAY) {
> +			ret = -EOPNOTSUPP;
> +			goto free_map_tab;
> +		}
> +	}
> +
> +	if (map->ops->map_check_btf) {
>  		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> +		if (ret < 0)
> +			goto free_map_tab;
> +	}
>  
> +	return ret;
> +free_map_tab:
> +	bpf_map_free_kptr_off_tab(map);
>  	return ret;
>  }
>  
> @@ -1638,7 +1742,7 @@ static int map_freeze(const union bpf_attr *attr)
>  		return PTR_ERR(map);
>  
>  	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> -	    map_value_has_timer(map)) {
> +	    map_value_has_timer(map) || map_value_has_kptr(map)) {
>  		fdput(f);
>  		return -ENOTSUPP;
>  	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index cf92f9c01556..881d1381757b 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3469,6 +3469,143 @@ static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
>  	return 0;
>  }
>  
> +static int __check_ptr_off_reg(struct bpf_verifier_env *env,
> +			       const struct bpf_reg_state *reg, int regno,
> +			       bool fixed_off_ok)
> +{
> +	/* Access to this pointer-typed register or passing it to a helper
> +	 * is only allowed in its original, unmodified form.
> +	 */
> +
> +	if (reg->off < 0) {
> +		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
> +			reg_type_str(env, reg->type), regno, reg->off);
> +		return -EACCES;
> +	}
> +
> +	if (!fixed_off_ok && reg->off) {
> +		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
> +			reg_type_str(env, reg->type), regno, reg->off);
> +		return -EACCES;
> +	}
> +
> +	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> +		char tn_buf[48];
> +
> +		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> +		verbose(env, "variable %s access var_off=%s disallowed\n",
> +			reg_type_str(env, reg->type), tn_buf);
> +		return -EACCES;
> +	}
> +
> +	return 0;
> +}
> +
> +int check_ptr_off_reg(struct bpf_verifier_env *env,
> +		      const struct bpf_reg_state *reg, int regno)
> +{
> +	return __check_ptr_off_reg(env, reg, regno, false);
> +}
> +
> +static int map_kptr_match_type(struct bpf_verifier_env *env,
> +			       struct bpf_map_value_off_desc *off_desc,
> +			       struct bpf_reg_state *reg, u32 regno)
> +{
> +	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> +	const char *reg_name = "";
> +
> +	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> +		goto bad_type;
> +
> +	if (!btf_is_kernel(reg->btf)) {
> +		verbose(env, "R%d must point to kernel BTF\n", regno);
> +		return -EINVAL;
> +	}
> +	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
> +	reg_name = kernel_type_name(reg->btf, reg->btf_id);
> +
> +	if (__check_ptr_off_reg(env, reg, regno, true))
> +		return -EACCES;
> +
> +	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +				  off_desc->btf, off_desc->btf_id))
> +		goto bad_type;
> +	return 0;
> +bad_type:
> +	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> +		reg_type_str(env, reg->type), reg_name);
> +	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> +	return -EINVAL;
> +}
> +
> +/* Returns an error, or 0 if ignoring the access, or 1 if register state was
> + * updated, in which case later updates must be skipped.
> + */
> +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> +				 int off, int size, int value_regno,
> +				 enum bpf_access_type t, int insn_idx)
> +{
> +	struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
> +	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> +	struct bpf_map_value_off_desc *off_desc;
> +	int insn_class = BPF_CLASS(insn->code);
> +	struct bpf_map *map = reg->map_ptr;
> +
> +	/* Things we already checked for in check_map_access:
> +	 *  - Reject cases where variable offset may touch BTF ID pointer
> +	 *  - size of access (must be BPF_DW)
> +	 *  - off_desc->offset == off + reg->var_off.value
> +	 */
> +	if (!tnum_is_const(reg->var_off))
> +		return 0;
> +
> +	off_desc = bpf_map_kptr_off_contains(map, off + reg->var_off.value);
> +	if (!off_desc)
> +		return 0;
> +
> +	if (WARN_ON_ONCE(size != bpf_size_to_bytes(BPF_DW)))

since the check was made, please avoid defensive progamming.

> +		return -EACCES;
> +
> +	if (BPF_MODE(insn->code) != BPF_MEM)

what is this? a fancy way to filter ldx/stx/st insns?
Pls add a comment if so.

> +		goto end;
> +
> +	if (!env->bpf_capable) {
> +		verbose(env, "kptr access only allowed for CAP_BPF and CAP_SYS_ADMIN\n");
> +		return -EPERM;

Please move this check into map_check_btf().
That's the earliest place we can issue such error.
Doing it here is so late and it doesn't help users, but makes run-time slower,
since this function is called a lot more times than map_check_btf.

> +	}
> +
> +	if (insn_class == BPF_LDX) {
> +		if (WARN_ON_ONCE(value_regno < 0))
> +			return -EFAULT;

defensive programming? Pls dont.

> +		val_reg = reg_state(env, value_regno);
> +		/* We can simply mark the value_regno receiving the pointer
> +		 * value from map as PTR_TO_BTF_ID, with the correct type.
> +		 */
> +		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> +				off_desc->btf_id, PTR_MAYBE_NULL);
> +		val_reg->id = ++env->id_gen;
> +	} else if (insn_class == BPF_STX) {
> +		if (WARN_ON_ONCE(value_regno < 0))
> +			return -EFAULT;
> +		val_reg = reg_state(env, value_regno);
> +		if (!register_is_null(val_reg) &&
> +		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
> +			return -EACCES;
> +	} else if (insn_class == BPF_ST) {
> +		if (insn->imm) {
> +			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> +				off_desc->offset);
> +			return -EACCES;
> +		}
> +	} else {
> +		goto end;
> +	}
> +	return 1;
> +end:
> +	verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> +	return -EACCES;
> +}
> +
>  /* check read/write into a map element with possible variable offset */
>  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>  			    int off, int size, bool zero_size_allowed)
> @@ -3507,6 +3644,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>  			return -EACCES;
>  		}
>  	}
> +	if (map_value_has_kptr(map)) {
> +		struct bpf_map_value_off *tab = map->kptr_off_tab;
> +		int i;
> +
> +		for (i = 0; i < tab->nr_off; i++) {
> +			u32 p = tab->off[i].offset;
> +
> +			if (reg->smin_value + off < p + sizeof(u64) &&
> +			    p < reg->umax_value + off + size) {
> +				if (!tnum_is_const(reg->var_off)) {
> +					verbose(env, "kptr access cannot have variable offset\n");
> +					return -EACCES;
> +				}
> +				if (p != off + reg->var_off.value) {
> +					verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> +						p, off + reg->var_off.value);
> +					return -EACCES;
> +				}
> +				if (size != bpf_size_to_bytes(BPF_DW)) {
> +					verbose(env, "kptr access size must be BPF_DW\n");
> +					return -EACCES;
> +				}
> +				break;
> +			}
> +		}
> +	}
>  	return err;
>  }
>  
> @@ -3980,44 +4143,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
>  }
>  #endif
>  
> -static int __check_ptr_off_reg(struct bpf_verifier_env *env,
> -			       const struct bpf_reg_state *reg, int regno,
> -			       bool fixed_off_ok)
> -{
> -	/* Access to this pointer-typed register or passing it to a helper
> -	 * is only allowed in its original, unmodified form.
> -	 */
> -
> -	if (reg->off < 0) {
> -		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
> -			reg_type_str(env, reg->type), regno, reg->off);
> -		return -EACCES;
> -	}
> -
> -	if (!fixed_off_ok && reg->off) {
> -		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
> -			reg_type_str(env, reg->type), regno, reg->off);
> -		return -EACCES;
> -	}
> -
> -	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> -		char tn_buf[48];
> -
> -		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> -		verbose(env, "variable %s access var_off=%s disallowed\n",
> -			reg_type_str(env, reg->type), tn_buf);
> -		return -EACCES;
> -	}
> -
> -	return 0;
> -}
> -
> -int check_ptr_off_reg(struct bpf_verifier_env *env,
> -		      const struct bpf_reg_state *reg, int regno)
> -{
> -	return __check_ptr_off_reg(env, reg, regno, false);
> -}
> -

please split the hunk that moves code around into separate patch.
Don't mix it with actual changes.

>  static int __check_buffer_access(struct bpf_verifier_env *env,
>  				 const char *buf_info,
>  				 const struct bpf_reg_state *reg,
> @@ -4421,6 +4546,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>  		if (err)
>  			return err;
>  		err = check_map_access(env, regno, off, size, false);
> +		err = err ?: check_map_kptr_access(env, regno, off, size, value_regno, t, insn_idx);
> +		if (err < 0)
> +			return err;
> +		/* if err == 0, check_map_kptr_access ignored the access */
>  		if (!err && t == BPF_READ && value_regno >= 0) {
>  			struct bpf_map *map = reg->map_ptr;
>  
> @@ -4442,6 +4571,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>  				mark_reg_unknown(env, regs, value_regno);
>  			}
>  		}
> +		/* clear err == 1 */
> +		err = err < 0 ? err : 0;
>  	} else if (base_type(reg->type) == PTR_TO_MEM) {
>  		bool rdonly_mem = type_is_rdonly_mem(reg->type);
>  
> -- 
> 2.35.1
> 

-- 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 04/15] bpf: Allow storing referenced kptr in map
  2022-03-17 11:59 ` [PATCH bpf-next v2 04/15] bpf: Allow storing referenced " Kumar Kartikeya Dwivedi
@ 2022-03-19 18:24   ` Alexei Starovoitov
  2022-03-19 18:59     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 18:24 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:46PM +0530, Kumar Kartikeya Dwivedi wrote:
> Extending the code in previous commit, introduce referenced kptr
> support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
> unreferenced kptr, referenced kptr have a lot more restrictions. In
> addition to the type matching, only a newly introduced bpf_kptr_xchg
> helper is allowed to modify the map value at that offset. This transfers
> the referenced pointer being stored into the map, releasing the
> references state for the program, and returning the old value and
> creating new reference state for the returned pointer.
> 
> Similar to unreferenced pointer case, return value for this case will
> also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
> must either be eventually released by calling the corresponding release
> function, otherwise it must be transferred into another map.
> 
> It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
> the value, and obtain the old value if any.
> 
> BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
> commit will permit using BPF_LDX for such pointers, but attempt at
> making it safe, since the lifetime of object won't be guaranteed.
> 
> There are valid reasons to enforce the restriction of permitting only
> bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
> consistent in face of concurrent modification, and any prior values
> contained in the map must also be released before a new one is moved
> into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
> returns the old value, which the verifier would require the user to
> either free or move into another map, and releases the reference held
> for the pointer being moved in.
> 
> In the future, direct BPF_XCHG instruction may also be permitted to work
> like bpf_kptr_xchg helper.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h            |   7 ++
>  include/uapi/linux/bpf.h       |  12 +++
>  kernel/bpf/btf.c               |  20 +++-
>  kernel/bpf/helpers.c           |  21 +++++
>  kernel/bpf/verifier.c          | 167 +++++++++++++++++++++++++++++----
>  tools/include/uapi/linux/bpf.h |  12 +++
>  6 files changed, 219 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f35920d279dd..702aa882e4a3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -160,10 +160,15 @@ enum {
>  	BPF_MAP_VALUE_OFF_MAX = 8,
>  };
>  
> +enum {
> +	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> +};
> +
>  struct bpf_map_value_off_desc {
>  	u32 offset;
>  	u32 btf_id;
>  	struct btf *btf;
> +	int flags;
>  };
>  
>  struct bpf_map_value_off {
> @@ -413,6 +418,7 @@ enum bpf_arg_type {
>  	ARG_PTR_TO_STACK,	/* pointer to stack */
>  	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
>  	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
> +	ARG_PTR_TO_KPTR,	/* pointer to kptr */
>  	__BPF_ARG_TYPE_MAX,
>  
>  	/* Extended arg_types. */
> @@ -422,6 +428,7 @@ enum bpf_arg_type {
>  	ARG_PTR_TO_SOCKET_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
>  	ARG_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
>  	ARG_PTR_TO_STACK_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
> +	ARG_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
>  
>  	/* This must be the last entry. Its purpose is to ensure the enum is
>  	 * wide enough to hold the higher bits reserved for bpf_type_flag.
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 99fab54ae9c0..d45568746e79 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5129,6 +5129,17 @@ union bpf_attr {
>   *		The **hash_algo** is returned on success,
>   *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
>   *		invalid arguments are passed.
> + *
> + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> + *	Description
> + *		Exchange kptr at pointer *map_value* with *ptr*, and return the
> + *		old value. *ptr* can be NULL, otherwise it must be a referenced
> + *		pointer which will be released when this helper is called.
> + *	Return
> + *		The old value of kptr (which can be NULL). The returned pointer
> + *		if not NULL, is a reference which must be released using its
> + *		corresponding release function, or moved into a BPF map before
> + *		program exit.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5325,6 +5336,7 @@ union bpf_attr {
>  	FN(copy_from_user_task),	\
>  	FN(skb_set_tstamp),		\
>  	FN(ima_file_hash),		\
> +	FN(kptr_xchg),			\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 9ac9364ef533..7b4179667bf1 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3175,6 +3175,7 @@ enum {
>  struct btf_field_info {
>  	const struct btf_type *type;
>  	u32 off;
> +	int flags;
>  };
>  
>  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> @@ -3196,7 +3197,8 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
>  			       u32 off, int sz, struct btf_field_info *info,
>  			       int info_cnt, int idx)
>  {
> -	bool kptr_tag = false;
> +	bool kptr_tag = false, kptr_ref_tag = false;
> +	int tags;
>  
>  	/* For PTR, sz is always == 8 */
>  	if (!btf_type_is_ptr(t))
> @@ -3209,12 +3211,21 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
>  			if (kptr_tag)
>  				return -EEXIST;
>  			kptr_tag = true;
> +		} else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off))) {
> +			/* repeated tag */
> +			if (kptr_ref_tag)
> +				return -EEXIST;
> +			kptr_ref_tag = true;
>  		}
>  		/* Look for next tag */
>  		t = btf_type_by_id(btf, t->type);
>  	}
> -	if (!kptr_tag)
> +
> +	tags = kptr_tag + kptr_ref_tag;
> +	if (!tags)
>  		return BTF_FIELD_IGNORE;
> +	else if (tags > 1)
> +		return -EINVAL;
>  
>  	/* Get the base type */
>  	if (btf_type_is_modifier(t))
> @@ -3225,6 +3236,10 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
>  
>  	if (idx >= info_cnt)
>  		return -E2BIG;
> +	if (kptr_ref_tag)
> +		info[idx].flags = BPF_MAP_VALUE_OFF_F_REF;
> +	else
> +		info[idx].flags = 0;
>  	info[idx].type = t;
>  	info[idx].off = off;
>  	return BTF_FIELD_FOUND;
> @@ -3402,6 +3417,7 @@ struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
>  		tab->off[i].offset = info_arr[i].off;
>  		tab->off[i].btf_id = id;
>  		tab->off[i].btf = off_btf;
> +		tab->off[i].flags = info_arr[i].flags;
>  		tab->nr_off = i + 1;
>  	}
>  	return tab;
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 315053ef6a75..cb717bfbda19 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
>  	kfree(t);
>  }
>  
> +BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
> +{
> +	unsigned long *kptr = map_value;
> +
> +	return xchg(kptr, (unsigned long)ptr);
> +}
> +
> +static u32 bpf_kptr_xchg_btf_id;
> +
> +const struct bpf_func_proto bpf_kptr_xchg_proto = {
> +	.func        = bpf_kptr_xchg,
> +	.gpl_only    = false,
> +	.ret_type    = RET_PTR_TO_BTF_ID_OR_NULL,
> +	.ret_btf_id  = &bpf_kptr_xchg_btf_id,
> +	.arg1_type   = ARG_PTR_TO_KPTR,
> +	.arg2_type   = ARG_PTR_TO_BTF_ID_OR_NULL,
> +	.arg2_btf_id = &bpf_kptr_xchg_btf_id,
> +};
> +
>  const struct bpf_func_proto bpf_get_current_task_proto __weak;
>  const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
>  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
> @@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  		return &bpf_timer_start_proto;
>  	case BPF_FUNC_timer_cancel:
>  		return &bpf_timer_cancel_proto;
> +	case BPF_FUNC_kptr_xchg:
> +		return &bpf_kptr_xchg_proto;
>  	default:
>  		break;
>  	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 881d1381757b..f8738054aa52 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -257,6 +257,7 @@ struct bpf_call_arg_meta {
>  	struct btf *ret_btf;
>  	u32 ret_btf_id;
>  	u32 subprogno;
> +	struct bpf_map_value_off_desc *kptr_off_desc;
>  };
>  
>  struct btf *btf_vmlinux;
> @@ -479,7 +480,8 @@ static bool is_release_function(enum bpf_func_id func_id)
>  {
>  	return func_id == BPF_FUNC_sk_release ||
>  	       func_id == BPF_FUNC_ringbuf_submit ||
> -	       func_id == BPF_FUNC_ringbuf_discard;
> +	       func_id == BPF_FUNC_ringbuf_discard ||
> +	       func_id == BPF_FUNC_kptr_xchg;
>  }
>  
>  static bool may_be_acquire_function(enum bpf_func_id func_id)
> @@ -488,7 +490,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
>  		func_id == BPF_FUNC_sk_lookup_udp ||
>  		func_id == BPF_FUNC_skc_lookup_tcp ||
>  		func_id == BPF_FUNC_map_lookup_elem ||
> -	        func_id == BPF_FUNC_ringbuf_reserve;
> +		func_id == BPF_FUNC_ringbuf_reserve ||
> +		func_id == BPF_FUNC_kptr_xchg;
>  }
>  
>  static bool is_acquire_function(enum bpf_func_id func_id,
> @@ -499,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
>  	if (func_id == BPF_FUNC_sk_lookup_tcp ||
>  	    func_id == BPF_FUNC_sk_lookup_udp ||
>  	    func_id == BPF_FUNC_skc_lookup_tcp ||
> -	    func_id == BPF_FUNC_ringbuf_reserve)
> +	    func_id == BPF_FUNC_ringbuf_reserve ||
> +	    func_id == BPF_FUNC_kptr_xchg)
>  		return true;
>  
>  	if (func_id == BPF_FUNC_map_lookup_elem &&
> @@ -3509,10 +3513,12 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
>  
>  static int map_kptr_match_type(struct bpf_verifier_env *env,
>  			       struct bpf_map_value_off_desc *off_desc,
> -			       struct bpf_reg_state *reg, u32 regno)
> +			       struct bpf_reg_state *reg, u32 regno,
> +			       bool ref_ptr)
>  {
>  	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
>  	const char *reg_name = "";
> +	bool fixed_off_ok = true;
>  
>  	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
>  		goto bad_type;
> @@ -3524,7 +3530,26 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
>  	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
>  	reg_name = kernel_type_name(reg->btf, reg->btf_id);
>  
> -	if (__check_ptr_off_reg(env, reg, regno, true))
> +	if (ref_ptr) {
> +		if (!reg->ref_obj_id) {
> +			verbose(env, "R%d must be referenced %s%s\n", regno,
> +				reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> +			return -EACCES;
> +		}
> +		/* reg->off can be used to store pointer to a certain type formed by
> +		 * incrementing pointer of a parent structure the object is embedded in,
> +		 * e.g. map may expect unreferenced struct path *, and user should be
> +		 * allowed a store using &file->f_path. However, in the case of
> +		 * referenced pointer, we cannot do this, because the reference is only
> +		 * for the parent structure, not its embedded object(s), and because
> +		 * the transfer of ownership happens for the original pointer to and
> +		 * from the map (before its eventual release).
> +		 */
> +		if (reg->off)
> +			fixed_off_ok = false;
> +	}
> +	/* var_off is rejected by __check_ptr_off_reg for PTR_TO_BTF_ID */
> +	if (__check_ptr_off_reg(env, reg, regno, fixed_off_ok))
>  		return -EACCES;
>  
>  	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> @@ -3550,6 +3575,7 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
>  	struct bpf_map_value_off_desc *off_desc;
>  	int insn_class = BPF_CLASS(insn->code);
>  	struct bpf_map *map = reg->map_ptr;
> +	bool ref_ptr = false;
>  
>  	/* Things we already checked for in check_map_access:
>  	 *  - Reject cases where variable offset may touch BTF ID pointer
> @@ -3574,9 +3600,15 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
>  		return -EPERM;
>  	}
>  
> +	ref_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_REF;
> +
>  	if (insn_class == BPF_LDX) {
>  		if (WARN_ON_ONCE(value_regno < 0))
>  			return -EFAULT;
> +		if (ref_ptr) {
> +			verbose(env, "accessing referenced kptr disallowed\n");
> +			return -EACCES;
> +		}

Please do this warn once instead of copy paste 3 times.

>  		val_reg = reg_state(env, value_regno);
>  		/* We can simply mark the value_regno receiving the pointer
>  		 * value from map as PTR_TO_BTF_ID, with the correct type.
> @@ -3587,11 +3619,19 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
>  	} else if (insn_class == BPF_STX) {
>  		if (WARN_ON_ONCE(value_regno < 0))
>  			return -EFAULT;
> +		if (ref_ptr) {
> +			verbose(env, "accessing referenced kptr disallowed\n");
> +			return -EACCES;
> +		}
>  		val_reg = reg_state(env, value_regno);
>  		if (!register_is_null(val_reg) &&
> -		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
> +		    map_kptr_match_type(env, off_desc, val_reg, value_regno, false))
>  			return -EACCES;
>  	} else if (insn_class == BPF_ST) {
> +		if (ref_ptr) {
> +			verbose(env, "accessing referenced kptr disallowed\n");
> +			return -EACCES;
> +		}
>  		if (insn->imm) {
>  			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
>  				off_desc->offset);
> @@ -5265,6 +5305,63 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
>  	return 0;
>  }
>  
> +static int process_kptr_func(struct bpf_verifier_env *env, int regno,
> +			     struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +	struct bpf_map_value_off_desc *off_desc;
> +	struct bpf_map *map_ptr = reg->map_ptr;
> +	u32 kptr_off;
> +	int ret;
> +
> +	if (!env->bpf_capable) {
> +		verbose(env, "kptr access only allowed for CAP_BPF and CAP_SYS_ADMIN\n");
> +		return -EPERM;
> +	}

another check? pls drop.

> +	if (!tnum_is_const(reg->var_off)) {
> +		verbose(env,
> +			"R%d doesn't have constant offset. kptr has to be at the constant offset\n",
> +			regno);
> +		return -EINVAL;
> +	}
> +	if (!map_ptr->btf) {
> +		verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
> +			map_ptr->name);
> +		return -EINVAL;
> +	}
> +	if (!map_value_has_kptr(map_ptr)) {
> +		ret = PTR_ERR(map_ptr->kptr_off_tab);
> +		if (ret == -E2BIG)
> +			verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
> +				BPF_MAP_VALUE_OFF_MAX);
> +		else if (ret == -EEXIST)
> +			verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
> +		else
> +			verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
> +		return -EINVAL;
> +	}
> +
> +	meta->map_ptr = map_ptr;
> +	/* Check access for BPF_WRITE */
> +	meta->raw_mode = true;
> +	ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
> +	if (ret < 0)
> +		return ret;
> +
> +	kptr_off = reg->off + reg->var_off.value;
> +	off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
> +	if (!off_desc) {
> +		verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
> +		return -EACCES;
> +	}
> +	if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
> +		verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
> +		return -EACCES;
> +	}
> +	meta->kptr_off_desc = off_desc;
> +	return 0;
> +}
> +
>  static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
>  {
>  	return base_type(type) == ARG_PTR_TO_MEM ||
> @@ -5400,6 +5497,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
>  static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
>  static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
>  static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
> +static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
>  
>  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>  	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
> @@ -5427,11 +5525,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>  	[ARG_PTR_TO_STACK]		= &stack_ptr_types,
>  	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
>  	[ARG_PTR_TO_TIMER]		= &timer_types,
> +	[ARG_PTR_TO_KPTR]		= &kptr_types,
>  };
>  
>  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>  			  enum bpf_arg_type arg_type,
> -			  const u32 *arg_btf_id)
> +			  const u32 *arg_btf_id,
> +			  struct bpf_call_arg_meta *meta)
>  {
>  	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
>  	enum bpf_reg_type expected, type = reg->type;
> @@ -5484,8 +5584,15 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>  			arg_btf_id = compatible->btf_id;
>  		}
>  
> -		if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> -					  btf_vmlinux, *arg_btf_id)) {
> +		if (meta->func_id == BPF_FUNC_kptr_xchg) {
> +			if (!meta->kptr_off_desc) {
> +				verbose(env, "verifier internal error: meta.kptr_off_desc unset\n");
> +				return -EFAULT;
> +			}

please audit all patches and remove all instances of defensive programming.

> +			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno, true))
> +				return -EACCES;
> +		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +						 btf_vmlinux, *arg_btf_id)) {
>  			verbose(env, "R%d is of type %s but %s is expected\n",
>  				regno, kernel_type_name(reg->btf, reg->btf_id),
>  				kernel_type_name(btf_vmlinux, *arg_btf_id));
> @@ -5595,7 +5702,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  		 */
>  		goto skip_type_check;
>  
> -	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
> +	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
>  	if (err)
>  		return err;
>  
> @@ -5760,6 +5867,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  			verbose(env, "string is not zero-terminated\n");
>  			return -EINVAL;
>  		}
> +	} else if (arg_type == ARG_PTR_TO_KPTR) {
> +		if (meta->func_id == BPF_FUNC_kptr_xchg) {
> +			if (process_kptr_func(env, regno, meta))
> +				return -EACCES;
> +		} else {
> +			verbose(env, "verifier internal error\n");
> +			return -EFAULT;

remove.

> +		}
>  	}
>  
>  	return err;
> @@ -6102,10 +6217,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
>  	int i;
>  
>  	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> -		if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> +		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
>  			return false;
>  
> -		if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> +		if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
>  			return false;
>  	}
>  
> @@ -6830,7 +6945,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  	}
>  
>  	if (is_release_function(func_id)) {
> -		err = release_reference(env, meta.ref_obj_id);
> +		err = -EINVAL;
> +		if (meta.ref_obj_id)
> +			err = release_reference(env, meta.ref_obj_id);
> +		/* Only bpf_kptr_xchg is a release function that accepts a
> +		 * possibly NULL reg, hence meta.ref_obj_id can only be unset
> +		 * for it.

Could you rephrase the comment? I'm not following what it's trying to convey.

> +		 */
> +		else if (func_id == BPF_FUNC_kptr_xchg)
> +			err = 0;
>  		if (err) {
>  			verbose(env, "func %s#%d reference has not been acquired before\n",
>  				func_id_name(func_id), func_id);
> @@ -6963,21 +7086,29 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  			regs[BPF_REG_0].btf_id = meta.ret_btf_id;
>  		}
>  	} else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
> +		struct btf *ret_btf;
>  		int ret_btf_id;
>  
>  		mark_reg_known_zero(env, regs, BPF_REG_0);
>  		regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
> -		ret_btf_id = *fn->ret_btf_id;
> +		if (func_id == BPF_FUNC_kptr_xchg) {
> +			if (!meta.kptr_off_desc) {
> +				verbose(env, "verifier internal error: meta.kptr_off_desc unset\n");
> +				return -EFAULT;

remove.

> +			}
> +			ret_btf = meta.kptr_off_desc->btf;
> +			ret_btf_id = meta.kptr_off_desc->btf_id;
> +		} else {
> +			ret_btf = btf_vmlinux;
> +			ret_btf_id = *fn->ret_btf_id;
> +		}
>  		if (ret_btf_id == 0) {
>  			verbose(env, "invalid return type %u of func %s#%d\n",
>  				base_type(ret_type), func_id_name(func_id),
>  				func_id);
>  			return -EINVAL;
>  		}
> -		/* current BPF helper definitions are only coming from
> -		 * built-in code with type IDs from  vmlinux BTF
> -		 */
> -		regs[BPF_REG_0].btf = btf_vmlinux;
> +		regs[BPF_REG_0].btf = ret_btf;
>  		regs[BPF_REG_0].btf_id = ret_btf_id;
>  	} else {
>  		verbose(env, "unknown return type %u of func %s#%d\n",
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 99fab54ae9c0..d45568746e79 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5129,6 +5129,17 @@ union bpf_attr {
>   *		The **hash_algo** is returned on success,
>   *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
>   *		invalid arguments are passed.
> + *
> + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> + *	Description
> + *		Exchange kptr at pointer *map_value* with *ptr*, and return the
> + *		old value. *ptr* can be NULL, otherwise it must be a referenced
> + *		pointer which will be released when this helper is called.
> + *	Return
> + *		The old value of kptr (which can be NULL). The returned pointer
> + *		if not NULL, is a reference which must be released using its
> + *		corresponding release function, or moved into a BPF map before
> + *		program exit.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5325,6 +5336,7 @@ union bpf_attr {
>  	FN(copy_from_user_task),	\
>  	FN(skb_set_tstamp),		\
>  	FN(ima_file_hash),		\
> +	FN(kptr_xchg),			\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> -- 
> 2.35.1
> 

-- 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 06/15] bpf: Allow storing user kptr in map
  2022-03-17 11:59 ` [PATCH bpf-next v2 06/15] bpf: Allow storing user " Kumar Kartikeya Dwivedi
@ 2022-03-19 18:28   ` Alexei Starovoitov
  2022-03-19 19:02     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 18:28 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:48PM +0530, Kumar Kartikeya Dwivedi wrote:
> Recently, verifier gained __user annotation support [0] where it
> prevents BPF program from normally derefering user memory pointer in the
> kernel, and instead requires use of bpf_probe_read_user. We can allow
> the user to also store these pointers in BPF maps, with the logic that
> whenever user loads it from the BPF map, it gets marked as MEM_USER. The
> tag 'kptr_user' is used to tag such pointers.
> 
>   [0]: https://lore.kernel.org/bpf/20220127154555.650886-1-yhs@fb.com
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h   |  1 +
>  kernel/bpf/btf.c      | 13 ++++++++++---
>  kernel/bpf/verifier.c | 15 ++++++++++++---
>  3 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 433f5cb161cf..989f47334215 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -163,6 +163,7 @@ enum {
>  enum {
>  	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
>  	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
> +	BPF_MAP_VALUE_OFF_F_USER   = (1U << 2),
...
> +		} else if (!strcmp("kptr_user", __btf_name_by_offset(btf, t->name_off))) {

I don't see a use case where __user pointer would need to be stored into a map.
That pointer is valid in the user task context.
When bpf prog has such pointer it can read user mem through it,
but storing it for later makes little sense. The user context will certainly change.
Reading it later from the map is more or less reading random number.
Lets drop this patch until real use case arises.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 05/15] bpf: Allow storing percpu kptr in map
  2022-03-17 11:59 ` [PATCH bpf-next v2 05/15] bpf: Allow storing percpu " Kumar Kartikeya Dwivedi
@ 2022-03-19 18:30   ` Alexei Starovoitov
  2022-03-19 19:04     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 18:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Hao Luo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:47PM +0530, Kumar Kartikeya Dwivedi wrote:
> Make adjustments to the code to allow storing percpu PTR_TO_BTF_ID in a
> map. Similar to 'kptr_ref' tag, a new 'kptr_percpu' allows tagging types
> of pointers accepting stores of such register types. On load, verifier
> marks destination register as having type PTR_TO_BTF_ID | MEM_PERCPU |
> PTR_MAYBE_NULL.
> 
> Cc: Hao Luo <haoluo@google.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h   |  3 ++-
>  kernel/bpf/btf.c      | 13 ++++++++++---
>  kernel/bpf/verifier.c | 26 +++++++++++++++++++++-----
>  3 files changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 702aa882e4a3..433f5cb161cf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -161,7 +161,8 @@ enum {
>  };
>  
>  enum {
> -	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> +	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
> +	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),

What is the use case for storing __percpu pointer into a map?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case
  2022-03-17 11:59 ` [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
@ 2022-03-19 18:34   ` Alexei Starovoitov
  2022-03-19 19:06     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 18:34 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:50PM +0530, Kumar Kartikeya Dwivedi wrote:
> Since now there might be at most 10 offsets that need handling in
> copy_map_value, the manual shuffling and special case is no longer going
> to work. Hence, let's generalise the copy_map_value function by using
> a sorted array of offsets to skip regions that must be avoided while
> copying into and out of a map value.
> 
> When the map is created, we populate the offset array in struct map,
> with one extra element for map->value_size, which is used as the final
> offset to subtract previous offset from. Since there can only be three
> sizes, we can avoid recording the size in the struct map, and only store
> sorted offsets. Later we can determine the size for each offset by
> comparing it to timer_off and spin_lock_off, otherwise it must be
> sizeof(u64) for kptr.
> 
> Then, copy_map_value uses this sorted offset array is used to memcpy
> while skipping timer, spin lock, and kptr.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h  | 59 +++++++++++++++++++++++++-------------------
>  kernel/bpf/syscall.c | 47 +++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8ac3070aa5e6..f0f1e0d3bb2e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -158,6 +158,10 @@ struct bpf_map_ops {
>  enum {
>  	/* Support at most 8 pointers in a BPF map value */
>  	BPF_MAP_VALUE_OFF_MAX = 8,
> +	BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
> +				1 + /* for bpf_spin_lock */
> +				1 + /* for bpf_timer */
> +				1,  /* for map->value_size sentinel */
>  };
>  
>  enum {
> @@ -208,7 +212,12 @@ struct bpf_map {
>  	char name[BPF_OBJ_NAME_LEN];
>  	bool bypass_spec_v1;
>  	bool frozen; /* write-once; write-protected by freeze_mutex */
> -	/* 6 bytes hole */
> +	/* 2 bytes hole */
> +	struct {
> +		u32 off[BPF_MAP_OFF_ARR_MAX];
> +		u32 cnt;
> +	} off_arr;
> +	/* 20 bytes hole */
>  
>  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
>  	 * particularly with refcounting.
> @@ -252,36 +261,34 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>  		memset(dst + map->spin_lock_off, 0, sizeof(struct bpf_spin_lock));
>  	if (unlikely(map_value_has_timer(map)))
>  		memset(dst + map->timer_off, 0, sizeof(struct bpf_timer));
> +	if (unlikely(map_value_has_kptr(map))) {
> +		struct bpf_map_value_off *tab = map->kptr_off_tab;
> +		int i;
> +
> +		for (i = 0; i < tab->nr_off; i++)
> +			*(u64 *)(dst + tab->off[i].offset) = 0;
> +	}
>  }
>  
>  /* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
>  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>  {
> -	u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;
> -
> -	if (unlikely(map_value_has_spin_lock(map))) {
> -		s_off = map->spin_lock_off;
> -		s_sz = sizeof(struct bpf_spin_lock);
> -	}
> -	if (unlikely(map_value_has_timer(map))) {
> -		t_off = map->timer_off;
> -		t_sz = sizeof(struct bpf_timer);
> -	}
> -
> -	if (unlikely(s_sz || t_sz)) {
> -		if (s_off < t_off || !s_sz) {
> -			swap(s_off, t_off);
> -			swap(s_sz, t_sz);
> -		}
> -		memcpy(dst, src, t_off);
> -		memcpy(dst + t_off + t_sz,
> -		       src + t_off + t_sz,
> -		       s_off - t_off - t_sz);
> -		memcpy(dst + s_off + s_sz,
> -		       src + s_off + s_sz,
> -		       map->value_size - s_off - s_sz);
> -	} else {
> -		memcpy(dst, src, map->value_size);
> +	int i;
> +
> +	memcpy(dst, src, map->off_arr.off[0]);
> +	for (i = 1; i < map->off_arr.cnt; i++) {
> +		u32 curr_off = map->off_arr.off[i - 1];
> +		u32 next_off = map->off_arr.off[i];
> +		u32 curr_sz;
> +
> +		if (map_value_has_spin_lock(map) && map->spin_lock_off == curr_off)
> +			curr_sz = sizeof(struct bpf_spin_lock);
> +		else if (map_value_has_timer(map) && map->timer_off == curr_off)
> +			curr_sz = sizeof(struct bpf_timer);
> +		else
> +			curr_sz = sizeof(u64);

Lets store size in off_arr as well.
Memory consumption of few u8-s are worth it.
Single load is faster than two if-s and a bunch of load.

> +		curr_off += curr_sz;
> +		memcpy(dst + curr_off, src + curr_off, next_off - curr_off);
>  	}
>  }
>  void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 87263b07f40b..69e8ea1be432 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -30,6 +30,7 @@
>  #include <linux/pgtable.h>
>  #include <linux/bpf_lsm.h>
>  #include <linux/poll.h>
> +#include <linux/sort.h>
>  #include <linux/bpf-netns.h>
>  #include <linux/rcupdate_trace.h>
>  #include <linux/memcontrol.h>
> @@ -850,6 +851,50 @@ int map_check_no_btf(const struct bpf_map *map,
>  	return -ENOTSUPP;
>  }
>  
> +static int map_off_arr_cmp(const void *_a, const void *_b)
> +{
> +	const u32 a = *(const u32 *)_a;
> +	const u32 b = *(const u32 *)_b;
> +
> +	if (a < b)
> +		return -1;
> +	else if (a > b)
> +		return 1;
> +	return 0;
> +}
> +
> +static void map_populate_off_arr(struct bpf_map *map)
> +{
> +	u32 i;
> +
> +	map->off_arr.cnt = 0;
> +	if (map_value_has_spin_lock(map)) {
> +		i = map->off_arr.cnt;
> +
> +		map->off_arr.off[i] = map->spin_lock_off;
> +		map->off_arr.cnt++;
> +	}
> +	if (map_value_has_timer(map)) {
> +		i = map->off_arr.cnt;
> +
> +		map->off_arr.off[i] = map->timer_off;
> +		map->off_arr.cnt++;
> +	}
> +	if (map_value_has_kptr(map)) {
> +		struct bpf_map_value_off *tab = map->kptr_off_tab;
> +		u32 j = map->off_arr.cnt;
> +
> +		for (i = 0; i < tab->nr_off; i++)
> +			map->off_arr.off[j + i] = tab->off[i].offset;
> +		map->off_arr.cnt += tab->nr_off;
> +	}
> +
> +	map->off_arr.off[map->off_arr.cnt++] = map->value_size;
> +	if (map->off_arr.cnt == 1)
> +		return;
> +	sort(map->off_arr.off, map->off_arr.cnt, sizeof(map->off_arr.off[0]), map_off_arr_cmp, NULL);
> +}
> +
>  static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  			 u32 btf_key_id, u32 btf_value_id)
>  {
> @@ -1015,6 +1060,8 @@ static int map_create(union bpf_attr *attr)
>  			attr->btf_vmlinux_value_type_id;
>  	}
>  
> +	map_populate_off_arr(map);
> +
>  	err = security_bpf_map_alloc(map);
>  	if (err)
>  		goto free_map;
> -- 
> 2.35.1
> 

-- 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 09/15] bpf: Always raise reference in btf_get_module_btf
  2022-03-17 11:59 ` [PATCH bpf-next v2 09/15] bpf: Always raise reference in btf_get_module_btf Kumar Kartikeya Dwivedi
@ 2022-03-19 18:43   ` Alexei Starovoitov
  0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 18:43 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Thu, Mar 17, 2022 at 05:29:51PM +0530, Kumar Kartikeya Dwivedi wrote:
> Align it with helpers like bpf_find_btf_id, so all functions returning
> BTF in out parameter follow the same rule of raising reference
> consistently, regardless of module or vmlinux BTF.
> 
> Adjust existing callers to handle the change accordinly.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Applied this and 1st patches.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps
  2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
                   ` (14 preceding siblings ...)
  2022-03-17 11:59 ` [PATCH bpf-next v2 15/15] selftests/bpf: Add verifier " Kumar Kartikeya Dwivedi
@ 2022-03-19 18:50 ` patchwork-bot+netdevbpf
  15 siblings, 0 replies; 42+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-03-19 18:50 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf, ast, andrii, daniel, toke, brouer

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Thu, 17 Mar 2022 17:29:42 +0530 you wrote:
> This set enables storing pointers of a certain type in BPF map, and extends the
> verifier to enforce type safety and lifetime correctness properties.
> 
> The infrastructure being added is generic enough for allowing storing any kind
> of pointers whose type is available using BTF (user or kernel) in the future
> (e.g. strongly typed memory allocation in BPF program), which are internally
> tracked in the verifier as PTR_TO_BTF_ID, but for now the series limits them to
> four kinds of pointers obtained from the kernel.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v2,01/15] bpf: Factor out fd returning from bpf_btf_find_by_name_kind
    https://git.kernel.org/bpf/bpf-next/c/edc3ec09ab70
  - [bpf-next,v2,02/15] bpf: Make btf_find_field more generic
    (no matching commit)
  - [bpf-next,v2,03/15] bpf: Allow storing unreferenced kptr in map
    (no matching commit)
  - [bpf-next,v2,04/15] bpf: Allow storing referenced kptr in map
    (no matching commit)
  - [bpf-next,v2,05/15] bpf: Allow storing percpu kptr in map
    (no matching commit)
  - [bpf-next,v2,06/15] bpf: Allow storing user kptr in map
    (no matching commit)
  - [bpf-next,v2,07/15] bpf: Prevent escaping of kptr loaded from maps
    (no matching commit)
  - [bpf-next,v2,08/15] bpf: Adapt copy_map_value for multiple offset case
    (no matching commit)
  - [bpf-next,v2,09/15] bpf: Always raise reference in btf_get_module_btf
    https://git.kernel.org/bpf/bpf-next/c/9492450fd287
  - [bpf-next,v2,10/15] bpf: Populate pairs of btf_id and destructor kfunc in btf
    (no matching commit)
  - [bpf-next,v2,11/15] bpf: Wire up freeing of referenced kptr
    (no matching commit)
  - [bpf-next,v2,12/15] bpf: Teach verifier about kptr_get kfunc helpers
    (no matching commit)
  - [bpf-next,v2,13/15] libbpf: Add kptr type tag macros to bpf_helpers.h
    (no matching commit)
  - [bpf-next,v2,14/15] selftests/bpf: Add C tests for kptr
    (no matching commit)
  - [bpf-next,v2,15/15] selftests/bpf: Add verifier tests for kptr
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-19 18:15   ` Alexei Starovoitov
@ 2022-03-19 18:52     ` Kumar Kartikeya Dwivedi
  2022-03-19 21:17       ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 18:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Mar 19, 2022 at 11:45:38PM IST, Alexei Starovoitov wrote:
> On Thu, Mar 17, 2022 at 05:29:45PM +0530, Kumar Kartikeya Dwivedi wrote:
> > This commit introduces a new pointer type 'kptr' which can be embedded
> > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > register must have the same type as in the map value's BTF, and loading
> > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > kernel BTF and BTF ID.
> >
> > Such kptr are unreferenced, i.e. by the time another invocation of the
> > BPF program loads this pointer, the object which the pointer points to
> > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > to still access such invalid pointer, but passing such pointers into
> > BPF helpers and kfuncs should not be permitted. A future patch in this
> > series will close this gap.
> >
> > The flexibility offered by allowing programs to dereference such invalid
> > pointers while being safe at runtime frees the verifier from doing
> > complex lifetime tracking. As long as the user may ensure that the
> > object remains valid, it can ensure data read by it from the kernel
> > object is valid.
> >
> > The user indicates that a certain pointer must be treated as kptr
> > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > information is recorded in the object BTF which will be passed into the
> > kernel by way of map's BTF information. The name and kind from the map
> > value BTF is used to look up the in-kernel type, and the actual BTF and
> > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > now, only storing pointers to structs is permitted.
> >
> > An example of this specification is shown below:
> >
> > 	#define __kptr __attribute__((btf_type_tag("kptr")))
> >
> > 	struct map_value {
> > 		...
> > 		struct task_struct __kptr *task;
> > 		...
> > 	};
> >
> > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > task_struct into the map, and then load it later.
> >
> > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > the verifier cannot know whether the value is NULL or not statically, it
> > must treat all potential loads at that map value offset as loading a
> > possibly NULL pointer.
> >
> > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > same BTF type as specified in the map BTF. The access size must always
> > be BPF_DW.
> >
> > For the map in map support, the kptr_off_tab for outer map is copied
> > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > instead of introducing a refcount to kptr_off_tab, because the copy only
> > needs to be done when paramterizing using inner_map_fd in the map in map
> > case, hence would be unnecessary for all other users.
> >
> > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > having kptr, similar to the bpf_timer case.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h     |  29 +++++-
> >  include/linux/btf.h     |   2 +
> >  kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
> >  kernel/bpf/map_in_map.c |   5 +-
> >  kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
> >  kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
> >  6 files changed, 442 insertions(+), 62 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 88449fbbe063..f35920d279dd 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> >  	const struct bpf_iter_seq_info *iter_seq_info;
> >  };
> >
> > +enum {
> > +	/* Support at most 8 pointers in a BPF map value */
> > +	BPF_MAP_VALUE_OFF_MAX = 8,
> > +};
> > +
> > +struct bpf_map_value_off_desc {
> > +	u32 offset;
> > +	u32 btf_id;
> > +	struct btf *btf;
> > +};
> > +
> > +struct bpf_map_value_off {
> > +	u32 nr_off;
> > +	struct bpf_map_value_off_desc off[];
> > +};
> > +
> >  struct bpf_map {
> >  	/* The first two cachelines with read-mostly members of which some
> >  	 * are also accessed in fast-path (e.g. ops, max_entries).
> > @@ -171,6 +187,7 @@ struct bpf_map {
> >  	u64 map_extra; /* any per-map-type extra fields */
> >  	u32 map_flags;
> >  	int spin_lock_off; /* >=0 valid offset, <0 error */
> > +	struct bpf_map_value_off *kptr_off_tab;
> >  	int timer_off; /* >=0 valid offset, <0 error */
> >  	u32 id;
> >  	int numa_node;
> > @@ -184,7 +201,7 @@ struct bpf_map {
> >  	char name[BPF_OBJ_NAME_LEN];
> >  	bool bypass_spec_v1;
> >  	bool frozen; /* write-once; write-protected by freeze_mutex */
> > -	/* 14 bytes hole */
> > +	/* 6 bytes hole */
> >
> >  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
> >  	 * particularly with refcounting.
> > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> >  	return map->timer_off >= 0;
> >  }
> >
> > +static inline bool map_value_has_kptr(const struct bpf_map *map)
> > +{
> > +	return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > +}
> > +
> >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >  {
> >  	if (unlikely(map_value_has_spin_lock(map)))
> > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> >
> > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > +
> >  struct bpf_map *bpf_map_get(u32 ufd);
> >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> >  struct bpf_map *__bpf_map_get(struct fd f);
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index 36bc09b8e890..5b578dc81c04 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> >  			   u32 expected_offset, u32 expected_size);
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > +					const struct btf_type *t);
> >  bool btf_type_is_void(const struct btf_type *t);
> >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 5b2824332880..9ac9364ef533 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
> >  enum {
> >  	BTF_FIELD_SPIN_LOCK,
> >  	BTF_FIELD_TIMER,
> > +	BTF_FIELD_KPTR,
> > +};
> > +
> > +enum {
> > +	BTF_FIELD_IGNORE = 0,
> > +	BTF_FIELD_FOUND  = 1,
> >  };
> >
> >  struct btf_field_info {
> > +	const struct btf_type *type;
> >  	u32 off;
> >  };
> >
> >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > -				 u32 off, int sz, struct btf_field_info *info)
> > +				 u32 off, int sz, struct btf_field_info *info,
> > +				 int info_cnt, int idx)
> >  {
> >  	if (!__btf_type_is_struct(t))
> > -		return 0;
> > +		return BTF_FIELD_IGNORE;
> >  	if (t->size != sz)
> > -		return 0;
> > -	if (info->off != -ENOENT)
> > -		/* only one such field is allowed */
> > +		return BTF_FIELD_IGNORE;
> > +	if (idx >= info_cnt)
>
> No need to pass info_cnt, idx into this function.
> Move idx >= info_cnt check into the caller and let
> caller do 'info++' and pass that.

That was what I did initially, but this check actually needs to happen after we
see that the field is of interest (i.e. not ignored by btf_find_field_*). Doing
it in caller limits total fields to info_cnt. Moving those checks out into the
caller may be the other option, but I didn't like that. I can add a comment if
it makes things clear.

> This function will simply write into 'info'.
>
> >  		return -E2BIG;
> > +	info[idx].off = off;
> >  	info->off = off;
>
> This can't be right.
>

Ouch, thanks for catching this.

> > -	return 0;
> > +	return BTF_FIELD_FOUND;
> > +}
> > +
> > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > +			       u32 off, int sz, struct btf_field_info *info,
> > +			       int info_cnt, int idx)
> > +{
> > +	bool kptr_tag = false;
> > +
> > +	/* For PTR, sz is always == 8 */
> > +	if (!btf_type_is_ptr(t))
> > +		return BTF_FIELD_IGNORE;
> > +	t = btf_type_by_id(btf, t->type);
> > +
> > +	while (btf_type_is_type_tag(t)) {
> > +		if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off))) {
> > +			/* repeated tag */
> > +			if (kptr_tag)
> > +				return -EEXIST;
> > +			kptr_tag = true;
> > +		}
> > +		/* Look for next tag */
> > +		t = btf_type_by_id(btf, t->type);
> > +	}
>
> There is no need for while() loop and 4 bool kptr_*_tag checks.
> Just do:
>   if (!btf_type_is_type_tag(t))
>      return BTF_FIELD_IGNORE;
>   /* check next tag */
>   if (btf_type_is_type_tag(btf_type_by_id(btf, t->type))
>      return -EINVAL;

But there may be other tags also in the future? Then on older kernels it would
return an error, instead of skipping over them and ignoring them.

>   if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
>      flag = 0;
>   else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
>     flag = BPF_MAP_VALUE_OFF_F_REF;
>   ...
>
> > +	if (!kptr_tag)
> > +		return BTF_FIELD_IGNORE;
> > +
> > +	/* Get the base type */
> > +	if (btf_type_is_modifier(t))
> > +		t = btf_type_skip_modifiers(btf, t->type, NULL);
> > +	/* Only pointer to struct is allowed */
> > +	if (!__btf_type_is_struct(t))
> > +		return -EINVAL;
> > +
> > +	if (idx >= info_cnt)
> > +		return -E2BIG;
> > +	info[idx].type = t;
> > +	info[idx].off = off;
> > +	return BTF_FIELD_FOUND;
> >  }
> >
> >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> >  				 const char *name, int sz, int align, int field_type,
> > -				 struct btf_field_info *info)
> > +				 struct btf_field_info *info, int info_cnt)
> >  {
> >  	const struct btf_member *member;
> > +	int ret, idx = 0;
> >  	u32 i, off;
> > -	int ret;
> >
> >  	for_each_member(i, t, member) {
> >  		const struct btf_type *member_type = btf_type_by_id(btf,
> > @@ -3210,24 +3256,33 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> >  		switch (field_type) {
> >  		case BTF_FIELD_SPIN_LOCK:
> >  		case BTF_FIELD_TIMER:
> > -			ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > +			ret = btf_find_field_struct(btf, member_type, off, sz, info, info_cnt, idx);
> > +			if (ret < 0)
> > +				return ret;
> > +			break;
> > +		case BTF_FIELD_KPTR:
> > +			ret = btf_find_field_kptr(btf, member_type, off, sz, info, info_cnt, idx);
> >  			if (ret < 0)
> >  				return ret;
> >  			break;
> >  		default:
> >  			return -EFAULT;
> >  		}
> > +
> > +		if (ret == BTF_FIELD_IGNORE)
> > +			continue;
> > +		++idx;
> >  	}
> > -	return 0;
> > +	return idx;
> >  }
> >
> >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >  				const char *name, int sz, int align, int field_type,
> > -				struct btf_field_info *info)
> > +				struct btf_field_info *info, int info_cnt)
> >  {
> >  	const struct btf_var_secinfo *vsi;
> > +	int ret, idx = 0;
> >  	u32 i, off;
> > -	int ret;
> >
> >  	for_each_vsi(i, t, vsi) {
> >  		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > @@ -3245,25 +3300,34 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >  		switch (field_type) {
> >  		case BTF_FIELD_SPIN_LOCK:
> >  		case BTF_FIELD_TIMER:
> > -			ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > +			ret = btf_find_field_struct(btf, var_type, off, sz, info, info_cnt, idx);
> > +			if (ret < 0)
> > +				return ret;
> > +			break;
> > +		case BTF_FIELD_KPTR:
> > +			ret = btf_find_field_kptr(btf, var_type, off, sz, info, info_cnt, idx);
> >  			if (ret < 0)
> >  				return ret;
> >  			break;
> >  		default:
> >  			return -EFAULT;
> >  		}
> > +
> > +		if (ret == BTF_FIELD_IGNORE)
> > +			continue;
> > +		++idx;
> >  	}
> > -	return 0;
> > +	return idx;
> >  }
> >
> >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >  			  const char *name, int sz, int align, int field_type,
> > -			  struct btf_field_info *info)
> > +			  struct btf_field_info *info, int info_cnt)
> >  {
> >  	if (__btf_type_is_struct(t))
> > -		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > +		return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
> >  	else if (btf_type_is_datasec(t))
> > -		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > +		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
> >  	return -EINVAL;
> >  }
> >
> > @@ -3279,7 +3343,7 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> >  	ret = btf_find_field(btf, t, "bpf_spin_lock",
> >  			     sizeof(struct bpf_spin_lock),
> >  			     __alignof__(struct bpf_spin_lock),
> > -			     BTF_FIELD_SPIN_LOCK, &info);
> > +			     BTF_FIELD_SPIN_LOCK, &info, 1);
> >  	if (ret < 0)
> >  		return ret;
> >  	return info.off;
> > @@ -3293,12 +3357,61 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> >  	ret = btf_find_field(btf, t, "bpf_timer",
> >  			     sizeof(struct bpf_timer),
> >  			     __alignof__(struct bpf_timer),
> > -			     BTF_FIELD_TIMER, &info);
> > +			     BTF_FIELD_TIMER, &info, 1);
> >  	if (ret < 0)
> >  		return ret;
> >  	return info.off;
> >  }
> >
> > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > +					const struct btf_type *t)
> > +{
> > +	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> > +	struct bpf_map_value_off *tab;
> > +	int ret, i, nr_off;
> > +
> > +	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> > +	BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> > +
> > +	ret = btf_find_field(btf, t, NULL, sizeof(u64), __alignof__(u64),
>
> these pointless args will be gone with suggestion in the previous patch.
>

Agreed, will change.

> > +			     BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> > +	if (ret < 0)
> > +		return ERR_PTR(ret);
> > +	if (!ret)
> > +		return 0;
> > +
> > +	nr_off = ret;
> > +	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> > +	if (!tab)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	tab->nr_off = 0;
> > +	for (i = 0; i < nr_off; i++) {
> > +		const struct btf_type *t;
> > +		struct btf *off_btf;
> > +		s32 id;
> > +
> > +		t = info_arr[i].type;
> > +		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> > +				     &off_btf);
> > +		if (id < 0) {
> > +			ret = id;
> > +			goto end;
> > +		}
> > +
> > +		tab->off[i].offset = info_arr[i].off;
> > +		tab->off[i].btf_id = id;
> > +		tab->off[i].btf = off_btf;
> > +		tab->nr_off = i + 1;
> > +	}
> > +	return tab;
> > +end:
> > +	while (tab->nr_off--)
> > +		btf_put(tab->off[tab->nr_off].btf);
> > +	kfree(tab);
> > +	return ERR_PTR(ret);
> > +}
> > +
> >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> >  			      u32 type_id, void *data, u8 bits_offset,
> >  			      struct btf_show *show)
> > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> > index 5cd8f5277279..135205d0d560 100644
> > --- a/kernel/bpf/map_in_map.c
> > +++ b/kernel/bpf/map_in_map.c
> > @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> >  	inner_map_meta->max_entries = inner_map->max_entries;
> >  	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
> >  	inner_map_meta->timer_off = inner_map->timer_off;
> > +	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
> >  	if (inner_map->btf) {
> >  		btf_get(inner_map->btf);
> >  		inner_map_meta->btf = inner_map->btf;
> > @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> >
> >  void bpf_map_meta_free(struct bpf_map *map_meta)
> >  {
> > +	bpf_map_free_kptr_off_tab(map_meta);
> >  	btf_put(map_meta->btf);
> >  	kfree(map_meta);
> >  }
> > @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
> >  		meta0->key_size == meta1->key_size &&
> >  		meta0->value_size == meta1->value_size &&
> >  		meta0->timer_off == meta1->timer_off &&
> > -		meta0->map_flags == meta1->map_flags;
> > +		meta0->map_flags == meta1->map_flags &&
> > +		bpf_map_equal_kptr_off_tab(meta0, meta1);
> >  }
> >
> >  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 9beb585be5a6..87263b07f40b 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/bpf_trace.h>
> >  #include <linux/bpf_lirc.h>
> >  #include <linux/bpf_verifier.h>
> > +#include <linux/bsearch.h>
> >  #include <linux/btf.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/slab.h>
> > @@ -472,12 +473,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
> >  }
> >  #endif
> >
> > +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> > +{
> > +	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> > +
> > +	if (off_desc1->offset < off_desc2->offset)
> > +		return -1;
> > +	else if (off_desc1->offset > off_desc2->offset)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> > +{
> > +	/* Since members are iterated in btf_find_field in increasing order,
> > +	 * offsets appended to kptr_off_tab are in increasing order, so we can
> > +	 * do bsearch to find exact match.
> > +	 */
> > +	struct bpf_map_value_off *tab;
> > +
> > +	if (!map_value_has_kptr(map))
> > +		return NULL;
> > +	tab = map->kptr_off_tab;
> > +	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> > +}
> > +
> > +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> > +{
> > +	struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +	int i;
> > +
> > +	if (!map_value_has_kptr(map))
> > +		return;
> > +	for (i = 0; i < tab->nr_off; i++) {
> > +		struct btf *btf = tab->off[i].btf;
> > +
> > +		btf_put(btf);
> > +	}
> > +	kfree(tab);
> > +	map->kptr_off_tab = NULL;
> > +}
> > +
> > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> > +{
> > +	struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> > +	int size, i, ret;
> > +
> > +	if (!map_value_has_kptr(map))
> > +		return ERR_PTR(-ENOENT);
> > +	/* Do a deep copy of the kptr_off_tab */
> > +	for (i = 0; i < tab->nr_off; i++)
> > +		btf_get(tab->off[i].btf);
> > +
> > +	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> > +	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > +	if (!new_tab) {
> > +		ret = -ENOMEM;
> > +		goto end;
> > +	}
> > +	memcpy(new_tab, tab, size);
> > +	return new_tab;
> > +end:
> > +	while (i--)
> > +		btf_put(tab->off[i].btf);
> > +	return ERR_PTR(ret);
> > +}
> > +
> > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> > +{
> > +	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> > +	bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> > +	int size;
> > +
> > +	if (!a_has_kptr && !b_has_kptr)
> > +		return true;
> > +	if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
> > +		return false;
> > +	if (tab_a->nr_off != tab_b->nr_off)
> > +		return false;
> > +	size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> > +	return !memcmp(tab_a, tab_b, size);
> > +}
> > +
> >  /* called from workqueue */
> >  static void bpf_map_free_deferred(struct work_struct *work)
> >  {
> >  	struct bpf_map *map = container_of(work, struct bpf_map, work);
> >
> >  	security_bpf_map_free(map);
> > +	bpf_map_free_kptr_off_tab(map);
> >  	bpf_map_release_memcg(map);
> >  	/* implementation dependent freeing */
> >  	map->ops->map_free(map);
> > @@ -639,7 +723,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >  	int err;
> >
> >  	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> > -	    map_value_has_timer(map))
> > +	    map_value_has_timer(map) || map_value_has_kptr(map))
> >  		return -ENOTSUPP;
> >
> >  	if (!(vma->vm_flags & VM_SHARED))
> > @@ -819,9 +903,29 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >  			return -EOPNOTSUPP;
> >  	}
> >
> > -	if (map->ops->map_check_btf)
> > +	map->kptr_off_tab = btf_find_kptr(btf, value_type);
> > +	if (map_value_has_kptr(map)) {
> > +		if (map->map_flags & BPF_F_RDONLY_PROG) {
> > +			ret = -EACCES;
> > +			goto free_map_tab;
> > +		}
> > +		if (map->map_type != BPF_MAP_TYPE_HASH &&
> > +		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > +		    map->map_type != BPF_MAP_TYPE_ARRAY) {
> > +			ret = -EOPNOTSUPP;
> > +			goto free_map_tab;
> > +		}
> > +	}
> > +
> > +	if (map->ops->map_check_btf) {
> >  		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > +		if (ret < 0)
> > +			goto free_map_tab;
> > +	}
> >
> > +	return ret;
> > +free_map_tab:
> > +	bpf_map_free_kptr_off_tab(map);
> >  	return ret;
> >  }
> >
> > @@ -1638,7 +1742,7 @@ static int map_freeze(const union bpf_attr *attr)
> >  		return PTR_ERR(map);
> >
> >  	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > -	    map_value_has_timer(map)) {
> > +	    map_value_has_timer(map) || map_value_has_kptr(map)) {
> >  		fdput(f);
> >  		return -ENOTSUPP;
> >  	}
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index cf92f9c01556..881d1381757b 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3469,6 +3469,143 @@ static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
> >  	return 0;
> >  }
> >
> > +static int __check_ptr_off_reg(struct bpf_verifier_env *env,
> > +			       const struct bpf_reg_state *reg, int regno,
> > +			       bool fixed_off_ok)
> > +{
> > +	/* Access to this pointer-typed register or passing it to a helper
> > +	 * is only allowed in its original, unmodified form.
> > +	 */
> > +
> > +	if (reg->off < 0) {
> > +		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
> > +			reg_type_str(env, reg->type), regno, reg->off);
> > +		return -EACCES;
> > +	}
> > +
> > +	if (!fixed_off_ok && reg->off) {
> > +		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
> > +			reg_type_str(env, reg->type), regno, reg->off);
> > +		return -EACCES;
> > +	}
> > +
> > +	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> > +		char tn_buf[48];
> > +
> > +		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> > +		verbose(env, "variable %s access var_off=%s disallowed\n",
> > +			reg_type_str(env, reg->type), tn_buf);
> > +		return -EACCES;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int check_ptr_off_reg(struct bpf_verifier_env *env,
> > +		      const struct bpf_reg_state *reg, int regno)
> > +{
> > +	return __check_ptr_off_reg(env, reg, regno, false);
> > +}
> > +
> > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > +			       struct bpf_map_value_off_desc *off_desc,
> > +			       struct bpf_reg_state *reg, u32 regno)
> > +{
> > +	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > +	const char *reg_name = "";
> > +
> > +	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> > +		goto bad_type;
> > +
> > +	if (!btf_is_kernel(reg->btf)) {
> > +		verbose(env, "R%d must point to kernel BTF\n", regno);
> > +		return -EINVAL;
> > +	}
> > +	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > +	reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > +
> > +	if (__check_ptr_off_reg(env, reg, regno, true))
> > +		return -EACCES;
> > +
> > +	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > +				  off_desc->btf, off_desc->btf_id))
> > +		goto bad_type;
> > +	return 0;
> > +bad_type:
> > +	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > +		reg_type_str(env, reg->type), reg_name);
> > +	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > +	return -EINVAL;
> > +}
> > +
> > +/* Returns an error, or 0 if ignoring the access, or 1 if register state was
> > + * updated, in which case later updates must be skipped.
> > + */
> > +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> > +				 int off, int size, int value_regno,
> > +				 enum bpf_access_type t, int insn_idx)
> > +{
> > +	struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
> > +	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> > +	struct bpf_map_value_off_desc *off_desc;
> > +	int insn_class = BPF_CLASS(insn->code);
> > +	struct bpf_map *map = reg->map_ptr;
> > +
> > +	/* Things we already checked for in check_map_access:
> > +	 *  - Reject cases where variable offset may touch BTF ID pointer
> > +	 *  - size of access (must be BPF_DW)
> > +	 *  - off_desc->offset == off + reg->var_off.value
> > +	 */
> > +	if (!tnum_is_const(reg->var_off))
> > +		return 0;
> > +
> > +	off_desc = bpf_map_kptr_off_contains(map, off + reg->var_off.value);
> > +	if (!off_desc)
> > +		return 0;
> > +
> > +	if (WARN_ON_ONCE(size != bpf_size_to_bytes(BPF_DW)))
>
> since the check was made, please avoid defensive progamming.
>

Ok.

> > +		return -EACCES;
> > +
> > +	if (BPF_MODE(insn->code) != BPF_MEM)
>
> what is this? a fancy way to filter ldx/stx/st insns?
> Pls add a comment if so.
>
> > +		goto end;
> > +
> > +	if (!env->bpf_capable) {
> > +		verbose(env, "kptr access only allowed for CAP_BPF and CAP_SYS_ADMIN\n");
> > +		return -EPERM;
>
> Please move this check into map_check_btf().
> That's the earliest place we can issue such error.
> Doing it here is so late and it doesn't help users, but makes run-time slower,
> since this function is called a lot more times than map_check_btf.
>

Ok, will do.

> > +	}
> > +
> > +	if (insn_class == BPF_LDX) {
> > +		if (WARN_ON_ONCE(value_regno < 0))
> > +			return -EFAULT;
>
> defensive programming? Pls dont.
>

Ok.

> > +		val_reg = reg_state(env, value_regno);
> > +		/* We can simply mark the value_regno receiving the pointer
> > +		 * value from map as PTR_TO_BTF_ID, with the correct type.
> > +		 */
> > +		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> > +				off_desc->btf_id, PTR_MAYBE_NULL);
> > +		val_reg->id = ++env->id_gen;
> > +	} else if (insn_class == BPF_STX) {
> > +		if (WARN_ON_ONCE(value_regno < 0))
> > +			return -EFAULT;
> > +		val_reg = reg_state(env, value_regno);
> > +		if (!register_is_null(val_reg) &&
> > +		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
> > +			return -EACCES;
> > +	} else if (insn_class == BPF_ST) {
> > +		if (insn->imm) {
> > +			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> > +				off_desc->offset);
> > +			return -EACCES;
> > +		}
> > +	} else {
> > +		goto end;
> > +	}
> > +	return 1;
> > +end:
> > +	verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> > +	return -EACCES;
> > +}
> > +
> >  /* check read/write into a map element with possible variable offset */
> >  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >  			    int off, int size, bool zero_size_allowed)
> > @@ -3507,6 +3644,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >  			return -EACCES;
> >  		}
> >  	}
> > +	if (map_value_has_kptr(map)) {
> > +		struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +		int i;
> > +
> > +		for (i = 0; i < tab->nr_off; i++) {
> > +			u32 p = tab->off[i].offset;
> > +
> > +			if (reg->smin_value + off < p + sizeof(u64) &&
> > +			    p < reg->umax_value + off + size) {
> > +				if (!tnum_is_const(reg->var_off)) {
> > +					verbose(env, "kptr access cannot have variable offset\n");
> > +					return -EACCES;
> > +				}
> > +				if (p != off + reg->var_off.value) {
> > +					verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> > +						p, off + reg->var_off.value);
> > +					return -EACCES;
> > +				}
> > +				if (size != bpf_size_to_bytes(BPF_DW)) {
> > +					verbose(env, "kptr access size must be BPF_DW\n");
> > +					return -EACCES;
> > +				}
> > +				break;
> > +			}
> > +		}
> > +	}
> >  	return err;
> >  }
> >
> > @@ -3980,44 +4143,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
> >  }
> >  #endif
> >
> > -static int __check_ptr_off_reg(struct bpf_verifier_env *env,
> > -			       const struct bpf_reg_state *reg, int regno,
> > -			       bool fixed_off_ok)
> > -{
> > -	/* Access to this pointer-typed register or passing it to a helper
> > -	 * is only allowed in its original, unmodified form.
> > -	 */
> > -
> > -	if (reg->off < 0) {
> > -		verbose(env, "negative offset %s ptr R%d off=%d disallowed\n",
> > -			reg_type_str(env, reg->type), regno, reg->off);
> > -		return -EACCES;
> > -	}
> > -
> > -	if (!fixed_off_ok && reg->off) {
> > -		verbose(env, "dereference of modified %s ptr R%d off=%d disallowed\n",
> > -			reg_type_str(env, reg->type), regno, reg->off);
> > -		return -EACCES;
> > -	}
> > -
> > -	if (!tnum_is_const(reg->var_off) || reg->var_off.value) {
> > -		char tn_buf[48];
> > -
> > -		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> > -		verbose(env, "variable %s access var_off=%s disallowed\n",
> > -			reg_type_str(env, reg->type), tn_buf);
> > -		return -EACCES;
> > -	}
> > -
> > -	return 0;
> > -}
> > -
> > -int check_ptr_off_reg(struct bpf_verifier_env *env,
> > -		      const struct bpf_reg_state *reg, int regno)
> > -{
> > -	return __check_ptr_off_reg(env, reg, regno, false);
> > -}
> > -
>
> please split the hunk that moves code around into separate patch.
> Don't mix it with actual changes.
>

Ok, will split into a separate patch.

> >  static int __check_buffer_access(struct bpf_verifier_env *env,
> >  				 const char *buf_info,
> >  				 const struct bpf_reg_state *reg,
> > @@ -4421,6 +4546,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >  		if (err)
> >  			return err;
> >  		err = check_map_access(env, regno, off, size, false);
> > +		err = err ?: check_map_kptr_access(env, regno, off, size, value_regno, t, insn_idx);
> > +		if (err < 0)
> > +			return err;
> > +		/* if err == 0, check_map_kptr_access ignored the access */
> >  		if (!err && t == BPF_READ && value_regno >= 0) {
> >  			struct bpf_map *map = reg->map_ptr;
> >
> > @@ -4442,6 +4571,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >  				mark_reg_unknown(env, regs, value_regno);
> >  			}
> >  		}
> > +		/* clear err == 1 */
> > +		err = err < 0 ? err : 0;
> >  	} else if (base_type(reg->type) == PTR_TO_MEM) {
> >  		bool rdonly_mem = type_is_rdonly_mem(reg->type);
> >
> > --
> > 2.35.1
> >
>
> --

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 04/15] bpf: Allow storing referenced kptr in map
  2022-03-19 18:24   ` Alexei Starovoitov
@ 2022-03-19 18:59     ` Kumar Kartikeya Dwivedi
  2022-03-19 21:23       ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 18:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Mar 19, 2022 at 11:54:07PM IST, Alexei Starovoitov wrote:
> On Thu, Mar 17, 2022 at 05:29:46PM +0530, Kumar Kartikeya Dwivedi wrote:
> > Extending the code in previous commit, introduce referenced kptr
> > support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
> > unreferenced kptr, referenced kptr have a lot more restrictions. In
> > addition to the type matching, only a newly introduced bpf_kptr_xchg
> > helper is allowed to modify the map value at that offset. This transfers
> > the referenced pointer being stored into the map, releasing the
> > references state for the program, and returning the old value and
> > creating new reference state for the returned pointer.
> >
> > Similar to unreferenced pointer case, return value for this case will
> > also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
> > must either be eventually released by calling the corresponding release
> > function, otherwise it must be transferred into another map.
> >
> > It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
> > the value, and obtain the old value if any.
> >
> > BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
> > commit will permit using BPF_LDX for such pointers, but attempt at
> > making it safe, since the lifetime of object won't be guaranteed.
> >
> > There are valid reasons to enforce the restriction of permitting only
> > bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
> > consistent in face of concurrent modification, and any prior values
> > contained in the map must also be released before a new one is moved
> > into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
> > returns the old value, which the verifier would require the user to
> > either free or move into another map, and releases the reference held
> > for the pointer being moved in.
> >
> > In the future, direct BPF_XCHG instruction may also be permitted to work
> > like bpf_kptr_xchg helper.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h            |   7 ++
> >  include/uapi/linux/bpf.h       |  12 +++
> >  kernel/bpf/btf.c               |  20 +++-
> >  kernel/bpf/helpers.c           |  21 +++++
> >  kernel/bpf/verifier.c          | 167 +++++++++++++++++++++++++++++----
> >  tools/include/uapi/linux/bpf.h |  12 +++
> >  6 files changed, 219 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f35920d279dd..702aa882e4a3 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -160,10 +160,15 @@ enum {
> >  	BPF_MAP_VALUE_OFF_MAX = 8,
> >  };
> >
> > +enum {
> > +	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> > +};
> > +
> >  struct bpf_map_value_off_desc {
> >  	u32 offset;
> >  	u32 btf_id;
> >  	struct btf *btf;
> > +	int flags;
> >  };
> >
> >  struct bpf_map_value_off {
> > @@ -413,6 +418,7 @@ enum bpf_arg_type {
> >  	ARG_PTR_TO_STACK,	/* pointer to stack */
> >  	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
> >  	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
> > +	ARG_PTR_TO_KPTR,	/* pointer to kptr */
> >  	__BPF_ARG_TYPE_MAX,
> >
> >  	/* Extended arg_types. */
> > @@ -422,6 +428,7 @@ enum bpf_arg_type {
> >  	ARG_PTR_TO_SOCKET_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
> >  	ARG_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
> >  	ARG_PTR_TO_STACK_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
> > +	ARG_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
> >
> >  	/* This must be the last entry. Its purpose is to ensure the enum is
> >  	 * wide enough to hold the higher bits reserved for bpf_type_flag.
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 99fab54ae9c0..d45568746e79 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5129,6 +5129,17 @@ union bpf_attr {
> >   *		The **hash_algo** is returned on success,
> >   *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
> >   *		invalid arguments are passed.
> > + *
> > + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> > + *	Description
> > + *		Exchange kptr at pointer *map_value* with *ptr*, and return the
> > + *		old value. *ptr* can be NULL, otherwise it must be a referenced
> > + *		pointer which will be released when this helper is called.
> > + *	Return
> > + *		The old value of kptr (which can be NULL). The returned pointer
> > + *		if not NULL, is a reference which must be released using its
> > + *		corresponding release function, or moved into a BPF map before
> > + *		program exit.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)		\
> >  	FN(unspec),			\
> > @@ -5325,6 +5336,7 @@ union bpf_attr {
> >  	FN(copy_from_user_task),	\
> >  	FN(skb_set_tstamp),		\
> >  	FN(ima_file_hash),		\
> > +	FN(kptr_xchg),			\
> >  	/* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 9ac9364ef533..7b4179667bf1 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3175,6 +3175,7 @@ enum {
> >  struct btf_field_info {
> >  	const struct btf_type *type;
> >  	u32 off;
> > +	int flags;
> >  };
> >
> >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > @@ -3196,7 +3197,8 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> >  			       u32 off, int sz, struct btf_field_info *info,
> >  			       int info_cnt, int idx)
> >  {
> > -	bool kptr_tag = false;
> > +	bool kptr_tag = false, kptr_ref_tag = false;
> > +	int tags;
> >
> >  	/* For PTR, sz is always == 8 */
> >  	if (!btf_type_is_ptr(t))
> > @@ -3209,12 +3211,21 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> >  			if (kptr_tag)
> >  				return -EEXIST;
> >  			kptr_tag = true;
> > +		} else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off))) {
> > +			/* repeated tag */
> > +			if (kptr_ref_tag)
> > +				return -EEXIST;
> > +			kptr_ref_tag = true;
> >  		}
> >  		/* Look for next tag */
> >  		t = btf_type_by_id(btf, t->type);
> >  	}
> > -	if (!kptr_tag)
> > +
> > +	tags = kptr_tag + kptr_ref_tag;
> > +	if (!tags)
> >  		return BTF_FIELD_IGNORE;
> > +	else if (tags > 1)
> > +		return -EINVAL;
> >
> >  	/* Get the base type */
> >  	if (btf_type_is_modifier(t))
> > @@ -3225,6 +3236,10 @@ static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> >
> >  	if (idx >= info_cnt)
> >  		return -E2BIG;
> > +	if (kptr_ref_tag)
> > +		info[idx].flags = BPF_MAP_VALUE_OFF_F_REF;
> > +	else
> > +		info[idx].flags = 0;
> >  	info[idx].type = t;
> >  	info[idx].off = off;
> >  	return BTF_FIELD_FOUND;
> > @@ -3402,6 +3417,7 @@ struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> >  		tab->off[i].offset = info_arr[i].off;
> >  		tab->off[i].btf_id = id;
> >  		tab->off[i].btf = off_btf;
> > +		tab->off[i].flags = info_arr[i].flags;
> >  		tab->nr_off = i + 1;
> >  	}
> >  	return tab;
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 315053ef6a75..cb717bfbda19 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1374,6 +1374,25 @@ void bpf_timer_cancel_and_free(void *val)
> >  	kfree(t);
> >  }
> >
> > +BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
> > +{
> > +	unsigned long *kptr = map_value;
> > +
> > +	return xchg(kptr, (unsigned long)ptr);
> > +}
> > +
> > +static u32 bpf_kptr_xchg_btf_id;
> > +
> > +const struct bpf_func_proto bpf_kptr_xchg_proto = {
> > +	.func        = bpf_kptr_xchg,
> > +	.gpl_only    = false,
> > +	.ret_type    = RET_PTR_TO_BTF_ID_OR_NULL,
> > +	.ret_btf_id  = &bpf_kptr_xchg_btf_id,
> > +	.arg1_type   = ARG_PTR_TO_KPTR,
> > +	.arg2_type   = ARG_PTR_TO_BTF_ID_OR_NULL,
> > +	.arg2_btf_id = &bpf_kptr_xchg_btf_id,
> > +};
> > +
> >  const struct bpf_func_proto bpf_get_current_task_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
> >  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
> > @@ -1452,6 +1471,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >  		return &bpf_timer_start_proto;
> >  	case BPF_FUNC_timer_cancel:
> >  		return &bpf_timer_cancel_proto;
> > +	case BPF_FUNC_kptr_xchg:
> > +		return &bpf_kptr_xchg_proto;
> >  	default:
> >  		break;
> >  	}
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 881d1381757b..f8738054aa52 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -257,6 +257,7 @@ struct bpf_call_arg_meta {
> >  	struct btf *ret_btf;
> >  	u32 ret_btf_id;
> >  	u32 subprogno;
> > +	struct bpf_map_value_off_desc *kptr_off_desc;
> >  };
> >
> >  struct btf *btf_vmlinux;
> > @@ -479,7 +480,8 @@ static bool is_release_function(enum bpf_func_id func_id)
> >  {
> >  	return func_id == BPF_FUNC_sk_release ||
> >  	       func_id == BPF_FUNC_ringbuf_submit ||
> > -	       func_id == BPF_FUNC_ringbuf_discard;
> > +	       func_id == BPF_FUNC_ringbuf_discard ||
> > +	       func_id == BPF_FUNC_kptr_xchg;
> >  }
> >
> >  static bool may_be_acquire_function(enum bpf_func_id func_id)
> > @@ -488,7 +490,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
> >  		func_id == BPF_FUNC_sk_lookup_udp ||
> >  		func_id == BPF_FUNC_skc_lookup_tcp ||
> >  		func_id == BPF_FUNC_map_lookup_elem ||
> > -	        func_id == BPF_FUNC_ringbuf_reserve;
> > +		func_id == BPF_FUNC_ringbuf_reserve ||
> > +		func_id == BPF_FUNC_kptr_xchg;
> >  }
> >
> >  static bool is_acquire_function(enum bpf_func_id func_id,
> > @@ -499,7 +502,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
> >  	if (func_id == BPF_FUNC_sk_lookup_tcp ||
> >  	    func_id == BPF_FUNC_sk_lookup_udp ||
> >  	    func_id == BPF_FUNC_skc_lookup_tcp ||
> > -	    func_id == BPF_FUNC_ringbuf_reserve)
> > +	    func_id == BPF_FUNC_ringbuf_reserve ||
> > +	    func_id == BPF_FUNC_kptr_xchg)
> >  		return true;
> >
> >  	if (func_id == BPF_FUNC_map_lookup_elem &&
> > @@ -3509,10 +3513,12 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> >
> >  static int map_kptr_match_type(struct bpf_verifier_env *env,
> >  			       struct bpf_map_value_off_desc *off_desc,
> > -			       struct bpf_reg_state *reg, u32 regno)
> > +			       struct bpf_reg_state *reg, u32 regno,
> > +			       bool ref_ptr)
> >  {
> >  	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> >  	const char *reg_name = "";
> > +	bool fixed_off_ok = true;
> >
> >  	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> >  		goto bad_type;
> > @@ -3524,7 +3530,26 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
> >  	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
> >  	reg_name = kernel_type_name(reg->btf, reg->btf_id);
> >
> > -	if (__check_ptr_off_reg(env, reg, regno, true))
> > +	if (ref_ptr) {
> > +		if (!reg->ref_obj_id) {
> > +			verbose(env, "R%d must be referenced %s%s\n", regno,
> > +				reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > +			return -EACCES;
> > +		}
> > +		/* reg->off can be used to store pointer to a certain type formed by
> > +		 * incrementing pointer of a parent structure the object is embedded in,
> > +		 * e.g. map may expect unreferenced struct path *, and user should be
> > +		 * allowed a store using &file->f_path. However, in the case of
> > +		 * referenced pointer, we cannot do this, because the reference is only
> > +		 * for the parent structure, not its embedded object(s), and because
> > +		 * the transfer of ownership happens for the original pointer to and
> > +		 * from the map (before its eventual release).
> > +		 */
> > +		if (reg->off)
> > +			fixed_off_ok = false;
> > +	}
> > +	/* var_off is rejected by __check_ptr_off_reg for PTR_TO_BTF_ID */
> > +	if (__check_ptr_off_reg(env, reg, regno, fixed_off_ok))
> >  		return -EACCES;
> >
> >  	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > @@ -3550,6 +3575,7 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> >  	struct bpf_map_value_off_desc *off_desc;
> >  	int insn_class = BPF_CLASS(insn->code);
> >  	struct bpf_map *map = reg->map_ptr;
> > +	bool ref_ptr = false;
> >
> >  	/* Things we already checked for in check_map_access:
> >  	 *  - Reject cases where variable offset may touch BTF ID pointer
> > @@ -3574,9 +3600,15 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> >  		return -EPERM;
> >  	}
> >
> > +	ref_ptr = off_desc->flags & BPF_MAP_VALUE_OFF_F_REF;
> > +
> >  	if (insn_class == BPF_LDX) {
> >  		if (WARN_ON_ONCE(value_regno < 0))
> >  			return -EFAULT;
> > +		if (ref_ptr) {
> > +			verbose(env, "accessing referenced kptr disallowed\n");
> > +			return -EACCES;
> > +		}
>
> Please do this warn once instead of copy paste 3 times.
>

Ok.

> >  		val_reg = reg_state(env, value_regno);
> >  		/* We can simply mark the value_regno receiving the pointer
> >  		 * value from map as PTR_TO_BTF_ID, with the correct type.
> > @@ -3587,11 +3619,19 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> >  	} else if (insn_class == BPF_STX) {
> >  		if (WARN_ON_ONCE(value_regno < 0))
> >  			return -EFAULT;
> > +		if (ref_ptr) {
> > +			verbose(env, "accessing referenced kptr disallowed\n");
> > +			return -EACCES;
> > +		}
> >  		val_reg = reg_state(env, value_regno);
> >  		if (!register_is_null(val_reg) &&
> > -		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
> > +		    map_kptr_match_type(env, off_desc, val_reg, value_regno, false))
> >  			return -EACCES;
> >  	} else if (insn_class == BPF_ST) {
> > +		if (ref_ptr) {
> > +			verbose(env, "accessing referenced kptr disallowed\n");
> > +			return -EACCES;
> > +		}
> >  		if (insn->imm) {
> >  			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> >  				off_desc->offset);
> > @@ -5265,6 +5305,63 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
> >  	return 0;
> >  }
> >
> > +static int process_kptr_func(struct bpf_verifier_env *env, int regno,
> > +			     struct bpf_call_arg_meta *meta)
> > +{
> > +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> > +	struct bpf_map_value_off_desc *off_desc;
> > +	struct bpf_map *map_ptr = reg->map_ptr;
> > +	u32 kptr_off;
> > +	int ret;
> > +
> > +	if (!env->bpf_capable) {
> > +		verbose(env, "kptr access only allowed for CAP_BPF and CAP_SYS_ADMIN\n");
> > +		return -EPERM;
> > +	}
>
> another check? pls drop.
>

That check would never be hit for bpf_kptr_xchg, if I followed the code
correctly, but you already said to move into map_check_btf.

> > +	if (!tnum_is_const(reg->var_off)) {
> > +		verbose(env,
> > +			"R%d doesn't have constant offset. kptr has to be at the constant offset\n",
> > +			regno);
> > +		return -EINVAL;
> > +	}
> > +	if (!map_ptr->btf) {
> > +		verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
> > +			map_ptr->name);
> > +		return -EINVAL;
> > +	}
> > +	if (!map_value_has_kptr(map_ptr)) {
> > +		ret = PTR_ERR(map_ptr->kptr_off_tab);
> > +		if (ret == -E2BIG)
> > +			verbose(env, "map '%s' has more than %d kptr\n", map_ptr->name,
> > +				BPF_MAP_VALUE_OFF_MAX);
> > +		else if (ret == -EEXIST)
> > +			verbose(env, "map '%s' has repeating kptr BTF tags\n", map_ptr->name);
> > +		else
> > +			verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
> > +		return -EINVAL;
> > +	}
> > +
> > +	meta->map_ptr = map_ptr;
> > +	/* Check access for BPF_WRITE */
> > +	meta->raw_mode = true;
> > +	ret = check_helper_mem_access(env, regno, sizeof(u64), false, meta);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	kptr_off = reg->off + reg->var_off.value;
> > +	off_desc = bpf_map_kptr_off_contains(map_ptr, kptr_off);
> > +	if (!off_desc) {
> > +		verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
> > +		return -EACCES;
> > +	}
> > +	if (!(off_desc->flags & BPF_MAP_VALUE_OFF_F_REF)) {
> > +		verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
> > +		return -EACCES;
> > +	}
> > +	meta->kptr_off_desc = off_desc;
> > +	return 0;
> > +}
> > +
> >  static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
> >  {
> >  	return base_type(type) == ARG_PTR_TO_MEM ||
> > @@ -5400,6 +5497,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
> >  static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
> >  static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
> >  static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
> > +static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
> >
> >  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> >  	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
> > @@ -5427,11 +5525,13 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> >  	[ARG_PTR_TO_STACK]		= &stack_ptr_types,
> >  	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
> >  	[ARG_PTR_TO_TIMER]		= &timer_types,
> > +	[ARG_PTR_TO_KPTR]		= &kptr_types,
> >  };
> >
> >  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> >  			  enum bpf_arg_type arg_type,
> > -			  const u32 *arg_btf_id)
> > +			  const u32 *arg_btf_id,
> > +			  struct bpf_call_arg_meta *meta)
> >  {
> >  	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> >  	enum bpf_reg_type expected, type = reg->type;
> > @@ -5484,8 +5584,15 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> >  			arg_btf_id = compatible->btf_id;
> >  		}
> >
> > -		if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > -					  btf_vmlinux, *arg_btf_id)) {
> > +		if (meta->func_id == BPF_FUNC_kptr_xchg) {
> > +			if (!meta->kptr_off_desc) {
> > +				verbose(env, "verifier internal error: meta.kptr_off_desc unset\n");
> > +				return -EFAULT;
> > +			}
>
> please audit all patches and remove all instances of defensive programming.
>

I was just keeping it the same as meta->map_ptr checks in different places, but
no problem with removing them.

> > +			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno, true))
> > +				return -EACCES;
> > +		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > +						 btf_vmlinux, *arg_btf_id)) {
> >  			verbose(env, "R%d is of type %s but %s is expected\n",
> >  				regno, kernel_type_name(reg->btf, reg->btf_id),
> >  				kernel_type_name(btf_vmlinux, *arg_btf_id));
> > @@ -5595,7 +5702,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >  		 */
> >  		goto skip_type_check;
> >
> > -	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]);
> > +	err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg], meta);
> >  	if (err)
> >  		return err;
> >
> > @@ -5760,6 +5867,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >  			verbose(env, "string is not zero-terminated\n");
> >  			return -EINVAL;
> >  		}
> > +	} else if (arg_type == ARG_PTR_TO_KPTR) {
> > +		if (meta->func_id == BPF_FUNC_kptr_xchg) {
> > +			if (process_kptr_func(env, regno, meta))
> > +				return -EACCES;
> > +		} else {
> > +			verbose(env, "verifier internal error\n");
> > +			return -EFAULT;
>
> remove.
>
> > +		}
> >  	}
> >
> >  	return err;
> > @@ -6102,10 +6217,10 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
> >  	int i;
> >
> >  	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
> > -		if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> > +		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
> >  			return false;
> >
> > -		if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> > +		if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
> >  			return false;
> >  	}
> >
> > @@ -6830,7 +6945,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >  	}
> >
> >  	if (is_release_function(func_id)) {
> > -		err = release_reference(env, meta.ref_obj_id);
> > +		err = -EINVAL;
> > +		if (meta.ref_obj_id)
> > +			err = release_reference(env, meta.ref_obj_id);
> > +		/* Only bpf_kptr_xchg is a release function that accepts a
> > +		 * possibly NULL reg, hence meta.ref_obj_id can only be unset
> > +		 * for it.
>
> Could you rephrase the comment? I'm not following what it's trying to convey.
>

All existing release helpers never take a NULL register, so their
meta.ref_obj_id will never be unset, but bpf_kptr_xchg can, so it needs some
special handling. In check_func_arg, when it jumps to skip_type_check label,
reg->ref_obj_id won't be set for NULL value.

> > +		 */
> > +		else if (func_id == BPF_FUNC_kptr_xchg)
> > +			err = 0;
> >  		if (err) {
> >  			verbose(env, "func %s#%d reference has not been acquired before\n",
> >  				func_id_name(func_id), func_id);
> > @@ -6963,21 +7086,29 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >  			regs[BPF_REG_0].btf_id = meta.ret_btf_id;
> >  		}
> >  	} else if (base_type(ret_type) == RET_PTR_TO_BTF_ID) {
> > +		struct btf *ret_btf;
> >  		int ret_btf_id;
> >
> >  		mark_reg_known_zero(env, regs, BPF_REG_0);
> >  		regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
> > -		ret_btf_id = *fn->ret_btf_id;
> > +		if (func_id == BPF_FUNC_kptr_xchg) {
> > +			if (!meta.kptr_off_desc) {
> > +				verbose(env, "verifier internal error: meta.kptr_off_desc unset\n");
> > +				return -EFAULT;
>
> remove.
>
> > +			}
> > +			ret_btf = meta.kptr_off_desc->btf;
> > +			ret_btf_id = meta.kptr_off_desc->btf_id;
> > +		} else {
> > +			ret_btf = btf_vmlinux;
> > +			ret_btf_id = *fn->ret_btf_id;
> > +		}
> >  		if (ret_btf_id == 0) {
> >  			verbose(env, "invalid return type %u of func %s#%d\n",
> >  				base_type(ret_type), func_id_name(func_id),
> >  				func_id);
> >  			return -EINVAL;
> >  		}
> > -		/* current BPF helper definitions are only coming from
> > -		 * built-in code with type IDs from  vmlinux BTF
> > -		 */
> > -		regs[BPF_REG_0].btf = btf_vmlinux;
> > +		regs[BPF_REG_0].btf = ret_btf;
> >  		regs[BPF_REG_0].btf_id = ret_btf_id;
> >  	} else {
> >  		verbose(env, "unknown return type %u of func %s#%d\n",
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 99fab54ae9c0..d45568746e79 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -5129,6 +5129,17 @@ union bpf_attr {
> >   *		The **hash_algo** is returned on success,
> >   *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
> >   *		invalid arguments are passed.
> > + *
> > + * void *bpf_kptr_xchg(void *map_value, void *ptr)
> > + *	Description
> > + *		Exchange kptr at pointer *map_value* with *ptr*, and return the
> > + *		old value. *ptr* can be NULL, otherwise it must be a referenced
> > + *		pointer which will be released when this helper is called.
> > + *	Return
> > + *		The old value of kptr (which can be NULL). The returned pointer
> > + *		if not NULL, is a reference which must be released using its
> > + *		corresponding release function, or moved into a BPF map before
> > + *		program exit.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)		\
> >  	FN(unspec),			\
> > @@ -5325,6 +5336,7 @@ union bpf_attr {
> >  	FN(copy_from_user_task),	\
> >  	FN(skb_set_tstamp),		\
> >  	FN(ima_file_hash),		\
> > +	FN(kptr_xchg),			\
> >  	/* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > --
> > 2.35.1
> >
>
> --

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 06/15] bpf: Allow storing user kptr in map
  2022-03-19 18:28   ` Alexei Starovoitov
@ 2022-03-19 19:02     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 19:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Mar 19, 2022 at 11:58:13PM IST, Alexei Starovoitov wrote:
> On Thu, Mar 17, 2022 at 05:29:48PM +0530, Kumar Kartikeya Dwivedi wrote:
> > Recently, verifier gained __user annotation support [0] where it
> > prevents BPF program from normally derefering user memory pointer in the
> > kernel, and instead requires use of bpf_probe_read_user. We can allow
> > the user to also store these pointers in BPF maps, with the logic that
> > whenever user loads it from the BPF map, it gets marked as MEM_USER. The
> > tag 'kptr_user' is used to tag such pointers.
> >
> >   [0]: https://lore.kernel.org/bpf/20220127154555.650886-1-yhs@fb.com
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h   |  1 +
> >  kernel/bpf/btf.c      | 13 ++++++++++---
> >  kernel/bpf/verifier.c | 15 ++++++++++++---
> >  3 files changed, 23 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 433f5cb161cf..989f47334215 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -163,6 +163,7 @@ enum {
> >  enum {
> >  	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
> >  	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
> > +	BPF_MAP_VALUE_OFF_F_USER   = (1U << 2),
> ...
> > +		} else if (!strcmp("kptr_user", __btf_name_by_offset(btf, t->name_off))) {
>
> I don't see a use case where __user pointer would need to be stored into a map.
> That pointer is valid in the user task context.
> When bpf prog has such pointer it can read user mem through it,
> but storing it for later makes little sense. The user context will certainly change.
> Reading it later from the map is more or less reading random number.
> Lets drop this patch until real use case arises.

In some cases the address may be fixed (e.g. user area registration similar to
rseq, that can be done between task and BPF program), as long as the task is
alive.

But the patch itself is trivial, so fine with dropping for now.

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 05/15] bpf: Allow storing percpu kptr in map
  2022-03-19 18:30   ` Alexei Starovoitov
@ 2022-03-19 19:04     ` Kumar Kartikeya Dwivedi
  2022-03-19 21:26       ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 19:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Hao Luo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 12:00:28AM IST, Alexei Starovoitov wrote:
> On Thu, Mar 17, 2022 at 05:29:47PM +0530, Kumar Kartikeya Dwivedi wrote:
> > Make adjustments to the code to allow storing percpu PTR_TO_BTF_ID in a
> > map. Similar to 'kptr_ref' tag, a new 'kptr_percpu' allows tagging types
> > of pointers accepting stores of such register types. On load, verifier
> > marks destination register as having type PTR_TO_BTF_ID | MEM_PERCPU |
> > PTR_MAYBE_NULL.
> >
> > Cc: Hao Luo <haoluo@google.com>
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h   |  3 ++-
> >  kernel/bpf/btf.c      | 13 ++++++++++---
> >  kernel/bpf/verifier.c | 26 +++++++++++++++++++++-----
> >  3 files changed, 33 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 702aa882e4a3..433f5cb161cf 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -161,7 +161,8 @@ enum {
> >  };
> >
> >  enum {
> > -	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> > +	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
> > +	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
>
> What is the use case for storing __percpu pointer into a map?

No specific use case for me, just thought it would be useful, especially now
that __percpu tag is understood by verifier for kernel BTF, so it may also refer
to dynamically allocated per-CPU memory, not just global percpu variables. But
fine with dropping both this and user kptr if you don't feel like keeping them.

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case
  2022-03-19 18:34   ` Alexei Starovoitov
@ 2022-03-19 19:06     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 19:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 12:04:40AM IST, Alexei Starovoitov wrote:
> On Thu, Mar 17, 2022 at 05:29:50PM +0530, Kumar Kartikeya Dwivedi wrote:
> > Since now there might be at most 10 offsets that need handling in
> > copy_map_value, the manual shuffling and special case is no longer going
> > to work. Hence, let's generalise the copy_map_value function by using
> > a sorted array of offsets to skip regions that must be avoided while
> > copying into and out of a map value.
> >
> > When the map is created, we populate the offset array in struct map,
> > with one extra element for map->value_size, which is used as the final
> > offset to subtract previous offset from. Since there can only be three
> > sizes, we can avoid recording the size in the struct map, and only store
> > sorted offsets. Later we can determine the size for each offset by
> > comparing it to timer_off and spin_lock_off, otherwise it must be
> > sizeof(u64) for kptr.
> >
> > Then, copy_map_value uses this sorted offset array is used to memcpy
> > while skipping timer, spin lock, and kptr.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h  | 59 +++++++++++++++++++++++++-------------------
> >  kernel/bpf/syscall.c | 47 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 80 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 8ac3070aa5e6..f0f1e0d3bb2e 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -158,6 +158,10 @@ struct bpf_map_ops {
> >  enum {
> >  	/* Support at most 8 pointers in a BPF map value */
> >  	BPF_MAP_VALUE_OFF_MAX = 8,
> > +	BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
> > +				1 + /* for bpf_spin_lock */
> > +				1 + /* for bpf_timer */
> > +				1,  /* for map->value_size sentinel */
> >  };
> >
> >  enum {
> > @@ -208,7 +212,12 @@ struct bpf_map {
> >  	char name[BPF_OBJ_NAME_LEN];
> >  	bool bypass_spec_v1;
> >  	bool frozen; /* write-once; write-protected by freeze_mutex */
> > -	/* 6 bytes hole */
> > +	/* 2 bytes hole */
> > +	struct {
> > +		u32 off[BPF_MAP_OFF_ARR_MAX];
> > +		u32 cnt;
> > +	} off_arr;
> > +	/* 20 bytes hole */
> >
> >  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
> >  	 * particularly with refcounting.
> > @@ -252,36 +261,34 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >  		memset(dst + map->spin_lock_off, 0, sizeof(struct bpf_spin_lock));
> >  	if (unlikely(map_value_has_timer(map)))
> >  		memset(dst + map->timer_off, 0, sizeof(struct bpf_timer));
> > +	if (unlikely(map_value_has_kptr(map))) {
> > +		struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +		int i;
> > +
> > +		for (i = 0; i < tab->nr_off; i++)
> > +			*(u64 *)(dst + tab->off[i].offset) = 0;
> > +	}
> >  }
> >
> >  /* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
> >  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> >  {
> > -	u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;
> > -
> > -	if (unlikely(map_value_has_spin_lock(map))) {
> > -		s_off = map->spin_lock_off;
> > -		s_sz = sizeof(struct bpf_spin_lock);
> > -	}
> > -	if (unlikely(map_value_has_timer(map))) {
> > -		t_off = map->timer_off;
> > -		t_sz = sizeof(struct bpf_timer);
> > -	}
> > -
> > -	if (unlikely(s_sz || t_sz)) {
> > -		if (s_off < t_off || !s_sz) {
> > -			swap(s_off, t_off);
> > -			swap(s_sz, t_sz);
> > -		}
> > -		memcpy(dst, src, t_off);
> > -		memcpy(dst + t_off + t_sz,
> > -		       src + t_off + t_sz,
> > -		       s_off - t_off - t_sz);
> > -		memcpy(dst + s_off + s_sz,
> > -		       src + s_off + s_sz,
> > -		       map->value_size - s_off - s_sz);
> > -	} else {
> > -		memcpy(dst, src, map->value_size);
> > +	int i;
> > +
> > +	memcpy(dst, src, map->off_arr.off[0]);
> > +	for (i = 1; i < map->off_arr.cnt; i++) {
> > +		u32 curr_off = map->off_arr.off[i - 1];
> > +		u32 next_off = map->off_arr.off[i];
> > +		u32 curr_sz;
> > +
> > +		if (map_value_has_spin_lock(map) && map->spin_lock_off == curr_off)
> > +			curr_sz = sizeof(struct bpf_spin_lock);
> > +		else if (map_value_has_timer(map) && map->timer_off == curr_off)
> > +			curr_sz = sizeof(struct bpf_timer);
> > +		else
> > +			curr_sz = sizeof(u64);
>
> Lets store size in off_arr as well.
> Memory consumption of few u8-s are worth it.
> Single load is faster than two if-s and a bunch of load.
>

Ok, will include the size in v3.

> > +		curr_off += curr_sz;
> > +		memcpy(dst + curr_off, src + curr_off, next_off - curr_off);
> >  	}
> >  }
> >  void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 87263b07f40b..69e8ea1be432 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/pgtable.h>
> >  #include <linux/bpf_lsm.h>
> >  #include <linux/poll.h>
> > +#include <linux/sort.h>
> >  #include <linux/bpf-netns.h>
> >  #include <linux/rcupdate_trace.h>
> >  #include <linux/memcontrol.h>
> > @@ -850,6 +851,50 @@ int map_check_no_btf(const struct bpf_map *map,
> >  	return -ENOTSUPP;
> >  }
> >
> > +static int map_off_arr_cmp(const void *_a, const void *_b)
> > +{
> > +	const u32 a = *(const u32 *)_a;
> > +	const u32 b = *(const u32 *)_b;
> > +
> > +	if (a < b)
> > +		return -1;
> > +	else if (a > b)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +static void map_populate_off_arr(struct bpf_map *map)
> > +{
> > +	u32 i;
> > +
> > +	map->off_arr.cnt = 0;
> > +	if (map_value_has_spin_lock(map)) {
> > +		i = map->off_arr.cnt;
> > +
> > +		map->off_arr.off[i] = map->spin_lock_off;
> > +		map->off_arr.cnt++;
> > +	}
> > +	if (map_value_has_timer(map)) {
> > +		i = map->off_arr.cnt;
> > +
> > +		map->off_arr.off[i] = map->timer_off;
> > +		map->off_arr.cnt++;
> > +	}
> > +	if (map_value_has_kptr(map)) {
> > +		struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +		u32 j = map->off_arr.cnt;
> > +
> > +		for (i = 0; i < tab->nr_off; i++)
> > +			map->off_arr.off[j + i] = tab->off[i].offset;
> > +		map->off_arr.cnt += tab->nr_off;
> > +	}
> > +
> > +	map->off_arr.off[map->off_arr.cnt++] = map->value_size;
> > +	if (map->off_arr.cnt == 1)
> > +		return;
> > +	sort(map->off_arr.off, map->off_arr.cnt, sizeof(map->off_arr.off[0]), map_off_arr_cmp, NULL);
> > +}
> > +
> >  static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >  			 u32 btf_key_id, u32 btf_value_id)
> >  {
> > @@ -1015,6 +1060,8 @@ static int map_create(union bpf_attr *attr)
> >  			attr->btf_vmlinux_value_type_id;
> >  	}
> >
> > +	map_populate_off_arr(map);
> > +
> >  	err = security_bpf_map_alloc(map);
> >  	if (err)
> >  		goto free_map;
> > --
> > 2.35.1
> >
>
> --

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic
  2022-03-19 17:55   ` Alexei Starovoitov
@ 2022-03-19 19:31     ` Kumar Kartikeya Dwivedi
  2022-03-19 20:06       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 19:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sat, Mar 19, 2022 at 11:25:34PM IST, Alexei Starovoitov wrote:
> On Thu, Mar 17, 2022 at 05:29:44PM +0530, Kumar Kartikeya Dwivedi wrote:
> > Next commit's field type will not be struct, but pointer, and it will
> > not be limited to one offset, but multiple ones. Make existing
> > btf_find_struct_field and btf_find_datasec_var functions amenable to use
> > for finding BTF ID pointers in map value, by taking a moving spin_lock
> > and timer specific checks into their own function.
> >
> > The alignment, and name are checked before the function is called, so it
> > is the last point where we can skip field or return an error before the
> > next loop iteration happens. This is important, because we'll be
> > potentially reallocating memory inside this function in next commit, so
> > being able to do that when everything else is in order is going to be
> > more convenient.
> >
> > The name parameter is now optional, and only checked if it is not NULL.
> >
> > The size must be checked in the function, because in case of PTR it will
> > instead point to the underlying BTF ID it is pointing to (or modifiers),
> > so the check becomes wrong to do outside of function, and the base type
> > has to be obtained by removing modifiers.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  kernel/bpf/btf.c | 120 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 86 insertions(+), 34 deletions(-)
> >
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 17b9adcd88d3..5b2824332880 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3161,71 +3161,109 @@ static void btf_struct_log(struct btf_verifier_env *env,
> >  	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
> >  }
> >
> > +enum {
> > +	BTF_FIELD_SPIN_LOCK,
> > +	BTF_FIELD_TIMER,
> > +};
> > +
> > +struct btf_field_info {
> > +	u32 off;
> > +};
> > +
> > +static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > +				 u32 off, int sz, struct btf_field_info *info)
> > +{
> > +	if (!__btf_type_is_struct(t))
> > +		return 0;
> > +	if (t->size != sz)
> > +		return 0;
> > +	if (info->off != -ENOENT)
> > +		/* only one such field is allowed */
> > +		return -E2BIG;
> > +	info->off = off;
> > +	return 0;
> > +}
> > +
> >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > -				 const char *name, int sz, int align)
> > +				 const char *name, int sz, int align, int field_type,
> > +				 struct btf_field_info *info)
> >  {
> >  	const struct btf_member *member;
> > -	u32 i, off = -ENOENT;
> > +	u32 i, off;
> > +	int ret;
> >
> >  	for_each_member(i, t, member) {
> >  		const struct btf_type *member_type = btf_type_by_id(btf,
> >  								    member->type);
> > -		if (!__btf_type_is_struct(member_type))
> > -			continue;
> > -		if (member_type->size != sz)
> > -			continue;
> > -		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> > -			continue;
> > -		if (off != -ENOENT)
> > -			/* only one such field is allowed */
> > -			return -E2BIG;
> > +
> >  		off = __btf_member_bit_offset(t, member);
> > +
> > +		if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> > +			continue;
> >  		if (off % 8)
> >  			/* valid C code cannot generate such BTF */
> >  			return -EINVAL;
> >  		off /= 8;
> >  		if (off % align)
> >  			return -EINVAL;
> > +
> > +		switch (field_type) {
> > +		case BTF_FIELD_SPIN_LOCK:
> > +		case BTF_FIELD_TIMER:
>
> Since spin_lock vs timer is passed into btf_find_struct_field() as field_type
> argument there is no need to pass name, sz, align from the caller.
> Pls make btf_find_spin_lock() to pass BTF_FIELD_SPIN_LOCK only
> and in the above code do something like:
>  switch (field_type) {
>  case BTF_FIELD_SPIN_LOCK:
>      name = "bpf_spin_lock";
>      sz = ...
>      break;
>  case BTF_FIELD_TIMER:
>      name = "bpf_timer";
>      sz = ...
>      break;
>  }

Would doing this in btf_find_field be better? Then we set these once instead of
doing it twice in btf_find_struct_field, and btf_find_datasec_var.

>  switch (field_type) {
>  case BTF_FIELD_SPIN_LOCK:
>  case BTF_FIELD_TIMER:
> 	if (!__btf_type_is_struct(member_type))
> 		continue;
> 	if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
>         ...
>         btf_find_field_struct(btf, member_type, off, sz, info);
>  }
>
> It will cleanup the later patch which passes NULL, sizeof(u64), alignof(u64)
> only to pass something into the function.
> With above suggestion it wouldn't need to pass dummy args. BTF_FIELD_KPTR will be enough.
>
> > +			ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > +			if (ret < 0)
> > +				return ret;
> > +			break;
> > +		default:
> > +			return -EFAULT;
> > +		}
> >  	}
> > -	return off;
> > +	return 0;
> >  }
> >
> >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > -				const char *name, int sz, int align)
> > +				const char *name, int sz, int align, int field_type,
> > +				struct btf_field_info *info)
> >  {
> >  	const struct btf_var_secinfo *vsi;
> > -	u32 i, off = -ENOENT;
> > +	u32 i, off;
> > +	int ret;
> >
> >  	for_each_vsi(i, t, vsi) {
> >  		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> >  		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
> >
> > -		if (!__btf_type_is_struct(var_type))
> > -			continue;
> > -		if (var_type->size != sz)
> > +		off = vsi->offset;
> > +
> > +		if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> >  			continue;
> >  		if (vsi->size != sz)
> >  			continue;
> > -		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> > -			continue;
> > -		if (off != -ENOENT)
> > -			/* only one such field is allowed */
> > -			return -E2BIG;
> > -		off = vsi->offset;
> >  		if (off % align)
> >  			return -EINVAL;
> > +
> > +		switch (field_type) {
> > +		case BTF_FIELD_SPIN_LOCK:
> > +		case BTF_FIELD_TIMER:
> > +			ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > +			if (ret < 0)
> > +				return ret;
> > +			break;
> > +		default:
> > +			return -EFAULT;
> > +		}
> >  	}
> > -	return off;
> > +	return 0;
> >  }
> >
> >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > -			  const char *name, int sz, int align)
> > +			  const char *name, int sz, int align, int field_type,
> > +			  struct btf_field_info *info)
> >  {
> > -
> >  	if (__btf_type_is_struct(t))
> > -		return btf_find_struct_field(btf, t, name, sz, align);
> > +		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> >  	else if (btf_type_is_datasec(t))
> > -		return btf_find_datasec_var(btf, t, name, sz, align);
> > +		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> >  	return -EINVAL;
> >  }
> >
> > @@ -3235,16 +3273,30 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >   */
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> >  {
> > -	return btf_find_field(btf, t, "bpf_spin_lock",
> > -			      sizeof(struct bpf_spin_lock),
> > -			      __alignof__(struct bpf_spin_lock));
> > +	struct btf_field_info info = { .off = -ENOENT };
> > +	int ret;
> > +
> > +	ret = btf_find_field(btf, t, "bpf_spin_lock",
> > +			     sizeof(struct bpf_spin_lock),
> > +			     __alignof__(struct bpf_spin_lock),
> > +			     BTF_FIELD_SPIN_LOCK, &info);
> > +	if (ret < 0)
> > +		return ret;
> > +	return info.off;
> >  }
> >
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> >  {
> > -	return btf_find_field(btf, t, "bpf_timer",
> > -			      sizeof(struct bpf_timer),
> > -			      __alignof__(struct bpf_timer));
> > +	struct btf_field_info info = { .off = -ENOENT };
> > +	int ret;
> > +
> > +	ret = btf_find_field(btf, t, "bpf_timer",
> > +			     sizeof(struct bpf_timer),
> > +			     __alignof__(struct bpf_timer),
> > +			     BTF_FIELD_TIMER, &info);
> > +	if (ret < 0)
> > +		return ret;
> > +	return info.off;
> >  }
> >
> >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> > --
> > 2.35.1
> >
>
> --

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic
  2022-03-19 19:31     ` Kumar Kartikeya Dwivedi
@ 2022-03-19 20:06       ` Kumar Kartikeya Dwivedi
  2022-03-19 21:30         ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 20:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 01:01:16AM IST, Kumar Kartikeya Dwivedi wrote:
> On Sat, Mar 19, 2022 at 11:25:34PM IST, Alexei Starovoitov wrote:
> > On Thu, Mar 17, 2022 at 05:29:44PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > Next commit's field type will not be struct, but pointer, and it will
> > > not be limited to one offset, but multiple ones. Make existing
> > > btf_find_struct_field and btf_find_datasec_var functions amenable to use
> > > for finding BTF ID pointers in map value, by taking a moving spin_lock
> > > and timer specific checks into their own function.
> > >
> > > The alignment, and name are checked before the function is called, so it
> > > is the last point where we can skip field or return an error before the
> > > next loop iteration happens. This is important, because we'll be
> > > potentially reallocating memory inside this function in next commit, so
> > > being able to do that when everything else is in order is going to be
> > > more convenient.
> > >
> > > The name parameter is now optional, and only checked if it is not NULL.
> > >
> > > The size must be checked in the function, because in case of PTR it will
> > > instead point to the underlying BTF ID it is pointing to (or modifiers),
> > > so the check becomes wrong to do outside of function, and the base type
> > > has to be obtained by removing modifiers.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  kernel/bpf/btf.c | 120 +++++++++++++++++++++++++++++++++--------------
> > >  1 file changed, 86 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 17b9adcd88d3..5b2824332880 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -3161,71 +3161,109 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > >  	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
> > >  }
> > >
> > > +enum {
> > > +	BTF_FIELD_SPIN_LOCK,
> > > +	BTF_FIELD_TIMER,
> > > +};
> > > +
> > > +struct btf_field_info {
> > > +	u32 off;
> > > +};
> > > +
> > > +static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > > +				 u32 off, int sz, struct btf_field_info *info)
> > > +{
> > > +	if (!__btf_type_is_struct(t))
> > > +		return 0;
> > > +	if (t->size != sz)
> > > +		return 0;
> > > +	if (info->off != -ENOENT)
> > > +		/* only one such field is allowed */
> > > +		return -E2BIG;
> > > +	info->off = off;
> > > +	return 0;
> > > +}
> > > +
> > >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > > -				 const char *name, int sz, int align)
> > > +				 const char *name, int sz, int align, int field_type,
> > > +				 struct btf_field_info *info)
> > >  {
> > >  	const struct btf_member *member;
> > > -	u32 i, off = -ENOENT;
> > > +	u32 i, off;
> > > +	int ret;
> > >
> > >  	for_each_member(i, t, member) {
> > >  		const struct btf_type *member_type = btf_type_by_id(btf,
> > >  								    member->type);
> > > -		if (!__btf_type_is_struct(member_type))
> > > -			continue;
> > > -		if (member_type->size != sz)
> > > -			continue;
> > > -		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> > > -			continue;
> > > -		if (off != -ENOENT)
> > > -			/* only one such field is allowed */
> > > -			return -E2BIG;
> > > +
> > >  		off = __btf_member_bit_offset(t, member);
> > > +
> > > +		if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> > > +			continue;
> > >  		if (off % 8)
> > >  			/* valid C code cannot generate such BTF */
> > >  			return -EINVAL;
> > >  		off /= 8;
> > >  		if (off % align)
> > >  			return -EINVAL;
> > > +
> > > +		switch (field_type) {
> > > +		case BTF_FIELD_SPIN_LOCK:
> > > +		case BTF_FIELD_TIMER:
> >
> > Since spin_lock vs timer is passed into btf_find_struct_field() as field_type
> > argument there is no need to pass name, sz, align from the caller.
> > Pls make btf_find_spin_lock() to pass BTF_FIELD_SPIN_LOCK only
> > and in the above code do something like:
> >  switch (field_type) {
> >  case BTF_FIELD_SPIN_LOCK:
> >      name = "bpf_spin_lock";
> >      sz = ...
> >      break;
> >  case BTF_FIELD_TIMER:
> >      name = "bpf_timer";
> >      sz = ...
> >      break;
> >  }
>
> Would doing this in btf_find_field be better? Then we set these once instead of
> doing it twice in btf_find_struct_field, and btf_find_datasec_var.
>
> >  switch (field_type) {
> >  case BTF_FIELD_SPIN_LOCK:
> >  case BTF_FIELD_TIMER:
> > 	if (!__btf_type_is_struct(member_type))
> > 		continue;
> > 	if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> >         ...
> >         btf_find_field_struct(btf, member_type, off, sz, info);
> >  }
> >
> > It will cleanup the later patch which passes NULL, sizeof(u64), alignof(u64)
> > only to pass something into the function.
> > With above suggestion it wouldn't need to pass dummy args. BTF_FIELD_KPTR will be enough.
> >

Just to be clear, for the kptr case we still use size and align, only name is
optional. size is used for datasec_var call, align is used in both struct_field
and datasec_var. So I'm not sure whether moving it around has much effect,
instead of the caller it will now be set based on field_type inside
btf_find_field.

> > > +			ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > > +			if (ret < 0)
> > > +				return ret;
> > > +			break;
> > > +		default:
> > > +			return -EFAULT;
> > > +		}
> > >  	}
> > > -	return off;
> > > +	return 0;
> > >  }
> > >
> > >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > > -				const char *name, int sz, int align)
> > > +				const char *name, int sz, int align, int field_type,
> > > +				struct btf_field_info *info)
> > >  {
> > >  	const struct btf_var_secinfo *vsi;
> > > -	u32 i, off = -ENOENT;
> > > +	u32 i, off;
> > > +	int ret;
> > >
> > >  	for_each_vsi(i, t, vsi) {
> > >  		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > >  		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
> > >
> > > -		if (!__btf_type_is_struct(var_type))
> > > -			continue;
> > > -		if (var_type->size != sz)
> > > +		off = vsi->offset;
> > > +
> > > +		if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> > >  			continue;
> > >  		if (vsi->size != sz)
> > >  			continue;
> > > -		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
> > > -			continue;
> > > -		if (off != -ENOENT)
> > > -			/* only one such field is allowed */
> > > -			return -E2BIG;
> > > -		off = vsi->offset;
> > >  		if (off % align)
> > >  			return -EINVAL;
> > > +
> > > +		switch (field_type) {
> > > +		case BTF_FIELD_SPIN_LOCK:
> > > +		case BTF_FIELD_TIMER:
> > > +			ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > > +			if (ret < 0)
> > > +				return ret;
> > > +			break;
> > > +		default:
> > > +			return -EFAULT;
> > > +		}
> > >  	}
> > > -	return off;
> > > +	return 0;
> > >  }
> > >
> > >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > > -			  const char *name, int sz, int align)
> > > +			  const char *name, int sz, int align, int field_type,
> > > +			  struct btf_field_info *info)
> > >  {
> > > -
> > >  	if (__btf_type_is_struct(t))
> > > -		return btf_find_struct_field(btf, t, name, sz, align);
> > > +		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > >  	else if (btf_type_is_datasec(t))
> > > -		return btf_find_datasec_var(btf, t, name, sz, align);
> > > +		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > >  	return -EINVAL;
> > >  }
> > >
> > > @@ -3235,16 +3273,30 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > >   */
> > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> > >  {
> > > -	return btf_find_field(btf, t, "bpf_spin_lock",
> > > -			      sizeof(struct bpf_spin_lock),
> > > -			      __alignof__(struct bpf_spin_lock));
> > > +	struct btf_field_info info = { .off = -ENOENT };
> > > +	int ret;
> > > +
> > > +	ret = btf_find_field(btf, t, "bpf_spin_lock",
> > > +			     sizeof(struct bpf_spin_lock),
> > > +			     __alignof__(struct bpf_spin_lock),
> > > +			     BTF_FIELD_SPIN_LOCK, &info);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +	return info.off;
> > >  }
> > >
> > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> > >  {
> > > -	return btf_find_field(btf, t, "bpf_timer",
> > > -			      sizeof(struct bpf_timer),
> > > -			      __alignof__(struct bpf_timer));
> > > +	struct btf_field_info info = { .off = -ENOENT };
> > > +	int ret;
> > > +
> > > +	ret = btf_find_field(btf, t, "bpf_timer",
> > > +			     sizeof(struct bpf_timer),
> > > +			     __alignof__(struct bpf_timer),
> > > +			     BTF_FIELD_TIMER, &info);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +	return info.off;
> > >  }
> > >
> > >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> > > --
> > > 2.35.1
> > >
> >
> > --
>
> --
> Kartikeya

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-19 18:52     ` Kumar Kartikeya Dwivedi
@ 2022-03-19 21:17       ` Alexei Starovoitov
  2022-03-19 21:39         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 21:17 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 12:22:51AM +0530, Kumar Kartikeya Dwivedi wrote:
> On Sat, Mar 19, 2022 at 11:45:38PM IST, Alexei Starovoitov wrote:
> > On Thu, Mar 17, 2022 at 05:29:45PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > register must have the same type as in the map value's BTF, and loading
> > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > kernel BTF and BTF ID.
> > >
> > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > BPF program loads this pointer, the object which the pointer points to
> > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > to still access such invalid pointer, but passing such pointers into
> > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > series will close this gap.
> > >
> > > The flexibility offered by allowing programs to dereference such invalid
> > > pointers while being safe at runtime frees the verifier from doing
> > > complex lifetime tracking. As long as the user may ensure that the
> > > object remains valid, it can ensure data read by it from the kernel
> > > object is valid.
> > >
> > > The user indicates that a certain pointer must be treated as kptr
> > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > information is recorded in the object BTF which will be passed into the
> > > kernel by way of map's BTF information. The name and kind from the map
> > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > now, only storing pointers to structs is permitted.
> > >
> > > An example of this specification is shown below:
> > >
> > > 	#define __kptr __attribute__((btf_type_tag("kptr")))
> > >
> > > 	struct map_value {
> > > 		...
> > > 		struct task_struct __kptr *task;
> > > 		...
> > > 	};
> > >
> > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > task_struct into the map, and then load it later.
> > >
> > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > the verifier cannot know whether the value is NULL or not statically, it
> > > must treat all potential loads at that map value offset as loading a
> > > possibly NULL pointer.
> > >
> > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > same BTF type as specified in the map BTF. The access size must always
> > > be BPF_DW.
> > >
> > > For the map in map support, the kptr_off_tab for outer map is copied
> > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > case, hence would be unnecessary for all other users.
> > >
> > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > having kptr, similar to the bpf_timer case.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h     |  29 +++++-
> > >  include/linux/btf.h     |   2 +
> > >  kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
> > >  kernel/bpf/map_in_map.c |   5 +-
> > >  kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
> > >  kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
> > >  6 files changed, 442 insertions(+), 62 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 88449fbbe063..f35920d279dd 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> > >  	const struct bpf_iter_seq_info *iter_seq_info;
> > >  };
> > >
> > > +enum {
> > > +	/* Support at most 8 pointers in a BPF map value */
> > > +	BPF_MAP_VALUE_OFF_MAX = 8,
> > > +};
> > > +
> > > +struct bpf_map_value_off_desc {
> > > +	u32 offset;
> > > +	u32 btf_id;
> > > +	struct btf *btf;
> > > +};
> > > +
> > > +struct bpf_map_value_off {
> > > +	u32 nr_off;
> > > +	struct bpf_map_value_off_desc off[];
> > > +};
> > > +
> > >  struct bpf_map {
> > >  	/* The first two cachelines with read-mostly members of which some
> > >  	 * are also accessed in fast-path (e.g. ops, max_entries).
> > > @@ -171,6 +187,7 @@ struct bpf_map {
> > >  	u64 map_extra; /* any per-map-type extra fields */
> > >  	u32 map_flags;
> > >  	int spin_lock_off; /* >=0 valid offset, <0 error */
> > > +	struct bpf_map_value_off *kptr_off_tab;
> > >  	int timer_off; /* >=0 valid offset, <0 error */
> > >  	u32 id;
> > >  	int numa_node;
> > > @@ -184,7 +201,7 @@ struct bpf_map {
> > >  	char name[BPF_OBJ_NAME_LEN];
> > >  	bool bypass_spec_v1;
> > >  	bool frozen; /* write-once; write-protected by freeze_mutex */
> > > -	/* 14 bytes hole */
> > > +	/* 6 bytes hole */
> > >
> > >  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
> > >  	 * particularly with refcounting.
> > > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> > >  	return map->timer_off >= 0;
> > >  }
> > >
> > > +static inline bool map_value_has_kptr(const struct bpf_map *map)
> > > +{
> > > +	return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > > +}
> > > +
> > >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > >  {
> > >  	if (unlikely(map_value_has_spin_lock(map)))
> > > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> > >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> > >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> > >
> > > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > > +
> > >  struct bpf_map *bpf_map_get(u32 ufd);
> > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > index 36bc09b8e890..5b578dc81c04 100644
> > > --- a/include/linux/btf.h
> > > +++ b/include/linux/btf.h
> > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > >  			   u32 expected_offset, u32 expected_size);
> > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > +					const struct btf_type *t);
> > >  bool btf_type_is_void(const struct btf_type *t);
> > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 5b2824332880..9ac9364ef533 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > >  enum {
> > >  	BTF_FIELD_SPIN_LOCK,
> > >  	BTF_FIELD_TIMER,
> > > +	BTF_FIELD_KPTR,
> > > +};
> > > +
> > > +enum {
> > > +	BTF_FIELD_IGNORE = 0,
> > > +	BTF_FIELD_FOUND  = 1,
> > >  };
> > >
> > >  struct btf_field_info {
> > > +	const struct btf_type *type;
> > >  	u32 off;
> > >  };
> > >
> > >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > > -				 u32 off, int sz, struct btf_field_info *info)
> > > +				 u32 off, int sz, struct btf_field_info *info,
> > > +				 int info_cnt, int idx)
> > >  {
> > >  	if (!__btf_type_is_struct(t))
> > > -		return 0;
> > > +		return BTF_FIELD_IGNORE;
> > >  	if (t->size != sz)
> > > -		return 0;
> > > -	if (info->off != -ENOENT)
> > > -		/* only one such field is allowed */
> > > +		return BTF_FIELD_IGNORE;
> > > +	if (idx >= info_cnt)
> >
> > No need to pass info_cnt, idx into this function.
> > Move idx >= info_cnt check into the caller and let
> > caller do 'info++' and pass that.
> 
> That was what I did initially, but this check actually needs to happen after we
> see that the field is of interest (i.e. not ignored by btf_find_field_*). Doing
> it in caller limits total fields to info_cnt. Moving those checks out into the
> caller may be the other option, but I didn't like that. I can add a comment if
> it makes things clear.

don't increment info unconditionally?
only when field is found.

> 
> > This function will simply write into 'info'.
> >
> > >  		return -E2BIG;
> > > +	info[idx].off = off;
> > >  	info->off = off;
> >
> > This can't be right.
> >
> 
> Ouch, thanks for catching this.
> 
> > > -	return 0;
> > > +	return BTF_FIELD_FOUND;
> > > +}
> > > +
> > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > +			       u32 off, int sz, struct btf_field_info *info,
> > > +			       int info_cnt, int idx)
> > > +{
> > > +	bool kptr_tag = false;
> > > +
> > > +	/* For PTR, sz is always == 8 */
> > > +	if (!btf_type_is_ptr(t))
> > > +		return BTF_FIELD_IGNORE;
> > > +	t = btf_type_by_id(btf, t->type);
> > > +
> > > +	while (btf_type_is_type_tag(t)) {
> > > +		if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off))) {
> > > +			/* repeated tag */
> > > +			if (kptr_tag)
> > > +				return -EEXIST;
> > > +			kptr_tag = true;
> > > +		}
> > > +		/* Look for next tag */
> > > +		t = btf_type_by_id(btf, t->type);
> > > +	}
> >
> > There is no need for while() loop and 4 bool kptr_*_tag checks.
> > Just do:
> >   if (!btf_type_is_type_tag(t))
> >      return BTF_FIELD_IGNORE;
> >   /* check next tag */
> >   if (btf_type_is_type_tag(btf_type_by_id(btf, t->type))
> >      return -EINVAL;
> 
> But there may be other tags also in the future? Then on older kernels it would
> return an error, instead of skipping over them and ignoring them.

and that would be correct behavior.
If there is a tag it should be meaningful. The kernel shouldn't ignore them.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 04/15] bpf: Allow storing referenced kptr in map
  2022-03-19 18:59     ` Kumar Kartikeya Dwivedi
@ 2022-03-19 21:23       ` Alexei Starovoitov
  2022-03-19 21:43         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 21:23 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 12:29:04AM +0530, Kumar Kartikeya Dwivedi wrote:
> > >
> > >  	if (is_release_function(func_id)) {
> > > -		err = release_reference(env, meta.ref_obj_id);
> > > +		err = -EINVAL;
> > > +		if (meta.ref_obj_id)
> > > +			err = release_reference(env, meta.ref_obj_id);
> > > +		/* Only bpf_kptr_xchg is a release function that accepts a
> > > +		 * possibly NULL reg, hence meta.ref_obj_id can only be unset
> > > +		 * for it.
> >
> > Could you rephrase the comment? I'm not following what it's trying to convey.
> >
> 
> All existing release helpers never take a NULL register, so their
> meta.ref_obj_id will never be unset, but bpf_kptr_xchg can, so it needs some
> special handling. In check_func_arg, when it jumps to skip_type_check label,
> reg->ref_obj_id won't be set for NULL value.

I still don't follow.
What do you mean 'unset meta.ref_obj_id' ?
It's either set or not.
meta->ref_obj_id will stay zero when arg == NULL.
Above 'if (meta.ref_obj_id)' makes sense.
But the code below with extra func_id check looks like defensive programming again.

> > > +		 */
> > > +		else if (func_id == BPF_FUNC_kptr_xchg)
> > > +			err = 0;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 05/15] bpf: Allow storing percpu kptr in map
  2022-03-19 19:04     ` Kumar Kartikeya Dwivedi
@ 2022-03-19 21:26       ` Alexei Starovoitov
  2022-03-19 21:45         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 21:26 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Hao Luo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 12:34:09AM +0530, Kumar Kartikeya Dwivedi wrote:
> On Sun, Mar 20, 2022 at 12:00:28AM IST, Alexei Starovoitov wrote:
> > On Thu, Mar 17, 2022 at 05:29:47PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > Make adjustments to the code to allow storing percpu PTR_TO_BTF_ID in a
> > > map. Similar to 'kptr_ref' tag, a new 'kptr_percpu' allows tagging types
> > > of pointers accepting stores of such register types. On load, verifier
> > > marks destination register as having type PTR_TO_BTF_ID | MEM_PERCPU |
> > > PTR_MAYBE_NULL.
> > >
> > > Cc: Hao Luo <haoluo@google.com>
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h   |  3 ++-
> > >  kernel/bpf/btf.c      | 13 ++++++++++---
> > >  kernel/bpf/verifier.c | 26 +++++++++++++++++++++-----
> > >  3 files changed, 33 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 702aa882e4a3..433f5cb161cf 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -161,7 +161,8 @@ enum {
> > >  };
> > >
> > >  enum {
> > > -	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> > > +	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
> > > +	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
> >
> > What is the use case for storing __percpu pointer into a map?
> 
> No specific use case for me, just thought it would be useful, especially now
> that __percpu tag is understood by verifier for kernel BTF, so it may also refer
> to dynamically allocated per-CPU memory, not just global percpu variables. But
> fine with dropping both this and user kptr if you don't feel like keeping them.

I prefer to drop it for now.
The patch is trivial but kptr_percpu tag would stay forever.
Maybe we can allow storing percpu pointers in a map with just kptr tag.
The verifier should be able to understand from btf whether that pointer
is percpu or not.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic
  2022-03-19 20:06       ` Kumar Kartikeya Dwivedi
@ 2022-03-19 21:30         ` Alexei Starovoitov
  0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 21:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 01:36:41AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > >  			return -EINVAL;
> > > > +
> > > > +		switch (field_type) {
> > > > +		case BTF_FIELD_SPIN_LOCK:
> > > > +		case BTF_FIELD_TIMER:
> > >
> > > Since spin_lock vs timer is passed into btf_find_struct_field() as field_type
> > > argument there is no need to pass name, sz, align from the caller.
> > > Pls make btf_find_spin_lock() to pass BTF_FIELD_SPIN_LOCK only
> > > and in the above code do something like:
> > >  switch (field_type) {
> > >  case BTF_FIELD_SPIN_LOCK:
> > >      name = "bpf_spin_lock";
> > >      sz = ...
> > >      break;
> > >  case BTF_FIELD_TIMER:
> > >      name = "bpf_timer";
> > >      sz = ...
> > >      break;
> > >  }
> >
> > Would doing this in btf_find_field be better? Then we set these once instead of
> > doing it twice in btf_find_struct_field, and btf_find_datasec_var.

yeah. probably.

> >
> > >  switch (field_type) {
> > >  case BTF_FIELD_SPIN_LOCK:
> > >  case BTF_FIELD_TIMER:
> > > 	if (!__btf_type_is_struct(member_type))
> > > 		continue;
> > > 	if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
> > >         ...
> > >         btf_find_field_struct(btf, member_type, off, sz, info);
> > >  }
> > >
> > > It will cleanup the later patch which passes NULL, sizeof(u64), alignof(u64)
> > > only to pass something into the function.
> > > With above suggestion it wouldn't need to pass dummy args. BTF_FIELD_KPTR will be enough.
> > >
> 
> Just to be clear, for the kptr case we still use size and align, only name is
> optional. size is used for datasec_var call, align is used in both struct_field
> and datasec_var. So I'm not sure whether moving it around has much effect,
> instead of the caller it will now be set based on field_type inside
> btf_find_field.

There is no use case to do BTF_FIELD_KPTR, sizeof(u64) and BTF_FIELD_KPTR, sizeof(u32), right?
So best to avoid such mistakes.
In other words consider every function to be a uapi.
Not in a way that it can never change, but from pov that you wouldn't want the user space
to specify all details for the kernel when BTF_FIELD_KPTR is enough to figure out the rest.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-19 21:17       ` Alexei Starovoitov
@ 2022-03-19 21:39         ` Kumar Kartikeya Dwivedi
  2022-03-19 21:50           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 21:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 02:47:54AM IST, Alexei Starovoitov wrote:
> On Sun, Mar 20, 2022 at 12:22:51AM +0530, Kumar Kartikeya Dwivedi wrote:
> > On Sat, Mar 19, 2022 at 11:45:38PM IST, Alexei Starovoitov wrote:
> > > On Thu, Mar 17, 2022 at 05:29:45PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > register must have the same type as in the map value's BTF, and loading
> > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > kernel BTF and BTF ID.
> > > >
> > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > BPF program loads this pointer, the object which the pointer points to
> > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > to still access such invalid pointer, but passing such pointers into
> > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > series will close this gap.
> > > >
> > > > The flexibility offered by allowing programs to dereference such invalid
> > > > pointers while being safe at runtime frees the verifier from doing
> > > > complex lifetime tracking. As long as the user may ensure that the
> > > > object remains valid, it can ensure data read by it from the kernel
> > > > object is valid.
> > > >
> > > > The user indicates that a certain pointer must be treated as kptr
> > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > information is recorded in the object BTF which will be passed into the
> > > > kernel by way of map's BTF information. The name and kind from the map
> > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > now, only storing pointers to structs is permitted.
> > > >
> > > > An example of this specification is shown below:
> > > >
> > > > 	#define __kptr __attribute__((btf_type_tag("kptr")))
> > > >
> > > > 	struct map_value {
> > > > 		...
> > > > 		struct task_struct __kptr *task;
> > > > 		...
> > > > 	};
> > > >
> > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > task_struct into the map, and then load it later.
> > > >
> > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > must treat all potential loads at that map value offset as loading a
> > > > possibly NULL pointer.
> > > >
> > > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > same BTF type as specified in the map BTF. The access size must always
> > > > be BPF_DW.
> > > >
> > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > case, hence would be unnecessary for all other users.
> > > >
> > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > having kptr, similar to the bpf_timer case.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h     |  29 +++++-
> > > >  include/linux/btf.h     |   2 +
> > > >  kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
> > > >  kernel/bpf/map_in_map.c |   5 +-
> > > >  kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
> > > >  kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
> > > >  6 files changed, 442 insertions(+), 62 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 88449fbbe063..f35920d279dd 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> > > >  	const struct bpf_iter_seq_info *iter_seq_info;
> > > >  };
> > > >
> > > > +enum {
> > > > +	/* Support at most 8 pointers in a BPF map value */
> > > > +	BPF_MAP_VALUE_OFF_MAX = 8,
> > > > +};
> > > > +
> > > > +struct bpf_map_value_off_desc {
> > > > +	u32 offset;
> > > > +	u32 btf_id;
> > > > +	struct btf *btf;
> > > > +};
> > > > +
> > > > +struct bpf_map_value_off {
> > > > +	u32 nr_off;
> > > > +	struct bpf_map_value_off_desc off[];
> > > > +};
> > > > +
> > > >  struct bpf_map {
> > > >  	/* The first two cachelines with read-mostly members of which some
> > > >  	 * are also accessed in fast-path (e.g. ops, max_entries).
> > > > @@ -171,6 +187,7 @@ struct bpf_map {
> > > >  	u64 map_extra; /* any per-map-type extra fields */
> > > >  	u32 map_flags;
> > > >  	int spin_lock_off; /* >=0 valid offset, <0 error */
> > > > +	struct bpf_map_value_off *kptr_off_tab;
> > > >  	int timer_off; /* >=0 valid offset, <0 error */
> > > >  	u32 id;
> > > >  	int numa_node;
> > > > @@ -184,7 +201,7 @@ struct bpf_map {
> > > >  	char name[BPF_OBJ_NAME_LEN];
> > > >  	bool bypass_spec_v1;
> > > >  	bool frozen; /* write-once; write-protected by freeze_mutex */
> > > > -	/* 14 bytes hole */
> > > > +	/* 6 bytes hole */
> > > >
> > > >  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
> > > >  	 * particularly with refcounting.
> > > > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> > > >  	return map->timer_off >= 0;
> > > >  }
> > > >
> > > > +static inline bool map_value_has_kptr(const struct bpf_map *map)
> > > > +{
> > > > +	return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > > > +}
> > > > +
> > > >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > > >  {
> > > >  	if (unlikely(map_value_has_spin_lock(map)))
> > > > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> > > >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> > > >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> > > >
> > > > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > > > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > > > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > > > +
> > > >  struct bpf_map *bpf_map_get(u32 ufd);
> > > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > > index 36bc09b8e890..5b578dc81c04 100644
> > > > --- a/include/linux/btf.h
> > > > +++ b/include/linux/btf.h
> > > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > > >  			   u32 expected_offset, u32 expected_size);
> > > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > > +					const struct btf_type *t);
> > > >  bool btf_type_is_void(const struct btf_type *t);
> > > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > index 5b2824332880..9ac9364ef533 100644
> > > > --- a/kernel/bpf/btf.c
> > > > +++ b/kernel/bpf/btf.c
> > > > @@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > > >  enum {
> > > >  	BTF_FIELD_SPIN_LOCK,
> > > >  	BTF_FIELD_TIMER,
> > > > +	BTF_FIELD_KPTR,
> > > > +};
> > > > +
> > > > +enum {
> > > > +	BTF_FIELD_IGNORE = 0,
> > > > +	BTF_FIELD_FOUND  = 1,
> > > >  };
> > > >
> > > >  struct btf_field_info {
> > > > +	const struct btf_type *type;
> > > >  	u32 off;
> > > >  };
> > > >
> > > >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > > > -				 u32 off, int sz, struct btf_field_info *info)
> > > > +				 u32 off, int sz, struct btf_field_info *info,
> > > > +				 int info_cnt, int idx)
> > > >  {
> > > >  	if (!__btf_type_is_struct(t))
> > > > -		return 0;
> > > > +		return BTF_FIELD_IGNORE;
> > > >  	if (t->size != sz)
> > > > -		return 0;
> > > > -	if (info->off != -ENOENT)
> > > > -		/* only one such field is allowed */
> > > > +		return BTF_FIELD_IGNORE;
> > > > +	if (idx >= info_cnt)
> > >
> > > No need to pass info_cnt, idx into this function.
> > > Move idx >= info_cnt check into the caller and let
> > > caller do 'info++' and pass that.
> >
> > That was what I did initially, but this check actually needs to happen after we
> > see that the field is of interest (i.e. not ignored by btf_find_field_*). Doing
> > it in caller limits total fields to info_cnt. Moving those checks out into the
> > caller may be the other option, but I didn't like that. I can add a comment if
> > it makes things clear.
>
> don't increment info unconditionally?
> only when field is found.
>

Right now the j++ happens only when we find a field. What I'm saying is that if
you now move the idx (which is j in caller) >= info_cnt out into the loop, later
iteration will return error even if it is not a timer, spin_lock, or kptr field,
so actual check is done inside the function after we know that for this specific
case it can only be a timer, spin_lock, or kptr, and we already have no more room
to record their info.

e.g. there can be a case when we end up at j == info_cnt (all infos used), but
we still find a kptr, so we should only return error on seeing j == info_cnt
once we know that field is a kptr, because we reached the total limit of kptrs
in a map value.

> [...]

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 04/15] bpf: Allow storing referenced kptr in map
  2022-03-19 21:23       ` Alexei Starovoitov
@ 2022-03-19 21:43         ` Kumar Kartikeya Dwivedi
  2022-03-20  0:57           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 21:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 02:53:20AM IST, Alexei Starovoitov wrote:
> On Sun, Mar 20, 2022 at 12:29:04AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > >
> > > >  	if (is_release_function(func_id)) {
> > > > -		err = release_reference(env, meta.ref_obj_id);
> > > > +		err = -EINVAL;
> > > > +		if (meta.ref_obj_id)
> > > > +			err = release_reference(env, meta.ref_obj_id);
> > > > +		/* Only bpf_kptr_xchg is a release function that accepts a
> > > > +		 * possibly NULL reg, hence meta.ref_obj_id can only be unset
> > > > +		 * for it.
> > >
> > > Could you rephrase the comment? I'm not following what it's trying to convey.
> > >
> >
> > All existing release helpers never take a NULL register, so their
> > meta.ref_obj_id will never be unset, but bpf_kptr_xchg can, so it needs some
> > special handling. In check_func_arg, when it jumps to skip_type_check label,
> > reg->ref_obj_id won't be set for NULL value.
>
> I still don't follow.
> What do you mean 'unset meta.ref_obj_id' ?
> It's either set or not.

By unset I meant it is the default (0).

> meta->ref_obj_id will stay zero when arg == NULL.
> Above 'if (meta.ref_obj_id)' makes sense.
> But the code below with extra func_id check looks like defensive programming again.
>

Ok, so I'll just write it like:

if (is_release_function(...) && meta.ref_obj_id) {
	err = release_reference(...);
	if (err)
		...
}

> > > > +		 */
> > > > +		else if (func_id == BPF_FUNC_kptr_xchg)
> > > > +			err = 0;

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 05/15] bpf: Allow storing percpu kptr in map
  2022-03-19 21:26       ` Alexei Starovoitov
@ 2022-03-19 21:45         ` Kumar Kartikeya Dwivedi
  2022-03-19 23:01           ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 21:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Hao Luo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 02:56:20AM IST, Alexei Starovoitov wrote:
> On Sun, Mar 20, 2022 at 12:34:09AM +0530, Kumar Kartikeya Dwivedi wrote:
> > On Sun, Mar 20, 2022 at 12:00:28AM IST, Alexei Starovoitov wrote:
> > > On Thu, Mar 17, 2022 at 05:29:47PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > Make adjustments to the code to allow storing percpu PTR_TO_BTF_ID in a
> > > > map. Similar to 'kptr_ref' tag, a new 'kptr_percpu' allows tagging types
> > > > of pointers accepting stores of such register types. On load, verifier
> > > > marks destination register as having type PTR_TO_BTF_ID | MEM_PERCPU |
> > > > PTR_MAYBE_NULL.
> > > >
> > > > Cc: Hao Luo <haoluo@google.com>
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h   |  3 ++-
> > > >  kernel/bpf/btf.c      | 13 ++++++++++---
> > > >  kernel/bpf/verifier.c | 26 +++++++++++++++++++++-----
> > > >  3 files changed, 33 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 702aa882e4a3..433f5cb161cf 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -161,7 +161,8 @@ enum {
> > > >  };
> > > >
> > > >  enum {
> > > > -	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> > > > +	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
> > > > +	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
> > >
> > > What is the use case for storing __percpu pointer into a map?
> >
> > No specific use case for me, just thought it would be useful, especially now
> > that __percpu tag is understood by verifier for kernel BTF, so it may also refer
> > to dynamically allocated per-CPU memory, not just global percpu variables. But
> > fine with dropping both this and user kptr if you don't feel like keeping them.
>
> I prefer to drop it for now.
> The patch is trivial but kptr_percpu tag would stay forever.

Ok, I'll drop both this and user kptr for now.

> Maybe we can allow storing percpu pointers in a map with just kptr tag.
> The verifier should be able to understand from btf whether that pointer
> is percpu or not.

This won't work (unless I missed something), it is possible to see the type when
a store is being done, but we cannot know whether the pointer was percpu or not
when doing a load (which is needed to decide whether it will be marked with
MEM_PERCPU, so that user has to call bpf_this_cpu_ptr or bpf_per_cpu_ptr to
obtain actual pointer). So some extra tagging is needed.

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-19 21:39         ` Kumar Kartikeya Dwivedi
@ 2022-03-19 21:50           ` Kumar Kartikeya Dwivedi
  2022-03-19 22:57             ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-19 21:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 03:09:20AM IST, Kumar Kartikeya Dwivedi wrote:
> On Sun, Mar 20, 2022 at 02:47:54AM IST, Alexei Starovoitov wrote:
> > On Sun, Mar 20, 2022 at 12:22:51AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > On Sat, Mar 19, 2022 at 11:45:38PM IST, Alexei Starovoitov wrote:
> > > > On Thu, Mar 17, 2022 at 05:29:45PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > > register must have the same type as in the map value's BTF, and loading
> > > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > > kernel BTF and BTF ID.
> > > > >
> > > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > > BPF program loads this pointer, the object which the pointer points to
> > > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > > to still access such invalid pointer, but passing such pointers into
> > > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > > series will close this gap.
> > > > >
> > > > > The flexibility offered by allowing programs to dereference such invalid
> > > > > pointers while being safe at runtime frees the verifier from doing
> > > > > complex lifetime tracking. As long as the user may ensure that the
> > > > > object remains valid, it can ensure data read by it from the kernel
> > > > > object is valid.
> > > > >
> > > > > The user indicates that a certain pointer must be treated as kptr
> > > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > > information is recorded in the object BTF which will be passed into the
> > > > > kernel by way of map's BTF information. The name and kind from the map
> > > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > > now, only storing pointers to structs is permitted.
> > > > >
> > > > > An example of this specification is shown below:
> > > > >
> > > > > 	#define __kptr __attribute__((btf_type_tag("kptr")))
> > > > >
> > > > > 	struct map_value {
> > > > > 		...
> > > > > 		struct task_struct __kptr *task;
> > > > > 		...
> > > > > 	};
> > > > >
> > > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > > task_struct into the map, and then load it later.
> > > > >
> > > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > > must treat all potential loads at that map value offset as loading a
> > > > > possibly NULL pointer.
> > > > >
> > > > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > > same BTF type as specified in the map BTF. The access size must always
> > > > > be BPF_DW.
> > > > >
> > > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > > case, hence would be unnecessary for all other users.
> > > > >
> > > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > > having kptr, similar to the bpf_timer case.
> > > > >
> > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > > ---
> > > > >  include/linux/bpf.h     |  29 +++++-
> > > > >  include/linux/btf.h     |   2 +
> > > > >  kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
> > > > >  kernel/bpf/map_in_map.c |   5 +-
> > > > >  kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
> > > > >  kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
> > > > >  6 files changed, 442 insertions(+), 62 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > index 88449fbbe063..f35920d279dd 100644
> > > > > --- a/include/linux/bpf.h
> > > > > +++ b/include/linux/bpf.h
> > > > > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> > > > >  	const struct bpf_iter_seq_info *iter_seq_info;
> > > > >  };
> > > > >
> > > > > +enum {
> > > > > +	/* Support at most 8 pointers in a BPF map value */
> > > > > +	BPF_MAP_VALUE_OFF_MAX = 8,
> > > > > +};
> > > > > +
> > > > > +struct bpf_map_value_off_desc {
> > > > > +	u32 offset;
> > > > > +	u32 btf_id;
> > > > > +	struct btf *btf;
> > > > > +};
> > > > > +
> > > > > +struct bpf_map_value_off {
> > > > > +	u32 nr_off;
> > > > > +	struct bpf_map_value_off_desc off[];
> > > > > +};
> > > > > +
> > > > >  struct bpf_map {
> > > > >  	/* The first two cachelines with read-mostly members of which some
> > > > >  	 * are also accessed in fast-path (e.g. ops, max_entries).
> > > > > @@ -171,6 +187,7 @@ struct bpf_map {
> > > > >  	u64 map_extra; /* any per-map-type extra fields */
> > > > >  	u32 map_flags;
> > > > >  	int spin_lock_off; /* >=0 valid offset, <0 error */
> > > > > +	struct bpf_map_value_off *kptr_off_tab;
> > > > >  	int timer_off; /* >=0 valid offset, <0 error */
> > > > >  	u32 id;
> > > > >  	int numa_node;
> > > > > @@ -184,7 +201,7 @@ struct bpf_map {
> > > > >  	char name[BPF_OBJ_NAME_LEN];
> > > > >  	bool bypass_spec_v1;
> > > > >  	bool frozen; /* write-once; write-protected by freeze_mutex */
> > > > > -	/* 14 bytes hole */
> > > > > +	/* 6 bytes hole */
> > > > >
> > > > >  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
> > > > >  	 * particularly with refcounting.
> > > > > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> > > > >  	return map->timer_off >= 0;
> > > > >  }
> > > > >
> > > > > +static inline bool map_value_has_kptr(const struct bpf_map *map)
> > > > > +{
> > > > > +	return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > > > > +}
> > > > > +
> > > > >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > > > >  {
> > > > >  	if (unlikely(map_value_has_spin_lock(map)))
> > > > > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> > > > >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> > > > >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> > > > >
> > > > > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > > > > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > > > > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > > > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > > > > +
> > > > >  struct bpf_map *bpf_map_get(u32 ufd);
> > > > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > > > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > > > index 36bc09b8e890..5b578dc81c04 100644
> > > > > --- a/include/linux/btf.h
> > > > > +++ b/include/linux/btf.h
> > > > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > > > >  			   u32 expected_offset, u32 expected_size);
> > > > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > > > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > > > +					const struct btf_type *t);
> > > > >  bool btf_type_is_void(const struct btf_type *t);
> > > > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > > > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > > index 5b2824332880..9ac9364ef533 100644
> > > > > --- a/kernel/bpf/btf.c
> > > > > +++ b/kernel/bpf/btf.c
> > > > > @@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > > > >  enum {
> > > > >  	BTF_FIELD_SPIN_LOCK,
> > > > >  	BTF_FIELD_TIMER,
> > > > > +	BTF_FIELD_KPTR,
> > > > > +};
> > > > > +
> > > > > +enum {
> > > > > +	BTF_FIELD_IGNORE = 0,
> > > > > +	BTF_FIELD_FOUND  = 1,
> > > > >  };
> > > > >
> > > > >  struct btf_field_info {
> > > > > +	const struct btf_type *type;
> > > > >  	u32 off;
> > > > >  };
> > > > >
> > > > >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > > > > -				 u32 off, int sz, struct btf_field_info *info)
> > > > > +				 u32 off, int sz, struct btf_field_info *info,
> > > > > +				 int info_cnt, int idx)
> > > > >  {
> > > > >  	if (!__btf_type_is_struct(t))
> > > > > -		return 0;
> > > > > +		return BTF_FIELD_IGNORE;
> > > > >  	if (t->size != sz)
> > > > > -		return 0;
> > > > > -	if (info->off != -ENOENT)
> > > > > -		/* only one such field is allowed */
> > > > > +		return BTF_FIELD_IGNORE;
> > > > > +	if (idx >= info_cnt)
> > > >
> > > > No need to pass info_cnt, idx into this function.
> > > > Move idx >= info_cnt check into the caller and let
> > > > caller do 'info++' and pass that.
> > >
> > > That was what I did initially, but this check actually needs to happen after we
> > > see that the field is of interest (i.e. not ignored by btf_find_field_*). Doing
> > > it in caller limits total fields to info_cnt. Moving those checks out into the
> > > caller may be the other option, but I didn't like that. I can add a comment if
> > > it makes things clear.
> >
> > don't increment info unconditionally?
> > only when field is found.
> >
>
> Right now the j++ happens only when we find a field. What I'm saying is that if
> you now move the idx (which is j in caller) >= info_cnt out into the loop, later
> iteration will return error even if it is not a timer, spin_lock, or kptr field,
> so actual check is done inside the function after we know that for this specific
> case it can only be a timer, spin_lock, or kptr, and we already have no more room
> to record their info.
>
> e.g. there can be a case when we end up at j == info_cnt (all infos used), but
> we still find a kptr, so we should only return error on seeing j == info_cnt
> once we know that field is a kptr, because we reached the total limit of kptrs
> in a map value.
>

One other option might be doing the check in caller _after_ we see
BTF_FIELD_FOUND, but this would require bumping the size to max + 1, so that the
write to info inside the function doesn't write past the end of the array.

> > [...]
>
> --
> Kartikeya

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map
  2022-03-19 21:50           ` Kumar Kartikeya Dwivedi
@ 2022-03-19 22:57             ` Alexei Starovoitov
  0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 22:57 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 03:20:46AM +0530, Kumar Kartikeya Dwivedi wrote:
> On Sun, Mar 20, 2022 at 03:09:20AM IST, Kumar Kartikeya Dwivedi wrote:
> > On Sun, Mar 20, 2022 at 02:47:54AM IST, Alexei Starovoitov wrote:
> > > On Sun, Mar 20, 2022 at 12:22:51AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > On Sat, Mar 19, 2022 at 11:45:38PM IST, Alexei Starovoitov wrote:
> > > > > On Thu, Mar 17, 2022 at 05:29:45PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > > > register must have the same type as in the map value's BTF, and loading
> > > > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > > > kernel BTF and BTF ID.
> > > > > >
> > > > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > > > BPF program loads this pointer, the object which the pointer points to
> > > > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > > > to still access such invalid pointer, but passing such pointers into
> > > > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > > > series will close this gap.
> > > > > >
> > > > > > The flexibility offered by allowing programs to dereference such invalid
> > > > > > pointers while being safe at runtime frees the verifier from doing
> > > > > > complex lifetime tracking. As long as the user may ensure that the
> > > > > > object remains valid, it can ensure data read by it from the kernel
> > > > > > object is valid.
> > > > > >
> > > > > > The user indicates that a certain pointer must be treated as kptr
> > > > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > > > information is recorded in the object BTF which will be passed into the
> > > > > > kernel by way of map's BTF information. The name and kind from the map
> > > > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > > > now, only storing pointers to structs is permitted.
> > > > > >
> > > > > > An example of this specification is shown below:
> > > > > >
> > > > > > 	#define __kptr __attribute__((btf_type_tag("kptr")))
> > > > > >
> > > > > > 	struct map_value {
> > > > > > 		...
> > > > > > 		struct task_struct __kptr *task;
> > > > > > 		...
> > > > > > 	};
> > > > > >
> > > > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > > > task_struct into the map, and then load it later.
> > > > > >
> > > > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > > > must treat all potential loads at that map value offset as loading a
> > > > > > possibly NULL pointer.
> > > > > >
> > > > > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > > > same BTF type as specified in the map BTF. The access size must always
> > > > > > be BPF_DW.
> > > > > >
> > > > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > > > case, hence would be unnecessary for all other users.
> > > > > >
> > > > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > > > having kptr, similar to the bpf_timer case.
> > > > > >
> > > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > > > ---
> > > > > >  include/linux/bpf.h     |  29 +++++-
> > > > > >  include/linux/btf.h     |   2 +
> > > > > >  kernel/bpf/btf.c        | 151 +++++++++++++++++++++++++----
> > > > > >  kernel/bpf/map_in_map.c |   5 +-
> > > > > >  kernel/bpf/syscall.c    | 110 ++++++++++++++++++++-
> > > > > >  kernel/bpf/verifier.c   | 207 ++++++++++++++++++++++++++++++++--------
> > > > > >  6 files changed, 442 insertions(+), 62 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > > index 88449fbbe063..f35920d279dd 100644
> > > > > > --- a/include/linux/bpf.h
> > > > > > +++ b/include/linux/bpf.h
> > > > > > @@ -155,6 +155,22 @@ struct bpf_map_ops {
> > > > > >  	const struct bpf_iter_seq_info *iter_seq_info;
> > > > > >  };
> > > > > >
> > > > > > +enum {
> > > > > > +	/* Support at most 8 pointers in a BPF map value */
> > > > > > +	BPF_MAP_VALUE_OFF_MAX = 8,
> > > > > > +};
> > > > > > +
> > > > > > +struct bpf_map_value_off_desc {
> > > > > > +	u32 offset;
> > > > > > +	u32 btf_id;
> > > > > > +	struct btf *btf;
> > > > > > +};
> > > > > > +
> > > > > > +struct bpf_map_value_off {
> > > > > > +	u32 nr_off;
> > > > > > +	struct bpf_map_value_off_desc off[];
> > > > > > +};
> > > > > > +
> > > > > >  struct bpf_map {
> > > > > >  	/* The first two cachelines with read-mostly members of which some
> > > > > >  	 * are also accessed in fast-path (e.g. ops, max_entries).
> > > > > > @@ -171,6 +187,7 @@ struct bpf_map {
> > > > > >  	u64 map_extra; /* any per-map-type extra fields */
> > > > > >  	u32 map_flags;
> > > > > >  	int spin_lock_off; /* >=0 valid offset, <0 error */
> > > > > > +	struct bpf_map_value_off *kptr_off_tab;
> > > > > >  	int timer_off; /* >=0 valid offset, <0 error */
> > > > > >  	u32 id;
> > > > > >  	int numa_node;
> > > > > > @@ -184,7 +201,7 @@ struct bpf_map {
> > > > > >  	char name[BPF_OBJ_NAME_LEN];
> > > > > >  	bool bypass_spec_v1;
> > > > > >  	bool frozen; /* write-once; write-protected by freeze_mutex */
> > > > > > -	/* 14 bytes hole */
> > > > > > +	/* 6 bytes hole */
> > > > > >
> > > > > >  	/* The 3rd and 4th cacheline with misc members to avoid false sharing
> > > > > >  	 * particularly with refcounting.
> > > > > > @@ -217,6 +234,11 @@ static inline bool map_value_has_timer(const struct bpf_map *map)
> > > > > >  	return map->timer_off >= 0;
> > > > > >  }
> > > > > >
> > > > > > +static inline bool map_value_has_kptr(const struct bpf_map *map)
> > > > > > +{
> > > > > > +	return !IS_ERR_OR_NULL(map->kptr_off_tab);
> > > > > > +}
> > > > > > +
> > > > > >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > > > > >  {
> > > > > >  	if (unlikely(map_value_has_spin_lock(map)))
> > > > > > @@ -1497,6 +1519,11 @@ void bpf_prog_put(struct bpf_prog *prog);
> > > > > >  void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
> > > > > >  void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
> > > > > >
> > > > > > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
> > > > > > +void bpf_map_free_kptr_off_tab(struct bpf_map *map);
> > > > > > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> > > > > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > > > > > +
> > > > > >  struct bpf_map *bpf_map_get(u32 ufd);
> > > > > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > > > > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > > > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > > > > index 36bc09b8e890..5b578dc81c04 100644
> > > > > > --- a/include/linux/btf.h
> > > > > > +++ b/include/linux/btf.h
> > > > > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > > > > >  			   u32 expected_offset, u32 expected_size);
> > > > > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > > > > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > > > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > > > > +					const struct btf_type *t);
> > > > > >  bool btf_type_is_void(const struct btf_type *t);
> > > > > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > > > > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > > > index 5b2824332880..9ac9364ef533 100644
> > > > > > --- a/kernel/bpf/btf.c
> > > > > > +++ b/kernel/bpf/btf.c
> > > > > > @@ -3164,33 +3164,79 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > > > > >  enum {
> > > > > >  	BTF_FIELD_SPIN_LOCK,
> > > > > >  	BTF_FIELD_TIMER,
> > > > > > +	BTF_FIELD_KPTR,
> > > > > > +};
> > > > > > +
> > > > > > +enum {
> > > > > > +	BTF_FIELD_IGNORE = 0,
> > > > > > +	BTF_FIELD_FOUND  = 1,
> > > > > >  };
> > > > > >
> > > > > >  struct btf_field_info {
> > > > > > +	const struct btf_type *type;
> > > > > >  	u32 off;
> > > > > >  };
> > > > > >
> > > > > >  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t,
> > > > > > -				 u32 off, int sz, struct btf_field_info *info)
> > > > > > +				 u32 off, int sz, struct btf_field_info *info,
> > > > > > +				 int info_cnt, int idx)
> > > > > >  {
> > > > > >  	if (!__btf_type_is_struct(t))
> > > > > > -		return 0;
> > > > > > +		return BTF_FIELD_IGNORE;
> > > > > >  	if (t->size != sz)
> > > > > > -		return 0;
> > > > > > -	if (info->off != -ENOENT)
> > > > > > -		/* only one such field is allowed */
> > > > > > +		return BTF_FIELD_IGNORE;
> > > > > > +	if (idx >= info_cnt)
> > > > >
> > > > > No need to pass info_cnt, idx into this function.
> > > > > Move idx >= info_cnt check into the caller and let
> > > > > caller do 'info++' and pass that.
> > > >
> > > > That was what I did initially, but this check actually needs to happen after we
> > > > see that the field is of interest (i.e. not ignored by btf_find_field_*). Doing
> > > > it in caller limits total fields to info_cnt. Moving those checks out into the
> > > > caller may be the other option, but I didn't like that. I can add a comment if
> > > > it makes things clear.
> > >
> > > don't increment info unconditionally?
> > > only when field is found.
> > >
> >
> > Right now the j++ happens only when we find a field. What I'm saying is that if
> > you now move the idx (which is j in caller) >= info_cnt out into the loop, later
> > iteration will return error even if it is not a timer, spin_lock, or kptr field,
> > so actual check is done inside the function after we know that for this specific
> > case it can only be a timer, spin_lock, or kptr, and we already have no more room
> > to record their info.
> >
> > e.g. there can be a case when we end up at j == info_cnt (all infos used), but
> > we still find a kptr, so we should only return error on seeing j == info_cnt
> > once we know that field is a kptr, because we reached the total limit of kptrs
> > in a map value.
> >
> 
> One other option might be doing the check in caller _after_ we see
> BTF_FIELD_FOUND, but this would require bumping the size to max + 1, so that the
> write to info inside the function doesn't write past the end of the array.

yep. Extra element in the array will work.
Or split btf_find_field_struct() into find and populate.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 05/15] bpf: Allow storing percpu kptr in map
  2022-03-19 21:45         ` Kumar Kartikeya Dwivedi
@ 2022-03-19 23:01           ` Alexei Starovoitov
  0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2022-03-19 23:01 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Hao Luo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 03:15:05AM +0530, Kumar Kartikeya Dwivedi wrote:
> On Sun, Mar 20, 2022 at 02:56:20AM IST, Alexei Starovoitov wrote:
> > On Sun, Mar 20, 2022 at 12:34:09AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > On Sun, Mar 20, 2022 at 12:00:28AM IST, Alexei Starovoitov wrote:
> > > > On Thu, Mar 17, 2022 at 05:29:47PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > > Make adjustments to the code to allow storing percpu PTR_TO_BTF_ID in a
> > > > > map. Similar to 'kptr_ref' tag, a new 'kptr_percpu' allows tagging types
> > > > > of pointers accepting stores of such register types. On load, verifier
> > > > > marks destination register as having type PTR_TO_BTF_ID | MEM_PERCPU |
> > > > > PTR_MAYBE_NULL.
> > > > >
> > > > > Cc: Hao Luo <haoluo@google.com>
> > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > > ---
> > > > >  include/linux/bpf.h   |  3 ++-
> > > > >  kernel/bpf/btf.c      | 13 ++++++++++---
> > > > >  kernel/bpf/verifier.c | 26 +++++++++++++++++++++-----
> > > > >  3 files changed, 33 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > index 702aa882e4a3..433f5cb161cf 100644
> > > > > --- a/include/linux/bpf.h
> > > > > +++ b/include/linux/bpf.h
> > > > > @@ -161,7 +161,8 @@ enum {
> > > > >  };
> > > > >
> > > > >  enum {
> > > > > -	BPF_MAP_VALUE_OFF_F_REF = (1U << 0),
> > > > > +	BPF_MAP_VALUE_OFF_F_REF    = (1U << 0),
> > > > > +	BPF_MAP_VALUE_OFF_F_PERCPU = (1U << 1),
> > > >
> > > > What is the use case for storing __percpu pointer into a map?
> > >
> > > No specific use case for me, just thought it would be useful, especially now
> > > that __percpu tag is understood by verifier for kernel BTF, so it may also refer
> > > to dynamically allocated per-CPU memory, not just global percpu variables. But
> > > fine with dropping both this and user kptr if you don't feel like keeping them.
> >
> > I prefer to drop it for now.
> > The patch is trivial but kptr_percpu tag would stay forever.
> 
> Ok, I'll drop both this and user kptr for now.
> 
> > Maybe we can allow storing percpu pointers in a map with just kptr tag.
> > The verifier should be able to understand from btf whether that pointer
> > is percpu or not.
> 
> This won't work (unless I missed something), it is possible to see the type when
> a store is being done, but we cannot know whether the pointer was percpu or not
> when doing a load (which is needed to decide whether it will be marked with
> MEM_PERCPU, so that user has to call bpf_this_cpu_ptr or bpf_per_cpu_ptr to
> obtain actual pointer). So some extra tagging is needed.

The pointer in bpf program should probably be marked as normal __percpu then.
So that types match during both store and load.
It will be a combination of btf_tags __kptr and __percpu.
Anyway let's table this discussion until main feature lands.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v2 04/15] bpf: Allow storing referenced kptr in map
  2022-03-19 21:43         ` Kumar Kartikeya Dwivedi
@ 2022-03-20  0:57           ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 42+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-20  0:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer

On Sun, Mar 20, 2022 at 03:13:03AM IST, Kumar Kartikeya Dwivedi wrote:
> On Sun, Mar 20, 2022 at 02:53:20AM IST, Alexei Starovoitov wrote:
> > On Sun, Mar 20, 2022 at 12:29:04AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > >
> > > > >  	if (is_release_function(func_id)) {
> > > > > -		err = release_reference(env, meta.ref_obj_id);
> > > > > +		err = -EINVAL;
> > > > > +		if (meta.ref_obj_id)
> > > > > +			err = release_reference(env, meta.ref_obj_id);
> > > > > +		/* Only bpf_kptr_xchg is a release function that accepts a
> > > > > +		 * possibly NULL reg, hence meta.ref_obj_id can only be unset
> > > > > +		 * for it.
> > > >
> > > > Could you rephrase the comment? I'm not following what it's trying to convey.
> > > >
> > >
> > > All existing release helpers never take a NULL register, so their
> > > meta.ref_obj_id will never be unset, but bpf_kptr_xchg can, so it needs some
> > > special handling. In check_func_arg, when it jumps to skip_type_check label,
> > > reg->ref_obj_id won't be set for NULL value.
> >
> > I still don't follow.
> > What do you mean 'unset meta.ref_obj_id' ?
> > It's either set or not.
>
> By unset I meant it is the default (0).
>
> > meta->ref_obj_id will stay zero when arg == NULL.
> > Above 'if (meta.ref_obj_id)' makes sense.
> > But the code below with extra func_id check looks like defensive programming again.
> >
>
> Ok, so I'll just write it like:
>
> if (is_release_function(...) && meta.ref_obj_id) {
> 	err = release_reference(...);
> 	if (err)
> 		...
> }
>

Ah, after reworking, bpf_sk_release(listen_sk) in verifier/ref_tracking.c is
failing, now I remember why I did it this way.

So meta.ref_obj_id may be 0 in many other cases, where reg->ref_obj_id is 0, not
just for the NULL register, so we need to special case the bpf_kptr_xchg. User
cannot pass any other reg with ref_obj_id == 0 to it because verifier will check
type to be PTR_TO_BTF_ID_OR_NULL, if register is not NULL.


> > > > > +		 */
> > > > > +		else if (func_id == BPF_FUNC_kptr_xchg)
> > > > > +			err = 0;
>
> --
> Kartikeya

--
Kartikeya

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-03-20  0:57 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-17 11:59 [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 01/15] bpf: Factor out fd returning from bpf_btf_find_by_name_kind Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 02/15] bpf: Make btf_find_field more generic Kumar Kartikeya Dwivedi
2022-03-19 17:55   ` Alexei Starovoitov
2022-03-19 19:31     ` Kumar Kartikeya Dwivedi
2022-03-19 20:06       ` Kumar Kartikeya Dwivedi
2022-03-19 21:30         ` Alexei Starovoitov
2022-03-17 11:59 ` [PATCH bpf-next v2 03/15] bpf: Allow storing unreferenced kptr in map Kumar Kartikeya Dwivedi
2022-03-19 18:15   ` Alexei Starovoitov
2022-03-19 18:52     ` Kumar Kartikeya Dwivedi
2022-03-19 21:17       ` Alexei Starovoitov
2022-03-19 21:39         ` Kumar Kartikeya Dwivedi
2022-03-19 21:50           ` Kumar Kartikeya Dwivedi
2022-03-19 22:57             ` Alexei Starovoitov
2022-03-17 11:59 ` [PATCH bpf-next v2 04/15] bpf: Allow storing referenced " Kumar Kartikeya Dwivedi
2022-03-19 18:24   ` Alexei Starovoitov
2022-03-19 18:59     ` Kumar Kartikeya Dwivedi
2022-03-19 21:23       ` Alexei Starovoitov
2022-03-19 21:43         ` Kumar Kartikeya Dwivedi
2022-03-20  0:57           ` Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 05/15] bpf: Allow storing percpu " Kumar Kartikeya Dwivedi
2022-03-19 18:30   ` Alexei Starovoitov
2022-03-19 19:04     ` Kumar Kartikeya Dwivedi
2022-03-19 21:26       ` Alexei Starovoitov
2022-03-19 21:45         ` Kumar Kartikeya Dwivedi
2022-03-19 23:01           ` Alexei Starovoitov
2022-03-17 11:59 ` [PATCH bpf-next v2 06/15] bpf: Allow storing user " Kumar Kartikeya Dwivedi
2022-03-19 18:28   ` Alexei Starovoitov
2022-03-19 19:02     ` Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 07/15] bpf: Prevent escaping of kptr loaded from maps Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 08/15] bpf: Adapt copy_map_value for multiple offset case Kumar Kartikeya Dwivedi
2022-03-19 18:34   ` Alexei Starovoitov
2022-03-19 19:06     ` Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 09/15] bpf: Always raise reference in btf_get_module_btf Kumar Kartikeya Dwivedi
2022-03-19 18:43   ` Alexei Starovoitov
2022-03-17 11:59 ` [PATCH bpf-next v2 10/15] bpf: Populate pairs of btf_id and destructor kfunc in btf Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 11/15] bpf: Wire up freeing of referenced kptr Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 12/15] bpf: Teach verifier about kptr_get kfunc helpers Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 13/15] libbpf: Add kptr type tag macros to bpf_helpers.h Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 14/15] selftests/bpf: Add C tests for kptr Kumar Kartikeya Dwivedi
2022-03-17 11:59 ` [PATCH bpf-next v2 15/15] selftests/bpf: Add verifier " Kumar Kartikeya Dwivedi
2022-03-19 18:50 ` [PATCH bpf-next v2 00/15] Introduce typed pointer support in BPF maps patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).