All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf 00/11] bpf: Fix the release of inner map
@ 2023-11-07 14:06 Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers Hou Tao
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patch-set aims to fix the release of inner map in map array or map
htab. The release of inner map is different with normal map. For normal
map, the map is released after the bpf program which uses the map is
destroyed, because the bpf program tracks the used maps. However bpf
program can not track the used inner map because these inner map may be
updated or deleted dynamically, and for now the ref-count of inner map
is decreased after the inner map is overrided or deleted from map in
map, so the inner map may be released before the bpf program which is
accessing the inner map exits and there will be use-after-free problem
as demonstrate by patch #11.

The patchset fixes the problem by deferring the decrease of ref-count of
inner map. Patch #1 fixes the warning when running the newly-added
selftest. Patch #2~#6 add necessary helpers, patch #7~#8 fix the problem
for map array and map htab, patch #9 removes unused helpers and patch
#10~#11 update test add add new test cases. Please check individual
patches for more details. And comments are always welcome.

Regards,
Tao

Hou Tao (11):
  bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
  bpf: Reduce the scope of rcu_read_lock when updating fd map
  bpf: Use GFP_KERNEL in bpf_event_entry_gen()
  bpf: Add need_defer parameter to .map_fd_put_ptr()
  bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  bpf: Add bpf_map_of_map_fd_sys_lookup_elem() helper
  bpf: Defer bpf_map_put() for inner map in map array
  bpf: Defer bpf_map_put() for inner map in map htab
  bpf: Remove unused helpers for map-in-map
  selftests/bpf: Remove the liveness test for inner map
  selftests/bpf: Add test cases for inner map

 include/linux/bpf.h                           |   6 +-
 kernel/bpf/arraymap.c                         |  40 +++--
 kernel/bpf/hashtab.c                          |  33 +++--
 kernel/bpf/helpers.c                          |  13 +-
 kernel/bpf/map_in_map.c                       |  60 ++++++--
 kernel/bpf/map_in_map.h                       |  16 +-
 kernel/bpf/syscall.c                          |   4 -
 .../selftests/bpf/prog_tests/btf_map_in_map.c |  26 +---
 .../selftests/bpf/prog_tests/map_in_map.c     | 138 ++++++++++++++++++
 .../selftests/bpf/progs/access_map_in_map.c   |  99 +++++++++++++
 10 files changed, 359 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_in_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/access_map_in_map.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-08 23:11   ` Martin KaFai Lau
  2023-11-07 14:06 ` [PATCH bpf 02/11] bpf: Reduce the scope of rcu_read_lock when updating fd map Hou Tao
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

These three bpf_map_{lookup,update,delete}_elem() helpers are also
available for sleepable bpf program, so add the corresponding lock
assertion for sleepable bpf program, otherwise the following warning
will be reported when a sleepable bpf program manipulates bpf map under
interpreter mode (aka bpf_jit_enable=0):

  WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
  CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:bpf_map_lookup_elem+0x54/0x60
  ......
  Call Trace:
   <TASK>
   ? __warn+0xa5/0x240
   ? bpf_map_lookup_elem+0x54/0x60
   ? report_bug+0x1ba/0x1f0
   ? handle_bug+0x40/0x80
   ? exc_invalid_op+0x18/0x50
   ? asm_exc_invalid_op+0x1b/0x20
   ? __pfx_bpf_map_lookup_elem+0x10/0x10
   ? rcu_lockdep_current_cpu_online+0x65/0xb0
   ? rcu_is_watching+0x23/0x50
   ? bpf_map_lookup_elem+0x54/0x60
   ? __pfx_bpf_map_lookup_elem+0x10/0x10
   ___bpf_prog_run+0x513/0x3b70
   __bpf_prog_run32+0x9d/0xd0
   ? __bpf_prog_enter_sleepable_recur+0xad/0x120
   ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
   bpf_trampoline_6442580665+0x4d/0x1000
   __x64_sys_getpgid+0x5/0x30
   ? do_syscall_64+0x36/0xb0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/helpers.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 56b0c1f678ee7..f43038931935e 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -32,12 +32,13 @@
  *
  * Different map implementations will rely on rcu in map methods
  * lookup/update/delete, therefore eBPF programs must run under rcu lock
- * if program is allowed to access maps, so check rcu_read_lock_held in
- * all three functions.
+ * if program is allowed to access maps, so check rcu_read_lock_held() or
+ * rcu_read_lock_trace_held() in all three functions.
  */
 BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
 {
-	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
+		     !rcu_read_lock_bh_held());
 	return (unsigned long) map->ops->map_lookup_elem(map, key);
 }
 
@@ -53,7 +54,8 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {
 BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
 	   void *, value, u64, flags)
 {
-	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
+		     !rcu_read_lock_bh_held());
 	return map->ops->map_update_elem(map, key, value, flags);
 }
 
@@ -70,7 +72,8 @@ const struct bpf_func_proto bpf_map_update_elem_proto = {
 
 BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)
 {
-	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
+		     !rcu_read_lock_bh_held());
 	return map->ops->map_delete_elem(map, key);
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 02/11] bpf: Reduce the scope of rcu_read_lock when updating fd map
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 03/11] bpf: Use GFP_KERNEL in bpf_event_entry_gen() Hou Tao
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
callbacks.

For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
rcu-read-lock because array->ptrs will not be freed until the map-in-map
is released. For bpf_fd_htab_map_update_elem(), htab_map_update_elem()
requires rcu-read-lock to be held, so only use rcu_read_lock() during
the invocation of htab_map_update_elem().

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 2 ++
 kernel/bpf/syscall.c | 4 ----
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index fd8d4b0addfca..7d1457360c99b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -2523,7 +2523,9 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
 	if (IS_ERR(ptr))
 		return PTR_ERR(ptr);
 
+	rcu_read_lock();
 	ret = htab_map_update_elem(map, key, &ptr, map_flags);
+	rcu_read_unlock();
 	if (ret)
 		map->ops->map_fd_put_ptr(ptr);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a144eb286974b..696b1af8cecbd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -180,15 +180,11 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 		err = bpf_percpu_cgroup_storage_update(map, key, value,
 						       flags);
 	} else if (IS_FD_ARRAY(map)) {
-		rcu_read_lock();
 		err = bpf_fd_array_map_update_elem(map, map_file, key, value,
 						   flags);
-		rcu_read_unlock();
 	} else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
-		rcu_read_lock();
 		err = bpf_fd_htab_map_update_elem(map, map_file, key, value,
 						  flags);
-		rcu_read_unlock();
 	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
 		/* rcu_read_lock() is not needed */
 		err = bpf_fd_reuseport_array_update_elem(map, key, value,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 03/11] bpf: Use GFP_KERNEL in bpf_event_entry_gen()
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 02/11] bpf: Reduce the scope of rcu_read_lock when updating fd map Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 04/11] bpf: Add need_defer parameter to .map_fd_put_ptr() Hou Tao
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

rcu_read_lock() is no longer held when invoking bpf_event_entry_gen()
which is called by perf_event_fd_array_get_ptr(), so using GFP_KERNEL
instead of GFP_ATOMIC to reduce the possibility of failures due to
out-of-memory.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/arraymap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2058e89b5ddd0..2fcb7031fca87 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1189,7 +1189,7 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file,
 {
 	struct bpf_event_entry *ee;
 
-	ee = kzalloc(sizeof(*ee), GFP_ATOMIC);
+	ee = kzalloc(sizeof(*ee), GFP_KERNEL);
 	if (ee) {
 		ee->event = perf_file->private_data;
 		ee->perf_file = perf_file;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 04/11] bpf: Add need_defer parameter to .map_fd_put_ptr()
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (2 preceding siblings ...)
  2023-11-07 14:06 ` [PATCH bpf 03/11] bpf: Use GFP_KERNEL in bpf_event_entry_gen() Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers Hou Tao
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.

The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:

1) release the reference of the old element in the map during map update
   or map deletion. The release must be deferred, otherwise the bpf
   program may incur use-after-free problem, so need_defer needs to be
   true.
2) release the reference of the to-be-added element in the error path of
   map update. The to-be-added element is not visible to any bpf
   program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
   Any bpf program which has access to the map must have been exited and
   released, so need_defer=false will be OK.

The need_defer parameter will be used by the following patches to fix the
potential use-after-free problem for map-in-map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h     |  6 +++++-
 kernel/bpf/arraymap.c   | 12 +++++++-----
 kernel/bpf/hashtab.c    |  6 +++---
 kernel/bpf/map_in_map.c |  2 +-
 kernel/bpf/map_in_map.h |  2 +-
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b4825d3cdb292..88be6a4b0a315 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -106,7 +106,11 @@ struct bpf_map_ops {
 	/* funcs called by prog_array and perf_event_array map */
 	void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
 				int fd);
-	void (*map_fd_put_ptr)(void *ptr);
+	/* If need_defer is true, the implementation should guarantee that
+	 * the to-be-put element is still alive before the bpf program, which
+	 * may manipulate it, exists.
+	 */
+	void (*map_fd_put_ptr)(void *ptr, bool need_defer);
 	int (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf);
 	u32 (*map_fd_sys_lookup_elem)(void *ptr);
 	void (*map_seq_show_elem)(struct bpf_map *map, void *key,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2fcb7031fca87..c1124b71da158 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -867,7 +867,7 @@ int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
 	}
 
 	if (old_ptr)
-		map->ops->map_fd_put_ptr(old_ptr);
+		map->ops->map_fd_put_ptr(old_ptr, true);
 	return 0;
 }
 
@@ -890,7 +890,7 @@ static long fd_array_map_delete_elem(struct bpf_map *map, void *key)
 	}
 
 	if (old_ptr) {
-		map->ops->map_fd_put_ptr(old_ptr);
+		map->ops->map_fd_put_ptr(old_ptr, true);
 		return 0;
 	} else {
 		return -ENOENT;
@@ -913,8 +913,9 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
 	return prog;
 }
 
-static void prog_fd_array_put_ptr(void *ptr)
+static void prog_fd_array_put_ptr(void *ptr, bool deferred)
 {
+	/* bpf_prog is freed after one RCU or tasks trace grace period */
 	bpf_prog_put(ptr);
 }
 
@@ -1239,8 +1240,9 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map,
 	return ee;
 }
 
-static void perf_event_fd_array_put_ptr(void *ptr)
+static void perf_event_fd_array_put_ptr(void *ptr, bool deferred)
 {
+	/* bpf_perf_event is freed after one RCU grace period */
 	bpf_event_entry_free_rcu(ptr);
 }
 
@@ -1294,7 +1296,7 @@ static void *cgroup_fd_array_get_ptr(struct bpf_map *map,
 	return cgroup_get_from_fd(fd);
 }
 
-static void cgroup_fd_array_put_ptr(void *ptr)
+static void cgroup_fd_array_put_ptr(void *ptr, bool deferred)
 {
 	/* cgroup_put free cgrp after a rcu grace period */
 	cgroup_put(ptr);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 7d1457360c99b..81b9f237c942b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -897,7 +897,7 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 
 	if (map->ops->map_fd_put_ptr) {
 		ptr = fd_htab_map_get_ptr(map, l);
-		map->ops->map_fd_put_ptr(ptr);
+		map->ops->map_fd_put_ptr(ptr, true);
 	}
 }
 
@@ -2484,7 +2484,7 @@ static void fd_htab_map_free(struct bpf_map *map)
 		hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
 			void *ptr = fd_htab_map_get_ptr(map, l);
 
-			map->ops->map_fd_put_ptr(ptr);
+			map->ops->map_fd_put_ptr(ptr, false);
 		}
 	}
 
@@ -2527,7 +2527,7 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
 	ret = htab_map_update_elem(map, key, &ptr, map_flags);
 	rcu_read_unlock();
 	if (ret)
-		map->ops->map_fd_put_ptr(ptr);
+		map->ops->map_fd_put_ptr(ptr, false);
 
 	return ret;
 }
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index cd5eafaba97e2..8323ce201159d 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -127,7 +127,7 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map,
 	return inner_map;
 }
 
-void bpf_map_fd_put_ptr(void *ptr)
+void bpf_map_fd_put_ptr(void *ptr, bool deferred)
 {
 	/* ptr->ops->map_free() has to go through one
 	 * rcu grace period by itself.
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index bcb7534afb3c0..63872bffd9b3c 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -13,7 +13,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
 void bpf_map_meta_free(struct bpf_map *map_meta);
 void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
 			 int ufd);
-void bpf_map_fd_put_ptr(void *ptr);
+void bpf_map_fd_put_ptr(void *ptr, bool need_defer);
 u32 bpf_map_fd_sys_lookup_elem(void *ptr);
 
 #endif
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (3 preceding siblings ...)
  2023-11-07 14:06 ` [PATCH bpf 04/11] bpf: Add need_defer parameter to .map_fd_put_ptr() Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-09  6:36   ` Martin KaFai Lau
  2023-11-07 14:06 ` [PATCH bpf 06/11] bpf: Add bpf_map_of_map_fd_sys_lookup_elem() helper Hou Tao
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
pointer saved in map-in-map. These two helpers will be used by the
following patches to fix the use-after-free problems for map-in-map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/map_in_map.h | 11 +++++++--
 2 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 8323ce201159d..96e32f4167c4e 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -4,6 +4,7 @@
 #include <linux/slab.h>
 #include <linux/bpf.h>
 #include <linux/btf.h>
+#include <linux/rcupdate.h>
 
 #include "map_in_map.h"
 
@@ -139,3 +140,53 @@ u32 bpf_map_fd_sys_lookup_elem(void *ptr)
 {
 	return ((struct bpf_map *)ptr)->id;
 }
+
+void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
+			       int ufd)
+{
+	struct bpf_inner_map_element *element;
+	struct bpf_map *inner_map;
+
+	element = kmalloc(sizeof(*element), GFP_KERNEL);
+	if (!element)
+		return ERR_PTR(-ENOMEM);
+
+	inner_map = bpf_map_fd_get_ptr(map, map_file, ufd);
+	if (IS_ERR(inner_map)) {
+		kfree(element);
+		return inner_map;
+	}
+
+	element->map = inner_map;
+	return element;
+}
+
+static void bpf_inner_map_element_free_rcu(struct rcu_head *rcu)
+{
+	struct bpf_inner_map_element *elem = container_of(rcu, struct bpf_inner_map_element, rcu);
+
+	bpf_map_put(elem->map);
+	kfree(elem);
+}
+
+static void bpf_inner_map_element_free_tt_rcu(struct rcu_head *rcu)
+{
+	if (rcu_trace_implies_rcu_gp())
+		bpf_inner_map_element_free_rcu(rcu);
+	else
+		call_rcu(rcu, bpf_inner_map_element_free_rcu);
+}
+
+void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
+{
+	struct bpf_inner_map_element *element = ptr;
+
+	/* Do bpf_map_put() after a RCU grace period and a tasks trace
+	 * RCU grace period, so it is certain that the bpf program which is
+	 * manipulating the map now has exited when bpf_map_put() is called.
+	 */
+	if (need_defer)
+		call_rcu_tasks_trace(&element->rcu, bpf_inner_map_element_free_tt_rcu);
+	else
+		bpf_inner_map_element_free_rcu(&element->rcu);
+}
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index 63872bffd9b3c..8d38496e5179b 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -9,11 +9,18 @@
 struct file;
 struct bpf_map;
 
+struct bpf_inner_map_element {
+	struct bpf_map *map;
+	struct rcu_head rcu;
+};
+
 struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
 void bpf_map_meta_free(struct bpf_map *map_meta);
-void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
-			 int ufd);
+void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
 void bpf_map_fd_put_ptr(void *ptr, bool need_defer);
 u32 bpf_map_fd_sys_lookup_elem(void *ptr);
 
+void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
+void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer);
+
 #endif
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 06/11] bpf: Add bpf_map_of_map_fd_sys_lookup_elem() helper
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (4 preceding siblings ...)
  2023-11-07 14:06 ` [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 07/11] bpf: Defer bpf_map_put() for inner map in map array Hou Tao
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

bpf_map_of_map_fd_sys_lookup_elem() returns the map id as the value for
map-in-map. It will be used when userspace applications do lookup
operations for map-in-map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/map_in_map.c | 6 ++++++
 kernel/bpf/map_in_map.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 96e32f4167c4e..3cd08e7fb86e6 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -190,3 +190,9 @@ void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
 	else
 		bpf_inner_map_element_free_rcu(&element->rcu);
 }
+
+u32 bpf_map_of_map_fd_sys_lookup_elem(void *ptr)
+{
+	rcu_read_lock_held();
+	return ((struct bpf_inner_map_element *)ptr)->map->id;
+}
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index 8d38496e5179b..4a0d66757a065 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -22,5 +22,6 @@ u32 bpf_map_fd_sys_lookup_elem(void *ptr);
 
 void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
 void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer);
+u32 bpf_map_of_map_fd_sys_lookup_elem(void *ptr);
 
 #endif
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 07/11] bpf: Defer bpf_map_put() for inner map in map array
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (5 preceding siblings ...)
  2023-11-07 14:06 ` [PATCH bpf 06/11] bpf: Add bpf_map_of_map_fd_sys_lookup_elem() helper Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-07 14:06 ` [PATCH bpf 08/11] bpf: Defer bpf_map_put() for inner map in map htab Hou Tao
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

When updating or deleting a map in map array, the map may still be
accessed by non-sleepable program or sleepable program. However
bpf_fd_array_map_update_elem() decreases the ref-count of the inner map
directly through bpf_map_put(), if the ref-count is the last ref-count
which is true for most cases, the inner map will be free by
ops->map_free() in a kworker. But for now, most .map_free() callbacks
don't use synchronize_rcu() or its variants to wait for the elapse of a
RCU grace period, so bpf program which is accessing the inner map may
incur use-after-free problem.

Fix it by deferring the invocation of bpf_map_put() after the elapse of
both one RCU grace period and one tasks trace RCU grace period.

Fixes: bba1dc0b55ac ("bpf: Remove redundant synchronize_rcu.")
Fixes: 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/arraymap.c   | 26 +++++++++++++++++---------
 kernel/bpf/map_in_map.h |  3 +++
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index c1124b71da158..2229253bcb6bd 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1355,12 +1355,18 @@ static void array_of_map_free(struct bpf_map *map)
 
 static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
 {
-	struct bpf_map **inner_map = array_map_lookup_elem(map, key);
+	struct bpf_inner_map_element *element;
+	void **ptr;
 
-	if (!inner_map)
+	ptr = array_map_lookup_elem(map, key);
+	if (!ptr)
 		return NULL;
 
-	return READ_ONCE(*inner_map);
+	element = READ_ONCE(*ptr);
+	/* Uninitialized element ? */
+	if (!element)
+		return NULL;
+	return element->map;
 }
 
 static int array_of_map_gen_lookup(struct bpf_map *map,
@@ -1376,10 +1382,10 @@ static int array_of_map_gen_lookup(struct bpf_map *map,
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, value));
 	*insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0);
 	if (!map->bypass_spec_v1) {
-		*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 6);
+		*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 8);
 		*insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask);
 	} else {
-		*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 5);
+		*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 7);
 	}
 	if (is_power_of_2(elem_size))
 		*insn++ = BPF_ALU64_IMM(BPF_LSH, ret, ilog2(elem_size));
@@ -1387,7 +1393,9 @@ static int array_of_map_gen_lookup(struct bpf_map *map,
 		*insn++ = BPF_ALU64_IMM(BPF_MUL, ret, elem_size);
 	*insn++ = BPF_ALU64_REG(BPF_ADD, ret, map_ptr);
 	*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
-	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 4);
+	*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2);
 	*insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
 	*insn++ = BPF_MOV64_IMM(ret, 0);
 
@@ -1401,9 +1409,9 @@ const struct bpf_map_ops array_of_maps_map_ops = {
 	.map_get_next_key = array_map_get_next_key,
 	.map_lookup_elem = array_of_map_lookup_elem,
 	.map_delete_elem = fd_array_map_delete_elem,
-	.map_fd_get_ptr = bpf_map_fd_get_ptr,
-	.map_fd_put_ptr = bpf_map_fd_put_ptr,
-	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+	.map_fd_get_ptr = bpf_map_of_map_fd_get_ptr,
+	.map_fd_put_ptr = bpf_map_of_map_fd_put_ptr,
+	.map_fd_sys_lookup_elem = bpf_map_of_map_fd_sys_lookup_elem,
 	.map_gen_lookup = array_of_map_gen_lookup,
 	.map_lookup_batch = generic_map_lookup_batch,
 	.map_update_batch = generic_map_update_batch,
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index 4a0d66757a065..1fa688b8882ae 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -10,6 +10,9 @@ struct file;
 struct bpf_map;
 
 struct bpf_inner_map_element {
+	/* map must be the first member, array_of_map_gen_lookup() depends on it
+	 * to dereference map correctly.
+	 */
 	struct bpf_map *map;
 	struct rcu_head rcu;
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 08/11] bpf: Defer bpf_map_put() for inner map in map htab
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (6 preceding siblings ...)
  2023-11-07 14:06 ` [PATCH bpf 07/11] bpf: Defer bpf_map_put() for inner map in map array Hou Tao
@ 2023-11-07 14:06 ` Hou Tao
  2023-11-07 14:07 ` [PATCH bpf 09/11] bpf: Remove unused helpers for map-in-map Hou Tao
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:06 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

When updating or deleting a map in map htab, the map may still be
accessed by non-sleepable program or sleepable program. However
bpf_fd_htab_map_update_elem() decreases the ref-count of the inner map
directly through bpf_map_put(), if the ref-count is the last ref-count
which is true for most cases, the inner map will be free by
ops->map_free() in a kworker. But for now, most .map_free() callbacks
don't use synchronize_rcu() or its variants to wait for the elapse of a
RCU grace period, so bpf program which is accessing the inner map may
incur use-after-free problem.

Fix it by deferring the invocation of bpf_map_put() after the elapse of
both one RCU grace period and one tasks trace RCU grace period.

Fixes: bba1dc0b55ac ("bpf: Remove redundant synchronize_rcu.")
Fixes: 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c    | 25 +++++++++++++++----------
 kernel/bpf/map_in_map.h |  4 ++--
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 81b9f237c942b..0013329af6d36 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1813,10 +1813,10 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 		} else {
 			value = l->key + roundup_key_size;
 			if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
-				struct bpf_map **inner_map = value;
+				void *inner = READ_ONCE(*(void **)value);
 
-				 /* Actual value is the id of the inner map */
-				map_id = map->ops->map_fd_sys_lookup_elem(*inner_map);
+				/* Actual value is the id of the inner map */
+				map_id = map->ops->map_fd_sys_lookup_elem(inner);
 				value = &map_id;
 			}
 
@@ -2553,12 +2553,16 @@ static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
 
 static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
 {
-	struct bpf_map **inner_map  = htab_map_lookup_elem(map, key);
+	struct bpf_inner_map_element *element;
+	void **ptr;
 
-	if (!inner_map)
+	ptr = htab_map_lookup_elem(map, key);
+	if (!ptr)
 		return NULL;
 
-	return READ_ONCE(*inner_map);
+	/* element must be no-NULL */
+	element = READ_ONCE(*ptr);
+	return element->map;
 }
 
 static int htab_of_map_gen_lookup(struct bpf_map *map,
@@ -2570,11 +2574,12 @@ static int htab_of_map_gen_lookup(struct bpf_map *map,
 	BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
 		     (void *(*)(struct bpf_map *map, void *key))NULL));
 	*insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
-	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 3);
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
 				offsetof(struct htab_elem, key) +
 				round_up(map->key_size, 8));
 	*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
+	*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
 
 	return insn - insn_buf;
 }
@@ -2592,9 +2597,9 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_of_map_lookup_elem,
 	.map_delete_elem = htab_map_delete_elem,
-	.map_fd_get_ptr = bpf_map_fd_get_ptr,
-	.map_fd_put_ptr = bpf_map_fd_put_ptr,
-	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+	.map_fd_get_ptr = bpf_map_of_map_fd_get_ptr,
+	.map_fd_put_ptr = bpf_map_of_map_fd_put_ptr,
+	.map_fd_sys_lookup_elem = bpf_map_of_map_fd_sys_lookup_elem,
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
 	.map_mem_usage = htab_map_mem_usage,
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index 1fa688b8882ae..f8719bcd7c254 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -10,8 +10,8 @@ struct file;
 struct bpf_map;
 
 struct bpf_inner_map_element {
-	/* map must be the first member, array_of_map_gen_lookup() depends on it
-	 * to dereference map correctly.
+	/* map must be the first member, array_of_map_gen_lookup() and
+	 * htab_of_map_lookup_elem() depend on it to dereference map correctly.
 	 */
 	struct bpf_map *map;
 	struct rcu_head rcu;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 09/11] bpf: Remove unused helpers for map-in-map
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (7 preceding siblings ...)
  2023-11-07 14:06 ` [PATCH bpf 08/11] bpf: Defer bpf_map_put() for inner map in map htab Hou Tao
@ 2023-11-07 14:07 ` Hou Tao
  2023-11-07 14:07 ` [PATCH bpf 10/11] selftests/bpf: Remove the liveness test for inner map Hou Tao
  2023-11-07 14:07 ` [PATCH bpf 11/11] selftests/bpf: Add test cases " Hou Tao
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:07 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

bpf_map_fd_put_ptr() and bpf_map_fd_sys_lookup_elem() are no longer used
for map-in-map, so just remove these two helpers. bpf_map_fd_get_ptr()
is only used internally, so make it be static.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/map_in_map.c | 19 ++-----------------
 kernel/bpf/map_in_map.h |  3 ---
 2 files changed, 2 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 3cd08e7fb86e6..372e89edc2d57 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -106,9 +106,7 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		btf_record_equal(meta0->record, meta1->record);
 }
 
-void *bpf_map_fd_get_ptr(struct bpf_map *map,
-			 struct file *map_file /* not used */,
-			 int ufd)
+static void *bpf_map_fd_get_ptr(struct bpf_map *map, int ufd)
 {
 	struct bpf_map *inner_map, *inner_map_meta;
 	struct fd f;
@@ -128,19 +126,6 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map,
 	return inner_map;
 }
 
-void bpf_map_fd_put_ptr(void *ptr, bool deferred)
-{
-	/* ptr->ops->map_free() has to go through one
-	 * rcu grace period by itself.
-	 */
-	bpf_map_put(ptr);
-}
-
-u32 bpf_map_fd_sys_lookup_elem(void *ptr)
-{
-	return ((struct bpf_map *)ptr)->id;
-}
-
 void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
 			       int ufd)
 {
@@ -151,7 +136,7 @@ void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
 	if (!element)
 		return ERR_PTR(-ENOMEM);
 
-	inner_map = bpf_map_fd_get_ptr(map, map_file, ufd);
+	inner_map = bpf_map_fd_get_ptr(map, ufd);
 	if (IS_ERR(inner_map)) {
 		kfree(element);
 		return inner_map;
diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
index f8719bcd7c254..903eb33896723 100644
--- a/kernel/bpf/map_in_map.h
+++ b/kernel/bpf/map_in_map.h
@@ -19,9 +19,6 @@ struct bpf_inner_map_element {
 
 struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
 void bpf_map_meta_free(struct bpf_map *map_meta);
-void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
-void bpf_map_fd_put_ptr(void *ptr, bool need_defer);
-u32 bpf_map_fd_sys_lookup_elem(void *ptr);
 
 void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
 void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 10/11] selftests/bpf: Remove the liveness test for inner map
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (8 preceding siblings ...)
  2023-11-07 14:07 ` [PATCH bpf 09/11] bpf: Remove unused helpers for map-in-map Hou Tao
@ 2023-11-07 14:07 ` Hou Tao
  2023-11-07 14:07 ` [PATCH bpf 11/11] selftests/bpf: Add test cases " Hou Tao
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:07 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

Even before the introduction of deferred-map-put for inner map, the
liveness test for inner map is not correct, because inner_map1 and
inner_map2 are released directly in a kworker and there is no need for
invoking kern_sync_rcu().

After deferring the invocation of bpf_map_put() for inner map, the time
when the inner map is released is more complicated: each overwrite of
the inner map will delay the release of the inner map after both one RCU
grace period and one tasks trace RCU grace period. To avoid such
complexity in selftest, just remove the liveness test for inner map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/btf_map_in_map.c | 26 +------------------
 1 file changed, 1 insertion(+), 25 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c b/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c
index a8b53b8736f01..f66ceccd7029c 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c
@@ -25,7 +25,7 @@ static void test_lookup_update(void)
 	int map1_fd, map2_fd, map3_fd, map4_fd, map5_fd, map1_id, map2_id;
 	int outer_arr_fd, outer_hash_fd, outer_arr_dyn_fd;
 	struct test_btf_map_in_map *skel;
-	int err, key = 0, val, i, fd;
+	int err, key = 0, val, i;
 
 	skel = test_btf_map_in_map__open_and_load();
 	if (CHECK(!skel, "skel_open", "failed to open&load skeleton\n"))
@@ -102,30 +102,6 @@ static void test_lookup_update(void)
 	CHECK(map1_id == 0, "map1_id", "failed to get ID 1\n");
 	CHECK(map2_id == 0, "map2_id", "failed to get ID 2\n");
 
-	test_btf_map_in_map__destroy(skel);
-	skel = NULL;
-
-	/* we need to either wait for or force synchronize_rcu(), before
-	 * checking for "still exists" condition, otherwise map could still be
-	 * resolvable by ID, causing false positives.
-	 *
-	 * Older kernels (5.8 and earlier) freed map only after two
-	 * synchronize_rcu()s, so trigger two, to be entirely sure.
-	 */
-	CHECK(kern_sync_rcu(), "sync_rcu", "failed\n");
-	CHECK(kern_sync_rcu(), "sync_rcu", "failed\n");
-
-	fd = bpf_map_get_fd_by_id(map1_id);
-	if (CHECK(fd >= 0, "map1_leak", "inner_map1 leaked!\n")) {
-		close(fd);
-		goto cleanup;
-	}
-	fd = bpf_map_get_fd_by_id(map2_id);
-	if (CHECK(fd >= 0, "map2_leak", "inner_map2 leaked!\n")) {
-		close(fd);
-		goto cleanup;
-	}
-
 cleanup:
 	test_btf_map_in_map__destroy(skel);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf 11/11] selftests/bpf: Add test cases for inner map
  2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
                   ` (9 preceding siblings ...)
  2023-11-07 14:07 ` [PATCH bpf 10/11] selftests/bpf: Remove the liveness test for inner map Hou Tao
@ 2023-11-07 14:07 ` Hou Tao
  10 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-07 14:07 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

Add test cases to test the race between the destroy of inner map due to
map-in-map update and the access of inner map in bpf program. The
following 4 combination are added:
(1) array map in map array + bpf program
(2) array map in map array + sleepable bpf program
(3) array map in map htab + bpf program
(4) array map in map htab + sleepable bpf program

Before apply the fixes, when running "./test_prog -a map_in_map" with
net.core.bpf_jit_enable=0, the following error was reported:

  BUG: KASAN: slab-use-after-free in bpf_map_lookup_elem+0x25/0x60
  Read of size 8 at addr ffff888162fbe000 by task test_progs/3282

  CPU: 4 PID: 3282 Comm: test_progs Not tainted 6.6.0-rc5+ #21
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  Call Trace:
   <TASK>
   dump_stack_lvl+0x4b/0x80
   print_report+0xcf/0x610
   kasan_report+0x9d/0xd0
   __asan_load8+0x7e/0xb0
   bpf_map_lookup_elem+0x25/0x60
   ___bpf_prog_run+0x2569/0x3c50
   __bpf_prog_run32+0xa1/0xe0
   trace_call_bpf+0x1a9/0x5e0
   kprobe_perf_func+0xce/0x450
   kprobe_dispatcher+0xa1/0xb0
   kprobe_ftrace_handler+0x27b/0x370
   0xffffffffc02080f7
  RIP: 0010:__x64_sys_getpgid+0x1/0x30
  ......
   </TASK>

  Allocated by task 3281:
   kasan_save_stack+0x26/0x50
   kasan_set_track+0x25/0x30
   kasan_save_alloc_info+0x1b/0x30
   __kasan_kmalloc+0x84/0xa0
   __kmalloc_node+0x67/0x170
   __bpf_map_area_alloc+0x13f/0x160
   bpf_map_area_alloc+0x10/0x20
   array_map_alloc+0x11d/0x2c0
   map_create+0x285/0xc30
   __sys_bpf+0xcff/0x3350
   __x64_sys_bpf+0x45/0x60
   do_syscall_64+0x33/0x60
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8

  Freed by task 1328:
   kasan_save_stack+0x26/0x50
   kasan_set_track+0x25/0x30
   kasan_save_free_info+0x2b/0x50
   __kasan_slab_free+0x10f/0x1a0
   __kmem_cache_free+0x1df/0x460
   kfree+0x90/0x140
   kvfree+0x2c/0x40
   bpf_map_area_free+0xe/0x20
   array_map_free+0x11f/0x270
   bpf_map_free_deferred+0xda/0x200
   process_scheduled_works+0x689/0xa20
   worker_thread+0x2fd/0x5a0
   kthread+0x1bf/0x200
   ret_from_fork+0x39/0x70
   ret_from_fork_asm+0x1b/0x30

  Last potentially related work creation:
   kasan_save_stack+0x26/0x50
   __kasan_record_aux_stack+0x92/0xa0
   kasan_record_aux_stack_noalloc+0xb/0x20
   insert_work+0x2a/0xc0
   __queue_work+0x2a6/0x8d0
   queue_work_on+0x7c/0x80
   __bpf_map_put+0x103/0x140
   bpf_map_put+0x10/0x20
   bpf_map_fd_put_ptr+0x1e/0x30
   bpf_fd_array_map_update_elem+0x18a/0x1d0
   bpf_map_update_value+0x2ca/0x4b0
   __sys_bpf+0x26ba/0x3350
   __x64_sys_bpf+0x45/0x60
   do_syscall_64+0x33/0x60
   entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/map_in_map.c     | 138 ++++++++++++++++++
 .../selftests/bpf/progs/access_map_in_map.c   |  99 +++++++++++++
 2 files changed, 237 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_in_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/access_map_in_map.c

diff --git a/tools/testing/selftests/bpf/prog_tests/map_in_map.c b/tools/testing/selftests/bpf/prog_tests/map_in_map.c
new file mode 100644
index 0000000000000..b60c5ac1f0222
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/map_in_map.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include "access_map_in_map.skel.h"
+
+struct thread_ctx {
+	pthread_barrier_t barrier;
+	int outer_map_fd;
+	int start, abort;
+	int loop, err;
+};
+
+static int wait_for_start_or_abort(struct thread_ctx *ctx)
+{
+	while (!ctx->start && !ctx->abort)
+		usleep(1);
+	return ctx->abort ? -1 : 0;
+}
+
+static void *update_map_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	int loop = ctx->loop, err = 0;
+
+	if (wait_for_start_or_abort(ctx) < 0)
+		return NULL;
+
+	while (loop-- > 0) {
+		int fd, zero = 0;
+
+		fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, 4, 4, 1, NULL);
+		if (fd < 0) {
+			err |= 1;
+			pthread_barrier_wait(&ctx->barrier);
+			continue;
+		}
+
+		/* Remove the old inner map */
+		if (bpf_map_update_elem(ctx->outer_map_fd, &zero, &fd, 0) < 0)
+			err |= 2;
+		close(fd);
+		pthread_barrier_wait(&ctx->barrier);
+	}
+
+	ctx->err = err;
+
+	return NULL;
+}
+
+static void *access_map_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	int loop = ctx->loop;
+
+	if (wait_for_start_or_abort(ctx) < 0)
+		return NULL;
+
+	while (loop-- > 0) {
+		/* Access the old inner map */
+		syscall(SYS_getpgid);
+		pthread_barrier_wait(&ctx->barrier);
+	}
+
+	return NULL;
+}
+
+static void test_map_in_map_access(const char *prog_name, const char *map_name)
+{
+	struct access_map_in_map *skel;
+	struct bpf_map *outer_map;
+	struct bpf_program *prog;
+	struct thread_ctx ctx;
+	pthread_t tid[2];
+	int err;
+
+	skel = access_map_in_map__open();
+	if (!ASSERT_OK_PTR(skel, "access_map_in_map open"))
+		return;
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "find program"))
+		goto out;
+	bpf_program__set_autoload(prog, true);
+
+	outer_map = bpf_object__find_map_by_name(skel->obj, map_name);
+	if (!ASSERT_OK_PTR(outer_map, "find map"))
+		goto out;
+
+	err = access_map_in_map__load(skel);
+	if (!ASSERT_OK(err, "access_map_in_map load"))
+		goto out;
+
+	err = access_map_in_map__attach(skel);
+	if (!ASSERT_OK(err, "access_map_in_map attach"))
+		goto out;
+
+	skel->bss->tgid = getpid();
+
+	memset(&ctx, 0, sizeof(ctx));
+	pthread_barrier_init(&ctx.barrier, NULL, 2);
+	ctx.outer_map_fd = bpf_map__fd(outer_map);
+	ctx.loop = 8;
+
+	err = pthread_create(&tid[0], NULL, update_map_fn, &ctx);
+	if (!ASSERT_OK(err, "close_thread"))
+		goto out;
+
+	err = pthread_create(&tid[1], NULL, access_map_fn, &ctx);
+	if (!ASSERT_OK(err, "read_thread")) {
+		ctx.abort = 1;
+		pthread_join(tid[0], NULL);
+		goto out;
+	}
+
+	ctx.start = 1;
+	pthread_join(tid[0], NULL);
+	pthread_join(tid[1], NULL);
+
+	ASSERT_OK(ctx.err, "err");
+out:
+	access_map_in_map__destroy(skel);
+}
+
+void test_map_in_map(void)
+{
+	if (test__start_subtest("acc_map_in_array"))
+		test_map_in_map_access("access_map_in_array", "outer_array_map");
+	if (test__start_subtest("sleepable_acc_map_in_array"))
+		test_map_in_map_access("sleepable_access_map_in_array", "outer_array_map");
+	if (test__start_subtest("acc_map_in_htab"))
+		test_map_in_map_access("access_map_in_htab", "outer_htab_map");
+	if (test__start_subtest("sleepable_acc_map_in_htab"))
+		test_map_in_map_access("sleepable_access_map_in_htab", "outer_htab_map");
+}
diff --git a/tools/testing/selftests/bpf/progs/access_map_in_map.c b/tools/testing/selftests/bpf/progs/access_map_in_map.c
new file mode 100644
index 0000000000000..1102998ea509a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/access_map_in_map.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#include <linux/bpf.h>
+#include <time.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+struct inner_map_type {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+	__uint(value_size, 4);
+	__uint(max_entries, 1);
+} inner_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__type(key, int);
+	__type(value, int);
+	__uint(max_entries, 1);
+	__array(values, struct inner_map_type);
+} outer_array_map SEC(".maps") = {
+	.values = {
+		[0] = &inner_map,
+	},
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
+	__type(key, int);
+	__type(value, int);
+	__uint(max_entries, 1);
+	__array(values, struct inner_map_type);
+} outer_htab_map SEC(".maps") = {
+	.values = {
+		[0] = &inner_map,
+	},
+};
+
+char _license[] SEC("license") = "GPL";
+
+int tgid = 0;
+
+static int acc_map_in_map(void *outer_map)
+{
+	void *inner_map;
+	int i, key;
+
+	if ((bpf_get_current_pid_tgid() >> 32) != tgid)
+		return 0;
+
+	/* Find nonexistent inner map */
+	key = 1;
+	inner_map = bpf_map_lookup_elem(outer_map, &key);
+	if (inner_map)
+		return 0;
+
+	/* Find the old inner map */
+	key = 0;
+	inner_map = bpf_map_lookup_elem(outer_map, &key);
+	if (!inner_map)
+		return 0;
+
+	/* Wait for the old inner map to be replaced */
+	for (i = 0; i < 256; i++) {
+		int *ptr;
+
+		ptr = bpf_map_lookup_elem(inner_map, &key);
+		if (!ptr)
+			continue;
+		*ptr = 0xdeadbeef;
+	}
+
+	return 0;
+}
+
+SEC("?kprobe/" SYS_PREFIX "sys_getpgid")
+int access_map_in_array(void *ctx)
+{
+	return acc_map_in_map(&outer_array_map);
+}
+
+SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
+int sleepable_access_map_in_array(void *ctx)
+{
+	return acc_map_in_map(&outer_array_map);
+}
+
+SEC("?kprobe/" SYS_PREFIX "sys_getpgid")
+int access_map_in_htab(void *ctx)
+{
+	return acc_map_in_map(&outer_htab_map);
+}
+
+SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
+int sleepable_access_map_in_htab(void *ctx)
+{
+	return acc_map_in_map(&outer_htab_map);
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
  2023-11-07 14:06 ` [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers Hou Tao
@ 2023-11-08 23:11   ` Martin KaFai Lau
  2023-11-09  3:46     ` Hou Tao
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2023-11-08 23:11 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, houtao1, bpf

On 11/7/23 6:06 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> These three bpf_map_{lookup,update,delete}_elem() helpers are also
> available for sleepable bpf program, so add the corresponding lock
> assertion for sleepable bpf program, otherwise the following warning
> will be reported when a sleepable bpf program manipulates bpf map under
> interpreter mode (aka bpf_jit_enable=0):
> 
>    WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
>    CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
>    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
>    RIP: 0010:bpf_map_lookup_elem+0x54/0x60
>    ......
>    Call Trace:
>     <TASK>
>     ? __warn+0xa5/0x240
>     ? bpf_map_lookup_elem+0x54/0x60
>     ? report_bug+0x1ba/0x1f0
>     ? handle_bug+0x40/0x80
>     ? exc_invalid_op+0x18/0x50
>     ? asm_exc_invalid_op+0x1b/0x20
>     ? __pfx_bpf_map_lookup_elem+0x10/0x10
>     ? rcu_lockdep_current_cpu_online+0x65/0xb0
>     ? rcu_is_watching+0x23/0x50
>     ? bpf_map_lookup_elem+0x54/0x60
>     ? __pfx_bpf_map_lookup_elem+0x10/0x10
>     ___bpf_prog_run+0x513/0x3b70
>     __bpf_prog_run32+0x9d/0xd0
>     ? __bpf_prog_enter_sleepable_recur+0xad/0x120
>     ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
>     bpf_trampoline_6442580665+0x4d/0x1000
>     __x64_sys_getpgid+0x5/0x30
>     ? do_syscall_64+0x36/0xb0
>     entry_SYSCALL_64_after_hwframe+0x6e/0x76
>     </TASK>
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   kernel/bpf/helpers.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 56b0c1f678ee7..f43038931935e 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -32,12 +32,13 @@
>    *
>    * Different map implementations will rely on rcu in map methods
>    * lookup/update/delete, therefore eBPF programs must run under rcu lock
> - * if program is allowed to access maps, so check rcu_read_lock_held in
> - * all three functions.
> + * if program is allowed to access maps, so check rcu_read_lock_held() or
> + * rcu_read_lock_trace_held() in all three functions.
>    */
>   BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
>   {
> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
> +		     !rcu_read_lock_bh_held());
>   	return (unsigned long) map->ops->map_lookup_elem(map, key);
>   }
>   
> @@ -53,7 +54,8 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {
>   BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
>   	   void *, value, u64, flags)
>   {
> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
> +		     !rcu_read_lock_bh_held());
>   	return map->ops->map_update_elem(map, key, value, flags);
>   }
>   
> @@ -70,7 +72,8 @@ const struct bpf_func_proto bpf_map_update_elem_proto = {
>   
>   BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)
>   {
> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
> +		     !rcu_read_lock_bh_held());

Should these WARN_ON_ONCE be removed from the helpers instead?

For catching error purpose, the ops->map_{lookup,update,delete}_elem are inlined 
  for the jitted case which I believe is the bpf-CI setting also. Meaning the 
above change won't help to catch error in the common normal case.

>   	return map->ops->map_delete_elem(map, key);
>   }
>   


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
  2023-11-08 23:11   ` Martin KaFai Lau
@ 2023-11-09  3:46     ` Hou Tao
  2023-11-09  7:02       ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-09  3:46 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, houtao1, bpf

Hi,

On 11/9/2023 7:11 AM, Martin KaFai Lau wrote:
> On 11/7/23 6:06 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> These three bpf_map_{lookup,update,delete}_elem() helpers are also
>> available for sleepable bpf program, so add the corresponding lock
>> assertion for sleepable bpf program, otherwise the following warning
>> will be reported when a sleepable bpf program manipulates bpf map under
>> interpreter mode (aka bpf_jit_enable=0):
>>
SNIP
>>   BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
>>   {
>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>> !rcu_read_lock_trace_held() &&
>> +             !rcu_read_lock_bh_held());
>>       return (unsigned long) map->ops->map_lookup_elem(map, key);
>>   }
>>   @@ -53,7 +54,8 @@ const struct bpf_func_proto
>> bpf_map_lookup_elem_proto = {
>>   BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
>>          void *, value, u64, flags)
>>   {
>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>> !rcu_read_lock_trace_held() &&
>> +             !rcu_read_lock_bh_held());
>>       return map->ops->map_update_elem(map, key, value, flags);
>>   }
>>   @@ -70,7 +72,8 @@ const struct bpf_func_proto
>> bpf_map_update_elem_proto = {
>>     BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)
>>   {
>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>> !rcu_read_lock_trace_held() &&
>> +             !rcu_read_lock_bh_held());
>
> Should these WARN_ON_ONCE be removed from the helpers instead?
>
> For catching error purpose, the ops->map_{lookup,update,delete}_elem
> are inlined  for the jitted case which I believe is the bpf-CI setting
> also. Meaning the above change won't help to catch error in the common
> normal case.

Removing these WARN_ON_ONCE is also an option. Considering JIT is not
available for all architectures and there is no KASAN support in JIT,
could we enable BPF interpreter mode in BPF CI to find more potential
problems ?

>
>>       return map->ops->map_delete_elem(map, key);
>>   }
>>   
>
>
> .


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-07 14:06 ` [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers Hou Tao
@ 2023-11-09  6:36   ` Martin KaFai Lau
  2023-11-09  7:26     ` Hou Tao
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2023-11-09  6:36 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, houtao1, bpf

On 11/7/23 6:06 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
> pointer saved in map-in-map. These two helpers will be used by the
> following patches to fix the use-after-free problems for map-in-map.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
>   kernel/bpf/map_in_map.h | 11 +++++++--
>   2 files changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 8323ce201159d..96e32f4167c4e 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -4,6 +4,7 @@
>   #include <linux/slab.h>
>   #include <linux/bpf.h>
>   #include <linux/btf.h>
> +#include <linux/rcupdate.h>
>   
>   #include "map_in_map.h"
>   
> @@ -139,3 +140,53 @@ u32 bpf_map_fd_sys_lookup_elem(void *ptr)
>   {
>   	return ((struct bpf_map *)ptr)->id;
>   }
> +
> +void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
> +			       int ufd)
> +{
> +	struct bpf_inner_map_element *element;
> +	struct bpf_map *inner_map;
> +
> +	element = kmalloc(sizeof(*element), GFP_KERNEL);
> +	if (!element)
> +		return ERR_PTR(-ENOMEM);
> +
> +	inner_map = bpf_map_fd_get_ptr(map, map_file, ufd);
> +	if (IS_ERR(inner_map)) {
> +		kfree(element);
> +		return inner_map;
> +	}
> +
> +	element->map = inner_map;
> +	return element;
> +}
> +
> +static void bpf_inner_map_element_free_rcu(struct rcu_head *rcu)
> +{
> +	struct bpf_inner_map_element *elem = container_of(rcu, struct bpf_inner_map_element, rcu);
> +
> +	bpf_map_put(elem->map);
> +	kfree(elem);
> +}
> +
> +static void bpf_inner_map_element_free_tt_rcu(struct rcu_head *rcu)
> +{
> +	if (rcu_trace_implies_rcu_gp())
> +		bpf_inner_map_element_free_rcu(rcu);
> +	else
> +		call_rcu(rcu, bpf_inner_map_element_free_rcu);
> +}
> +
> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
> +{
> +	struct bpf_inner_map_element *element = ptr;
> +
> +	/* Do bpf_map_put() after a RCU grace period and a tasks trace
> +	 * RCU grace period, so it is certain that the bpf program which is
> +	 * manipulating the map now has exited when bpf_map_put() is called.
> +	 */
> +	if (need_defer)

"need_defer" should only happen from the syscall cmd? Instead of adding rcu_head 
to each element, how about "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?

> +		call_rcu_tasks_trace(&element->rcu, bpf_inner_map_element_free_tt_rcu);
> +	else
> +		bpf_inner_map_element_free_rcu(&element->rcu);
> +}
> diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
> index 63872bffd9b3c..8d38496e5179b 100644
> --- a/kernel/bpf/map_in_map.h
> +++ b/kernel/bpf/map_in_map.h
> @@ -9,11 +9,18 @@
>   struct file;
>   struct bpf_map;
>   
> +struct bpf_inner_map_element {
> +	struct bpf_map *map;
> +	struct rcu_head rcu;
> +};
> +
>   struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
>   void bpf_map_meta_free(struct bpf_map *map_meta);
> -void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
> -			 int ufd);
> +void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
>   void bpf_map_fd_put_ptr(void *ptr, bool need_defer);
>   u32 bpf_map_fd_sys_lookup_elem(void *ptr);
>   
> +void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd);
> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer);
> +
>   #endif


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
  2023-11-09  3:46     ` Hou Tao
@ 2023-11-09  7:02       ` Martin KaFai Lau
  2023-11-09  7:44         ` Hou Tao
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2023-11-09  7:02 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, houtao1, bpf

On 11/8/23 7:46 PM, Hou Tao wrote:
> Hi,
> 
> On 11/9/2023 7:11 AM, Martin KaFai Lau wrote:
>> On 11/7/23 6:06 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> These three bpf_map_{lookup,update,delete}_elem() helpers are also
>>> available for sleepable bpf program, so add the corresponding lock
>>> assertion for sleepable bpf program, otherwise the following warning
>>> will be reported when a sleepable bpf program manipulates bpf map under
>>> interpreter mode (aka bpf_jit_enable=0):
>>>
> SNIP
>>>    BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
>>>    {
>>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>>> !rcu_read_lock_trace_held() &&
>>> +             !rcu_read_lock_bh_held());
>>>        return (unsigned long) map->ops->map_lookup_elem(map, key);
>>>    }
>>>    @@ -53,7 +54,8 @@ const struct bpf_func_proto
>>> bpf_map_lookup_elem_proto = {
>>>    BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
>>>           void *, value, u64, flags)
>>>    {
>>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>>> !rcu_read_lock_trace_held() &&
>>> +             !rcu_read_lock_bh_held());
>>>        return map->ops->map_update_elem(map, key, value, flags);
>>>    }
>>>    @@ -70,7 +72,8 @@ const struct bpf_func_proto
>>> bpf_map_update_elem_proto = {
>>>      BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)
>>>    {
>>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>>> !rcu_read_lock_trace_held() &&
>>> +             !rcu_read_lock_bh_held());
>>
>> Should these WARN_ON_ONCE be removed from the helpers instead?
>>
>> For catching error purpose, the ops->map_{lookup,update,delete}_elem
>> are inlined  for the jitted case which I believe is the bpf-CI setting
>> also. Meaning the above change won't help to catch error in the common
>> normal case.
> 
> Removing these WARN_ON_ONCE is also an option. Considering JIT is not
> available for all architectures and there is no KASAN support in JIT,
> could we enable BPF interpreter mode in BPF CI to find more potential
> problems ?

ah. The test in patch 11 needs jit to be off because the map_gen_lookup inlined 
the code? Would it help to use bpf_map_update_elem(inner_map,...) to trigger the 
issue instead?

> 
>>
>>>        return map->ops->map_delete_elem(map, key);
>>>    }
>>>    
>>
>>
>> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-09  6:36   ` Martin KaFai Lau
@ 2023-11-09  7:26     ` Hou Tao
  2023-11-09 15:55       ` Alexei Starovoitov
  0 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-09  7:26 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, houtao1, bpf

Hi,

On 11/9/2023 2:36 PM, Martin KaFai Lau wrote:
> On 11/7/23 6:06 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
>> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
>> pointer saved in map-in-map. These two helpers will be used by the
>> following patches to fix the use-after-free problems for map-in-map.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/map_in_map.h | 11 +++++++--
>>   2 files changed, 60 insertions(+), 2 deletions(-)
>>
>>
SNIP
>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
>> +{
>> +    struct bpf_inner_map_element *element = ptr;
>> +
>> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
>> +     * RCU grace period, so it is certain that the bpf program which is
>> +     * manipulating the map now has exited when bpf_map_put() is
>> called.
>> +     */
>> +    if (need_defer)
>
> "need_defer" should only happen from the syscall cmd? Instead of
> adding rcu_head to each element, how about
> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?

No. I have tried the method before, but it didn't work due to dead-lock
(will mention that in commit message in v2). The reason is that bpf
syscall program may also do map update through sys_bpf helper. Because
bpf syscall program is running with sleep-able context and has
rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
call_rcu_tasks) will lead to dead-lock.
>
>> +        call_rcu_tasks_trace(&element->rcu,
>> bpf_inner_map_element_free_tt_rcu);
>> +    else
>> +        bpf_inner_map_element_free_rcu(&element->rcu);
>> +}
>> diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h
>> index 63872bffd9b3c..8d38496e5179b 100644
>> --- a/kernel/bpf/map_in_map.h
>> +++ b/kernel/bpf/map_in_map.h
>> @@ -9,11 +9,18 @@
>>   struct file;
>>   struct bpf_map;
>>   +struct bpf_inner_map_element {
>> +    struct bpf_map *map;
>> +    struct rcu_head rcu;
>> +};
>> +
>>   struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
>>   void bpf_map_meta_free(struct bpf_map *map_meta);
>> -void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
>> -             int ufd);
>> +void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
>> int ufd);
>>   void bpf_map_fd_put_ptr(void *ptr, bool need_defer);
>>   u32 bpf_map_fd_sys_lookup_elem(void *ptr);
>>   +void *bpf_map_of_map_fd_get_ptr(struct bpf_map *map, struct file
>> *map_file, int ufd);
>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer);
>> +
>>   #endif


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
  2023-11-09  7:02       ` Martin KaFai Lau
@ 2023-11-09  7:44         ` Hou Tao
  0 siblings, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-09  7:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, houtao1, bpf

Hi,

On 11/9/2023 3:02 PM, Martin KaFai Lau wrote:
> On 11/8/23 7:46 PM, Hou Tao wrote:
>> Hi,
>>
>> On 11/9/2023 7:11 AM, Martin KaFai Lau wrote:
>>> On 11/7/23 6:06 AM, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> These three bpf_map_{lookup,update,delete}_elem() helpers are also
>>>> available for sleepable bpf program, so add the corresponding lock
>>>> assertion for sleepable bpf program, otherwise the following warning
>>>> will be reported when a sleepable bpf program manipulates bpf map
>>>> under
>>>> interpreter mode (aka bpf_jit_enable=0):
>>>>
>> SNIP
>>>>    BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
>>>>    {
>>>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>>>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>>>> !rcu_read_lock_trace_held() &&
>>>> +             !rcu_read_lock_bh_held());
>>>>        return (unsigned long) map->ops->map_lookup_elem(map, key);
>>>>    }
>>>>    @@ -53,7 +54,8 @@ const struct bpf_func_proto
>>>> bpf_map_lookup_elem_proto = {
>>>>    BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
>>>>           void *, value, u64, flags)
>>>>    {
>>>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>>>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>>>> !rcu_read_lock_trace_held() &&
>>>> +             !rcu_read_lock_bh_held());
>>>>        return map->ops->map_update_elem(map, key, value, flags);
>>>>    }
>>>>    @@ -70,7 +72,8 @@ const struct bpf_func_proto
>>>> bpf_map_update_elem_proto = {
>>>>      BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *,
>>>> key)
>>>>    {
>>>> -    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
>>>> +    WARN_ON_ONCE(!rcu_read_lock_held() &&
>>>> !rcu_read_lock_trace_held() &&
>>>> +             !rcu_read_lock_bh_held());
>>>
>>> Should these WARN_ON_ONCE be removed from the helpers instead?
>>>
>>> For catching error purpose, the ops->map_{lookup,update,delete}_elem
>>> are inlined  for the jitted case which I believe is the bpf-CI setting
>>> also. Meaning the above change won't help to catch error in the common
>>> normal case.
>>
>> Removing these WARN_ON_ONCE is also an option. Considering JIT is not
>> available for all architectures and there is no KASAN support in JIT,
>> could we enable BPF interpreter mode in BPF CI to find more potential
>> problems ?
>
> ah. The test in patch 11 needs jit to be off because the
> map_gen_lookup inlined the code? Would it help to use
> bpf_map_update_elem(inner_map,...) to trigger the issue instead?

Er, I didn't consider that before, but you are right.
bpf_map_lookup_elem(inner_map) is inlined by verifier. I think using
bpf_map_update_elem() may be able to reproduce the issue under JIT mode.
Will try later.
>
>>
>>>
>>>>        return map->ops->map_delete_elem(map, key);
>>>>    }
>>>>    
>>>
>>>
>>> .
>>
>
> .


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-09  7:26     ` Hou Tao
@ 2023-11-09 15:55       ` Alexei Starovoitov
  2023-11-09 19:55         ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2023-11-09 15:55 UTC (permalink / raw)
  To: Hou Tao, Paul E. McKenney
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hou Tao, bpf

On Wed, Nov 8, 2023 at 11:26 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 11/9/2023 2:36 PM, Martin KaFai Lau wrote:
> > On 11/7/23 6:06 AM, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
> >> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
> >> pointer saved in map-in-map. These two helpers will be used by the
> >> following patches to fix the use-after-free problems for map-in-map.
> >>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
> >>   kernel/bpf/map_in_map.h | 11 +++++++--
> >>   2 files changed, 60 insertions(+), 2 deletions(-)
> >>
> >>
> SNIP
> >> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
> >> +{
> >> +    struct bpf_inner_map_element *element = ptr;
> >> +
> >> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
> >> +     * RCU grace period, so it is certain that the bpf program which is
> >> +     * manipulating the map now has exited when bpf_map_put() is
> >> called.
> >> +     */
> >> +    if (need_defer)
> >
> > "need_defer" should only happen from the syscall cmd? Instead of
> > adding rcu_head to each element, how about
> > "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
>
> No. I have tried the method before, but it didn't work due to dead-lock
> (will mention that in commit message in v2). The reason is that bpf
> syscall program may also do map update through sys_bpf helper. Because
> bpf syscall program is running with sleep-able context and has
> rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
> call_rcu_tasks) will lead to dead-lock.

Dead-lock? why?

I think it's legal to do call_rcu_tasks_trace() while inside RCU CS
or RCU tasks trace CS.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-09 15:55       ` Alexei Starovoitov
@ 2023-11-09 19:55         ` Paul E. McKenney
  2023-11-10  1:06           ` Hou Tao
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2023-11-09 19:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hou Tao, bpf

On Thu, Nov 09, 2023 at 07:55:50AM -0800, Alexei Starovoitov wrote:
> On Wed, Nov 8, 2023 at 11:26 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 11/9/2023 2:36 PM, Martin KaFai Lau wrote:
> > > On 11/7/23 6:06 AM, Hou Tao wrote:
> > >> From: Hou Tao <houtao1@huawei.com>
> > >>
> > >> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
> > >> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
> > >> pointer saved in map-in-map. These two helpers will be used by the
> > >> following patches to fix the use-after-free problems for map-in-map.
> > >>
> > >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >> ---
> > >>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
> > >>   kernel/bpf/map_in_map.h | 11 +++++++--
> > >>   2 files changed, 60 insertions(+), 2 deletions(-)
> > >>
> > >>
> > SNIP
> > >> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
> > >> +{
> > >> +    struct bpf_inner_map_element *element = ptr;
> > >> +
> > >> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
> > >> +     * RCU grace period, so it is certain that the bpf program which is
> > >> +     * manipulating the map now has exited when bpf_map_put() is
> > >> called.
> > >> +     */
> > >> +    if (need_defer)
> > >
> > > "need_defer" should only happen from the syscall cmd? Instead of
> > > adding rcu_head to each element, how about
> > > "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
> >
> > No. I have tried the method before, but it didn't work due to dead-lock
> > (will mention that in commit message in v2). The reason is that bpf
> > syscall program may also do map update through sys_bpf helper. Because
> > bpf syscall program is running with sleep-able context and has
> > rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
> > call_rcu_tasks) will lead to dead-lock.
> 
> Dead-lock? why?
> 
> I think it's legal to do call_rcu_tasks_trace() while inside RCU CS
> or RCU tasks trace CS.

Just confirming that this is the case.  If invoking call_rcu_tasks_trace()
within under either rcu_read_lock() or rcu_read_lock_trace() deadlocks,
then there is a bug that needs fixing.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-09 19:55         ` Paul E. McKenney
@ 2023-11-10  1:06           ` Hou Tao
  2023-11-10  1:45             ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-10  1:06 UTC (permalink / raw)
  To: paulmck, Alexei Starovoitov
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hou Tao, bpf

Hi,

On 11/10/2023 3:55 AM, Paul E. McKenney wrote:
> On Thu, Nov 09, 2023 at 07:55:50AM -0800, Alexei Starovoitov wrote:
>> On Wed, Nov 8, 2023 at 11:26 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 11/9/2023 2:36 PM, Martin KaFai Lau wrote:
>>>> On 11/7/23 6:06 AM, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
>>>>> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
>>>>> pointer saved in map-in-map. These two helpers will be used by the
>>>>> following patches to fix the use-after-free problems for map-in-map.
>>>>>
>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>> ---
>>>>>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
>>>>>   kernel/bpf/map_in_map.h | 11 +++++++--
>>>>>   2 files changed, 60 insertions(+), 2 deletions(-)
>>>>>
>>>>>
>>> SNIP
>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
>>>>> +{
>>>>> +    struct bpf_inner_map_element *element = ptr;
>>>>> +
>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
>>>>> +     * RCU grace period, so it is certain that the bpf program which is
>>>>> +     * manipulating the map now has exited when bpf_map_put() is
>>>>> called.
>>>>> +     */
>>>>> +    if (need_defer)
>>>> "need_defer" should only happen from the syscall cmd? Instead of
>>>> adding rcu_head to each element, how about
>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
>>> No. I have tried the method before, but it didn't work due to dead-lock
>>> (will mention that in commit message in v2). The reason is that bpf
>>> syscall program may also do map update through sys_bpf helper. Because
>>> bpf syscall program is running with sleep-able context and has
>>> rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
>>> call_rcu_tasks) will lead to dead-lock.
>> Dead-lock? why?
>>
>> I think it's legal to do call_rcu_tasks_trace() while inside RCU CS
>> or RCU tasks trace CS.
> Just confirming that this is the case.  If invoking call_rcu_tasks_trace()
> within under either rcu_read_lock() or rcu_read_lock_trace() deadlocks,
> then there is a bug that needs fixing.  ;-)

The case for dead-lock is that calling synchronize_rcu_mult(call_rcu,
call_rcu_tasks_trace) within under rcu_read_lock_trace() and I think it
is expected. The case that calling call_rcu_tasks_trace() with
rcu_read_lock_trace() being held is OK.
>
> 							Thanx, Paul
>
> .


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-10  1:06           ` Hou Tao
@ 2023-11-10  1:45             ` Paul E. McKenney
  2023-11-10  2:37               ` Hou Tao
  2023-11-10  2:48               ` Martin KaFai Lau
  0 siblings, 2 replies; 28+ messages in thread
From: Paul E. McKenney @ 2023-11-10  1:45 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hou Tao, bpf

On Fri, Nov 10, 2023 at 09:06:56AM +0800, Hou Tao wrote:
> Hi,
> 
> On 11/10/2023 3:55 AM, Paul E. McKenney wrote:
> > On Thu, Nov 09, 2023 at 07:55:50AM -0800, Alexei Starovoitov wrote:
> >> On Wed, Nov 8, 2023 at 11:26 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>> Hi,
> >>>
> >>> On 11/9/2023 2:36 PM, Martin KaFai Lau wrote:
> >>>> On 11/7/23 6:06 AM, Hou Tao wrote:
> >>>>> From: Hou Tao <houtao1@huawei.com>
> >>>>>
> >>>>> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
> >>>>> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
> >>>>> pointer saved in map-in-map. These two helpers will be used by the
> >>>>> following patches to fix the use-after-free problems for map-in-map.
> >>>>>
> >>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >>>>> ---
> >>>>>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
> >>>>>   kernel/bpf/map_in_map.h | 11 +++++++--
> >>>>>   2 files changed, 60 insertions(+), 2 deletions(-)
> >>>>>
> >>>>>
> >>> SNIP
> >>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
> >>>>> +{
> >>>>> +    struct bpf_inner_map_element *element = ptr;
> >>>>> +
> >>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
> >>>>> +     * RCU grace period, so it is certain that the bpf program which is
> >>>>> +     * manipulating the map now has exited when bpf_map_put() is
> >>>>> called.
> >>>>> +     */
> >>>>> +    if (need_defer)
> >>>> "need_defer" should only happen from the syscall cmd? Instead of
> >>>> adding rcu_head to each element, how about
> >>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
> >>> No. I have tried the method before, but it didn't work due to dead-lock
> >>> (will mention that in commit message in v2). The reason is that bpf
> >>> syscall program may also do map update through sys_bpf helper. Because
> >>> bpf syscall program is running with sleep-able context and has
> >>> rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
> >>> call_rcu_tasks) will lead to dead-lock.
> >> Dead-lock? why?
> >>
> >> I think it's legal to do call_rcu_tasks_trace() while inside RCU CS
> >> or RCU tasks trace CS.
> > Just confirming that this is the case.  If invoking call_rcu_tasks_trace()
> > within under either rcu_read_lock() or rcu_read_lock_trace() deadlocks,
> > then there is a bug that needs fixing.  ;-)
> 
> The case for dead-lock is that calling synchronize_rcu_mult(call_rcu,
> call_rcu_tasks_trace) within under rcu_read_lock_trace() and I think it
> is expected. The case that calling call_rcu_tasks_trace() with
> rcu_read_lock_trace() being held is OK.

Very good, you are quite right.  In this particular case, deadlock is
expected behavior.

The problem here is that synchronize_rcu_mult() doesn't just invoke its
arguments, instead, it also waits for all of the corresponding grace
periods to complete.  But if you call this while under the protection of
rcu_read_lock_trace(), then synchronize_rcu_mult(call_rcu_tasks_trace)
cannot return until the corresponding rcu_read_unlock_trace() is
reached, but that rcu_read_unlock_trace() cannot be reached until after
synchronize_rcu_mult(call_rcu_tasks_trace) returns.

(I did leave out the call_rcu argument because it does not participate
in this particular deadlock.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-10  1:45             ` Paul E. McKenney
@ 2023-11-10  2:37               ` Hou Tao
  2023-11-10  2:48               ` Martin KaFai Lau
  1 sibling, 0 replies; 28+ messages in thread
From: Hou Tao @ 2023-11-10  2:37 UTC (permalink / raw)
  To: paulmck
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hou Tao, bpf

Hi Paul,

On 11/10/2023 9:45 AM, Paul E. McKenney wrote:
> On Fri, Nov 10, 2023 at 09:06:56AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 11/10/2023 3:55 AM, Paul E. McKenney wrote:
>>> On Thu, Nov 09, 2023 at 07:55:50AM -0800, Alexei Starovoitov wrote:
>>>> On Wed, Nov 8, 2023 at 11:26 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 11/9/2023 2:36 PM, Martin KaFai Lau wrote:
>>>>>> On 11/7/23 6:06 AM, Hou Tao wrote:
>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>
>>>>>>> bpf_map_of_map_fd_get_ptr() will convert the map fd to the pointer
>>>>>>> saved in map-in-map. bpf_map_of_map_fd_put_ptr() will release the
>>>>>>> pointer saved in map-in-map. These two helpers will be used by the
>>>>>>> following patches to fix the use-after-free problems for map-in-map.
>>>>>>>
>>>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>>>> ---
>>>>>>>   kernel/bpf/map_in_map.c | 51 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>   kernel/bpf/map_in_map.h | 11 +++++++--
>>>>>>>   2 files changed, 60 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>>
>>>>> SNIP
>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
>>>>>>> +{
>>>>>>> +    struct bpf_inner_map_element *element = ptr;
>>>>>>> +
>>>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
>>>>>>> +     * RCU grace period, so it is certain that the bpf program which is
>>>>>>> +     * manipulating the map now has exited when bpf_map_put() is
>>>>>>> called.
>>>>>>> +     */
>>>>>>> +    if (need_defer)
>>>>>> "need_defer" should only happen from the syscall cmd? Instead of
>>>>>> adding rcu_head to each element, how about
>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
>>>>> No. I have tried the method before, but it didn't work due to dead-lock
>>>>> (will mention that in commit message in v2). The reason is that bpf
>>>>> syscall program may also do map update through sys_bpf helper. Because
>>>>> bpf syscall program is running with sleep-able context and has
>>>>> rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
>>>>> call_rcu_tasks) will lead to dead-lock.
>>>> Dead-lock? why?
>>>>
>>>> I think it's legal to do call_rcu_tasks_trace() while inside RCU CS
>>>> or RCU tasks trace CS.
>>> Just confirming that this is the case.  If invoking call_rcu_tasks_trace()
>>> within under either rcu_read_lock() or rcu_read_lock_trace() deadlocks,
>>> then there is a bug that needs fixing.  ;-)
>> The case for dead-lock is that calling synchronize_rcu_mult(call_rcu,
>> call_rcu_tasks_trace) within under rcu_read_lock_trace() and I think it
>> is expected. The case that calling call_rcu_tasks_trace() with
>> rcu_read_lock_trace() being held is OK.
> Very good, you are quite right.  In this particular case, deadlock is
> expected behavior.
>
> The problem here is that synchronize_rcu_mult() doesn't just invoke its
> arguments, instead, it also waits for all of the corresponding grace
> periods to complete.  But if you call this while under the protection of
> rcu_read_lock_trace(), then synchronize_rcu_mult(call_rcu_tasks_trace)
> cannot return until the corresponding rcu_read_unlock_trace() is
> reached, but that rcu_read_unlock_trace() cannot be reached until after
> synchronize_rcu_mult(call_rcu_tasks_trace) returns.
>
> (I did leave out the call_rcu argument because it does not participate
> in this particular deadlock.)

Got it. Thanks for the detailed explanation.
>
> 							Thanx, Paul
>
> .


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-10  1:45             ` Paul E. McKenney
  2023-11-10  2:37               ` Hou Tao
@ 2023-11-10  2:48               ` Martin KaFai Lau
  2023-11-10  3:34                 ` Hou Tao
  1 sibling, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2023-11-10  2:48 UTC (permalink / raw)
  To: Hou Tao, Hou Tao
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, bpf, paulmck

On 11/9/23 5:45 PM, Paul E. McKenney wrote:
>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
>>>>>>> +{
>>>>>>> +    struct bpf_inner_map_element *element = ptr;
>>>>>>> +
>>>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks trace
>>>>>>> +     * RCU grace period, so it is certain that the bpf program which is
>>>>>>> +     * manipulating the map now has exited when bpf_map_put() is
>>>>>>> called.
>>>>>>> +     */
>>>>>>> +    if (need_defer)
>>>>>> "need_defer" should only happen from the syscall cmd? Instead of
>>>>>> adding rcu_head to each element, how about
>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
>>>>> No. I have tried the method before, but it didn't work due to dead-lock
>>>>> (will mention that in commit message in v2). The reason is that bpf
>>>>> syscall program may also do map update through sys_bpf helper. Because
>>>>> bpf syscall program is running with sleep-able context and has
>>>>> rcu_read_lock_trace being held, so call synchronize_rcu_mult(call_rcu,
>>>>> call_rcu_tasks) will lead to dead-lock.

Need to think of a less intrusive solution instead of adding rcu_head to each 
element and lookup also needs an extra de-referencing.

May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not be done by 
the syscall program? Which selftest does it?

Can the inner_map learn that it has been deleted from an outer map that is used 
in a sleepable prog->aux->used_maps? The bpf_map_free_deferred() will then wait 
for a task_trace gp?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-10  2:48               ` Martin KaFai Lau
@ 2023-11-10  3:34                 ` Hou Tao
  2023-11-10  4:58                   ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-10  3:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, bpf, paulmck, Hou Tao

Hi Martin,

On 11/10/2023 10:48 AM, Martin KaFai Lau wrote:
> On 11/9/23 5:45 PM, Paul E. McKenney wrote:
>>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
>>>>>>>> +{
>>>>>>>> +    struct bpf_inner_map_element *element = ptr;
>>>>>>>> +
>>>>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks
>>>>>>>> trace
>>>>>>>> +     * RCU grace period, so it is certain that the bpf program
>>>>>>>> which is
>>>>>>>> +     * manipulating the map now has exited when bpf_map_put() is
>>>>>>>> called.
>>>>>>>> +     */
>>>>>>>> +    if (need_defer)
>>>>>>> "need_defer" should only happen from the syscall cmd? Instead of
>>>>>>> adding rcu_head to each element, how about
>>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
>>>>>> No. I have tried the method before, but it didn't work due to
>>>>>> dead-lock
>>>>>> (will mention that in commit message in v2). The reason is that bpf
>>>>>> syscall program may also do map update through sys_bpf helper.
>>>>>> Because
>>>>>> bpf syscall program is running with sleep-able context and has
>>>>>> rcu_read_lock_trace being held, so call
>>>>>> synchronize_rcu_mult(call_rcu,
>>>>>> call_rcu_tasks) will lead to dead-lock.
>
> Need to think of a less intrusive solution instead of adding rcu_head
> to each element and lookup also needs an extra de-referencing.

I see.
>
> May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not
> be done by the syscall program? Which selftest does it?

Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I
remembered correctly it was map_ptr.
>
> Can the inner_map learn that it has been deleted from an outer map
> that is used in a sleepable prog->aux->used_maps? The
> bpf_map_free_deferred() will then wait for a task_trace gp?

Considering an inner_map may be used by multiple outer_map, the
following solution will be simpler: if the inner map has been deleted
from an outer map once, its free must be delayed after one RCU GP and
one tasks trace RCU GP. But I will check whether it is possible to only
wait for one RCU GP instead of two.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-10  3:34                 ` Hou Tao
@ 2023-11-10  4:58                   ` Paul E. McKenney
  2023-11-13  0:53                     ` Hou Tao
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2023-11-10  4:58 UTC (permalink / raw)
  To: Hou Tao
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, bpf, Hou Tao

On Fri, Nov 10, 2023 at 11:34:03AM +0800, Hou Tao wrote:
> Hi Martin,
> 
> On 11/10/2023 10:48 AM, Martin KaFai Lau wrote:
> > On 11/9/23 5:45 PM, Paul E. McKenney wrote:
> >>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
> >>>>>>>> +{
> >>>>>>>> +    struct bpf_inner_map_element *element = ptr;
> >>>>>>>> +
> >>>>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks
> >>>>>>>> trace
> >>>>>>>> +     * RCU grace period, so it is certain that the bpf program
> >>>>>>>> which is
> >>>>>>>> +     * manipulating the map now has exited when bpf_map_put() is
> >>>>>>>> called.
> >>>>>>>> +     */
> >>>>>>>> +    if (need_defer)
> >>>>>>> "need_defer" should only happen from the syscall cmd? Instead of
> >>>>>>> adding rcu_head to each element, how about
> >>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
> >>>>>> No. I have tried the method before, but it didn't work due to
> >>>>>> dead-lock
> >>>>>> (will mention that in commit message in v2). The reason is that bpf
> >>>>>> syscall program may also do map update through sys_bpf helper.
> >>>>>> Because
> >>>>>> bpf syscall program is running with sleep-able context and has
> >>>>>> rcu_read_lock_trace being held, so call
> >>>>>> synchronize_rcu_mult(call_rcu,
> >>>>>> call_rcu_tasks) will lead to dead-lock.
> >
> > Need to think of a less intrusive solution instead of adding rcu_head
> > to each element and lookup also needs an extra de-referencing.
> 
> I see.
> >
> > May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not
> > be done by the syscall program? Which selftest does it?
> 
> Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I
> remembered correctly it was map_ptr.
> >
> > Can the inner_map learn that it has been deleted from an outer map
> > that is used in a sleepable prog->aux->used_maps? The
> > bpf_map_free_deferred() will then wait for a task_trace gp?
> 
> Considering an inner_map may be used by multiple outer_map, the
> following solution will be simpler: if the inner map has been deleted
> from an outer map once, its free must be delayed after one RCU GP and
> one tasks trace RCU GP. But I will check whether it is possible to only
> wait for one RCU GP instead of two.

If you are freeing a large quantity of elements at a time, one approach
is to use a single rcu_head structure for the group.  (Or, in this case,
maybe a pair of rcu_head structures, one for call_rcu() and the other
for call_rcu_tasks_trace().)

This requires that you be able to link the elements in the group
together somehow, which requires some per-element storage, but only
one word per element instead of two.

There are other variations on this theme, depending on what constraints
apply here.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-10  4:58                   ` Paul E. McKenney
@ 2023-11-13  0:53                     ` Hou Tao
  2023-11-14 12:58                       ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Hou Tao @ 2023-11-13  0:53 UTC (permalink / raw)
  To: paulmck
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, bpf, Hou Tao

Hi,

On 11/10/2023 12:58 PM, Paul E. McKenney wrote:
> On Fri, Nov 10, 2023 at 11:34:03AM +0800, Hou Tao wrote:
>> Hi Martin,
>>
>> On 11/10/2023 10:48 AM, Martin KaFai Lau wrote:
>>> On 11/9/23 5:45 PM, Paul E. McKenney wrote:
>>>>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
>>>>>>>>>> +{
>>>>>>>>>> +    struct bpf_inner_map_element *element = ptr;
>>>>>>>>>> +
>>>>>>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks
>>>>>>>>>> trace
>>>>>>>>>> +     * RCU grace period, so it is certain that the bpf program
>>>>>>>>>> which is
>>>>>>>>>> +     * manipulating the map now has exited when bpf_map_put() is
>>>>>>>>>> called.
>>>>>>>>>> +     */
>>>>>>>>>> +    if (need_defer)
>>>>>>>>> "need_defer" should only happen from the syscall cmd? Instead of
>>>>>>>>> adding rcu_head to each element, how about
>>>>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
>>>>>>>> No. I have tried the method before, but it didn't work due to
>>>>>>>> dead-lock
>>>>>>>> (will mention that in commit message in v2). The reason is that bpf
>>>>>>>> syscall program may also do map update through sys_bpf helper.
>>>>>>>> Because
>>>>>>>> bpf syscall program is running with sleep-able context and has
>>>>>>>> rcu_read_lock_trace being held, so call
>>>>>>>> synchronize_rcu_mult(call_rcu,
>>>>>>>> call_rcu_tasks) will lead to dead-lock.
>>> Need to think of a less intrusive solution instead of adding rcu_head
>>> to each element and lookup also needs an extra de-referencing.
>> I see.
>>> May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not
>>> be done by the syscall program? Which selftest does it?
>> Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I
>> remembered correctly it was map_ptr.
>>> Can the inner_map learn that it has been deleted from an outer map
>>> that is used in a sleepable prog->aux->used_maps? The
>>> bpf_map_free_deferred() will then wait for a task_trace gp?
>> Considering an inner_map may be used by multiple outer_map, the
>> following solution will be simpler: if the inner map has been deleted
>> from an outer map once, its free must be delayed after one RCU GP and
>> one tasks trace RCU GP. But I will check whether it is possible to only
>> wait for one RCU GP instead of two.
> If you are freeing a large quantity of elements at a time, one approach
> is to use a single rcu_head structure for the group.  (Or, in this case,
> maybe a pair of rcu_head structures, one for call_rcu() and the other
> for call_rcu_tasks_trace().)
>
> This requires that you be able to link the elements in the group
> together somehow, which requires some per-element storage, but only
> one word per element instead of two.
>
> There are other variations on this theme, depending on what constraints
> apply here.

Thanks for your suggestions. Although there are batch update support for
inner map, but I think inner map is updated one-by-one at most case. And
the main concern here is the extra dereference due to memory allocation,
so I think adding extra flags to indicate bpf_mem_free_deferred() to
free the map differently may be appropriate.

Regards,
Tao
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers
  2023-11-13  0:53                     ` Hou Tao
@ 2023-11-14 12:58                       ` Paul E. McKenney
  0 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2023-11-14 12:58 UTC (permalink / raw)
  To: Hou Tao
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, bpf, Hou Tao

On Mon, Nov 13, 2023 at 08:53:06AM +0800, Hou Tao wrote:
> Hi,
> 
> On 11/10/2023 12:58 PM, Paul E. McKenney wrote:
> > On Fri, Nov 10, 2023 at 11:34:03AM +0800, Hou Tao wrote:
> >> Hi Martin,
> >>
> >> On 11/10/2023 10:48 AM, Martin KaFai Lau wrote:
> >>> On 11/9/23 5:45 PM, Paul E. McKenney wrote:
> >>>>>>>>>> +void bpf_map_of_map_fd_put_ptr(void *ptr, bool need_defer)
> >>>>>>>>>> +{
> >>>>>>>>>> +    struct bpf_inner_map_element *element = ptr;
> >>>>>>>>>> +
> >>>>>>>>>> +    /* Do bpf_map_put() after a RCU grace period and a tasks
> >>>>>>>>>> trace
> >>>>>>>>>> +     * RCU grace period, so it is certain that the bpf program
> >>>>>>>>>> which is
> >>>>>>>>>> +     * manipulating the map now has exited when bpf_map_put() is
> >>>>>>>>>> called.
> >>>>>>>>>> +     */
> >>>>>>>>>> +    if (need_defer)
> >>>>>>>>> "need_defer" should only happen from the syscall cmd? Instead of
> >>>>>>>>> adding rcu_head to each element, how about
> >>>>>>>>> "synchronize_rcu_mult(call_rcu, call_rcu_tasks)" here?
> >>>>>>>> No. I have tried the method before, but it didn't work due to
> >>>>>>>> dead-lock
> >>>>>>>> (will mention that in commit message in v2). The reason is that bpf
> >>>>>>>> syscall program may also do map update through sys_bpf helper.
> >>>>>>>> Because
> >>>>>>>> bpf syscall program is running with sleep-able context and has
> >>>>>>>> rcu_read_lock_trace being held, so call
> >>>>>>>> synchronize_rcu_mult(call_rcu,
> >>>>>>>> call_rcu_tasks) will lead to dead-lock.
> >>> Need to think of a less intrusive solution instead of adding rcu_head
> >>> to each element and lookup also needs an extra de-referencing.
> >> I see.
> >>> May be the bpf_map_{update,delete}_elem(&outer_map, ....) should not
> >>> be done by the syscall program? Which selftest does it?
> >> Now bpf_map_update_elem is allowed for bpf_sys_bpf helper. If I
> >> remembered correctly it was map_ptr.
> >>> Can the inner_map learn that it has been deleted from an outer map
> >>> that is used in a sleepable prog->aux->used_maps? The
> >>> bpf_map_free_deferred() will then wait for a task_trace gp?
> >> Considering an inner_map may be used by multiple outer_map, the
> >> following solution will be simpler: if the inner map has been deleted
> >> from an outer map once, its free must be delayed after one RCU GP and
> >> one tasks trace RCU GP. But I will check whether it is possible to only
> >> wait for one RCU GP instead of two.
> > If you are freeing a large quantity of elements at a time, one approach
> > is to use a single rcu_head structure for the group.  (Or, in this case,
> > maybe a pair of rcu_head structures, one for call_rcu() and the other
> > for call_rcu_tasks_trace().)
> >
> > This requires that you be able to link the elements in the group
> > together somehow, which requires some per-element storage, but only
> > one word per element instead of two.
> >
> > There are other variations on this theme, depending on what constraints
> > apply here.
> 
> Thanks for your suggestions. Although there are batch update support for
> inner map, but I think inner map is updated one-by-one at most case. And
> the main concern here is the extra dereference due to memory allocation,
> so I think adding extra flags to indicate bpf_mem_free_deferred() to
> free the map differently may be appropriate.

Whatever gets the job done is of course goodness, and yes, if the freeing
cannot be batched, my suggestion won't help much.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-11-14 12:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-07 14:06 [PATCH bpf 00/11] bpf: Fix the release of inner map Hou Tao
2023-11-07 14:06 ` [PATCH bpf 01/11] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers Hou Tao
2023-11-08 23:11   ` Martin KaFai Lau
2023-11-09  3:46     ` Hou Tao
2023-11-09  7:02       ` Martin KaFai Lau
2023-11-09  7:44         ` Hou Tao
2023-11-07 14:06 ` [PATCH bpf 02/11] bpf: Reduce the scope of rcu_read_lock when updating fd map Hou Tao
2023-11-07 14:06 ` [PATCH bpf 03/11] bpf: Use GFP_KERNEL in bpf_event_entry_gen() Hou Tao
2023-11-07 14:06 ` [PATCH bpf 04/11] bpf: Add need_defer parameter to .map_fd_put_ptr() Hou Tao
2023-11-07 14:06 ` [PATCH bpf 05/11] bpf: Add bpf_map_of_map_fd_{get,put}_ptr() helpers Hou Tao
2023-11-09  6:36   ` Martin KaFai Lau
2023-11-09  7:26     ` Hou Tao
2023-11-09 15:55       ` Alexei Starovoitov
2023-11-09 19:55         ` Paul E. McKenney
2023-11-10  1:06           ` Hou Tao
2023-11-10  1:45             ` Paul E. McKenney
2023-11-10  2:37               ` Hou Tao
2023-11-10  2:48               ` Martin KaFai Lau
2023-11-10  3:34                 ` Hou Tao
2023-11-10  4:58                   ` Paul E. McKenney
2023-11-13  0:53                     ` Hou Tao
2023-11-14 12:58                       ` Paul E. McKenney
2023-11-07 14:06 ` [PATCH bpf 06/11] bpf: Add bpf_map_of_map_fd_sys_lookup_elem() helper Hou Tao
2023-11-07 14:06 ` [PATCH bpf 07/11] bpf: Defer bpf_map_put() for inner map in map array Hou Tao
2023-11-07 14:06 ` [PATCH bpf 08/11] bpf: Defer bpf_map_put() for inner map in map htab Hou Tao
2023-11-07 14:07 ` [PATCH bpf 09/11] bpf: Remove unused helpers for map-in-map Hou Tao
2023-11-07 14:07 ` [PATCH bpf 10/11] selftests/bpf: Remove the liveness test for inner map Hou Tao
2023-11-07 14:07 ` [PATCH bpf 11/11] selftests/bpf: Add test cases " Hou Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.