bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
@ 2022-12-30  4:11 Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset tries to fix the problems found when checking how htab map
handles element reuse in bpf memory allocator. The immediate reuse of
freed elements may lead to two problems in htab map:

(1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
    htab map value and it may corrupt lookup procedure with BFP_F_LOCK
    flag which acquires bpf-spin-lock during value copying. The
    corruption of bpf-spin-lock may result in hard lock-up.
(2) lookup procedure may get incorrect map value if the found element is
    freed and then reused.

Because the type of htab map elements are the same, so problem #1 can be
fixed by supporting ctor in bpf memory allocator. The ctor initializes
these special fields in map element only when the map element is newly
allocated. If it is just a reused element, there will be no
reinitialization.

Problem #2 exists for both non-preallocated and preallocated htab map.
By adding seq in htab element, doing reuse check and retrying the
lookup procedure may be a feasible solution, but it will make the
lookup API being hard to use, because the user needs to check whether
the found element is reused or not and repeat the lookup procedure if it
is reused. A simpler solution would be just disabling freed elements
reuse and freeing these elements after lookup procedure ends.

In order to reduce the overhead of call_rcu_tasks_trace() for each freed
elements, freeing these elements in batch by moving these freed elements
into a global per-cpu free list firstly, then after the number of freed
elements reaches the threshold, these freed elements will be moved into
a dymaically allocated object and being freed by a global per-cpu worker
by calling call_rcu_tasks_trace().

Because the solution frees memory by allocating new memory, so if there
is no memory available, the global per-cpu worker will call
rcu_barrier_tasks_trace() to wait for the expiration of RCU grace period
and free these free elements which have been spliced into a temporary
list. And the newly freed elements will be freed after another round of
rcu_barrier_tasks_trace() if there is still no memory. Maybe need to
reserve some bpf_ma_free_batch to speed up the free. Now also doesn't
consider the scenario when RCU grace period is slow. Because these
newly-allocated memory (aka bpf_ma_free_batch) will be freed after the
expiration of RCU grace period, so if grace period is slow, there may be
too much bpf_ma_free_batch being allocated.

Aftering applying BPF_MA_NO_REUSE in htab map, the performance of
"./map_perf_test 4 18 8192" drops from 520K to 330K events per sec on
one CPU. It is a big performance degradation, so hope to get some
feedbacks on whether or not it is necessary and how to better fixing the
reuse problem in htab map (global allocated object may have the same
problems as htab map). Comments are always welcome.

Regards,
Hou

Hou Tao (6):
  bpf: Support ctor in bpf memory allocator
  bpf: Factor out a common helper free_llist()
  bpf: Pass bitwise flags to bpf_mem_alloc_init()
  bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator
  bpf: Use BPF_MA_NO_REUSE in htab map
  selftests/bpf: Add test case for element reuse in htab map

 include/linux/bpf_mem_alloc.h                 |  12 +-
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/hashtab.c                          |  17 +-
 kernel/bpf/memalloc.c                         | 218 ++++++++++++++++--
 .../selftests/bpf/prog_tests/htab_reuse.c     | 111 +++++++++
 .../testing/selftests/bpf/progs/htab_reuse.c  |  19 ++
 6 files changed, 353 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_reuse.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_reuse.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
@ 2022-12-30  4:11 ` Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:

 NMI backtrace for cpu 16
 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G             L     6.1.0+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
 RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
 ......
 Call Trace:
  <TASK>
  copy_map_value_locked+0xb7/0x170
  bpf_map_copy_value+0x113/0x3c0
  __sys_bpf+0x1c67/0x2780
  __x64_sys_bpf+0x1c/0x20
  do_syscall_64+0x30/0x60
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 ......
  </TASK>

For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized, but now only bpf memory allocator knows whether
or not an allocated object is reused or not. So introducing ctor support
in bpf memory allocator and calling ctor for the allocated object only
when it is newly allocated.

Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h |  4 +++-
 kernel/bpf/core.c             |  2 +-
 kernel/bpf/hashtab.c          | 16 ++++++++++++----
 kernel/bpf/memalloc.c         | 10 +++++++++-
 4 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3e164b8efaa9..3c287db087e7 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -12,9 +12,11 @@ struct bpf_mem_alloc {
 	struct bpf_mem_caches __percpu *caches;
 	struct bpf_mem_cache __percpu *cache;
 	struct work_struct work;
+	void (*ctor)(struct bpf_mem_alloc *ma, void *obj);
 };
 
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu,
+		       void (*ctor)(struct bpf_mem_alloc *, void *));
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7f98dec6e90f..6da2f9a6b085 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2755,7 +2755,7 @@ static int __init bpf_global_ma_init(void)
 {
 	int ret;
 
-	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
+	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false, NULL);
 	bpf_global_ma_set = !ret;
 	return ret;
 }
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 5aa2b5525f79..3d6557ec4b92 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -453,6 +453,15 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	return 0;
 }
 
+static void htab_elem_ctor(struct bpf_mem_alloc *ma, void *obj)
+{
+	struct bpf_htab *htab = container_of(ma, struct bpf_htab, ma);
+	struct htab_elem *elem = obj;
+
+	check_and_init_map_value(&htab->map,
+				 elem->key + round_up(htab->map.key_size, 8));
+}
+
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
 	bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
@@ -565,12 +574,13 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false,
+					 htab_elem_ctor);
 		if (err)
 			goto free_map_locked;
 		if (percpu) {
 			err = bpf_mem_alloc_init(&htab->pcpu_ma,
-						 round_up(htab->map.value_size, 8), true);
+						 round_up(htab->map.value_size, 8), true, NULL);
 			if (err)
 				goto free_map_locked;
 		}
@@ -1004,8 +1014,6 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
-		check_and_init_map_value(&htab->map,
-					 l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index ebcc3dd0fa19..ac5b92fece14 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -98,6 +98,7 @@ struct bpf_mem_cache {
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
+	struct bpf_mem_alloc *ma;
 
 	struct rcu_head rcu;
 	struct llist_head free_by_rcu;
@@ -188,6 +189,9 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 			obj = __alloc(c, node);
 			if (!obj)
 				break;
+			/* Only do initialize for newly allocated object */
+			if (c->ma->ctor)
+				c->ma->ctor(c->ma, obj);
 		}
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			/* In RT irq_work runs in per-cpu kthread, so disable
@@ -374,7 +378,8 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  * kmalloc/kfree. Max allocation size is 4096 in this case.
  * This is bpf_dynptr and bpf_kptr use case.
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu,
+		       void (*ctor)(struct bpf_mem_alloc *, void *))
 {
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
@@ -382,6 +387,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 	struct obj_cgroup *objcg = NULL;
 	int cpu, i, unit_size, percpu_size = 0;
 
+	ma->ctor = ctor;
 	if (size) {
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
 		if (!pc)
@@ -402,6 +408,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->unit_size = unit_size;
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
+			c->ma = ma;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
@@ -424,6 +431,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c = &cc->cache[i];
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
+			c->ma = ma;
 			prefill_mem_cache(c, cpu);
 		}
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist()
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
@ 2022-12-30  4:11 ` Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Factor out a common helper free_llist() to free normal elements or
per-cpu elements on a lock-less list.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index ac5b92fece14..3ad2e25946b5 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -217,9 +217,9 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	mem_cgroup_put(memcg);
 }
 
-static void free_one(struct bpf_mem_cache *c, void *obj)
+static void free_one(void *obj, bool percpu)
 {
-	if (c->percpu_size) {
+	if (percpu) {
 		free_percpu(((void **)obj)[1]);
 		kfree(obj);
 		return;
@@ -228,14 +228,19 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 	kfree(obj);
 }
 
-static void __free_rcu(struct rcu_head *head)
+static void free_llist(struct llist_node *llnode, bool percpu)
 {
-	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
-	struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
 	struct llist_node *pos, *t;
 
 	llist_for_each_safe(pos, t, llnode)
-		free_one(c, pos);
+		free_one(pos, percpu);
+}
+
+static void __free_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+
+	free_llist(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
 	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
@@ -441,7 +446,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu,
 
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
-	struct llist_node *llnode, *t;
+	bool percpu = !!c->percpu_size;
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
@@ -450,14 +455,10 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * Except for waiting_for_gp list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
-		free_one(c, llnode);
+	free_llist(__llist_del_all(&c->free_by_rcu), percpu);
+	free_llist(llist_del_all(&c->waiting_for_gp), percpu);
+	free_llist(__llist_del_all(&c->free_llist), percpu);
+	free_llist(__llist_del_all(&c->free_llist_extra), percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init()
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
@ 2022-12-30  4:11 ` Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Extend a boolean argument to a bitwise flags argument for
bpf_mem_alloc_init(), so more new flags can be added later.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h | 8 +++++++-
 kernel/bpf/core.c             | 2 +-
 kernel/bpf/hashtab.c          | 5 +++--
 kernel/bpf/memalloc.c         | 4 +++-
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3c287db087e7..b9f6b9155fa5 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -13,9 +13,15 @@ struct bpf_mem_alloc {
 	struct bpf_mem_cache __percpu *cache;
 	struct work_struct work;
 	void (*ctor)(struct bpf_mem_alloc *ma, void *obj);
+	unsigned int flags;
 };
 
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu,
+/* flags for bpf_mem_alloc_init() */
+enum {
+	BPF_MA_PERCPU = 1,
+};
+
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags,
 		       void (*ctor)(struct bpf_mem_alloc *, void *));
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 6da2f9a6b085..ca9a698c3f08 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2755,7 +2755,7 @@ static int __init bpf_global_ma_init(void)
 {
 	int ret;
 
-	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false, NULL);
+	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, 0, NULL);
 	bpf_global_ma_set = !ret;
 	return ret;
 }
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 3d6557ec4b92..623111d4276d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -574,13 +574,14 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false,
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, 0,
 					 htab_elem_ctor);
 		if (err)
 			goto free_map_locked;
 		if (percpu) {
 			err = bpf_mem_alloc_init(&htab->pcpu_ma,
-						 round_up(htab->map.value_size, 8), true, NULL);
+						 round_up(htab->map.value_size, 8),
+						 BPF_MA_PERCPU, NULL);
 			if (err)
 				goto free_map_locked;
 		}
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 3ad2e25946b5..454c86596111 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -383,7 +383,7 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  * kmalloc/kfree. Max allocation size is 4096 in this case.
  * This is bpf_dynptr and bpf_kptr use case.
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu,
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags,
 		       void (*ctor)(struct bpf_mem_alloc *, void *))
 {
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
@@ -391,7 +391,9 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu,
 	struct bpf_mem_cache *c, __percpu *pc;
 	struct obj_cgroup *objcg = NULL;
 	int cpu, i, unit_size, percpu_size = 0;
+	bool percpu = (flags & BPF_MA_PERCPU);
 
+	ma->flags = flags;
 	ma->ctor = ctor;
 	if (size) {
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
                   ` (2 preceding siblings ...)
  2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
@ 2022-12-30  4:11 ` Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Currently the freed element in bpf memory allocator may be reused by new
allocation, the reuse may lead to two problems. One problem is that the
lookup procedure may get incorrect result if the found element is freed
and then reused. Another problem is that lookup procedure may still use
special fields in map value or allocated object and at the same time
these special fields are reinitialized by new allocation. The latter
problem can be mitigated by using ctor in bpf memory allocator, but it
only works for case in which all elements have the same type.

So introducing BPF_MA_NO_REUSE to disable immediate reuse of freed
elements. These freed elements will be moved into a global per-cpu free
list instead. After the number of freed elements reaches the threshold,
these free elements will be moved into a dymaically allocated object
and being freed by a global per-cpu worker through
call_rcu_tasks_trace().

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h |   2 +
 kernel/bpf/memalloc.c         | 175 +++++++++++++++++++++++++++++++++-
 2 files changed, 173 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index b9f6b9155fa5..2a10b721832d 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -19,6 +19,8 @@ struct bpf_mem_alloc {
 /* flags for bpf_mem_alloc_init() */
 enum {
 	BPF_MA_PERCPU = 1,
+	/* Don't reuse freed elements during allocation */
+	BPF_MA_NO_REUSE = 2,
 };
 
 int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags,
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 454c86596111..e5eaf765624b 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -35,6 +35,23 @@
  */
 #define LLIST_NODE_SZ sizeof(struct llist_node)
 
+#define BPF_MA_FREE_TYPE_NR 2
+
+struct bpf_ma_free_context {
+	raw_spinlock_t lock;
+	local_t active;
+	/* For both no per-cpu and per-cpu */
+	struct llist_head to_free[BPF_MA_FREE_TYPE_NR];
+	unsigned int to_free_cnt[BPF_MA_FREE_TYPE_NR];
+	struct llist_head to_free_extra[BPF_MA_FREE_TYPE_NR];
+	struct delayed_work dwork;
+};
+
+struct bpf_ma_free_batch {
+	struct rcu_head rcu;
+	struct llist_node *to_free[BPF_MA_FREE_TYPE_NR];
+};
+
 /* similar to kmalloc, but sizeof == 8 bucket is gone */
 static u8 size_index[24] __ro_after_init = {
 	3,	/* 8 */
@@ -63,6 +80,9 @@ static u8 size_index[24] __ro_after_init = {
 	2	/* 192 */
 };
 
+static DEFINE_PER_CPU(struct bpf_ma_free_context, percpu_free_ctx);
+static struct workqueue_struct *bpf_ma_free_wq;
+
 static int bpf_mem_cache_idx(size_t size)
 {
 	if (!size || size > 4096)
@@ -609,14 +629,11 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
  * add it to the free_llist of the current cpu.
  * Let kfree() logic deal with it when it's later called from irq_work.
  */
-static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+static void notrace reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
 {
-	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
 	unsigned long flags;
 	int cnt = 0;
 
-	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
-
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		__llist_add(llnode, &c->free_llist);
@@ -638,6 +655,137 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 		irq_work_raise(c);
 }
 
+static void batch_free_rcu(struct rcu_head *rcu)
+{
+	struct bpf_ma_free_batch *batch = container_of(rcu, struct bpf_ma_free_batch, rcu);
+
+	free_llist(batch->to_free[0], false);
+	free_llist(batch->to_free[1], true);
+	kfree(batch);
+}
+
+static void batch_free_rcu_tasks_trace(struct rcu_head *rcu)
+{
+	if (rcu_trace_implies_rcu_gp())
+		batch_free_rcu(rcu);
+	else
+		call_rcu(rcu, batch_free_rcu);
+}
+
+static void bpf_ma_schedule_free_dwork(struct bpf_ma_free_context *ctx)
+{
+	long delay, left;
+	u64 to_free_cnt;
+
+	to_free_cnt = ctx->to_free_cnt[0] + ctx->to_free_cnt[1];
+	delay = to_free_cnt >= 256 ? 1 : HZ;
+	if (delayed_work_pending(&ctx->dwork)) {
+		left = ctx->dwork.timer.expires - jiffies;
+		if (delay < left)
+			mod_delayed_work(bpf_ma_free_wq, &ctx->dwork, delay);
+		return;
+	}
+	queue_delayed_work(bpf_ma_free_wq, &ctx->dwork, delay);
+}
+
+static void bpf_ma_splice_to_free_list(struct bpf_ma_free_context *ctx, struct llist_node **to_free)
+{
+	struct llist_node *tmp[BPF_MA_FREE_TYPE_NR];
+	unsigned long flags;
+	unsigned int i;
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+	for (i = 0; i < ARRAY_SIZE(tmp); i++) {
+		tmp[i] = __llist_del_all(&ctx->to_free[i]);
+		ctx->to_free_cnt[i] = 0;
+	}
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	for (i = 0; i < ARRAY_SIZE(tmp); i++) {
+		struct llist_node *first, *last;
+
+		first = llist_del_all(&ctx->to_free_extra[i]);
+		if (!first) {
+			to_free[i] = tmp[i];
+			continue;
+		}
+		last = first;
+		while (last->next)
+			last = last->next;
+		to_free[i] = first;
+		last->next = tmp[i];
+	}
+}
+
+static inline bool bpf_ma_has_to_free(const struct bpf_ma_free_context *ctx)
+{
+	return !llist_empty(&ctx->to_free[0]) || !llist_empty(&ctx->to_free[1]) ||
+	       !llist_empty(&ctx->to_free_extra[0]) || !llist_empty(&ctx->to_free_extra[1]);
+}
+
+static void bpf_ma_free_dwork(struct work_struct *work)
+{
+	struct bpf_ma_free_context *ctx = container_of(to_delayed_work(work),
+						       struct bpf_ma_free_context, dwork);
+	struct llist_node *to_free[BPF_MA_FREE_TYPE_NR];
+	struct bpf_ma_free_batch *batch;
+	unsigned long flags;
+
+	bpf_ma_splice_to_free_list(ctx, to_free);
+
+	batch = kmalloc(sizeof(*batch), GFP_NOWAIT | __GFP_NOWARN);
+	if (!batch) {
+		/* TODO: handle ENOMEM case better ? */
+		rcu_barrier_tasks_trace();
+		rcu_barrier();
+		free_llist(to_free[0], false);
+		free_llist(to_free[1], true);
+		goto check;
+	}
+
+	batch->to_free[0] = to_free[0];
+	batch->to_free[1] = to_free[1];
+	call_rcu_tasks_trace(&batch->rcu, batch_free_rcu_tasks_trace);
+check:
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+	if (bpf_ma_has_to_free(ctx))
+		bpf_ma_schedule_free_dwork(ctx);
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+static void notrace direct_free(struct bpf_mem_cache *c, struct llist_node *llnode)
+{
+	struct bpf_ma_free_context *ctx;
+	bool percpu = !!c->percpu_size;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ctx = this_cpu_ptr(&percpu_free_ctx);
+	if (local_inc_return(&ctx->active) == 1) {
+		raw_spin_lock(&ctx->lock);
+		__llist_add(llnode, &ctx->to_free[percpu]);
+		ctx->to_free_cnt[percpu] += 1;
+		bpf_ma_schedule_free_dwork(ctx);
+		raw_spin_unlock(&ctx->lock);
+	} else {
+		llist_add(llnode, &ctx->to_free_extra[percpu]);
+	}
+	local_dec(&ctx->active);
+	local_irq_restore(flags);
+}
+
+static inline void unit_free(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+
+	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
+
+	if (c->ma->flags & BPF_MA_NO_REUSE)
+		direct_free(c, llnode);
+	else
+		reuse_free(c, llnode);
+}
+
 /* Called from BPF program or from sys_bpf syscall.
  * In both cases migration is disabled.
  */
@@ -686,3 +834,22 @@ void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
 
 	unit_free(this_cpu_ptr(ma->cache), ptr);
 }
+
+static int __init bpf_ma_init(void)
+{
+	int cpu;
+
+	bpf_ma_free_wq = alloc_workqueue("bpf_ma_free", WQ_MEM_RECLAIM, 0);
+	BUG_ON(!bpf_ma_free_wq);
+
+	for_each_possible_cpu(cpu) {
+		struct bpf_ma_free_context *ctx;
+
+		ctx = per_cpu_ptr(&percpu_free_ctx, cpu);
+		raw_spin_lock_init(&ctx->lock);
+		INIT_DELAYED_WORK(&ctx->dwork, bpf_ma_free_dwork);
+	}
+
+	return 0;
+}
+fs_initcall(bpf_ma_init);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
                   ` (3 preceding siblings ...)
  2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
@ 2022-12-30  4:11 ` Hou Tao
  2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
  2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
  6 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Use BPF_MA_NO_REUSE in htab map to disable the immediate reuse of free
elements, so the lookup procedure will not return incorrect result.

After the change, the performance of "./map_perf_test 4 18 8192" will
drop from 520K to 330K events per sec on one CPU.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 623111d4276d..e1636c5d0051 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -574,14 +574,14 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, 0,
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, BPF_MA_NO_REUSE,
 					 htab_elem_ctor);
 		if (err)
 			goto free_map_locked;
 		if (percpu) {
 			err = bpf_mem_alloc_init(&htab->pcpu_ma,
 						 round_up(htab->map.value_size, 8),
-						 BPF_MA_PERCPU, NULL);
+						 BPF_MA_PERCPU | BPF_MA_NO_REUSE, NULL);
 			if (err)
 				goto free_map_locked;
 		}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse in htab map
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
                   ` (4 preceding siblings ...)
  2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
@ 2022-12-30  4:11 ` Hou Tao
  2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
  6 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2022-12-30  4:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

The immediate reuse of free htab elements can lead to two problems:
1) lookup may get unexpected map value if the found element is deleted
and then reused.
2) the reinitialization of spin-lock in map value after reuse may
corrupt lookup with BPF_F_LOCK flag and result in hard lock-up.

So add one test case to demonostrate these two problems.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/htab_reuse.c     | 111 ++++++++++++++++++
 .../testing/selftests/bpf/progs/htab_reuse.c  |  19 +++
 2 files changed, 130 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_reuse.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_reuse.c

diff --git a/tools/testing/selftests/bpf/prog_tests/htab_reuse.c b/tools/testing/selftests/bpf/prog_tests/htab_reuse.c
new file mode 100644
index 000000000000..995972958d1d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/htab_reuse.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdbool.h>
+#include <test_progs.h>
+#include "htab_reuse.skel.h"
+
+struct htab_op_ctx {
+	int fd;
+	int loop;
+	bool stop;
+};
+
+struct htab_val {
+	unsigned int lock;
+	unsigned int data;
+};
+
+static void *htab_lookup_fn(void *arg)
+{
+	struct htab_op_ctx *ctx = arg;
+	int ret = 0, i = 0;
+
+        while (i++ < ctx->loop && !ctx->stop) {
+		struct htab_val value;
+		unsigned int key;
+		int err;
+
+		/*
+		 * Use BPF_F_LOCK to use spin-lock in map value. And also
+		 * check whether or not an unexpected value is returned.
+		 */
+		key = 7;
+                err = bpf_map_lookup_elem_flags(ctx->fd, &key, &value, BPF_F_LOCK);
+		if (!err && key != value.data)
+			ret = EINVAL;
+        }
+
+        return (void *)(long)ret;
+}
+
+static void *htab_update_fn(void *arg)
+{
+	struct htab_op_ctx *ctx = arg;
+	int i = 0;
+
+        while (i++ < ctx->loop && !ctx->stop) {
+		struct htab_val value;
+                unsigned int key;
+
+		key = 7;
+		value.lock = 0;
+		value.data = key;
+                bpf_map_update_elem(ctx->fd, &key, &value, BPF_F_LOCK);
+		bpf_map_delete_elem(ctx->fd, &key);
+
+		key = 24;
+		value.lock = 0;
+		value.data = key;
+                bpf_map_update_elem(ctx->fd, &key, &value, BPF_F_LOCK);
+		bpf_map_delete_elem(ctx->fd, &key);
+        }
+
+        return NULL;
+}
+
+void test_htab_reuse(void)
+{
+	unsigned int i, wr_nr = 1, rd_nr = 4;
+	pthread_t tids[wr_nr + rd_nr];
+	struct htab_reuse *skel;
+	struct htab_op_ctx ctx;
+	int err;
+
+	skel = htab_reuse__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "htab_reuse__open_and_load"))
+		return;
+
+	ctx.fd = bpf_map__fd(skel->maps.htab);
+	ctx.loop = 500;
+	ctx.stop = false;
+
+	memset(tids, 0, sizeof(tids));
+	for (i = 0; i < wr_nr; i++) {
+		err = pthread_create(&tids[i], NULL, htab_update_fn, &ctx);
+		if (!ASSERT_OK(err, "pthread_create")) {
+			ctx.stop = true;
+			goto reap;
+		}
+	}
+	for (i = 0; i < rd_nr; i++) {
+		err = pthread_create(&tids[i + wr_nr], NULL, htab_lookup_fn, &ctx);
+		if (!ASSERT_OK(err, "pthread_create")) {
+			ctx.stop = true;
+			goto reap;
+		}
+	}
+
+reap:
+	for (i = 0; i < wr_nr + rd_nr; i++) {
+		void *thread_err;
+
+		if (!tids[i])
+			continue;
+		thread_err = NULL;
+		pthread_join(tids[i], &thread_err);
+		ASSERT_NULL(thread_err, "thread error");
+	}
+	htab_reuse__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/htab_reuse.c b/tools/testing/selftests/bpf/progs/htab_reuse.c
new file mode 100644
index 000000000000..e6dcc70517f9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_reuse.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct htab_val {
+	struct bpf_spin_lock lock;
+	unsigned int data;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 64);
+	__type(key, unsigned int);
+	__type(value, struct htab_val);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
                   ` (5 preceding siblings ...)
  2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
@ 2023-01-01  1:26 ` Alexei Starovoitov
  2023-01-01 18:48   ` Yonghong Song
  2023-01-03 13:40   ` Hou Tao
  6 siblings, 2 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2023-01-01  1:26 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> The patchset tries to fix the problems found when checking how htab map
> handles element reuse in bpf memory allocator. The immediate reuse of
> freed elements may lead to two problems in htab map:
> 
> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>     flag which acquires bpf-spin-lock during value copying. The
>     corruption of bpf-spin-lock may result in hard lock-up.
> (2) lookup procedure may get incorrect map value if the found element is
>     freed and then reused.
> 
> Because the type of htab map elements are the same, so problem #1 can be
> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> these special fields in map element only when the map element is newly
> allocated. If it is just a reused element, there will be no
> reinitialization.

Instead of adding the overhead of ctor callback let's just
add __GFP_ZERO to flags in __alloc().
That will address the issue 1 and will make bpf_mem_alloc behave just
like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
will behave the same way.

> Problem #2 exists for both non-preallocated and preallocated htab map.
> By adding seq in htab element, doing reuse check and retrying the
> lookup procedure may be a feasible solution, but it will make the
> lookup API being hard to use, because the user needs to check whether
> the found element is reused or not and repeat the lookup procedure if it
> is reused. A simpler solution would be just disabling freed elements
> reuse and freeing these elements after lookup procedure ends.

You've proposed this 'solution' twice already in qptrie thread and both
times the answer was 'no, we cannot do this' with reasons explained.
The 3rd time the answer is still the same.
This 'issue 2' existed in hashmap since very beginning for many years.
It's a known quirk. There is nothing to fix really.

The graph apis (aka new gen data structs) with link list and rbtree are
in active development. Soon bpf progs will be able to implement their own
hash maps with explicit bpf_rcu_read_lock. At that time the progs will
be making the trade off between performance and lookup/delete race.
So please respin with just __GFP_ZERO and update the patch 6
to check for lockup only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
@ 2023-01-01 18:48   ` Yonghong Song
  2023-01-03 13:47     ` Hou Tao
  2023-01-03 13:40   ` Hou Tao
  1 sibling, 1 reply; 35+ messages in thread
From: Yonghong Song @ 2023-01-01 18:48 UTC (permalink / raw)
  To: Alexei Starovoitov, Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1



On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
>>
>> The patchset tries to fix the problems found when checking how htab map
>> handles element reuse in bpf memory allocator. The immediate reuse of
>> freed elements may lead to two problems in htab map:
>>
>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>      flag which acquires bpf-spin-lock during value copying. The
>>      corruption of bpf-spin-lock may result in hard lock-up.
>> (2) lookup procedure may get incorrect map value if the found element is
>>      freed and then reused.
>>
>> Because the type of htab map elements are the same, so problem #1 can be
>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>> these special fields in map element only when the map element is newly
>> allocated. If it is just a reused element, there will be no
>> reinitialization.
> 
> Instead of adding the overhead of ctor callback let's just
> add __GFP_ZERO to flags in __alloc().
> That will address the issue 1 and will make bpf_mem_alloc behave just
> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> will behave the same way.

Patch 
https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ 
tried to address a similar issue for lru hash table.
Maybe we need to do similar things after bpf_mem_cache_alloc() for
hash table?


> 
>> Problem #2 exists for both non-preallocated and preallocated htab map.
>> By adding seq in htab element, doing reuse check and retrying the
>> lookup procedure may be a feasible solution, but it will make the
>> lookup API being hard to use, because the user needs to check whether
>> the found element is reused or not and repeat the lookup procedure if it
>> is reused. A simpler solution would be just disabling freed elements
>> reuse and freeing these elements after lookup procedure ends.
> 
> You've proposed this 'solution' twice already in qptrie thread and both
> times the answer was 'no, we cannot do this' with reasons explained.
> The 3rd time the answer is still the same.
> This 'issue 2' existed in hashmap since very beginning for many years.
> It's a known quirk. There is nothing to fix really.
> 
> The graph apis (aka new gen data structs) with link list and rbtree are
> in active development. Soon bpf progs will be able to implement their own
> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> be making the trade off between performance and lookup/delete race.
> So please respin with just __GFP_ZERO and update the patch 6
> to check for lockup only.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
  2023-01-01 18:48   ` Yonghong Song
@ 2023-01-03 13:40   ` Hou Tao
  2023-01-03 19:38     ` Alexei Starovoitov
  1 sibling, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-01-03 13:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Paul E . McKenney
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

Hi,

On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
>>
>> The patchset tries to fix the problems found when checking how htab map
>> handles element reuse in bpf memory allocator. The immediate reuse of
>> freed elements may lead to two problems in htab map:
>>
>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>     flag which acquires bpf-spin-lock during value copying. The
>>     corruption of bpf-spin-lock may result in hard lock-up.
>> (2) lookup procedure may get incorrect map value if the found element is
>>     freed and then reused.
>>
>> Because the type of htab map elements are the same, so problem #1 can be
>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>> these special fields in map element only when the map element is newly
>> allocated. If it is just a reused element, there will be no
>> reinitialization.
> Instead of adding the overhead of ctor callback let's just
> add __GFP_ZERO to flags in __alloc().
> That will address the issue 1 and will make bpf_mem_alloc behave just
> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> will behave the same way.
Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
object may be bigger than ctor callback when the size of allocated object is
large. And it also introduces unnecessary memory zeroing if there is no special
field in the map value.

>> Problem #2 exists for both non-preallocated and preallocated htab map.
>> By adding seq in htab element, doing reuse check and retrying the
>> lookup procedure may be a feasible solution, but it will make the
>> lookup API being hard to use, because the user needs to check whether
>> the found element is reused or not and repeat the lookup procedure if it
>> is reused. A simpler solution would be just disabling freed elements
>> reuse and freeing these elements after lookup procedure ends.
> You've proposed this 'solution' twice already in qptrie thread and both
> times the answer was 'no, we cannot do this' with reasons explained.
> The 3rd time the answer is still the same.
This time a workable demo which calls call_rcu_task_trace() in batch is provided
:) Also because I can not find a better solution for the reuse problem. But you
are right, although don't reuse the freed element will make the implementation
of map simpler, the potential OOM problem is hard to solve specially when RCU
tasks trace grace period is slow. Hope Paul can provide some insight about the
problem.
> This 'issue 2' existed in hashmap since very beginning for many years.
> It's a known quirk. There is nothing to fix really.
Do we need to document the unexpected behavior somewhere, because I really don't
know nothing about the quirk ?
>
> The graph apis (aka new gen data structs) with link list and rbtree are
> in active development. Soon bpf progs will be able to implement their own
> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> be making the trade off between performance and lookup/delete race.
It seems these new gen data struct also need to solve the reuse problem because
a global bpf memory allocator is used.
> So please respin with just __GFP_ZERO and update the patch 6
> to check for lockup only.
> .


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-01 18:48   ` Yonghong Song
@ 2023-01-03 13:47     ` Hou Tao
  2023-01-04  6:10       ` Yonghong Song
  0 siblings, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-01-03 13:47 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi,

On 1/2/2023 2:48 AM, Yonghong Song wrote:
>
>
> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Hi,
>>>
>>> The patchset tries to fix the problems found when checking how htab map
>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>> freed elements may lead to two problems in htab map:
>>>
>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>      flag which acquires bpf-spin-lock during value copying. The
>>>      corruption of bpf-spin-lock may result in hard lock-up.
>>> (2) lookup procedure may get incorrect map value if the found element is
>>>      freed and then reused.
>>>
>>> Because the type of htab map elements are the same, so problem #1 can be
>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>> these special fields in map element only when the map element is newly
>>> allocated. If it is just a reused element, there will be no
>>> reinitialization.
>>
>> Instead of adding the overhead of ctor callback let's just
>> add __GFP_ZERO to flags in __alloc().
>> That will address the issue 1 and will make bpf_mem_alloc behave just
>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>> will behave the same way.
>
> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> tried to address a similar issue for lru hash table.
> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> hash table?
IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>
>
>>
>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>> By adding seq in htab element, doing reuse check and retrying the
>>> lookup procedure may be a feasible solution, but it will make the
>>> lookup API being hard to use, because the user needs to check whether
>>> the found element is reused or not and repeat the lookup procedure if it
>>> is reused. A simpler solution would be just disabling freed elements
>>> reuse and freeing these elements after lookup procedure ends.
>>
>> You've proposed this 'solution' twice already in qptrie thread and both
>> times the answer was 'no, we cannot do this' with reasons explained.
>> The 3rd time the answer is still the same.
>> This 'issue 2' existed in hashmap since very beginning for many years.
>> It's a known quirk. There is nothing to fix really.
>>
>> The graph apis (aka new gen data structs) with link list and rbtree are
>> in active development. Soon bpf progs will be able to implement their own
>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>> be making the trade off between performance and lookup/delete race.
>> So please respin with just __GFP_ZERO and update the patch 6
>> to check for lockup only.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-03 13:40   ` Hou Tao
@ 2023-01-03 19:38     ` Alexei Starovoitov
  2023-01-10  6:26       ` Martin KaFai Lau
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2023-01-03 19:38 UTC (permalink / raw)
  To: Hou Tao
  Cc: Paul E . McKenney, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, rcu, Hou Tao

On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
> > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Hi,
> >>
> >> The patchset tries to fix the problems found when checking how htab map
> >> handles element reuse in bpf memory allocator. The immediate reuse of
> >> freed elements may lead to two problems in htab map:
> >>
> >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> >>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> >>     flag which acquires bpf-spin-lock during value copying. The
> >>     corruption of bpf-spin-lock may result in hard lock-up.
> >> (2) lookup procedure may get incorrect map value if the found element is
> >>     freed and then reused.
> >>
> >> Because the type of htab map elements are the same, so problem #1 can be
> >> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> >> these special fields in map element only when the map element is newly
> >> allocated. If it is just a reused element, there will be no
> >> reinitialization.
> > Instead of adding the overhead of ctor callback let's just
> > add __GFP_ZERO to flags in __alloc().
> > That will address the issue 1 and will make bpf_mem_alloc behave just
> > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > will behave the same way.
> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
> object may be bigger than ctor callback when the size of allocated object is
> large. And it also introduces unnecessary memory zeroing if there is no special
> field in the map value.

Small memset is often faster than an indirect call.
I doubt that adding GFP_ZERO will have a measurable perf difference
in map_perf_test and other benchmarks.

> >> Problem #2 exists for both non-preallocated and preallocated htab map.
> >> By adding seq in htab element, doing reuse check and retrying the
> >> lookup procedure may be a feasible solution, but it will make the
> >> lookup API being hard to use, because the user needs to check whether
> >> the found element is reused or not and repeat the lookup procedure if it
> >> is reused. A simpler solution would be just disabling freed elements
> >> reuse and freeing these elements after lookup procedure ends.
> > You've proposed this 'solution' twice already in qptrie thread and both
> > times the answer was 'no, we cannot do this' with reasons explained.
> > The 3rd time the answer is still the same.
> This time a workable demo which calls call_rcu_task_trace() in batch is provided
> :) Also because I can not find a better solution for the reuse problem. But you
> are right, although don't reuse the freed element will make the implementation
> of map simpler, the potential OOM problem is hard to solve specially when RCU
> tasks trace grace period is slow. Hope Paul can provide some insight about the
> problem.

OOM is exactly the reason why we cannot do this delaying logic
in the general case. We don't control what progs do and rcu tasks trace
may take a long time.

> > This 'issue 2' existed in hashmap since very beginning for many years.
> > It's a known quirk. There is nothing to fix really.
> Do we need to document the unexpected behavior somewhere, because I really don't
> know nothing about the quirk ?

Yeah. It's not documented in Documentation/bpf/map_hash.rst.
Please send a patch to add it.

> >
> > The graph apis (aka new gen data structs) with link list and rbtree are
> > in active development. Soon bpf progs will be able to implement their own
> > hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> > be making the trade off between performance and lookup/delete race.
> It seems these new gen data struct also need to solve the reuse problem because
> a global bpf memory allocator is used.

Currently the graph api is single owner and kptr_xchg is used to allow
single cpu at a time to observe the object.
In the future we will add bpf_refcount and multi owner.
Then multiple cpus will be able to access the same object concurrently.
They will race to read/write the fields and it will be prog decision
to arbitrate the access.
In other words the bpf prog needs to have mechanisms to deal with reuse.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-03 13:47     ` Hou Tao
@ 2023-01-04  6:10       ` Yonghong Song
  2023-01-04  6:30         ` Hou Tao
  0 siblings, 1 reply; 35+ messages in thread
From: Yonghong Song @ 2023-01-04  6:10 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1



On 1/3/23 5:47 AM, Hou Tao wrote:
> Hi,
> 
> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>
>>
>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> The patchset tries to fix the problems found when checking how htab map
>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>> freed elements may lead to two problems in htab map:
>>>>
>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>       htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>       flag which acquires bpf-spin-lock during value copying. The
>>>>       corruption of bpf-spin-lock may result in hard lock-up.
>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>       freed and then reused.
>>>>
>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>> these special fields in map element only when the map element is newly
>>>> allocated. If it is just a reused element, there will be no
>>>> reinitialization.
>>>
>>> Instead of adding the overhead of ctor callback let's just
>>> add __GFP_ZERO to flags in __alloc().
>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>> will behave the same way.
>>
>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>> tried to address a similar issue for lru hash table.
>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>> hash table?
> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?

The following is my understanding:
in function alloc_htab_elem() (hashtab.c), we have

                 if (is_map_full(htab))
                         if (!old_elem)
                                 /* when map is full and update() is 
replacing
                                  * old element, it's ok to allocate, since
                                  * old element will be freed immediately.
                                  * Otherwise return an error
                                  */
                                 return ERR_PTR(-E2BIG);
                 inc_elem_count(htab);
                 l_new = bpf_mem_cache_alloc(&htab->ma);
                 if (!l_new) {
                         l_new = ERR_PTR(-ENOMEM);
                         goto dec_count;
                 }
                 check_and_init_map_value(&htab->map,
                                          l_new->key + 
round_up(key_size, 8));

In the above check_and_init_map_value() intends to do initializing
for an element from bpf_mem_cache_alloc (could be reused from the free 
list).

The check_and_init_map_value() looks like below (in include/linux/bpf.h)

static inline void bpf_obj_init(const struct btf_field_offs *foffs, void 
*obj)
{
         int i;

         if (!foffs)
                 return;
         for (i = 0; i < foffs->cnt; i++)
                 memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
}

static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
{
         bpf_obj_init(map->field_offs, dst);
}

IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
list_head, list_node, etc.

This is the problem for above problem #1.
Maybe I missed something?

>>
>>
>>>
>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>> By adding seq in htab element, doing reuse check and retrying the
>>>> lookup procedure may be a feasible solution, but it will make the
>>>> lookup API being hard to use, because the user needs to check whether
>>>> the found element is reused or not and repeat the lookup procedure if it
>>>> is reused. A simpler solution would be just disabling freed elements
>>>> reuse and freeing these elements after lookup procedure ends.
>>>
>>> You've proposed this 'solution' twice already in qptrie thread and both
>>> times the answer was 'no, we cannot do this' with reasons explained.
>>> The 3rd time the answer is still the same.
>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>> It's a known quirk. There is nothing to fix really.
>>>
>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>> in active development. Soon bpf progs will be able to implement their own
>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>> be making the trade off between performance and lookup/delete race.
>>> So please respin with just __GFP_ZERO and update the patch 6
>>> to check for lockup only.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-04  6:10       ` Yonghong Song
@ 2023-01-04  6:30         ` Hou Tao
  2023-01-04  7:14           ` Yonghong Song
  0 siblings, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-01-04  6:30 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi,

On 1/4/2023 2:10 PM, Yonghong Song wrote:
>
>
> On 1/3/23 5:47 AM, Hou Tao wrote:
>> Hi,
>>
>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>
>>>
>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> Hi,
>>>>>
>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>> freed elements may lead to two problems in htab map:
>>>>>
>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>       htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>       flag which acquires bpf-spin-lock during value copying. The
>>>>>       corruption of bpf-spin-lock may result in hard lock-up.
>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>       freed and then reused.
>>>>>
>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>> these special fields in map element only when the map element is newly
>>>>> allocated. If it is just a reused element, there will be no
>>>>> reinitialization.
>>>>
>>>> Instead of adding the overhead of ctor callback let's just
>>>> add __GFP_ZERO to flags in __alloc().
>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>> will behave the same way.
>>>
>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>> tried to address a similar issue for lru hash table.
>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>> hash table?
>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>
> The following is my understanding:
> in function alloc_htab_elem() (hashtab.c), we have
>
>                 if (is_map_full(htab))
>                         if (!old_elem)
>                                 /* when map is full and update() is replacing
>                                  * old element, it's ok to allocate, since
>                                  * old element will be freed immediately.
>                                  * Otherwise return an error
>                                  */
>                                 return ERR_PTR(-E2BIG);
>                 inc_elem_count(htab);
>                 l_new = bpf_mem_cache_alloc(&htab->ma);
>                 if (!l_new) {
>                         l_new = ERR_PTR(-ENOMEM);
>                         goto dec_count;
>                 }
>                 check_and_init_map_value(&htab->map,
>                                          l_new->key + round_up(key_size, 8));
>
> In the above check_and_init_map_value() intends to do initializing
> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>
> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>
> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> {
>         int i;
>
>         if (!foffs)
>                 return;
>         for (i = 0; i < foffs->cnt; i++)
>                 memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> }
>
> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> {
>         bpf_obj_init(map->field_offs, dst);
> }
>
> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> list_head, list_node, etc.
>
> This is the problem for above problem #1.
> Maybe I missed something?
Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
problem by only calling check_and_init_map_value() once for the newly-allocated
element, so if a freed element is reused, its special fields will not be zeroed
again. Is there any other cases which are not covered by the solution or any
other similar problems in hash-tab ?
>
>>>
>>>
>>>>
>>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>>> By adding seq in htab element, doing reuse check and retrying the
>>>>> lookup procedure may be a feasible solution, but it will make the
>>>>> lookup API being hard to use, because the user needs to check whether
>>>>> the found element is reused or not and repeat the lookup procedure if it
>>>>> is reused. A simpler solution would be just disabling freed elements
>>>>> reuse and freeing these elements after lookup procedure ends.
>>>>
>>>> You've proposed this 'solution' twice already in qptrie thread and both
>>>> times the answer was 'no, we cannot do this' with reasons explained.
>>>> The 3rd time the answer is still the same.
>>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>>> It's a known quirk. There is nothing to fix really.
>>>>
>>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>>> in active development. Soon bpf progs will be able to implement their own
>>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>>> be making the trade off between performance and lookup/delete race.
>>>> So please respin with just __GFP_ZERO and update the patch 6
>>>> to check for lockup only.
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-04  6:30         ` Hou Tao
@ 2023-01-04  7:14           ` Yonghong Song
  2023-01-04 18:26             ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Yonghong Song @ 2023-01-04  7:14 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1



On 1/3/23 10:30 PM, Hou Tao wrote:
> Hi,
> 
> On 1/4/2023 2:10 PM, Yonghong Song wrote:
>>
>>
>> On 1/3/23 5:47 AM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>>> freed elements may lead to two problems in htab map:
>>>>>>
>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>>        flag which acquires bpf-spin-lock during value copying. The
>>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
>>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>>        freed and then reused.
>>>>>>
>>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>>> these special fields in map element only when the map element is newly
>>>>>> allocated. If it is just a reused element, there will be no
>>>>>> reinitialization.
>>>>>
>>>>> Instead of adding the overhead of ctor callback let's just
>>>>> add __GFP_ZERO to flags in __alloc().
>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>>> will behave the same way.
>>>>
>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>>> tried to address a similar issue for lru hash table.
>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>>> hash table?
>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>>
>> The following is my understanding:
>> in function alloc_htab_elem() (hashtab.c), we have
>>
>>                  if (is_map_full(htab))
>>                          if (!old_elem)
>>                                  /* when map is full and update() is replacing
>>                                   * old element, it's ok to allocate, since
>>                                   * old element will be freed immediately.
>>                                   * Otherwise return an error
>>                                   */
>>                                  return ERR_PTR(-E2BIG);
>>                  inc_elem_count(htab);
>>                  l_new = bpf_mem_cache_alloc(&htab->ma);
>>                  if (!l_new) {
>>                          l_new = ERR_PTR(-ENOMEM);
>>                          goto dec_count;
>>                  }
>>                  check_and_init_map_value(&htab->map,
>>                                           l_new->key + round_up(key_size, 8));
>>
>> In the above check_and_init_map_value() intends to do initializing
>> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>>
>> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>>
>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
>> {
>>          int i;
>>
>>          if (!foffs)
>>                  return;
>>          for (i = 0; i < foffs->cnt; i++)
>>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
>> }
>>
>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>> {
>>          bpf_obj_init(map->field_offs, dst);
>> }
>>
>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
>> list_head, list_node, etc.
>>
>> This is the problem for above problem #1.
>> Maybe I missed something?
> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> problem by only calling check_and_init_map_value() once for the newly-allocated
> element, so if a freed element is reused, its special fields will not be zeroed
> again. Is there any other cases which are not covered by the solution or any
> other similar problems in hash-tab ?

No, I checked all cases of check_and_init_map_value() and didn't find
any other instances.

>>
>>>>
>>>>
>>>>>
>>>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>>>> By adding seq in htab element, doing reuse check and retrying the
>>>>>> lookup procedure may be a feasible solution, but it will make the
>>>>>> lookup API being hard to use, because the user needs to check whether
>>>>>> the found element is reused or not and repeat the lookup procedure if it
>>>>>> is reused. A simpler solution would be just disabling freed elements
>>>>>> reuse and freeing these elements after lookup procedure ends.
>>>>>
>>>>> You've proposed this 'solution' twice already in qptrie thread and both
>>>>> times the answer was 'no, we cannot do this' with reasons explained.
>>>>> The 3rd time the answer is still the same.
>>>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>>>> It's a known quirk. There is nothing to fix really.
>>>>>
>>>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>>>> in active development. Soon bpf progs will be able to implement their own
>>>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>>>> be making the trade off between performance and lookup/delete race.
>>>>> So please respin with just __GFP_ZERO and update the patch 6
>>>>> to check for lockup only.
>>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-04  7:14           ` Yonghong Song
@ 2023-01-04 18:26             ` Alexei Starovoitov
  2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2023-01-04 18:26 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Paul E . McKenney, rcu, Hou Tao

On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 1/3/23 10:30 PM, Hou Tao wrote:
> > Hi,
> >
> > On 1/4/2023 2:10 PM, Yonghong Song wrote:
> >>
> >>
> >> On 1/3/23 5:47 AM, Hou Tao wrote:
> >>> Hi,
> >>>
> >>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
> >>>>
> >>>>
> >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> >>>>>> From: Hou Tao <houtao1@huawei.com>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> The patchset tries to fix the problems found when checking how htab map
> >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
> >>>>>> freed elements may lead to two problems in htab map:
> >>>>>>
> >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> >>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> >>>>>>        flag which acquires bpf-spin-lock during value copying. The
> >>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
> >>>>>> (2) lookup procedure may get incorrect map value if the found element is
> >>>>>>        freed and then reused.
> >>>>>>
> >>>>>> Because the type of htab map elements are the same, so problem #1 can be
> >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> >>>>>> these special fields in map element only when the map element is newly
> >>>>>> allocated. If it is just a reused element, there will be no
> >>>>>> reinitialization.
> >>>>>
> >>>>> Instead of adding the overhead of ctor callback let's just
> >>>>> add __GFP_ZERO to flags in __alloc().
> >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
> >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> >>>>> will behave the same way.
> >>>>
> >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> >>>> tried to address a similar issue for lru hash table.
> >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> >>>> hash table?
> >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
> >>
> >> The following is my understanding:
> >> in function alloc_htab_elem() (hashtab.c), we have
> >>
> >>                  if (is_map_full(htab))
> >>                          if (!old_elem)
> >>                                  /* when map is full and update() is replacing
> >>                                   * old element, it's ok to allocate, since
> >>                                   * old element will be freed immediately.
> >>                                   * Otherwise return an error
> >>                                   */
> >>                                  return ERR_PTR(-E2BIG);
> >>                  inc_elem_count(htab);
> >>                  l_new = bpf_mem_cache_alloc(&htab->ma);
> >>                  if (!l_new) {
> >>                          l_new = ERR_PTR(-ENOMEM);
> >>                          goto dec_count;
> >>                  }
> >>                  check_and_init_map_value(&htab->map,
> >>                                           l_new->key + round_up(key_size, 8));
> >>
> >> In the above check_and_init_map_value() intends to do initializing
> >> for an element from bpf_mem_cache_alloc (could be reused from the free list).
> >>
> >> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
> >>
> >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> >> {
> >>          int i;
> >>
> >>          if (!foffs)
> >>                  return;
> >>          for (i = 0; i < foffs->cnt; i++)
> >>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> >> }
> >>
> >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >> {
> >>          bpf_obj_init(map->field_offs, dst);
> >> }
> >>
> >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> >> list_head, list_node, etc.
> >>
> >> This is the problem for above problem #1.
> >> Maybe I missed something?
> > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> > problem by only calling check_and_init_map_value() once for the newly-allocated
> > element, so if a freed element is reused, its special fields will not be zeroed
> > again. Is there any other cases which are not covered by the solution or any
> > other similar problems in hash-tab ?
>
> No, I checked all cases of check_and_init_map_value() and didn't find
> any other instances.

check_and_init_map_value() is called in two other cases:
lookup_and_delete[_batch].
There the zeroing of the fields is necessary because the 'value'
is a temp buffer that is going to be copied to user space.
I think the way forward is to add GFP_ZERO to mem_alloc
(to make it equivalent to prealloc), remove one case
of check_and_init_map_value from hashmap, add short comments
to two other cases and add a big comment to check_and_init_map_value()
that should say that 'dst' must be a temp buffer and should not
point to memory that could be used in parallel by a bpf prog.
It feels like we've dealt with this issue a couple times already
and keep repeating this mistake, so the more comments the better.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-03 19:38     ` Alexei Starovoitov
@ 2023-01-10  6:26       ` Martin KaFai Lau
  0 siblings, 0 replies; 35+ messages in thread
From: Martin KaFai Lau @ 2023-01-10  6:26 UTC (permalink / raw)
  To: Hou Tao
  Cc: Paul E . McKenney, bpf, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, Hou Tao,
	Alexei Starovoitov

On 1/3/23 11:38 AM, Alexei Starovoitov wrote:
> On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> The patchset tries to fix the problems found when checking how htab map
>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>> freed elements may lead to two problems in htab map:
>>>>
>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>      flag which acquires bpf-spin-lock during value copying. The
>>>>      corruption of bpf-spin-lock may result in hard lock-up.
>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>      freed and then reused.
>>>>
>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>> these special fields in map element only when the map element is newly
>>>> allocated. If it is just a reused element, there will be no
>>>> reinitialization.
>>> Instead of adding the overhead of ctor callback let's just
>>> add __GFP_ZERO to flags in __alloc().
>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>> will behave the same way.
>> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
>> object may be bigger than ctor callback when the size of allocated object is
>> large. And it also introduces unnecessary memory zeroing if there is no special
>> field in the map value.
> 
> Small memset is often faster than an indirect call.
> I doubt that adding GFP_ZERO will have a measurable perf difference
> in map_perf_test and other benchmarks.
> 
>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>> By adding seq in htab element, doing reuse check and retrying the
>>>> lookup procedure may be a feasible solution, but it will make the
>>>> lookup API being hard to use, because the user needs to check whether
>>>> the found element is reused or not and repeat the lookup procedure if it
>>>> is reused. A simpler solution would be just disabling freed elements
>>>> reuse and freeing these elements after lookup procedure ends.
>>> You've proposed this 'solution' twice already in qptrie thread and both
>>> times the answer was 'no, we cannot do this' with reasons explained.
>>> The 3rd time the answer is still the same.
>> This time a workable demo which calls call_rcu_task_trace() in batch is provided
>> :) Also because I can not find a better solution for the reuse problem. But you
>> are right, although don't reuse the freed element will make the implementation
>> of map simpler, the potential OOM problem is hard to solve specially when RCU
>> tasks trace grace period is slow. Hope Paul can provide some insight about the
>> problem.
> 
> OOM is exactly the reason why we cannot do this delaying logic
> in the general case. We don't control what progs do and rcu tasks trace
> may take a long time.

I haven't looked at the details of this patch yet. Since Hou was asking in
https://lore.kernel.org/bpf/6e4ec7a4-9ac9-417c-c11a-de59e72a6e42@huawei.com/ for 
the local storage use case (thanks!), so continue the discussion in this thread.

Some of my current thoughts, the local storage map is a little different from 
the more generic bpf map (eg htab). The common use case in local storage map is 
less demanding on the alloc/free path. The storage is usually allocated once and 
then stays for the whole lifetime with its owner (sk, task, cgrp, or inode). 
There is no update helper, so it encourages a get() and then followed by an 
in-place modification. Beside, the current local storage is also freed after 
going through a rcu tasks trace gp.

That said, I am checking to see if the common local storage use case can be 
modified to safely reuse without waiting the gp. It will then be an improvement 
on the existing implementation. The bottom line is not slowing down the current 
get (ie lookup) performance. Not 100% sure yet if it is doable.

The delete() helper likely still need to go through gp. Being able to avoid 
immediate reuse in bpf_mem_alloc will at least be needed for the local storage 
delete() code path.

> 
>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>> It's a known quirk. There is nothing to fix really.
>> Do we need to document the unexpected behavior somewhere, because I really don't
>> know nothing about the quirk ?
> 
> Yeah. It's not documented in Documentation/bpf/map_hash.rst.
> Please send a patch to add it.
> 
>>>
>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>> in active development. Soon bpf progs will be able to implement their own
>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>> be making the trade off between performance and lookup/delete race.
>> It seems these new gen data struct also need to solve the reuse problem because
>> a global bpf memory allocator is used.
> 
> Currently the graph api is single owner and kptr_xchg is used to allow
> single cpu at a time to observe the object.
> In the future we will add bpf_refcount and multi owner.
> Then multiple cpus will be able to access the same object concurrently.
> They will race to read/write the fields and it will be prog decision
> to arbitrate the access.
> In other words the bpf prog needs to have mechanisms to deal with reuse.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-01-04 18:26             ` Alexei Starovoitov
@ 2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
  2023-02-10 21:06                 ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-02-10 16:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, Hou Tao

On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
> On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
> >
> >
> >
> > On 1/3/23 10:30 PM, Hou Tao wrote:
> > > Hi,
> > >
> > > On 1/4/2023 2:10 PM, Yonghong Song wrote:
> > >>
> > >>
> > >> On 1/3/23 5:47 AM, Hou Tao wrote:
> > >>> Hi,
> > >>>
> > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
> > >>>>
> > >>>>
> > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> > >>>>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> The patchset tries to fix the problems found when checking how htab map
> > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
> > >>>>>> freed elements may lead to two problems in htab map:
> > >>>>>>
> > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> > >>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> > >>>>>>        flag which acquires bpf-spin-lock during value copying. The
> > >>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
> > >>>>>> (2) lookup procedure may get incorrect map value if the found element is
> > >>>>>>        freed and then reused.
> > >>>>>>
> > >>>>>> Because the type of htab map elements are the same, so problem #1 can be
> > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> > >>>>>> these special fields in map element only when the map element is newly
> > >>>>>> allocated. If it is just a reused element, there will be no
> > >>>>>> reinitialization.
> > >>>>>
> > >>>>> Instead of adding the overhead of ctor callback let's just
> > >>>>> add __GFP_ZERO to flags in __alloc().
> > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
> > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > >>>>> will behave the same way.
> > >>>>
> > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> > >>>> tried to address a similar issue for lru hash table.
> > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> > >>>> hash table?
> > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
> > >>
> > >> The following is my understanding:
> > >> in function alloc_htab_elem() (hashtab.c), we have
> > >>
> > >>                  if (is_map_full(htab))
> > >>                          if (!old_elem)
> > >>                                  /* when map is full and update() is replacing
> > >>                                   * old element, it's ok to allocate, since
> > >>                                   * old element will be freed immediately.
> > >>                                   * Otherwise return an error
> > >>                                   */
> > >>                                  return ERR_PTR(-E2BIG);
> > >>                  inc_elem_count(htab);
> > >>                  l_new = bpf_mem_cache_alloc(&htab->ma);
> > >>                  if (!l_new) {
> > >>                          l_new = ERR_PTR(-ENOMEM);
> > >>                          goto dec_count;
> > >>                  }
> > >>                  check_and_init_map_value(&htab->map,
> > >>                                           l_new->key + round_up(key_size, 8));
> > >>
> > >> In the above check_and_init_map_value() intends to do initializing
> > >> for an element from bpf_mem_cache_alloc (could be reused from the free list).
> > >>
> > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
> > >>
> > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> > >> {
> > >>          int i;
> > >>
> > >>          if (!foffs)
> > >>                  return;
> > >>          for (i = 0; i < foffs->cnt; i++)
> > >>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> > >> }
> > >>
> > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > >> {
> > >>          bpf_obj_init(map->field_offs, dst);
> > >> }
> > >>
> > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> > >> list_head, list_node, etc.
> > >>
> > >> This is the problem for above problem #1.
> > >> Maybe I missed something?
> > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> > > problem by only calling check_and_init_map_value() once for the newly-allocated
> > > element, so if a freed element is reused, its special fields will not be zeroed
> > > again. Is there any other cases which are not covered by the solution or any
> > > other similar problems in hash-tab ?
> >
> > No, I checked all cases of check_and_init_map_value() and didn't find
> > any other instances.
>
> check_and_init_map_value() is called in two other cases:
> lookup_and_delete[_batch].
> There the zeroing of the fields is necessary because the 'value'
> is a temp buffer that is going to be copied to user space.
> I think the way forward is to add GFP_ZERO to mem_alloc
> (to make it equivalent to prealloc), remove one case
> of check_and_init_map_value from hashmap, add short comments
> to two other cases and add a big comment to check_and_init_map_value()
> that should say that 'dst' must be a temp buffer and should not
> point to memory that could be used in parallel by a bpf prog.
> It feels like we've dealt with this issue a couple times already
> and keep repeating this mistake, so the more comments the better.

Hou, are you plannning to resubmit this change? I also hit this while testing my
changes on bpf-next.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
@ 2023-02-10 21:06                 ` Alexei Starovoitov
  2023-02-11  1:09                   ` Hou Tao
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2023-02-10 21:06 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, Hou Tao

On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
> > On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
> > >
> > >
> > >
> > > On 1/3/23 10:30 PM, Hou Tao wrote:
> > > > Hi,
> > > >
> > > > On 1/4/2023 2:10 PM, Yonghong Song wrote:
> > > >>
> > > >>
> > > >> On 1/3/23 5:47 AM, Hou Tao wrote:
> > > >>> Hi,
> > > >>>
> > > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
> > > >>>>
> > > >>>>
> > > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> > > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> > > >>>>>> From: Hou Tao <houtao1@huawei.com>
> > > >>>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> The patchset tries to fix the problems found when checking how htab map
> > > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
> > > >>>>>> freed elements may lead to two problems in htab map:
> > > >>>>>>
> > > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> > > >>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> > > >>>>>>        flag which acquires bpf-spin-lock during value copying. The
> > > >>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
> > > >>>>>> (2) lookup procedure may get incorrect map value if the found element is
> > > >>>>>>        freed and then reused.
> > > >>>>>>
> > > >>>>>> Because the type of htab map elements are the same, so problem #1 can be
> > > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> > > >>>>>> these special fields in map element only when the map element is newly
> > > >>>>>> allocated. If it is just a reused element, there will be no
> > > >>>>>> reinitialization.
> > > >>>>>
> > > >>>>> Instead of adding the overhead of ctor callback let's just
> > > >>>>> add __GFP_ZERO to flags in __alloc().
> > > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
> > > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > > >>>>> will behave the same way.
> > > >>>>
> > > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> > > >>>> tried to address a similar issue for lru hash table.
> > > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> > > >>>> hash table?
> > > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
> > > >>
> > > >> The following is my understanding:
> > > >> in function alloc_htab_elem() (hashtab.c), we have
> > > >>
> > > >>                  if (is_map_full(htab))
> > > >>                          if (!old_elem)
> > > >>                                  /* when map is full and update() is replacing
> > > >>                                   * old element, it's ok to allocate, since
> > > >>                                   * old element will be freed immediately.
> > > >>                                   * Otherwise return an error
> > > >>                                   */
> > > >>                                  return ERR_PTR(-E2BIG);
> > > >>                  inc_elem_count(htab);
> > > >>                  l_new = bpf_mem_cache_alloc(&htab->ma);
> > > >>                  if (!l_new) {
> > > >>                          l_new = ERR_PTR(-ENOMEM);
> > > >>                          goto dec_count;
> > > >>                  }
> > > >>                  check_and_init_map_value(&htab->map,
> > > >>                                           l_new->key + round_up(key_size, 8));
> > > >>
> > > >> In the above check_and_init_map_value() intends to do initializing
> > > >> for an element from bpf_mem_cache_alloc (could be reused from the free list).
> > > >>
> > > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
> > > >>
> > > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> > > >> {
> > > >>          int i;
> > > >>
> > > >>          if (!foffs)
> > > >>                  return;
> > > >>          for (i = 0; i < foffs->cnt; i++)
> > > >>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> > > >> }
> > > >>
> > > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > > >> {
> > > >>          bpf_obj_init(map->field_offs, dst);
> > > >> }
> > > >>
> > > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> > > >> list_head, list_node, etc.
> > > >>
> > > >> This is the problem for above problem #1.
> > > >> Maybe I missed something?
> > > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> > > > problem by only calling check_and_init_map_value() once for the newly-allocated
> > > > element, so if a freed element is reused, its special fields will not be zeroed
> > > > again. Is there any other cases which are not covered by the solution or any
> > > > other similar problems in hash-tab ?
> > >
> > > No, I checked all cases of check_and_init_map_value() and didn't find
> > > any other instances.
> >
> > check_and_init_map_value() is called in two other cases:
> > lookup_and_delete[_batch].
> > There the zeroing of the fields is necessary because the 'value'
> > is a temp buffer that is going to be copied to user space.
> > I think the way forward is to add GFP_ZERO to mem_alloc
> > (to make it equivalent to prealloc), remove one case
> > of check_and_init_map_value from hashmap, add short comments
> > to two other cases and add a big comment to check_and_init_map_value()
> > that should say that 'dst' must be a temp buffer and should not
> > point to memory that could be used in parallel by a bpf prog.
> > It feels like we've dealt with this issue a couple times already
> > and keep repeating this mistake, so the more comments the better.
>
> Hou, are you plannning to resubmit this change? I also hit this while testing my
> changes on bpf-next.

Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
The former will take a long time to settle.
The latter is trivial.
To unblock yourself just add GFP_ZERO in an extra patch?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-10 21:06                 ` Alexei Starovoitov
@ 2023-02-11  1:09                   ` Hou Tao
  2023-02-11 16:33                     ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-02-11  1:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Paul E . McKenney, rcu, Hou Tao



On 2/11/2023 5:06 AM, Alexei Starovoitov wrote:
> On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>> On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
>>> On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
>>>>
>>>>
>>>> On 1/3/23 10:30 PM, Hou Tao wrote:
>>>>> Hi,
>>>>>
>>>>> On 1/4/2023 2:10 PM, Yonghong Song wrote:
>>>>>>
>>>>>> On 1/3/23 5:47 AM, Hou Tao wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>>>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>>>>>>> freed elements may lead to two problems in htab map:
>>>>>>>>>>
>>>>>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>>>>>>        flag which acquires bpf-spin-lock during value copying. The
>>>>>>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
>>>>>>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>>>>>>        freed and then reused.
>>>>>>>>>>
>>>>>>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>>>>>>> these special fields in map element only when the map element is newly
>>>>>>>>>> allocated. If it is just a reused element, there will be no
>>>>>>>>>> reinitialization.
>>>>>>>>> Instead of adding the overhead of ctor callback let's just
>>>>>>>>> add __GFP_ZERO to flags in __alloc().
>>>>>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>>>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>>>>>>> will behave the same way.
>>>>>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>>>>>>> tried to address a similar issue for lru hash table.
>>>>>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>>>>>>> hash table?
>>>>>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>>>>>> The following is my understanding:
>>>>>> in function alloc_htab_elem() (hashtab.c), we have
>>>>>>
>>>>>>                  if (is_map_full(htab))
>>>>>>                          if (!old_elem)
>>>>>>                                  /* when map is full and update() is replacing
>>>>>>                                   * old element, it's ok to allocate, since
>>>>>>                                   * old element will be freed immediately.
>>>>>>                                   * Otherwise return an error
>>>>>>                                   */
>>>>>>                                  return ERR_PTR(-E2BIG);
>>>>>>                  inc_elem_count(htab);
>>>>>>                  l_new = bpf_mem_cache_alloc(&htab->ma);
>>>>>>                  if (!l_new) {
>>>>>>                          l_new = ERR_PTR(-ENOMEM);
>>>>>>                          goto dec_count;
>>>>>>                  }
>>>>>>                  check_and_init_map_value(&htab->map,
>>>>>>                                           l_new->key + round_up(key_size, 8));
>>>>>>
>>>>>> In the above check_and_init_map_value() intends to do initializing
>>>>>> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>>>>>>
>>>>>> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>>>>>>
>>>>>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
>>>>>> {
>>>>>>          int i;
>>>>>>
>>>>>>          if (!foffs)
>>>>>>                  return;
>>>>>>          for (i = 0; i < foffs->cnt; i++)
>>>>>>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
>>>>>> }
>>>>>>
>>>>>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>>>>>> {
>>>>>>          bpf_obj_init(map->field_offs, dst);
>>>>>> }
>>>>>>
>>>>>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
>>>>>> list_head, list_node, etc.
>>>>>>
>>>>>> This is the problem for above problem #1.
>>>>>> Maybe I missed something?
>>>>> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
>>>>> problem by only calling check_and_init_map_value() once for the newly-allocated
>>>>> element, so if a freed element is reused, its special fields will not be zeroed
>>>>> again. Is there any other cases which are not covered by the solution or any
>>>>> other similar problems in hash-tab ?
>>>> No, I checked all cases of check_and_init_map_value() and didn't find
>>>> any other instances.
>>> check_and_init_map_value() is called in two other cases:
>>> lookup_and_delete[_batch].
>>> There the zeroing of the fields is necessary because the 'value'
>>> is a temp buffer that is going to be copied to user space.
>>> I think the way forward is to add GFP_ZERO to mem_alloc
>>> (to make it equivalent to prealloc), remove one case
>>> of check_and_init_map_value from hashmap, add short comments
>>> to two other cases and add a big comment to check_and_init_map_value()
>>> that should say that 'dst' must be a temp buffer and should not
>>> point to memory that could be used in parallel by a bpf prog.
>>> It feels like we've dealt with this issue a couple times already
>>> and keep repeating this mistake, so the more comments the better.
>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>> changes on bpf-next.
> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> The former will take a long time to settle.
> The latter is trivial.
> To unblock yourself just add GFP_ZERO in an extra patch?
Sorry for the long delay. Just find find out time to do some tests to compare
the performance of bzero and ctor. After it is done, will resubmit on next week.
> .


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-11  1:09                   ` Hou Tao
@ 2023-02-11 16:33                     ` Alexei Starovoitov
  2023-02-11 16:34                       ` Alexei Starovoitov
  2023-02-15  2:35                       ` Hou Tao
  0 siblings, 2 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2023-02-11 16:33 UTC (permalink / raw)
  To: Hou Tao
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hou, are you plannning to resubmit this change? I also hit this while testing my
> >> changes on bpf-next.
> > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> > The former will take a long time to settle.
> > The latter is trivial.
> > To unblock yourself just add GFP_ZERO in an extra patch?
> Sorry for the long delay. Just find find out time to do some tests to compare
> the performance of bzero and ctor. After it is done, will resubmit on next week.

I still don't like ctor as a concept. In general the callbacks in the critical
path are guaranteed to be slow due to retpoline overhead.
Please send a patch to add GFP_ZERO.

Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
This approach will work for inner nodes of qptrie, since bpf progs
never see pointers to them. It will work for local storage
converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
It's also safe without uaf caveat in sleepable progs and sleepable progs
can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
So please respin the set with rcu gp only and that new flag.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-11 16:33                     ` Alexei Starovoitov
@ 2023-02-11 16:34                       ` Alexei Starovoitov
  2023-02-15  1:54                         ` Martin KaFai Lau
  2023-02-16 13:55                         ` Hou Tao
  2023-02-15  2:35                       ` Hou Tao
  1 sibling, 2 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2023-02-11 16:34 UTC (permalink / raw)
  To: Hou Tao
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hou, are you plannning to resubmit this change? I also hit this while testing my
> > >> changes on bpf-next.
> > > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> > > The former will take a long time to settle.
> > > The latter is trivial.
> > > To unblock yourself just add GFP_ZERO in an extra patch?
> > Sorry for the long delay. Just find find out time to do some tests to compare
> > the performance of bzero and ctor. After it is done, will resubmit on next week.
>
> I still don't like ctor as a concept. In general the callbacks in the critical
> path are guaranteed to be slow due to retpoline overhead.
> Please send a patch to add GFP_ZERO.
>
> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
> This approach will work for inner nodes of qptrie, since bpf progs
> never see pointers to them. It will work for local storage
> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
> It's also safe without uaf caveat in sleepable progs and sleepable progs

I meant 'safe with uaf caveat'.
Safe because we wait for rcu_task_trace later before returning to kernel memory.

> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
> So please respin the set with rcu gp only and that new flag.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-11 16:34                       ` Alexei Starovoitov
@ 2023-02-15  1:54                         ` Martin KaFai Lau
  2023-02-15  4:02                           ` Hou Tao
  2023-02-16 13:55                         ` Hou Tao
  1 sibling, 1 reply; 35+ messages in thread
From: Martin KaFai Lau @ 2023-02-15  1:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Hou Tao
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, Hou Tao,
	Martin KaFai Lau

On 2/11/23 8:34 AM, Alexei Starovoitov wrote:
> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>>> changes on bpf-next.
>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>> The former will take a long time to settle.
>>>> The latter is trivial.
>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>> the performance of bzero and ctor. After it is done, will resubmit on next week.
>>
>> I still don't like ctor as a concept. In general the callbacks in the critical
>> path are guaranteed to be slow due to retpoline overhead.
>> Please send a patch to add GFP_ZERO.
>>
>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
>> This approach will work for inner nodes of qptrie, since bpf progs
>> never see pointers to them. It will work for local storage
>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
>> It's also safe without uaf caveat in sleepable progs and sleepable progs
> 
> I meant 'safe with uaf caveat'.
> Safe because we wait for rcu_task_trace later before returning to kernel memory.

For local storage, when its owner (sk/task/inode/cgrp) is going away, the memory 
can be reused immediately. No rcu gp is needed.

The local storage delete case (eg. bpf_sk_storage_delete) is the only one that 
needs to be freed by tasks_trace gp because another bpf prog (reader) may be 
under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on 
allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp 
can be extended to the local storage delete case. I think we can extend the 
assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() 
when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.

I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and the 
BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.

> 
>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
>> So please respin the set with rcu gp only and that new flag.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-11 16:33                     ` Alexei Starovoitov
  2023-02-11 16:34                       ` Alexei Starovoitov
@ 2023-02-15  2:35                       ` Hou Tao
  2023-02-15  2:42                         ` Alexei Starovoitov
  1 sibling, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-02-15  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

Hi,

On 2/12/2023 12:33 AM, Alexei Starovoitov wrote:
> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>> changes on bpf-next.
>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>> The former will take a long time to settle.
>>> The latter is trivial.
>>> To unblock yourself just add GFP_ZERO in an extra patch?
>> Sorry for the long delay. Just find find out time to do some tests to compare
>> the performance of bzero and ctor. After it is done, will resubmit on next week.
> I still don't like ctor as a concept. In general the callbacks in the critical
> path are guaranteed to be slow due to retpoline overhead.
> Please send a patch to add GFP_ZERO.
I see. Will do. But i think it is better to know the coarse overhead of these
two methods, so I hack map_perf_test to support customizable value size for
hash_map_alloc and do some benchmarks to show the overheads of ctor and
GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when
the number of allocated elements is small, the overheads of ctor and bzero are
basically the same, but when the number of allocated element increases (e.g.,
half full), the overhead of ctor will be bigger. For big value size, the
overhead of ctor and zero are basically the same, and it seems due to the main
overhead comes from slab allocation. The following is the detailed results:

* ./map_perf_test 4 8 8192 10000 $value_size

Key of htab is thread pid, so only 8 elements are allocated.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 256604 | 261112 | 173646 | 74195  | 23138  | 6275   |
| bzero      | 253362 | 257563 | 171445 | 73303  | 22949  | 6249   |
| ctor       | 264570 | 258246 | 175048 | 72511  | 23004  | 6270   |

* ./map_perf_test 4 8 8192 100 $value_size

The key is still thread pid, so only 8 elements are allocated. decrease the loop
count to 100 to show the overhead of first allocation.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 135662 | 137742 | 87043  | 36265  | 12501  | 4450   |
| bzero      | 139993 | 134920 | 94570  | 37190  | 12543  | 4131   |
| ctor       | 147949 | 141825 | 94321  | 38240  | 13131  | 4248   |

* ./map_perf_test 4 8 8192 1000 $value_size

Create 512 different keys per-thread, so the hash table will be half-full. Also
decrease the loop count to 1K.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 4234   | 4289   | 1478   | 510    | 168    | 46     |
| bzero      | 3792   | 4002   | 1473   | 515    | 161    | 37     |
| ctor       | 3846   | 2198   | 1269   | 499    | 161    | 42     |

* ./map_perf_test 4 8 8192 100 $value_size

Create 512 different keys per-thread, so the hash table will be half-full. Also
decrease the loop count to 100.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 3669   | 3419   | 1272   | 476    | 168    | 44     |
| bzero      | 3468   | 3499   | 1274   | 476    | 150    | 36     |
| ctor       | 2235   | 2312   | 1128   | 452    | 145    | 35     |
>
> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
> This approach will work for inner nodes of qptrie, since bpf progs
> never see pointers to them. It will work for local storage
> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
> It's also safe without uaf caveat in sleepable progs and sleepable progs
> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
> So please respin the set with rcu gp only and that new flag.
> .


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-15  2:35                       ` Hou Tao
@ 2023-02-15  2:42                         ` Alexei Starovoitov
  2023-02-15  3:00                           ` Hou Tao
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2023-02-15  2:42 UTC (permalink / raw)
  To: Hou Tao
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

On Tue, Feb 14, 2023 at 6:36 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/12/2023 12:33 AM, Alexei Starovoitov wrote:
> > On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
> >>>> changes on bpf-next.
> >>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> >>> The former will take a long time to settle.
> >>> The latter is trivial.
> >>> To unblock yourself just add GFP_ZERO in an extra patch?
> >> Sorry for the long delay. Just find find out time to do some tests to compare
> >> the performance of bzero and ctor. After it is done, will resubmit on next week.
> > I still don't like ctor as a concept. In general the callbacks in the critical
> > path are guaranteed to be slow due to retpoline overhead.
> > Please send a patch to add GFP_ZERO.
> I see. Will do. But i think it is better to know the coarse overhead of these
> two methods, so I hack map_perf_test to support customizable value size for
> hash_map_alloc and do some benchmarks to show the overheads of ctor and
> GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when
> the number of allocated elements is small, the overheads of ctor and bzero are
> basically the same, but when the number of allocated element increases (e.g.,
> half full), the overhead of ctor will be bigger. For big value size, the
> overhead of ctor and zero are basically the same, and it seems due to the main
> overhead comes from slab allocation. The following is the detailed results:

and with retpoline?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-15  2:42                         ` Alexei Starovoitov
@ 2023-02-15  3:00                           ` Hou Tao
  0 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2023-02-15  3:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

Hi,

On 2/15/2023 10:42 AM, Alexei Starovoitov wrote:
> On Tue, Feb 14, 2023 at 6:36 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 2/12/2023 12:33 AM, Alexei Starovoitov wrote:
>>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>>>> changes on bpf-next.
>>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>>> The former will take a long time to settle.
>>>>> The latter is trivial.
>>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>>> the performance of bzero and ctor. After it is done, will resubmit on next week.
>>> I still don't like ctor as a concept. In general the callbacks in the critical
>>> path are guaranteed to be slow due to retpoline overhead.
>>> Please send a patch to add GFP_ZERO.
>> I see. Will do. But i think it is better to know the coarse overhead of these
>> two methods, so I hack map_perf_test to support customizable value size for
>> hash_map_alloc and do some benchmarks to show the overheads of ctor and
>> GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when
>> the number of allocated elements is small, the overheads of ctor and bzero are
>> basically the same, but when the number of allocated element increases (e.g.,
>> half full), the overhead of ctor will be bigger. For big value size, the
>> overhead of ctor and zero are basically the same, and it seems due to the main
>> overhead comes from slab allocation. The following is the detailed results:
> and with retpoline?
Yes. Forge to mention that. The following is the output of vulnerabilities
directory:

# cd /sys/devices/system/cpu/vulnerabilities
# grep . *
itlb_multihit:Processor vulnerable
l1tf:Mitigation: PTE Inversion
mds:Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
meltdown:Mitigation: PTI
mmio_stale_data:Unknown: No mitigations
retbleed:Not affected
spec_store_bypass:Vulnerable
spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
spectre_v2:Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS:
Not affected
srbds:Not affected
tsx_async_abort:Not affected



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-15  1:54                         ` Martin KaFai Lau
@ 2023-02-15  4:02                           ` Hou Tao
  2023-02-15  7:22                             ` Martin KaFai Lau
  0 siblings, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-02-15  4:02 UTC (permalink / raw)
  To: Martin KaFai Lau, Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, Hou Tao,
	Martin KaFai Lau

Hi,

On 2/15/2023 9:54 AM, Martin KaFai Lau wrote:
> On 2/11/23 8:34 AM, Alexei Starovoitov wrote:
>> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>> Hou, are you plannning to resubmit this change? I also hit this while
>>>>>> testing my
>>>>>> changes on bpf-next.
>>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>>> The former will take a long time to settle.
>>>>> The latter is trivial.
>>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>>> the performance of bzero and ctor. After it is done, will resubmit on next
>>>> week.
>>>
>>> I still don't like ctor as a concept. In general the callbacks in the critical
>>> path are guaranteed to be slow due to retpoline overhead.
>>> Please send a patch to add GFP_ZERO.
>>>
>>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
>>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
>>> This approach will work for inner nodes of qptrie, since bpf progs
>>> never see pointers to them. It will work for local storage
>>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
>>> It's also safe without uaf caveat in sleepable progs and sleepable progs
>>
>> I meant 'safe with uaf caveat'.
>> Safe because we wait for rcu_task_trace later before returning to kernel memory.
For qp-trie, I had added reuse checking for qp-trie inner node by adding a
version in both child pointer and child node itself and it seemed works, but
using BPF_REUSE_AFTER_RCU_GP for inner node will be much simpler for the
implementation. And it seems for qp-trie leaf node, BPF_REUSE_AFTER_RCU_GP is
needed as well, else the value returned to caller in bpf program or syscall may
be reused just like hash-table during its usage. We can change qp-trie to act as
a set only (e.g., no value), but that will limit its usage scenario. Maybe
requiring the caller to use bpf_rcu_read_lock() is solution as well. What do you
think ?
>
> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
> memory can be reused immediately. No rcu gp is needed.
Now it seems it will wait for RCU GP and i think it is still necessary, because
when the process exits, other processes may still access the local storage
through pidfd or task_struct of the exited process.
>
> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that
> needs to be freed by tasks_trace gp because another bpf prog (reader) may be
> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on
> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp
> can be extended to the local storage delete case. I think we can extend the
> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock()
> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.
>
It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not
only bpf_task_storage_get(), but also the dereferences of the returned local
storage ptr, right ? I think qp-trie may also need this.
> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and
> the BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.
I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean
that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we
cooperate together to accomplish that ?


>
>>
>>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
>>> So please respin the set with rcu gp only and that new flag.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-15  4:02                           ` Hou Tao
@ 2023-02-15  7:22                             ` Martin KaFai Lau
  2023-02-16  2:11                               ` Hou Tao
  0 siblings, 1 reply; 35+ messages in thread
From: Martin KaFai Lau @ 2023-02-15  7:22 UTC (permalink / raw)
  To: Hou Tao
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, Hou Tao,
	Martin KaFai Lau, Alexei Starovoitov

On 2/14/23 8:02 PM, Hou Tao wrote:
>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>> memory can be reused immediately. No rcu gp is needed.
> Now it seems it will wait for RCU GP and i think it is still necessary, because
> when the process exits, other processes may still access the local storage
> through pidfd or task_struct of the exited process.

When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0 
and will be kfree immediately next. eg. bpf_sk_storage_free is called just 
before the sk is about to be kfree. No bpf prog should have a hold on this sk. 
The same should go for the task.

The current rcu gp waiting during bpf_{sk,task,cgrp...}_storage_free is because 
the racing with the map destruction bpf_local_storage_map_free().

>>
>> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that
>> needs to be freed by tasks_trace gp because another bpf prog (reader) may be
>> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on
>> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp
>> can be extended to the local storage delete case. I think we can extend the
>> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock()
>> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.
>>
> It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not
> only bpf_task_storage_get(), but also the dereferences of the returned local
> storage ptr, right ? I think qp-trie may also need this.

I think bpf_rcu_read_lock() is primarily for bpf prog.

The bpf_{sk,task,...}_storage_get() internal is easier to handle and probably 
will need to do its own rcu_read_lock() instead of depending on the bpf prog 
doing the bpf_rcu_read_lock() because the bpf prog may decide uaf is fine.

>> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and
>> the BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.
> I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean
> that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we
> cooperate together to accomplish that ?
Please submit the GFP_ZERO patch first. Kumar and I can use it immediately.

I have been hacking to make bpf's memalloc safe for the 
bpf_{sk,task,cgrp..}_storage_delete() and this safe-on-reuse piece still need 
works. The whole thing is getting pretty long, so my current plan is to put the 
safe-on-reuse piece aside for now, focus back on the immediate goal and make the 
common case deadlock free first. Meaning the 
bpf_*_storage_get(BPF_*_STORAGE_GET_F_CREATE) and the bpf_*_storage_free() will 
use the bpf_mem_cache_{alloc,free}. The bpf_*_storage_delete() will stay as-is 
to go through the call_rcu_tasks_trace() for now since delete is not the common 
use case.

In parallel, if you can post the BPF_REUSE_AFTER_RCU_GP, we can discuss based on 
your work. That should speed up the progress. If I finished the immediate goal 
for local storage and this piece is still pending, I will ping you first.  Thoughts?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-15  7:22                             ` Martin KaFai Lau
@ 2023-02-16  2:11                               ` Hou Tao
  2023-02-16  7:47                                 ` Martin KaFai Lau
  0 siblings, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-02-16  2:11 UTC (permalink / raw)
  To: Martin KaFai Lau, Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, Hou Tao,
	Martin KaFai Lau

Hi,

On 2/15/2023 3:22 PM, Martin KaFai Lau wrote:
> On 2/14/23 8:02 PM, Hou Tao wrote:
>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>>> memory can be reused immediately. No rcu gp is needed.
>> Now it seems it will wait for RCU GP and i think it is still necessary, because
>> when the process exits, other processes may still access the local storage
>> through pidfd or task_struct of the exited process.
>
> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0
> and will be kfree immediately next. eg. bpf_sk_storage_free is called just
> before the sk is about to be kfree. No bpf prog should have a hold on this sk.
> The same should go for the task.
A bpf syscall may have already found the task local storage through a pidfd,
then the target task exits and the local storage is free immediately, then bpf
syscall starts to copy the local storage and there will be a UAF, right ? Did I
missing something here ?
>
> The current rcu gp waiting during bpf_{sk,task,cgrp...}_storage_free is
> because the racing with the map destruction bpf_local_storage_map_free().
>
>>>
>>> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that
>>> needs to be freed by tasks_trace gp because another bpf prog (reader) may be
>>> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on
>>> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp
>>> can be extended to the local storage delete case. I think we can extend the
>>> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock()
>>> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.
>>>
>> It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not
>> only bpf_task_storage_get(), but also the dereferences of the returned local
>> storage ptr, right ? I think qp-trie may also need this.
>
> I think bpf_rcu_read_lock() is primarily for bpf prog.
Yes. I mean the bpf program which uses qp-trie will need bpf_rcu_read_lock().
>
> The bpf_{sk,task,...}_storage_get() internal is easier to handle and probably
> will need to do its own rcu_read_lock() instead of depending on the bpf prog
> doing the bpf_rcu_read_lock() because the bpf prog may decide uaf is fine.
>
>>> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and
>>> the BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.
>> I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean
>> that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we
>> cooperate together to accomplish that ?
> Please submit the GFP_ZERO patch first. Kumar and I can use it immediately.
>
> I have been hacking to make bpf's memalloc safe for the
> bpf_{sk,task,cgrp..}_storage_delete() and this safe-on-reuse piece still need
> works. The whole thing is getting pretty long, so my current plan is to put
> the safe-on-reuse piece aside for now, focus back on the immediate goal and
> make the common case deadlock free first. Meaning the
> bpf_*_storage_get(BPF_*_STORAGE_GET_F_CREATE) and the bpf_*_storage_free()
> will use the bpf_mem_cache_{alloc,free}. The bpf_*_storage_delete() will stay
> as-is to go through the call_rcu_tasks_trace() for now since delete is not the
> common use case.
>
> In parallel, if you can post the BPF_REUSE_AFTER_RCU_GP, we can discuss based
> on your work. That should speed up the progress. If I finished the immediate
> goal for local storage and this piece is still pending, I will ping you
> first.  Thoughts?
I am fine with the proposal, thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-16  2:11                               ` Hou Tao
@ 2023-02-16  7:47                                 ` Martin KaFai Lau
  2023-02-16  8:18                                   ` Hou Tao
  0 siblings, 1 reply; 35+ messages in thread
From: Martin KaFai Lau @ 2023-02-16  7:47 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Kumar Kartikeya Dwivedi, Yonghong Song, bpf,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

On 2/15/23 6:11 PM, Hou Tao wrote:
>>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>>>> memory can be reused immediately. No rcu gp is needed.
>>> Now it seems it will wait for RCU GP and i think it is still necessary, because
>>> when the process exits, other processes may still access the local storage
>>> through pidfd or task_struct of the exited process.
>> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0
>> and will be kfree immediately next. eg. bpf_sk_storage_free is called just
>> before the sk is about to be kfree. No bpf prog should have a hold on this sk.
>> The same should go for the task.
> A bpf syscall may have already found the task local storage through a pidfd,
> then the target task exits and the local storage is free immediately, then bpf
> syscall starts to copy the local storage and there will be a UAF, right ? Did I
> missing something here ?
bpf syscall like bpf_pid_task_storage_lookup_elem and you meant 
__put_task_struct() will be called while the syscall's bpf_map_copy_value() is 
still under rcu_read_lock()?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-16  7:47                                 ` Martin KaFai Lau
@ 2023-02-16  8:18                                   ` Hou Tao
  0 siblings, 0 replies; 35+ messages in thread
From: Hou Tao @ 2023-02-16  8:18 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Kumar Kartikeya Dwivedi, Yonghong Song, bpf,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

Hi,

On 2/16/2023 3:47 PM, Martin KaFai Lau wrote:
> On 2/15/23 6:11 PM, Hou Tao wrote:
>>>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>>>>> memory can be reused immediately. No rcu gp is needed.
>>>> Now it seems it will wait for RCU GP and i think it is still necessary,
>>>> because
>>>> when the process exits, other processes may still access the local storage
>>>> through pidfd or task_struct of the exited process.
>>> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0
>>> and will be kfree immediately next. eg. bpf_sk_storage_free is called just
>>> before the sk is about to be kfree. No bpf prog should have a hold on this sk.
>>> The same should go for the task.
>> A bpf syscall may have already found the task local storage through a pidfd,
>> then the target task exits and the local storage is free immediately, then bpf
>> syscall starts to copy the local storage and there will be a UAF, right ? Did I
>> missing something here ?
> bpf syscall like bpf_pid_task_storage_lookup_elem and you meant
> __put_task_struct() will be called while the syscall's bpf_map_copy_value() is
> still under rcu_read_lock()?
Yes, but I known it is impossible here. I missed that task_struct is released
through call_rcu(), so the calling of __put_task_struct() must happen after the
completion of bpf_map_copy_value() in bpf syscall. Thanks for the explanation.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-11 16:34                       ` Alexei Starovoitov
  2023-02-15  1:54                         ` Martin KaFai Lau
@ 2023-02-16 13:55                         ` Hou Tao
  2023-02-16 16:35                           ` Alexei Starovoitov
  1 sibling, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-02-16 13:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao

Hi,

On 2/12/2023 12:34 AM, Alexei Starovoitov wrote:
> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>>> changes on bpf-next.
>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>> The former will take a long time to settle.
>>>> The latter is trivial.
>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>> the performance of bzero and ctor. After it is done, will resubmit on next week.
>> I still don't like ctor as a concept. In general the callbacks in the critical
>> path are guaranteed to be slow due to retpoline overhead.
>> Please send a patch to add GFP_ZERO.
>>
>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
>> This approach will work for inner nodes of qptrie, since bpf progs
>> never see pointers to them. It will work for local storage
>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
>> It's also safe without uaf caveat in sleepable progs and sleepable progs
> I meant 'safe with uaf caveat'.
> Safe because we wait for rcu_task_trace later before returning to kernel memory.
>
>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
>> So please respin the set with rcu gp only and that new flag.
Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?
Its downside is that it will enforce sleep-able program to use
bpf_rcu_read_{lock,unlock}() to access these returned pointers ?
> .


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-16 13:55                         ` Hou Tao
@ 2023-02-16 16:35                           ` Alexei Starovoitov
  2023-02-17  1:19                             ` Hou Tao
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2023-02-16 16:35 UTC (permalink / raw)
  To: Hou Tao
  Cc: Martin KaFai Lau, Kumar Kartikeya Dwivedi, Yonghong Song, bpf,
	Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao

On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?

The idea is for bpf_mem_free to wait normal RCU GP before adding
the elements back to the free list and free the elem to global kernel memory
only after both rcu and rcu_tasks_trace GPs as it's doing now.

> Its downside is that it will enforce sleep-able program to use
> bpf_rcu_read_{lock,unlock}() to access these returned pointers ?

sleepable can access elems without kptrs/spin_locks
even when not using rcu_read_lock, since it's safe, but there is uaf.
Some progs might be fine with it.
When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-16 16:35                           ` Alexei Starovoitov
@ 2023-02-17  1:19                             ` Hou Tao
  2023-02-22 19:30                               ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Hou Tao @ 2023-02-17  1:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

Hi,

On 2/17/2023 12:35 AM, Alexei Starovoitov wrote:
> On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?
> The idea is for bpf_mem_free to wait normal RCU GP before adding
> the elements back to the free list and free the elem to global kernel memory
> only after both rcu and rcu_tasks_trace GPs as it's doing now.
>
>> Its downside is that it will enforce sleep-able program to use
>> bpf_rcu_read_{lock,unlock}() to access these returned pointers ?
> sleepable can access elems without kptrs/spin_locks
> even when not using rcu_read_lock, since it's safe, but there is uaf.
> Some progs might be fine with it.
> When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.
Thanks for the explanation for BPF_REUSE_AFTER_RCU_GP. It seems that
BPF_REUSE_AFTER_RCU_GP may incur OOM easily, because before the expiration of
one RCU GP, these freed elements will not available to both bpf ma or slab
subsystem and after the expiration of RCU GP, these freed elements are only
available for one bpf ma but the number of these freed elements maybe too many
for one bpf ma, so part of these freed elements will be freed through
call_rcu_tasks_trace() and these freed-again elements will not be available for
slab subsystem untill the expiration of tasks trace RCU. In brief, after one RCU
GP, part of these freed elements will be reused, but the majority of these
elements will still be freed through call_rcu_tasks_trace(). Due to the doubt
above, I proposed BPF_FREE_AFTER_RCU to directly free these elements after one
RCU GP and enforce sleepable program to use bpf_rcu_read_lock() to access these
elements, but the enforcement will break the existing sleepable programs, so
BPF_FREE_AFTER_GP is still not a good idea. I will check whether or not these is
still OOM risk for BPF_REUSE_AFTER_RCU_GP and try to mitigate if it is possible
(e.g., share these freed elements between all bpf ma instead of one bpf ma which
free it).


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
  2023-02-17  1:19                             ` Hou Tao
@ 2023-02-22 19:30                               ` Alexei Starovoitov
  0 siblings, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2023-02-22 19:30 UTC (permalink / raw)
  To: Hou Tao
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, bpf, Martin KaFai Lau,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, Hou Tao, Martin KaFai Lau

On Thu, Feb 16, 2023 at 5:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/17/2023 12:35 AM, Alexei Starovoitov wrote:
> > On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?
> > The idea is for bpf_mem_free to wait normal RCU GP before adding
> > the elements back to the free list and free the elem to global kernel memory
> > only after both rcu and rcu_tasks_trace GPs as it's doing now.
> >
> >> Its downside is that it will enforce sleep-able program to use
> >> bpf_rcu_read_{lock,unlock}() to access these returned pointers ?
> > sleepable can access elems without kptrs/spin_locks
> > even when not using rcu_read_lock, since it's safe, but there is uaf.
> > Some progs might be fine with it.
> > When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.
> Thanks for the explanation for BPF_REUSE_AFTER_RCU_GP. It seems that
> BPF_REUSE_AFTER_RCU_GP may incur OOM easily, because before the expiration of
> one RCU GP, these freed elements will not available to both bpf ma or slab
> subsystem and after the expiration of RCU GP, these freed elements are only
> available for one bpf ma but the number of these freed elements maybe too many
> for one bpf ma, so part of these freed elements will be freed through
> call_rcu_tasks_trace() and these freed-again elements will not be available for
> slab subsystem untill the expiration of tasks trace RCU. In brief, after one RCU
> GP, part of these freed elements will be reused, but the majority of these
> elements will still be freed through call_rcu_tasks_trace(). Due to the doubt
> above, I proposed BPF_FREE_AFTER_RCU to directly free these elements after one
> RCU GP and enforce sleepable program to use bpf_rcu_read_lock() to access these
> elements, but the enforcement will break the existing sleepable programs, so
> BPF_FREE_AFTER_GP is still not a good idea. I will check whether or not these is
> still OOM risk for BPF_REUSE_AFTER_RCU_GP and try to mitigate if it is possible
> (e.g., share these freed elements between all bpf ma instead of one bpf ma which
> free it).

Since BPF_REUSE_AFTER_RCU_GP is a new thing that will be used
by qptrie map and maybe? local storage, there is no sleepable breakage.
If we start using BPF_REUSE_AFTER_RCU_GP for hashmaps with kptrs
and enforce bpf_rcu_read_lock() this is also ok, since kptrs are unstable.
I prefer to avoid complicating bpf ma with sharing free lists across all ma-s.
Unless this is really trivial code that is easy to review.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-02-22 19:31 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
2023-01-01 18:48   ` Yonghong Song
2023-01-03 13:47     ` Hou Tao
2023-01-04  6:10       ` Yonghong Song
2023-01-04  6:30         ` Hou Tao
2023-01-04  7:14           ` Yonghong Song
2023-01-04 18:26             ` Alexei Starovoitov
2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
2023-02-10 21:06                 ` Alexei Starovoitov
2023-02-11  1:09                   ` Hou Tao
2023-02-11 16:33                     ` Alexei Starovoitov
2023-02-11 16:34                       ` Alexei Starovoitov
2023-02-15  1:54                         ` Martin KaFai Lau
2023-02-15  4:02                           ` Hou Tao
2023-02-15  7:22                             ` Martin KaFai Lau
2023-02-16  2:11                               ` Hou Tao
2023-02-16  7:47                                 ` Martin KaFai Lau
2023-02-16  8:18                                   ` Hou Tao
2023-02-16 13:55                         ` Hou Tao
2023-02-16 16:35                           ` Alexei Starovoitov
2023-02-17  1:19                             ` Hou Tao
2023-02-22 19:30                               ` Alexei Starovoitov
2023-02-15  2:35                       ` Hou Tao
2023-02-15  2:42                         ` Alexei Starovoitov
2023-02-15  3:00                           ` Hou Tao
2023-01-03 13:40   ` Hou Tao
2023-01-03 19:38     ` Alexei Starovoitov
2023-01-10  6:26       ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).