All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop
@ 2022-08-06  1:46 Kumar Kartikeya Dwivedi
  2022-08-06  1:46 ` [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs Kumar Kartikeya Dwivedi
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-06  1:46 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Daniel Borkmann

Fix for a bug in prelloc_lru_pop spotted while reading the code, then a test +
example that checks whether it is fixed.

Kumar Kartikeya Dwivedi (3):
  bpf: Allow calling bpf_prog_test kfuncs in tracing programs
  bpf: Don't reinit map value in prealloc_lru_pop
  selftests/bpf: Add test for prealloc_lru_pop bug

 kernel/bpf/hashtab.c                          |  6 +-
 net/bpf/test_run.c                            |  1 +
 .../selftests/bpf/prog_tests/lru_bug.c        | 19 ++++++
 tools/testing/selftests/bpf/progs/lru_bug.c   | 67 +++++++++++++++++++
 4 files changed, 88 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/lru_bug.c
 create mode 100644 tools/testing/selftests/bpf/progs/lru_bug.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs
  2022-08-06  1:46 [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
@ 2022-08-06  1:46 ` Kumar Kartikeya Dwivedi
  2022-08-06  1:46 ` [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
  2022-08-06  1:46 ` [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug Kumar Kartikeya Dwivedi
  2 siblings, 0 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-06  1:46 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Daniel Borkmann

In addition to TC hook, enable these in tracing programs so that they
can be used in selftests.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 net/bpf/test_run.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index cbc9cd5058cb..d11209367dd0 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1628,6 +1628,7 @@ static int __init bpf_prog_test_run_init(void)
 	int ret;
 
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_prog_test_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_prog_test_kfunc_set);
 	return ret ?: register_btf_id_dtor_kfuncs(bpf_prog_test_dtor_kfunc,
 						  ARRAY_SIZE(bpf_prog_test_dtor_kfunc),
 						  THIS_MODULE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-06  1:46 [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
  2022-08-06  1:46 ` [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs Kumar Kartikeya Dwivedi
@ 2022-08-06  1:46 ` Kumar Kartikeya Dwivedi
  2022-08-08  6:09   ` Yonghong Song
  2022-08-06  1:46 ` [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-06  1:46 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Daniel Borkmann

The LRU map that is preallocated may have its elements reused while
another program holds a pointer to it from bpf_map_lookup_elem. Hence,
only check_and_free_fields is appropriate when the element is being
deleted, as it ensures proper synchronization against concurrent access
of the map value. After that, we cannot call check_and_init_map_value
again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
they can be concurrently accessed from a BPF program.

Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/hashtab.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index da7578426a46..4d793a92301b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -311,12 +311,8 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
 	struct htab_elem *l;
 
 	if (node) {
-		u32 key_size = htab->map.key_size;
-
 		l = container_of(node, struct htab_elem, lru_node);
-		memcpy(l->key, key, key_size);
-		check_and_init_map_value(&htab->map,
-					 l->key + round_up(key_size, 8));
+		memcpy(l->key, key, htab->map.key_size);
 		return l;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug
  2022-08-06  1:46 [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
  2022-08-06  1:46 ` [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs Kumar Kartikeya Dwivedi
  2022-08-06  1:46 ` [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
@ 2022-08-06  1:46 ` Kumar Kartikeya Dwivedi
  2022-08-08 11:36   ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-06  1:46 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Daniel Borkmann

Add a regression test to check against invalid check_and_init_map_value
call inside prealloc_lru_pop.

To actually observe a kind of problem this can cause, set debug to 1
when running the test locally without the fix. Then, just observe the
refcount which keeps increasing on each run of the test. With timers or
spin locks, it would cause unpredictable results when racing.

...

bash-5.1# ./test_progs -t lru_bug
      test_progs-192     [000] d..21   354.838821: bpf_trace_printk: ref: 4
      test_progs-192     [000] d..21   354.842824: bpf_trace_printk: ref: 5
bash-5.1# ./test_pogs -t lru_bug
      test_progs-193     [000] d..21   356.722813: bpf_trace_printk: ref: 5
      test_progs-193     [000] d..21   356.727071: bpf_trace_printk: ref: 6

... and so on.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/lru_bug.c        | 19 ++++++
 tools/testing/selftests/bpf/progs/lru_bug.c   | 67 +++++++++++++++++++
 2 files changed, 86 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/lru_bug.c
 create mode 100644 tools/testing/selftests/bpf/progs/lru_bug.c

diff --git a/tools/testing/selftests/bpf/prog_tests/lru_bug.c b/tools/testing/selftests/bpf/prog_tests/lru_bug.c
new file mode 100644
index 000000000000..e77b2d9469cb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/lru_bug.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+#include "lru_bug.skel.h"
+
+void test_lru_bug(void)
+{
+	struct lru_bug *skel;
+	int ret;
+
+	skel = lru_bug__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "lru_bug__open_and_load"))
+		return;
+	ret = lru_bug__attach(skel);
+	if (!ASSERT_OK(ret, "lru_bug__attach"))
+		return;
+	usleep(1);
+	ASSERT_OK(skel->data->result, "prealloc_lru_pop doesn't call check_and_init_map_value");
+}
diff --git a/tools/testing/selftests/bpf/progs/lru_bug.c b/tools/testing/selftests/bpf/progs/lru_bug.c
new file mode 100644
index 000000000000..35cbbe7aba9e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/lru_bug.c
@@ -0,0 +1,67 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+struct map_value {
+	struct prog_test_ref_kfunc __kptr_ref *ptr;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct map_value);
+} lru_map SEC(".maps");
+
+extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym;
+extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *s) __ksym;
+
+int pid = 0;
+const volatile int debug = 0;
+int result = 1;
+
+SEC("fentry/bpf_ktime_get_ns")
+int printk(void *ctx)
+{
+	struct map_value v = {};
+
+	if (pid == bpf_get_current_task_btf()->pid)
+		bpf_map_update_elem(&lru_map, &(int){0}, &v, 0);
+	return 0;
+}
+
+SEC("fentry/do_nanosleep")
+int nanosleep(void *ctx)
+{
+	struct map_value val = {}, *v;
+	struct prog_test_ref_kfunc *s;
+	unsigned long l = 0;
+
+	bpf_map_update_elem(&lru_map, &(int){0}, &val, 0);
+	v = bpf_map_lookup_elem(&lru_map, &(int){0});
+	if (!v)
+		return 0;
+	bpf_map_delete_elem(&lru_map, &(int){0});
+	s = bpf_kfunc_call_test_acquire(&l);
+	if (!s)
+		return 0;
+	if (debug)
+		bpf_printk("ref: %d\n", s->cnt.refs.counter);
+	s = bpf_kptr_xchg(&v->ptr, s);
+	if (s)
+		bpf_kfunc_call_test_release(s);
+	pid = bpf_get_current_task_btf()->pid;
+	bpf_ktime_get_ns();
+	if (debug) {
+		s = bpf_kfunc_call_test_acquire(&l);
+		if (!s)
+			return 0;
+		bpf_printk("ref: %d\n", s->cnt.refs.counter);
+		bpf_kfunc_call_test_release(s);
+	}
+	result = !v->ptr;
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-06  1:46 ` [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
@ 2022-08-08  6:09   ` Yonghong Song
  2022-08-08 11:18     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-08-08  6:09 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Daniel Borkmann



On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
> The LRU map that is preallocated may have its elements reused while
> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
> only check_and_free_fields is appropriate when the element is being
> deleted, as it ensures proper synchronization against concurrent access
> of the map value. After that, we cannot call check_and_init_map_value
> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
> they can be concurrently accessed from a BPF program.

If I understand correctly, one lru_node gets freed and pushed to free 
list without doing check_and_free_fields().
If later the same node is used with bpf_map_update_elem() and 
prealloc_lru_pop() is called, then with this patch, 
check_and_init_map_value() is not called, so the new node may contain
leftover values for kptr/timer/spin_lock, could this cause a problem?

To address the above rewrite issue, maybe the solution should be
to push the deleted lru_nodes back to free list only after 
rcu_read_unlock() is done?

> 
> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   kernel/bpf/hashtab.c | 6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index da7578426a46..4d793a92301b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -311,12 +311,8 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
>   	struct htab_elem *l;
>   
>   	if (node) {
> -		u32 key_size = htab->map.key_size;
> -
>   		l = container_of(node, struct htab_elem, lru_node);
> -		memcpy(l->key, key, key_size);
> -		check_and_init_map_value(&htab->map,
> -					 l->key + round_up(key_size, 8));
> +		memcpy(l->key, key, htab->map.key_size);
>   		return l;
>   	}
>   

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-08  6:09   ` Yonghong Song
@ 2022-08-08 11:18     ` Kumar Kartikeya Dwivedi
  2022-08-08 16:19       ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-08 11:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Daniel Borkmann, toke

On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
> > The LRU map that is preallocated may have its elements reused while
> > another program holds a pointer to it from bpf_map_lookup_elem. Hence,
> > only check_and_free_fields is appropriate when the element is being
> > deleted, as it ensures proper synchronization against concurrent access
> > of the map value. After that, we cannot call check_and_init_map_value
> > again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
> > they can be concurrently accessed from a BPF program.
>
> If I understand correctly, one lru_node gets freed and pushed to free
> list without doing check_and_free_fields().

I don't think that's true, there is a check_and_free_fields call on
deletion right before bpf_lru_push_free in htab_lru_push_free.
Then once the preallocated items are freed on map destruction, we free
timers and kptrs again, so if someone has access to preallocated items
after freeing e.g. through an earlier lookup, we still release
resources they may have created at the end of map's lifetime.

> If later the same node is used with bpf_map_update_elem() and
> prealloc_lru_pop() is called, then with this patch,
> check_and_init_map_value() is not called, so the new node may contain
> leftover values for kptr/timer/spin_lock, could this cause a problem?
>

This can only happen once you touch kptr/timer/spin_lock after
deletion's check_and_free_fields call, but the program obtaining the
new item will see and be able to handle that case. The timer helpers
handle if an existing timer exists, kptr_xchg returns the old pointer
as a reference you must release. For unreferenced kptr, it is marked
as PTR_UNTRUSTED so a corrupted pointer value is possible but not
fatal. If spin_lock is locked on lookup, then the other CPU having
access to deleted-but-now-reallocated item will eventually call
unlock.

It is very much expected, IIUC, that someone else may use-after-free
deleted items of hashtab.c maps in case of preallocation. It can be
considered similar to how SLAB_TYPESAFE_BY_RCU behaves.

> To address the above rewrite issue, maybe the solution should be
> to push the deleted lru_nodes back to free list only after
> rcu_read_unlock() is done?

Please correct me if I'm wrong, but I don't think this is a good idea.
Delaying preallocated item reuse for a RCU grace period will greatly
increase the probability of running out of preallocated items under
load, even though technically those items are free for use.

Side note: I found the problem this patch fixes while reading the
code, because I am running into this exact problem with my WIP skip
list map implementation, where in the preallocated case, to make
things a bit easier for the lockless lookup, I delay reuse of items
until an RCU grace period passes (so that the deleted -> unlinked
transition does not happen during traversal), but I'm easily able to
come up with scenarios (producer/consumer situations) where that leads
to exhaustion of the preallocated memory (and if not that, performance
degradation on updates because pcpu_freelist now needs to search other
CPU's caches more often).

BTW, this would be fixed if we could simply use Alexei's per-CPU cache
based memory allocator instead of preallocating items, because then
the per-CPU cache gets replenished when it goes below the watermark
(and also frees stuff back to kernel allocator above the high
watermark, which is great for producer/consumer cases with alloc/free
imbalance), so we can do the RCU delays similar to kmalloc case
without running into the memory exhaustion problem.

>
> >
> > Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   kernel/bpf/hashtab.c | 6 +-----
> >   1 file changed, 1 insertion(+), 5 deletions(-)
> >
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index da7578426a46..4d793a92301b 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -311,12 +311,8 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
> >       struct htab_elem *l;
> >
> >       if (node) {
> > -             u32 key_size = htab->map.key_size;
> > -
> >               l = container_of(node, struct htab_elem, lru_node);
> > -             memcpy(l->key, key, key_size);
> > -             check_and_init_map_value(&htab->map,
> > -                                      l->key + round_up(key_size, 8));
> > +             memcpy(l->key, key, htab->map.key_size);
> >               return l;
> >       }
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug
  2022-08-06  1:46 ` [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug Kumar Kartikeya Dwivedi
@ 2022-08-08 11:36   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-08 11:36 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Daniel Borkmann

On Sat, 6 Aug 2022 at 03:46, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Add a regression test to check against invalid check_and_init_map_value
> call inside prealloc_lru_pop.
>
> To actually observe a kind of problem this can cause, set debug to 1
> when running the test locally without the fix. Then, just observe the
> refcount which keeps increasing on each run of the test. With timers or
> spin locks, it would cause unpredictable results when racing.
>
> ...
>
> bash-5.1# ./test_progs -t lru_bug
>       test_progs-192     [000] d..21   354.838821: bpf_trace_printk: ref: 4
>       test_progs-192     [000] d..21   354.842824: bpf_trace_printk: ref: 5
> bash-5.1# ./test_pogs -t lru_bug
>       test_progs-193     [000] d..21   356.722813: bpf_trace_printk: ref: 5
>       test_progs-193     [000] d..21   356.727071: bpf_trace_printk: ref: 6
>
> ... and so on.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  .../selftests/bpf/prog_tests/lru_bug.c        | 19 ++++++
>  tools/testing/selftests/bpf/progs/lru_bug.c   | 67 +++++++++++++++++++
>  2 files changed, 86 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/lru_bug.c
>  create mode 100644 tools/testing/selftests/bpf/progs/lru_bug.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/lru_bug.c b/tools/testing/selftests/bpf/prog_tests/lru_bug.c
> new file mode 100644
> index 000000000000..e77b2d9469cb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/lru_bug.c
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +
> +#include "lru_bug.skel.h"
> +
> +void test_lru_bug(void)

CI is failing because map_kptr and this test both want to observe
refcount when it is not being touched by either, so marking this
serial_ would fix it (map_kptr takes time so it is better for it to
run in parallel mode).

I will wait for the discussion to conclude before respinning.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-08 11:18     ` Kumar Kartikeya Dwivedi
@ 2022-08-08 16:19       ` Yonghong Song
  2022-08-08 18:55         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-08-08 16:19 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Daniel Borkmann, toke



On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
> On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
>>> The LRU map that is preallocated may have its elements reused while
>>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
>>> only check_and_free_fields is appropriate when the element is being
>>> deleted, as it ensures proper synchronization against concurrent access
>>> of the map value. After that, we cannot call check_and_init_map_value
>>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
>>> they can be concurrently accessed from a BPF program.
>>
>> If I understand correctly, one lru_node gets freed and pushed to free
>> list without doing check_and_free_fields().
> 
> I don't think that's true, there is a check_and_free_fields call on
> deletion right before bpf_lru_push_free in htab_lru_push_free.
> Then once the preallocated items are freed on map destruction, we free
> timers and kptrs again, so if someone has access to preallocated items
> after freeing e.g. through an earlier lookup, we still release
> resources they may have created at the end of map's lifetime.
> 
>> If later the same node is used with bpf_map_update_elem() and
>> prealloc_lru_pop() is called, then with this patch,
>> check_and_init_map_value() is not called, so the new node may contain
>> leftover values for kptr/timer/spin_lock, could this cause a problem?
>>
> 
> This can only happen once you touch kptr/timer/spin_lock after
> deletion's check_and_free_fields call, but the program obtaining the
> new item will see and be able to handle that case. The timer helpers
> handle if an existing timer exists, kptr_xchg returns the old pointer
> as a reference you must release. For unreferenced kptr, it is marked
> as PTR_UNTRUSTED so a corrupted pointer value is possible but not
> fatal. If spin_lock is locked on lookup, then the other CPU having
> access to deleted-but-now-reallocated item will eventually call
> unlock.

Thanks for explanation. Originally I think we should clear everything
including spin_lock before putting the deleted lru_node to free list.
check_and_free_fields() only did for kptr/timer but not spin_lock.

But looks like we should not delete spin_lock before pushing the
deleted nodes to free list since lookup side may hold a reference
to the map value and it may have done a bpf_spin_lock() call.
And we should not clear spin_lock fields in check_and_free_fields()
and neither in prealloc_lru_pop() in map_update. Otherwise, we
may have issues for bpf_spin_unlock() on lookup side.

It looks timer and kptr are already been handled for such
cases (concurrency between map_lookup() and clearing some map_value
fields for timer/kptr).

> 
> It is very much expected, IIUC, that someone else may use-after-free
> deleted items of hashtab.c maps in case of preallocation. It can be
> considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
> 
>> To address the above rewrite issue, maybe the solution should be
>> to push the deleted lru_nodes back to free list only after
>> rcu_read_unlock() is done?
> 
> Please correct me if I'm wrong, but I don't think this is a good idea.
> Delaying preallocated item reuse for a RCU grace period will greatly
> increase the probability of running out of preallocated items under
> load, even though technically those items are free for use.

Agree. This is not a good idea. It increased life cycle for preallocated
item reuse and will have some performance issue and resource consumption
issue.

> 
> Side note: I found the problem this patch fixes while reading the
> code, because I am running into this exact problem with my WIP skip
> list map implementation, where in the preallocated case, to make
> things a bit easier for the lockless lookup, I delay reuse of items
> until an RCU grace period passes (so that the deleted -> unlinked
> transition does not happen during traversal), but I'm easily able to
> come up with scenarios (producer/consumer situations) where that leads
> to exhaustion of the preallocated memory (and if not that, performance
> degradation on updates because pcpu_freelist now needs to search other
> CPU's caches more often).
> 
> BTW, this would be fixed if we could simply use Alexei's per-CPU cache
> based memory allocator instead of preallocating items, because then
> the per-CPU cache gets replenished when it goes below the watermark
> (and also frees stuff back to kernel allocator above the high
> watermark, which is great for producer/consumer cases with alloc/free
> imbalance), so we can do the RCU delays similar to kmalloc case
> without running into the memory exhaustion problem.

Thanks for explanation. So okay the patch looks good to me.

> 
>>
>>>
>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Acked-by: Yonghong Song <yhs@fb.com>

>>> ---
>>>    kernel/bpf/hashtab.c | 6 +-----
>>>    1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index da7578426a46..4d793a92301b 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -311,12 +311,8 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
>>>        struct htab_elem *l;
>>>
>>>        if (node) {
>>> -             u32 key_size = htab->map.key_size;
>>> -
>>>                l = container_of(node, struct htab_elem, lru_node);
>>> -             memcpy(l->key, key, key_size);
>>> -             check_and_init_map_value(&htab->map,
>>> -                                      l->key + round_up(key_size, 8));
>>> +             memcpy(l->key, key, htab->map.key_size);
>>>                return l;
>>>        }
>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-08 16:19       ` Yonghong Song
@ 2022-08-08 18:55         ` Kumar Kartikeya Dwivedi
  2022-08-08 23:23           ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-08 18:55 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Daniel Borkmann, toke

On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
> >>> The LRU map that is preallocated may have its elements reused while
> >>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
> >>> only check_and_free_fields is appropriate when the element is being
> >>> deleted, as it ensures proper synchronization against concurrent access
> >>> of the map value. After that, we cannot call check_and_init_map_value
> >>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
> >>> they can be concurrently accessed from a BPF program.
> >>
> >> If I understand correctly, one lru_node gets freed and pushed to free
> >> list without doing check_and_free_fields().
> >
> > I don't think that's true, there is a check_and_free_fields call on
> > deletion right before bpf_lru_push_free in htab_lru_push_free.
> > Then once the preallocated items are freed on map destruction, we free
> > timers and kptrs again, so if someone has access to preallocated items
> > after freeing e.g. through an earlier lookup, we still release
> > resources they may have created at the end of map's lifetime.
> >
> >> If later the same node is used with bpf_map_update_elem() and
> >> prealloc_lru_pop() is called, then with this patch,
> >> check_and_init_map_value() is not called, so the new node may contain
> >> leftover values for kptr/timer/spin_lock, could this cause a problem?
> >>
> >
> > This can only happen once you touch kptr/timer/spin_lock after
> > deletion's check_and_free_fields call, but the program obtaining the
> > new item will see and be able to handle that case. The timer helpers
> > handle if an existing timer exists, kptr_xchg returns the old pointer
> > as a reference you must release. For unreferenced kptr, it is marked
> > as PTR_UNTRUSTED so a corrupted pointer value is possible but not
> > fatal. If spin_lock is locked on lookup, then the other CPU having
> > access to deleted-but-now-reallocated item will eventually call
> > unlock.
>
> Thanks for explanation. Originally I think we should clear everything
> including spin_lock before putting the deleted lru_node to free list.
> check_and_free_fields() only did for kptr/timer but not spin_lock.
>
> But looks like we should not delete spin_lock before pushing the
> deleted nodes to free list since lookup side may hold a reference
> to the map value and it may have done a bpf_spin_lock() call.
> And we should not clear spin_lock fields in check_and_free_fields()
> and neither in prealloc_lru_pop() in map_update. Otherwise, we
> may have issues for bpf_spin_unlock() on lookup side.
>
> It looks timer and kptr are already been handled for such
> cases (concurrency between map_lookup() and clearing some map_value
> fields for timer/kptr).
>

Yes, I also took a look again at other call sites of
check_and_init_map_value and everything looks sane.

> >
> > It is very much expected, IIUC, that someone else may use-after-free
> > deleted items of hashtab.c maps in case of preallocation. It can be
> > considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
> >
> >> To address the above rewrite issue, maybe the solution should be
> >> to push the deleted lru_nodes back to free list only after
> >> rcu_read_unlock() is done?
> >
> > Please correct me if I'm wrong, but I don't think this is a good idea.
> > Delaying preallocated item reuse for a RCU grace period will greatly
> > increase the probability of running out of preallocated items under
> > load, even though technically those items are free for use.
>
> Agree. This is not a good idea. It increased life cycle for preallocated
> item reuse and will have some performance issue and resource consumption
> issue.
>
> >
> > Side note: I found the problem this patch fixes while reading the
> > code, because I am running into this exact problem with my WIP skip
> > list map implementation, where in the preallocated case, to make
> > things a bit easier for the lockless lookup, I delay reuse of items
> > until an RCU grace period passes (so that the deleted -> unlinked
> > transition does not happen during traversal), but I'm easily able to
> > come up with scenarios (producer/consumer situations) where that leads
> > to exhaustion of the preallocated memory (and if not that, performance
> > degradation on updates because pcpu_freelist now needs to search other
> > CPU's caches more often).
> >
> > BTW, this would be fixed if we could simply use Alexei's per-CPU cache
> > based memory allocator instead of preallocating items, because then
> > the per-CPU cache gets replenished when it goes below the watermark
> > (and also frees stuff back to kernel allocator above the high
> > watermark, which is great for producer/consumer cases with alloc/free
> > imbalance), so we can do the RCU delays similar to kmalloc case
> > without running into the memory exhaustion problem.
>
> Thanks for explanation. So okay the patch looks good to me.
>
> >
> >>
> >>>
> >>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> >>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>
> Acked-by: Yonghong Song <yhs@fb.com>
>

Thanks! I'll summarize our discussion in short and add it to the commit log.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-08 18:55         ` Kumar Kartikeya Dwivedi
@ 2022-08-08 23:23           ` Yonghong Song
  2022-08-09  0:24             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-08-08 23:23 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Daniel Borkmann, toke



On 8/8/22 11:55 AM, Kumar Kartikeya Dwivedi wrote:
> On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
>>> On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
>>>>> The LRU map that is preallocated may have its elements reused while
>>>>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
>>>>> only check_and_free_fields is appropriate when the element is being
>>>>> deleted, as it ensures proper synchronization against concurrent access
>>>>> of the map value. After that, we cannot call check_and_init_map_value
>>>>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
>>>>> they can be concurrently accessed from a BPF program.
>>>>
>>>> If I understand correctly, one lru_node gets freed and pushed to free
>>>> list without doing check_and_free_fields().
>>>
>>> I don't think that's true, there is a check_and_free_fields call on
>>> deletion right before bpf_lru_push_free in htab_lru_push_free.
>>> Then once the preallocated items are freed on map destruction, we free
>>> timers and kptrs again, so if someone has access to preallocated items
>>> after freeing e.g. through an earlier lookup, we still release
>>> resources they may have created at the end of map's lifetime.
>>>
>>>> If later the same node is used with bpf_map_update_elem() and
>>>> prealloc_lru_pop() is called, then with this patch,
>>>> check_and_init_map_value() is not called, so the new node may contain
>>>> leftover values for kptr/timer/spin_lock, could this cause a problem?
>>>>
>>>
>>> This can only happen once you touch kptr/timer/spin_lock after
>>> deletion's check_and_free_fields call, but the program obtaining the
>>> new item will see and be able to handle that case. The timer helpers
>>> handle if an existing timer exists, kptr_xchg returns the old pointer
>>> as a reference you must release. For unreferenced kptr, it is marked
>>> as PTR_UNTRUSTED so a corrupted pointer value is possible but not
>>> fatal. If spin_lock is locked on lookup, then the other CPU having
>>> access to deleted-but-now-reallocated item will eventually call
>>> unlock.
>>
>> Thanks for explanation. Originally I think we should clear everything
>> including spin_lock before putting the deleted lru_node to free list.
>> check_and_free_fields() only did for kptr/timer but not spin_lock.
>>
>> But looks like we should not delete spin_lock before pushing the
>> deleted nodes to free list since lookup side may hold a reference
>> to the map value and it may have done a bpf_spin_lock() call.
>> And we should not clear spin_lock fields in check_and_free_fields()
>> and neither in prealloc_lru_pop() in map_update. Otherwise, we
>> may have issues for bpf_spin_unlock() on lookup side.
>>
>> It looks timer and kptr are already been handled for such
>> cases (concurrency between map_lookup() and clearing some map_value
>> fields for timer/kptr).
>>
> 
> Yes, I also took a look again at other call sites of
> check_and_init_map_value and everything looks sane.

Sounds good.

> 
>>>
>>> It is very much expected, IIUC, that someone else may use-after-free
>>> deleted items of hashtab.c maps in case of preallocation. It can be
>>> considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
>>>
>>>> To address the above rewrite issue, maybe the solution should be
>>>> to push the deleted lru_nodes back to free list only after
>>>> rcu_read_unlock() is done?
>>>
>>> Please correct me if I'm wrong, but I don't think this is a good idea.
>>> Delaying preallocated item reuse for a RCU grace period will greatly
>>> increase the probability of running out of preallocated items under
>>> load, even though technically those items are free for use.
>>
>> Agree. This is not a good idea. It increased life cycle for preallocated
>> item reuse and will have some performance issue and resource consumption
>> issue.
>>
>>>
>>> Side note: I found the problem this patch fixes while reading the
>>> code, because I am running into this exact problem with my WIP skip
>>> list map implementation, where in the preallocated case, to make
>>> things a bit easier for the lockless lookup, I delay reuse of items
>>> until an RCU grace period passes (so that the deleted -> unlinked
>>> transition does not happen during traversal), but I'm easily able to
>>> come up with scenarios (producer/consumer situations) where that leads
>>> to exhaustion of the preallocated memory (and if not that, performance
>>> degradation on updates because pcpu_freelist now needs to search other
>>> CPU's caches more often).

Thinking again. I guess the following scenario is possible:

      rcu_read_lock()
         v = bpf_map_lookup_elem(&key);
         t1 = v->field;
         bpf_map_delete_elem(&key);
            /* another bpf program triggering bpf_map_update_elem() */
            /* the element with 'key' is reused */
            /* the value is updated */
         t2 = v->field;
         ...
      rcu_read_unlock()

it is possible t1 and t2 may not be the same.
This should be an extremely corner case, not sure how to resolve
this easily without performance degradation.

>>>
>>> BTW, this would be fixed if we could simply use Alexei's per-CPU cache
>>> based memory allocator instead of preallocating items, because then
>>> the per-CPU cache gets replenished when it goes below the watermark
>>> (and also frees stuff back to kernel allocator above the high
>>> watermark, which is great for producer/consumer cases with alloc/free
>>> imbalance), so we can do the RCU delays similar to kmalloc case
>>> without running into the memory exhaustion problem. >>
>> Thanks for explanation. So okay the patch looks good to me.
>>
>>>
>>>>
>>>>>
>>>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>
>> Acked-by: Yonghong Song <yhs@fb.com>
>>
> 
> Thanks! I'll summarize our discussion in short and add it to the commit log.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-08 23:23           ` Yonghong Song
@ 2022-08-09  0:24             ` Kumar Kartikeya Dwivedi
  2022-08-09  0:53               ` Kumar Kartikeya Dwivedi
  2022-08-09  3:18               ` Alexei Starovoitov
  0 siblings, 2 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-09  0:24 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Daniel Borkmann, toke

On Tue, 9 Aug 2022 at 01:23, Yonghong Song <yhs@fb.com> wrote:
> On 8/8/22 11:55 AM, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
> >>> On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
> >>>>> The LRU map that is preallocated may have its elements reused while
> >>>>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
> >>>>> only check_and_free_fields is appropriate when the element is being
> >>>>> deleted, as it ensures proper synchronization against concurrent access
> >>>>> of the map value. After that, we cannot call check_and_init_map_value
> >>>>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
> >>>>> they can be concurrently accessed from a BPF program.
> >>>>
> >>>> If I understand correctly, one lru_node gets freed and pushed to free
> >>>> list without doing check_and_free_fields().
> >>>
> >>> I don't think that's true, there is a check_and_free_fields call on
> >>> deletion right before bpf_lru_push_free in htab_lru_push_free.
> >>> Then once the preallocated items are freed on map destruction, we free
> >>> timers and kptrs again, so if someone has access to preallocated items
> >>> after freeing e.g. through an earlier lookup, we still release
> >>> resources they may have created at the end of map's lifetime.
> >>>
> >>>> If later the same node is used with bpf_map_update_elem() and
> >>>> prealloc_lru_pop() is called, then with this patch,
> >>>> check_and_init_map_value() is not called, so the new node may contain
> >>>> leftover values for kptr/timer/spin_lock, could this cause a problem?
> >>>>
> >>>
> >>> This can only happen once you touch kptr/timer/spin_lock after
> >>> deletion's check_and_free_fields call, but the program obtaining the
> >>> new item will see and be able to handle that case. The timer helpers
> >>> handle if an existing timer exists, kptr_xchg returns the old pointer
> >>> as a reference you must release. For unreferenced kptr, it is marked
> >>> as PTR_UNTRUSTED so a corrupted pointer value is possible but not
> >>> fatal. If spin_lock is locked on lookup, then the other CPU having
> >>> access to deleted-but-now-reallocated item will eventually call
> >>> unlock.
> >>
> >> Thanks for explanation. Originally I think we should clear everything
> >> including spin_lock before putting the deleted lru_node to free list.
> >> check_and_free_fields() only did for kptr/timer but not spin_lock.
> >>
> >> But looks like we should not delete spin_lock before pushing the
> >> deleted nodes to free list since lookup side may hold a reference
> >> to the map value and it may have done a bpf_spin_lock() call.
> >> And we should not clear spin_lock fields in check_and_free_fields()
> >> and neither in prealloc_lru_pop() in map_update. Otherwise, we
> >> may have issues for bpf_spin_unlock() on lookup side.
> >>
> >> It looks timer and kptr are already been handled for such
> >> cases (concurrency between map_lookup() and clearing some map_value
> >> fields for timer/kptr).
> >>
> >
> > Yes, I also took a look again at other call sites of
> > check_and_init_map_value and everything looks sane.
>
> Sounds good.
>
> >
> >>>
> >>> It is very much expected, IIUC, that someone else may use-after-free
> >>> deleted items of hashtab.c maps in case of preallocation. It can be
> >>> considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
> >>>
> >>>> To address the above rewrite issue, maybe the solution should be
> >>>> to push the deleted lru_nodes back to free list only after
> >>>> rcu_read_unlock() is done?
> >>>
> >>> Please correct me if I'm wrong, but I don't think this is a good idea.
> >>> Delaying preallocated item reuse for a RCU grace period will greatly
> >>> increase the probability of running out of preallocated items under
> >>> load, even though technically those items are free for use.
> >>
> >> Agree. This is not a good idea. It increased life cycle for preallocated
> >> item reuse and will have some performance issue and resource consumption
> >> issue.
> >>
> >>>
> >>> Side note: I found the problem this patch fixes while reading the
> >>> code, because I am running into this exact problem with my WIP skip
> >>> list map implementation, where in the preallocated case, to make
> >>> things a bit easier for the lockless lookup, I delay reuse of items
> >>> until an RCU grace period passes (so that the deleted -> unlinked
> >>> transition does not happen during traversal), but I'm easily able to
> >>> come up with scenarios (producer/consumer situations) where that leads
> >>> to exhaustion of the preallocated memory (and if not that, performance
> >>> degradation on updates because pcpu_freelist now needs to search other
> >>> CPU's caches more often).
>
> Thinking again. I guess the following scenario is possible:
>
>       rcu_read_lock()
>          v = bpf_map_lookup_elem(&key);
>          t1 = v->field;
>          bpf_map_delete_elem(&key);
>             /* another bpf program triggering bpf_map_update_elem() */
>             /* the element with 'key' is reused */
>             /* the value is updated */
>          t2 = v->field;
>          ...
>       rcu_read_unlock()
>
> it is possible t1 and t2 may not be the same.
> This should be an extremely corner case, not sure how to resolve
> this easily without performance degradation.

Yes, this is totally possible. This extreme corner case may also
become a real problem in sleepable programs ;-).

I had an idea in mind on how it can be done completely in the BPF
program side without changing the map implementation.

We can add a new tag that marks a field as direct access only, so it
will be treated like kptr/timer/spin_lock during copy_map_value, and
not touched on deletion or update (i.e. no reset on deletion, no init
on update).

Then, use a sequence counter field with this tag, and update it
whenever performing a deletion or doing an update.
On first update, this counter will be 0 (thanks to allocation being
zeroed by default).
On deletion side, it will be:
v = lookup_elem(...);
if (cmpxchg(&v->seq, v->seq, v->seq + 1))
  delete_elem(...);

No ABA problems as seq is ever increasing. Odd seq means item is in
freelist, even means it is alive, on update it is incremented (but
using a cmpxchg) to indicate item is back alive. So the cost is one
lookup + cmpxchg on delete and update resp.

On the reader side, the reader starts by loading the seq before doing
read, and comparing the seq after its read (seqlock style).

seq1 = v->seq;
smp_rmb();
// read data
smp_rmb();
seq2 = v->seq;
if (seq1 != seq2 || seq1 & 1)
  handle_bad_read();

smp_rmb() is free on x86, but it can be generalized when we have a
proper BPF memory model. Since seq is 'direct access only' field, it
is preserved across alloc -> free -> alloc -> ....

Maybe I missed some details so it might be incorrect somewhere, but
the main idea is to do something similar to rechecking key/generation
counter when items are reused in SLAB_TYPESAFE_BY_RCU, but also
ensuring integrity of the data being read from map value at the same
time. So if you want absolutely bullet proof reads and are willing to
pay some cost for that, this should work.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-09  0:24             ` Kumar Kartikeya Dwivedi
@ 2022-08-09  0:53               ` Kumar Kartikeya Dwivedi
  2022-08-09  3:18               ` Alexei Starovoitov
  1 sibling, 0 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-09  0:53 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Daniel Borkmann, toke

On Tue, 9 Aug 2022 at 02:24, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Tue, 9 Aug 2022 at 01:23, Yonghong Song <yhs@fb.com> wrote:
> > On 8/8/22 11:55 AM, Kumar Kartikeya Dwivedi wrote:
> > > On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@fb.com> wrote:
> > >>
> > >>
> > >>
> > >> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
> > >>> On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
> > >>>>> The LRU map that is preallocated may have its elements reused while
> > >>>>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
> > >>>>> only check_and_free_fields is appropriate when the element is being
> > >>>>> deleted, as it ensures proper synchronization against concurrent access
> > >>>>> of the map value. After that, we cannot call check_and_init_map_value
> > >>>>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
> > >>>>> they can be concurrently accessed from a BPF program.
> > >>>>
> > >>>> If I understand correctly, one lru_node gets freed and pushed to free
> > >>>> list without doing check_and_free_fields().
> > >>>
> > >>> I don't think that's true, there is a check_and_free_fields call on
> > >>> deletion right before bpf_lru_push_free in htab_lru_push_free.
> > >>> Then once the preallocated items are freed on map destruction, we free
> > >>> timers and kptrs again, so if someone has access to preallocated items
> > >>> after freeing e.g. through an earlier lookup, we still release
> > >>> resources they may have created at the end of map's lifetime.
> > >>>
> > >>>> If later the same node is used with bpf_map_update_elem() and
> > >>>> prealloc_lru_pop() is called, then with this patch,
> > >>>> check_and_init_map_value() is not called, so the new node may contain
> > >>>> leftover values for kptr/timer/spin_lock, could this cause a problem?
> > >>>>
> > >>>
> > >>> This can only happen once you touch kptr/timer/spin_lock after
> > >>> deletion's check_and_free_fields call, but the program obtaining the
> > >>> new item will see and be able to handle that case. The timer helpers
> > >>> handle if an existing timer exists, kptr_xchg returns the old pointer
> > >>> as a reference you must release. For unreferenced kptr, it is marked
> > >>> as PTR_UNTRUSTED so a corrupted pointer value is possible but not
> > >>> fatal. If spin_lock is locked on lookup, then the other CPU having
> > >>> access to deleted-but-now-reallocated item will eventually call
> > >>> unlock.
> > >>
> > >> Thanks for explanation. Originally I think we should clear everything
> > >> including spin_lock before putting the deleted lru_node to free list.
> > >> check_and_free_fields() only did for kptr/timer but not spin_lock.
> > >>
> > >> But looks like we should not delete spin_lock before pushing the
> > >> deleted nodes to free list since lookup side may hold a reference
> > >> to the map value and it may have done a bpf_spin_lock() call.
> > >> And we should not clear spin_lock fields in check_and_free_fields()
> > >> and neither in prealloc_lru_pop() in map_update. Otherwise, we
> > >> may have issues for bpf_spin_unlock() on lookup side.
> > >>
> > >> It looks timer and kptr are already been handled for such
> > >> cases (concurrency between map_lookup() and clearing some map_value
> > >> fields for timer/kptr).
> > >>
> > >
> > > Yes, I also took a look again at other call sites of
> > > check_and_init_map_value and everything looks sane.
> >
> > Sounds good.
> >
> > >
> > >>>
> > >>> It is very much expected, IIUC, that someone else may use-after-free
> > >>> deleted items of hashtab.c maps in case of preallocation. It can be
> > >>> considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
> > >>>
> > >>>> To address the above rewrite issue, maybe the solution should be
> > >>>> to push the deleted lru_nodes back to free list only after
> > >>>> rcu_read_unlock() is done?
> > >>>
> > >>> Please correct me if I'm wrong, but I don't think this is a good idea.
> > >>> Delaying preallocated item reuse for a RCU grace period will greatly
> > >>> increase the probability of running out of preallocated items under
> > >>> load, even though technically those items are free for use.
> > >>
> > >> Agree. This is not a good idea. It increased life cycle for preallocated
> > >> item reuse and will have some performance issue and resource consumption
> > >> issue.
> > >>
> > >>>
> > >>> Side note: I found the problem this patch fixes while reading the
> > >>> code, because I am running into this exact problem with my WIP skip
> > >>> list map implementation, where in the preallocated case, to make
> > >>> things a bit easier for the lockless lookup, I delay reuse of items
> > >>> until an RCU grace period passes (so that the deleted -> unlinked
> > >>> transition does not happen during traversal), but I'm easily able to
> > >>> come up with scenarios (producer/consumer situations) where that leads
> > >>> to exhaustion of the preallocated memory (and if not that, performance
> > >>> degradation on updates because pcpu_freelist now needs to search other
> > >>> CPU's caches more often).
> >
> > Thinking again. I guess the following scenario is possible:
> >
> >       rcu_read_lock()
> >          v = bpf_map_lookup_elem(&key);
> >          t1 = v->field;
> >          bpf_map_delete_elem(&key);
> >             /* another bpf program triggering bpf_map_update_elem() */
> >             /* the element with 'key' is reused */
> >             /* the value is updated */
> >          t2 = v->field;
> >          ...
> >       rcu_read_unlock()
> >
> > it is possible t1 and t2 may not be the same.
> > This should be an extremely corner case, not sure how to resolve
> > this easily without performance degradation.
>
> Yes, this is totally possible. This extreme corner case may also
> become a real problem in sleepable programs ;-).
>
> I had an idea in mind on how it can be done completely in the BPF
> program side without changing the map implementation.
>
> We can add a new tag that marks a field as direct access only, so it
> will be treated like kptr/timer/spin_lock during copy_map_value, and
> not touched on deletion or update (i.e. no reset on deletion, no init
> on update).
>
> Then, use a sequence counter field with this tag, and update it
> whenever performing a deletion or doing an update.
> On first update, this counter will be 0 (thanks to allocation being
> zeroed by default).
> On deletion side, it will be:
> v = lookup_elem(...);
> if (cmpxchg(&v->seq, v->seq, v->seq + 1))
>   delete_elem(...);
>
> No ABA problems as seq is ever increasing. Odd seq means item is in
> freelist, even means it is alive, on update it is incremented (but
> using a cmpxchg) to indicate item is back alive. So the cost is one
> lookup + cmpxchg on delete and update resp.
>
> On the reader side, the reader starts by loading the seq before doing
> read, and comparing the seq after its read (seqlock style).
>
> seq1 = v->seq;
> smp_rmb();
> // read data
> smp_rmb();
> seq2 = v->seq;
> if (seq1 != seq2 || seq1 & 1)
>   handle_bad_read();
>
> smp_rmb() is free on x86, but it can be generalized when we have a
> proper BPF memory model. Since seq is 'direct access only' field, it
> is preserved across alloc -> free -> alloc -> ....
>
> Maybe I missed some details so it might be incorrect somewhere, but
> the main idea is to do something similar to rechecking key/generation
> counter when items are reused in SLAB_TYPESAFE_BY_RCU, but also
> ensuring integrity of the data being read from map value at the same
> time. So if you want absolutely bullet proof reads and are willing to
> pay some cost for that, this should work.

This whole scheme is a bit complicated, but would be less costly for
readers. The other option that comes to mind is refcounting of map
elements, which is easier to implement and add support to the map
itself, but I don't know how appealing that is (and whether it really
fits in well with the current update/delete style helpers).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-09  0:24             ` Kumar Kartikeya Dwivedi
  2022-08-09  0:53               ` Kumar Kartikeya Dwivedi
@ 2022-08-09  3:18               ` Alexei Starovoitov
  2022-08-09  4:42                 ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2022-08-09  3:18 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Martin KaFai Lau,
	Andrii Nakryiko, Daniel Borkmann, toke

On Mon, Aug 8, 2022 at 5:25 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Thinking again. I guess the following scenario is possible:
> >
> >       rcu_read_lock()
> >          v = bpf_map_lookup_elem(&key);
> >          t1 = v->field;
> >          bpf_map_delete_elem(&key);
> >             /* another bpf program triggering bpf_map_update_elem() */
> >             /* the element with 'key' is reused */
> >             /* the value is updated */
> >          t2 = v->field;
> >          ...
> >       rcu_read_unlock()
> >
> > it is possible t1 and t2 may not be the same.
> > This should be an extremely corner case, not sure how to resolve
> > this easily without performance degradation.
>
> Yes, this is totally possible. This extreme corner case may also
> become a real problem in sleepable programs ;-).
>
> I had an idea in mind on how it can be done completely in the BPF
> program side without changing the map implementation.

+1 it's best to address it inside the program instead
of complicating and likely slowing down maps.

As far as sleepable progs.. the issue is not present
because sleepable progs are only allowed to use preallocated maps.
In non-prealloc maps free_htab_elem() would just call_rcu()
and the element would be freed into the global slab which
will be bad if bpf prog is doing something like above.

Doing call_rcu_tasks_trace() + call_rcu() would allow
non-prealloc in sleepable, but it doesn't scale
and non-prealloc is already many times slower comparing
to prealloc due to atomic_inc/dec and call_rcu.

With new bpf_mem_alloc I'm experimenting with batching
of call_rcu_tasks_trace + call_rcu, so hash map can be
dynamically allocated, just as fast as full prealloc and
finally safe in both sleepable and traditional progs.
I was thinking of adding a new SLAB_TYPESAFE_BY_RCU_TASKS_TRACE
flag to the slab, but decided to go that route only if
batching of call_rcu*() won't be sufficient.
Sorry it takes so long to get the patches ready.
Summer in WA is the best time to take vacation :)
I've been taking plenty.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
  2022-08-09  3:18               ` Alexei Starovoitov
@ 2022-08-09  4:42                 ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-09  4:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Martin KaFai Lau,
	Andrii Nakryiko, Daniel Borkmann, toke

On Tue, 9 Aug 2022 at 05:18, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 8, 2022 at 5:25 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > Thinking again. I guess the following scenario is possible:
> > >
> > >       rcu_read_lock()
> > >          v = bpf_map_lookup_elem(&key);
> > >          t1 = v->field;
> > >          bpf_map_delete_elem(&key);
> > >             /* another bpf program triggering bpf_map_update_elem() */
> > >             /* the element with 'key' is reused */
> > >             /* the value is updated */
> > >          t2 = v->field;
> > >          ...
> > >       rcu_read_unlock()
> > >
> > > it is possible t1 and t2 may not be the same.
> > > This should be an extremely corner case, not sure how to resolve
> > > this easily without performance degradation.
> >
> > Yes, this is totally possible. This extreme corner case may also
> > become a real problem in sleepable programs ;-).
> >
> > I had an idea in mind on how it can be done completely in the BPF
> > program side without changing the map implementation.
>
> +1 it's best to address it inside the program instead
> of complicating and likely slowing down maps.
>
> As far as sleepable progs.. the issue is not present
> because sleepable progs are only allowed to use preallocated maps.

The problem being discussed in this thread is exactly with
preallocated maps, so I do think the issue is present with sleepable
progs. I.e. it can do a lookup, and go to sleep (implicitly or
explicitly), and then find the element has already been reused (i.e.
the window is wider so it is more likely). It's not a smart thing to
do ofcourse, but my point was that compared to the non-sleepable case
it won't be a corner case anymore that you might not hit very often.

> In non-prealloc maps free_htab_elem() would just call_rcu()
> and the element would be freed into the global slab which
> will be bad if bpf prog is doing something like above.
>
> Doing call_rcu_tasks_trace() + call_rcu() would allow
> non-prealloc in sleepable, but it doesn't scale
> and non-prealloc is already many times slower comparing
> to prealloc due to atomic_inc/dec and call_rcu.

Exactly, which means the element will be readily reused in the
sleepable case since it only uses preallocated maps, which takes us
back to the problem above.

>
> With new bpf_mem_alloc I'm experimenting with batching
> of call_rcu_tasks_trace + call_rcu, so hash map can be
> dynamically allocated, just as fast as full prealloc and
> finally safe in both sleepable and traditional progs.
> I was thinking of adding a new SLAB_TYPESAFE_BY_RCU_TASKS_TRACE
> flag to the slab, but decided to go that route only if
> batching of call_rcu*() won't be sufficient.

Interesting stuff, looking forward to it!

> Sorry it takes so long to get the patches ready.
> Summer in WA is the best time to take vacation :)
> I've been taking plenty.

Well, I think this thread was just me and Yonghong contemplating other
possible fixes to the original bug, so there's no rush, enjoy
yourself! :).

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-08-09  4:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-06  1:46 [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
2022-08-06  1:46 ` [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs Kumar Kartikeya Dwivedi
2022-08-06  1:46 ` [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
2022-08-08  6:09   ` Yonghong Song
2022-08-08 11:18     ` Kumar Kartikeya Dwivedi
2022-08-08 16:19       ` Yonghong Song
2022-08-08 18:55         ` Kumar Kartikeya Dwivedi
2022-08-08 23:23           ` Yonghong Song
2022-08-09  0:24             ` Kumar Kartikeya Dwivedi
2022-08-09  0:53               ` Kumar Kartikeya Dwivedi
2022-08-09  3:18               ` Alexei Starovoitov
2022-08-09  4:42                 ` Kumar Kartikeya Dwivedi
2022-08-06  1:46 ` [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug Kumar Kartikeya Dwivedi
2022-08-08 11:36   ` Kumar Kartikeya Dwivedi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.