All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf v2 0/2] bpf: Fix bpf timer kmemleak
@ 2023-10-20  1:42 Hou Tao
  2023-10-20  1:42 ` [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned Hou Tao
  2023-10-20  1:42 ` [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init Hou Tao
  0 siblings, 2 replies; 8+ messages in thread
From: Hou Tao @ 2023-10-20  1:42 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hsin-Wei Hung,
	houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset aims to fix the kmemleak problem reported by Hsin-Wei Hung
[0]. Patch #1 fixes the kmemleak problem by re-checking map->usercnt
after timer->timer is assigned. Patch #2 adds a selftest for the
kmemleak problem. But it is a bit hard to reproduce the kmemleak by
only running the test and I managed to reproduce the problem by both
running the test and injecting delay before timer->timer is assigned in
bpf_timer_init().

Please see individual patches for more details. And comments are always
welcome.

Change Log:
v2:
  * patch #1: use smp_mb() instead of smp_mb__before_atomic()
  * patch #2: use WRITE_ONCE(timer->timer, x) to match the lockless read
              of timer->timer

v1: https://lore.kernel.org/bpf/20231017125717.241101-1-houtao@huaweicloud.com

Hou Tao (2):
  bpf: Check map->usercnt again after timer->timer is assigned
  selftests/bpf: Test race between map uref release and bpf timer init

 kernel/bpf/helpers.c                          |  18 ++-
 .../bpf/prog_tests/timer_init_race.c          | 138 ++++++++++++++++++
 .../selftests/bpf/progs/timer_init_race.c     |  56 +++++++
 3 files changed, 209 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_init_race.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_init_race.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned
  2023-10-20  1:42 [PATCH bpf v2 0/2] bpf: Fix bpf timer kmemleak Hou Tao
@ 2023-10-20  1:42 ` Hou Tao
  2023-10-20  2:14   ` Alexei Starovoitov
  2023-10-20  1:42 ` [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init Hou Tao
  1 sibling, 1 reply; 8+ messages in thread
From: Hou Tao @ 2023-10-20  1:42 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hsin-Wei Hung,
	houtao1

From: Hou Tao <houtao1@huawei.com>

When there are concurrent uref release and bpf timer init operations,
the following sequence diagram is possible and it will lead to memory
leak:

bpf program X

bpf_timer_init()
  lock timer->lock
    read timer->timer as NULL
    read map->usercnt != 0

                process Y

                close(map_fd)
                  // put last uref
                  bpf_map_put_uref()
                    atomic_dec_and_test(map->usercnt)
                      array_map_free_timers()
                        bpf_timer_cancel_and_free()
                          // just return and lead to memory leak
                          read timer->timer is NULL

    t = bpf_map_kmalloc_node()
    timer->timer = t
  unlock timer->lock

Fix the problem by checking map->usercnt again after timer->timer is
assigned, so when there are concurrent uref release and bpf timer init,
either bpf_timer_cancel_and_free() from uref release reads a no-NULL
timer and the newly-added check of map->usercnt reads a zero usercnt.

Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
in bpf_timer_cancel_and_free() are not protected by a lock, so add
a memory barrier to guarantee the order between map->usercnt and
timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
read of timer->timer.

Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/helpers.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 757b99c1e613f..a7d92c3ddc3dd 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
 	   u64, flags)
 {
 	clockid_t clockid = flags & (MAX_CLOCKS - 1);
-	struct bpf_hrtimer *t;
+	struct bpf_hrtimer *t, *to_free = NULL;
 	int ret = 0;
 
 	BUILD_BUG_ON(MAX_CLOCKS != 16);
@@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
 	rcu_assign_pointer(t->callback_fn, NULL);
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
 	t->timer.function = bpf_timer_cb;
-	timer->timer = t;
+	WRITE_ONCE(timer->timer, t);
+	/* Guarantee order between timer->timer and map->usercnt. So when
+	 * there are concurrent uref release and bpf timer init, either
+	 * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
+	 * timer or atomic64_read() below reads a zero usercnt.
+	 */
+	smp_mb();
+	if (!atomic64_read(&map->usercnt)) {
+		WRITE_ONCE(timer->timer, NULL);
+		to_free = t;
+		ret = -EPERM;
+	}
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
+	kfree(to_free);
 	return ret;
 }
 
@@ -1372,7 +1384,7 @@ void bpf_timer_cancel_and_free(void *val)
 	/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
 	 * this timer, since it won't be initialized.
 	 */
-	timer->timer = NULL;
+	WRITE_ONCE(timer->timer, NULL);
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	if (!t)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init
  2023-10-20  1:42 [PATCH bpf v2 0/2] bpf: Fix bpf timer kmemleak Hou Tao
  2023-10-20  1:42 ` [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned Hou Tao
@ 2023-10-20  1:42 ` Hou Tao
  2023-10-20  2:50   ` Hsin-Wei Hung
  1 sibling, 1 reply; 8+ messages in thread
From: Hou Tao @ 2023-10-20  1:42 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Hsin-Wei Hung,
	houtao1

From: Hou Tao <houtao1@huawei.com>

Test race between the release of map ref and bpf_timer_init():
1) create one thread to add array map with bpf_timer into array of
   arrays map repeatedly.
2) create another thread to call getpgid() and call bpf_timer_init()
   in the attached bpf program repeatedly.
3) synchronize these two threads through pthread barrier.

It is a bit hard to trigger the kmemleak by only running the test. I
managed to reproduce the kmemleak by injecting a delay between
t->timer.function = bpf_timer_cb and timer->timer = t in
bpf_timer_init().

The following is the output of kmemleak after reproducing:

unreferenced object 0xffff8881163d3780 (size 96):
  comm "test_progs", pid 539, jiffies 4295358164 (age 23.276s)
  hex dump (first 32 bytes):
    80 37 3d 16 81 88 ff ff 00 00 00 00 00 00 00 00  .7=.............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bbc3f059>] __kmem_cache_alloc_node+0x3b1/0x4a0
    [<00000000a24ddf4d>] __kmalloc_node+0x57/0x140
    [<000000004d577dbf>] bpf_map_kmalloc_node+0x5f/0x180
    [<00000000bd8428d3>] bpf_timer_init+0xf6/0x1b0
    [<0000000086d87323>] 0xffffffffc000c94e
    [<000000005a09e655>] trace_call_bpf+0xc5/0x1c0
    [<0000000051ab837b>] kprobe_perf_func+0x51/0x260
    [<000000000069bbd1>] kprobe_dispatcher+0x61/0x70
    [<000000007dceb75b>] kprobe_ftrace_handler+0x168/0x240
    [<00000000d8721bd7>] 0xffffffffc02010f7
    [<00000000e885b809>] __x64_sys_getpgid+0x1/0x20
    [<000000007be835d8>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../bpf/prog_tests/timer_init_race.c          | 138 ++++++++++++++++++
 .../selftests/bpf/progs/timer_init_race.c     |  56 +++++++
 2 files changed, 194 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_init_race.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_init_race.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer_init_race.c b/tools/testing/selftests/bpf/prog_tests/timer_init_race.c
new file mode 100644
index 0000000000000..7bd57459e5048
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer_init_race.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include "timer_init_race.skel.h"
+
+struct thread_ctx {
+	struct bpf_map_create_opts opts;
+	pthread_barrier_t barrier;
+	int outer_map_fd;
+	int start, abort;
+	int loop, err;
+};
+
+static int wait_for_start_or_abort(struct thread_ctx *ctx)
+{
+	while (!ctx->start && !ctx->abort)
+		usleep(1);
+	return ctx->abort ? -1 : 0;
+}
+
+static void *close_map_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	int loop = ctx->loop, err = 0;
+
+	if (wait_for_start_or_abort(ctx) < 0)
+		return NULL;
+
+	while (loop-- > 0) {
+		int fd, zero = 0, i;
+		volatile int s = 0;
+
+		fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, 4, sizeof(struct bpf_timer),
+				    1, &ctx->opts);
+		if (fd < 0) {
+			err |= 1;
+			pthread_barrier_wait(&ctx->barrier);
+			continue;
+		}
+
+		if (bpf_map_update_elem(ctx->outer_map_fd, &zero, &fd, 0) < 0)
+			err |= 2;
+
+		pthread_barrier_wait(&ctx->barrier);
+		/* let bpf_timer_init run first */
+		for (i = 0; i < 5000; i++)
+			s++;
+		close(fd);
+	}
+
+	ctx->err = err;
+
+	return NULL;
+}
+
+static void *init_timer_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	int loop = ctx->loop;
+
+	if (wait_for_start_or_abort(ctx) < 0)
+		return NULL;
+
+	while (loop-- > 0) {
+		pthread_barrier_wait(&ctx->barrier);
+		syscall(SYS_getpgid);
+	}
+
+	return NULL;
+}
+
+void test_timer_init_race(void)
+{
+	struct timer_init_race *skel;
+	struct thread_ctx ctx;
+	pthread_t tid[2];
+	struct btf *btf;
+	int err;
+
+	skel = timer_init_race__open();
+	if (!ASSERT_OK_PTR(skel, "timer_init_race open"))
+		return;
+
+	err = timer_init_race__load(skel);
+	if (!ASSERT_EQ(err, 0, "timer_init_race load"))
+		goto out;
+
+	memset(&ctx, 0, sizeof(ctx));
+
+	btf = bpf_object__btf(skel->obj);
+	if (!ASSERT_OK_PTR(btf, "timer_init_race btf"))
+		goto out;
+
+	LIBBPF_OPTS_RESET(ctx.opts);
+	ctx.opts.btf_fd = bpf_object__btf_fd(skel->obj);
+	if (!ASSERT_GE((int)ctx.opts.btf_fd, 0, "btf_fd"))
+		goto out;
+	ctx.opts.btf_key_type_id = btf__find_by_name(btf, "int");
+	if (!ASSERT_GT(ctx.opts.btf_key_type_id, 0, "key_type_id"))
+		goto out;
+	ctx.opts.btf_value_type_id = btf__find_by_name_kind(btf, "inner_value", BTF_KIND_STRUCT);
+	if (!ASSERT_GT(ctx.opts.btf_value_type_id, 0, "value_type_id"))
+		goto out;
+
+	err = timer_init_race__attach(skel);
+	if (!ASSERT_EQ(err, 0, "timer_init_race attach"))
+		goto out;
+
+	skel->bss->tgid = getpid();
+
+	pthread_barrier_init(&ctx.barrier, NULL, 2);
+	ctx.outer_map_fd = bpf_map__fd(skel->maps.outer_map);
+	ctx.loop = 8;
+
+	err = pthread_create(&tid[0], NULL, close_map_fn, &ctx);
+	if (!ASSERT_OK(err, "close_thread"))
+		goto out;
+
+	err = pthread_create(&tid[1], NULL, init_timer_fn, &ctx);
+	if (!ASSERT_OK(err, "init_thread")) {
+		ctx.abort = 1;
+		pthread_join(tid[0], NULL);
+		goto out;
+	}
+
+	ctx.start = 1;
+	pthread_join(tid[0], NULL);
+	pthread_join(tid[1], NULL);
+
+	ASSERT_EQ(ctx.err, 0, "error");
+	ASSERT_EQ(skel->bss->cnt, 8, "cnt");
+out:
+	timer_init_race__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer_init_race.c b/tools/testing/selftests/bpf/progs/timer_init_race.c
new file mode 100644
index 0000000000000..ba67cb1786399
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer_init_race.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#include <linux/bpf.h>
+#include <time.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+struct inner_value {
+	struct bpf_timer timer;
+};
+
+struct inner_map_type {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct inner_value);
+	__uint(max_entries, 1);
+} inner_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__type(key, int);
+	__type(value, int);
+	__uint(max_entries, 1);
+	__array(values, struct inner_map_type);
+} outer_map SEC(".maps") = {
+	.values = {
+		[0] = &inner_map,
+	},
+};
+
+char _license[] SEC("license") = "GPL";
+
+int tgid = 0, cnt = 0;
+
+SEC("kprobe/" SYS_PREFIX "sys_getpgid")
+int do_timer_init(void *ctx)
+{
+	struct inner_map_type *map;
+	struct inner_value *value;
+	int zero = 0;
+
+	if ((bpf_get_current_pid_tgid() >> 32) != tgid)
+		return 0;
+
+	map = bpf_map_lookup_elem(&outer_map, &zero);
+	if (!map)
+		return 0;
+	value = bpf_map_lookup_elem(map, &zero);
+	if (!value)
+		return 0;
+	bpf_timer_init(&value->timer, map, CLOCK_MONOTONIC);
+	cnt++;
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned
  2023-10-20  1:42 ` [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned Hou Tao
@ 2023-10-20  2:14   ` Alexei Starovoitov
  2023-10-20  7:31     ` Hou Tao
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2023-10-20  2:14 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hsin-Wei Hung, Hou Tao

On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> When there are concurrent uref release and bpf timer init operations,
> the following sequence diagram is possible and it will lead to memory
> leak:
>
> bpf program X
>
> bpf_timer_init()
>   lock timer->lock
>     read timer->timer as NULL
>     read map->usercnt != 0
>
>                 process Y
>
>                 close(map_fd)
>                   // put last uref
>                   bpf_map_put_uref()
>                     atomic_dec_and_test(map->usercnt)
>                       array_map_free_timers()
>                         bpf_timer_cancel_and_free()
>                           // just return and lead to memory leak
>                           read timer->timer is NULL
>
>     t = bpf_map_kmalloc_node()
>     timer->timer = t
>   unlock timer->lock
>
> Fix the problem by checking map->usercnt again after timer->timer is
> assigned, so when there are concurrent uref release and bpf timer init,
> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
> timer and the newly-added check of map->usercnt reads a zero usercnt.
>
> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
> in bpf_timer_cancel_and_free() are not protected by a lock, so add
> a memory barrier to guarantee the order between map->usercnt and
> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
> read of timer->timer.
>
> Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/helpers.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 757b99c1e613f..a7d92c3ddc3dd 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>            u64, flags)
>  {
>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
> -       struct bpf_hrtimer *t;
> +       struct bpf_hrtimer *t, *to_free = NULL;
>         int ret = 0;
>
>         BUILD_BUG_ON(MAX_CLOCKS != 16);
> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>         rcu_assign_pointer(t->callback_fn, NULL);
>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>         t->timer.function = bpf_timer_cb;
> -       timer->timer = t;
> +       WRITE_ONCE(timer->timer, t);
> +       /* Guarantee order between timer->timer and map->usercnt. So when
> +        * there are concurrent uref release and bpf timer init, either
> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
> +        * timer or atomic64_read() below reads a zero usercnt.
> +        */
> +       smp_mb();
> +       if (!atomic64_read(&map->usercnt)) {
> +               WRITE_ONCE(timer->timer, NULL);
> +               to_free = t;

just kfree(t); here.

> +               ret = -EPERM;
> +       }

This will add a second atomic64_read(&map->usercnt) in the same function.
Let's remove the first one ?

>  out:
>         __bpf_spin_unlock_irqrestore(&timer->lock);
> +       kfree(to_free);
>         return ret;
>  }
>
> @@ -1372,7 +1384,7 @@ void bpf_timer_cancel_and_free(void *val)
>         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>          * this timer, since it won't be initialized.
>          */
> -       timer->timer = NULL;
> +       WRITE_ONCE(timer->timer, NULL);
>  out:
>         __bpf_spin_unlock_irqrestore(&timer->lock);
>         if (!t)
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init
  2023-10-20  1:42 ` [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init Hou Tao
@ 2023-10-20  2:50   ` Hsin-Wei Hung
  2023-10-20 13:57     ` Hou Tao
  0 siblings, 1 reply; 8+ messages in thread
From: Hsin-Wei Hung @ 2023-10-20  2:50 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Test race between the release of map ref and bpf_timer_init():
> 1) create one thread to add array map with bpf_timer into array of
>    arrays map repeatedly.
> 2) create another thread to call getpgid() and call bpf_timer_init()
>    in the attached bpf program repeatedly.
> 3) synchronize these two threads through pthread barrier.
>
> It is a bit hard to trigger the kmemleak by only running the test. I
> managed to reproduce the kmemleak by injecting a delay between
> t->timer.function = bpf_timer_cb and timer->timer = t in
> bpf_timer_init().

I figured out that to trigger this issue reliably, I can insert
different delays using large bpf_loop() in allocation and release
paths. I have some extra code to filter out unwanted events. The
userspace program is similar. It just needs to try to call close(fd)
and syscall(SYS_getpgid) at the same time without delay. It is not a
stable test though due to the reference to the function.

SEC("tp/syscalls/sys_enter_close")
{
        ...
        bpf_loop(1000000, &delay_loop, NULL, 0);
}

SEC("fexit/bpf_map_kmalloc_node")gmai
{
        ...
        bpf_loop(2000000, &delay_loop, NULL, 0);
}

I can confirm that the v1 patch fixes memleak in v5.15. However, this
issue doesn't seem to affect net-next. At least since db559117828d
(bpf: Consolidate spin_lock, timer management into btf_record), the
leaked memory caused by the race would be freed in array_map_free().

>
> The following is the output of kmemleak after reproducing:
>
> unreferenced object 0xffff8881163d3780 (size 96):
>   comm "test_progs", pid 539, jiffies 4295358164 (age 23.276s)
>   hex dump (first 32 bytes):
>     80 37 3d 16 81 88 ff ff 00 00 00 00 00 00 00 00  .7=.............
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bbc3f059>] __kmem_cache_alloc_node+0x3b1/0x4a0
>     [<00000000a24ddf4d>] __kmalloc_node+0x57/0x140
>     [<000000004d577dbf>] bpf_map_kmalloc_node+0x5f/0x180
>     [<00000000bd8428d3>] bpf_timer_init+0xf6/0x1b0
>     [<0000000086d87323>] 0xffffffffc000c94e
>     [<000000005a09e655>] trace_call_bpf+0xc5/0x1c0
>     [<0000000051ab837b>] kprobe_perf_func+0x51/0x260
>     [<000000000069bbd1>] kprobe_dispatcher+0x61/0x70
>     [<000000007dceb75b>] kprobe_ftrace_handler+0x168/0x240
>     [<00000000d8721bd7>] 0xffffffffc02010f7
>     [<00000000e885b809>] __x64_sys_getpgid+0x1/0x20
>     [<000000007be835d8>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  .../bpf/prog_tests/timer_init_race.c          | 138 ++++++++++++++++++
>  .../selftests/bpf/progs/timer_init_race.c     |  56 +++++++
>  2 files changed, 194 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_init_race.c
>  create mode 100644 tools/testing/selftests/bpf/progs/timer_init_race.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/timer_init_race.c b/tools/testing/selftests/bpf/prog_tests/timer_init_race.c
> new file mode 100644
> index 0000000000000..7bd57459e5048
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/timer_init_race.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
> +#define _GNU_SOURCE
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <test_progs.h>
> +#include <bpf/btf.h>
> +#include "timer_init_race.skel.h"
> +
> +struct thread_ctx {
> +       struct bpf_map_create_opts opts;
> +       pthread_barrier_t barrier;
> +       int outer_map_fd;
> +       int start, abort;
> +       int loop, err;
> +};
> +
> +static int wait_for_start_or_abort(struct thread_ctx *ctx)
> +{
> +       while (!ctx->start && !ctx->abort)
> +               usleep(1);
> +       return ctx->abort ? -1 : 0;
> +}
> +
> +static void *close_map_fn(void *data)
> +{
> +       struct thread_ctx *ctx = data;
> +       int loop = ctx->loop, err = 0;
> +
> +       if (wait_for_start_or_abort(ctx) < 0)
> +               return NULL;
> +
> +       while (loop-- > 0) {
> +               int fd, zero = 0, i;
> +               volatile int s = 0;
> +
> +               fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, 4, sizeof(struct bpf_timer),
> +                                   1, &ctx->opts);
> +               if (fd < 0) {
> +                       err |= 1;
> +                       pthread_barrier_wait(&ctx->barrier);
> +                       continue;
> +               }
> +
> +               if (bpf_map_update_elem(ctx->outer_map_fd, &zero, &fd, 0) < 0)
> +                       err |= 2;
> +
> +               pthread_barrier_wait(&ctx->barrier);
> +               /* let bpf_timer_init run first */
> +               for (i = 0; i < 5000; i++)
> +                       s++;
> +               close(fd);
> +       }
> +
> +       ctx->err = err;
> +
> +       return NULL;
> +}
> +
> +static void *init_timer_fn(void *data)
> +{
> +       struct thread_ctx *ctx = data;
> +       int loop = ctx->loop;
> +
> +       if (wait_for_start_or_abort(ctx) < 0)
> +               return NULL;
> +
> +       while (loop-- > 0) {
> +               pthread_barrier_wait(&ctx->barrier);
> +               syscall(SYS_getpgid);
> +       }
> +
> +       return NULL;
> +}
> +
> +void test_timer_init_race(void)
> +{
> +       struct timer_init_race *skel;
> +       struct thread_ctx ctx;
> +       pthread_t tid[2];
> +       struct btf *btf;
> +       int err;
> +
> +       skel = timer_init_race__open();
> +       if (!ASSERT_OK_PTR(skel, "timer_init_race open"))
> +               return;
> +
> +       err = timer_init_race__load(skel);
> +       if (!ASSERT_EQ(err, 0, "timer_init_race load"))
> +               goto out;
> +
> +       memset(&ctx, 0, sizeof(ctx));
> +
> +       btf = bpf_object__btf(skel->obj);
> +       if (!ASSERT_OK_PTR(btf, "timer_init_race btf"))
> +               goto out;
> +
> +       LIBBPF_OPTS_RESET(ctx.opts);
> +       ctx.opts.btf_fd = bpf_object__btf_fd(skel->obj);
> +       if (!ASSERT_GE((int)ctx.opts.btf_fd, 0, "btf_fd"))
> +               goto out;
> +       ctx.opts.btf_key_type_id = btf__find_by_name(btf, "int");
> +       if (!ASSERT_GT(ctx.opts.btf_key_type_id, 0, "key_type_id"))
> +               goto out;
> +       ctx.opts.btf_value_type_id = btf__find_by_name_kind(btf, "inner_value", BTF_KIND_STRUCT);
> +       if (!ASSERT_GT(ctx.opts.btf_value_type_id, 0, "value_type_id"))
> +               goto out;
> +
> +       err = timer_init_race__attach(skel);
> +       if (!ASSERT_EQ(err, 0, "timer_init_race attach"))
> +               goto out;
> +
> +       skel->bss->tgid = getpid();
> +
> +       pthread_barrier_init(&ctx.barrier, NULL, 2);
> +       ctx.outer_map_fd = bpf_map__fd(skel->maps.outer_map);
> +       ctx.loop = 8;
> +
> +       err = pthread_create(&tid[0], NULL, close_map_fn, &ctx);
> +       if (!ASSERT_OK(err, "close_thread"))
> +               goto out;
> +
> +       err = pthread_create(&tid[1], NULL, init_timer_fn, &ctx);
> +       if (!ASSERT_OK(err, "init_thread")) {
> +               ctx.abort = 1;
> +               pthread_join(tid[0], NULL);
> +               goto out;
> +       }
> +
> +       ctx.start = 1;
> +       pthread_join(tid[0], NULL);
> +       pthread_join(tid[1], NULL);
> +
> +       ASSERT_EQ(ctx.err, 0, "error");
> +       ASSERT_EQ(skel->bss->cnt, 8, "cnt");
> +out:
> +       timer_init_race__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/timer_init_race.c b/tools/testing/selftests/bpf/progs/timer_init_race.c
> new file mode 100644
> index 0000000000000..ba67cb1786399
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/timer_init_race.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
> +#include <linux/bpf.h>
> +#include <time.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#include "bpf_misc.h"
> +
> +struct inner_value {
> +       struct bpf_timer timer;
> +};
> +
> +struct inner_map_type {
> +       __uint(type, BPF_MAP_TYPE_ARRAY);
> +       __type(key, int);
> +       __type(value, struct inner_value);
> +       __uint(max_entries, 1);
> +} inner_map SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> +       __type(key, int);
> +       __type(value, int);
> +       __uint(max_entries, 1);
> +       __array(values, struct inner_map_type);
> +} outer_map SEC(".maps") = {
> +       .values = {
> +               [0] = &inner_map,
> +       },
> +};
> +
> +char _license[] SEC("license") = "GPL";
> +
> +int tgid = 0, cnt = 0;
> +
> +SEC("kprobe/" SYS_PREFIX "sys_getpgid")
> +int do_timer_init(void *ctx)
> +{
> +       struct inner_map_type *map;
> +       struct inner_value *value;
> +       int zero = 0;
> +
> +       if ((bpf_get_current_pid_tgid() >> 32) != tgid)
> +               return 0;
> +
> +       map = bpf_map_lookup_elem(&outer_map, &zero);
> +       if (!map)
> +               return 0;
> +       value = bpf_map_lookup_elem(map, &zero);
> +       if (!value)
> +               return 0;
> +       bpf_timer_init(&value->timer, map, CLOCK_MONOTONIC);
> +       cnt++;
> +
> +       return 0;
> +}
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned
  2023-10-20  2:14   ` Alexei Starovoitov
@ 2023-10-20  7:31     ` Hou Tao
  2023-10-20 14:57       ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Hou Tao @ 2023-10-20  7:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hsin-Wei Hung, Hou Tao

Hi,

On 10/20/2023 10:14 AM, Alexei Starovoitov wrote:
> On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When there are concurrent uref release and bpf timer init operations,
>> the following sequence diagram is possible and it will lead to memory
>> leak:
>>
>> bpf program X
>>
>> bpf_timer_init()
>>   lock timer->lock
>>     read timer->timer as NULL
>>     read map->usercnt != 0
>>
>>                 process Y
>>
>>                 close(map_fd)
>>                   // put last uref
>>                   bpf_map_put_uref()
>>                     atomic_dec_and_test(map->usercnt)
>>                       array_map_free_timers()
>>                         bpf_timer_cancel_and_free()
>>                           // just return and lead to memory leak
>>                           read timer->timer is NULL
>>
>>     t = bpf_map_kmalloc_node()
>>     timer->timer = t
>>   unlock timer->lock
>>
>> Fix the problem by checking map->usercnt again after timer->timer is
>> assigned, so when there are concurrent uref release and bpf timer init,
>> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
>> timer and the newly-added check of map->usercnt reads a zero usercnt.
>>
>> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
>> in bpf_timer_cancel_and_free() are not protected by a lock, so add
>> a memory barrier to guarantee the order between map->usercnt and
>> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
>> read of timer->timer.
>>
>> Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
>> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
>> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/helpers.c | 18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 757b99c1e613f..a7d92c3ddc3dd 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>>            u64, flags)
>>  {
>>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
>> -       struct bpf_hrtimer *t;
>> +       struct bpf_hrtimer *t, *to_free = NULL;
>>         int ret = 0;
>>
>>         BUILD_BUG_ON(MAX_CLOCKS != 16);
>> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>>         rcu_assign_pointer(t->callback_fn, NULL);
>>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>>         t->timer.function = bpf_timer_cb;
>> -       timer->timer = t;
>> +       WRITE_ONCE(timer->timer, t);
>> +       /* Guarantee order between timer->timer and map->usercnt. So when
>> +        * there are concurrent uref release and bpf timer init, either
>> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
>> +        * timer or atomic64_read() below reads a zero usercnt.
>> +        */
>> +       smp_mb();
>> +       if (!atomic64_read(&map->usercnt)) {
>> +               WRITE_ONCE(timer->timer, NULL);
>> +               to_free = t;
> just kfree(t); here.

Will do. It is a slow path, so I think doing kfree() under spin-lock is
acceptable.
>
>> +               ret = -EPERM;
>> +       }
> This will add a second atomic64_read(&map->usercnt) in the same function.
> Let's remove the first one ?

I prefer to still keep it. Because it can detect the release of map uref
early and the handle of uref release is simple compared with the second
atomic64_read(). Do you have a strong preference ?
>
>>  out:
>>         __bpf_spin_unlock_irqrestore(&timer->lock);
>> +       kfree(to_free);
>>         return ret;
>>  }
>>
>> @@ -1372,7 +1384,7 @@ void bpf_timer_cancel_and_free(void *val)
>>         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>>          * this timer, since it won't be initialized.
>>          */
>> -       timer->timer = NULL;
>> +       WRITE_ONCE(timer->timer, NULL);
>>  out:
>>         __bpf_spin_unlock_irqrestore(&timer->lock);
>>         if (!t)
>> --
>> 2.29.2
>>
> .


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init
  2023-10-20  2:50   ` Hsin-Wei Hung
@ 2023-10-20 13:57     ` Hou Tao
  0 siblings, 0 replies; 8+ messages in thread
From: Hou Tao @ 2023-10-20 13:57 UTC (permalink / raw)
  To: Hsin-Wei Hung, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

Hi,

On 10/20/2023 10:50 AM, Hsin-Wei Hung wrote:
> On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Test race between the release of map ref and bpf_timer_init():
>> 1) create one thread to add array map with bpf_timer into array of
>>    arrays map repeatedly.
>> 2) create another thread to call getpgid() and call bpf_timer_init()
>>    in the attached bpf program repeatedly.
>> 3) synchronize these two threads through pthread barrier.
>>
>> It is a bit hard to trigger the kmemleak by only running the test. I
>> managed to reproduce the kmemleak by injecting a delay between
>> t->timer.function = bpf_timer_cb and timer->timer = t in
>> bpf_timer_init().
> I figured out that to trigger this issue reliably, I can insert
> different delays using large bpf_loop() in allocation and release
> paths. I have some extra code to filter out unwanted events. The
> userspace program is similar. It just needs to try to call close(fd)
> and syscall(SYS_getpgid) at the same time without delay. It is not a
> stable test though due to the reference to the function.
>
> SEC("tp/syscalls/sys_enter_close")
> {
>         ...
>         bpf_loop(1000000, &delay_loop, NULL, 0);
> }
>
> SEC("fexit/bpf_map_kmalloc_node")gmai
> {
>         ...
>         bpf_loop(2000000, &delay_loop, NULL, 0);
> }

Thanks for sharing another way to reproduce the problem.
>
> I can confirm that the v1 patch fixes memleak in v5.15. However, this
> issue doesn't seem to affect net-next. At least since db559117828d
> (bpf: Consolidate spin_lock, timer management into btf_record), the
> leaked memory caused by the race would be freed in array_map_free().

I think you are partially right, because array_map indeed doesn't have
such problem but array-in-array map still has the problem and I can
reproduce the problem in bpf tree (see the kmemleak report below). After
reading the related code carefully, I think the proposed fix in the
patch is not right, because the root cause is the release of map-in-map
is not correct (e.g, don't wait for a RCU GP) but the patch only fixes
the phenomenon. Will update the patchset to fix the problem again.

Regards,
Hou
>
>> The following is the output of kmemleak after reproducing:
>>
>> unreferenced object 0xffff8881163d3780 (size 96):
>>   comm "test_progs", pid 539, jiffies 4295358164 (age 23.276s)
>>   hex dump (first 32 bytes):
>>     80 37 3d 16 81 88 ff ff 00 00 00 00 00 00 00 00  .7=.............
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>   backtrace:
>>     [<00000000bbc3f059>] __kmem_cache_alloc_node+0x3b1/0x4a0
>>     [<00000000a24ddf4d>] __kmalloc_node+0x57/0x140
>>     [<000000004d577dbf>] bpf_map_kmalloc_node+0x5f/0x180
>>     [<00000000bd8428d3>] bpf_timer_init+0xf6/0x1b0
>>     [<0000000086d87323>] 0xffffffffc000c94e
>>     [<000000005a09e655>] trace_call_bpf+0xc5/0x1c0
>>     [<0000000051ab837b>] kprobe_perf_func+0x51/0x260
>>     [<000000000069bbd1>] kprobe_dispatcher+0x61/0x70
>>     [<000000007dceb75b>] kprobe_ftrace_handler+0x168/0x240
>>     [<00000000d8721bd7>] 0xffffffffc02010f7
>>     [<00000000e885b809>] __x64_sys_getpgid+0x1/0x20
>>     [<000000007be835d8>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  .../bpf/prog_tests/timer_init_race.c          | 138 ++++++++++++++++++
>>  .../selftests/bpf/progs/timer_init_race.c     |  56 +++++++
>>  2 files changed, 194 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_init_race.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/timer_init_race.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/timer_init_race.c b/tools/testing/selftests/bpf/prog_tests/timer_init_race.c
>> new file mode 100644
>> index 0000000000000..7bd57459e5048
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/timer_init_race.c
>> @@ -0,0 +1,138 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
>> +#define _GNU_SOURCE
>> +#include <unistd.h>
>> +#include <sys/syscall.h>
>> +#include <test_progs.h>
>> +#include <bpf/btf.h>
>> +#include "timer_init_race.skel.h"
>> +
>> +struct thread_ctx {
>> +       struct bpf_map_create_opts opts;
>> +       pthread_barrier_t barrier;
>> +       int outer_map_fd;
>> +       int start, abort;
>> +       int loop, err;
>> +};
>> +
>> +static int wait_for_start_or_abort(struct thread_ctx *ctx)
>> +{
>> +       while (!ctx->start && !ctx->abort)
>> +               usleep(1);
>> +       return ctx->abort ? -1 : 0;
>> +}
>> +
>> +static void *close_map_fn(void *data)
>> +{
>> +       struct thread_ctx *ctx = data;
>> +       int loop = ctx->loop, err = 0;
>> +
>> +       if (wait_for_start_or_abort(ctx) < 0)
>> +               return NULL;
>> +
>> +       while (loop-- > 0) {
>> +               int fd, zero = 0, i;
>> +               volatile int s = 0;
>> +
>> +               fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, 4, sizeof(struct bpf_timer),
>> +                                   1, &ctx->opts);
>> +               if (fd < 0) {
>> +                       err |= 1;
>> +                       pthread_barrier_wait(&ctx->barrier);
>> +                       continue;
>> +               }
>> +
>> +               if (bpf_map_update_elem(ctx->outer_map_fd, &zero, &fd, 0) < 0)
>> +                       err |= 2;
>> +
>> +               pthread_barrier_wait(&ctx->barrier);
>> +               /* let bpf_timer_init run first */
>> +               for (i = 0; i < 5000; i++)
>> +                       s++;
>> +               close(fd);
>> +       }
>> +
>> +       ctx->err = err;
>> +
>> +       return NULL;
>> +}
>> +
>> +static void *init_timer_fn(void *data)
>> +{
>> +       struct thread_ctx *ctx = data;
>> +       int loop = ctx->loop;
>> +
>> +       if (wait_for_start_or_abort(ctx) < 0)
>> +               return NULL;
>> +
>> +       while (loop-- > 0) {
>> +               pthread_barrier_wait(&ctx->barrier);
>> +               syscall(SYS_getpgid);
>> +       }
>> +
>> +       return NULL;
>> +}
>> +
>> +void test_timer_init_race(void)
>> +{
>> +       struct timer_init_race *skel;
>> +       struct thread_ctx ctx;
>> +       pthread_t tid[2];
>> +       struct btf *btf;
>> +       int err;
>> +
>> +       skel = timer_init_race__open();
>> +       if (!ASSERT_OK_PTR(skel, "timer_init_race open"))
>> +               return;
>> +
>> +       err = timer_init_race__load(skel);
>> +       if (!ASSERT_EQ(err, 0, "timer_init_race load"))
>> +               goto out;
>> +
>> +       memset(&ctx, 0, sizeof(ctx));
>> +
>> +       btf = bpf_object__btf(skel->obj);
>> +       if (!ASSERT_OK_PTR(btf, "timer_init_race btf"))
>> +               goto out;
>> +
>> +       LIBBPF_OPTS_RESET(ctx.opts);
>> +       ctx.opts.btf_fd = bpf_object__btf_fd(skel->obj);
>> +       if (!ASSERT_GE((int)ctx.opts.btf_fd, 0, "btf_fd"))
>> +               goto out;
>> +       ctx.opts.btf_key_type_id = btf__find_by_name(btf, "int");
>> +       if (!ASSERT_GT(ctx.opts.btf_key_type_id, 0, "key_type_id"))
>> +               goto out;
>> +       ctx.opts.btf_value_type_id = btf__find_by_name_kind(btf, "inner_value", BTF_KIND_STRUCT);
>> +       if (!ASSERT_GT(ctx.opts.btf_value_type_id, 0, "value_type_id"))
>> +               goto out;
>> +
>> +       err = timer_init_race__attach(skel);
>> +       if (!ASSERT_EQ(err, 0, "timer_init_race attach"))
>> +               goto out;
>> +
>> +       skel->bss->tgid = getpid();
>> +
>> +       pthread_barrier_init(&ctx.barrier, NULL, 2);
>> +       ctx.outer_map_fd = bpf_map__fd(skel->maps.outer_map);
>> +       ctx.loop = 8;
>> +
>> +       err = pthread_create(&tid[0], NULL, close_map_fn, &ctx);
>> +       if (!ASSERT_OK(err, "close_thread"))
>> +               goto out;
>> +
>> +       err = pthread_create(&tid[1], NULL, init_timer_fn, &ctx);
>> +       if (!ASSERT_OK(err, "init_thread")) {
>> +               ctx.abort = 1;
>> +               pthread_join(tid[0], NULL);
>> +               goto out;
>> +       }
>> +
>> +       ctx.start = 1;
>> +       pthread_join(tid[0], NULL);
>> +       pthread_join(tid[1], NULL);
>> +
>> +       ASSERT_EQ(ctx.err, 0, "error");
>> +       ASSERT_EQ(skel->bss->cnt, 8, "cnt");
>> +out:
>> +       timer_init_race__destroy(skel);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/timer_init_race.c b/tools/testing/selftests/bpf/progs/timer_init_race.c
>> new file mode 100644
>> index 0000000000000..ba67cb1786399
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/timer_init_race.c
>> @@ -0,0 +1,56 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
>> +#include <linux/bpf.h>
>> +#include <time.h>
>> +#include <bpf/bpf_helpers.h>
>> +
>> +#include "bpf_misc.h"
>> +
>> +struct inner_value {
>> +       struct bpf_timer timer;
>> +};
>> +
>> +struct inner_map_type {
>> +       __uint(type, BPF_MAP_TYPE_ARRAY);
>> +       __type(key, int);
>> +       __type(value, struct inner_value);
>> +       __uint(max_entries, 1);
>> +} inner_map SEC(".maps");
>> +
>> +struct {
>> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
>> +       __type(key, int);
>> +       __type(value, int);
>> +       __uint(max_entries, 1);
>> +       __array(values, struct inner_map_type);
>> +} outer_map SEC(".maps") = {
>> +       .values = {
>> +               [0] = &inner_map,
>> +       },
>> +};
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +int tgid = 0, cnt = 0;
>> +
>> +SEC("kprobe/" SYS_PREFIX "sys_getpgid")
>> +int do_timer_init(void *ctx)
>> +{
>> +       struct inner_map_type *map;
>> +       struct inner_value *value;
>> +       int zero = 0;
>> +
>> +       if ((bpf_get_current_pid_tgid() >> 32) != tgid)
>> +               return 0;
>> +
>> +       map = bpf_map_lookup_elem(&outer_map, &zero);
>> +       if (!map)
>> +               return 0;
>> +       value = bpf_map_lookup_elem(map, &zero);
>> +       if (!value)
>> +               return 0;
>> +       bpf_timer_init(&value->timer, map, CLOCK_MONOTONIC);
>> +       cnt++;
>> +
>> +       return 0;
>> +}
>> --
>> 2.29.2
>>
> .


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned
  2023-10-20  7:31     ` Hou Tao
@ 2023-10-20 14:57       ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2023-10-20 14:57 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Hsin-Wei Hung, Hou Tao

On Fri, Oct 20, 2023 at 12:31 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/20/2023 10:14 AM, Alexei Starovoitov wrote:
> > On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> When there are concurrent uref release and bpf timer init operations,
> >> the following sequence diagram is possible and it will lead to memory
> >> leak:
> >>
> >> bpf program X
> >>
> >> bpf_timer_init()
> >>   lock timer->lock
> >>     read timer->timer as NULL
> >>     read map->usercnt != 0
> >>
> >>                 process Y
> >>
> >>                 close(map_fd)
> >>                   // put last uref
> >>                   bpf_map_put_uref()
> >>                     atomic_dec_and_test(map->usercnt)
> >>                       array_map_free_timers()
> >>                         bpf_timer_cancel_and_free()
> >>                           // just return and lead to memory leak
> >>                           read timer->timer is NULL
> >>
> >>     t = bpf_map_kmalloc_node()
> >>     timer->timer = t
> >>   unlock timer->lock
> >>
> >> Fix the problem by checking map->usercnt again after timer->timer is
> >> assigned, so when there are concurrent uref release and bpf timer init,
> >> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
> >> timer and the newly-added check of map->usercnt reads a zero usercnt.
> >>
> >> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
> >> in bpf_timer_cancel_and_free() are not protected by a lock, so add
> >> a memory barrier to guarantee the order between map->usercnt and
> >> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
> >> read of timer->timer.
> >>
> >> Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
> >> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
> >> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>  kernel/bpf/helpers.c | 18 +++++++++++++++---
> >>  1 file changed, 15 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index 757b99c1e613f..a7d92c3ddc3dd 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >>            u64, flags)
> >>  {
> >>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
> >> -       struct bpf_hrtimer *t;
> >> +       struct bpf_hrtimer *t, *to_free = NULL;
> >>         int ret = 0;
> >>
> >>         BUILD_BUG_ON(MAX_CLOCKS != 16);
> >> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >>         rcu_assign_pointer(t->callback_fn, NULL);
> >>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> >>         t->timer.function = bpf_timer_cb;
> >> -       timer->timer = t;
> >> +       WRITE_ONCE(timer->timer, t);
> >> +       /* Guarantee order between timer->timer and map->usercnt. So when
> >> +        * there are concurrent uref release and bpf timer init, either
> >> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
> >> +        * timer or atomic64_read() below reads a zero usercnt.
> >> +        */
> >> +       smp_mb();
> >> +       if (!atomic64_read(&map->usercnt)) {
> >> +               WRITE_ONCE(timer->timer, NULL);
> >> +               to_free = t;
> > just kfree(t); here.
>
> Will do. It is a slow path, so I think doing kfree() under spin-lock is
> acceptable.
> >
> >> +               ret = -EPERM;
> >> +       }
> > This will add a second atomic64_read(&map->usercnt) in the same function.
> > Let's remove the first one ?
>
> I prefer to still keep it. Because it can detect the release of map uref
> early and the handle of uref release is simple compared with the second
> atomic64_read(). Do you have a strong preference ?

I bet somebody will send a patch to remove the first one as redundant.
So let's do it now.
The only reason we do repeated early check is to avoid taking a lock.
Here doing an extra early check to avoid kmalloc is an overkill.
That check is highly unlikely to hit while for locks it's a likely one.
Hence extra check is justified for locks, but not here.

Reading your other email it looks like this patchset is incomplete anyway?

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-10-20 14:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-20  1:42 [PATCH bpf v2 0/2] bpf: Fix bpf timer kmemleak Hou Tao
2023-10-20  1:42 ` [PATCH bpf v2 1/2] bpf: Check map->usercnt again after timer->timer is assigned Hou Tao
2023-10-20  2:14   ` Alexei Starovoitov
2023-10-20  7:31     ` Hou Tao
2023-10-20 14:57       ` Alexei Starovoitov
2023-10-20  1:42 ` [PATCH bpf v2 2/2] selftests/bpf: Test race between map uref release and bpf timer init Hou Tao
2023-10-20  2:50   ` Hsin-Wei Hung
2023-10-20 13:57     ` Hou Tao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.