All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value
@ 2022-02-09  7:03 Kumar Kartikeya Dwivedi
  2022-02-09  7:03 ` [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value Kumar Kartikeya Dwivedi
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-09  7:03 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

A fix for an oversight in copy_map_value that leads to kernel crash.

Also, a question for BPF developers:
It seems in arraymap.c, we always do check_and_free_timer_in_array after we do
copy_map_value in map_update_elem callback, but the same is not done for
hashtab.c. Is there a specific reason for this difference in behavior, or did I
miss that it happens for hashtab.c as well?

Changlog:
---------
v1 -> v2:
v1: https://lore.kernel.org/bpf/20220209051113.870717-1-memxor@gmail.com

 * Fix build error for selftests patch due to missing SYS_PREFIX in bpf tree

Kumar Kartikeya Dwivedi (2):
  bpf: Fix crash due to incorrect copy_map_value
  selftests/bpf: Add test for bpf_timer overwriting crash

 include/linux/bpf.h                           |  3 +-
 .../selftests/bpf/prog_tests/timer_crash.c    | 32 +++++++++++
 .../testing/selftests/bpf/progs/timer_crash.c | 54 +++++++++++++++++++
 3 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_crash.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_crash.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-09  7:03 [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value Kumar Kartikeya Dwivedi
@ 2022-02-09  7:03 ` Kumar Kartikeya Dwivedi
  2022-02-09 19:06   ` Yonghong Song
  2022-02-09  7:03 ` [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash Kumar Kartikeya Dwivedi
  2022-02-11 21:30 ` [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value patchwork-bot+netdevbpf
  2 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-09  7:03 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

When both bpf_spin_lock and bpf_timer are present in a BPF map value,
copy_map_value needs to skirt both objects when copying a value into and
out of the map. However, the current code does not set both s_off and
t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
is placed in map value with bpf_timer, as bpf_map_update_elem call will
be able to overwrite the other timer object.

When the issue is not fixed, an overwriting can produce the following
splat:

[root@(none) bpf]# ./test_progs -t timer_crash
[   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
[   16.037849] ==================================================================
[   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
[   16.039399]
[   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
[   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
[   16.040485] Call Trace:
[   16.040645]  <TASK>
[   16.040805]  dump_stack_lvl+0x59/0x73
[   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.041427]  kasan_report.cold+0x116/0x11b
[   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.042328]  ? memcpy+0x39/0x60
[   16.042552]  ? pv_hash+0xd0/0xd0
[   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
[   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
[   16.043366]  ? bpf_get_current_comm+0x50/0x50
[   16.043608]  ? jhash+0x11a/0x270
[   16.043848]  bpf_timer_cancel+0x34/0xe0
[   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
[   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
[   16.044836]  __x64_sys_nanosleep+0x5/0x140
[   16.045119]  do_syscall_64+0x59/0x80
[   16.045377]  ? lock_is_held_type+0xe4/0x140
[   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
[   16.046001]  ? mark_held_locks+0x24/0x90
[   16.046287]  ? asm_exc_page_fault+0x1e/0x30
[   16.046569]  ? asm_exc_page_fault+0x8/0x30
[   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
[   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   16.047405] RIP: 0033:0x7f9e4831718d
[   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
[   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
[   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
[   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
[   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
[   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
[   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   16.051608]  </TASK>
[   16.051762] ==================================================================

Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fa517ae604ad..31a83449808b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 	if (unlikely(map_value_has_spin_lock(map))) {
 		s_off = map->spin_lock_off;
 		s_sz = sizeof(struct bpf_spin_lock);
-	} else if (unlikely(map_value_has_timer(map))) {
+	}
+	if (unlikely(map_value_has_timer(map))) {
 		t_off = map->timer_off;
 		t_sz = sizeof(struct bpf_timer);
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash
  2022-02-09  7:03 [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value Kumar Kartikeya Dwivedi
  2022-02-09  7:03 ` [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value Kumar Kartikeya Dwivedi
@ 2022-02-09  7:03 ` Kumar Kartikeya Dwivedi
  2022-02-09 20:03   ` Kumar Kartikeya Dwivedi
  2022-02-11 21:30 ` [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value patchwork-bot+netdevbpf
  2 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-09  7:03 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

Add a test that validates that timer value is not overwritten when doing
a copy_map_value call in the kernel. Without the prior fix, this test
triggers a crash.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/timer_crash.c    | 32 +++++++++++
 .../testing/selftests/bpf/progs/timer_crash.c | 54 +++++++++++++++++++
 2 files changed, 86 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_crash.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_crash.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer_crash.c b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
new file mode 100644
index 000000000000..f74b82305da8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include "timer_crash.skel.h"
+
+enum {
+	MODE_ARRAY,
+	MODE_HASH,
+};
+
+static void test_timer_crash_mode(int mode)
+{
+	struct timer_crash *skel;
+
+	skel = timer_crash__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "timer_crash__open_and_load"))
+		return;
+	skel->bss->pid = getpid();
+	skel->bss->crash_map = mode;
+	if (!ASSERT_OK(timer_crash__attach(skel), "timer_crash__attach"))
+		goto end;
+	usleep(1);
+end:
+	timer_crash__destroy(skel);
+}
+
+void test_timer_crash(void)
+{
+	if (test__start_subtest("array"))
+		test_timer_crash_mode(MODE_ARRAY);
+	if (test__start_subtest("hash"))
+		test_timer_crash_mode(MODE_HASH);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer_crash.c b/tools/testing/selftests/bpf/progs/timer_crash.c
new file mode 100644
index 000000000000..f8f7944e70da
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer_crash.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+struct map_elem {
+	struct bpf_timer timer;
+	struct bpf_spin_lock lock;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct map_elem);
+} amap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct map_elem);
+} hmap SEC(".maps");
+
+int pid = 0;
+int crash_map = 0; /* 0 for amap, 1 for hmap */
+
+SEC("fentry/do_nanosleep")
+int sys_enter(void *ctx)
+{
+	struct map_elem *e, value = {};
+	void *map = crash_map ? (void *)&hmap : (void *)&amap;
+
+	if (bpf_get_current_task_btf()->tgid != pid)
+		return 0;
+
+	*(void **)&value = (void *)0xdeadcaf3;
+
+	bpf_map_update_elem(map, &(int){0}, &value, 0);
+	/* For array map, doing bpf_map_update_elem will do a
+	 * check_and_free_timer_in_array, which will trigger the crash if timer
+	 * pointer was overwritten, for hmap we need to use bpf_timer_cancel.
+	 */
+	if (crash_map == 1) {
+		e = bpf_map_lookup_elem(map, &(int){0});
+		if (!e)
+			return 0;
+		bpf_timer_cancel(&e->timer);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-09  7:03 ` [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value Kumar Kartikeya Dwivedi
@ 2022-02-09 19:06   ` Yonghong Song
  2022-02-09 19:52     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-02-09 19:06 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko



On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
> When both bpf_spin_lock and bpf_timer are present in a BPF map value,
> copy_map_value needs to skirt both objects when copying a value into and
> out of the map. However, the current code does not set both s_off and
> t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
> is placed in map value with bpf_timer, as bpf_map_update_elem call will
> be able to overwrite the other timer object.
> 
> When the issue is not fixed, an overwriting can produce the following
> splat:
> 
> [root@(none) bpf]# ./test_progs -t timer_crash
> [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
> [   16.037849] ==================================================================
> [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
> [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
> [   16.039399]
> [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
> [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
> [   16.040485] Call Trace:
> [   16.040645]  <TASK>
> [   16.040805]  dump_stack_lvl+0x59/0x73
> [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> [   16.041427]  kasan_report.cold+0x116/0x11b
> [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
> [   16.042328]  ? memcpy+0x39/0x60
> [   16.042552]  ? pv_hash+0xd0/0xd0
> [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
> [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
> [   16.043366]  ? bpf_get_current_comm+0x50/0x50
> [   16.043608]  ? jhash+0x11a/0x270
> [   16.043848]  bpf_timer_cancel+0x34/0xe0
> [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
> [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
> [   16.044836]  __x64_sys_nanosleep+0x5/0x140
> [   16.045119]  do_syscall_64+0x59/0x80
> [   16.045377]  ? lock_is_held_type+0xe4/0x140
> [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
> [   16.046001]  ? mark_held_locks+0x24/0x90
> [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
> [   16.046569]  ? asm_exc_page_fault+0x8/0x30
> [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
> [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   16.047405] RIP: 0033:0x7f9e4831718d
> [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
> [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
> [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
> [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
> [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
> [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
> [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   16.051608]  </TASK>
> [   16.051762] ==================================================================
> 
> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   include/linux/bpf.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fa517ae604ad..31a83449808b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>   	if (unlikely(map_value_has_spin_lock(map))) {
>   		s_off = map->spin_lock_off;
>   		s_sz = sizeof(struct bpf_spin_lock);
> -	} else if (unlikely(map_value_has_timer(map))) {
> +	}
> +	if (unlikely(map_value_has_timer(map))) {
>   		t_off = map->timer_off;
>   		t_sz = sizeof(struct bpf_timer);
>   	}

Thanks for the patch. I think we have a bigger problem here with the 
patch. It actually exposed a few kernel bugs. If you run current 
selftests, esp. ./test_progs -j which is what I tried, you will observe
various testing failures. The reason is due to we preserved the timer or
spin lock information incorrectly for a map value.

For example, the selftest #179 (timer) will fail with this patch and
the following change can fix it.

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d29af9988f37..3336d76cc5a6 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct 
bpf_htab *htab, void *key,
                         l_new = ERR_PTR(-ENOMEM);
                         goto dec_count;
                 }
-               check_and_init_map_value(&htab->map,
-                                        l_new->key + round_up(key_size, 
8));
         }

+       check_and_init_map_value(&htab->map,
+                                l_new->key + round_up(key_size, 8));
+
         memcpy(l_new->key, key, key_size);
         if (percpu) {
                 size = round_up(size, 8);

Basically, the above clears new map value timer/spin_lock fields for 
*all* new items.

There are still some selftest failures due to this patch.
Could you run selftest, with necessary additional fixes, to ensure
all selftests passed?

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-09 19:06   ` Yonghong Song
@ 2022-02-09 19:52     ` Kumar Kartikeya Dwivedi
  2022-02-10  8:05       ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-09 19:52 UTC (permalink / raw)
  To: Yonghong Song; +Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
>
>
> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
> > When both bpf_spin_lock and bpf_timer are present in a BPF map value,
> > copy_map_value needs to skirt both objects when copying a value into and
> > out of the map. However, the current code does not set both s_off and
> > t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
> > is placed in map value with bpf_timer, as bpf_map_update_elem call will
> > be able to overwrite the other timer object.
> >
> > When the issue is not fixed, an overwriting can produce the following
> > splat:
> >
> > [root@(none) bpf]# ./test_progs -t timer_crash
> > [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
> > [   16.037849] ==================================================================
> > [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
> > [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
> > [   16.039399]
> > [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
> > [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
> > [   16.040485] Call Trace:
> > [   16.040645]  <TASK>
> > [   16.040805]  dump_stack_lvl+0x59/0x73
> > [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> > [   16.041427]  kasan_report.cold+0x116/0x11b
> > [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> > [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
> > [   16.042328]  ? memcpy+0x39/0x60
> > [   16.042552]  ? pv_hash+0xd0/0xd0
> > [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
> > [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
> > [   16.043366]  ? bpf_get_current_comm+0x50/0x50
> > [   16.043608]  ? jhash+0x11a/0x270
> > [   16.043848]  bpf_timer_cancel+0x34/0xe0
> > [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
> > [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
> > [   16.044836]  __x64_sys_nanosleep+0x5/0x140
> > [   16.045119]  do_syscall_64+0x59/0x80
> > [   16.045377]  ? lock_is_held_type+0xe4/0x140
> > [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
> > [   16.046001]  ? mark_held_locks+0x24/0x90
> > [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
> > [   16.046569]  ? asm_exc_page_fault+0x8/0x30
> > [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
> > [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [   16.047405] RIP: 0033:0x7f9e4831718d
> > [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
> > [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
> > [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
> > [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
> > [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
> > [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
> > [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [   16.051608]  </TASK>
> > [   16.051762] ==================================================================
> >
> > Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   include/linux/bpf.h | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index fa517ae604ad..31a83449808b 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> >   	if (unlikely(map_value_has_spin_lock(map))) {
> >   		s_off = map->spin_lock_off;
> >   		s_sz = sizeof(struct bpf_spin_lock);
> > -	} else if (unlikely(map_value_has_timer(map))) {
> > +	}
> > +	if (unlikely(map_value_has_timer(map))) {
> >   		t_off = map->timer_off;
> >   		t_sz = sizeof(struct bpf_timer);
> >   	}
>
> Thanks for the patch. I think we have a bigger problem here with the patch.
> It actually exposed a few kernel bugs. If you run current selftests, esp.
> ./test_progs -j which is what I tried, you will observe
> various testing failures. The reason is due to we preserved the timer or
> spin lock information incorrectly for a map value.
>
> For example, the selftest #179 (timer) will fail with this patch and
> the following change can fix it.
>

I actually only saw the same failures (on bpf/master) as in CI, and it seems
they are there even when I do a run without my patch (related to uprobes). The
bpftool patch PR in GitHub also has the same error, so I'm guessing it is
unrelated to this. I also didn't see any difference when running on bpf-next.

As far as others are concerned, I didn't see the failure for timer test, or any
other ones, for me all timer tests pass properly after applying it. It could be
that my test VM is not triggering it, because it may depend on the runtime
system/memory values, etc.

Can you share what error you see? Does it crash or does it just fail?

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index d29af9988f37..3336d76cc5a6 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
> bpf_htab *htab, void *key,
>                         l_new = ERR_PTR(-ENOMEM);
>                         goto dec_count;
>                 }
> -               check_and_init_map_value(&htab->map,
> -                                        l_new->key + round_up(key_size,
> 8));
>         }
>
> +       check_and_init_map_value(&htab->map,
> +                                l_new->key + round_up(key_size, 8));
> +

Makes sense, but trying to understand why it would fail:
So this is needed because the reused element from per-CPU region might have
garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.

Earlier copy_map_value further below in this code would also overwrite the timer
part (which usually may be zero), but that would also not happen anymore.

>         memcpy(l_new->key, key, key_size);
>         if (percpu) {
>                 size = round_up(size, 8);
>
> Basically, the above clears new map value timer/spin_lock fields for *all*
> new items.
>
> There are still some selftest failures due to this patch.
> Could you run selftest, with necessary additional fixes, to ensure
> all selftests passed?

Assuming that uprobe ones are unrelated, should I add this diff and respin? Or
are you seeing other errors as well apart from the timer (#179) test?

--
Kartikeya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash
  2022-02-09  7:03 ` [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash Kumar Kartikeya Dwivedi
@ 2022-02-09 20:03   ` Kumar Kartikeya Dwivedi
  2022-02-10  8:10     ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-09 20:03 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko; +Cc: Yonghong Song

On Wed, Feb 09, 2022 at 12:33:24PM IST, Kumar Kartikeya Dwivedi wrote:
> Add a test that validates that timer value is not overwritten when doing
> a copy_map_value call in the kernel. Without the prior fix, this test
> triggers a crash.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  .../selftests/bpf/prog_tests/timer_crash.c    | 32 +++++++++++
>  .../testing/selftests/bpf/progs/timer_crash.c | 54 +++++++++++++++++++
>  2 files changed, 86 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_crash.c
>  create mode 100644 tools/testing/selftests/bpf/progs/timer_crash.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/timer_crash.c b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
> new file mode 100644
> index 000000000000..f74b82305da8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
> @@ -0,0 +1,32 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include "timer_crash.skel.h"
> +
> +enum {
> +	MODE_ARRAY,
> +	MODE_HASH,
> +};
> +
> +static void test_timer_crash_mode(int mode)
> +{
> +	struct timer_crash *skel;
> +
> +	skel = timer_crash__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "timer_crash__open_and_load"))
> +		return;
> +	skel->bss->pid = getpid();
> +	skel->bss->crash_map = mode;
> +	if (!ASSERT_OK(timer_crash__attach(skel), "timer_crash__attach"))
> +		goto end;
> +	usleep(1);
> +end:
> +	timer_crash__destroy(skel);
> +}
> +
> +void test_timer_crash(void)
> +{
> +	if (test__start_subtest("array"))
> +		test_timer_crash_mode(MODE_ARRAY);
> +	if (test__start_subtest("hash"))
> +		test_timer_crash_mode(MODE_HASH);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/timer_crash.c b/tools/testing/selftests/bpf/progs/timer_crash.c
> new file mode 100644
> index 000000000000..f8f7944e70da
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/timer_crash.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +struct map_elem {
> +	struct bpf_timer timer;
> +	struct bpf_spin_lock lock;
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, int);
> +	__type(value, struct map_elem);
> +} amap SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, 1);
> +	__type(key, int);
> +	__type(value, struct map_elem);
> +} hmap SEC(".maps");
> +
> +int pid = 0;
> +int crash_map = 0; /* 0 for amap, 1 for hmap */
> +
> +SEC("fentry/do_nanosleep")
> +int sys_enter(void *ctx)
> +{
> +	struct map_elem *e, value = {};
> +	void *map = crash_map ? (void *)&hmap : (void *)&amap;
> +
> +	if (bpf_get_current_task_btf()->tgid != pid)
> +		return 0;
> +
> +	*(void **)&value = (void *)0xdeadcaf3;
> +
> +	bpf_map_update_elem(map, &(int){0}, &value, 0);
> +	/* For array map, doing bpf_map_update_elem will do a
> +	 * check_and_free_timer_in_array, which will trigger the crash if timer
> +	 * pointer was overwritten, for hmap we need to use bpf_timer_cancel.
> +	 */

Also, in this case, there seems to be a difference of behavior. When we do
bpf_map_update_elem for array map, it seems to invoke
check_and_free_timer_in_array and free any timers that were part of the value.
In case of hash/lru_hash, that doesn't seem to happen, hence why the test has
two 'modes', to then trigger a dereference of the overwritten pointer using
bpf_timer_cancel.

So in case of array map it crashes right when doing bpf_map_update_elem, and in
case of hash it crashes inside bpf_timer_cancel.

This seems inconsistent to me, is there a specific reason behind doing it for
array map differently than hash map? If not, is it now too late to change this?

> +	if (crash_map == 1) {
> +		e = bpf_map_lookup_elem(map, &(int){0});
> +		if (!e)
> +			return 0;
> +		bpf_timer_cancel(&e->timer);
> +	}
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.35.1
>

--
Kartikeya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-09 19:52     ` Kumar Kartikeya Dwivedi
@ 2022-02-10  8:05       ` Yonghong Song
  2022-02-10 22:49         ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-02-10  8:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko


On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
>>
>>
>> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
>>> When both bpf_spin_lock and bpf_timer are present in a BPF map value,
>>> copy_map_value needs to skirt both objects when copying a value into and
>>> out of the map. However, the current code does not set both s_off and
>>> t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
>>> is placed in map value with bpf_timer, as bpf_map_update_elem call will
>>> be able to overwrite the other timer object.
>>>
>>> When the issue is not fixed, an overwriting can produce the following
>>> splat:
>>>
>>> [root@(none) bpf]# ./test_progs -t timer_crash
>>> [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
>>> [   16.037849] ==================================================================
>>> [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
>>> [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
>>> [   16.039399]
>>> [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
>>> [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
>>> [   16.040485] Call Trace:
>>> [   16.040645]  <TASK>
>>> [   16.040805]  dump_stack_lvl+0x59/0x73
>>> [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>> [   16.041427]  kasan_report.cold+0x116/0x11b
>>> [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>> [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
>>> [   16.042328]  ? memcpy+0x39/0x60
>>> [   16.042552]  ? pv_hash+0xd0/0xd0
>>> [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
>>> [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
>>> [   16.043366]  ? bpf_get_current_comm+0x50/0x50
>>> [   16.043608]  ? jhash+0x11a/0x270
>>> [   16.043848]  bpf_timer_cancel+0x34/0xe0
>>> [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
>>> [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
>>> [   16.044836]  __x64_sys_nanosleep+0x5/0x140
>>> [   16.045119]  do_syscall_64+0x59/0x80
>>> [   16.045377]  ? lock_is_held_type+0xe4/0x140
>>> [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
>>> [   16.046001]  ? mark_held_locks+0x24/0x90
>>> [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
>>> [   16.046569]  ? asm_exc_page_fault+0x8/0x30
>>> [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
>>> [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [   16.047405] RIP: 0033:0x7f9e4831718d
>>> [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
>>> [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
>>> [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
>>> [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
>>> [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
>>> [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
>>> [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>> [   16.051608]  </TASK>
>>> [   16.051762] ==================================================================
>>>
>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> ---
>>>    include/linux/bpf.h | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index fa517ae604ad..31a83449808b 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>>>    	if (unlikely(map_value_has_spin_lock(map))) {
>>>    		s_off = map->spin_lock_off;
>>>    		s_sz = sizeof(struct bpf_spin_lock);
>>> -	} else if (unlikely(map_value_has_timer(map))) {
>>> +	}
>>> +	if (unlikely(map_value_has_timer(map))) {
>>>    		t_off = map->timer_off;
>>>    		t_sz = sizeof(struct bpf_timer);
>>>    	}
>>
>> Thanks for the patch. I think we have a bigger problem here with the patch.
>> It actually exposed a few kernel bugs. If you run current selftests, esp.
>> ./test_progs -j which is what I tried, you will observe
>> various testing failures. The reason is due to we preserved the timer or
>> spin lock information incorrectly for a map value.
>>
>> For example, the selftest #179 (timer) will fail with this patch and
>> the following change can fix it.
>>
> 
> I actually only saw the same failures (on bpf/master) as in CI, and it seems
> they are there even when I do a run without my patch (related to uprobes). The
> bpftool patch PR in GitHub also has the same error, so I'm guessing it is
> unrelated to this. I also didn't see any difference when running on bpf-next.
> 
> As far as others are concerned, I didn't see the failure for timer test, or any
> other ones, for me all timer tests pass properly after applying it. It could be
> that my test VM is not triggering it, because it may depend on the runtime
> system/memory values, etc.
> 
> Can you share what error you see? Does it crash or does it just fail?

For test #179 (timer), most time I saw a hung. But I also see
the oops in bpf_timer_set_callback().

> 
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index d29af9988f37..3336d76cc5a6 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
>> bpf_htab *htab, void *key,
>>                          l_new = ERR_PTR(-ENOMEM);
>>                          goto dec_count;
>>                  }
>> -               check_and_init_map_value(&htab->map,
>> -                                        l_new->key + round_up(key_size,
>> 8));
>>          }
>>
>> +       check_and_init_map_value(&htab->map,
>> +                                l_new->key + round_up(key_size, 8));
>> +
> 
> Makes sense, but trying to understand why it would fail:
> So this is needed because the reused element from per-CPU region might have
> garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
> case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
> 
> Earlier copy_map_value further below in this code would also overwrite the timer
> part (which usually may be zero), but that would also not happen anymore.

That is correct. The preallocated hash tables have a free list. Look 
like when an element is put into a free list, its value is not reset.
this in general is fine as later on a new value will be copied to
overwrite the old one, but this is except spin_lock and timer value.
spin_lock probably fine as it just contains a value like unlocked state.
but timer is a problem as timer has been released and reusing the timer
could cause all kind of issues.

Without your patch, if spinlock exists timer is always reset to 0 and 
hence we won't have problems. If spinlock doesn't exist, looks like
we could still have issues.

My above change is really just a hack.

The following is (sort of) a proper patch and with your patch the
selftests do pass in my environment.

commit cf9d4a47d95dc992504a717e4c39bbf2bac8b41c (HEAD -> tmp10)
Author: Yonghong Song <yhs@fb.com>
Date:   Wed Feb 9 19:53:42 2022 -0800

     bpf: clear timer field for new elements in preallocated hash tables

     Currently, when a hash element is freed and released to freelist
     pool, its value is not reset. This may cause a problem if the value
     contains a timer. Since the timer has been cancelled and freed,
     reusing the old pointer may cause various kernel issues including
     hang or oops. This can happen with the below command
       ./test_progs -t timer

     There are two approaches to resolve this. One is to reset the
     timer value when the element is put to the freelist, and the second
     is to clear the timer value when the element from the freelist
     is ready to be used. This patch implemented the second approach.

     Signed-off-by: Yonghong Song <yhs@fb.com>

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d29af9988f37..558f3c40307a 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -916,6 +916,12 @@ static bool fd_htab_map_needs_adjust(const struct 
bpf_htab *htab)
                BITS_PER_LONG == 64;
  }

+static inline void check_and_init_timer_value(struct bpf_map *map, void 
*dst)
+{
+       if (unlikely(map_value_has_timer(map)))
+               memset(dst + map->timer_off, 0, sizeof(struct bpf_timer));
+}
+
  static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
                                          void *value, u32 key_size, u32 
hash,
                                          bool percpu, bool onallcpus,
@@ -925,6 +931,7 @@ static struct htab_elem *alloc_htab_elem(struct 
bpf_htab *htab, void *key,
         bool prealloc = htab_is_prealloc(htab);
         struct htab_elem *l_new, **pl_new;
         void __percpu *pptr;
+       void *new_value;

         if (prealloc) {
                 if (old_elem) {
@@ -989,9 +996,9 @@ static struct htab_elem *alloc_htab_elem(struct 
bpf_htab *htab, void *key,
                 size = round_up(size, 8);
                 memcpy(l_new->key + round_up(key_size, 8), value, size);
         } else {
-               copy_map_value(&htab->map,
-                              l_new->key + round_up(key_size, 8),
-                              value);
+               new_value = l_new->key + round_up(key_size, 8);
+               check_and_init_timer_value(&htab->map, new_value);
+               copy_map_value(&htab->map, new_value, value);
         }

         l_new->hash = hash;
@@ -1127,6 +1134,7 @@ static int htab_lru_map_update_elem(struct bpf_map 
*map, void *key, void *value,
         unsigned long flags;
         struct bucket *b;
         u32 key_size, hash;
+       void *new_value;
         int ret;

         if (unlikely(map_flags > BPF_EXIST))
@@ -1151,8 +1159,10 @@ static int htab_lru_map_update_elem(struct 
bpf_map *map, void *key, void *value,
         l_new = prealloc_lru_pop(htab, key, hash);
         if (!l_new)
                 return -ENOMEM;
-       copy_map_value(&htab->map,
-                      l_new->key + round_up(map->key_size, 8), value);
+
+       new_value = l_new->key + round_up(map->key_size, 8);
+       check_and_init_timer_value(&htab->map, new_value);
+       copy_map_value(&htab->map, new_value, value);

         ret = htab_lock_bucket(htab, b, hash, &flags);

> 
>>          memcpy(l_new->key, key, key_size);
>>          if (percpu) {
>>                  size = round_up(size, 8);
>>
>> Basically, the above clears new map value timer/spin_lock fields for *all*
>> new items.
>>
>> There are still some selftest failures due to this patch.
>> Could you run selftest, with necessary additional fixes, to ensure
>> all selftests passed?
> 
> Assuming that uprobe ones are unrelated, should I add this diff and respin? Or
> are you seeing other errors as well apart from the timer (#179) test?

Yes, please incorporate my above change into your patch and resubmit.
folks may have different opinions e.g. to clear timer field right
before entering to the free list.

> 
> --
> Kartikeya

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash
  2022-02-09 20:03   ` Kumar Kartikeya Dwivedi
@ 2022-02-10  8:10     ` Yonghong Song
  0 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2022-02-10  8:10 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko



On 2/9/22 12:03 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Feb 09, 2022 at 12:33:24PM IST, Kumar Kartikeya Dwivedi wrote:
>> Add a test that validates that timer value is not overwritten when doing
>> a copy_map_value call in the kernel. Without the prior fix, this test
>> triggers a crash.
>>
>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> ---
>>   .../selftests/bpf/prog_tests/timer_crash.c    | 32 +++++++++++
>>   .../testing/selftests/bpf/progs/timer_crash.c | 54 +++++++++++++++++++
>>   2 files changed, 86 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_crash.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/timer_crash.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/timer_crash.c b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
>> new file mode 100644
>> index 000000000000..f74b82305da8
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/timer_crash.c
>> @@ -0,0 +1,32 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <test_progs.h>
>> +#include "timer_crash.skel.h"
>> +
>> +enum {
>> +	MODE_ARRAY,
>> +	MODE_HASH,
>> +};
>> +
>> +static void test_timer_crash_mode(int mode)
>> +{
>> +	struct timer_crash *skel;
>> +
>> +	skel = timer_crash__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "timer_crash__open_and_load"))
>> +		return;
>> +	skel->bss->pid = getpid();
>> +	skel->bss->crash_map = mode;
>> +	if (!ASSERT_OK(timer_crash__attach(skel), "timer_crash__attach"))
>> +		goto end;
>> +	usleep(1);
>> +end:
>> +	timer_crash__destroy(skel);
>> +}
>> +
>> +void test_timer_crash(void)
>> +{
>> +	if (test__start_subtest("array"))
>> +		test_timer_crash_mode(MODE_ARRAY);
>> +	if (test__start_subtest("hash"))
>> +		test_timer_crash_mode(MODE_HASH);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/timer_crash.c b/tools/testing/selftests/bpf/progs/timer_crash.c
>> new file mode 100644
>> index 000000000000..f8f7944e70da
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/timer_crash.c
>> @@ -0,0 +1,54 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <vmlinux.h>
>> +#include <bpf/bpf_tracing.h>
>> +#include <bpf/bpf_helpers.h>
>> +
>> +struct map_elem {
>> +	struct bpf_timer timer;
>> +	struct bpf_spin_lock lock;
>> +};
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> +	__uint(max_entries, 1);
>> +	__type(key, int);
>> +	__type(value, struct map_elem);
>> +} amap SEC(".maps");
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_HASH);
>> +	__uint(max_entries, 1);
>> +	__type(key, int);
>> +	__type(value, struct map_elem);
>> +} hmap SEC(".maps");
>> +
>> +int pid = 0;
>> +int crash_map = 0; /* 0 for amap, 1 for hmap */
>> +
>> +SEC("fentry/do_nanosleep")
>> +int sys_enter(void *ctx)
>> +{
>> +	struct map_elem *e, value = {};
>> +	void *map = crash_map ? (void *)&hmap : (void *)&amap;
>> +
>> +	if (bpf_get_current_task_btf()->tgid != pid)
>> +		return 0;
>> +
>> +	*(void **)&value = (void *)0xdeadcaf3;
>> +
>> +	bpf_map_update_elem(map, &(int){0}, &value, 0);
>> +	/* For array map, doing bpf_map_update_elem will do a
>> +	 * check_and_free_timer_in_array, which will trigger the crash if timer
>> +	 * pointer was overwritten, for hmap we need to use bpf_timer_cancel.
>> +	 */
> 
> Also, in this case, there seems to be a difference of behavior. When we do
> bpf_map_update_elem for array map, it seems to invoke
> check_and_free_timer_in_array and free any timers that were part of the value.
> In case of hash/lru_hash, that doesn't seem to happen, hence why the test has
> two 'modes', to then trigger a dereference of the overwritten pointer using
> bpf_timer_cancel.
> 
> So in case of array map it crashes right when doing bpf_map_update_elem, and in
> case of hash it crashes inside bpf_timer_cancel.
> 
> This seems inconsistent to me, is there a specific reason behind doing it for
> array map differently than hash map? If not, is it now too late to change this?

There is no map deletion operation for array map, so the timer is 
canceled and freed when a new update comes in which signals the previous
value is gone.

for hashmap, depending on whether a new element will be allocated or 
not. If it is, the old element (if exists) will have its timer freed.

> 
>> +	if (crash_map == 1) {
>> +		e = bpf_map_lookup_elem(map, &(int){0});
>> +		if (!e)
>> +			return 0;
>> +		bpf_timer_cancel(&e->timer);
>> +	}
>> +	return 0;
>> +}
>> +
>> +char _license[] SEC("license") = "GPL";
>> --
>> 2.35.1
>>
> 
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-10  8:05       ` Yonghong Song
@ 2022-02-10 22:49         ` Alexei Starovoitov
  2022-02-10 23:54           ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2022-02-10 22:49 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@fb.com> wrote:
>
>
> On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
> > On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
> >>
> >>
> >> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
> >>> When both bpf_spin_lock and bpf_timer are present in a BPF map value,
> >>> copy_map_value needs to skirt both objects when copying a value into and
> >>> out of the map. However, the current code does not set both s_off and
> >>> t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
> >>> is placed in map value with bpf_timer, as bpf_map_update_elem call will
> >>> be able to overwrite the other timer object.
> >>>
> >>> When the issue is not fixed, an overwriting can produce the following
> >>> splat:
> >>>
> >>> [root@(none) bpf]# ./test_progs -t timer_crash
> >>> [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
> >>> [   16.037849] ==================================================================
> >>> [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
> >>> [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
> >>> [   16.039399]
> >>> [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
> >>> [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
> >>> [   16.040485] Call Trace:
> >>> [   16.040645]  <TASK>
> >>> [   16.040805]  dump_stack_lvl+0x59/0x73
> >>> [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> >>> [   16.041427]  kasan_report.cold+0x116/0x11b
> >>> [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> >>> [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
> >>> [   16.042328]  ? memcpy+0x39/0x60
> >>> [   16.042552]  ? pv_hash+0xd0/0xd0
> >>> [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
> >>> [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
> >>> [   16.043366]  ? bpf_get_current_comm+0x50/0x50
> >>> [   16.043608]  ? jhash+0x11a/0x270
> >>> [   16.043848]  bpf_timer_cancel+0x34/0xe0
> >>> [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
> >>> [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
> >>> [   16.044836]  __x64_sys_nanosleep+0x5/0x140
> >>> [   16.045119]  do_syscall_64+0x59/0x80
> >>> [   16.045377]  ? lock_is_held_type+0xe4/0x140
> >>> [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
> >>> [   16.046001]  ? mark_held_locks+0x24/0x90
> >>> [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
> >>> [   16.046569]  ? asm_exc_page_fault+0x8/0x30
> >>> [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
> >>> [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>> [   16.047405] RIP: 0033:0x7f9e4831718d
> >>> [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
> >>> [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
> >>> [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
> >>> [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
> >>> [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
> >>> [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
> >>> [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >>> [   16.051608]  </TASK>
> >>> [   16.051762] ==================================================================
> >>>
> >>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> >>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> >>> ---
> >>>    include/linux/bpf.h | 3 ++-
> >>>    1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >>> index fa517ae604ad..31a83449808b 100644
> >>> --- a/include/linux/bpf.h
> >>> +++ b/include/linux/bpf.h
> >>> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> >>>     if (unlikely(map_value_has_spin_lock(map))) {
> >>>             s_off = map->spin_lock_off;
> >>>             s_sz = sizeof(struct bpf_spin_lock);
> >>> -   } else if (unlikely(map_value_has_timer(map))) {
> >>> +   }
> >>> +   if (unlikely(map_value_has_timer(map))) {
> >>>             t_off = map->timer_off;
> >>>             t_sz = sizeof(struct bpf_timer);
> >>>     }
> >>
> >> Thanks for the patch. I think we have a bigger problem here with the patch.
> >> It actually exposed a few kernel bugs. If you run current selftests, esp.
> >> ./test_progs -j which is what I tried, you will observe
> >> various testing failures. The reason is due to we preserved the timer or
> >> spin lock information incorrectly for a map value.
> >>
> >> For example, the selftest #179 (timer) will fail with this patch and
> >> the following change can fix it.
> >>
> >
> > I actually only saw the same failures (on bpf/master) as in CI, and it seems
> > they are there even when I do a run without my patch (related to uprobes). The
> > bpftool patch PR in GitHub also has the same error, so I'm guessing it is
> > unrelated to this. I also didn't see any difference when running on bpf-next.
> >
> > As far as others are concerned, I didn't see the failure for timer test, or any
> > other ones, for me all timer tests pass properly after applying it. It could be
> > that my test VM is not triggering it, because it may depend on the runtime
> > system/memory values, etc.
> >
> > Can you share what error you see? Does it crash or does it just fail?
>
> For test #179 (timer), most time I saw a hung. But I also see
> the oops in bpf_timer_set_callback().
>
> >
> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >> index d29af9988f37..3336d76cc5a6 100644
> >> --- a/kernel/bpf/hashtab.c
> >> +++ b/kernel/bpf/hashtab.c
> >> @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
> >> bpf_htab *htab, void *key,
> >>                          l_new = ERR_PTR(-ENOMEM);
> >>                          goto dec_count;
> >>                  }
> >> -               check_and_init_map_value(&htab->map,
> >> -                                        l_new->key + round_up(key_size,
> >> 8));
> >>          }
> >>
> >> +       check_and_init_map_value(&htab->map,
> >> +                                l_new->key + round_up(key_size, 8));
> >> +
> >
> > Makes sense, but trying to understand why it would fail:
> > So this is needed because the reused element from per-CPU region might have
> > garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
> > case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
> >
> > Earlier copy_map_value further below in this code would also overwrite the timer
> > part (which usually may be zero), but that would also not happen anymore.
>
> That is correct. The preallocated hash tables have a free list. Look
> like when an element is put into a free list, its value is not reset.

I don't follow. How do you think it can happen?
htab_delete/update are calling free_htab_elem()
which calls check_and_free_timer().
For pre-alloc htab_update calls check_and_free_timer() directly.
There should be never a case when timer is active in the free list.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-10 22:49         ` Alexei Starovoitov
@ 2022-02-10 23:54           ` Yonghong Song
  2022-02-11  0:02             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-02-10 23:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko



On 2/10/22 2:49 PM, Alexei Starovoitov wrote:
> On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>> On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
>>> On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
>>>>
>>>>
>>>> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
>>>>> When both bpf_spin_lock and bpf_timer are present in a BPF map value,
>>>>> copy_map_value needs to skirt both objects when copying a value into and
>>>>> out of the map. However, the current code does not set both s_off and
>>>>> t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
>>>>> is placed in map value with bpf_timer, as bpf_map_update_elem call will
>>>>> be able to overwrite the other timer object.
>>>>>
>>>>> When the issue is not fixed, an overwriting can produce the following
>>>>> splat:
>>>>>
>>>>> [root@(none) bpf]# ./test_progs -t timer_crash
>>>>> [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
>>>>> [   16.037849] ==================================================================
>>>>> [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>> [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
>>>>> [   16.039399]
>>>>> [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
>>>>> [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
>>>>> [   16.040485] Call Trace:
>>>>> [   16.040645]  <TASK>
>>>>> [   16.040805]  dump_stack_lvl+0x59/0x73
>>>>> [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>> [   16.041427]  kasan_report.cold+0x116/0x11b
>>>>> [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>> [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>> [   16.042328]  ? memcpy+0x39/0x60
>>>>> [   16.042552]  ? pv_hash+0xd0/0xd0
>>>>> [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
>>>>> [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
>>>>> [   16.043366]  ? bpf_get_current_comm+0x50/0x50
>>>>> [   16.043608]  ? jhash+0x11a/0x270
>>>>> [   16.043848]  bpf_timer_cancel+0x34/0xe0
>>>>> [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
>>>>> [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
>>>>> [   16.044836]  __x64_sys_nanosleep+0x5/0x140
>>>>> [   16.045119]  do_syscall_64+0x59/0x80
>>>>> [   16.045377]  ? lock_is_held_type+0xe4/0x140
>>>>> [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
>>>>> [   16.046001]  ? mark_held_locks+0x24/0x90
>>>>> [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
>>>>> [   16.046569]  ? asm_exc_page_fault+0x8/0x30
>>>>> [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
>>>>> [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [   16.047405] RIP: 0033:0x7f9e4831718d
>>>>> [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
>>>>> [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
>>>>> [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
>>>>> [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
>>>>> [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
>>>>> [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
>>>>> [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>>> [   16.051608]  </TASK>
>>>>> [   16.051762] ==================================================================
>>>>>
>>>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>>>> ---
>>>>>     include/linux/bpf.h | 3 ++-
>>>>>     1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index fa517ae604ad..31a83449808b 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>>> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>>>>>      if (unlikely(map_value_has_spin_lock(map))) {
>>>>>              s_off = map->spin_lock_off;
>>>>>              s_sz = sizeof(struct bpf_spin_lock);
>>>>> -   } else if (unlikely(map_value_has_timer(map))) {
>>>>> +   }
>>>>> +   if (unlikely(map_value_has_timer(map))) {
>>>>>              t_off = map->timer_off;
>>>>>              t_sz = sizeof(struct bpf_timer);
>>>>>      }
>>>>
>>>> Thanks for the patch. I think we have a bigger problem here with the patch.
>>>> It actually exposed a few kernel bugs. If you run current selftests, esp.
>>>> ./test_progs -j which is what I tried, you will observe
>>>> various testing failures. The reason is due to we preserved the timer or
>>>> spin lock information incorrectly for a map value.
>>>>
>>>> For example, the selftest #179 (timer) will fail with this patch and
>>>> the following change can fix it.
>>>>
>>>
>>> I actually only saw the same failures (on bpf/master) as in CI, and it seems
>>> they are there even when I do a run without my patch (related to uprobes). The
>>> bpftool patch PR in GitHub also has the same error, so I'm guessing it is
>>> unrelated to this. I also didn't see any difference when running on bpf-next.
>>>
>>> As far as others are concerned, I didn't see the failure for timer test, or any
>>> other ones, for me all timer tests pass properly after applying it. It could be
>>> that my test VM is not triggering it, because it may depend on the runtime
>>> system/memory values, etc.
>>>
>>> Can you share what error you see? Does it crash or does it just fail?
>>
>> For test #179 (timer), most time I saw a hung. But I also see
>> the oops in bpf_timer_set_callback().
>>
>>>
>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>> index d29af9988f37..3336d76cc5a6 100644
>>>> --- a/kernel/bpf/hashtab.c
>>>> +++ b/kernel/bpf/hashtab.c
>>>> @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
>>>> bpf_htab *htab, void *key,
>>>>                           l_new = ERR_PTR(-ENOMEM);
>>>>                           goto dec_count;
>>>>                   }
>>>> -               check_and_init_map_value(&htab->map,
>>>> -                                        l_new->key + round_up(key_size,
>>>> 8));
>>>>           }
>>>>
>>>> +       check_and_init_map_value(&htab->map,
>>>> +                                l_new->key + round_up(key_size, 8));
>>>> +
>>>
>>> Makes sense, but trying to understand why it would fail:
>>> So this is needed because the reused element from per-CPU region might have
>>> garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
>>> case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
>>>
>>> Earlier copy_map_value further below in this code would also overwrite the timer
>>> part (which usually may be zero), but that would also not happen anymore.
>>
>> That is correct. The preallocated hash tables have a free list. Look
>> like when an element is put into a free list, its value is not reset.
> 
> I don't follow. How do you think it can happen?
> htab_delete/update are calling free_htab_elem()
> which calls check_and_free_timer().
> For pre-alloc htab_update calls check_and_free_timer() directly.
> There should be never a case when timer is active in the free list.

The issue is not a timer active in the free list. It is the timer value
is not reset to 0 in the free list.
For example,
  1. value->timer... is set properly (non zero)
  2. value is deleted through update or delete, value->timer
     is cancelled and freed, and the hash_elem is put into
     free list. But the hash_elem value->timer is not zero.
  3. one hash_elem is picked up from the free list,
     and proper value is copied to the value except value->timer
     and value->spinlock (if they exist). This happens with this patch.
  4. some later kernel functions may see value->timer is set and
     do something bad ...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-10 23:54           ` Yonghong Song
@ 2022-02-11  0:02             ` Kumar Kartikeya Dwivedi
  2022-02-11  0:13               ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-02-11  0:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko

On Fri, Feb 11, 2022 at 05:24:55AM IST, Yonghong Song wrote:
>
>
> On 2/10/22 2:49 PM, Alexei Starovoitov wrote:
> > On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@fb.com> wrote:
> > >
> > >
> > > On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
> > > > On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
> > > > >
> > > > >
> > > > > On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > When both bpf_spin_lock and bpf_timer are present in a BPF map value,
> > > > > > copy_map_value needs to skirt both objects when copying a value into and
> > > > > > out of the map. However, the current code does not set both s_off and
> > > > > > t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
> > > > > > is placed in map value with bpf_timer, as bpf_map_update_elem call will
> > > > > > be able to overwrite the other timer object.
> > > > > >
> > > > > > When the issue is not fixed, an overwriting can produce the following
> > > > > > splat:
> > > > > >
> > > > > > [root@(none) bpf]# ./test_progs -t timer_crash
> > > > > > [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
> > > > > > [   16.037849] ==================================================================
> > > > > > [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
> > > > > > [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
> > > > > > [   16.039399]
> > > > > > [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
> > > > > > [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
> > > > > > [   16.040485] Call Trace:
> > > > > > [   16.040645]  <TASK>
> > > > > > [   16.040805]  dump_stack_lvl+0x59/0x73
> > > > > > [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> > > > > > [   16.041427]  kasan_report.cold+0x116/0x11b
> > > > > > [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
> > > > > > [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
> > > > > > [   16.042328]  ? memcpy+0x39/0x60
> > > > > > [   16.042552]  ? pv_hash+0xd0/0xd0
> > > > > > [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
> > > > > > [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
> > > > > > [   16.043366]  ? bpf_get_current_comm+0x50/0x50
> > > > > > [   16.043608]  ? jhash+0x11a/0x270
> > > > > > [   16.043848]  bpf_timer_cancel+0x34/0xe0
> > > > > > [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
> > > > > > [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
> > > > > > [   16.044836]  __x64_sys_nanosleep+0x5/0x140
> > > > > > [   16.045119]  do_syscall_64+0x59/0x80
> > > > > > [   16.045377]  ? lock_is_held_type+0xe4/0x140
> > > > > > [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
> > > > > > [   16.046001]  ? mark_held_locks+0x24/0x90
> > > > > > [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
> > > > > > [   16.046569]  ? asm_exc_page_fault+0x8/0x30
> > > > > > [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
> > > > > > [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > > > [   16.047405] RIP: 0033:0x7f9e4831718d
> > > > > > [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
> > > > > > [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
> > > > > > [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
> > > > > > [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
> > > > > > [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
> > > > > > [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
> > > > > > [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > [   16.051608]  </TASK>
> > > > > > [   16.051762] ==================================================================
> > > > > >
> > > > > > Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> > > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > > > ---
> > > > > >     include/linux/bpf.h | 3 ++-
> > > > > >     1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > > index fa517ae604ad..31a83449808b 100644
> > > > > > --- a/include/linux/bpf.h
> > > > > > +++ b/include/linux/bpf.h
> > > > > > @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> > > > > >      if (unlikely(map_value_has_spin_lock(map))) {
> > > > > >              s_off = map->spin_lock_off;
> > > > > >              s_sz = sizeof(struct bpf_spin_lock);
> > > > > > -   } else if (unlikely(map_value_has_timer(map))) {
> > > > > > +   }
> > > > > > +   if (unlikely(map_value_has_timer(map))) {
> > > > > >              t_off = map->timer_off;
> > > > > >              t_sz = sizeof(struct bpf_timer);
> > > > > >      }
> > > > >
> > > > > Thanks for the patch. I think we have a bigger problem here with the patch.
> > > > > It actually exposed a few kernel bugs. If you run current selftests, esp.
> > > > > ./test_progs -j which is what I tried, you will observe
> > > > > various testing failures. The reason is due to we preserved the timer or
> > > > > spin lock information incorrectly for a map value.
> > > > >
> > > > > For example, the selftest #179 (timer) will fail with this patch and
> > > > > the following change can fix it.
> > > > >
> > > >
> > > > I actually only saw the same failures (on bpf/master) as in CI, and it seems
> > > > they are there even when I do a run without my patch (related to uprobes). The
> > > > bpftool patch PR in GitHub also has the same error, so I'm guessing it is
> > > > unrelated to this. I also didn't see any difference when running on bpf-next.
> > > >
> > > > As far as others are concerned, I didn't see the failure for timer test, or any
> > > > other ones, for me all timer tests pass properly after applying it. It could be
> > > > that my test VM is not triggering it, because it may depend on the runtime
> > > > system/memory values, etc.
> > > >
> > > > Can you share what error you see? Does it crash or does it just fail?
> > >
> > > For test #179 (timer), most time I saw a hung. But I also see
> > > the oops in bpf_timer_set_callback().
> > >
> > > >
> > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > > > index d29af9988f37..3336d76cc5a6 100644
> > > > > --- a/kernel/bpf/hashtab.c
> > > > > +++ b/kernel/bpf/hashtab.c
> > > > > @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
> > > > > bpf_htab *htab, void *key,
> > > > >                           l_new = ERR_PTR(-ENOMEM);
> > > > >                           goto dec_count;
> > > > >                   }
> > > > > -               check_and_init_map_value(&htab->map,
> > > > > -                                        l_new->key + round_up(key_size,
> > > > > 8));
> > > > >           }
> > > > >
> > > > > +       check_and_init_map_value(&htab->map,
> > > > > +                                l_new->key + round_up(key_size, 8));
> > > > > +
> > > >
> > > > Makes sense, but trying to understand why it would fail:
> > > > So this is needed because the reused element from per-CPU region might have
> > > > garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
> > > > case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
> > > >
> > > > Earlier copy_map_value further below in this code would also overwrite the timer
> > > > part (which usually may be zero), but that would also not happen anymore.
> > >
> > > That is correct. The preallocated hash tables have a free list. Look
> > > like when an element is put into a free list, its value is not reset.
> >
> > I don't follow. How do you think it can happen?
> > htab_delete/update are calling free_htab_elem()
> > which calls check_and_free_timer().
> > For pre-alloc htab_update calls check_and_free_timer() directly.
> > There should be never a case when timer is active in the free list.
>
> The issue is not a timer active in the free list. It is the timer value
> is not reset to 0 in the free list.
> For example,
>  1. value->timer... is set properly (non zero)
>  2. value is deleted through update or delete, value->timer
>     is cancelled and freed, and the hash_elem is put into
>     free list. But the hash_elem value->timer is not zero.

But in all cases, check_and_free_timer was called right? Which then calls
bpf_timer_cancel_and_free which does this:

  1336 void bpf_timer_cancel_and_free(void *val)
  1337 {
  1338         struct bpf_timer_kern *timer = val;
  1339         struct bpf_hrtimer *t;
  1340
  1341         /* Performance optimization: read timer->timer without lock first. */
  1342         if (!READ_ONCE(timer->timer))
  1343                 return;
  1344
  1345         __bpf_spin_lock_irqsave(&timer->lock);
  1346         /* re-read it under lock */
  1347         t = timer->timer;
  1348         if (!t)
  1349                 goto out;
  1350         drop_prog_refcnt(t);
  1351         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
  1352          * this timer, since it won't be initialized.
  1353          */
  1354         timer->timer = NULL;
  ...

So the timer->timer was set to NULL in the map value.

>  3. one hash_elem is picked up from the free list,
>     and proper value is copied to the value except value->timer
>     and value->spinlock (if they exist). This happens with this patch.
>  4. some later kernel functions may see value->timer is set and
>     do something bad ...

--
Kartikeya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-11  0:02             ` Kumar Kartikeya Dwivedi
@ 2022-02-11  0:13               ` Yonghong Song
  2022-02-11  1:04                 ` Yonghong Song
  0 siblings, 1 reply; 14+ messages in thread
From: Yonghong Song @ 2022-02-11  0:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko



On 2/10/22 4:02 PM, Kumar Kartikeya Dwivedi wrote:
> On Fri, Feb 11, 2022 at 05:24:55AM IST, Yonghong Song wrote:
>>
>>
>> On 2/10/22 2:49 PM, Alexei Starovoitov wrote:
>>> On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>> On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
>>>>> On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
>>>>>>> When both bpf_spin_lock and bpf_timer are present in a BPF map value,
>>>>>>> copy_map_value needs to skirt both objects when copying a value into and
>>>>>>> out of the map. However, the current code does not set both s_off and
>>>>>>> t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
>>>>>>> is placed in map value with bpf_timer, as bpf_map_update_elem call will
>>>>>>> be able to overwrite the other timer object.
>>>>>>>
>>>>>>> When the issue is not fixed, an overwriting can produce the following
>>>>>>> splat:
>>>>>>>
>>>>>>> [root@(none) bpf]# ./test_progs -t timer_crash
>>>>>>> [   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
>>>>>>> [   16.037849] ==================================================================
>>>>>>> [   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>> [   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
>>>>>>> [   16.039399]
>>>>>>> [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
>>>>>>> [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
>>>>>>> [   16.040485] Call Trace:
>>>>>>> [   16.040645]  <TASK>
>>>>>>> [   16.040805]  dump_stack_lvl+0x59/0x73
>>>>>>> [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>> [   16.041427]  kasan_report.cold+0x116/0x11b
>>>>>>> [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>> [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>> [   16.042328]  ? memcpy+0x39/0x60
>>>>>>> [   16.042552]  ? pv_hash+0xd0/0xd0
>>>>>>> [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
>>>>>>> [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
>>>>>>> [   16.043366]  ? bpf_get_current_comm+0x50/0x50
>>>>>>> [   16.043608]  ? jhash+0x11a/0x270
>>>>>>> [   16.043848]  bpf_timer_cancel+0x34/0xe0
>>>>>>> [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
>>>>>>> [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
>>>>>>> [   16.044836]  __x64_sys_nanosleep+0x5/0x140
>>>>>>> [   16.045119]  do_syscall_64+0x59/0x80
>>>>>>> [   16.045377]  ? lock_is_held_type+0xe4/0x140
>>>>>>> [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
>>>>>>> [   16.046001]  ? mark_held_locks+0x24/0x90
>>>>>>> [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
>>>>>>> [   16.046569]  ? asm_exc_page_fault+0x8/0x30
>>>>>>> [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
>>>>>>> [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>> [   16.047405] RIP: 0033:0x7f9e4831718d
>>>>>>> [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
>>>>>>> [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
>>>>>>> [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
>>>>>>> [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
>>>>>>> [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
>>>>>>> [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
>>>>>>> [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>>>>> [   16.051608]  </TASK>
>>>>>>> [   16.051762] ==================================================================
>>>>>>>
>>>>>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>>>>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>>>>>> ---
>>>>>>>      include/linux/bpf.h | 3 ++-
>>>>>>>      1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>>>> index fa517ae604ad..31a83449808b 100644
>>>>>>> --- a/include/linux/bpf.h
>>>>>>> +++ b/include/linux/bpf.h
>>>>>>> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>>>>>>>       if (unlikely(map_value_has_spin_lock(map))) {
>>>>>>>               s_off = map->spin_lock_off;
>>>>>>>               s_sz = sizeof(struct bpf_spin_lock);
>>>>>>> -   } else if (unlikely(map_value_has_timer(map))) {
>>>>>>> +   }
>>>>>>> +   if (unlikely(map_value_has_timer(map))) {
>>>>>>>               t_off = map->timer_off;
>>>>>>>               t_sz = sizeof(struct bpf_timer);
>>>>>>>       }
>>>>>>
>>>>>> Thanks for the patch. I think we have a bigger problem here with the patch.
>>>>>> It actually exposed a few kernel bugs. If you run current selftests, esp.
>>>>>> ./test_progs -j which is what I tried, you will observe
>>>>>> various testing failures. The reason is due to we preserved the timer or
>>>>>> spin lock information incorrectly for a map value.
>>>>>>
>>>>>> For example, the selftest #179 (timer) will fail with this patch and
>>>>>> the following change can fix it.
>>>>>>
>>>>>
>>>>> I actually only saw the same failures (on bpf/master) as in CI, and it seems
>>>>> they are there even when I do a run without my patch (related to uprobes). The
>>>>> bpftool patch PR in GitHub also has the same error, so I'm guessing it is
>>>>> unrelated to this. I also didn't see any difference when running on bpf-next.
>>>>>
>>>>> As far as others are concerned, I didn't see the failure for timer test, or any
>>>>> other ones, for me all timer tests pass properly after applying it. It could be
>>>>> that my test VM is not triggering it, because it may depend on the runtime
>>>>> system/memory values, etc.
>>>>>
>>>>> Can you share what error you see? Does it crash or does it just fail?
>>>>
>>>> For test #179 (timer), most time I saw a hung. But I also see
>>>> the oops in bpf_timer_set_callback().
>>>>
>>>>>
>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>>> index d29af9988f37..3336d76cc5a6 100644
>>>>>> --- a/kernel/bpf/hashtab.c
>>>>>> +++ b/kernel/bpf/hashtab.c
>>>>>> @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
>>>>>> bpf_htab *htab, void *key,
>>>>>>                            l_new = ERR_PTR(-ENOMEM);
>>>>>>                            goto dec_count;
>>>>>>                    }
>>>>>> -               check_and_init_map_value(&htab->map,
>>>>>> -                                        l_new->key + round_up(key_size,
>>>>>> 8));
>>>>>>            }
>>>>>>
>>>>>> +       check_and_init_map_value(&htab->map,
>>>>>> +                                l_new->key + round_up(key_size, 8));
>>>>>> +
>>>>>
>>>>> Makes sense, but trying to understand why it would fail:
>>>>> So this is needed because the reused element from per-CPU region might have
>>>>> garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
>>>>> case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
>>>>>
>>>>> Earlier copy_map_value further below in this code would also overwrite the timer
>>>>> part (which usually may be zero), but that would also not happen anymore.
>>>>
>>>> That is correct. The preallocated hash tables have a free list. Look
>>>> like when an element is put into a free list, its value is not reset.
>>>
>>> I don't follow. How do you think it can happen?
>>> htab_delete/update are calling free_htab_elem()
>>> which calls check_and_free_timer().
>>> For pre-alloc htab_update calls check_and_free_timer() directly.
>>> There should be never a case when timer is active in the free list.
>>
>> The issue is not a timer active in the free list. It is the timer value
>> is not reset to 0 in the free list.
>> For example,
>>   1. value->timer... is set properly (non zero)
>>   2. value is deleted through update or delete, value->timer
>>      is cancelled and freed, and the hash_elem is put into
>>      free list. But the hash_elem value->timer is not zero.
> 
> But in all cases, check_and_free_timer was called right? Which then calls
> bpf_timer_cancel_and_free which does this:
> 
>    1336 void bpf_timer_cancel_and_free(void *val)
>    1337 {
>    1338         struct bpf_timer_kern *timer = val;
>    1339         struct bpf_hrtimer *t;
>    1340
>    1341         /* Performance optimization: read timer->timer without lock first. */
>    1342         if (!READ_ONCE(timer->timer))
>    1343                 return;
>    1344
>    1345         __bpf_spin_lock_irqsave(&timer->lock);
>    1346         /* re-read it under lock */
>    1347         t = timer->timer;
>    1348         if (!t)
>    1349                 goto out;
>    1350         drop_prog_refcnt(t);
>    1351         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>    1352          * this timer, since it won't be initialized.
>    1353          */
>    1354         timer->timer = NULL;
>    ...
> 
> So the timer->timer was set to NULL in the map value.

Looking at one call site:

                 bpf_timer_cancel_and_free(elem->key +
                                           round_up(htab->map.key_size, 8) +
                                           htab->map.timer_off);

So I am talking about to have
    *(struct bpf_timer *)value = (struct bpf_timer){};
The timer->timer is for bpf_hrtimer.

> 
>>   3. one hash_elem is picked up from the free list,
>>      and proper value is copied to the value except value->timer
>>      and value->spinlock (if they exist). This happens with this patch.
>>   4. some later kernel functions may see value->timer is set and
>>      do something bad ...
> 
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value
  2022-02-11  0:13               ` Yonghong Song
@ 2022-02-11  1:04                 ` Yonghong Song
  0 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2022-02-11  1:04 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko



On 2/10/22 4:13 PM, Yonghong Song wrote:
> 
> 
> On 2/10/22 4:02 PM, Kumar Kartikeya Dwivedi wrote:
>> On Fri, Feb 11, 2022 at 05:24:55AM IST, Yonghong Song wrote:
>>>
>>>
>>> On 2/10/22 2:49 PM, Alexei Starovoitov wrote:
>>>> On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@fb.com> wrote:
>>>>>
>>>>>
>>>>> On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
>>>>>> On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
>>>>>>>> When both bpf_spin_lock and bpf_timer are present in a BPF map 
>>>>>>>> value,
>>>>>>>> copy_map_value needs to skirt both objects when copying a value 
>>>>>>>> into and
>>>>>>>> out of the map. However, the current code does not set both 
>>>>>>>> s_off and
>>>>>>>> t_off in copy_map_value, which leads to a crash when e.g. 
>>>>>>>> bpf_spin_lock
>>>>>>>> is placed in map value with bpf_timer, as bpf_map_update_elem 
>>>>>>>> call will
>>>>>>>> be able to overwrite the other timer object.
>>>>>>>>
>>>>>>>> When the issue is not fixed, an overwriting can produce the 
>>>>>>>> following
>>>>>>>> splat:
>>>>>>>>
>>>>>>>> [root@(none) bpf]# ./test_progs -t timer_crash
>>>>>>>> [   15.930339] bpf_testmod: loading out-of-tree module taints 
>>>>>>>> kernel.
>>>>>>>> [   16.037849] 
>>>>>>>> ==================================================================
>>>>>>>> [   16.038458] BUG: KASAN: user-memory-access in 
>>>>>>>> __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>>> [   16.038944] Write of size 8 at addr 0000000000043ec0 by task 
>>>>>>>> test_progs/325
>>>>>>>> [   16.039399]
>>>>>>>> [   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: 
>>>>>>>> G           OE     5.16.0+ #278
>>>>>>>> [   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>>>>>> 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
>>>>>>>> [   16.040485] Call Trace:
>>>>>>>> [   16.040645]  <TASK>
>>>>>>>> [   16.040805]  dump_stack_lvl+0x59/0x73
>>>>>>>> [   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>>> [   16.041427]  kasan_report.cold+0x116/0x11b
>>>>>>>> [   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>>> [   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
>>>>>>>> [   16.042328]  ? memcpy+0x39/0x60
>>>>>>>> [   16.042552]  ? pv_hash+0xd0/0xd0
>>>>>>>> [   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
>>>>>>>> [   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
>>>>>>>> [   16.043366]  ? bpf_get_current_comm+0x50/0x50
>>>>>>>> [   16.043608]  ? jhash+0x11a/0x270
>>>>>>>> [   16.043848]  bpf_timer_cancel+0x34/0xe0
>>>>>>>> [   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
>>>>>>>> [   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
>>>>>>>> [   16.044836]  __x64_sys_nanosleep+0x5/0x140
>>>>>>>> [   16.045119]  do_syscall_64+0x59/0x80
>>>>>>>> [   16.045377]  ? lock_is_held_type+0xe4/0x140
>>>>>>>> [   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
>>>>>>>> [   16.046001]  ? mark_held_locks+0x24/0x90
>>>>>>>> [   16.046287]  ? asm_exc_page_fault+0x1e/0x30
>>>>>>>> [   16.046569]  ? asm_exc_page_fault+0x8/0x30
>>>>>>>> [   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
>>>>>>>> [   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>> [   16.047405] RIP: 0033:0x7f9e4831718d
>>>>>>>> [   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 
>>>>>>>> 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 
>>>>>>>> 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 
>>>>>>>> 0c 00 f7 d8 64 89 01 48
>>>>>>>> [   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 
>>>>>>>> ORIG_RAX: 0000000000000023
>>>>>>>> [   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 
>>>>>>>> 00007f9e4831718d
>>>>>>>> [   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
>>>>>>>> 00007fff488086d0
>>>>>>>> [   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 
>>>>>>>> 00007f9e4cb594a0
>>>>>>>> [   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 
>>>>>>>> 00007f9e484cde30
>>>>>>>> [   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 
>>>>>>>> 0000000000000000
>>>>>>>> [   16.051608]  </TASK>
>>>>>>>> [   16.051762] 
>>>>>>>> ==================================================================
>>>>>>>>
>>>>>>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>>>>>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>>>>>>> ---
>>>>>>>>      include/linux/bpf.h | 3 ++-
>>>>>>>>      1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>>>>> index fa517ae604ad..31a83449808b 100644
>>>>>>>> --- a/include/linux/bpf.h
>>>>>>>> +++ b/include/linux/bpf.h
>>>>>>>> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct 
>>>>>>>> bpf_map *map, void *dst, void *src)
>>>>>>>>       if (unlikely(map_value_has_spin_lock(map))) {
>>>>>>>>               s_off = map->spin_lock_off;
>>>>>>>>               s_sz = sizeof(struct bpf_spin_lock);
>>>>>>>> -   } else if (unlikely(map_value_has_timer(map))) {
>>>>>>>> +   }
>>>>>>>> +   if (unlikely(map_value_has_timer(map))) {
>>>>>>>>               t_off = map->timer_off;
>>>>>>>>               t_sz = sizeof(struct bpf_timer);
>>>>>>>>       }
>>>>>>>
>>>>>>> Thanks for the patch. I think we have a bigger problem here with 
>>>>>>> the patch.
>>>>>>> It actually exposed a few kernel bugs. If you run current 
>>>>>>> selftests, esp.
>>>>>>> ./test_progs -j which is what I tried, you will observe
>>>>>>> various testing failures. The reason is due to we preserved the 
>>>>>>> timer or
>>>>>>> spin lock information incorrectly for a map value.
>>>>>>>
>>>>>>> For example, the selftest #179 (timer) will fail with this patch and
>>>>>>> the following change can fix it.
>>>>>>>
>>>>>>
>>>>>> I actually only saw the same failures (on bpf/master) as in CI, 
>>>>>> and it seems
>>>>>> they are there even when I do a run without my patch (related to 
>>>>>> uprobes). The
>>>>>> bpftool patch PR in GitHub also has the same error, so I'm 
>>>>>> guessing it is
>>>>>> unrelated to this. I also didn't see any difference when running 
>>>>>> on bpf-next.
>>>>>>
>>>>>> As far as others are concerned, I didn't see the failure for timer 
>>>>>> test, or any
>>>>>> other ones, for me all timer tests pass properly after applying 
>>>>>> it. It could be
>>>>>> that my test VM is not triggering it, because it may depend on the 
>>>>>> runtime
>>>>>> system/memory values, etc.
>>>>>>
>>>>>> Can you share what error you see? Does it crash or does it just fail?
>>>>>
>>>>> For test #179 (timer), most time I saw a hung. But I also see
>>>>> the oops in bpf_timer_set_callback().
>>>>>
>>>>>>
>>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>>>> index d29af9988f37..3336d76cc5a6 100644
>>>>>>> --- a/kernel/bpf/hashtab.c
>>>>>>> +++ b/kernel/bpf/hashtab.c
>>>>>>> @@ -961,10 +961,11 @@ static struct htab_elem 
>>>>>>> *alloc_htab_elem(struct
>>>>>>> bpf_htab *htab, void *key,
>>>>>>>                            l_new = ERR_PTR(-ENOMEM);
>>>>>>>                            goto dec_count;
>>>>>>>                    }
>>>>>>> -               check_and_init_map_value(&htab->map,
>>>>>>> -                                        l_new->key + 
>>>>>>> round_up(key_size,
>>>>>>> 8));
>>>>>>>            }
>>>>>>>
>>>>>>> +       check_and_init_map_value(&htab->map,
>>>>>>> +                                l_new->key + round_up(key_size, 
>>>>>>> 8));
>>>>>>> +
>>>>>>
>>>>>> Makes sense, but trying to understand why it would fail:
>>>>>> So this is needed because the reused element from per-CPU region 
>>>>>> might have
>>>>>> garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast 
>>>>>> for timer
>>>>>> case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
>>>>>>
>>>>>> Earlier copy_map_value further below in this code would also 
>>>>>> overwrite the timer
>>>>>> part (which usually may be zero), but that would also not happen 
>>>>>> anymore.
>>>>>
>>>>> That is correct. The preallocated hash tables have a free list. Look
>>>>> like when an element is put into a free list, its value is not reset.
>>>>
>>>> I don't follow. How do you think it can happen?
>>>> htab_delete/update are calling free_htab_elem()
>>>> which calls check_and_free_timer().
>>>> For pre-alloc htab_update calls check_and_free_timer() directly.
>>>> There should be never a case when timer is active in the free list.
>>>
>>> The issue is not a timer active in the free list. It is the timer value
>>> is not reset to 0 in the free list.
>>> For example,
>>>   1. value->timer... is set properly (non zero)
>>>   2. value is deleted through update or delete, value->timer
>>>      is cancelled and freed, and the hash_elem is put into
>>>      free list. But the hash_elem value->timer is not zero.
>>
>> But in all cases, check_and_free_timer was called right? Which then calls
>> bpf_timer_cancel_and_free which does this:
>>
>>    1336 void bpf_timer_cancel_and_free(void *val)
>>    1337 {
>>    1338         struct bpf_timer_kern *timer = val;
>>    1339         struct bpf_hrtimer *t;
>>    1340
>>    1341         /* Performance optimization: read timer->timer without 
>> lock first. */
>>    1342         if (!READ_ONCE(timer->timer))
>>    1343                 return;
>>    1344
>>    1345         __bpf_spin_lock_irqsave(&timer->lock);
>>    1346         /* re-read it under lock */
>>    1347         t = timer->timer;
>>    1348         if (!t)
>>    1349                 goto out;
>>    1350         drop_prog_refcnt(t);
>>    1351         /* The subsequent bpf_timer_start/cancel() helpers 
>> won't be able to use
>>    1352          * this timer, since it won't be initialized.
>>    1353          */
>>    1354         timer->timer = NULL;
>>    ...
>>
>> So the timer->timer was set to NULL in the map value.
> 
> Looking at one call site:
> 
>                  bpf_timer_cancel_and_free(elem->key +
>                                            round_up(htab->map.key_size, 
> 8) +
>                                            htab->map.timer_off);
> 
> So I am talking about to have
>     *(struct bpf_timer *)value = (struct bpf_timer){};
> The timer->timer is for bpf_hrtimer.

Sorry. You are correct. timer->timer indeed sets the 'timer' part of
the value to NULL (0). It is probably somewhere else is leaking.
Let me check again.

> 
>>
>>>   3. one hash_elem is picked up from the free list,
>>>      and proper value is copied to the value except value->timer
>>>      and value->spinlock (if they exist). This happens with this patch.
>>>   4. some later kernel functions may see value->timer is set and
>>>      do something bad ...
>>
>> -- 
>> Kartikeya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value
  2022-02-09  7:03 [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value Kumar Kartikeya Dwivedi
  2022-02-09  7:03 ` [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value Kumar Kartikeya Dwivedi
  2022-02-09  7:03 ` [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash Kumar Kartikeya Dwivedi
@ 2022-02-11 21:30 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 14+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-02-11 21:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf, ast, daniel, andrii

Hello:

This series was applied to bpf/bpf.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Wed,  9 Feb 2022 12:33:22 +0530 you wrote:
> A fix for an oversight in copy_map_value that leads to kernel crash.
> 
> Also, a question for BPF developers:
> It seems in arraymap.c, we always do check_and_free_timer_in_array after we do
> copy_map_value in map_update_elem callback, but the same is not done for
> hashtab.c. Is there a specific reason for this difference in behavior, or did I
> miss that it happens for hashtab.c as well?
> 
> [...]

Here is the summary with links:
  - [bpf,v2,1/2] bpf: Fix crash due to incorrect copy_map_value
    https://git.kernel.org/bpf/bpf/c/a8abb0c3dc1e
  - [bpf,v2,2/2] selftests/bpf: Add test for bpf_timer overwriting crash
    https://git.kernel.org/bpf/bpf/c/a7e75016a075

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-02-11 21:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-09  7:03 [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value Kumar Kartikeya Dwivedi
2022-02-09  7:03 ` [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value Kumar Kartikeya Dwivedi
2022-02-09 19:06   ` Yonghong Song
2022-02-09 19:52     ` Kumar Kartikeya Dwivedi
2022-02-10  8:05       ` Yonghong Song
2022-02-10 22:49         ` Alexei Starovoitov
2022-02-10 23:54           ` Yonghong Song
2022-02-11  0:02             ` Kumar Kartikeya Dwivedi
2022-02-11  0:13               ` Yonghong Song
2022-02-11  1:04                 ` Yonghong Song
2022-02-09  7:03 ` [PATCH bpf v2 2/2] selftests/bpf: Add test for bpf_timer overwriting crash Kumar Kartikeya Dwivedi
2022-02-09 20:03   ` Kumar Kartikeya Dwivedi
2022-02-10  8:10     ` Yonghong Song
2022-02-11 21:30 ` [PATCH bpf v2 0/2] Fix for crash due to overwrite in copy_map_value patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.