bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask
@ 2023-01-05  9:26 tong
  2023-01-05  9:26 ` [bpf-next v4 2/2] selftests/bpf: add test case for htab map tong
  2023-01-10  1:52 ` [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask Martin KaFai Lau
  0 siblings, 2 replies; 11+ messages in thread
From: tong @ 2023-01-05  9:26 UTC (permalink / raw)
  To: bpf
  Cc: Tonghao Zhang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Hou Tao

From: Tonghao Zhang <tong@infragraf.org>

The deadlock still may occur while accessed in NMI and non-NMI
context. Because in NMI, we still may access the same bucket but with
different map_locked index.

For example, on the same CPU, .max_entries = 2, we update the hash map,
with key = 4, while running bpf prog in NMI nmi_handle(), to update
hash map with key = 20, so it will have the same bucket index but have
different map_locked index.

To fix this issue, using min mask to hash again.

Signed-off-by: Tonghao Zhang <tong@infragraf.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 5aa2b5525f79..974f104f47a0 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -152,7 +152,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
 {
 	unsigned long flags;
 
-	hash = hash & HASHTAB_MAP_LOCK_MASK;
+	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
 
 	preempt_disable();
 	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
@@ -171,7 +171,7 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
 				      struct bucket *b, u32 hash,
 				      unsigned long flags)
 {
-	hash = hash & HASHTAB_MAP_LOCK_MASK;
+	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
 	raw_spin_unlock_irqrestore(&b->raw_lock, flags);
 	__this_cpu_dec(*(htab->map_locked[hash]));
 	preempt_enable();
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-05  9:26 [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask tong
@ 2023-01-05  9:26 ` tong
  2023-01-05 10:56   ` Hou Tao
  2023-01-10  1:33   ` Martin KaFai Lau
  2023-01-10  1:52 ` [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask Martin KaFai Lau
  1 sibling, 2 replies; 11+ messages in thread
From: tong @ 2023-01-05  9:26 UTC (permalink / raw)
  To: bpf
  Cc: Tonghao Zhang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Hou Tao

From: Tonghao Zhang <tong@infragraf.org>

This testing show how to reproduce deadlock in special case.
We update htab map in Task and NMI context. Task can be interrupted by
NMI, if the same map bucket was locked, there will be a deadlock.

* map max_entries is 2.
* NMI using key 4 and Task context using key 20.
* so same bucket index but map_locked index is different.

The selftest use perf to produce the NMI and fentry nmi_handle.
Note that bpf_overflow_handler checks bpf_prog_active, but in bpf update
map syscall increase this counter in bpf_disable_instrumentation.
Then fentry nmi_handle and update hash map will reproduce the issue.

Signed-off-by: Tonghao Zhang <tong@infragraf.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/DENYLIST.aarch64  |  1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 .../selftests/bpf/prog_tests/htab_deadlock.c  | 75 +++++++++++++++++++
 .../selftests/bpf/progs/htab_deadlock.c       | 30 ++++++++
 4 files changed, 107 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_deadlock.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
index 99cc33c51eaa..42d98703f209 100644
--- a/tools/testing/selftests/bpf/DENYLIST.aarch64
+++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
@@ -24,6 +24,7 @@ fexit_test                                       # fexit_attach unexpected error
 get_func_args_test                               # get_func_args_test__attach unexpected error: -524 (errno 524) (trampoline)
 get_func_ip_test                                 # get_func_ip_test__attach unexpected error: -524 (errno 524) (trampoline)
 htab_update/reenter_update
+htab_deadlock                                    # fentry failed: -524 (trampoline)
 kfree_skb                                        # attach fentry unexpected error: -524 (trampoline)
 kfunc_call/subprog                               # extern (var ksym) 'bpf_prog_active': not found in kernel BTF
 kfunc_call/subprog_lskel                         # skel unexpected error: -2
diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 3efe091255bf..ab11f71842a5 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -26,6 +26,7 @@ get_func_args_test	                 # trampoline
 get_func_ip_test                         # get_func_ip_test__attach unexpected error: -524                             (trampoline)
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
 htab_update                              # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)
+htab_deadlock                            # fentry failed: -524                                                         (trampoline)
 jit_probe_mem                            # jit_probe_mem__open_and_load unexpected error: -524                         (kfunc)
 kfree_skb                                # attach fentry unexpected error: -524                                        (trampoline)
 kfunc_call                               # 'bpf_prog_active': not found in kernel BTF                                  (?)
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
new file mode 100644
index 000000000000..137dce8f1346
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 DiDi Global Inc. */
+#define _GNU_SOURCE
+#include <pthread.h>
+#include <sched.h>
+#include <test_progs.h>
+
+#include "htab_deadlock.skel.h"
+
+static int perf_event_open(void)
+{
+	struct perf_event_attr attr = {0};
+	int pfd;
+
+	/* create perf event on CPU 0 */
+	attr.size = sizeof(attr);
+	attr.type = PERF_TYPE_HARDWARE;
+	attr.config = PERF_COUNT_HW_CPU_CYCLES;
+	attr.freq = 1;
+	attr.sample_freq = 1000;
+	pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
+
+	return pfd >= 0 ? pfd : -errno;
+}
+
+void test_htab_deadlock(void)
+{
+	unsigned int val = 0, key = 20;
+	struct bpf_link *link = NULL;
+	struct htab_deadlock *skel;
+	int err, i, pfd;
+	cpu_set_t cpus;
+
+	skel = htab_deadlock__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	err = htab_deadlock__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto clean_skel;
+
+	/* NMI events. */
+	pfd = perf_event_open();
+	if (pfd < 0) {
+		if (pfd == -ENOENT || pfd == -EOPNOTSUPP) {
+			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
+			test__skip();
+			goto clean_skel;
+		}
+		if (!ASSERT_GE(pfd, 0, "perf_event_open"))
+			goto clean_skel;
+	}
+
+	link = bpf_program__attach_perf_event(skel->progs.bpf_empty, pfd);
+	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
+		goto clean_pfd;
+
+	/* Pinned on CPU 0 */
+	CPU_ZERO(&cpus);
+	CPU_SET(0, &cpus);
+	pthread_setaffinity_np(pthread_self(), sizeof(cpus), &cpus);
+
+	/* update bpf map concurrently on CPU0 in NMI and Task context.
+	 * there should be no kernel deadlock.
+	 */
+	for (i = 0; i < 100000; i++)
+		bpf_map_update_elem(bpf_map__fd(skel->maps.htab),
+				    &key, &val, BPF_ANY);
+
+	bpf_link__destroy(link);
+clean_pfd:
+	close(pfd);
+clean_skel:
+	htab_deadlock__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/htab_deadlock.c b/tools/testing/selftests/bpf/progs/htab_deadlock.c
new file mode 100644
index 000000000000..dacd003b1ccb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_deadlock.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 DiDi Global Inc. */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 2);
+	__uint(map_flags, BPF_F_ZERO_SEED);
+	__type(key, unsigned int);
+	__type(value, unsigned int);
+} htab SEC(".maps");
+
+SEC("fentry/perf_event_overflow")
+int bpf_nmi_handle(struct pt_regs *regs)
+{
+	unsigned int val = 0, key = 4;
+
+	bpf_map_update_elem(&htab, &key, &val, BPF_ANY);
+	return 0;
+}
+
+SEC("perf_event")
+int bpf_empty(struct pt_regs *regs)
+{
+	return 0;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-05  9:26 ` [bpf-next v4 2/2] selftests/bpf: add test case for htab map tong
@ 2023-01-05 10:56   ` Hou Tao
  2023-01-10  1:33   ` Martin KaFai Lau
  1 sibling, 0 replies; 11+ messages in thread
From: Hou Tao @ 2023-01-05 10:56 UTC (permalink / raw)
  To: tong, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa



On 1/5/2023 5:26 PM, tong@infragraf.org wrote:
> From: Tonghao Zhang <tong@infragraf.org>
>
> This testing show how to reproduce deadlock in special case.
> We update htab map in Task and NMI context. Task can be interrupted by
> NMI, if the same map bucket was locked, there will be a deadlock.
>
> * map max_entries is 2.
> * NMI using key 4 and Task context using key 20.
> * so same bucket index but map_locked index is different.
>
> The selftest use perf to produce the NMI and fentry nmi_handle.
> Note that bpf_overflow_handler checks bpf_prog_active, but in bpf update
> map syscall increase this counter in bpf_disable_instrumentation.
> Then fentry nmi_handle and update hash map will reproduce the issue.
>
> Signed-off-by: Tonghao Zhang <tong@infragraf.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Hou Tao <houtao1@huawei.com>
> Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hou Tao<houtao1@huawei.com>
> ---
>  tools/testing/selftests/bpf/DENYLIST.aarch64  |  1 +
>  tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
>  .../selftests/bpf/prog_tests/htab_deadlock.c  | 75 +++++++++++++++++++
>  .../selftests/bpf/progs/htab_deadlock.c       | 30 ++++++++
>  4 files changed, 107 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>  create mode 100644 tools/testing/selftests/bpf/progs/htab_deadlock.c
>
> diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
> index 99cc33c51eaa..42d98703f209 100644
> --- a/tools/testing/selftests/bpf/DENYLIST.aarch64
> +++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
> @@ -24,6 +24,7 @@ fexit_test                                       # fexit_attach unexpected error
>  get_func_args_test                               # get_func_args_test__attach unexpected error: -524 (errno 524) (trampoline)
>  get_func_ip_test                                 # get_func_ip_test__attach unexpected error: -524 (errno 524) (trampoline)
>  htab_update/reenter_update
> +htab_deadlock                                    # fentry failed: -524 (trampoline)
>  kfree_skb                                        # attach fentry unexpected error: -524 (trampoline)
>  kfunc_call/subprog                               # extern (var ksym) 'bpf_prog_active': not found in kernel BTF
>  kfunc_call/subprog_lskel                         # skel unexpected error: -2
> diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
> index 3efe091255bf..ab11f71842a5 100644
> --- a/tools/testing/selftests/bpf/DENYLIST.s390x
> +++ b/tools/testing/selftests/bpf/DENYLIST.s390x
> @@ -26,6 +26,7 @@ get_func_args_test	                 # trampoline
>  get_func_ip_test                         # get_func_ip_test__attach unexpected error: -524                             (trampoline)
>  get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
>  htab_update                              # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)
> +htab_deadlock                            # fentry failed: -524                                                         (trampoline)
>  jit_probe_mem                            # jit_probe_mem__open_and_load unexpected error: -524                         (kfunc)
>  kfree_skb                                # attach fentry unexpected error: -524                                        (trampoline)
>  kfunc_call                               # 'bpf_prog_active': not found in kernel BTF                                  (?)
> diff --git a/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
> new file mode 100644
> index 000000000000..137dce8f1346
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
> @@ -0,0 +1,75 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 DiDi Global Inc. */
> +#define _GNU_SOURCE
> +#include <pthread.h>
> +#include <sched.h>
> +#include <test_progs.h>
> +
> +#include "htab_deadlock.skel.h"
> +
> +static int perf_event_open(void)
> +{
> +	struct perf_event_attr attr = {0};
> +	int pfd;
> +
> +	/* create perf event on CPU 0 */
> +	attr.size = sizeof(attr);
> +	attr.type = PERF_TYPE_HARDWARE;
> +	attr.config = PERF_COUNT_HW_CPU_CYCLES;
> +	attr.freq = 1;
> +	attr.sample_freq = 1000;
> +	pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
> +
> +	return pfd >= 0 ? pfd : -errno;
> +}
> +
> +void test_htab_deadlock(void)
> +{
> +	unsigned int val = 0, key = 20;
> +	struct bpf_link *link = NULL;
> +	struct htab_deadlock *skel;
> +	int err, i, pfd;
> +	cpu_set_t cpus;
> +
> +	skel = htab_deadlock__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +		return;
> +
> +	err = htab_deadlock__attach(skel);
> +	if (!ASSERT_OK(err, "skel_attach"))
> +		goto clean_skel;
> +
> +	/* NMI events. */
> +	pfd = perf_event_open();
> +	if (pfd < 0) {
> +		if (pfd == -ENOENT || pfd == -EOPNOTSUPP) {
> +			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
> +			test__skip();
> +			goto clean_skel;
> +		}
> +		if (!ASSERT_GE(pfd, 0, "perf_event_open"))
> +			goto clean_skel;
> +	}
> +
> +	link = bpf_program__attach_perf_event(skel->progs.bpf_empty, pfd);
> +	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
> +		goto clean_pfd;
> +
> +	/* Pinned on CPU 0 */
> +	CPU_ZERO(&cpus);
> +	CPU_SET(0, &cpus);
> +	pthread_setaffinity_np(pthread_self(), sizeof(cpus), &cpus);
> +
> +	/* update bpf map concurrently on CPU0 in NMI and Task context.
> +	 * there should be no kernel deadlock.
> +	 */
> +	for (i = 0; i < 100000; i++)
> +		bpf_map_update_elem(bpf_map__fd(skel->maps.htab),
> +				    &key, &val, BPF_ANY);
> +
> +	bpf_link__destroy(link);
> +clean_pfd:
> +	close(pfd);
> +clean_skel:
> +	htab_deadlock__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/htab_deadlock.c b/tools/testing/selftests/bpf/progs/htab_deadlock.c
> new file mode 100644
> index 000000000000..dacd003b1ccb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/htab_deadlock.c
> @@ -0,0 +1,30 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 DiDi Global Inc. */
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, 2);
> +	__uint(map_flags, BPF_F_ZERO_SEED);
> +	__type(key, unsigned int);
> +	__type(value, unsigned int);
> +} htab SEC(".maps");
> +
> +SEC("fentry/perf_event_overflow")
> +int bpf_nmi_handle(struct pt_regs *regs)
> +{
> +	unsigned int val = 0, key = 4;
> +
> +	bpf_map_update_elem(&htab, &key, &val, BPF_ANY);
> +	return 0;
> +}
> +
> +SEC("perf_event")
> +int bpf_empty(struct pt_regs *regs)
> +{
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-05  9:26 ` [bpf-next v4 2/2] selftests/bpf: add test case for htab map tong
  2023-01-05 10:56   ` Hou Tao
@ 2023-01-10  1:33   ` Martin KaFai Lau
  2023-01-10  2:21     ` Tonghao Zhang
  1 sibling, 1 reply; 11+ messages in thread
From: Martin KaFai Lau @ 2023-01-10  1:33 UTC (permalink / raw)
  To: tong
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Hou Tao, bpf, Manu Bretelle

On 1/5/23 1:26 AM, tong@infragraf.org wrote:
> diff --git a/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
> new file mode 100644
> index 000000000000..137dce8f1346
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
> @@ -0,0 +1,75 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 DiDi Global Inc. */
> +#define _GNU_SOURCE
> +#include <pthread.h>
> +#include <sched.h>
> +#include <test_progs.h>
> +
> +#include "htab_deadlock.skel.h"
> +
> +static int perf_event_open(void)
> +{
> +	struct perf_event_attr attr = {0};
> +	int pfd;
> +
> +	/* create perf event on CPU 0 */
> +	attr.size = sizeof(attr);
> +	attr.type = PERF_TYPE_HARDWARE;
> +	attr.config = PERF_COUNT_HW_CPU_CYCLES;
> +	attr.freq = 1;
> +	attr.sample_freq = 1000;
> +	pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
> +
> +	return pfd >= 0 ? pfd : -errno;
> +}
> +
> +void test_htab_deadlock(void)
> +{
> +	unsigned int val = 0, key = 20;
> +	struct bpf_link *link = NULL;
> +	struct htab_deadlock *skel;
> +	int err, i, pfd;
> +	cpu_set_t cpus;
> +
> +	skel = htab_deadlock__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +		return;
> +
> +	err = htab_deadlock__attach(skel);
> +	if (!ASSERT_OK(err, "skel_attach"))
> +		goto clean_skel;
> +
> +	/* NMI events. */
> +	pfd = perf_event_open();
> +	if (pfd < 0) {
> +		if (pfd == -ENOENT || pfd == -EOPNOTSUPP) {
> +			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
> +			test__skip();

This test is a SKIP in bpf CI, so it won't be useful.
https://github.com/kernel-patches/bpf/actions/runs/3858084722/jobs/6579470256#step:6:5198

Is there other way to test it or do you know what may be missing in vmtest.sh? 
Not sure if the cloud setup in CI blocks HW_CPU_CYCLES.  If it is, I also don't 
know a good way (Cc: Manu).

> +			goto clean_skel;
> +		}
> +		if (!ASSERT_GE(pfd, 0, "perf_event_open"))
> +			goto clean_skel;
> +	}
> +
> +	link = bpf_program__attach_perf_event(skel->progs.bpf_empty, pfd);
> +	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
> +		goto clean_pfd;
> +
> +	/* Pinned on CPU 0 */
> +	CPU_ZERO(&cpus);
> +	CPU_SET(0, &cpus);
> +	pthread_setaffinity_np(pthread_self(), sizeof(cpus), &cpus);
> +
> +	/* update bpf map concurrently on CPU0 in NMI and Task context.
> +	 * there should be no kernel deadlock.
> +	 */
> +	for (i = 0; i < 100000; i++)
> +		bpf_map_update_elem(bpf_map__fd(skel->maps.htab),
> +				    &key, &val, BPF_ANY);
> +
> +	bpf_link__destroy(link);
> +clean_pfd:
> +	close(pfd);
> +clean_skel:
> +	htab_deadlock__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/htab_deadlock.c b/tools/testing/selftests/bpf/progs/htab_deadlock.c
> new file mode 100644
> index 000000000000..dacd003b1ccb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/htab_deadlock.c
> @@ -0,0 +1,30 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 DiDi Global Inc. */
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, 2);
> +	__uint(map_flags, BPF_F_ZERO_SEED);
> +	__type(key, unsigned int);
> +	__type(value, unsigned int);
> +} htab SEC(".maps");
> +
> +SEC("fentry/perf_event_overflow")
> +int bpf_nmi_handle(struct pt_regs *regs)
> +{
> +	unsigned int val = 0, key = 4;
> +
> +	bpf_map_update_elem(&htab, &key, &val, BPF_ANY);

I ran it in my qemu setup which does not skip the test.  I got this splat though:

[   42.990306] ================================
[   42.990307] WARNING: inconsistent lock state
[   42.990310] 6.2.0-rc2-00304-gaf88a1bb9967 #409 Tainted: G           O
[   42.990313] --------------------------------
[   42.990315] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   42.990317] test_progs/1546 [HC1[1]:SC0[0]:HE0:SE1] takes:
[   42.990322] ffff888101245768 (&htab->lockdep_key){....}-{2:2}, at: 
htab_map_update_elem+0x1e7/0x810
[   42.990340] {INITIAL USE} state was registered at:
[   42.990341]   lock_acquire+0x1e6/0x530
[   42.990351]   _raw_spin_lock_irqsave+0xb8/0x100
[   42.990362]   htab_map_update_elem+0x1e7/0x810
[   42.990365]   bpf_map_update_value+0x40d/0x4f0
[   42.990371]   map_update_elem+0x423/0x580
[   42.990375]   __sys_bpf+0x54e/0x670
[   42.990377]   __x64_sys_bpf+0x7c/0x90
[   42.990382]   do_syscall_64+0x43/0x90
[   42.990387]   entry_SYSCALL_64_after_hwframe+0x72/0xdc

Please check.

> +	return 0;
> +}
> +
> +SEC("perf_event")
> +int bpf_empty(struct pt_regs *regs)
> +{

btw, from a quick look at __perf_event_overflow, I suspect doing the 
bpf_map_update_elem() here instead of the fentry/perf_event_overflow above can 
also reproduce the patch 1 issue?

> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask
  2023-01-05  9:26 [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask tong
  2023-01-05  9:26 ` [bpf-next v4 2/2] selftests/bpf: add test case for htab map tong
@ 2023-01-10  1:52 ` Martin KaFai Lau
  2023-01-10  2:25   ` Tonghao Zhang
  1 sibling, 1 reply; 11+ messages in thread
From: Martin KaFai Lau @ 2023-01-10  1:52 UTC (permalink / raw)
  To: tong, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Hou Tao

On 1/5/23 1:26 AM, tong@infragraf.org wrote:
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 5aa2b5525f79..974f104f47a0 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -152,7 +152,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>   {
>   	unsigned long flags;
>   
> -	hash = hash & HASHTAB_MAP_LOCK_MASK;
> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>   
>   	preempt_disable();
>   	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> @@ -171,7 +171,7 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>   				      struct bucket *b, u32 hash,
>   				      unsigned long flags)
>   {
> -	hash = hash & HASHTAB_MAP_LOCK_MASK;
> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);

Please run checkpatch.pl.  patchwork also reports the same thing:
https://patchwork.kernel.org/project/netdevbpf/patch/20230105092637.35069-1-tong@infragraf.org/

CHECK: spaces preferred around that '-' (ctx:WxV)
#46: FILE: kernel/bpf/hashtab.c:155:
+	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
  	                                                                ^

CHECK: spaces preferred around that '-' (ctx:WxV)
#55: FILE: kernel/bpf/hashtab.c:174:
+	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);

btw, instead of doing this min_t and -1 repeatedly, ensuring n_buckets is at 
least HASHTAB_MAP_LOCK_COUNT during map_alloc should be as good?  htab having 2 
or 4 max_entries should be pretty uncommon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-10  1:33   ` Martin KaFai Lau
@ 2023-01-10  2:21     ` Tonghao Zhang
  2023-01-10  3:25       ` Martin KaFai Lau
  0 siblings, 1 reply; 11+ messages in thread
From: Tonghao Zhang @ 2023-01-10  2:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Hou Tao, bpf, Manu Bretelle



> On Jan 10, 2023, at 9:33 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 1/5/23 1:26 AM, tong@infragraf.org wrote:
>> diff --git a/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>> new file mode 100644
>> index 000000000000..137dce8f1346
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>> @@ -0,0 +1,75 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 DiDi Global Inc. */
>> +#define _GNU_SOURCE
>> +#include <pthread.h>
>> +#include <sched.h>
>> +#include <test_progs.h>
>> +
>> +#include "htab_deadlock.skel.h"
>> +
>> +static int perf_event_open(void)
>> +{
>> +	struct perf_event_attr attr = {0};
>> +	int pfd;
>> +
>> +	/* create perf event on CPU 0 */
>> +	attr.size = sizeof(attr);
>> +	attr.type = PERF_TYPE_HARDWARE;
>> +	attr.config = PERF_COUNT_HW_CPU_CYCLES;
>> +	attr.freq = 1;
>> +	attr.sample_freq = 1000;
>> +	pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
>> +
>> +	return pfd >= 0 ? pfd : -errno;
>> +}
>> +
>> +void test_htab_deadlock(void)
>> +{
>> +	unsigned int val = 0, key = 20;
>> +	struct bpf_link *link = NULL;
>> +	struct htab_deadlock *skel;
>> +	int err, i, pfd;
>> +	cpu_set_t cpus;
>> +
>> +	skel = htab_deadlock__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
>> +		return;
>> +
>> +	err = htab_deadlock__attach(skel);
>> +	if (!ASSERT_OK(err, "skel_attach"))
>> +		goto clean_skel;
>> +
>> +	/* NMI events. */
>> +	pfd = perf_event_open();
>> +	if (pfd < 0) {
>> +		if (pfd == -ENOENT || pfd == -EOPNOTSUPP) {
>> +			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
>> +			test__skip();
> 
> This test is a SKIP in bpf CI, so it won't be useful.
> https://github.com/kernel-patches/bpf/actions/runs/3858084722/jobs/6579470256#step:6:5198
> 
> Is there other way to test it or do you know what may be missing in vmtest.sh? Not sure if the cloud setup in CI blocks HW_CPU_CYCLES.  If it is, I also don't know a good way (Cc: Manu).
Hi

Other test cases using PERF_COUNT_HW_CPU_CYCLES were skipped too. For example,
send_signal
find_vma
get_stackid_cannot_attach

> 
>> +			goto clean_skel;
>> +		}
>> +		if (!ASSERT_GE(pfd, 0, "perf_event_open"))
>> +			goto clean_skel;
>> +	}
>> +
>> +	link = bpf_program__attach_perf_event(skel->progs.bpf_empty, pfd);
>> +	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
>> +		goto clean_pfd;
>> +
>> +	/* Pinned on CPU 0 */
>> +	CPU_ZERO(&cpus);
>> +	CPU_SET(0, &cpus);
>> +	pthread_setaffinity_np(pthread_self(), sizeof(cpus), &cpus);
>> +
>> +	/* update bpf map concurrently on CPU0 in NMI and Task context.
>> +	 * there should be no kernel deadlock.
>> +	 */
>> +	for (i = 0; i < 100000; i++)
>> +		bpf_map_update_elem(bpf_map__fd(skel->maps.htab),
>> +				    &key, &val, BPF_ANY);
>> +
>> +	bpf_link__destroy(link);
>> +clean_pfd:
>> +	close(pfd);
>> +clean_skel:
>> +	htab_deadlock__destroy(skel);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/htab_deadlock.c b/tools/testing/selftests/bpf/progs/htab_deadlock.c
>> new file mode 100644
>> index 000000000000..dacd003b1ccb
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/htab_deadlock.c
>> @@ -0,0 +1,30 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 DiDi Global Inc. */
>> +#include <linux/bpf.h>
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_HASH);
>> +	__uint(max_entries, 2);
>> +	__uint(map_flags, BPF_F_ZERO_SEED);
>> +	__type(key, unsigned int);
>> +	__type(value, unsigned int);
>> +} htab SEC(".maps");
>> +
>> +SEC("fentry/perf_event_overflow")
>> +int bpf_nmi_handle(struct pt_regs *regs)
>> +{
>> +	unsigned int val = 0, key = 4;
>> +
>> +	bpf_map_update_elem(&htab, &key, &val, BPF_ANY);
> 
> I ran it in my qemu setup which does not skip the test.  I got this splat though:
This is a false alarm, not deadlock(this patch fix deadlock, only). I fix waring in other patch, please review
https://patchwork.kernel.org/project/netdevbpf/patch/20230105112749.38421-1-tong@infragraf.org/ 
> 
> [   42.990306] ================================
> [   42.990307] WARNING: inconsistent lock state
> [   42.990310] 6.2.0-rc2-00304-gaf88a1bb9967 #409 Tainted: G           O
> [   42.990313] --------------------------------
> [   42.990315] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> [   42.990317] test_progs/1546 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [   42.990322] ffff888101245768 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x1e7/0x810
> [   42.990340] {INITIAL USE} state was registered at:
> [   42.990341]   lock_acquire+0x1e6/0x530
> [   42.990351]   _raw_spin_lock_irqsave+0xb8/0x100
> [   42.990362]   htab_map_update_elem+0x1e7/0x810
> [   42.990365]   bpf_map_update_value+0x40d/0x4f0
> [   42.990371]   map_update_elem+0x423/0x580
> [   42.990375]   __sys_bpf+0x54e/0x670
> [   42.990377]   __x64_sys_bpf+0x7c/0x90
> [   42.990382]   do_syscall_64+0x43/0x90
> [   42.990387]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
> Please check.
> 
>> +	return 0;
>> +}
>> +
>> +SEC("perf_event")
>> +int bpf_empty(struct pt_regs *regs)
>> +{
> 
> btw, from a quick look at __perf_event_overflow, I suspect doing the bpf_map_update_elem() here instead of the fentry/perf_event_overflow above can also reproduce the patch 1 issue?
No
bpf_overflow_handler will check the bpf_prog_active, if syscall increase it, bpf_overflow_handler will skip the bpf prog.
Fentry will not check the bpf_prog_active, and interrupt the task context. We have discussed that. 

> 
>> +	return 0;
>> +}
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask
  2023-01-10  1:52 ` [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask Martin KaFai Lau
@ 2023-01-10  2:25   ` Tonghao Zhang
  2023-01-10  3:03     ` Martin KaFai Lau
  0 siblings, 1 reply; 11+ messages in thread
From: Tonghao Zhang @ 2023-01-10  2:25 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Hou Tao



> On Jan 10, 2023, at 9:52 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 1/5/23 1:26 AM, tong@infragraf.org wrote:
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 5aa2b5525f79..974f104f47a0 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -152,7 +152,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>  {
>>  	unsigned long flags;
>>  -	hash = hash & HASHTAB_MAP_LOCK_MASK;
>> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>    	preempt_disable();
>>  	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>> @@ -171,7 +171,7 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>  				      struct bucket *b, u32 hash,
>>  				      unsigned long flags)
>>  {
>> -	hash = hash & HASHTAB_MAP_LOCK_MASK;
>> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> 
> Please run checkpatch.pl.  patchwork also reports the same thing:
> https://patchwork.kernel.org/project/netdevbpf/patch/20230105092637.35069-1-tong@infragraf.org/
> 
> CHECK: spaces preferred around that '-' (ctx:WxV)
> #46: FILE: kernel/bpf/hashtab.c:155:
> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> 	                                                                ^
> 
> CHECK: spaces preferred around that '-' (ctx:WxV)
> #55: FILE: kernel/bpf/hashtab.c:174:
> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> 
> btw, instead of doing this min_t and -1 repeatedly, ensuring n_buckets is at least HASHTAB_MAP_LOCK_COUNT during map_alloc should be as good?  htab having 2 or 4 max_entries should be pretty uncommon.
> 
I think we should not limit the max_entries, while it’s not common use case. But for performance, we can introduce htab->n_buckets_mask = HASHTAB_MAP_LOCK_COUNT & (htab->n_buckets -1) ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask
  2023-01-10  2:25   ` Tonghao Zhang
@ 2023-01-10  3:03     ` Martin KaFai Lau
  0 siblings, 0 replies; 11+ messages in thread
From: Martin KaFai Lau @ 2023-01-10  3:03 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Hou Tao

On 1/9/23 6:25 PM, Tonghao Zhang wrote:
> 
> 
>> On Jan 10, 2023, at 9:52 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/5/23 1:26 AM, tong@infragraf.org wrote:
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index 5aa2b5525f79..974f104f47a0 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -152,7 +152,7 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>   {
>>>   	unsigned long flags;
>>>   -	hash = hash & HASHTAB_MAP_LOCK_MASK;
>>> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>>     	preempt_disable();
>>>   	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>> @@ -171,7 +171,7 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>>   				      struct bucket *b, u32 hash,
>>>   				      unsigned long flags)
>>>   {
>>> -	hash = hash & HASHTAB_MAP_LOCK_MASK;
>>> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>
>> Please run checkpatch.pl.  patchwork also reports the same thing:
>> https://patchwork.kernel.org/project/netdevbpf/patch/20230105092637.35069-1-tong@infragraf.org/
>>
>> CHECK: spaces preferred around that '-' (ctx:WxV)
>> #46: FILE: kernel/bpf/hashtab.c:155:
>> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>> 	                                                                ^
>>
>> CHECK: spaces preferred around that '-' (ctx:WxV)
>> #55: FILE: kernel/bpf/hashtab.c:174:
>> +	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>
>> btw, instead of doing this min_t and -1 repeatedly, ensuring n_buckets is at least HASHTAB_MAP_LOCK_COUNT during map_alloc should be as good?  htab having 2 or 4 max_entries should be pretty uncommon.
>>
> I think we should not limit the max_entries, while it’s not common use case. But for performance, we can introduce htab->n_buckets_mask = HASHTAB_MAP_LOCK_COUNT & (htab->n_buckets -1) ?

To be clear, I didn't mean limiting max_entries... I meant lower bound the 
n_buckets to HASHTAB_MAP_LOCK_COUNT.

imo, adding another n_buckets_mask to htab is even worse for this uncommon case. 
eg. the future code needs to be more careful when to use which one.

It was a suggestion, if you insist on min_t during htab_(un)lock_bucket is fine.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-10  2:21     ` Tonghao Zhang
@ 2023-01-10  3:25       ` Martin KaFai Lau
  2023-01-10  3:44         ` Martin KaFai Lau
  2023-01-10  8:10         ` Tonghao Zhang
  0 siblings, 2 replies; 11+ messages in thread
From: Martin KaFai Lau @ 2023-01-10  3:25 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Hou Tao, bpf, Manu Bretelle

On 1/9/23 6:21 PM, Tonghao Zhang wrote:
> 
> 
>> On Jan 10, 2023, at 9:33 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/5/23 1:26 AM, tong@infragraf.org wrote:
>>> diff --git a/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>>> new file mode 100644
>>> index 000000000000..137dce8f1346
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>>> @@ -0,0 +1,75 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/* Copyright (c) 2022 DiDi Global Inc. */
>>> +#define _GNU_SOURCE
>>> +#include <pthread.h>
>>> +#include <sched.h>
>>> +#include <test_progs.h>
>>> +
>>> +#include "htab_deadlock.skel.h"
>>> +
>>> +static int perf_event_open(void)
>>> +{
>>> +	struct perf_event_attr attr = {0};
>>> +	int pfd;
>>> +
>>> +	/* create perf event on CPU 0 */
>>> +	attr.size = sizeof(attr);
>>> +	attr.type = PERF_TYPE_HARDWARE;
>>> +	attr.config = PERF_COUNT_HW_CPU_CYCLES;
>>> +	attr.freq = 1;
>>> +	attr.sample_freq = 1000;
>>> +	pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
>>> +
>>> +	return pfd >= 0 ? pfd : -errno;
>>> +}
>>> +
>>> +void test_htab_deadlock(void)
>>> +{
>>> +	unsigned int val = 0, key = 20;
>>> +	struct bpf_link *link = NULL;
>>> +	struct htab_deadlock *skel;
>>> +	int err, i, pfd;
>>> +	cpu_set_t cpus;
>>> +
>>> +	skel = htab_deadlock__open_and_load();
>>> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
>>> +		return;
>>> +
>>> +	err = htab_deadlock__attach(skel);
>>> +	if (!ASSERT_OK(err, "skel_attach"))
>>> +		goto clean_skel;
>>> +
>>> +	/* NMI events. */
>>> +	pfd = perf_event_open();
>>> +	if (pfd < 0) {
>>> +		if (pfd == -ENOENT || pfd == -EOPNOTSUPP) {
>>> +			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
>>> +			test__skip();
>>
>> This test is a SKIP in bpf CI, so it won't be useful.
>> https://github.com/kernel-patches/bpf/actions/runs/3858084722/jobs/6579470256#step:6:5198
>>
>> Is there other way to test it or do you know what may be missing in vmtest.sh? Not sure if the cloud setup in CI blocks HW_CPU_CYCLES.  If it is, I also don't know a good way (Cc: Manu).
> Hi
> 
> Other test cases using PERF_COUNT_HW_CPU_CYCLES were skipped too. For example,
> send_signal
> find_vma
> get_stackid_cannot_attach

Got it. Thanks for checking.

>>
>>> +			goto clean_skel;
>>> +		}
>>> +		if (!ASSERT_GE(pfd, 0, "perf_event_open"))
>>> +			goto clean_skel;
>>> +	}
>>> +
>>> +	link = bpf_program__attach_perf_event(skel->progs.bpf_empty, pfd);
>>> +	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
>>> +		goto clean_pfd;
>>> +
>>> +	/* Pinned on CPU 0 */
>>> +	CPU_ZERO(&cpus);
>>> +	CPU_SET(0, &cpus);
>>> +	pthread_setaffinity_np(pthread_self(), sizeof(cpus), &cpus);
>>> +
>>> +	/* update bpf map concurrently on CPU0 in NMI and Task context.
>>> +	 * there should be no kernel deadlock.
>>> +	 */
>>> +	for (i = 0; i < 100000; i++)
>>> +		bpf_map_update_elem(bpf_map__fd(skel->maps.htab),
>>> +				    &key, &val, BPF_ANY);
>>> +
>>> +	bpf_link__destroy(link);
>>> +clean_pfd:
>>> +	close(pfd);
>>> +clean_skel:
>>> +	htab_deadlock__destroy(skel);
>>> +}
>>> diff --git a/tools/testing/selftests/bpf/progs/htab_deadlock.c b/tools/testing/selftests/bpf/progs/htab_deadlock.c
>>> new file mode 100644
>>> index 000000000000..dacd003b1ccb
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/bpf/progs/htab_deadlock.c
>>> @@ -0,0 +1,30 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/* Copyright (c) 2022 DiDi Global Inc. */
>>> +#include <linux/bpf.h>
>>> +#include <bpf/bpf_helpers.h>
>>> +#include <bpf/bpf_tracing.h>
>>> +
>>> +char _license[] SEC("license") = "GPL";
>>> +
>>> +struct {
>>> +	__uint(type, BPF_MAP_TYPE_HASH);
>>> +	__uint(max_entries, 2);
>>> +	__uint(map_flags, BPF_F_ZERO_SEED);
>>> +	__type(key, unsigned int);
>>> +	__type(value, unsigned int);
>>> +} htab SEC(".maps");
>>> +
>>> +SEC("fentry/perf_event_overflow")
>>> +int bpf_nmi_handle(struct pt_regs *regs)
>>> +{
>>> +	unsigned int val = 0, key = 4;
>>> +
>>> +	bpf_map_update_elem(&htab, &key, &val, BPF_ANY);
>>
>> I ran it in my qemu setup which does not skip the test.  I got this splat though:
> This is a false alarm, not deadlock(this patch fix deadlock, only). I fix waring in other patch, please review
> https://patchwork.kernel.org/project/netdevbpf/patch/20230105112749.38421-1-tong@infragraf.org/

Yeah, I just saw this thread also. Please submit the warning fix together with 
this patch set since this test can trigger it.  They should be reviewed together.

>>
>> [   42.990306] ================================
>> [   42.990307] WARNING: inconsistent lock state
>> [   42.990310] 6.2.0-rc2-00304-gaf88a1bb9967 #409 Tainted: G           O
>> [   42.990313] --------------------------------
>> [   42.990315] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>> [   42.990317] test_progs/1546 [HC1[1]:SC0[0]:HE0:SE1] takes:
>> [   42.990322] ffff888101245768 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x1e7/0x810
>> [   42.990340] {INITIAL USE} state was registered at:
>> [   42.990341]   lock_acquire+0x1e6/0x530
>> [   42.990351]   _raw_spin_lock_irqsave+0xb8/0x100
>> [   42.990362]   htab_map_update_elem+0x1e7/0x810
>> [   42.990365]   bpf_map_update_value+0x40d/0x4f0
>> [   42.990371]   map_update_elem+0x423/0x580
>> [   42.990375]   __sys_bpf+0x54e/0x670
>> [   42.990377]   __x64_sys_bpf+0x7c/0x90
>> [   42.990382]   do_syscall_64+0x43/0x90
>> [   42.990387]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>
>> Please check.
>>
>>> +	return 0;
>>> +}
>>> +
>>> +SEC("perf_event")
>>> +int bpf_empty(struct pt_regs *regs)
>>> +{
>>
>> btw, from a quick look at __perf_event_overflow, I suspect doing the bpf_map_update_elem() here instead of the fentry/perf_event_overflow above can also reproduce the patch 1 issue?
> No
> bpf_overflow_handler will check the bpf_prog_active, if syscall increase it, bpf_overflow_handler will skip the bpf prog.

tbh, I am quite surprised the bpf_prog_active would be noisy enough to avoid 
this deadlock being reproduced easily. fwiw, I just tried doing map_update here 
and can reproduce it in the very first run.

> Fentry will not check the bpf_prog_active, and interrupt the task context. We have discussed that.

Sure. fentry is fine. The reason I was asking is to see if the test can be 
simplified and barring any future fentry blacklist.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-10  3:25       ` Martin KaFai Lau
@ 2023-01-10  3:44         ` Martin KaFai Lau
  2023-01-10  8:10         ` Tonghao Zhang
  1 sibling, 0 replies; 11+ messages in thread
From: Martin KaFai Lau @ 2023-01-10  3:44 UTC (permalink / raw)
  To: Tonghao Zhang
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Hou Tao, bpf, Manu Bretelle

On 1/9/23 7:25 PM, Martin KaFai Lau wrote:
>>>
>>> btw, from a quick look at __perf_event_overflow, I suspect doing the 
>>> bpf_map_update_elem() here instead of the fentry/perf_event_overflow above 
>>> can also reproduce the patch 1 issue?
>> No
>> bpf_overflow_handler will check the bpf_prog_active, if syscall increase it, 
>> bpf_overflow_handler will skip the bpf prog.
> 
> tbh, I am quite surprised the bpf_prog_active would be noisy enough to avoid 
> this deadlock being reproduced easily. fwiw, I just tried doing map_update here 
> and can reproduce it in the very first run.
Correcting my self. I only reproduced the warning splat but not the deadlock. 
This test is using map_update from the syscall that bumps the prog_active.

Agree that SEC("perf_event") alone won't work unless the bpf_map_update_elem() 
is not done from the syscall in prog_tests/htab_deadlock.c, eg. from another bpf 
prog.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf-next v4 2/2] selftests/bpf: add test case for htab map
  2023-01-10  3:25       ` Martin KaFai Lau
  2023-01-10  3:44         ` Martin KaFai Lau
@ 2023-01-10  8:10         ` Tonghao Zhang
  1 sibling, 0 replies; 11+ messages in thread
From: Tonghao Zhang @ 2023-01-10  8:10 UTC (permalink / raw)
  To: Martin KaFai Lau, Hou Tao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Hou Tao, bpf, Manu Bretelle



> On Jan 10, 2023, at 11:25 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 1/9/23 6:21 PM, Tonghao Zhang wrote:
>>> On Jan 10, 2023, at 9:33 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 1/5/23 1:26 AM, tong@infragraf.org wrote:
>>>> diff --git a/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>>>> new file mode 100644
>>>> index 000000000000..137dce8f1346
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/bpf/prog_tests/htab_deadlock.c
>>>> @@ -0,0 +1,75 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2022 DiDi Global Inc. */
>>>> +#define _GNU_SOURCE
>>>> +#include <pthread.h>
>>>> +#include <sched.h>
>>>> +#include <test_progs.h>
>>>> +
>>>> +#include "htab_deadlock.skel.h"
>>>> +
>>>> +static int perf_event_open(void)
>>>> +{
>>>> +	struct perf_event_attr attr = {0};
>>>> +	int pfd;
>>>> +
>>>> +	/* create perf event on CPU 0 */
>>>> +	attr.size = sizeof(attr);
>>>> +	attr.type = PERF_TYPE_HARDWARE;
>>>> +	attr.config = PERF_COUNT_HW_CPU_CYCLES;
>>>> +	attr.freq = 1;
>>>> +	attr.sample_freq = 1000;
>>>> +	pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
>>>> +
>>>> +	return pfd >= 0 ? pfd : -errno;
>>>> +}
>>>> +
>>>> +void test_htab_deadlock(void)
>>>> +{
>>>> +	unsigned int val = 0, key = 20;
>>>> +	struct bpf_link *link = NULL;
>>>> +	struct htab_deadlock *skel;
>>>> +	int err, i, pfd;
>>>> +	cpu_set_t cpus;
>>>> +
>>>> +	skel = htab_deadlock__open_and_load();
>>>> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
>>>> +		return;
>>>> +
>>>> +	err = htab_deadlock__attach(skel);
>>>> +	if (!ASSERT_OK(err, "skel_attach"))
>>>> +		goto clean_skel;
>>>> +
>>>> +	/* NMI events. */
>>>> +	pfd = perf_event_open();
>>>> +	if (pfd < 0) {
>>>> +		if (pfd == -ENOENT || pfd == -EOPNOTSUPP) {
>>>> +			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n", __func__);
>>>> +			test__skip();
>>> 
>>> This test is a SKIP in bpf CI, so it won't be useful.
>>> https://github.com/kernel-patches/bpf/actions/runs/3858084722/jobs/6579470256#step:6:5198
>>> 
>>> Is there other way to test it or do you know what may be missing in vmtest.sh? Not sure if the cloud setup in CI blocks HW_CPU_CYCLES.  If it is, I also don't know a good way (Cc: Manu).
>> Hi
>> Other test cases using PERF_COUNT_HW_CPU_CYCLES were skipped too. For example,
>> send_signal
>> find_vma
>> get_stackid_cannot_attach
> 
> Got it. Thanks for checking.
> 
>>> 
>>>> +			goto clean_skel;
>>>> +		}
>>>> +		if (!ASSERT_GE(pfd, 0, "perf_event_open"))
>>>> +			goto clean_skel;
>>>> +	}
>>>> +
>>>> +	link = bpf_program__attach_perf_event(skel->progs.bpf_empty, pfd);
>>>> +	if (!ASSERT_OK_PTR(link, "attach_perf_event"))
>>>> +		goto clean_pfd;
>>>> +
>>>> +	/* Pinned on CPU 0 */
>>>> +	CPU_ZERO(&cpus);
>>>> +	CPU_SET(0, &cpus);
>>>> +	pthread_setaffinity_np(pthread_self(), sizeof(cpus), &cpus);
>>>> +
>>>> +	/* update bpf map concurrently on CPU0 in NMI and Task context.
>>>> +	 * there should be no kernel deadlock.
>>>> +	 */
>>>> +	for (i = 0; i < 100000; i++)
>>>> +		bpf_map_update_elem(bpf_map__fd(skel->maps.htab),
>>>> +				    &key, &val, BPF_ANY);
>>>> +
>>>> +	bpf_link__destroy(link);
>>>> +clean_pfd:
>>>> +	close(pfd);
>>>> +clean_skel:
>>>> +	htab_deadlock__destroy(skel);
>>>> +}
>>>> diff --git a/tools/testing/selftests/bpf/progs/htab_deadlock.c b/tools/testing/selftests/bpf/progs/htab_deadlock.c
>>>> new file mode 100644
>>>> index 000000000000..dacd003b1ccb
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/bpf/progs/htab_deadlock.c
>>>> @@ -0,0 +1,30 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2022 DiDi Global Inc. */
>>>> +#include <linux/bpf.h>
>>>> +#include <bpf/bpf_helpers.h>
>>>> +#include <bpf/bpf_tracing.h>
>>>> +
>>>> +char _license[] SEC("license") = "GPL";
>>>> +
>>>> +struct {
>>>> +	__uint(type, BPF_MAP_TYPE_HASH);
>>>> +	__uint(max_entries, 2);
>>>> +	__uint(map_flags, BPF_F_ZERO_SEED);
>>>> +	__type(key, unsigned int);
>>>> +	__type(value, unsigned int);
>>>> +} htab SEC(".maps");
>>>> +
>>>> +SEC("fentry/perf_event_overflow")
>>>> +int bpf_nmi_handle(struct pt_regs *regs)
>>>> +{
>>>> +	unsigned int val = 0, key = 4;
>>>> +
>>>> +	bpf_map_update_elem(&htab, &key, &val, BPF_ANY);
>>> 
>>> I ran it in my qemu setup which does not skip the test.  I got this splat though:
>> This is a false alarm, not deadlock(this patch fix deadlock, only). I fix waring in other patch, please review
>> https://patchwork.kernel.org/project/netdevbpf/patch/20230105112749.38421-1-tong@infragraf.org/
> 
> Yeah, I just saw this thread also. Please submit the warning fix together with this patch set since this test can trigger it.  They should be reviewed together.

Hou, reviewed this patch https://patchwork.kernel.org/project/netdevbpf/patch/20230105112749.38421-1-tong@infragraf.org/

I will send v2 with other patches together.

>>> 
>>> [   42.990306] ================================
>>> [   42.990307] WARNING: inconsistent lock state
>>> [   42.990310] 6.2.0-rc2-00304-gaf88a1bb9967 #409 Tainted: G           O
>>> [   42.990313] --------------------------------
>>> [   42.990315] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>>> [   42.990317] test_progs/1546 [HC1[1]:SC0[0]:HE0:SE1] takes:
>>> [   42.990322] ffff888101245768 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x1e7/0x810
>>> [   42.990340] {INITIAL USE} state was registered at:
>>> [   42.990341]   lock_acquire+0x1e6/0x530
>>> [   42.990351]   _raw_spin_lock_irqsave+0xb8/0x100
>>> [   42.990362]   htab_map_update_elem+0x1e7/0x810
>>> [   42.990365]   bpf_map_update_value+0x40d/0x4f0
>>> [   42.990371]   map_update_elem+0x423/0x580
>>> [   42.990375]   __sys_bpf+0x54e/0x670
>>> [   42.990377]   __x64_sys_bpf+0x7c/0x90
>>> [   42.990382]   do_syscall_64+0x43/0x90
>>> [   42.990387]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>> 
>>> Please check.
>>> 
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +SEC("perf_event")
>>>> +int bpf_empty(struct pt_regs *regs)
>>>> +{
>>> 
>>> btw, from a quick look at __perf_event_overflow, I suspect doing the bpf_map_update_elem() here instead of the fentry/perf_event_overflow above can also reproduce the patch 1 issue?
>> No
>> bpf_overflow_handler will check the bpf_prog_active, if syscall increase it, bpf_overflow_handler will skip the bpf prog.
> 
> tbh, I am quite surprised the bpf_prog_active would be noisy enough to avoid this deadlock being reproduced easily. fwiw, I just tried doing map_update here and can reproduce it in the very first run.
> 
>> Fentry will not check the bpf_prog_active, and interrupt the task context. We have discussed that.
> 
> Sure. fentry is fine. The reason I was asking is to see if the test can be simplified and barring any future fentry blacklist.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-01-10  8:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-05  9:26 [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask tong
2023-01-05  9:26 ` [bpf-next v4 2/2] selftests/bpf: add test case for htab map tong
2023-01-05 10:56   ` Hou Tao
2023-01-10  1:33   ` Martin KaFai Lau
2023-01-10  2:21     ` Tonghao Zhang
2023-01-10  3:25       ` Martin KaFai Lau
2023-01-10  3:44         ` Martin KaFai Lau
2023-01-10  8:10         ` Tonghao Zhang
2023-01-10  1:52 ` [bpf-next v4 1/2] bpf: hash map, avoid deadlock with suitable hash mask Martin KaFai Lau
2023-01-10  2:25   ` Tonghao Zhang
2023-01-10  3:03     ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).