All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] Add bpf_for_each helper
@ 2021-11-18  1:04 Joanne Koong
  2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Joanne Koong @ 2021-11-18  1:04 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

This patchset add a new helper, bpf_for_each.

One of the complexities of using for loops in bpf programs is that the verifier
needs to ensure that in every possibility of the loop logic, the loop will always
terminate. As such, there is a limit on how many iterations the loop can do.

The bpf_for_each helper moves the loop logic into the kernel and can thereby
guarantee that the loop will always terminate. The bpf_for_each helper simplifies
a lot of the complexity the verifier needs to check, as well as removes the
constraint on the number of loops able to be run.

From the test results, we see that using bpf_for_each in place
of the traditional for loop led to a decrease in verification time
and number of bpf instructions by 100%. The benchmark results show
that as the number of iterations increases, the overhead per iteration
decreases.

The high-level overview of the patches -
Patch 1 - kernel-side + API changes for adding bpf_for_each
Patch 2 - tests
Patch 3 - benchmark for overhead of a bpf_for_each call


Joanne Koong (3):
  bpf: Add bpf_for_each helper
  selftests/bpf: Add tests for bpf_for_each
  selftest/bpf/benchs: add bpf_for_each benchmark

 include/linux/bpf.h                           |   1 +
 include/uapi/linux/bpf.h                      |  23 ++++
 kernel/bpf/bpf_iter.c                         |  32 ++++++
 kernel/bpf/helpers.c                          |   2 +
 kernel/bpf/verifier.c                         |  28 +++++
 tools/include/uapi/linux/bpf.h                |  23 ++++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_for_each.c     | 105 ++++++++++++++++++
 .../bpf/benchs/run_bench_for_each.sh          |  16 +++
 .../bpf/prog_tests/bpf_verif_scale.c          |  12 ++
 .../selftests/bpf/prog_tests/for_each.c       |  61 ++++++++++
 .../selftests/bpf/progs/for_each_helper.c     |  82 ++++++++++++++
 tools/testing/selftests/bpf/progs/pyperf.h    |  70 +++++++++++-
 .../selftests/bpf/progs/pyperf600_foreach.c   |   5 +
 .../testing/selftests/bpf/progs/strobemeta.h  |  73 +++++++++++-
 .../selftests/bpf/progs/strobemeta_foreach.c  |   9 ++
 17 files changed, 544 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_for_each.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_for_each.sh
 create mode 100644 tools/testing/selftests/bpf/progs/for_each_helper.c
 create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_foreach.c
 create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_foreach.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18  1:04 [PATCH bpf-next 0/3] Add bpf_for_each helper Joanne Koong
@ 2021-11-18  1:04 ` Joanne Koong
  2021-11-18 11:11   ` Toke Høiland-Jørgensen
                     ` (2 more replies)
  2021-11-18  1:04 ` [PATCH bpf-next 2/3] selftests/bpf: Add tests for bpf_for_each Joanne Koong
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Joanne Koong @ 2021-11-18  1:04 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

This patch adds the kernel-side and API changes for a new helper
function, bpf_for_each:

long bpf_for_each(u32 nr_interations, void *callback_fn,
void *callback_ctx, u64 flags);

bpf_for_each invokes the "callback_fn" nr_iterations number of times
or until the callback_fn returns 1.

A few things to please note:
~ The "u64 flags" parameter is currently unused but is included in
case a future use case for it arises.
~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
bpf_callback_t is used as the callback function cast.
~ A program can have nested bpf_for_each calls but the program must
still adhere to the verifier constraint of its stack depth (the stack depth
cannot exceed MAX_BPF_STACK))
~ The next patch will include the tests and benchmark

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
 kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
 kernel/bpf/helpers.c           |  2 ++
 kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
 6 files changed, 109 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6deebf8bf78f..d9b69a896c91 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
 extern const struct bpf_func_proto bpf_task_storage_get_proto;
 extern const struct bpf_func_proto bpf_task_storage_delete_proto;
 extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
+extern const struct bpf_func_proto bpf_for_each_proto;
 extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
 extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
 extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bd0c9f0487f6..ea5098920ed2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4750,6 +4750,28 @@ union bpf_attr {
  *		The number of traversed map elements for success, **-EINVAL** for
  *		invalid **flags**.
  *
+ * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
+ *	Description
+ *		For **nr_iterations**, call **callback_fn** function with
+ *		**callback_ctx** as the context parameter.
+ *		The **callback_fn** should be a static function and
+ *		the **callback_ctx** should be a pointer to the stack.
+ *		The **flags** is used to control certain aspects of the helper.
+ *		Currently, the **flags** must be 0.
+ *
+ *		long (\*callback_fn)(u32 index, void \*ctx);
+ *
+ *		where **index** is the current index in the iteration. The index
+ *		is zero-indexed.
+ *
+ *		If **callback_fn** returns 0, the helper will continue to the next
+ *		iteration. If return value is 1, the helper will skip the rest of
+ *		the iterations and return. Other return values are not used now.
+ *
+ *	Return
+ *		The number of iterations performed, **-EINVAL** for invalid **flags**
+ *		or a null **callback_fn**.
+ *
  * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 *data, u32 data_len)
  *	Description
  *		Outputs a string into the **str** buffer of size **str_size**
@@ -5105,6 +5127,7 @@ union bpf_attr {
 	FN(sock_from_file),		\
 	FN(check_mtu),			\
 	FN(for_each_map_elem),		\
+	FN(for_each),			\
 	FN(snprintf),			\
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index b2ee45064e06..cb742c50898a 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -714,3 +714,35 @@ const struct bpf_func_proto bpf_for_each_map_elem_proto = {
 	.arg3_type	= ARG_PTR_TO_STACK_OR_NULL,
 	.arg4_type	= ARG_ANYTHING,
 };
+
+BPF_CALL_4(bpf_for_each, u32, nr_iterations, void *, callback_fn, void *, callback_ctx,
+	   u64, flags)
+{
+	bpf_callback_t callback = (bpf_callback_t)callback_fn;
+	u64 err;
+	u32 i;
+
+	if (flags)
+		return -EINVAL;
+
+	for (i = 0; i < nr_iterations; i++) {
+		err = callback((u64)i, (u64)(long)callback_ctx, 0, 0, 0);
+		/* return value: 0 - continue, 1 - stop and return */
+		if (err) {
+			i++;
+			break;
+		}
+	}
+
+	return i;
+}
+
+const struct bpf_func_proto bpf_for_each_proto = {
+	.func		= bpf_for_each,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_PTR_TO_STACK_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1ffd469c217f..ac10e1b3859f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1378,6 +1378,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_ringbuf_query_proto;
 	case BPF_FUNC_for_each_map_elem:
 		return &bpf_for_each_map_elem_proto;
+	case BPF_FUNC_for_each:
+		return &bpf_for_each_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3c8aa7df1773..7c9cc4f01104 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6103,6 +6103,27 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_for_each_callback_state(struct bpf_verifier_env *env,
+				       struct bpf_func_state *caller,
+				       struct bpf_func_state *callee,
+				       int insn_idx)
+{
+	/* bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx,
+	 *		u64 flags);
+	 * callback_fn(u32 index, void *callback_ctx);
+	 */
+	callee->regs[BPF_REG_1].type = SCALAR_VALUE;
+	callee->regs[BPF_REG_2] = caller->regs[BPF_REG_3];
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static int set_timer_callback_state(struct bpf_verifier_env *env,
 				    struct bpf_func_state *caller,
 				    struct bpf_func_state *callee,
@@ -6482,6 +6503,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return -EINVAL;
 	}
 
+	if (func_id == BPF_FUNC_for_each) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_for_each_callback_state);
+		if (err < 0)
+			return -EINVAL;
+	}
+
 	if (func_id == BPF_FUNC_timer_set_callback) {
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
 					set_timer_callback_state);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bd0c9f0487f6..ea5098920ed2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4750,6 +4750,28 @@ union bpf_attr {
  *		The number of traversed map elements for success, **-EINVAL** for
  *		invalid **flags**.
  *
+ * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
+ *	Description
+ *		For **nr_iterations**, call **callback_fn** function with
+ *		**callback_ctx** as the context parameter.
+ *		The **callback_fn** should be a static function and
+ *		the **callback_ctx** should be a pointer to the stack.
+ *		The **flags** is used to control certain aspects of the helper.
+ *		Currently, the **flags** must be 0.
+ *
+ *		long (\*callback_fn)(u32 index, void \*ctx);
+ *
+ *		where **index** is the current index in the iteration. The index
+ *		is zero-indexed.
+ *
+ *		If **callback_fn** returns 0, the helper will continue to the next
+ *		iteration. If return value is 1, the helper will skip the rest of
+ *		the iterations and return. Other return values are not used now.
+ *
+ *	Return
+ *		The number of iterations performed, **-EINVAL** for invalid **flags**
+ *		or a null **callback_fn**.
+ *
  * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 *data, u32 data_len)
  *	Description
  *		Outputs a string into the **str** buffer of size **str_size**
@@ -5105,6 +5127,7 @@ union bpf_attr {
 	FN(sock_from_file),		\
 	FN(check_mtu),			\
 	FN(for_each_map_elem),		\
+	FN(for_each),			\
 	FN(snprintf),			\
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf-next 2/3] selftests/bpf: Add tests for bpf_for_each
  2021-11-18  1:04 [PATCH bpf-next 0/3] Add bpf_for_each helper Joanne Koong
  2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
@ 2021-11-18  1:04 ` Joanne Koong
  2021-11-18 20:23   ` Andrii Nakryiko
  2021-11-18  1:04 ` [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark Joanne Koong
  2021-11-18 11:14 ` [PATCH bpf-next 0/3] Add bpf_for_each helper Toke Høiland-Jørgensen
  3 siblings, 1 reply; 20+ messages in thread
From: Joanne Koong @ 2021-11-18  1:04 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

In this patch -
1) Add a new prog "for_each_helper" which tests the basic functionality of
the bpf_for_each helper.

2) Add pyperf600_foreach and strobemeta_foreach to test the performance
of using bpf_for_each instead of a for loop

The results of pyperf600 and strobemeta are as follows:

~strobemeta~

Baseline
    verification time 6808200 usec
    stack depth 496
    processed 592132 insns (limit 1000000) max_states_per_insn 14
    total_states 16018 peak_states 13684 mark_read 3132
    #188 verif_scale_strobemeta:OK (unrolled loop)

Using bpf_for_each
    verification time 31589 usec
    stack depth 96+408
    processed 1630 insns (limit 1000000) max_states_per_insn 4
    total_states 107 peak_states 107 mark_read 60
    #189 verif_scale_strobemeta_foreach:OK

~pyperf600~

Baseline
    verification time 29702486 usec
    stack depth 368
    processed 626838 insns (limit 1000000) max_states_per_insn 7
    total_states 30368 peak_states 30279 mark_read 748
    #182 verif_scale_pyperf600:OK (unrolled loop)

Using bpf_for_each
    verification time 148488 usec
    stack depth 320+40
    processed 10518 insns (limit 1000000) max_states_per_insn 10
    total_states 705 peak_states 517 mark_read 38
    #183 verif_scale_pyperf600_foreach:OK

Using the bpf_for_each helper led to approximately a 100% decrease
in the verification time and in the number of instructions.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 .../bpf/prog_tests/bpf_verif_scale.c          | 12 +++
 .../selftests/bpf/prog_tests/for_each.c       | 61 ++++++++++++++++
 .../selftests/bpf/progs/for_each_helper.c     | 69 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/pyperf.h    | 70 +++++++++++++++++-
 .../selftests/bpf/progs/pyperf600_foreach.c   |  5 ++
 .../testing/selftests/bpf/progs/strobemeta.h  | 73 ++++++++++++++++++-
 .../selftests/bpf/progs/strobemeta_foreach.c  |  9 +++
 7 files changed, 295 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/for_each_helper.c
 create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_foreach.c
 create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_foreach.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
index 867349e4ed9e..77396484fde7 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
@@ -115,6 +115,12 @@ void test_verif_scale_pyperf600()
 	scale_test("pyperf600.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
 }
 
+void test_verif_scale_pyperf600_foreach(void)
+{
+	/* use the bpf_for_each helper*/
+	scale_test("pyperf600_foreach.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
+}
+
 void test_verif_scale_pyperf600_nounroll()
 {
 	/* no unroll at all.
@@ -165,6 +171,12 @@ void test_verif_scale_strobemeta()
 	scale_test("strobemeta.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
 }
 
+void test_verif_scale_strobemeta_foreach(void)
+{
+	/* use the bpf_for_each helper*/
+	scale_test("strobemeta_foreach.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
+}
+
 void test_verif_scale_strobemeta_nounroll1()
 {
 	/* no unroll, tiny loops */
diff --git a/tools/testing/selftests/bpf/prog_tests/for_each.c b/tools/testing/selftests/bpf/prog_tests/for_each.c
index 68eb12a287d4..529573a82334 100644
--- a/tools/testing/selftests/bpf/prog_tests/for_each.c
+++ b/tools/testing/selftests/bpf/prog_tests/for_each.c
@@ -4,6 +4,7 @@
 #include <network_helpers.h>
 #include "for_each_hash_map_elem.skel.h"
 #include "for_each_array_map_elem.skel.h"
+#include "for_each_helper.skel.h"
 
 static unsigned int duration;
 
@@ -121,10 +122,70 @@ static void test_array_map(void)
 	for_each_array_map_elem__destroy(skel);
 }
 
+static void test_for_each_helper(void)
+{
+	struct for_each_helper *skel;
+	__u32 retval;
+	int err;
+
+	skel = for_each_helper__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "for_each_helper__open_and_load"))
+		return;
+
+	skel->bss->nr_iterations = 100;
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (CHECK(err || retval, "bpf_for_each helper test_prog",
+		  "err %d errno %d retval %d\n", err, errno, retval))
+		goto out;
+	ASSERT_EQ(skel->bss->nr_iterations_completed, skel->bss->nr_iterations,
+		  "nr_iterations mismatch");
+	ASSERT_EQ(skel->bss->g_output, (100 * 99) / 2, "wrong output");
+
+	/* test callback_fn returning 1 to stop iteration */
+	skel->bss->nr_iterations = 400;
+	skel->data->stop_index = 50;
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (CHECK(err || retval, "bpf_for_each helper test_prog",
+		  "err %d errno %d retval %d\n", err, errno, retval))
+		goto out;
+	ASSERT_EQ(skel->bss->nr_iterations_completed, skel->data->stop_index + 1,
+		  "stop_index not followed");
+	ASSERT_EQ(skel->bss->g_output, (50 * 49) / 2, "wrong output");
+
+	/* test passing in a null ctx */
+	skel->bss->nr_iterations = 10;
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_null_ctx),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (CHECK(err || retval, "bpf_for_each helper prog_null_ctx",
+		  "err %d errno %d retval %d\n", err, errno, retval))
+		goto out;
+	ASSERT_EQ(skel->bss->nr_iterations_completed, skel->bss->nr_iterations,
+		  "nr_iterations mismatch");
+
+	/* test invalid flags */
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_invalid_flags),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (CHECK(err || retval, "bpf_for_each helper prog_invalid_flags",
+		  "err %d errno %d retval %d\n", err, errno, retval))
+		goto out;
+	ASSERT_EQ(skel->bss->err, -EINVAL, "invalid_flags");
+
+out:
+	for_each_helper__destroy(skel);
+}
+
 void test_for_each(void)
 {
 	if (test__start_subtest("hash_map"))
 		test_hash_map();
 	if (test__start_subtest("array_map"))
 		test_array_map();
+	if (test__start_subtest("for_each_helper"))
+		test_for_each_helper();
 }
diff --git a/tools/testing/selftests/bpf/progs/for_each_helper.c b/tools/testing/selftests/bpf/progs/for_each_helper.c
new file mode 100644
index 000000000000..4404d0cb32a6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/for_each_helper.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct callback_ctx {
+	int output;
+};
+
+/* This should be set by the user program */
+u32 nr_iterations;
+u32 stop_index = -1;
+
+/* Making these global variables so that the userspace program
+ * can verify the output through the skeleton
+ */
+int nr_iterations_completed;
+int g_output;
+int err;
+
+static int callback_fn(__u32 index, void *data)
+{
+	struct callback_ctx *ctx = data;
+
+	if (index >= stop_index)
+		return 1;
+
+	ctx->output += index;
+
+	return 0;
+}
+
+static int empty_callback_fn(__u32 index, void *data)
+{
+	return 0;
+}
+
+SEC("tc")
+int test_prog(struct __sk_buff *skb)
+{
+	struct callback_ctx data = {};
+
+	nr_iterations_completed = bpf_for_each(nr_iterations, callback_fn, &data, 0);
+
+	g_output = data.output;
+
+	return 0;
+}
+
+SEC("tc")
+int prog_null_ctx(struct __sk_buff *skb)
+{
+	nr_iterations_completed = bpf_for_each(nr_iterations, empty_callback_fn, NULL, 0);
+
+	return 0;
+}
+
+SEC("tc")
+int prog_invalid_flags(struct __sk_buff *skb)
+{
+	struct callback_ctx data = {};
+
+	err = bpf_for_each(nr_iterations, callback_fn, &data, 1);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/pyperf.h b/tools/testing/selftests/bpf/progs/pyperf.h
index 2fb7adafb6b6..186735e2f385 100644
--- a/tools/testing/selftests/bpf/progs/pyperf.h
+++ b/tools/testing/selftests/bpf/progs/pyperf.h
@@ -159,6 +159,57 @@ struct {
 	__uint(value_size, sizeof(long long) * 127);
 } stackmap SEC(".maps");
 
+struct process_frame_ctx {
+	int cur_cpu;
+	int32_t *symbol_counter;
+	void *frame_ptr;
+	FrameData *frame;
+	PidData *pidData;
+	Symbol *sym;
+	Event *event;
+	bool done;
+};
+
+#define barrier_var(var) asm volatile("" : "=r"(var) : "0"(var))
+
+static int process_frame_callback(__u32 i, struct process_frame_ctx *ctx)
+{
+	int zero = 0;
+	void *frame_ptr = ctx->frame_ptr;
+	PidData *pidData = ctx->pidData;
+	FrameData *frame = ctx->frame;
+	int32_t *symbol_counter = ctx->symbol_counter;
+	int cur_cpu = ctx->cur_cpu;
+	Event *event = ctx->event;
+	Symbol *sym = ctx->sym;
+
+	if (frame_ptr && get_frame_data(frame_ptr, pidData, frame, sym)) {
+		int32_t new_symbol_id = *symbol_counter * 64 + cur_cpu;
+		int32_t *symbol_id = bpf_map_lookup_elem(&symbolmap, sym);
+
+		if (!symbol_id) {
+			bpf_map_update_elem(&symbolmap, sym, &zero, 0);
+			symbol_id = bpf_map_lookup_elem(&symbolmap, sym);
+			if (!symbol_id) {
+				ctx->done = true;
+				return 1;
+			}
+		}
+		if (*symbol_id == new_symbol_id)
+			(*symbol_counter)++;
+
+		barrier_var(i);
+		if (i >= STACK_MAX_LEN)
+			return 1;
+
+		event->stack[i] = *symbol_id;
+
+		event->stack_len = i + 1;
+		frame_ptr = frame->f_back;
+	}
+	return 0;
+}
+
 #ifdef GLOBAL_FUNC
 __noinline
 #elif defined(SUBPROGS)
@@ -228,11 +279,27 @@ int __on_event(struct bpf_raw_tracepoint_args *ctx)
 		int32_t* symbol_counter = bpf_map_lookup_elem(&symbolmap, &sym);
 		if (symbol_counter == NULL)
 			return 0;
+#ifdef USE_FOREACH
+	struct process_frame_ctx ctx;
+
+	ctx.cur_cpu = cur_cpu;
+	ctx.symbol_counter = symbol_counter;
+	ctx.frame_ptr = frame_ptr;
+	ctx.frame = &frame;
+	ctx.pidData = pidData;
+	ctx.sym = &sym;
+	ctx.event = event;
+	ctx.done = false;
+
+	bpf_for_each(STACK_MAX_LEN, process_frame_callback, &ctx, 0);
+	if (ctx.done)
+		return 0;
+#else
 #ifdef NO_UNROLL
 #pragma clang loop unroll(disable)
 #else
 #pragma clang loop unroll(full)
-#endif
+#endif /* NO_UNROLL */
 		/* Unwind python stack */
 		for (int i = 0; i < STACK_MAX_LEN; ++i) {
 			if (frame_ptr && get_frame_data(frame_ptr, pidData, &frame, &sym)) {
@@ -251,6 +318,7 @@ int __on_event(struct bpf_raw_tracepoint_args *ctx)
 				frame_ptr = frame.f_back;
 			}
 		}
+#endif /* USE_FOREACH */
 		event->stack_complete = frame_ptr == NULL;
 	} else {
 		event->stack_complete = 1;
diff --git a/tools/testing/selftests/bpf/progs/pyperf600_foreach.c b/tools/testing/selftests/bpf/progs/pyperf600_foreach.c
new file mode 100644
index 000000000000..1b43f12a7d54
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/pyperf600_foreach.c
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2021 Facebook
+#define STACK_MAX_LEN 600
+#define USE_FOREACH
+#include "pyperf.h"
diff --git a/tools/testing/selftests/bpf/progs/strobemeta.h b/tools/testing/selftests/bpf/progs/strobemeta.h
index 7de534f38c3f..40069e797930 100644
--- a/tools/testing/selftests/bpf/progs/strobemeta.h
+++ b/tools/testing/selftests/bpf/progs/strobemeta.h
@@ -445,6 +445,46 @@ static __always_inline void *read_map_var(struct strobemeta_cfg *cfg,
 	return payload;
 }
 
+enum read_type {
+	READ_INT_VAR,
+	READ_MAP_VAR,
+	READ_STR_VAR,
+};
+
+struct read_var_ctx {
+	struct strobemeta_payload *data;
+	void *tls_base;
+	struct strobemeta_cfg *cfg;
+	void *payload;
+	/* value gets mutated */
+	struct strobe_value_generic *value;
+	enum read_type type;
+};
+
+static int read_var_callback(__u32 index, struct read_var_ctx *ctx)
+{
+	switch (ctx->type) {
+	case READ_INT_VAR:
+		if (index >= STROBE_MAX_INTS)
+			return 1;
+		read_int_var(ctx->cfg, index, ctx->tls_base, ctx->value, ctx->data);
+		break;
+	case READ_MAP_VAR:
+		if (index >= STROBE_MAX_MAPS)
+			return 1;
+		ctx->payload = read_map_var(ctx->cfg, index, ctx->tls_base,
+					    ctx->value, ctx->data, ctx->payload);
+		break;
+	case READ_STR_VAR:
+		if (index >= STROBE_MAX_STRS)
+			return 1;
+		ctx->payload += read_str_var(ctx->cfg, index, ctx->tls_base,
+					     ctx->value, ctx->data, ctx->payload);
+		break;
+	}
+	return 0;
+}
+
 /*
  * read_strobe_meta returns NULL, if no metadata was read; otherwise returns
  * pointer to *right after* payload ends
@@ -475,11 +515,36 @@ static void *read_strobe_meta(struct task_struct *task,
 	 */
 	tls_base = (void *)task;
 
+#ifdef USE_FOREACH
+	struct read_var_ctx ctx;
+	int err;
+
+	ctx.cfg = cfg;
+	ctx.tls_base = tls_base;
+	ctx.value = &value;
+	ctx.data = data;
+	ctx.payload = payload;
+	ctx.type = READ_INT_VAR;
+
+	err = bpf_for_each(STROBE_MAX_INTS, read_var_callback, &ctx, 0);
+	if (err != STROBE_MAX_INTS)
+		return NULL;
+
+	ctx.type = READ_STR_VAR;
+	err = bpf_for_each(STROBE_MAX_STRS, read_var_callback, &ctx, 0);
+	if (err != STROBE_MAX_STRS)
+		return NULL;
+
+	ctx.type = READ_MAP_VAR;
+	err = bpf_for_each(STROBE_MAX_MAPS, read_var_callback, &ctx, 0);
+	if (err != STROBE_MAX_MAPS)
+		return NULL;
+#else
 #ifdef NO_UNROLL
 #pragma clang loop unroll(disable)
 #else
 #pragma unroll
-#endif
+#endif /* NO_UNROLL */
 	for (int i = 0; i < STROBE_MAX_INTS; ++i) {
 		read_int_var(cfg, i, tls_base, &value, data);
 	}
@@ -487,7 +552,7 @@ static void *read_strobe_meta(struct task_struct *task,
 #pragma clang loop unroll(disable)
 #else
 #pragma unroll
-#endif
+#endif /* NO_UNROLL */
 	for (int i = 0; i < STROBE_MAX_STRS; ++i) {
 		payload += read_str_var(cfg, i, tls_base, &value, data, payload);
 	}
@@ -495,10 +560,12 @@ static void *read_strobe_meta(struct task_struct *task,
 #pragma clang loop unroll(disable)
 #else
 #pragma unroll
-#endif
+#endif /* NO_UNROLL */
 	for (int i = 0; i < STROBE_MAX_MAPS; ++i) {
 		payload = read_map_var(cfg, i, tls_base, &value, data, payload);
 	}
+#endif /* USE_FOREACH */
+
 	/*
 	 * return pointer right after end of payload, so it's possible to
 	 * calculate exact amount of useful data that needs to be sent
diff --git a/tools/testing/selftests/bpf/progs/strobemeta_foreach.c b/tools/testing/selftests/bpf/progs/strobemeta_foreach.c
new file mode 100644
index 000000000000..a2d57d2d68ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/strobemeta_foreach.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+
+#define STROBE_MAX_INTS 2
+#define STROBE_MAX_STRS 25
+#define STROBE_MAX_MAPS 100
+#define STROBE_MAX_MAP_ENTRIES 20
+#define USE_FOREACH
+#include "strobemeta.h"
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-18  1:04 [PATCH bpf-next 0/3] Add bpf_for_each helper Joanne Koong
  2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
  2021-11-18  1:04 ` [PATCH bpf-next 2/3] selftests/bpf: Add tests for bpf_for_each Joanne Koong
@ 2021-11-18  1:04 ` Joanne Koong
  2021-11-18 11:18   ` Toke Høiland-Jørgensen
  2021-11-18 20:00   ` Andrii Nakryiko
  2021-11-18 11:14 ` [PATCH bpf-next 0/3] Add bpf_for_each helper Toke Høiland-Jørgensen
  3 siblings, 2 replies; 20+ messages in thread
From: Joanne Koong @ 2021-11-18  1:04 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

Add benchmark to measure the overhead of the bpf_for_each call
for a specified number of iterations.

Testing this on qemu on my dev machine on 1 thread, the data is
as follows:

        nr_iterations: 1
bpf_for_each helper - total callbacks called:  42.949 ± 1.404M/s

        nr_iterations: 10
bpf_for_each helper - total callbacks called:  73.645 ± 2.077M/s

        nr_iterations: 100
bpf_for_each helper - total callbacks called:  73.058 ± 1.256M/s

        nr_iterations: 500
bpf_for_each helper - total callbacks called:  78.255 ± 2.845M/s

        nr_iterations: 1000
bpf_for_each helper - total callbacks called:  79.439 ± 1.805M/s

        nr_iterations: 5000
bpf_for_each helper - total callbacks called:  81.639 ± 2.053M/s

        nr_iterations: 10000
bpf_for_each helper - total callbacks called:  80.577 ± 1.824M/s

        nr_iterations: 50000
bpf_for_each helper - total callbacks called:  76.773 ± 1.578M/s

        nr_iterations: 100000
bpf_for_each helper - total callbacks called:  77.073 ± 2.200M/s

        nr_iterations: 500000
bpf_for_each helper - total callbacks called:  75.136 ± 0.552M/s

        nr_iterations: 1000000
bpf_for_each helper - total callbacks called:  76.364 ± 1.690M/s

From this data, we can see that we are able to run the loop at
least 40 million times per second on an empty callback function.

From this data, we can also see that as the number of iterations
increases, the overhead per iteration decreases and steadies towards
a constant value.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +-
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_for_each.c     | 105 ++++++++++++++++++
 .../bpf/benchs/run_bench_for_each.sh          |  16 +++
 .../selftests/bpf/progs/for_each_helper.c     |  13 +++
 5 files changed, 140 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_for_each.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_for_each.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index f49cb5fc85af..b55fc72b8ef0 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -537,7 +537,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o $(OUTPUT)/testing_helpers.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
 		 $(OUTPUT)/bench_ringbufs.o \
-		 $(OUTPUT)/bench_bloom_filter_map.o
+		 $(OUTPUT)/bench_bloom_filter_map.o \
+		 $(OUTPUT)/bench_for_each.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index cc4722f693e9..d8b3d537a700 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -171,10 +171,12 @@ static const struct argp_option opts[] = {
 
 extern struct argp bench_ringbufs_argp;
 extern struct argp bench_bloom_map_argp;
+extern struct argp bench_for_each_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
+	{ &bench_for_each_argp, 0, "bpf_for_each helper benchmark", 0 },
 	{},
 };
 
@@ -368,6 +370,7 @@ extern const struct bench bench_bloom_update;
 extern const struct bench bench_bloom_false_positive;
 extern const struct bench bench_hashmap_without_bloom;
 extern const struct bench bench_hashmap_with_bloom;
+extern const struct bench bench_for_each_helper;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -394,6 +397,7 @@ static const struct bench *benchs[] = {
 	&bench_bloom_false_positive,
 	&bench_hashmap_without_bloom,
 	&bench_hashmap_with_bloom,
+	&bench_for_each_helper,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_for_each.c b/tools/testing/selftests/bpf/benchs/bench_for_each.c
new file mode 100644
index 000000000000..3372d5b7d67b
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_for_each.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <argp.h>
+#include "bench.h"
+#include "for_each_helper.skel.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct for_each_helper *skel;
+} ctx;
+
+static struct {
+	__u32 nr_iters;
+} args = {
+	.nr_iters = 10,
+};
+
+enum {
+	ARG_NR_ITERS = 4000,
+};
+
+static const struct argp_option opts[] = {
+	{ "nr_iters", ARG_NR_ITERS, "nr_iters", 0,
+		"Set number of iterations for the bpf_for_each helper"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_NR_ITERS:
+		args.nr_iters = strtol(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+/* exported into benchmark runner */
+const struct argp bench_for_each_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	while (true)
+		/* trigger the bpf program */
+		syscall(__NR_getpgid);
+
+	return NULL;
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+static void setup(void)
+{
+	struct bpf_link *link;
+
+	setup_libbpf();
+
+	ctx.skel = for_each_helper__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx.skel->progs.benchmark);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+
+	ctx.skel->bss->nr_iterations = args.nr_iters;
+}
+
+const struct bench bench_for_each_helper = {
+	.name = "for-each-helper",
+	.validate = validate,
+	.setup = setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_for_each.sh b/tools/testing/selftests/bpf/benchs/run_bench_for_each.sh
new file mode 100755
index 000000000000..5f11a1ad66d3
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_for_each.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+for t in 1 4 8 12 16; do
+printf "\n"
+for i in 1 10 100 500 1000 5000 10000 50000 100000 500000 1000000; do
+subtitle "nr_iterations: $i, nr_threads: $t"
+	summarize "bpf_for_each helper - total callbacks called: " \
+	    "$($RUN_BENCH -p $t --nr_iters $i for-each-helper)"
+	printf "\n"
+done
+done
diff --git a/tools/testing/selftests/bpf/progs/for_each_helper.c b/tools/testing/selftests/bpf/progs/for_each_helper.c
index 4404d0cb32a6..b95551d99f75 100644
--- a/tools/testing/selftests/bpf/progs/for_each_helper.c
+++ b/tools/testing/selftests/bpf/progs/for_each_helper.c
@@ -14,6 +14,8 @@ struct callback_ctx {
 u32 nr_iterations;
 u32 stop_index = -1;
 
+long hits;
+
 /* Making these global variables so that the userspace program
  * can verify the output through the skeleton
  */
@@ -67,3 +69,14 @@ int prog_invalid_flags(struct __sk_buff *skb)
 
 	return 0;
 }
+
+SEC("fentry/__x64_sys_getpgid")
+int benchmark(void *ctx)
+{
+	for (int i = 0; i < 1000; i++) {
+		bpf_for_each(nr_iterations, empty_callback_fn, NULL, 0);
+
+		__sync_add_and_fetch(&hits, nr_iterations);
+	}
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
@ 2021-11-18 11:11   ` Toke Høiland-Jørgensen
  2021-11-18 18:03     ` Yonghong Song
  2021-11-18 17:59   ` Yonghong Song
  2021-11-18 20:14   ` Andrii Nakryiko
  2 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-18 11:11 UTC (permalink / raw)
  To: Joanne Koong, bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

Joanne Koong <joannekoong@fb.com> writes:

> This patch adds the kernel-side and API changes for a new helper
> function, bpf_for_each:
>
> long bpf_for_each(u32 nr_interations, void *callback_fn,
> void *callback_ctx, u64 flags);
>
> bpf_for_each invokes the "callback_fn" nr_iterations number of times
> or until the callback_fn returns 1.
>
> A few things to please note:
> ~ The "u64 flags" parameter is currently unused but is included in
> case a future use case for it arises.
> ~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
> bpf_callback_t is used as the callback function cast.
> ~ A program can have nested bpf_for_each calls but the program must
> still adhere to the verifier constraint of its stack depth (the stack depth
> cannot exceed MAX_BPF_STACK))
> ~ The next patch will include the tests and benchmark
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>

Great to see this! One small nit, below, but otherwise:

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>


> ---
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>  kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>  kernel/bpf/helpers.c           |  2 ++
>  kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>  6 files changed, 109 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6deebf8bf78f..d9b69a896c91 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
>  extern const struct bpf_func_proto bpf_task_storage_get_proto;
>  extern const struct bpf_func_proto bpf_task_storage_delete_proto;
>  extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
> +extern const struct bpf_func_proto bpf_for_each_proto;
>  extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>  extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
>  extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index bd0c9f0487f6..ea5098920ed2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4750,6 +4750,28 @@ union bpf_attr {
>   *		The number of traversed map elements for success, **-EINVAL** for
>   *		invalid **flags**.
>   *
> + * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
> + *	Description
> + *		For **nr_iterations**, call **callback_fn** function with
> + *		**callback_ctx** as the context parameter.
> + *		The **callback_fn** should be a static function and
> + *		the **callback_ctx** should be a pointer to the stack.
> + *		The **flags** is used to control certain aspects of the helper.
> + *		Currently, the **flags** must be 0.
> + *
> + *		long (\*callback_fn)(u32 index, void \*ctx);
> + *
> + *		where **index** is the current index in the iteration. The index
> + *		is zero-indexed.
> + *
> + *		If **callback_fn** returns 0, the helper will continue to the next
> + *		iteration. If return value is 1, the helper will skip the rest of
> + *		the iterations and return. Other return values are not used now.

The code will actually return for any non-zero value, though? So
shouldn't the documentation reflect this? Or, alternatively, should the
verifier enforce that the function can only return 0 or 1?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 0/3] Add bpf_for_each helper
  2021-11-18  1:04 [PATCH bpf-next 0/3] Add bpf_for_each helper Joanne Koong
                   ` (2 preceding siblings ...)
  2021-11-18  1:04 ` [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark Joanne Koong
@ 2021-11-18 11:14 ` Toke Høiland-Jørgensen
  2021-11-19  2:35   ` Joanne Koong
  3 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-18 11:14 UTC (permalink / raw)
  To: Joanne Koong, bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

Joanne Koong <joannekoong@fb.com> writes:

> This patchset add a new helper, bpf_for_each.
>
> One of the complexities of using for loops in bpf programs is that the verifier
> needs to ensure that in every possibility of the loop logic, the loop will always
> terminate. As such, there is a limit on how many iterations the loop can do.
>
> The bpf_for_each helper moves the loop logic into the kernel and can thereby
> guarantee that the loop will always terminate. The bpf_for_each helper simplifies
> a lot of the complexity the verifier needs to check, as well as removes the
> constraint on the number of loops able to be run.
>
> From the test results, we see that using bpf_for_each in place
> of the traditional for loop led to a decrease in verification time
> and number of bpf instructions by 100%. The benchmark results show
> that as the number of iterations increases, the overhead per iteration
> decreases.

Small nit with the "by 100%" formulation: when giving such relative
quantities 100% has a particular meaning, namely "eliminates entirely".
Which this doesn't, obviously, it *almost* eliminates the verification
overhead. So I'd change this to 99.5% instead (which is the actual value
from your numbers in patch 2).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-18  1:04 ` [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark Joanne Koong
@ 2021-11-18 11:18   ` Toke Høiland-Jørgensen
  2021-11-18 19:55     ` Andrii Nakryiko
  2021-11-18 20:00   ` Andrii Nakryiko
  1 sibling, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-18 11:18 UTC (permalink / raw)
  To: Joanne Koong, bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team, Joanne Koong

Joanne Koong <joannekoong@fb.com> writes:

> Add benchmark to measure the overhead of the bpf_for_each call
> for a specified number of iterations.
>
> Testing this on qemu on my dev machine on 1 thread, the data is
> as follows:

Absolute numbers from some random dev machine are not terribly useful;
others have no way of replicating your tests. A more meaningful
benchmark would need a baseline to compare to; in this case I guess that
would be a regular loop? Do you have any numbers comparing the callback
to just looping?

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
  2021-11-18 11:11   ` Toke Høiland-Jørgensen
@ 2021-11-18 17:59   ` Yonghong Song
  2021-11-19  2:40     ` Joanne Koong
  2021-11-18 20:14   ` Andrii Nakryiko
  2 siblings, 1 reply; 20+ messages in thread
From: Yonghong Song @ 2021-11-18 17:59 UTC (permalink / raw)
  To: Joanne Koong, bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team



On 11/17/21 5:04 PM, Joanne Koong wrote:
> This patch adds the kernel-side and API changes for a new helper
> function, bpf_for_each:
> 
> long bpf_for_each(u32 nr_interations, void *callback_fn,
> void *callback_ctx, u64 flags);
> 
> bpf_for_each invokes the "callback_fn" nr_iterations number of times
> or until the callback_fn returns 1.
> 
> A few things to please note:
> ~ The "u64 flags" parameter is currently unused but is included in
> case a future use case for it arises.
> ~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
> bpf_callback_t is used as the callback function cast.
> ~ A program can have nested bpf_for_each calls but the program must
> still adhere to the verifier constraint of its stack depth (the stack depth
> cannot exceed MAX_BPF_STACK))
> ~ The next patch will include the tests and benchmark
> 
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>   include/linux/bpf.h            |  1 +
>   include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>   kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>   kernel/bpf/helpers.c           |  2 ++
>   kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>   6 files changed, 109 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6deebf8bf78f..d9b69a896c91 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
>   extern const struct bpf_func_proto bpf_task_storage_get_proto;
>   extern const struct bpf_func_proto bpf_task_storage_delete_proto;
>   extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
> +extern const struct bpf_func_proto bpf_for_each_proto;
>   extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>   extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
>   extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index bd0c9f0487f6..ea5098920ed2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4750,6 +4750,28 @@ union bpf_attr {
>    *		The number of traversed map elements for success, **-EINVAL** for
>    *		invalid **flags**.
>    *
> + * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
> + *	Description
> + *		For **nr_iterations**, call **callback_fn** function with
> + *		**callback_ctx** as the context parameter.
> + *		The **callback_fn** should be a static function and
> + *		the **callback_ctx** should be a pointer to the stack.
> + *		The **flags** is used to control certain aspects of the helper.
> + *		Currently, the **flags** must be 0.
> + *
> + *		long (\*callback_fn)(u32 index, void \*ctx);
> + *
> + *		where **index** is the current index in the iteration. The index
> + *		is zero-indexed.
> + *
> + *		If **callback_fn** returns 0, the helper will continue to the next
> + *		iteration. If return value is 1, the helper will skip the rest of
> + *		the iterations and return. Other return values are not used now.
> + *
> + *	Return
> + *		The number of iterations performed, **-EINVAL** for invalid **flags**
> + *		or a null **callback_fn**.

I think verifier enforced non-null callback_fn, right?

> + *
>    * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 *data, u32 data_len)
>    *	Description
>    *		Outputs a string into the **str** buffer of size **str_size**
> @@ -5105,6 +5127,7 @@ union bpf_attr {
>   	FN(sock_from_file),		\
>   	FN(check_mtu),			\
>   	FN(for_each_map_elem),		\
> +	FN(for_each),			\

Please put for_each helper at the end of list. Otherwise, it will break 
uapi.

>   	FN(snprintf),			\
>   	FN(sys_bpf),			\
>   	FN(btf_find_by_name_kind),	\
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index b2ee45064e06..cb742c50898a 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -714,3 +714,35 @@ const struct bpf_func_proto bpf_for_each_map_elem_proto = {
>   	.arg3_type	= ARG_PTR_TO_STACK_OR_NULL,
>   	.arg4_type	= ARG_ANYTHING,
>   };
> +
[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18 11:11   ` Toke Høiland-Jørgensen
@ 2021-11-18 18:03     ` Yonghong Song
  2021-11-19 12:17       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Yonghong Song @ 2021-11-18 18:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Joanne Koong, bpf
  Cc: andrii, ast, daniel, kafai, Kernel-team



On 11/18/21 3:11 AM, Toke Høiland-Jørgensen wrote:
> Joanne Koong <joannekoong@fb.com> writes:
> 
>> This patch adds the kernel-side and API changes for a new helper
>> function, bpf_for_each:
>>
>> long bpf_for_each(u32 nr_interations, void *callback_fn,
>> void *callback_ctx, u64 flags);
>>
>> bpf_for_each invokes the "callback_fn" nr_iterations number of times
>> or until the callback_fn returns 1.
>>
>> A few things to please note:
>> ~ The "u64 flags" parameter is currently unused but is included in
>> case a future use case for it arises.
>> ~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
>> bpf_callback_t is used as the callback function cast.
>> ~ A program can have nested bpf_for_each calls but the program must
>> still adhere to the verifier constraint of its stack depth (the stack depth
>> cannot exceed MAX_BPF_STACK))
>> ~ The next patch will include the tests and benchmark
>>
>> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> 
> Great to see this! One small nit, below, but otherwise:
> 
> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> 
>> ---
>>   include/linux/bpf.h            |  1 +
>>   include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>>   kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>>   kernel/bpf/helpers.c           |  2 ++
>>   kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>>   6 files changed, 109 insertions(+)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 6deebf8bf78f..d9b69a896c91 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
>>   extern const struct bpf_func_proto bpf_task_storage_get_proto;
>>   extern const struct bpf_func_proto bpf_task_storage_delete_proto;
>>   extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
>> +extern const struct bpf_func_proto bpf_for_each_proto;
>>   extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>>   extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
>>   extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index bd0c9f0487f6..ea5098920ed2 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -4750,6 +4750,28 @@ union bpf_attr {
>>    *		The number of traversed map elements for success, **-EINVAL** for
>>    *		invalid **flags**.
>>    *
>> + * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
>> + *	Description
>> + *		For **nr_iterations**, call **callback_fn** function with
>> + *		**callback_ctx** as the context parameter.
>> + *		The **callback_fn** should be a static function and
>> + *		the **callback_ctx** should be a pointer to the stack.
>> + *		The **flags** is used to control certain aspects of the helper.
>> + *		Currently, the **flags** must be 0.
>> + *
>> + *		long (\*callback_fn)(u32 index, void \*ctx);
>> + *
>> + *		where **index** is the current index in the iteration. The index
>> + *		is zero-indexed.
>> + *
>> + *		If **callback_fn** returns 0, the helper will continue to the next
>> + *		iteration. If return value is 1, the helper will skip the rest of
>> + *		the iterations and return. Other return values are not used now.
> 
> The code will actually return for any non-zero value, though? So
> shouldn't the documentation reflect this? Or, alternatively, should the
> verifier enforce that the function can only return 0 or 1?

This is enforced in verifier.c prepare_func_exit().

         if (callee->in_callback_fn) {
                 /* enforce R0 return value range [0, 1]. */
                 struct tnum range = tnum_range(0, 1);

                 if (r0->type != SCALAR_VALUE) {
                         verbose(env, "R0 not a scalar value\n");
                         return -EACCES;
                 }
                 if (!tnum_in(range, r0->var_off)) {
                         verbose_invalid_scalar(env, r0, &range, 
"callback return", "R0");
                         return -EINVAL;
                 }
         }

> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-18 11:18   ` Toke Høiland-Jørgensen
@ 2021-11-18 19:55     ` Andrii Nakryiko
  2021-11-19 13:04       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2021-11-18 19:55 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin Lau, Kernel Team

On Thu, Nov 18, 2021 at 3:18 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Joanne Koong <joannekoong@fb.com> writes:
>
> > Add benchmark to measure the overhead of the bpf_for_each call
> > for a specified number of iterations.
> >
> > Testing this on qemu on my dev machine on 1 thread, the data is
> > as follows:
>
> Absolute numbers from some random dev machine are not terribly useful;
> others have no way of replicating your tests. A more meaningful
> benchmark would need a baseline to compare to; in this case I guess that
> would be a regular loop? Do you have any numbers comparing the callback
> to just looping?

Measuring empty for (int i = 0; i < N; i++) is meaningless, you should
expect a number in billions of "operations" per second on modern
server CPUs. So that will give you no idea. Those numbers are useful
as a ballpark number of what's the overhead of bpf_for_each() helper
and callbacks. And 12ns per "iteration" is meaningful to have a good
idea of how slow that can be. Depending on your hardware it can be
different by 2x, maybe 3x, but not 100x.

But measuring inc + cmp + jne as a baseline is both unrealistic and
doesn't give much more extra information. But you can assume 2B/s,
give or take.

And you also can run this benchmark on your own on your hardware to
get "real" numbers, as much as you can expect real numbers from
artificial microbenchmark, of course.


I read those numbers as "plenty fast" :)

>
> -Toke
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-18  1:04 ` [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark Joanne Koong
  2021-11-18 11:18   ` Toke Høiland-Jørgensen
@ 2021-11-18 20:00   ` Andrii Nakryiko
  1 sibling, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2021-11-18 20:00 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin Lau, Kernel Team

On Wed, Nov 17, 2021 at 5:07 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> Add benchmark to measure the overhead of the bpf_for_each call
> for a specified number of iterations.
>
> Testing this on qemu on my dev machine on 1 thread, the data is
> as follows:
>
>         nr_iterations: 1
> bpf_for_each helper - total callbacks called:  42.949 ± 1.404M/s
>
>         nr_iterations: 10
> bpf_for_each helper - total callbacks called:  73.645 ± 2.077M/s
>
>         nr_iterations: 100
> bpf_for_each helper - total callbacks called:  73.058 ± 1.256M/s
>
>         nr_iterations: 500
> bpf_for_each helper - total callbacks called:  78.255 ± 2.845M/s
>
>         nr_iterations: 1000
> bpf_for_each helper - total callbacks called:  79.439 ± 1.805M/s
>
>         nr_iterations: 5000
> bpf_for_each helper - total callbacks called:  81.639 ± 2.053M/s
>
>         nr_iterations: 10000
> bpf_for_each helper - total callbacks called:  80.577 ± 1.824M/s
>
>         nr_iterations: 50000
> bpf_for_each helper - total callbacks called:  76.773 ± 1.578M/s
>
>         nr_iterations: 100000
> bpf_for_each helper - total callbacks called:  77.073 ± 2.200M/s
>
>         nr_iterations: 500000
> bpf_for_each helper - total callbacks called:  75.136 ± 0.552M/s
>
>         nr_iterations: 1000000
> bpf_for_each helper - total callbacks called:  76.364 ± 1.690M/s

bit clear why numbers go down with increased nr_iterations, I'd expect
them to stabilize. Try running bench with -a argument to set CPU
affinity, that usually improves stability of test results

>
> From this data, we can see that we are able to run the loop at
> least 40 million times per second on an empty callback function.
>
> From this data, we can also see that as the number of iterations
> increases, the overhead per iteration decreases and steadies towards
> a constant value.
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_for_each.c     | 105 ++++++++++++++++++
>  .../bpf/benchs/run_bench_for_each.sh          |  16 +++
>  .../selftests/bpf/progs/for_each_helper.c     |  13 +++

$ ls progs/*bench*
progs/bloom_filter_bench.c  progs/perfbuf_bench.c
progs/ringbuf_bench.c  progs/trigger_bench.c

let's keep the naming pattern


>  5 files changed, 140 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_for_each.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_for_each.sh
>

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
  2021-11-18 11:11   ` Toke Høiland-Jørgensen
  2021-11-18 17:59   ` Yonghong Song
@ 2021-11-18 20:14   ` Andrii Nakryiko
       [not found]     ` <9cf25708-c878-65db-0dfd-a76e83fe9e39@fb.com>
  2 siblings, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2021-11-18 20:14 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin Lau, Kernel Team

On Wed, Nov 17, 2021 at 5:06 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> This patch adds the kernel-side and API changes for a new helper
> function, bpf_for_each:
>
> long bpf_for_each(u32 nr_interations, void *callback_fn,
> void *callback_ctx, u64 flags);

foreach in other languages are usually used when you are iterating
elements of some data structure or stream of data, so the naming feels
slightly off. bpf_loop() for bpf_for_range() seems to be more directly
pointing to what's going on. My 2 cents, it's subjective, of course.


>
> bpf_for_each invokes the "callback_fn" nr_iterations number of times
> or until the callback_fn returns 1.

As Toke mentioned, we don't really check 1. Enforcing it on verifier
side is just going to cause more troubles for users and doesn't seem
important. I can see two ways to define the semantics, with error
propagation and without.

For error propagation, we can define:
  - >0, break and return number of iterations performed in total;
  - 0, continue to next iteration, if no more iterations, return
number of iterations performed;
  - <0, break and return that error value (but no way to know at which
iteration this happened, except through custom context);

Or we can make it simpler and just:
  - 0, continue;
  - != 0, break;
  - always return number of iterations performed.


No strong preferences on my side, I see benefits to both ways.

>
> A few things to please note:
> ~ The "u64 flags" parameter is currently unused but is included in
> case a future use case for it arises.
> ~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
> bpf_callback_t is used as the callback function cast.
> ~ A program can have nested bpf_for_each calls but the program must
> still adhere to the verifier constraint of its stack depth (the stack depth
> cannot exceed MAX_BPF_STACK))
> ~ The next patch will include the tests and benchmark
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>  kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>  kernel/bpf/helpers.c           |  2 ++
>  kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>  6 files changed, 109 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6deebf8bf78f..d9b69a896c91 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
>  extern const struct bpf_func_proto bpf_task_storage_get_proto;
>  extern const struct bpf_func_proto bpf_task_storage_delete_proto;
>  extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
> +extern const struct bpf_func_proto bpf_for_each_proto;
>  extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>  extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
>  extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index bd0c9f0487f6..ea5098920ed2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4750,6 +4750,28 @@ union bpf_attr {
>   *             The number of traversed map elements for success, **-EINVAL** for
>   *             invalid **flags**.
>   *
> + * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
> + *     Description
> + *             For **nr_iterations**, call **callback_fn** function with
> + *             **callback_ctx** as the context parameter.
> + *             The **callback_fn** should be a static function and
> + *             the **callback_ctx** should be a pointer to the stack.
> + *             The **flags** is used to control certain aspects of the helper.
> + *             Currently, the **flags** must be 0.
> + *
> + *             long (\*callback_fn)(u32 index, void \*ctx);
> + *
> + *             where **index** is the current index in the iteration. The index
> + *             is zero-indexed.
> + *
> + *             If **callback_fn** returns 0, the helper will continue to the next
> + *             iteration. If return value is 1, the helper will skip the rest of
> + *             the iterations and return. Other return values are not used now.
> + *
> + *     Return
> + *             The number of iterations performed, **-EINVAL** for invalid **flags**
> + *             or a null **callback_fn**.
> + *
>   * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 *data, u32 data_len)
>   *     Description
>   *             Outputs a string into the **str** buffer of size **str_size**
> @@ -5105,6 +5127,7 @@ union bpf_attr {
>         FN(sock_from_file),             \
>         FN(check_mtu),                  \
>         FN(for_each_map_elem),          \
> +       FN(for_each),                   \

you can't change the order of the function definitions, this breaks
ABI, please add it at the end

>         FN(snprintf),                   \
>         FN(sys_bpf),                    \
>         FN(btf_find_by_name_kind),      \
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index b2ee45064e06..cb742c50898a 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -714,3 +714,35 @@ const struct bpf_func_proto bpf_for_each_map_elem_proto = {
>         .arg3_type      = ARG_PTR_TO_STACK_OR_NULL,
>         .arg4_type      = ARG_ANYTHING,
>  };
> +
> +BPF_CALL_4(bpf_for_each, u32, nr_iterations, void *, callback_fn, void *, callback_ctx,
> +          u64, flags)
> +{
> +       bpf_callback_t callback = (bpf_callback_t)callback_fn;
> +       u64 err;
> +       u32 i;
> +

I wonder if we should have some high but reasonable number of
iteration limits. It would be too easy for users to cause some
overflow and not notice it, and then pass 4bln iterations and freeze
the kernel. I think limiting to something like 1mln or 8mln might be
ok. Thoughts?


> +       if (flags)
> +               return -EINVAL;
> +
> +       for (i = 0; i < nr_iterations; i++) {
> +               err = callback((u64)i, (u64)(long)callback_ctx, 0, 0, 0);
> +               /* return value: 0 - continue, 1 - stop and return */
> +               if (err) {

nit: not really error (at least in the current semantics), "ret" would
be more appropriate

> +                       i++;
> +                       break;
> +               }
> +       }
> +
> +       return i;
> +}
> +

[...]

>  static int set_timer_callback_state(struct bpf_verifier_env *env,
>                                     struct bpf_func_state *caller,
>                                     struct bpf_func_state *callee,
> @@ -6482,6 +6503,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                         return -EINVAL;
>         }
>
> +       if (func_id == BPF_FUNC_for_each) {
> +               err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +                                       set_for_each_callback_state);
> +               if (err < 0)
> +                       return -EINVAL;
> +       }
> +

we should convert these ifs (they are not even if elses!) into a
switch. And move if (err < 0) return err; outside. It will only keep
growing.

>         if (func_id == BPF_FUNC_timer_set_callback) {
>                 err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
>                                         set_timer_callback_state);

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 2/3] selftests/bpf: Add tests for bpf_for_each
  2021-11-18  1:04 ` [PATCH bpf-next 2/3] selftests/bpf: Add tests for bpf_for_each Joanne Koong
@ 2021-11-18 20:23   ` Andrii Nakryiko
  0 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2021-11-18 20:23 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin Lau, Kernel Team

On Wed, Nov 17, 2021 at 5:07 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> In this patch -
> 1) Add a new prog "for_each_helper" which tests the basic functionality of
> the bpf_for_each helper.
>
> 2) Add pyperf600_foreach and strobemeta_foreach to test the performance
> of using bpf_for_each instead of a for loop
>
> The results of pyperf600 and strobemeta are as follows:
>
> ~strobemeta~
>
> Baseline
>     verification time 6808200 usec
>     stack depth 496
>     processed 592132 insns (limit 1000000) max_states_per_insn 14
>     total_states 16018 peak_states 13684 mark_read 3132
>     #188 verif_scale_strobemeta:OK (unrolled loop)
>
> Using bpf_for_each
>     verification time 31589 usec
>     stack depth 96+408
>     processed 1630 insns (limit 1000000) max_states_per_insn 4
>     total_states 107 peak_states 107 mark_read 60
>     #189 verif_scale_strobemeta_foreach:OK
>
> ~pyperf600~
>
> Baseline
>     verification time 29702486 usec
>     stack depth 368
>     processed 626838 insns (limit 1000000) max_states_per_insn 7
>     total_states 30368 peak_states 30279 mark_read 748
>     #182 verif_scale_pyperf600:OK (unrolled loop)
>
> Using bpf_for_each
>     verification time 148488 usec

200x, this is awesome

>     stack depth 320+40
>     processed 10518 insns (limit 1000000) max_states_per_insn 10
>     total_states 705 peak_states 517 mark_read 38
>     #183 verif_scale_pyperf600_foreach:OK
>
> Using the bpf_for_each helper led to approximately a 100% decrease
> in the verification time and in the number of instructions.
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  .../bpf/prog_tests/bpf_verif_scale.c          | 12 +++
>  .../selftests/bpf/prog_tests/for_each.c       | 61 ++++++++++++++++
>  .../selftests/bpf/progs/for_each_helper.c     | 69 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/pyperf.h    | 70 +++++++++++++++++-
>  .../selftests/bpf/progs/pyperf600_foreach.c   |  5 ++
>  .../testing/selftests/bpf/progs/strobemeta.h  | 73 ++++++++++++++++++-
>  .../selftests/bpf/progs/strobemeta_foreach.c  |  9 +++

let's split out strobemeta and pyperf refactorings into a separate
patch from dedicated tests for bpf_for_each helper, they are logically
quite different and independent

>  7 files changed, 295 insertions(+), 4 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/for_each_helper.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_foreach.c
>  create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_foreach.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> index 867349e4ed9e..77396484fde7 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> @@ -115,6 +115,12 @@ void test_verif_scale_pyperf600()
>         scale_test("pyperf600.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
>  }
>
> +void test_verif_scale_pyperf600_foreach(void)
> +{
> +       /* use the bpf_for_each helper*/
> +       scale_test("pyperf600_foreach.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
> +}
> +
>  void test_verif_scale_pyperf600_nounroll()
>  {
>         /* no unroll at all.
> @@ -165,6 +171,12 @@ void test_verif_scale_strobemeta()
>         scale_test("strobemeta.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
>  }
>
> +void test_verif_scale_strobemeta_foreach(void)
> +{
> +       /* use the bpf_for_each helper*/
> +       scale_test("strobemeta_foreach.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
> +}
> +
>  void test_verif_scale_strobemeta_nounroll1()
>  {
>         /* no unroll, tiny loops */
> diff --git a/tools/testing/selftests/bpf/prog_tests/for_each.c b/tools/testing/selftests/bpf/prog_tests/for_each.c
> index 68eb12a287d4..529573a82334 100644
> --- a/tools/testing/selftests/bpf/prog_tests/for_each.c
> +++ b/tools/testing/selftests/bpf/prog_tests/for_each.c
> @@ -4,6 +4,7 @@
>  #include <network_helpers.h>
>  #include "for_each_hash_map_elem.skel.h"
>  #include "for_each_array_map_elem.skel.h"
> +#include "for_each_helper.skel.h"
>
>  static unsigned int duration;
>
> @@ -121,10 +122,70 @@ static void test_array_map(void)
>         for_each_array_map_elem__destroy(skel);
>  }
>
> +static void test_for_each_helper(void)
> +{
> +       struct for_each_helper *skel;
> +       __u32 retval;
> +       int err;
> +
> +       skel = for_each_helper__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "for_each_helper__open_and_load"))
> +               return;
> +
> +       skel->bss->nr_iterations = 100;
> +       err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
> +                               1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
> +                               &retval, &duration);
> +       if (CHECK(err || retval, "bpf_for_each helper test_prog",

please don't use CHECK() in new test, stick to ASSERT_XXX()


> +                 "err %d errno %d retval %d\n", err, errno, retval))
> +               goto out;
> +       ASSERT_EQ(skel->bss->nr_iterations_completed, skel->bss->nr_iterations,
> +                 "nr_iterations mismatch");
> +       ASSERT_EQ(skel->bss->g_output, (100 * 99) / 2, "wrong output");
> +
> +       /* test callback_fn returning 1 to stop iteration */
> +       skel->bss->nr_iterations = 400;
> +       skel->data->stop_index = 50;
> +       err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
> +                               1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
> +                               &retval, &duration);
> +       if (CHECK(err || retval, "bpf_for_each helper test_prog",
> +                 "err %d errno %d retval %d\n", err, errno, retval))
> +               goto out;
> +       ASSERT_EQ(skel->bss->nr_iterations_completed, skel->data->stop_index + 1,
> +                 "stop_index not followed");
> +       ASSERT_EQ(skel->bss->g_output, (50 * 49) / 2, "wrong output");
> +
> +       /* test passing in a null ctx */
> +       skel->bss->nr_iterations = 10;
> +       err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_null_ctx),
> +                               1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
> +                               &retval, &duration);
> +       if (CHECK(err || retval, "bpf_for_each helper prog_null_ctx",
> +                 "err %d errno %d retval %d\n", err, errno, retval))
> +               goto out;
> +       ASSERT_EQ(skel->bss->nr_iterations_completed, skel->bss->nr_iterations,
> +                 "nr_iterations mismatch");
> +
> +       /* test invalid flags */
> +       err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_invalid_flags),
> +                               1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
> +                               &retval, &duration);
> +       if (CHECK(err || retval, "bpf_for_each helper prog_invalid_flags",
> +                 "err %d errno %d retval %d\n", err, errno, retval))
> +               goto out;
> +       ASSERT_EQ(skel->bss->err, -EINVAL, "invalid_flags");
> +
> +out:
> +       for_each_helper__destroy(skel);
> +}
> +
>  void test_for_each(void)
>  {
>         if (test__start_subtest("hash_map"))
>                 test_hash_map();
>         if (test__start_subtest("array_map"))
>                 test_array_map();
> +       if (test__start_subtest("for_each_helper"))
> +               test_for_each_helper();

those hash_map and array_map are conceptually very different tests,
it's probably best to have a separate file under prog_tests dedicated
to this new helper.

>  }
> diff --git a/tools/testing/selftests/bpf/progs/for_each_helper.c b/tools/testing/selftests/bpf/progs/for_each_helper.c
> new file mode 100644
> index 000000000000..4404d0cb32a6
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/for_each_helper.c
> @@ -0,0 +1,69 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021 Facebook */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct callback_ctx {
> +       int output;
> +};
> +
> +/* This should be set by the user program */
> +u32 nr_iterations;
> +u32 stop_index = -1;
> +
> +/* Making these global variables so that the userspace program
> + * can verify the output through the skeleton
> + */
> +int nr_iterations_completed;
> +int g_output;
> +int err;
> +
> +static int callback_fn(__u32 index, void *data)
> +{
> +       struct callback_ctx *ctx = data;
> +
> +       if (index >= stop_index)
> +               return 1;
> +
> +       ctx->output += index;
> +
> +       return 0;
> +}
> +
> +static int empty_callback_fn(__u32 index, void *data)
> +{
> +       return 0;
> +}
> +
> +SEC("tc")
> +int test_prog(struct __sk_buff *skb)
> +{
> +       struct callback_ctx data = {};
> +
> +       nr_iterations_completed = bpf_for_each(nr_iterations, callback_fn, &data, 0);
> +
> +       g_output = data.output;
> +
> +       return 0;
> +}
> +
> +SEC("tc")
> +int prog_null_ctx(struct __sk_buff *skb)
> +{
> +       nr_iterations_completed = bpf_for_each(nr_iterations, empty_callback_fn, NULL, 0);
> +
> +       return 0;
> +}
> +
> +SEC("tc")
> +int prog_invalid_flags(struct __sk_buff *skb)
> +{
> +       struct callback_ctx data = {};
> +
> +       err = bpf_for_each(nr_iterations, callback_fn, &data, 1);
> +
> +       return 0;
> +}

You mentioned that nested loops works, let's have a test for that.

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 0/3] Add bpf_for_each helper
  2021-11-18 11:14 ` [PATCH bpf-next 0/3] Add bpf_for_each helper Toke Høiland-Jørgensen
@ 2021-11-19  2:35   ` Joanne Koong
  0 siblings, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2021-11-19  2:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf
  Cc: andrii, ast, daniel, kafai, Kernel-team

On 11/18/21 3:14 AM, Toke Høiland-Jørgensen wrote:

> Joanne Koong <joannekoong@fb.com> writes:
>
>> This patchset add a new helper, bpf_for_each.
>>
>> One of the complexities of using for loops in bpf programs is that the verifier
>> needs to ensure that in every possibility of the loop logic, the loop will always
>> terminate. As such, there is a limit on how many iterations the loop can do.
>>
>> The bpf_for_each helper moves the loop logic into the kernel and can thereby
>> guarantee that the loop will always terminate. The bpf_for_each helper simplifies
>> a lot of the complexity the verifier needs to check, as well as removes the
>> constraint on the number of loops able to be run.
>>
>>  From the test results, we see that using bpf_for_each in place
>> of the traditional for loop led to a decrease in verification time
>> and number of bpf instructions by 100%. The benchmark results show
>> that as the number of iterations increases, the overhead per iteration
>> decreases.
> Small nit with the "by 100%" formulation: when giving such relative
> quantities 100% has a particular meaning, namely "eliminates entirely".
> Which this doesn't, obviously, it *almost* eliminates the verification
> overhead. So I'd change this to 99.5% instead (which is the actual value
> from your numbers in patch 2).
>
Great point!! I will change this to "~99%" (and make this change in 
patch #2's
commit message as well). Thanks for reviewing this patchset!


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18 17:59   ` Yonghong Song
@ 2021-11-19  2:40     ` Joanne Koong
  0 siblings, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2021-11-19  2:40 UTC (permalink / raw)
  To: Yonghong Song, bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team

On 11/18/21 9:59 AM, Yonghong Song wrote:

>
>
> On 11/17/21 5:04 PM, Joanne Koong wrote:
>> This patch adds the kernel-side and API changes for a new helper
>> function, bpf_for_each:
>>
>> long bpf_for_each(u32 nr_interations, void *callback_fn,
>> void *callback_ctx, u64 flags);
>>
>> bpf_for_each invokes the "callback_fn" nr_iterations number of times
>> or until the callback_fn returns 1.
>>
>> A few things to please note:
>> ~ The "u64 flags" parameter is currently unused but is included in
>> case a future use case for it arises.
>> ~ In the kernel-side implementation of bpf_for_each 
>> (kernel/bpf/bpf_iter.c),
>> bpf_callback_t is used as the callback function cast.
>> ~ A program can have nested bpf_for_each calls but the program must
>> still adhere to the verifier constraint of its stack depth (the stack 
>> depth
>> cannot exceed MAX_BPF_STACK))
>> ~ The next patch will include the tests and benchmark
>>
>> Signed-off-by: Joanne Koong <joannekoong@fb.com>
>> ---
>>   include/linux/bpf.h            |  1 +
>>   include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>>   kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>>   kernel/bpf/helpers.c           |  2 ++
>>   kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>>   6 files changed, 109 insertions(+)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 6deebf8bf78f..d9b69a896c91 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto 
>> bpf_get_socket_ptr_cookie_proto;
>>   extern const struct bpf_func_proto bpf_task_storage_get_proto;
>>   extern const struct bpf_func_proto bpf_task_storage_delete_proto;
>>   extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
>> +extern const struct bpf_func_proto bpf_for_each_proto;
>>   extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>>   extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
>>   extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index bd0c9f0487f6..ea5098920ed2 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -4750,6 +4750,28 @@ union bpf_attr {
>>    *        The number of traversed map elements for success, 
>> **-EINVAL** for
>>    *        invalid **flags**.
>>    *
>> + * long bpf_for_each(u32 nr_iterations, void *callback_fn, void 
>> *callback_ctx, u64 flags)
>> + *    Description
>> + *        For **nr_iterations**, call **callback_fn** function with
>> + *        **callback_ctx** as the context parameter.
>> + *        The **callback_fn** should be a static function and
>> + *        the **callback_ctx** should be a pointer to the stack.
>> + *        The **flags** is used to control certain aspects of the 
>> helper.
>> + *        Currently, the **flags** must be 0.
>> + *
>> + *        long (\*callback_fn)(u32 index, void \*ctx);
>> + *
>> + *        where **index** is the current index in the iteration. The 
>> index
>> + *        is zero-indexed.
>> + *
>> + *        If **callback_fn** returns 0, the helper will continue to 
>> the next
>> + *        iteration. If return value is 1, the helper will skip the 
>> rest of
>> + *        the iterations and return. Other return values are not 
>> used now.
>> + *
>> + *    Return
>> + *        The number of iterations performed, **-EINVAL** for 
>> invalid **flags**
>> + *        or a null **callback_fn**.
>
> I think verifier enforced non-null callback_fn, right?
>
Yes, great point - I will remove the "or a null **callback_fn**" part.
>> + *
>>    * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 
>> *data, u32 data_len)
>>    *    Description
>>    *        Outputs a string into the **str** buffer of size 
>> **str_size**
>> @@ -5105,6 +5127,7 @@ union bpf_attr {
>>       FN(sock_from_file),        \
>>       FN(check_mtu),            \
>>       FN(for_each_map_elem),        \
>> +    FN(for_each),            \
>
> Please put for_each helper at the end of list. Otherwise, it will 
> break uapi.
>
Gotcha. I will make this change for v2 and remember to do this in the 
future as well!
>>       FN(snprintf),            \
>>       FN(sys_bpf),            \
>>       FN(btf_find_by_name_kind),    \
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> index b2ee45064e06..cb742c50898a 100644
>> --- a/kernel/bpf/bpf_iter.c
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -714,3 +714,35 @@ const struct bpf_func_proto 
>> bpf_for_each_map_elem_proto = {
>>       .arg3_type    = ARG_PTR_TO_STACK_OR_NULL,
>>       .arg4_type    = ARG_ANYTHING,
>>   };
>> +
> [...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
  2021-11-18 18:03     ` Yonghong Song
@ 2021-11-19 12:17       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-19 12:17 UTC (permalink / raw)
  To: Yonghong Song, Joanne Koong, bpf; +Cc: andrii, ast, daniel, kafai, Kernel-team

Yonghong Song <yhs@fb.com> writes:

> On 11/18/21 3:11 AM, Toke Høiland-Jørgensen wrote:
>> Joanne Koong <joannekoong@fb.com> writes:
>> 
>>> This patch adds the kernel-side and API changes for a new helper
>>> function, bpf_for_each:
>>>
>>> long bpf_for_each(u32 nr_interations, void *callback_fn,
>>> void *callback_ctx, u64 flags);
>>>
>>> bpf_for_each invokes the "callback_fn" nr_iterations number of times
>>> or until the callback_fn returns 1.
>>>
>>> A few things to please note:
>>> ~ The "u64 flags" parameter is currently unused but is included in
>>> case a future use case for it arises.
>>> ~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
>>> bpf_callback_t is used as the callback function cast.
>>> ~ A program can have nested bpf_for_each calls but the program must
>>> still adhere to the verifier constraint of its stack depth (the stack depth
>>> cannot exceed MAX_BPF_STACK))
>>> ~ The next patch will include the tests and benchmark
>>>
>>> Signed-off-by: Joanne Koong <joannekoong@fb.com>
>> 
>> Great to see this! One small nit, below, but otherwise:
>> 
>> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> 
>> 
>>> ---
>>>   include/linux/bpf.h            |  1 +
>>>   include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>>>   kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>>>   kernel/bpf/helpers.c           |  2 ++
>>>   kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>>>   tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>>>   6 files changed, 109 insertions(+)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 6deebf8bf78f..d9b69a896c91 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -2107,6 +2107,7 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
>>>   extern const struct bpf_func_proto bpf_task_storage_get_proto;
>>>   extern const struct bpf_func_proto bpf_task_storage_delete_proto;
>>>   extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
>>> +extern const struct bpf_func_proto bpf_for_each_proto;
>>>   extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>>>   extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
>>>   extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index bd0c9f0487f6..ea5098920ed2 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -4750,6 +4750,28 @@ union bpf_attr {
>>>    *		The number of traversed map elements for success, **-EINVAL** for
>>>    *		invalid **flags**.
>>>    *
>>> + * long bpf_for_each(u32 nr_iterations, void *callback_fn, void *callback_ctx, u64 flags)
>>> + *	Description
>>> + *		For **nr_iterations**, call **callback_fn** function with
>>> + *		**callback_ctx** as the context parameter.
>>> + *		The **callback_fn** should be a static function and
>>> + *		the **callback_ctx** should be a pointer to the stack.
>>> + *		The **flags** is used to control certain aspects of the helper.
>>> + *		Currently, the **flags** must be 0.
>>> + *
>>> + *		long (\*callback_fn)(u32 index, void \*ctx);
>>> + *
>>> + *		where **index** is the current index in the iteration. The index
>>> + *		is zero-indexed.
>>> + *
>>> + *		If **callback_fn** returns 0, the helper will continue to the next
>>> + *		iteration. If return value is 1, the helper will skip the rest of
>>> + *		the iterations and return. Other return values are not used now.
>> 
>> The code will actually return for any non-zero value, though? So
>> shouldn't the documentation reflect this? Or, alternatively, should the
>> verifier enforce that the function can only return 0 or 1?
>
> This is enforced in verifier.c prepare_func_exit().
>
>          if (callee->in_callback_fn) {
>                  /* enforce R0 return value range [0, 1]. */
>                  struct tnum range = tnum_range(0, 1);
>
>                  if (r0->type != SCALAR_VALUE) {
>                          verbose(env, "R0 not a scalar value\n");
>                          return -EACCES;
>                  }
>                  if (!tnum_in(range, r0->var_off)) {
>                          verbose_invalid_scalar(env, r0, &range, 
> "callback return", "R0");
>                          return -EINVAL;
>                  }
>          }

Ah, right! I went looking for this but couldn't find it - thanks for the
pointer!

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-18 19:55     ` Andrii Nakryiko
@ 2021-11-19 13:04       ` Toke Høiland-Jørgensen
  2021-11-19 22:50         ` Andrii Nakryiko
  0 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-19 13:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin Lau, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Nov 18, 2021 at 3:18 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Joanne Koong <joannekoong@fb.com> writes:
>>
>> > Add benchmark to measure the overhead of the bpf_for_each call
>> > for a specified number of iterations.
>> >
>> > Testing this on qemu on my dev machine on 1 thread, the data is
>> > as follows:
>>
>> Absolute numbers from some random dev machine are not terribly useful;
>> others have no way of replicating your tests. A more meaningful
>> benchmark would need a baseline to compare to; in this case I guess that
>> would be a regular loop? Do you have any numbers comparing the callback
>> to just looping?
>
> Measuring empty for (int i = 0; i < N; i++) is meaningless, you should
> expect a number in billions of "operations" per second on modern
> server CPUs. So that will give you no idea. Those numbers are useful
> as a ballpark number of what's the overhead of bpf_for_each() helper
> and callbacks. And 12ns per "iteration" is meaningful to have a good
> idea of how slow that can be. Depending on your hardware it can be
> different by 2x, maybe 3x, but not 100x.
>
> But measuring inc + cmp + jne as a baseline is both unrealistic and
> doesn't give much more extra information. But you can assume 2B/s,
> give or take.
>
> And you also can run this benchmark on your own on your hardware to
> get "real" numbers, as much as you can expect real numbers from
> artificial microbenchmark, of course.
>
>
> I read those numbers as "plenty fast" :)

Hmm, okay, fair enough, but I think it would be good to have the "~12 ns
per iteration" figure featured prominently in the commit message, then :)

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Add bpf_for_each helper
       [not found]     ` <9cf25708-c878-65db-0dfd-a76e83fe9e39@fb.com>
@ 2021-11-19 21:50       ` Joanne Koong
  0 siblings, 0 replies; 20+ messages in thread
From: Joanne Koong @ 2021-11-19 21:50 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf

On 11/18/21 7:23 PM, Joanne Koong wrote:

> On 11/18/21 12:14 PM, Andrii Nakryiko wrote:
>
>> On Wed, Nov 17, 2021 at 5:06 PM Joanne Koong<joannekoong@fb.com>  wrote:
>>> This patch adds the kernel-side and API changes for a new helper
>>> function, bpf_for_each:
>>>
>>> long bpf_for_each(u32 nr_interations, void *callback_fn,
>>> void *callback_ctx, u64 flags);
>> foreach in other languages are usually used when you are iterating
>> elements of some data structure or stream of data, so the naming feels
>> slightly off. bpf_loop() for bpf_for_range() seems to be more directly
>> pointing to what's going on. My 2 cents, it's subjective, of course.
>>
> Ooh, I really like "bpf_loop()"! I will change it to this for v2.
>>> bpf_for_each invokes the "callback_fn" nr_iterations number of times
>>> or until the callback_fn returns 1.
>> As Toke mentioned, we don't really check 1. Enforcing it on verifier
>> side is just going to cause more troubles for users and doesn't seem
>> important. I can see two ways to define the semantics, with error
>> propagation and without.
>>
>> For error propagation, we can define:
>>    - >0, break and return number of iterations performed in total;
>>    - 0, continue to next iteration, if no more iterations, return
>> number of iterations performed;
>>    - <0, break and return that error value (but no way to know at which
>> iteration this happened, except through custom context);
>>
>> Or we can make it simpler and just:
>>    - 0, continue;
>>    - != 0, break;
>>    - always return number of iterations performed.
>>
>>
>> No strong preferences on my side, I see benefits to both ways.
> This is already enforced in the verifier (as Yonghong mentioned as well)
> in prepare_func_exit() -
>           if (callee->in_callback_fn) {
>                   /* enforce R0 return value range [0, 1]. */
>                   struct tnum range = tnum_range(0, 1);
>
>                   if (r0->type != SCALAR_VALUE) {
>                           verbose(env, "R0 not a scalar value\n");
>                           return -EACCES;
>                   }
>                   if (!tnum_in(range, r0->var_off)) {
>                           verbose_invalid_scalar(env, r0, &range,
> "callback return", "R0");
>                           return -EINVAL;
>                   }
>           }
> The verifier enforces that at callback return, the R0 register is 
> always 0 or 1
>
>>> A few things to please note:
>>> ~ The "u64 flags" parameter is currently unused but is included in
>>> case a future use case for it arises.
>>> ~ In the kernel-side implementation of bpf_for_each (kernel/bpf/bpf_iter.c),
>>> bpf_callback_t is used as the callback function cast.
>>> ~ A program can have nested bpf_for_each calls but the program must
>>> still adhere to the verifier constraint of its stack depth (the stack depth
>>> cannot exceed MAX_BPF_STACK))
>>> ~ The next patch will include the tests and benchmark
>>>
>>> Signed-off-by: Joanne Koong<joannekoong@fb.com>
>>> ---
>>>   include/linux/bpf.h            |  1 +
>>>   include/uapi/linux/bpf.h       | 23 +++++++++++++++++++++++
>>>   kernel/bpf/bpf_iter.c          | 32 ++++++++++++++++++++++++++++++++
>>>   kernel/bpf/helpers.c           |  2 ++
>>>   kernel/bpf/verifier.c          | 28 ++++++++++++++++++++++++++++
>>>   tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++++
>>>   6 files changed, 109 insertions(+)
>>>
> [...]
>>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>>> index b2ee45064e06..cb742c50898a 100644
>>> --- a/kernel/bpf/bpf_iter.c
>>> +++ b/kernel/bpf/bpf_iter.c
>>> @@ -714,3 +714,35 @@ const struct bpf_func_proto bpf_for_each_map_elem_proto = {
>>>          .arg3_type      = ARG_PTR_TO_STACK_OR_NULL,
>>>          .arg4_type      = ARG_ANYTHING,
>>>   };
>>> +
>>> +BPF_CALL_4(bpf_for_each, u32, nr_iterations, void *, callback_fn, void *, callback_ctx,
>>> +          u64, flags)
>>> +{
>>> +       bpf_callback_t callback = (bpf_callback_t)callback_fn;
>>> +       u64 err;
>>> +       u32 i;
>>> +
>> I wonder if we should have some high but reasonable number of
>> iteration limits. It would be too easy for users to cause some
>> overflow and not notice it, and then pass 4bln iterations and freeze
>> the kernel. I think limiting to something like 1mln or 8mln might be
>> ok. Thoughts?
>>
> I see the pros and cons of both. It doesn't seem that likely to me 
> that users
> would accidentally pass in a negative u32 value. At the same time, I 
> don't think
> limiting it to something like 1 or 8 million is unreasonable (though 
> it might require
> the user to read the documentation more closely :)) - if the user wants to
> do more than 8 million loops then they can call the loop helper 
> multiple times
>
> As a user, I think I'd prefer u32 where I'd automatically know the 
> limit is 2^32 - 1,
> but either approach sounds good to me!
>
> [...]
>>>   static int set_timer_callback_state(struct bpf_verifier_env *env,
>>>                                      struct bpf_func_state *caller,
>>>                                      struct bpf_func_state *callee,
>>> @@ -6482,6 +6503,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>>>                          return -EINVAL;
>>>          }
>>>
>>> +       if (func_id == BPF_FUNC_for_each) {
>>> +               err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
>>> +                                       set_for_each_callback_state);
>>> +               if (err < 0)
>>> +                       return -EINVAL;
>>> +       }
>>> +
>> we should convert these ifs (they are not even if elses!) into a
>> switch. And move if (err < 0) return err; outside. It will only keep
>> growing.
> Sounds great, I will do this in v2!
>> [...]
(resending this email to the vger mailing list because it didn't go 
through the first time)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-19 13:04       ` Toke Høiland-Jørgensen
@ 2021-11-19 22:50         ` Andrii Nakryiko
  2021-11-22 14:27           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2021-11-19 22:50 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin Lau, Kernel Team

On Fri, Nov 19, 2021 at 5:04 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Thu, Nov 18, 2021 at 3:18 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Joanne Koong <joannekoong@fb.com> writes:
> >>
> >> > Add benchmark to measure the overhead of the bpf_for_each call
> >> > for a specified number of iterations.
> >> >
> >> > Testing this on qemu on my dev machine on 1 thread, the data is
> >> > as follows:
> >>
> >> Absolute numbers from some random dev machine are not terribly useful;
> >> others have no way of replicating your tests. A more meaningful
> >> benchmark would need a baseline to compare to; in this case I guess that
> >> would be a regular loop? Do you have any numbers comparing the callback
> >> to just looping?
> >
> > Measuring empty for (int i = 0; i < N; i++) is meaningless, you should
> > expect a number in billions of "operations" per second on modern
> > server CPUs. So that will give you no idea. Those numbers are useful
> > as a ballpark number of what's the overhead of bpf_for_each() helper
> > and callbacks. And 12ns per "iteration" is meaningful to have a good
> > idea of how slow that can be. Depending on your hardware it can be
> > different by 2x, maybe 3x, but not 100x.
> >
> > But measuring inc + cmp + jne as a baseline is both unrealistic and
> > doesn't give much more extra information. But you can assume 2B/s,
> > give or take.
> >
> > And you also can run this benchmark on your own on your hardware to
> > get "real" numbers, as much as you can expect real numbers from
> > artificial microbenchmark, of course.
> >
> >
> > I read those numbers as "plenty fast" :)
>
> Hmm, okay, fair enough, but I think it would be good to have the "~12 ns
> per iteration" figure featured prominently in the commit message, then :)
>

We discussed with Joanne offline adding an ops_report_final() helper
that will output both throughput (X ops/s) and latency/overhead (
(1000000000/X) ns/op), so that no one had to do any math.

> -Toke
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark
  2021-11-19 22:50         ` Andrii Nakryiko
@ 2021-11-22 14:27           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-22 14:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin Lau, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Nov 19, 2021 at 5:04 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Thu, Nov 18, 2021 at 3:18 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Joanne Koong <joannekoong@fb.com> writes:
>> >>
>> >> > Add benchmark to measure the overhead of the bpf_for_each call
>> >> > for a specified number of iterations.
>> >> >
>> >> > Testing this on qemu on my dev machine on 1 thread, the data is
>> >> > as follows:
>> >>
>> >> Absolute numbers from some random dev machine are not terribly useful;
>> >> others have no way of replicating your tests. A more meaningful
>> >> benchmark would need a baseline to compare to; in this case I guess that
>> >> would be a regular loop? Do you have any numbers comparing the callback
>> >> to just looping?
>> >
>> > Measuring empty for (int i = 0; i < N; i++) is meaningless, you should
>> > expect a number in billions of "operations" per second on modern
>> > server CPUs. So that will give you no idea. Those numbers are useful
>> > as a ballpark number of what's the overhead of bpf_for_each() helper
>> > and callbacks. And 12ns per "iteration" is meaningful to have a good
>> > idea of how slow that can be. Depending on your hardware it can be
>> > different by 2x, maybe 3x, but not 100x.
>> >
>> > But measuring inc + cmp + jne as a baseline is both unrealistic and
>> > doesn't give much more extra information. But you can assume 2B/s,
>> > give or take.
>> >
>> > And you also can run this benchmark on your own on your hardware to
>> > get "real" numbers, as much as you can expect real numbers from
>> > artificial microbenchmark, of course.
>> >
>> >
>> > I read those numbers as "plenty fast" :)
>>
>> Hmm, okay, fair enough, but I think it would be good to have the "~12 ns
>> per iteration" figure featured prominently in the commit message, then :)
>>
>
> We discussed with Joanne offline adding an ops_report_final() helper
> that will output both throughput (X ops/s) and latency/overhead (
> (1000000000/X) ns/op), so that no one had to do any math.

Alright, sounds good, thanks!

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-11-22 14:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-18  1:04 [PATCH bpf-next 0/3] Add bpf_for_each helper Joanne Koong
2021-11-18  1:04 ` [PATCH bpf-next 1/3] bpf: " Joanne Koong
2021-11-18 11:11   ` Toke Høiland-Jørgensen
2021-11-18 18:03     ` Yonghong Song
2021-11-19 12:17       ` Toke Høiland-Jørgensen
2021-11-18 17:59   ` Yonghong Song
2021-11-19  2:40     ` Joanne Koong
2021-11-18 20:14   ` Andrii Nakryiko
     [not found]     ` <9cf25708-c878-65db-0dfd-a76e83fe9e39@fb.com>
2021-11-19 21:50       ` Joanne Koong
2021-11-18  1:04 ` [PATCH bpf-next 2/3] selftests/bpf: Add tests for bpf_for_each Joanne Koong
2021-11-18 20:23   ` Andrii Nakryiko
2021-11-18  1:04 ` [PATCH bpf-next 3/3] selftest/bpf/benchs: add bpf_for_each benchmark Joanne Koong
2021-11-18 11:18   ` Toke Høiland-Jørgensen
2021-11-18 19:55     ` Andrii Nakryiko
2021-11-19 13:04       ` Toke Høiland-Jørgensen
2021-11-19 22:50         ` Andrii Nakryiko
2021-11-22 14:27           ` Toke Høiland-Jørgensen
2021-11-18 20:00   ` Andrii Nakryiko
2021-11-18 11:14 ` [PATCH bpf-next 0/3] Add bpf_for_each helper Toke Høiland-Jørgensen
2021-11-19  2:35   ` Joanne Koong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.