bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/4] Add bpf_loop_helper
@ 2021-11-23 18:34 Joanne Koong
  2021-11-23 18:34 ` [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper Joanne Koong
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Joanne Koong @ 2021-11-23 18:34 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, Kernel-team, Joanne Koong

This patchset add a new helper, bpf_loop.

One of the complexities of using for loops in bpf programs is that the verifier
needs to ensure that in every possibility of the loop logic, the loop will always
terminate. As such, there is a limit on how many iterations the loop can do.

The bpf_loop helper moves the loop logic into the kernel and can thereby
guarantee that the loop will always terminate. The bpf_loop helper simplifies
a lot of the complexity the verifier needs to check, as well as removes the
constraint on the number of loops able to be run.

From the test results, we see that using bpf_loop in place
of the traditional for loop led to a decrease in verification time
and number of bpf instructions by 100%. The benchmark results show
that as the number of iterations increases, the overhead per iteration
decreases.

The high-level overview of the patches -
Patch 1 - kernel-side + API changes for adding bpf_loop
Patch 2 - tests
Patch 3 - use bpf_loop in strobemeta + pyperf600 and measure verifier performance
Patch 4 - benchmark for throughput + latency of bpf_loop call

v1 -> v2:
~ Change helper name to bpf_loop (instead of bpf_for_each)
~ Set max nr_loops (~8 million loops) for bpf_loop call
~ Split tests + strobemeta/pyperf600 changes into two patches
~ Add new ops_report_final helper for outputting throughput and latency


Joanne Koong (4):
  bpf: Add bpf_loop helper
  selftests/bpf: Add bpf_loop test
  selftests/bpf: measure bpf_loop verifier performance
  selftest/bpf/benchs: add bpf_loop benchmark

 include/linux/bpf.h                           |   1 +
 include/uapi/linux/bpf.h                      |  25 ++++
 kernel/bpf/bpf_iter.c                         |  35 +++++
 kernel/bpf/helpers.c                          |   2 +
 kernel/bpf/verifier.c                         |  97 +++++++-----
 tools/include/uapi/linux/bpf.h                |  25 ++++
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  26 ++++
 tools/testing/selftests/bpf/bench.h           |   1 +
 .../selftests/bpf/benchs/bench_bpf_loop.c     | 105 +++++++++++++
 .../bpf/benchs/run_bench_bpf_loop.sh          |  15 ++
 .../selftests/bpf/benchs/run_common.sh        |  15 ++
 .../selftests/bpf/prog_tests/bpf_loop.c       | 138 ++++++++++++++++++
 .../bpf/prog_tests/bpf_verif_scale.c          |  12 ++
 tools/testing/selftests/bpf/progs/bpf_loop.c  |  99 +++++++++++++
 .../selftests/bpf/progs/bpf_loop_bench.c      |  26 ++++
 tools/testing/selftests/bpf/progs/pyperf.h    |  71 ++++++++-
 .../selftests/bpf/progs/pyperf600_bpf_loop.c  |   6 +
 .../testing/selftests/bpf/progs/strobemeta.h  |  75 +++++++++-
 .../selftests/bpf/progs/strobemeta_bpf_loop.c |   9 ++
 20 files changed, 745 insertions(+), 42 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper
  2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
@ 2021-11-23 18:34 ` Joanne Koong
  2021-11-23 22:46   ` Andrii Nakryiko
  2021-11-23 18:34 ` [PATCH v2 bpf-next 2/4] selftests/bpf: Add bpf_loop test Joanne Koong
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Joanne Koong @ 2021-11-23 18:34 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, Kernel-team, Joanne Koong

This patch adds the kernel-side and API changes for a new helper
function, bpf_loop:

long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
u64 flags);

where long (*callback_fn)(u32 index, void *ctx);

bpf_loop invokes the "callback_fn" **nr_loops** times or until the
callback_fn returns 1. The callback_fn can only return 0 or 1, and
this is enforced by the verifier. The callback_fn index is zero-indexed.

A few things to please note:
~ The "u64 flags" parameter is currently unused but is included in
case a future use case for it arises.
~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
bpf_callback_t is used as the callback function cast.
~ A program can have nested bpf_loop calls but the program must
still adhere to the verifier constraint of its stack depth (the stack depth
cannot exceed MAX_BPF_STACK))
~ Recursive callback_fns do not pass the verifier, due to the call stack
for these being too deep.
~ The next patch will include the tests and benchmark

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 25 +++++++++
 kernel/bpf/bpf_iter.c          | 35 ++++++++++++
 kernel/bpf/helpers.c           |  2 +
 kernel/bpf/verifier.c          | 97 +++++++++++++++++++++-------------
 tools/include/uapi/linux/bpf.h | 25 +++++++++
 6 files changed, 148 insertions(+), 37 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cc7a0c36e7df..cad0829710be 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2164,6 +2164,7 @@ extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
 extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
 extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto;
 extern const struct bpf_func_proto bpf_find_vma_proto;
+extern const struct bpf_func_proto bpf_loop_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a69e4b04ffeb..ea37f0ab4231 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4957,6 +4957,30 @@ union bpf_attr {
  *		**-ENOENT** if *task->mm* is NULL, or no vma contains *addr*.
  *		**-EBUSY** if failed to try lock mmap_lock.
  *		**-EINVAL** for invalid **flags**.
+ *
+ * long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx, u64 flags)
+ *	Description
+ *		For **nr_loops**, call **callback_fn** function
+ *		with **callback_ctx** as the context parameter.
+ *		The **callback_fn** should be a static function and
+ *		the **callback_ctx** should be a pointer to the stack.
+ *		The **flags** is used to control certain aspects of the helper.
+ *		Currently, the **flags** must be 0. Currently, nr_loops is
+ *		limited to 1 << 23 (~8 million) loops.
+ *
+ *		long (\*callback_fn)(u32 index, void \*ctx);
+ *
+ *		where **index** is the current index in the loop. The index
+ *		is zero-indexed.
+ *
+ *		If **callback_fn** returns 0, the helper will continue to the next
+ *		loop. If return value is 1, the helper will skip the rest of
+ *		the loops and return. Other return values are not used now,
+ *		and will be rejected by the verifier.
+ *
+ *	Return
+ *		The number of loops performed, **-EINVAL** for invalid **flags** or
+ *		if **nr_loops** exceeds the maximum number of loops.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5140,6 +5164,7 @@ union bpf_attr {
 	FN(skc_to_unix_sock),		\
 	FN(kallsyms_lookup_name),	\
 	FN(find_vma),			\
+	FN(loop),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index b2ee45064e06..1a29f650e3e0 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -714,3 +714,38 @@ const struct bpf_func_proto bpf_for_each_map_elem_proto = {
 	.arg3_type	= ARG_PTR_TO_STACK_OR_NULL,
 	.arg4_type	= ARG_ANYTHING,
 };
+
+/* maximum number of loops */
+#define MAX_LOOPS	BIT(23)
+
+BPF_CALL_4(bpf_loop, u32, nr_loops, void *, callback_fn, void *, callback_ctx,
+	   u64, flags)
+{
+	bpf_callback_t callback = (bpf_callback_t)callback_fn;
+	u64 ret;
+	u32 i;
+
+	if (flags || nr_loops > MAX_LOOPS)
+		return -EINVAL;
+
+	for (i = 0; i < nr_loops; i++) {
+		ret = callback((u64)i, (u64)(long)callback_ctx, 0, 0, 0);
+		/* return value: 0 - continue, 1 - stop and return */
+		if (ret) {
+			i++;
+			break;
+		}
+	}
+
+	return i;
+}
+
+const struct bpf_func_proto bpf_loop_proto = {
+	.func		= bpf_loop,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_PTR_TO_STACK_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1ffd469c217f..52188004a9c3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1378,6 +1378,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_ringbuf_query_proto;
 	case BPF_FUNC_for_each_map_elem:
 		return &bpf_for_each_map_elem_proto;
+	case BPF_FUNC_loop:
+		return &bpf_loop_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0763cca139a7..d674769d411b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6085,6 +6085,27 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_loop_callback_state(struct bpf_verifier_env *env,
+				   struct bpf_func_state *caller,
+				   struct bpf_func_state *callee,
+				   int insn_idx)
+{
+	/* bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
+	 *	    u64 flags);
+	 * callback_fn(u32 index, void *callback_ctx);
+	 */
+	callee->regs[BPF_REG_1].type = SCALAR_VALUE;
+	callee->regs[BPF_REG_2] = caller->regs[BPF_REG_3];
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static int set_timer_callback_state(struct bpf_verifier_env *env,
 				    struct bpf_func_state *caller,
 				    struct bpf_func_state *callee,
@@ -6458,13 +6479,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return err;
 	}
 
-	if (func_id == BPF_FUNC_tail_call) {
-		err = check_reference_leak(env);
-		if (err) {
-			verbose(env, "tail_call would lead to reference leak\n");
-			return err;
-		}
-	} else if (is_release_function(func_id)) {
+	if (is_release_function(func_id)) {
 		err = release_reference(env, meta.ref_obj_id);
 		if (err) {
 			verbose(env, "func %s#%d reference has not been acquired before\n",
@@ -6475,42 +6490,50 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 
 	regs = cur_regs(env);
 
-	/* check that flags argument in get_local_storage(map, flags) is 0,
-	 * this is required because get_local_storage() can't return an error.
-	 */
-	if (func_id == BPF_FUNC_get_local_storage &&
-	    !register_is_null(&regs[BPF_REG_2])) {
-		verbose(env, "get_local_storage() doesn't support non-zero flags\n");
-		return -EINVAL;
-	}
-
-	if (func_id == BPF_FUNC_for_each_map_elem) {
-		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
-					set_map_elem_callback_state);
-		if (err < 0)
+	switch (func_id) {
+	case BPF_FUNC_tail_call:
+		err = check_reference_leak(env);
+		if (err) {
+			verbose(env, "tail_call would lead to reference leak\n");
+			return err;
+		}
+		break;
+	case BPF_FUNC_get_local_storage:
+		/* check that flags argument in get_local_storage(map, flags) is 0,
+		 * this is required because get_local_storage() can't return an error.
+		 */
+		if (!register_is_null(&regs[BPF_REG_2])) {
+			verbose(env, "get_local_storage() doesn't support non-zero flags\n");
 			return -EINVAL;
-	}
-
-	if (func_id == BPF_FUNC_timer_set_callback) {
+		}
+		err = 0;
+		break;
+	case BPF_FUNC_for_each_map_elem:
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
-					set_timer_callback_state);
-		if (err < 0)
-			return -EINVAL;
-	}
-
-	if (func_id == BPF_FUNC_find_vma) {
+					set_map_elem_callback_state) ? -EINVAL : 0;
+		break;
+	case BPF_FUNC_timer_set_callback:
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
-					set_find_vma_callback_state);
-		if (err < 0)
-			return -EINVAL;
-	}
-
-	if (func_id == BPF_FUNC_snprintf) {
+					set_timer_callback_state) ? -EINVAL : 0;
+		break;
+	case BPF_FUNC_find_vma:
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_find_vma_callback_state) ? -EINVAL : 0;
+		break;
+	case BPF_FUNC_snprintf:
 		err = check_bpf_snprintf_call(env, regs);
-		if (err < 0)
-			return err;
+		break;
+	case BPF_FUNC_loop:
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_loop_callback_state) ? -EINVAL : 0;
+		break;
+	default:
+		err = 0;
 	}
 
+	if (err)
+		return err;
+
 	/* reset caller saved regs */
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		mark_reg_not_init(env, regs, caller_saved[i]);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a69e4b04ffeb..ea37f0ab4231 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4957,6 +4957,30 @@ union bpf_attr {
  *		**-ENOENT** if *task->mm* is NULL, or no vma contains *addr*.
  *		**-EBUSY** if failed to try lock mmap_lock.
  *		**-EINVAL** for invalid **flags**.
+ *
+ * long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx, u64 flags)
+ *	Description
+ *		For **nr_loops**, call **callback_fn** function
+ *		with **callback_ctx** as the context parameter.
+ *		The **callback_fn** should be a static function and
+ *		the **callback_ctx** should be a pointer to the stack.
+ *		The **flags** is used to control certain aspects of the helper.
+ *		Currently, the **flags** must be 0. Currently, nr_loops is
+ *		limited to 1 << 23 (~8 million) loops.
+ *
+ *		long (\*callback_fn)(u32 index, void \*ctx);
+ *
+ *		where **index** is the current index in the loop. The index
+ *		is zero-indexed.
+ *
+ *		If **callback_fn** returns 0, the helper will continue to the next
+ *		loop. If return value is 1, the helper will skip the rest of
+ *		the loops and return. Other return values are not used now,
+ *		and will be rejected by the verifier.
+ *
+ *	Return
+ *		The number of loops performed, **-EINVAL** for invalid **flags** or
+ *		if **nr_loops** exceeds the maximum number of loops.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5140,6 +5164,7 @@ union bpf_attr {
 	FN(skc_to_unix_sock),		\
 	FN(kallsyms_lookup_name),	\
 	FN(find_vma),			\
+	FN(loop),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 2/4] selftests/bpf: Add bpf_loop test
  2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
  2021-11-23 18:34 ` [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper Joanne Koong
@ 2021-11-23 18:34 ` Joanne Koong
  2021-11-23 18:34 ` [PATCH v2 bpf-next 3/4] selftests/bpf: measure bpf_loop verifier performance Joanne Koong
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2021-11-23 18:34 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, Kernel-team, Joanne Koong

Add test for bpf_loop testing a variety of cases:
various nr_loops, null callback ctx, invalid flags, nested callbacks.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 .../selftests/bpf/prog_tests/bpf_loop.c       | 138 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_loop.c  |  99 +++++++++++++
 2 files changed, 237 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_loop.c b/tools/testing/selftests/bpf/prog_tests/bpf_loop.c
new file mode 100644
index 000000000000..a279a2c5c44f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_loop.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "bpf_loop.skel.h"
+
+static void check_nr_loops(struct bpf_loop *skel)
+{
+	__u32 retval, duration;
+	int err;
+
+	/* test 0 loops */
+	skel->bss->nr_loops = 0;
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (!ASSERT_OK(err, "err") || !ASSERT_OK(retval, "retval"))
+		return;
+	ASSERT_EQ(skel->bss->nr_loops_returned, skel->bss->nr_loops,
+		  "0 loops");
+
+	/* test 500 loops */
+	skel->bss->nr_loops = 500;
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (!ASSERT_OK(err, "err") ||
+	    !ASSERT_OK(retval, "retval"))
+		return;
+	ASSERT_EQ(skel->bss->nr_loops_returned, skel->bss->nr_loops,
+		  "500 loops");
+	ASSERT_EQ(skel->bss->g_output, (500 * 499) / 2, "g_output");
+
+	/* test exceeding the max limit */
+	skel->bss->nr_loops = -1;
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (!ASSERT_OK(err, "err") || !ASSERT_OK(retval, "retval"))
+		return;
+	ASSERT_EQ(skel->bss->err, -EINVAL, "over max limit");
+}
+
+static void check_callback_fn_stop(struct bpf_loop *skel)
+{
+	__u32 retval, duration;
+	int err;
+
+	skel->bss->nr_loops = 400;
+	skel->data->stop_index = 50;
+
+	/* testing that loop is stopped when callback_fn returns 1 */
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.test_prog),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+
+	if (!ASSERT_OK(err, "err") || !ASSERT_OK(retval, "retval"))
+		return;
+
+	ASSERT_EQ(skel->bss->nr_loops_returned, skel->data->stop_index + 1,
+		  "nr_loops_returned");
+	ASSERT_EQ(skel->bss->g_output, (50 * 49) / 2,
+		  "g_output");
+}
+
+static void check_null_callback_ctx(struct bpf_loop *skel)
+{
+	__u32 retval, duration;
+	int err;
+
+	skel->bss->nr_loops = 10;
+
+	/* check that user is able to pass in a null callback_ctx */
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_null_ctx),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+
+	if (!ASSERT_OK(err, "err") || !ASSERT_OK(retval, "retval"))
+		return;
+
+	ASSERT_EQ(skel->bss->nr_loops_returned, skel->bss->nr_loops,
+		  "nr_loops_returned");
+}
+
+static void check_invalid_flags(struct bpf_loop *skel)
+{
+	__u32 retval, duration;
+	int err;
+
+	/* check that passing in non-zero flags returns -EINVAL */
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_invalid_flags),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+
+	if (!ASSERT_OK(err, "err") || !ASSERT_OK(retval, "retval"))
+		return;
+
+	ASSERT_EQ(skel->bss->err, -EINVAL, "err");
+}
+
+static void check_nested_calls(struct bpf_loop *skel)
+{
+	__u32 nr_loops = 100, nested_callback_nr_loops = 4;
+	__u32 retval, duration;
+	int err;
+
+	skel->bss->nr_loops = nr_loops;
+	skel->bss->nested_callback_nr_loops = nested_callback_nr_loops;
+
+	/* check that nested calls are supported */
+	err = bpf_prog_test_run(bpf_program__fd(skel->progs.prog_nested_calls),
+				1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				&retval, &duration);
+	if (!ASSERT_OK(err, "err") || !ASSERT_OK(retval, "retval"))
+		return;
+	ASSERT_EQ(skel->bss->nr_loops_returned, nr_loops * nested_callback_nr_loops
+		  * nested_callback_nr_loops, "nr_loops_returned");
+	ASSERT_EQ(skel->bss->g_output, (4 * 3) / 2 * nested_callback_nr_loops
+		* nr_loops, "g_output");
+}
+
+void test_bpf_loop(void)
+{
+	struct bpf_loop *skel;
+
+	skel = bpf_loop__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_loop__open_and_load"))
+		return;
+
+	check_nr_loops(skel);
+	check_callback_fn_stop(skel);
+	check_null_callback_ctx(skel);
+	check_invalid_flags(skel);
+	check_nested_calls(skel);
+
+	bpf_loop__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_loop.c b/tools/testing/selftests/bpf/progs/bpf_loop.c
new file mode 100644
index 000000000000..f5437792fe0f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_loop.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct callback_ctx {
+	int output;
+};
+
+/* These should be set by the user program */
+u32 nested_callback_nr_loops;
+u32 stop_index = -1;
+u32 nr_loops;
+
+/* Making these global variables so that the userspace program
+ * can verify the output through the skeleton
+ */
+int nr_loops_returned;
+int g_output;
+int err;
+
+static int callback(__u32 index, void *data)
+{
+	struct callback_ctx *ctx = data;
+
+	if (index >= stop_index)
+		return 1;
+
+	ctx->output += index;
+
+	return 0;
+}
+
+static int empty_callback(__u32 index, void *data)
+{
+	return 0;
+}
+
+static int nested_callback2(__u32 index, void *data)
+{
+	nr_loops_returned += bpf_loop(nested_callback_nr_loops, callback, data, 0);
+
+	return 0;
+}
+
+static int nested_callback1(__u32 index, void *data)
+{
+	bpf_loop(nested_callback_nr_loops, nested_callback2, data, 0);
+	return 0;
+}
+
+SEC("tc")
+int test_prog(struct __sk_buff *skb)
+{
+	struct callback_ctx data = {};
+
+	nr_loops_returned = bpf_loop(nr_loops, callback, &data, 0);
+
+	if (nr_loops_returned < 0)
+		err = nr_loops_returned;
+	else
+		g_output = data.output;
+
+	return 0;
+}
+
+SEC("tc")
+int prog_null_ctx(struct __sk_buff *skb)
+{
+	nr_loops_returned = bpf_loop(nr_loops, empty_callback, NULL, 0);
+
+	return 0;
+}
+
+SEC("tc")
+int prog_invalid_flags(struct __sk_buff *skb)
+{
+	struct callback_ctx data = {};
+
+	err = bpf_loop(nr_loops, callback, &data, 1);
+
+	return 0;
+}
+
+SEC("tc")
+int prog_nested_calls(struct __sk_buff *skb)
+{
+	struct callback_ctx data = {};
+
+	nr_loops_returned = 0;
+	bpf_loop(nr_loops, nested_callback1, &data, 0);
+
+	g_output = data.output;
+
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 3/4] selftests/bpf: measure bpf_loop verifier performance
  2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
  2021-11-23 18:34 ` [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper Joanne Koong
  2021-11-23 18:34 ` [PATCH v2 bpf-next 2/4] selftests/bpf: Add bpf_loop test Joanne Koong
@ 2021-11-23 18:34 ` Joanne Koong
  2021-11-23 18:34 ` [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark Joanne Koong
  2021-11-23 18:47 ` [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
  4 siblings, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2021-11-23 18:34 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, Kernel-team, Joanne Koong

This patch tests bpf_loop in pyperf and strobemeta, and measures the
verifier performance of replacing the traditional for loop
with bpf_loop.

The results are as follows:

~strobemeta~

Baseline
    verification time 6808200 usec
    stack depth 496
    processed 554252 insns (limit 1000000) max_states_per_insn 16
    total_states 15878 peak_states 13489  mark_read 3110
    #192 verif_scale_strobemeta:OK (unrolled loop)

Using bpf_loop
    verification time 31589 usec
    stack depth 96+400
    processed 1513 insns (limit 1000000) max_states_per_insn 2
    total_states 106 peak_states 106 mark_read 60
    #193 verif_scale_strobemeta_bpf_loop:OK

~pyperf600~

Baseline
    verification time 29702486 usec
    stack depth 368
    processed 626838 insns (limit 1000000) max_states_per_insn 7
    total_states 30368 peak_states 30279 mark_read 748
    #182 verif_scale_pyperf600:OK (unrolled loop)

Using bpf_loop
    verification time 148488 usec
    stack depth 320+40
    processed 10518 insns (limit 1000000) max_states_per_insn 10
    total_states 705 peak_states 517 mark_read 38
    #183 verif_scale_pyperf600_bpf_loop:OK

Using the bpf_loop helper led to approximately a 99% decrease
in the verification time and in the number of instructions.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 .../bpf/prog_tests/bpf_verif_scale.c          | 12 +++
 tools/testing/selftests/bpf/progs/pyperf.h    | 71 +++++++++++++++++-
 .../selftests/bpf/progs/pyperf600_bpf_loop.c  |  6 ++
 .../testing/selftests/bpf/progs/strobemeta.h  | 75 ++++++++++++++++++-
 .../selftests/bpf/progs/strobemeta_bpf_loop.c |  9 +++
 5 files changed, 169 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
index 27f5d8ea7964..1fb16f8dad56 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
@@ -115,6 +115,12 @@ void test_verif_scale_pyperf600()
 	scale_test("pyperf600.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
 }
 
+void test_verif_scale_pyperf600_bpf_loop(void)
+{
+	/* use the bpf_loop helper*/
+	scale_test("pyperf600_bpf_loop.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
+}
+
 void test_verif_scale_pyperf600_nounroll()
 {
 	/* no unroll at all.
@@ -165,6 +171,12 @@ void test_verif_scale_strobemeta()
 	scale_test("strobemeta.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
 }
 
+void test_verif_scale_strobemeta_bpf_loop(void)
+{
+	/* use the bpf_loop helper*/
+	scale_test("strobemeta_bpf_loop.o", BPF_PROG_TYPE_RAW_TRACEPOINT, false);
+}
+
 void test_verif_scale_strobemeta_nounroll1()
 {
 	/* no unroll, tiny loops */
diff --git a/tools/testing/selftests/bpf/progs/pyperf.h b/tools/testing/selftests/bpf/progs/pyperf.h
index 2fb7adafb6b6..1ed28882daf3 100644
--- a/tools/testing/selftests/bpf/progs/pyperf.h
+++ b/tools/testing/selftests/bpf/progs/pyperf.h
@@ -159,6 +159,59 @@ struct {
 	__uint(value_size, sizeof(long long) * 127);
 } stackmap SEC(".maps");
 
+#ifdef USE_BPF_LOOP
+struct process_frame_ctx {
+	int cur_cpu;
+	int32_t *symbol_counter;
+	void *frame_ptr;
+	FrameData *frame;
+	PidData *pidData;
+	Symbol *sym;
+	Event *event;
+	bool done;
+};
+
+#define barrier_var(var) asm volatile("" : "=r"(var) : "0"(var))
+
+static int process_frame_callback(__u32 i, struct process_frame_ctx *ctx)
+{
+	int zero = 0;
+	void *frame_ptr = ctx->frame_ptr;
+	PidData *pidData = ctx->pidData;
+	FrameData *frame = ctx->frame;
+	int32_t *symbol_counter = ctx->symbol_counter;
+	int cur_cpu = ctx->cur_cpu;
+	Event *event = ctx->event;
+	Symbol *sym = ctx->sym;
+
+	if (frame_ptr && get_frame_data(frame_ptr, pidData, frame, sym)) {
+		int32_t new_symbol_id = *symbol_counter * 64 + cur_cpu;
+		int32_t *symbol_id = bpf_map_lookup_elem(&symbolmap, sym);
+
+		if (!symbol_id) {
+			bpf_map_update_elem(&symbolmap, sym, &zero, 0);
+			symbol_id = bpf_map_lookup_elem(&symbolmap, sym);
+			if (!symbol_id) {
+				ctx->done = true;
+				return 1;
+			}
+		}
+		if (*symbol_id == new_symbol_id)
+			(*symbol_counter)++;
+
+		barrier_var(i);
+		if (i >= STACK_MAX_LEN)
+			return 1;
+
+		event->stack[i] = *symbol_id;
+
+		event->stack_len = i + 1;
+		frame_ptr = frame->f_back;
+	}
+	return 0;
+}
+#endif /* USE_BPF_LOOP */
+
 #ifdef GLOBAL_FUNC
 __noinline
 #elif defined(SUBPROGS)
@@ -228,11 +281,26 @@ int __on_event(struct bpf_raw_tracepoint_args *ctx)
 		int32_t* symbol_counter = bpf_map_lookup_elem(&symbolmap, &sym);
 		if (symbol_counter == NULL)
 			return 0;
+#ifdef USE_BPF_LOOP
+	struct process_frame_ctx ctx = {
+		.cur_cpu = cur_cpu,
+		.symbol_counter = symbol_counter,
+		.frame_ptr = frame_ptr,
+		.frame = &frame,
+		.pidData = pidData,
+		.sym = &sym,
+		.event = event,
+	};
+
+	bpf_loop(STACK_MAX_LEN, process_frame_callback, &ctx, 0);
+	if (ctx.done)
+		return 0;
+#else
 #ifdef NO_UNROLL
 #pragma clang loop unroll(disable)
 #else
 #pragma clang loop unroll(full)
-#endif
+#endif /* NO_UNROLL */
 		/* Unwind python stack */
 		for (int i = 0; i < STACK_MAX_LEN; ++i) {
 			if (frame_ptr && get_frame_data(frame_ptr, pidData, &frame, &sym)) {
@@ -251,6 +319,7 @@ int __on_event(struct bpf_raw_tracepoint_args *ctx)
 				frame_ptr = frame.f_back;
 			}
 		}
+#endif /* USE_BPF_LOOP */
 		event->stack_complete = frame_ptr == NULL;
 	} else {
 		event->stack_complete = 1;
diff --git a/tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c b/tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c
new file mode 100644
index 000000000000..bde8baed4ca6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2021 Facebook
+
+#define STACK_MAX_LEN 600
+#define USE_BPF_LOOP
+#include "pyperf.h"
diff --git a/tools/testing/selftests/bpf/progs/strobemeta.h b/tools/testing/selftests/bpf/progs/strobemeta.h
index 60c93aee2f4a..753718595c26 100644
--- a/tools/testing/selftests/bpf/progs/strobemeta.h
+++ b/tools/testing/selftests/bpf/progs/strobemeta.h
@@ -445,6 +445,48 @@ static __always_inline void *read_map_var(struct strobemeta_cfg *cfg,
 	return payload;
 }
 
+#ifdef USE_BPF_LOOP
+enum read_type {
+	READ_INT_VAR,
+	READ_MAP_VAR,
+	READ_STR_VAR,
+};
+
+struct read_var_ctx {
+	struct strobemeta_payload *data;
+	void *tls_base;
+	struct strobemeta_cfg *cfg;
+	void *payload;
+	/* value gets mutated */
+	struct strobe_value_generic *value;
+	enum read_type type;
+};
+
+static int read_var_callback(__u32 index, struct read_var_ctx *ctx)
+{
+	switch (ctx->type) {
+	case READ_INT_VAR:
+		if (index >= STROBE_MAX_INTS)
+			return 1;
+		read_int_var(ctx->cfg, index, ctx->tls_base, ctx->value, ctx->data);
+		break;
+	case READ_MAP_VAR:
+		if (index >= STROBE_MAX_MAPS)
+			return 1;
+		ctx->payload = read_map_var(ctx->cfg, index, ctx->tls_base,
+					    ctx->value, ctx->data, ctx->payload);
+		break;
+	case READ_STR_VAR:
+		if (index >= STROBE_MAX_STRS)
+			return 1;
+		ctx->payload += read_str_var(ctx->cfg, index, ctx->tls_base,
+					     ctx->value, ctx->data, ctx->payload);
+		break;
+	}
+	return 0;
+}
+#endif /* USE_BPF_LOOP */
+
 /*
  * read_strobe_meta returns NULL, if no metadata was read; otherwise returns
  * pointer to *right after* payload ends
@@ -475,11 +517,36 @@ static void *read_strobe_meta(struct task_struct *task,
 	 */
 	tls_base = (void *)task;
 
+#ifdef USE_BPF_LOOP
+	struct read_var_ctx ctx = {
+		.cfg = cfg,
+		.tls_base = tls_base,
+		.value = &value,
+		.data = data,
+		.payload = payload,
+	};
+	int err;
+
+	ctx.type = READ_INT_VAR;
+	err = bpf_loop(STROBE_MAX_INTS, read_var_callback, &ctx, 0);
+	if (err != STROBE_MAX_INTS)
+		return NULL;
+
+	ctx.type = READ_STR_VAR;
+	err = bpf_loop(STROBE_MAX_STRS, read_var_callback, &ctx, 0);
+	if (err != STROBE_MAX_STRS)
+		return NULL;
+
+	ctx.type = READ_MAP_VAR;
+	err = bpf_loop(STROBE_MAX_MAPS, read_var_callback, &ctx, 0);
+	if (err != STROBE_MAX_MAPS)
+		return NULL;
+#else
 #ifdef NO_UNROLL
 #pragma clang loop unroll(disable)
 #else
 #pragma unroll
-#endif
+#endif /* NO_UNROLL */
 	for (int i = 0; i < STROBE_MAX_INTS; ++i) {
 		read_int_var(cfg, i, tls_base, &value, data);
 	}
@@ -487,7 +554,7 @@ static void *read_strobe_meta(struct task_struct *task,
 #pragma clang loop unroll(disable)
 #else
 #pragma unroll
-#endif
+#endif /* NO_UNROLL */
 	for (int i = 0; i < STROBE_MAX_STRS; ++i) {
 		payload += read_str_var(cfg, i, tls_base, &value, data, payload);
 	}
@@ -495,10 +562,12 @@ static void *read_strobe_meta(struct task_struct *task,
 #pragma clang loop unroll(disable)
 #else
 #pragma unroll
-#endif
+#endif /* NO_UNROLL */
 	for (int i = 0; i < STROBE_MAX_MAPS; ++i) {
 		payload = read_map_var(cfg, i, tls_base, &value, data, payload);
 	}
+#endif /* USE_BPF_LOOP */
+
 	/*
 	 * return pointer right after end of payload, so it's possible to
 	 * calculate exact amount of useful data that needs to be sent
diff --git a/tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c b/tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c
new file mode 100644
index 000000000000..e6f9f920e68a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+
+#define STROBE_MAX_INTS 2
+#define STROBE_MAX_STRS 25
+#define STROBE_MAX_MAPS 100
+#define STROBE_MAX_MAP_ENTRIES 20
+#define USE_BPF_LOOP
+#include "strobemeta.h"
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
                   ` (2 preceding siblings ...)
  2021-11-23 18:34 ` [PATCH v2 bpf-next 3/4] selftests/bpf: measure bpf_loop verifier performance Joanne Koong
@ 2021-11-23 18:34 ` Joanne Koong
  2021-11-23 19:19   ` Toke Høiland-Jørgensen
  2021-11-23 18:47 ` [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
  4 siblings, 1 reply; 15+ messages in thread
From: Joanne Koong @ 2021-11-23 18:34 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, Kernel-team, Joanne Koong

Add benchmark to measure the throughput and latency of the bpf_loop
call.

Testing this on qemu on my dev machine on 1 thread, the data is
as follows:

        nr_loops: 1
bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op

        nr_loops: 10
bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op

        nr_loops: 100
bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op

        nr_loops: 500
bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op

        nr_loops: 1000
bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op

        nr_loops: 5000
bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op

        nr_loops: 10000
bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op

        nr_loops: 50000
bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op

        nr_loops: 100000
bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op

        nr_loops: 500000
bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op

        nr_loops: 1000000
bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op

From this data, we can see that the latency per loop decreases as the
number of loops increases. On this particular machine, each loop had an
overhead of about ~13 ns, and we were able to run ~70 million loops
per second.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  26 +++++
 tools/testing/selftests/bpf/bench.h           |   1 +
 .../selftests/bpf/benchs/bench_bpf_loop.c     | 105 ++++++++++++++++++
 .../bpf/benchs/run_bench_bpf_loop.sh          |  15 +++
 .../selftests/bpf/benchs/run_common.sh        |  15 +++
 .../selftests/bpf/progs/bpf_loop_bench.c      |  26 +++++
 7 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 35684d61aaeb..a6c0e92c86a1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -531,6 +531,7 @@ $(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h
 $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 			    $(OUTPUT)/perfbuf_bench.skel.h
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
+$(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -540,7 +541,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
 		 $(OUTPUT)/bench_ringbufs.o \
-		 $(OUTPUT)/bench_bloom_filter_map.o
+		 $(OUTPUT)/bench_bloom_filter_map.o \
+		 $(OUTPUT)/bench_bpf_loop.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c75e7ee28746..bfd969e0424f 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -134,6 +134,28 @@ void hits_drops_report_final(struct bench_res res[], int res_cnt)
 	       total_ops_mean, total_ops_stddev);
 }
 
+void ops_report_final(struct bench_res res[], int res_cnt)
+{
+	double hits_mean = 0.0, hits_stddev = 0.0;
+	int i;
+
+	for (i = 0; i < res_cnt; i++)
+		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++)
+			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
+				       (hits_mean - res[i].hits / 1000000.0) /
+				       (res_cnt - 1.0);
+
+		hits_stddev = sqrt(hits_stddev);
+	}
+	printf("Summary: throughput %8.3lf \u00B1 %5.3lf M ops/s (%7.3lfM ops/prod), ",
+	       hits_mean, hits_stddev, hits_mean / env.producer_cnt);
+	printf("latency %8.3lf ns/op (%7.3lf ns/op /prod)\n",
+	       1000.0 / hits_mean, 1000.0 / hits_mean / env.producer_cnt);
+}
+
 const char *argp_program_version = "benchmark";
 const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
 const char argp_program_doc[] =
@@ -171,10 +193,12 @@ static const struct argp_option opts[] = {
 
 extern struct argp bench_ringbufs_argp;
 extern struct argp bench_bloom_map_argp;
+extern struct argp bench_bpf_loop_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
+	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
 	{},
 };
 
@@ -373,6 +397,7 @@ extern const struct bench bench_bloom_update;
 extern const struct bench bench_bloom_false_positive;
 extern const struct bench bench_hashmap_without_bloom;
 extern const struct bench bench_hashmap_with_bloom;
+extern const struct bench bench_bpf_loop;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -404,6 +429,7 @@ static const struct bench *benchs[] = {
 	&bench_bloom_false_positive,
 	&bench_hashmap_without_bloom,
 	&bench_hashmap_with_bloom,
+	&bench_bpf_loop,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index 624c6b11501f..f0895de9aad0 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -59,6 +59,7 @@ void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
 void hits_drops_report_final(struct bench_res res[], int res_cnt);
 void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
 void false_hits_report_final(struct bench_res res[], int res_cnt);
+void ops_report_final(struct bench_res res[], int res_cnt);
 
 static inline __u64 get_time_ns() {
 	struct timespec t;
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c b/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
new file mode 100644
index 000000000000..809a51d7be66
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <argp.h>
+#include "bench.h"
+#include "bpf_loop_bench.skel.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct bpf_loop_bench *skel;
+} ctx;
+
+static struct {
+	__u32 nr_loops;
+} args = {
+	.nr_loops = 10,
+};
+
+enum {
+	ARG_NR_LOOPS = 4000,
+};
+
+static const struct argp_option opts[] = {
+	{ "nr_loops", ARG_NR_LOOPS, "nr_loops", 0,
+		"Set number of loops for the bpf_loop helper"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_NR_LOOPS:
+		args.nr_loops = strtol(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+/* exported into benchmark runner */
+const struct argp bench_bpf_loop_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	while (true)
+		/* trigger the bpf program */
+		syscall(__NR_getpgid);
+
+	return NULL;
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+static void setup(void)
+{
+	struct bpf_link *link;
+
+	setup_libbpf();
+
+	ctx.skel = bpf_loop_bench__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx.skel->progs.benchmark);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+
+	ctx.skel->bss->nr_loops = args.nr_loops;
+}
+
+const struct bench bench_bpf_loop = {
+	.name = "bpf-loop",
+	.validate = validate,
+	.setup = setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = ops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh b/tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
new file mode 100755
index 000000000000..ff740e80ba84
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+for t in 1 4 8 12 16; do
+for i in 1 10 100 500 1000 5000 10000 50000 100000 500000 1000000; do
+subtitle "nr_loops: $i, nr_threads: $t"
+	summarize_ops "bpf_loop: " \
+	    "$($RUN_BENCH -p $t --nr_loops $i bpf-loop)"
+	printf "\n"
+done
+done
diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh
index 9a16be78b180..6c5e6023a69f 100644
--- a/tools/testing/selftests/bpf/benchs/run_common.sh
+++ b/tools/testing/selftests/bpf/benchs/run_common.sh
@@ -33,6 +33,14 @@ function percentage()
 	echo "$*" | sed -E "s/.*Percentage\s=\s+([0-9]+\.[0-9]+).*/\1/"
 }
 
+function ops()
+{
+	echo -n "throughput: "
+	echo -n "$*" | sed -E "s/.*throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/"
+	echo -n -e ", latency: "
+	echo "$*" | sed -E "s/.*latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/"
+}
+
 function total()
 {
 	echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
@@ -52,6 +60,13 @@ function summarize_percentage()
 	printf "%-20s %s%%\n" "$bench" "$(percentage $summary)"
 }
 
+function summarize_ops()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s\n" "$bench" "$(ops $summary)"
+}
+
 function summarize_total()
 {
 	bench="$1"
diff --git a/tools/testing/selftests/bpf/progs/bpf_loop_bench.c b/tools/testing/selftests/bpf/progs/bpf_loop_bench.c
new file mode 100644
index 000000000000..ff00621858c0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_loop_bench.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2021 Facebook
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+u32 nr_loops;
+long hits;
+
+static int empty_callback(__u32 index, void *data)
+{
+	return 0;
+}
+
+SEC("fentry/__x64_sys_getpgid")
+int benchmark(void *ctx)
+{
+	for (int i = 0; i < 1000; i++) {
+		bpf_loop(nr_loops, empty_callback, NULL, 0);
+
+		__sync_add_and_fetch(&hits, nr_loops);
+	}
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add bpf_loop_helper
  2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
                   ` (3 preceding siblings ...)
  2021-11-23 18:34 ` [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark Joanne Koong
@ 2021-11-23 18:47 ` Joanne Koong
  4 siblings, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2021-11-23 18:47 UTC (permalink / raw)
  To: bpf; +Cc: andrii, ast, daniel, Kernel-team

On 11/23/21 10:34 AM, Joanne Koong wrote:

> This patchset add a new helper, bpf_loop.
>
> One of the complexities of using for loops in bpf programs is that the verifier
> needs to ensure that in every possibility of the loop logic, the loop will always
> terminate. As such, there is a limit on how many iterations the loop can do.
>
> The bpf_loop helper moves the loop logic into the kernel and can thereby
> guarantee that the loop will always terminate. The bpf_loop helper simplifies
> a lot of the complexity the verifier needs to check, as well as removes the
> constraint on the number of loops able to be run.
>
>  From the test results, we see that using bpf_loop in place
> of the traditional for loop led to a decrease in verification time
> and number of bpf instructions by 100%. The benchmark results show
> that as the number of iterations increases, the overhead per iteration
> decreases.
I will change the wording here to "led to a decrease in verification 
time and number
of bpf instructions by approximately ~99%". I changed this in patch 3 
but forgot to update
this here as well.
> The high-level overview of the patches -
> Patch 1 - kernel-side + API changes for adding bpf_loop
> Patch 2 - tests
> Patch 3 - use bpf_loop in strobemeta + pyperf600 and measure verifier performance
> Patch 4 - benchmark for throughput + latency of bpf_loop call
>
> v1 -> v2:
> ~ Change helper name to bpf_loop (instead of bpf_for_each)
> ~ Set max nr_loops (~8 million loops) for bpf_loop call
> ~ Split tests + strobemeta/pyperf600 changes into two patches
> ~ Add new ops_report_final helper for outputting throughput and latency
>
>
> Joanne Koong (4):
>    bpf: Add bpf_loop helper
>    selftests/bpf: Add bpf_loop test
>    selftests/bpf: measure bpf_loop verifier performance
>    selftest/bpf/benchs: add bpf_loop benchmark
>
>   include/linux/bpf.h                           |   1 +
>   include/uapi/linux/bpf.h                      |  25 ++++
>   kernel/bpf/bpf_iter.c                         |  35 +++++
>   kernel/bpf/helpers.c                          |   2 +
>   kernel/bpf/verifier.c                         |  97 +++++++-----
>   tools/include/uapi/linux/bpf.h                |  25 ++++
>   tools/testing/selftests/bpf/Makefile          |   4 +-
>   tools/testing/selftests/bpf/bench.c           |  26 ++++
>   tools/testing/selftests/bpf/bench.h           |   1 +
>   .../selftests/bpf/benchs/bench_bpf_loop.c     | 105 +++++++++++++
>   .../bpf/benchs/run_bench_bpf_loop.sh          |  15 ++
>   .../selftests/bpf/benchs/run_common.sh        |  15 ++
>   .../selftests/bpf/prog_tests/bpf_loop.c       | 138 ++++++++++++++++++
>   .../bpf/prog_tests/bpf_verif_scale.c          |  12 ++
>   tools/testing/selftests/bpf/progs/bpf_loop.c  |  99 +++++++++++++
>   .../selftests/bpf/progs/bpf_loop_bench.c      |  26 ++++
>   tools/testing/selftests/bpf/progs/pyperf.h    |  71 ++++++++-
>   .../selftests/bpf/progs/pyperf600_bpf_loop.c  |   6 +
>   .../testing/selftests/bpf/progs/strobemeta.h  |  75 +++++++++-
>   .../selftests/bpf/progs/strobemeta_bpf_loop.c |   9 ++
>   20 files changed, 745 insertions(+), 42 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
>   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_loop.c
>   create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop.c
>   create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop_bench.c
>   create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c
>   create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-23 18:34 ` [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark Joanne Koong
@ 2021-11-23 19:19   ` Toke Høiland-Jørgensen
  2021-11-24  0:20     ` Joanne Koong
  0 siblings, 1 reply; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-23 19:19 UTC (permalink / raw)
  To: Joanne Koong, bpf; +Cc: andrii, ast, daniel, Kernel-team, Joanne Koong

Joanne Koong <joannekoong@fb.com> writes:

> Add benchmark to measure the throughput and latency of the bpf_loop
> call.
>
> Testing this on qemu on my dev machine on 1 thread, the data is
> as follows:
>
>         nr_loops: 1
> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>
>         nr_loops: 10
> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>
>         nr_loops: 100
> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>
>         nr_loops: 500
> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>
>         nr_loops: 1000
> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>
>         nr_loops: 5000
> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>
>         nr_loops: 10000
> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>
>         nr_loops: 50000
> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>
>         nr_loops: 100000
> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>
>         nr_loops: 500000
> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>
>         nr_loops: 1000000
> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>
> From this data, we can see that the latency per loop decreases as the
> number of loops increases. On this particular machine, each loop had an
> overhead of about ~13 ns, and we were able to run ~70 million loops
> per second.

The latency figures are great, thanks! I assume these numbers are with
retpolines enabled? Otherwise 12ns seems a bit much... Or is this
because of qemu?

-Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper
  2021-11-23 18:34 ` [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper Joanne Koong
@ 2021-11-23 22:46   ` Andrii Nakryiko
  0 siblings, 0 replies; 15+ messages in thread
From: Andrii Nakryiko @ 2021-11-23 22:46 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Tue, Nov 23, 2021 at 10:34 AM Joanne Koong <joannekoong@fb.com> wrote:
>
> This patch adds the kernel-side and API changes for a new helper
> function, bpf_loop:
>
> long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
> u64 flags);
>
> where long (*callback_fn)(u32 index, void *ctx);
>
> bpf_loop invokes the "callback_fn" **nr_loops** times or until the
> callback_fn returns 1. The callback_fn can only return 0 or 1, and
> this is enforced by the verifier. The callback_fn index is zero-indexed.
>
> A few things to please note:
> ~ The "u64 flags" parameter is currently unused but is included in
> case a future use case for it arises.
> ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
> bpf_callback_t is used as the callback function cast.
> ~ A program can have nested bpf_loop calls but the program must
> still adhere to the verifier constraint of its stack depth (the stack depth
> cannot exceed MAX_BPF_STACK))
> ~ Recursive callback_fns do not pass the verifier, due to the call stack
> for these being too deep.
> ~ The next patch will include the tests and benchmark
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 25 +++++++++
>  kernel/bpf/bpf_iter.c          | 35 ++++++++++++
>  kernel/bpf/helpers.c           |  2 +
>  kernel/bpf/verifier.c          | 97 +++++++++++++++++++++-------------
>  tools/include/uapi/linux/bpf.h | 25 +++++++++
>  6 files changed, 148 insertions(+), 37 deletions(-)
>

[...]

> +/* maximum number of loops */
> +#define MAX_LOOPS      BIT(23)
> +
> +BPF_CALL_4(bpf_loop, u32, nr_loops, void *, callback_fn, void *, callback_ctx,
> +          u64, flags)
> +{
> +       bpf_callback_t callback = (bpf_callback_t)callback_fn;
> +       u64 ret;
> +       u32 i;
> +
> +       if (flags || nr_loops > MAX_LOOPS)
> +               return -EINVAL;

nit: it's probably a good idea to return -E2BIG for nr_loops >
MAX_LOOPS? It will be more obvious for unsuspecting users that forgot
to read the documentation carefully :)

> +
> +       for (i = 0; i < nr_loops; i++) {
> +               ret = callback((u64)i, (u64)(long)callback_ctx, 0, 0, 0);
> +               /* return value: 0 - continue, 1 - stop and return */
> +               if (ret) {
> +                       i++;
> +                       break;

nit: could be just return i + 1;

> +               }
> +       }
> +
> +       return i;
> +}
> +

[...]

> +       case BPF_FUNC_get_local_storage:
> +               /* check that flags argument in get_local_storage(map, flags) is 0,
> +                * this is required because get_local_storage() can't return an error.
> +                */
> +               if (!register_is_null(&regs[BPF_REG_2])) {
> +                       verbose(env, "get_local_storage() doesn't support non-zero flags\n");
>                         return -EINVAL;
> -       }
> -
> -       if (func_id == BPF_FUNC_timer_set_callback) {
> +               }
> +               err = 0;

err is guaranteed to be zero, no need to re-assign it

> +               break;
> +       case BPF_FUNC_for_each_map_elem:
>                 err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> -                                       set_timer_callback_state);
> -               if (err < 0)
> -                       return -EINVAL;
> -       }
> -
> -       if (func_id == BPF_FUNC_find_vma) {
> +                                       set_map_elem_callback_state) ? -EINVAL : 0;

I think it's actually good to propagate the original error code, so
let's not do `? -EINVAL : 0` logic

> +               break;
> +       case BPF_FUNC_timer_set_callback:
>                 err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> -                                       set_find_vma_callback_state);
> -               if (err < 0)
> -                       return -EINVAL;
> -       }
> -
> -       if (func_id == BPF_FUNC_snprintf) {
> +                                       set_timer_callback_state) ? -EINVAL : 0;
> +               break;
> +       case BPF_FUNC_find_vma:
> +               err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +                                       set_find_vma_callback_state) ? -EINVAL : 0;
> +               break;
> +       case BPF_FUNC_snprintf:
>                 err = check_bpf_snprintf_call(env, regs);
> -               if (err < 0)
> -                       return err;
> +               break;
> +       case BPF_FUNC_loop:
> +               err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +                                       set_loop_callback_state) ? -EINVAL : 0;
> +               break;
> +       default:
> +               err = 0;

same, err is already zero if no error so far was found

>         }
>
> +       if (err)
> +               return err;
> +

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-23 19:19   ` Toke Høiland-Jørgensen
@ 2021-11-24  0:20     ` Joanne Koong
  2021-11-24 12:56       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 15+ messages in thread
From: Joanne Koong @ 2021-11-24  0:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf; +Cc: andrii, ast, daniel, Kernel-team

On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:

> Joanne Koong <joannekoong@fb.com> writes:
>
>> Add benchmark to measure the throughput and latency of the bpf_loop
>> call.
>>
>> Testing this on qemu on my dev machine on 1 thread, the data is
>> as follows:
>>
>>          nr_loops: 1
>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>
>>          nr_loops: 10
>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>
>>          nr_loops: 100
>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>
>>          nr_loops: 500
>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>
>>          nr_loops: 1000
>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>
>>          nr_loops: 5000
>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>
>>          nr_loops: 10000
>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>
>>          nr_loops: 50000
>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>
>>          nr_loops: 100000
>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>
>>          nr_loops: 500000
>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>
>>          nr_loops: 1000000
>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>
>>  From this data, we can see that the latency per loop decreases as the
>> number of loops increases. On this particular machine, each loop had an
>> overhead of about ~13 ns, and we were able to run ~70 million loops
>> per second.
> The latency figures are great, thanks! I assume these numbers are with
> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
> because of qemu?
I just tested it on a machine (without retpoline enabled) that runs on 
actual
hardware and here is what I found:

             nr_loops: 1
     bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op

             nr_loops: 10
     bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op

             nr_loops: 100
     bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op

             nr_loops: 500
     bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op

             nr_loops: 1000
     bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

             nr_loops: 5000
     bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op

             nr_loops: 10000
     bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op

             nr_loops: 50000
     bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op

             nr_loops: 100000
     bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op

             nr_loops: 500000
     bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op

             nr_loops: 1000000
     bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op

The latency is about ~4ns / loop.

I will update the commit message in v3 with these new numbers as well.
>
> -Toke
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-24  0:20     ` Joanne Koong
@ 2021-11-24 12:56       ` Toke Høiland-Jørgensen
  2021-11-24 19:26         ` Andrii Nakryiko
  0 siblings, 1 reply; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-24 12:56 UTC (permalink / raw)
  To: Joanne Koong, bpf; +Cc: andrii, ast, daniel, Kernel-team

Joanne Koong <joannekoong@fb.com> writes:

> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>
>> Joanne Koong <joannekoong@fb.com> writes:
>>
>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>> call.
>>>
>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>> as follows:
>>>
>>>          nr_loops: 1
>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>
>>>          nr_loops: 10
>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>
>>>          nr_loops: 100
>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>
>>>          nr_loops: 500
>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>
>>>          nr_loops: 1000
>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>
>>>          nr_loops: 5000
>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>
>>>          nr_loops: 10000
>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>
>>>          nr_loops: 50000
>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>
>>>          nr_loops: 100000
>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>
>>>          nr_loops: 500000
>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>
>>>          nr_loops: 1000000
>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>
>>>  From this data, we can see that the latency per loop decreases as the
>>> number of loops increases. On this particular machine, each loop had an
>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>> per second.
>> The latency figures are great, thanks! I assume these numbers are with
>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>> because of qemu?
> I just tested it on a machine (without retpoline enabled) that runs on 
> actual
> hardware and here is what I found:
>
>              nr_loops: 1
>      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>
>              nr_loops: 10
>      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>
>              nr_loops: 100
>      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>
>              nr_loops: 500
>      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>
>              nr_loops: 1000
>      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>
>              nr_loops: 5000
>      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>
>              nr_loops: 10000
>      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>
>              nr_loops: 50000
>      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>
>              nr_loops: 100000
>      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>
>              nr_loops: 500000
>      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>
>              nr_loops: 1000000
>      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>
> The latency is about ~4ns / loop.
>
> I will update the commit message in v3 with these new numbers as well.

Right, awesome, thank you for the additional test. This is closer to
what I would expect: on the hardware I'm usually testing on, a function
call takes ~1.5ns, but the difference might just be the hardware, or
because these are indirect calls.

Another comparison just occurred to me (but it's totally OK if you don't
want to add any more benchmarks):

The difference between a program that does:

bpf_loop(nr_loops, empty_callback, NULL, 0);

and

for (i = 0; i < nr_loops; i++)
  empty_callback();


should show the difference between the indirect call in the helper and a
direct call from BPF (and show what the potential performance gain from
having the verifier inline the helper would be). This was more
interesting when there was a ~10x delta than a ~2x between your numbers
and mine, so also totally OK to leave this as-is, and we can cycle back
to such optimisations if it turns out to be necessary...

-Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-24 12:56       ` Toke Høiland-Jørgensen
@ 2021-11-24 19:26         ` Andrii Nakryiko
  2021-11-24 21:59           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2021-11-24 19:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Joanne Koong <joannekoong@fb.com> writes:
>
> > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
> >
> >> Joanne Koong <joannekoong@fb.com> writes:
> >>
> >>> Add benchmark to measure the throughput and latency of the bpf_loop
> >>> call.
> >>>
> >>> Testing this on qemu on my dev machine on 1 thread, the data is
> >>> as follows:
> >>>
> >>>          nr_loops: 1
> >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
> >>>
> >>>          nr_loops: 10
> >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
> >>>
> >>>          nr_loops: 100
> >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
> >>>
> >>>          nr_loops: 500
> >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
> >>>
> >>>          nr_loops: 1000
> >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
> >>>
> >>>          nr_loops: 5000
> >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
> >>>
> >>>          nr_loops: 10000
> >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
> >>>
> >>>          nr_loops: 50000
> >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
> >>>
> >>>          nr_loops: 100000
> >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
> >>>
> >>>          nr_loops: 500000
> >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
> >>>
> >>>          nr_loops: 1000000
> >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
> >>>
> >>>  From this data, we can see that the latency per loop decreases as the
> >>> number of loops increases. On this particular machine, each loop had an
> >>> overhead of about ~13 ns, and we were able to run ~70 million loops
> >>> per second.
> >> The latency figures are great, thanks! I assume these numbers are with
> >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
> >> because of qemu?
> > I just tested it on a machine (without retpoline enabled) that runs on
> > actual
> > hardware and here is what I found:
> >
> >              nr_loops: 1
> >      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
> >
> >              nr_loops: 10
> >      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
> >
> >              nr_loops: 100
> >      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
> >
> >              nr_loops: 500
> >      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
> >
> >              nr_loops: 1000
> >      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
> >
> >              nr_loops: 5000
> >      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
> >
> >              nr_loops: 10000
> >      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
> >
> >              nr_loops: 50000
> >      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
> >
> >              nr_loops: 100000
> >      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
> >
> >              nr_loops: 500000
> >      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
> >
> >              nr_loops: 1000000
> >      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
> >
> > The latency is about ~4ns / loop.
> >
> > I will update the commit message in v3 with these new numbers as well.
>
> Right, awesome, thank you for the additional test. This is closer to
> what I would expect: on the hardware I'm usually testing on, a function
> call takes ~1.5ns, but the difference might just be the hardware, or
> because these are indirect calls.
>
> Another comparison just occurred to me (but it's totally OK if you don't
> want to add any more benchmarks):
>
> The difference between a program that does:
>
> bpf_loop(nr_loops, empty_callback, NULL, 0);
>
> and
>
> for (i = 0; i < nr_loops; i++)
>   empty_callback();

You are basically trying to measure the overhead of bpf_loop() helper
call itself, because other than that it should be identical. We can
estimate that already from the numbers Joanne posted above:

             nr_loops: 1
      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
             nr_loops: 1000
      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

nr_loops:1 is bpf_loop() overhead and one static callback call.
bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
17.6ns. I don't think we need yet another benchmark just for this.


>
> should show the difference between the indirect call in the helper and a
> direct call from BPF (and show what the potential performance gain from
> having the verifier inline the helper would be). This was more
> interesting when there was a ~10x delta than a ~2x between your numbers
> and mine, so also totally OK to leave this as-is, and we can cycle back
> to such optimisations if it turns out to be necessary...
>
> -Toke
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-24 19:26         ` Andrii Nakryiko
@ 2021-11-24 21:59           ` Toke Høiland-Jørgensen
  2021-11-25  0:04             ` Joanne Koong
  0 siblings, 1 reply; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-24 21:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Joanne Koong <joannekoong@fb.com> writes:
>>
>> > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>> >
>> >> Joanne Koong <joannekoong@fb.com> writes:
>> >>
>> >>> Add benchmark to measure the throughput and latency of the bpf_loop
>> >>> call.
>> >>>
>> >>> Testing this on qemu on my dev machine on 1 thread, the data is
>> >>> as follows:
>> >>>
>> >>>          nr_loops: 1
>> >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>> >>>
>> >>>          nr_loops: 10
>> >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>> >>>
>> >>>          nr_loops: 100
>> >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>> >>>
>> >>>          nr_loops: 500
>> >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>> >>>
>> >>>          nr_loops: 1000
>> >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>> >>>
>> >>>          nr_loops: 5000
>> >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>> >>>
>> >>>          nr_loops: 10000
>> >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>> >>>
>> >>>          nr_loops: 50000
>> >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>> >>>
>> >>>          nr_loops: 100000
>> >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>> >>>
>> >>>          nr_loops: 500000
>> >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>> >>>
>> >>>          nr_loops: 1000000
>> >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>> >>>
>> >>>  From this data, we can see that the latency per loop decreases as the
>> >>> number of loops increases. On this particular machine, each loop had an
>> >>> overhead of about ~13 ns, and we were able to run ~70 million loops
>> >>> per second.
>> >> The latency figures are great, thanks! I assume these numbers are with
>> >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>> >> because of qemu?
>> > I just tested it on a machine (without retpoline enabled) that runs on
>> > actual
>> > hardware and here is what I found:
>> >
>> >              nr_loops: 1
>> >      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>> >
>> >              nr_loops: 10
>> >      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>> >
>> >              nr_loops: 100
>> >      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>> >
>> >              nr_loops: 500
>> >      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>> >
>> >              nr_loops: 1000
>> >      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>> >
>> >              nr_loops: 5000
>> >      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>> >
>> >              nr_loops: 10000
>> >      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>> >
>> >              nr_loops: 50000
>> >      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>> >
>> >              nr_loops: 100000
>> >      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>> >
>> >              nr_loops: 500000
>> >      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>> >
>> >              nr_loops: 1000000
>> >      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>> >
>> > The latency is about ~4ns / loop.
>> >
>> > I will update the commit message in v3 with these new numbers as well.
>>
>> Right, awesome, thank you for the additional test. This is closer to
>> what I would expect: on the hardware I'm usually testing on, a function
>> call takes ~1.5ns, but the difference might just be the hardware, or
>> because these are indirect calls.
>>
>> Another comparison just occurred to me (but it's totally OK if you don't
>> want to add any more benchmarks):
>>
>> The difference between a program that does:
>>
>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>
>> and
>>
>> for (i = 0; i < nr_loops; i++)
>>   empty_callback();
>
> You are basically trying to measure the overhead of bpf_loop() helper
> call itself, because other than that it should be identical.

No, I'm trying to measure the difference between the indirect call in
the helper, and the direct call from the BPF program. Should be minor
without retpolines, and somewhat higher where they are enabled...

> We can estimate that already from the numbers Joanne posted above:
>
>              nr_loops: 1
>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>              nr_loops: 1000
>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>
> nr_loops:1 is bpf_loop() overhead and one static callback call.
> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
> 17.6ns. I don't think we need yet another benchmark just for this.

That seems really high, though? The helper is a pretty simple function,
and the call to it should just be JIT'ed into a single regular function
call, right? So why the order-of-magnitude difference?

-Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-24 21:59           ` Toke Høiland-Jørgensen
@ 2021-11-25  0:04             ` Joanne Koong
  2021-11-25 11:35               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 15+ messages in thread
From: Joanne Koong @ 2021-11-25  0:04 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:

> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> Joanne Koong <joannekoong@fb.com> writes:
>>>
>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>>>>
>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>
>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>>>>> call.
>>>>>>
>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>>>>> as follows:
>>>>>>
>>>>>>           nr_loops: 1
>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>>>>
>>>>>>           nr_loops: 10
>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>>>>
>>>>>>           nr_loops: 100
>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>>>>
>>>>>>           nr_loops: 500
>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>>>>
>>>>>>           nr_loops: 1000
>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>>>>
>>>>>>           nr_loops: 5000
>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>>>>
>>>>>>           nr_loops: 10000
>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>>>>
>>>>>>           nr_loops: 50000
>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>>>>
>>>>>>           nr_loops: 100000
>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>>>>
>>>>>>           nr_loops: 500000
>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>>>>
>>>>>>           nr_loops: 1000000
>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>>>>
>>>>>>   From this data, we can see that the latency per loop decreases as the
>>>>>> number of loops increases. On this particular machine, each loop had an
>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>>>>> per second.
>>>>> The latency figures are great, thanks! I assume these numbers are with
>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>>>>> because of qemu?
>>>> I just tested it on a machine (without retpoline enabled) that runs on
>>>> actual
>>>> hardware and here is what I found:
>>>>
>>>>               nr_loops: 1
>>>>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>
>>>>               nr_loops: 10
>>>>       bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>>>>
>>>>               nr_loops: 100
>>>>       bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>>>>
>>>>               nr_loops: 500
>>>>       bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>>>>
>>>>               nr_loops: 1000
>>>>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>
>>>>               nr_loops: 5000
>>>>       bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>>>>
>>>>               nr_loops: 10000
>>>>       bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>>>>
>>>>               nr_loops: 50000
>>>>       bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>>>>
>>>>               nr_loops: 100000
>>>>       bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>>>>
>>>>               nr_loops: 500000
>>>>       bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>>>>
>>>>               nr_loops: 1000000
>>>>       bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>>>>
>>>> The latency is about ~4ns / loop.
>>>>
>>>> I will update the commit message in v3 with these new numbers as well.
>>> Right, awesome, thank you for the additional test. This is closer to
>>> what I would expect: on the hardware I'm usually testing on, a function
>>> call takes ~1.5ns, but the difference might just be the hardware, or
>>> because these are indirect calls.
>>>
>>> Another comparison just occurred to me (but it's totally OK if you don't
>>> want to add any more benchmarks):
>>>
>>> The difference between a program that does:
>>>
>>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>>
>>> and
>>>
>>> for (i = 0; i < nr_loops; i++)
>>>    empty_callback();
>> You are basically trying to measure the overhead of bpf_loop() helper
>> call itself, because other than that it should be identical.
> No, I'm trying to measure the difference between the indirect call in
> the helper, and the direct call from the BPF program. Should be minor
> without retpolines, and somewhat higher where they are enabled...
>
>> We can estimate that already from the numbers Joanne posted above:
>>
>>               nr_loops: 1
>>        bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>               nr_loops: 1000
>>        bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>
>> nr_loops:1 is bpf_loop() overhead and one static callback call.
>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
>> 17.6ns. I don't think we need yet another benchmark just for this.
> That seems really high, though? The helper is a pretty simple function,
> and the call to it should just be JIT'ed into a single regular function
> call, right? So why the order-of-magnitude difference?
I think the overhead of triggering the bpf program from the userspace
benchmarking program is also contributing to this. When nr_loops = 1, we
have to do the context switch between userspace + kernel per every 1000 
loops;
this overhead also contributes to the latency numbers above
> -Toke
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-25  0:04             ` Joanne Koong
@ 2021-11-25 11:35               ` Toke Høiland-Jørgensen
  2021-11-29 19:41                 ` Joanne Koong
  0 siblings, 1 reply; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-25 11:35 UTC (permalink / raw)
  To: Joanne Koong, Andrii Nakryiko
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Kernel Team

Joanne Koong <joannekoong@fb.com> writes:

> On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:
>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>
>>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>>>>>
>>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>>
>>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>>>>>> call.
>>>>>>>
>>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>>>>>> as follows:
>>>>>>>
>>>>>>>           nr_loops: 1
>>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>>>>>
>>>>>>>           nr_loops: 10
>>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>>>>>
>>>>>>>           nr_loops: 100
>>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>>>>>
>>>>>>>           nr_loops: 500
>>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>>>>>
>>>>>>>           nr_loops: 1000
>>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>>>>>
>>>>>>>           nr_loops: 5000
>>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>>>>>
>>>>>>>           nr_loops: 10000
>>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>>>>>
>>>>>>>           nr_loops: 50000
>>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>>>>>
>>>>>>>           nr_loops: 100000
>>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>>>>>
>>>>>>>           nr_loops: 500000
>>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>>>>>
>>>>>>>           nr_loops: 1000000
>>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>>>>>
>>>>>>>   From this data, we can see that the latency per loop decreases as the
>>>>>>> number of loops increases. On this particular machine, each loop had an
>>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>>>>>> per second.
>>>>>> The latency figures are great, thanks! I assume these numbers are with
>>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>>>>>> because of qemu?
>>>>> I just tested it on a machine (without retpoline enabled) that runs on
>>>>> actual
>>>>> hardware and here is what I found:
>>>>>
>>>>>               nr_loops: 1
>>>>>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>>
>>>>>               nr_loops: 10
>>>>>       bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>>>>>
>>>>>               nr_loops: 100
>>>>>       bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>>>>>
>>>>>               nr_loops: 500
>>>>>       bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>>>>>
>>>>>               nr_loops: 1000
>>>>>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>>
>>>>>               nr_loops: 5000
>>>>>       bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>>>>>
>>>>>               nr_loops: 10000
>>>>>       bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>>>>>
>>>>>               nr_loops: 50000
>>>>>       bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>>>>>
>>>>>               nr_loops: 100000
>>>>>       bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>>>>>
>>>>>               nr_loops: 500000
>>>>>       bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>>>>>
>>>>>               nr_loops: 1000000
>>>>>       bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>>>>>
>>>>> The latency is about ~4ns / loop.
>>>>>
>>>>> I will update the commit message in v3 with these new numbers as well.
>>>> Right, awesome, thank you for the additional test. This is closer to
>>>> what I would expect: on the hardware I'm usually testing on, a function
>>>> call takes ~1.5ns, but the difference might just be the hardware, or
>>>> because these are indirect calls.
>>>>
>>>> Another comparison just occurred to me (but it's totally OK if you don't
>>>> want to add any more benchmarks):
>>>>
>>>> The difference between a program that does:
>>>>
>>>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>>>
>>>> and
>>>>
>>>> for (i = 0; i < nr_loops; i++)
>>>>    empty_callback();
>>> You are basically trying to measure the overhead of bpf_loop() helper
>>> call itself, because other than that it should be identical.
>> No, I'm trying to measure the difference between the indirect call in
>> the helper, and the direct call from the BPF program. Should be minor
>> without retpolines, and somewhat higher where they are enabled...
>>
>>> We can estimate that already from the numbers Joanne posted above:
>>>
>>>               nr_loops: 1
>>>        bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>               nr_loops: 1000
>>>        bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>
>>> nr_loops:1 is bpf_loop() overhead and one static callback call.
>>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
>>> 17.6ns. I don't think we need yet another benchmark just for this.
>> That seems really high, though? The helper is a pretty simple function,
>> and the call to it should just be JIT'ed into a single regular function
>> call, right? So why the order-of-magnitude difference?
> I think the overhead of triggering the bpf program from the userspace
> benchmarking program is also contributing to this. When nr_loops = 1, we
> have to do the context switch between userspace + kernel per every 1000 
> loops;
> this overhead also contributes to the latency numbers above

Right okay. But then that data point is not really measuring what it's
purporting to measure? That's a bit misleading, so maybe better to leave
it out entirely?

-Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
  2021-11-25 11:35               ` Toke Høiland-Jørgensen
@ 2021-11-29 19:41                 ` Joanne Koong
  0 siblings, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2021-11-29 19:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: bpf, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Kernel Team

On 11/25/21 3:35 AM, Toke Høiland-Jørgensen wrote:

> Joanne Koong <joannekoong@fb.com> writes:
>
>> On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:
>>
>>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>>
>>>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>
>>>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>>>>>>
>>>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>>>
>>>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>>>>>>> call.
>>>>>>>>
>>>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>>            nr_loops: 1
>>>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 10
>>>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 100
>>>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 500
>>>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 1000
>>>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 5000
>>>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 10000
>>>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 50000
>>>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 100000
>>>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 500000
>>>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 1000000
>>>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>>>>>>
>>>>>>>>    From this data, we can see that the latency per loop decreases as the
>>>>>>>> number of loops increases. On this particular machine, each loop had an
>>>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>>>>>>> per second.
>>>>>>> The latency figures are great, thanks! I assume these numbers are with
>>>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>>>>>>> because of qemu?
>>>>>> I just tested it on a machine (without retpoline enabled) that runs on
>>>>>> actual
>>>>>> hardware and here is what I found:
>>>>>>
>>>>>>                nr_loops: 1
>>>>>>        bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>>>
>>>>>>                nr_loops: 10
>>>>>>        bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>>>>>>
>>>>>>                nr_loops: 100
>>>>>>        bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>>>>>>
>>>>>>                nr_loops: 500
>>>>>>        bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>>>>>>
>>>>>>                nr_loops: 1000
>>>>>>        bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>>>
>>>>>>                nr_loops: 5000
>>>>>>        bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>>>>>>
>>>>>>                nr_loops: 10000
>>>>>>        bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>>>>>>
>>>>>>                nr_loops: 50000
>>>>>>        bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>>>>>>
>>>>>>                nr_loops: 100000
>>>>>>        bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>>>>>>
>>>>>>                nr_loops: 500000
>>>>>>        bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>>>>>>
>>>>>>                nr_loops: 1000000
>>>>>>        bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>>>>>>
>>>>>> The latency is about ~4ns / loop.
>>>>>>
>>>>>> I will update the commit message in v3 with these new numbers as well.
>>>>> Right, awesome, thank you for the additional test. This is closer to
>>>>> what I would expect: on the hardware I'm usually testing on, a function
>>>>> call takes ~1.5ns, but the difference might just be the hardware, or
>>>>> because these are indirect calls.
>>>>>
>>>>> Another comparison just occurred to me (but it's totally OK if you don't
>>>>> want to add any more benchmarks):
>>>>>
>>>>> The difference between a program that does:
>>>>>
>>>>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>>>>
>>>>> and
>>>>>
>>>>> for (i = 0; i < nr_loops; i++)
>>>>>     empty_callback();
>>>> You are basically trying to measure the overhead of bpf_loop() helper
>>>> call itself, because other than that it should be identical.
>>> No, I'm trying to measure the difference between the indirect call in
>>> the helper, and the direct call from the BPF program. Should be minor
>>> without retpolines, and somewhat higher where they are enabled...
>>>
>>>> We can estimate that already from the numbers Joanne posted above:
>>>>
>>>>                nr_loops: 1
>>>>         bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>                nr_loops: 1000
>>>>         bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>
>>>> nr_loops:1 is bpf_loop() overhead and one static callback call.
>>>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
>>>> 17.6ns. I don't think we need yet another benchmark just for this.
>>> That seems really high, though? The helper is a pretty simple function,
>>> and the call to it should just be JIT'ed into a single regular function
>>> call, right? So why the order-of-magnitude difference?
>> I think the overhead of triggering the bpf program from the userspace
>> benchmarking program is also contributing to this. When nr_loops = 1, we
>> have to do the context switch between userspace + kernel per every 1000
>> loops;
>> this overhead also contributes to the latency numbers above
> Right okay. But then that data point is not really measuring what it's
> purporting to measure? That's a bit misleading, so maybe better to leave
> it out entirely?
Sure, I will leave this nr_loops = 1 datapoint out in v3 of this 
patchset :)
The overhead of triggering the bpf program from the userspace benchmarks
is present in every datapoint, but for nr_loops = 1, it's especially 
emphasized
since this overhead is per 1000 loops whereas for other datapoints, it is
per every 1000 * nr_loops.
> -Toke
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-11-29 19:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
2021-11-23 18:34 ` [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper Joanne Koong
2021-11-23 22:46   ` Andrii Nakryiko
2021-11-23 18:34 ` [PATCH v2 bpf-next 2/4] selftests/bpf: Add bpf_loop test Joanne Koong
2021-11-23 18:34 ` [PATCH v2 bpf-next 3/4] selftests/bpf: measure bpf_loop verifier performance Joanne Koong
2021-11-23 18:34 ` [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark Joanne Koong
2021-11-23 19:19   ` Toke Høiland-Jørgensen
2021-11-24  0:20     ` Joanne Koong
2021-11-24 12:56       ` Toke Høiland-Jørgensen
2021-11-24 19:26         ` Andrii Nakryiko
2021-11-24 21:59           ` Toke Høiland-Jørgensen
2021-11-25  0:04             ` Joanne Koong
2021-11-25 11:35               ` Toke Høiland-Jørgensen
2021-11-29 19:41                 ` Joanne Koong
2021-11-23 18:47 ` [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).