All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/3] Add benchmark runner and few benchmarks
@ 2020-05-08 23:20 Andrii Nakryiko
  2020-05-08 23:20 ` [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-08 23:20 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add generic benchmark runner framework which simplifies writing various
performance benchmarks in a consistent fashion.  This framework will be used
in follow up patches to test performance of perf buffer and ring buffer as
well.

Patch #1 adds generic runner implementation and atomic counter benchmarks to
validate benchmark runner's behavior.

Patch #2 implements test_overhead benchmark as part of bench runner. It also
add fmod_ret BPF program type to a set of benchmarks.

Patch #3 tests faster alternatives to set_task_comm() approach, tested in
test_overhead, in search for minimal-overhead way to trigger BPF program
execution from user-space on demand.

v1->v2:
  - moved benchmarks into benchs/ subdir (John);
  - added benchmark "suite" scripts (John);
  - few small clean ups, change defaults, etc.

Andrii Nakryiko (3):
  selftests/bpf: add benchmark runner infrastructure
  selftest/bpf: fmod_ret prog and implement test_overhead as part of
    bench
  selftest/bpf: add BPF triggering benchmark

 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |  17 +-
 tools/testing/selftests/bpf/bench.c           | 398 ++++++++++++++++++
 tools/testing/selftests/bpf/bench.h           |  74 ++++
 .../selftests/bpf/benchs/bench_count.c        |  91 ++++
 .../selftests/bpf/benchs/bench_rename.c       | 195 +++++++++
 .../selftests/bpf/benchs/bench_trigger.c      | 167 ++++++++
 .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
 .../selftests/bpf/benchs/run_bench_trigger.sh |   9 +
 .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
 .../selftests/bpf/progs/test_overhead.c       |   6 +
 .../selftests/bpf/progs/trigger_bench.c       |  47 +++
 12 files changed, 1026 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bench.c
 create mode 100644 tools/testing/selftests/bpf/bench.h
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_trigger.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_trigger.sh
 create mode 100644 tools/testing/selftests/bpf/progs/trigger_bench.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure
  2020-05-08 23:20 [PATCH v2 bpf-next 0/3] Add benchmark runner and few benchmarks Andrii Nakryiko
@ 2020-05-08 23:20 ` Andrii Nakryiko
  2020-05-09 17:10   ` Yonghong Song
  2020-05-08 23:20 ` [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench Andrii Nakryiko
  2020-05-08 23:20 ` [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark Andrii Nakryiko
  2 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-08 23:20 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

While working on BPF ringbuf implementation, testing, and benchmarking, I've
developed a pretty generic and modular benchmark runner, which seems to be
generically useful, as I've already used it for one more purpose (testing
fastest way to trigger BPF program, to minimize overhead of in-kernel code).

This patch adds generic part of benchmark runner and sets up Makefile for
extending it with more sets of benchmarks.

Benchmarker itself operates by spinning up specified number of producer and
consumer threads, setting up interval timer sending SIGALARM signal to
application once a second. Every second, current snapshot with hits/drops
counters are collected and stored in an array. Drops are useful for
producer/consumer benchmarks in which producer might overwhelm consumers.

Once test finishes after given amount of warm-up and testing seconds, mean and
stddev are calculated (ignoring warm-up results) and is printed out to stdout.
This setup seems to give consistent and accurate results.

To validate behavior, I added two atomic counting tests: global and local.
For global one, all the producer threads are atomically incrementing same
counter as fast as possible. This, of course, leads to huge drop of
performance once there is more than one producer thread due to CPUs fighting
for the same memory location.

Local counting, on the other hand, maintains one counter per each producer
thread, incremented independently. Once per second, all counters are read and
added together to form final "counting throughput" measurement. As expected,
such setup demonstrates linear scalability with number of producers (as long
as there are enough physical CPU cores, of course). See example output below.
Also, this setup can nicely demonstrate disastrous effects of false sharing,
if care is not taken to take those per-producer counters apart into
independent cache lines.

Demo output shows global counter first with 1 producer, then with 4. Both
total and per-producer performance significantly drop. The last run is local
counter with 4 producers, demonstrating near-perfect scalability.

$ ./bench -a -w1 -d2 -p1 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-local
Setting up benchmark 'count-local'...
Benchmark 'count-local' started.
Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |  13 +-
 tools/testing/selftests/bpf/bench.c           | 372 ++++++++++++++++++
 tools/testing/selftests/bpf/bench.h           |  74 ++++
 .../selftests/bpf/benchs/bench_count.c        |  91 +++++
 5 files changed, 550 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bench.c
 create mode 100644 tools/testing/selftests/bpf/bench.h
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 3ff031972975..1bb204cee853 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -38,3 +38,4 @@ test_cpp
 /bpf_gcc
 /tools
 /runqslower
+/bench
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3d942be23d09..289fffbf975e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -77,7 +77,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping test_cpp runqslower
+	test_lirc_mode2_user xdping test_cpp runqslower bench
 
 TEST_CUSTOM_PROGS = urandom_read
 
@@ -405,6 +405,17 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
 	$(call msg,CXX,,$@)
 	$(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
 
+# Benchmark runner
+$(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
+	$(call msg,CC,,$@)
+	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
+$(OUTPUT)/bench.o: bench.h
+$(OUTPUT)/bench: LDLIBS += -lm
+$(OUTPUT)/bench: $(OUTPUT)/bench.o \
+		 $(OUTPUT)/bench_count.o
+	$(call msg,BINARY,,$@)
+	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
+
 EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR)			\
 	prog_tests/tests.h map_tests/tests.h verifier/tests.h		\
 	feature								\
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
new file mode 100644
index 000000000000..dddc97cd4db6
--- /dev/null
+++ b/tools/testing/selftests/bpf/bench.c
@@ -0,0 +1,372 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#define _GNU_SOURCE
+#include <argp.h>
+#include <linux/compiler.h>
+#include <sys/time.h>
+#include <sched.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sys/sysinfo.h>
+#include <sys/resource.h>
+#include <signal.h>
+#include "bench.h"
+
+struct env env = {
+	.warmup_sec = 1,
+	.duration_sec = 5,
+	.affinity = false,
+	.consumer_cnt = 1,
+	.producer_cnt = 1,
+};
+
+static int libbpf_print_fn(enum libbpf_print_level level,
+		    const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !env.verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static int bump_memlock_rlimit(void)
+{
+	struct rlimit rlim_new = {
+		.rlim_cur	= RLIM_INFINITY,
+		.rlim_max	= RLIM_INFINITY,
+	};
+
+	return setrlimit(RLIMIT_MEMLOCK, &rlim_new);
+}
+
+void setup_libbpf()
+{
+	int err;
+
+	libbpf_set_print(libbpf_print_fn);
+
+	err = bump_memlock_rlimit();
+	if (err)
+		fprintf(stderr, "failed to increase RLIMIT_MEMLOCK: %d", err);
+}
+
+void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
+{
+	double hits_per_sec, drops_per_sec;
+	double hits_per_prod;
+
+	hits_per_sec = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
+	hits_per_prod = hits_per_sec / env.producer_cnt;
+	drops_per_sec = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
+
+	printf("Iter %3d (%7.3lfus): ",
+	       iter, (delta_ns - 1000000000) / 1000.0);
+
+	printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
+	       hits_per_sec, hits_per_prod, drops_per_sec);
+}
+
+void hits_drops_report_final(struct bench_res res[], int res_cnt)
+{
+	int i;
+	double hits_mean = 0.0, drops_mean = 0.0;
+	double hits_stddev = 0.0, drops_stddev = 0.0;
+
+	for (i = 0; i < res_cnt; i++) {
+		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+		drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
+	}
+
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++) {
+			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
+				       (hits_mean - res[i].hits / 1000000.0) /
+				       (res_cnt - 1.0);
+			drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
+					(drops_mean - res[i].drops / 1000000.0) /
+					(res_cnt - 1.0);
+		}
+		hits_stddev = sqrt(hits_stddev);
+		drops_stddev = sqrt(drops_stddev);
+	}
+	printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
+	       hits_mean, hits_stddev, hits_mean / env.producer_cnt);
+	printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
+	       drops_mean, drops_stddev);
+}
+
+const char *argp_program_version = "benchmark";
+const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
+const char argp_program_doc[] =
+"benchmark    Generic benchmarking framework.\n"
+"\n"
+"This tool runs benchmarks.\n"
+"\n"
+"USAGE: benchmark <bench-name>\n"
+"\n"
+"EXAMPLES:\n"
+"    # run 'count-local' benchmark with 1 producer and 1 consumer\n"
+"    benchmark count-local\n"
+"    # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
+"    benchmark -p16 -c8 -a count-local\n";
+
+static const struct argp_option opts[] = {
+	{ "list", 'l', NULL, 0, "List available benchmarks"},
+	{ "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
+	{ "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
+	{ "producers", 'p', "NUM", 0, "Number of producer threads"},
+	{ "consumers", 'c', "NUM", 0, "Number of consumer threads"},
+	{ "verbose", 'v', NULL, 0, "Verbose debug output"},
+	{ "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
+	{ "b2b", 'b', NULL, 0, "Back-to-back mode"},
+	{ "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	static int pos_args;
+
+	switch (key) {
+	case 'v':
+		env.verbose = true;
+		break;
+	case 'l':
+		env.list = true;
+		break;
+	case 'd':
+		env.duration_sec = strtol(arg, NULL, 10);
+		if (env.duration_sec <= 0) {
+			fprintf(stderr, "Invalid duration: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'w':
+		env.warmup_sec = strtol(arg, NULL, 10);
+		if (env.warmup_sec <= 0) {
+			fprintf(stderr, "Invalid warm-up duration: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'p':
+		env.producer_cnt = strtol(arg, NULL, 10);
+		if (env.producer_cnt <= 0) {
+			fprintf(stderr, "Invalid producer count: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'c':
+		env.consumer_cnt = strtol(arg, NULL, 10);
+		if (env.consumer_cnt <= 0) {
+			fprintf(stderr, "Invalid consumer count: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'a':
+		env.affinity = true;
+		break;
+	case ARGP_KEY_ARG:
+		if (pos_args++) {
+			fprintf(stderr,
+				"Unrecognized positional argument: %s\n", arg);
+			argp_usage(state);
+		}
+		env.bench_name = strdup(arg);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static void parse_cmdline_args(int argc, char **argv)
+{
+	static const struct argp argp = {
+		.options = opts,
+		.parser = parse_arg,
+		.doc = argp_program_doc,
+	};
+	if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
+		exit(1);
+	if (!env.list && !env.bench_name) {
+		argp_help(&argp, stderr, ARGP_HELP_DOC, "bench");
+		exit(1);
+	}
+}
+
+static void collect_measurements(long delta_ns);
+
+static __u64 last_time_ns;
+static void sigalarm_handler(int signo)
+{
+	long new_time_ns = get_time_ns();
+	long delta_ns = new_time_ns - last_time_ns;
+
+	collect_measurements(delta_ns);
+
+	last_time_ns = new_time_ns;
+}
+
+/* set up periodic 1-second timer */
+static void setup_timer()
+{
+	static struct sigaction sigalarm_action = {
+		.sa_handler = sigalarm_handler,
+	};
+	struct itimerval timer_settings = {};
+	int err;
+
+	last_time_ns = get_time_ns();
+	err = sigaction(SIGALRM, &sigalarm_action, NULL);
+	if (err < 0) {
+		fprintf(stderr, "failed to install SIGALARM handler: %d\n", -errno);
+		exit(1);
+	}
+	timer_settings.it_interval.tv_sec = 1;
+	timer_settings.it_value.tv_sec = 1;
+	err = setitimer(ITIMER_REAL, &timer_settings, NULL);
+	if (err < 0) {
+		fprintf(stderr, "failed to arm interval timer: %d\n", -errno);
+		exit(1);
+	}
+}
+
+static void set_thread_affinity(pthread_t thread, int cpu)
+{
+	cpu_set_t cpuset;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	if (pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset)) {
+		fprintf(stderr, "setting affinity to CPU #%d failed: %d\n",
+			cpu, errno);
+		exit(1);
+	}
+}
+
+static struct bench_state {
+	int res_cnt;
+	struct bench_res *results;
+	pthread_t *consumers;
+	pthread_t *producers;
+} state;
+
+const struct bench *bench = NULL;
+
+extern const struct bench bench_count_global;
+extern const struct bench bench_count_local;
+
+static const struct bench *benchs[] = {
+	&bench_count_global,
+	&bench_count_local,
+};
+
+static void setup_benchmark()
+{
+	int i, err;
+
+	if (!env.bench_name) {
+		fprintf(stderr, "benchmark name is not specified\n");
+		exit(1);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(benchs); i++) {
+		if (strcmp(benchs[i]->name, env.bench_name) == 0) {
+			bench = benchs[i];
+			break;
+		}
+	}
+	if (!bench) {
+		fprintf(stderr, "benchmark '%s' not found\n", env.bench_name);
+		exit(1);
+	}
+
+	printf("Setting up benchmark '%s'...\n", bench->name);
+
+	state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
+	state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
+	state.results = calloc(env.duration_sec + env.warmup_sec + 2,
+			       sizeof(*state.results));
+	if (!state.producers || !state.consumers || !state.results)
+		exit(1);
+
+	if (bench->validate)
+		bench->validate();
+	if (bench->setup)
+		bench->setup();
+
+	for (i = 0; i < env.consumer_cnt; i++) {
+		err = pthread_create(&state.consumers[i], NULL,
+				     bench->consumer_thread, (void *)(long)i);
+		if (err) {
+			fprintf(stderr, "failed to create consumer thread #%d: %d\n",
+				i, -errno);
+			exit(1);
+		}
+		if (env.affinity)
+			set_thread_affinity(state.consumers[i], i);
+	}
+	for (i = 0; i < env.producer_cnt; i++) {
+		err = pthread_create(&state.producers[i], NULL,
+				     bench->producer_thread, (void *)(long)i);
+		if (err) {
+			fprintf(stderr, "failed to create producer thread #%d: %d\n",
+				i, -errno);
+			exit(1);
+		}
+		if (env.affinity)
+			set_thread_affinity(state.producers[i],
+					    env.consumer_cnt + i);
+	}
+
+	printf("Benchmark '%s' started.\n", bench->name);
+}
+
+static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
+
+static void collect_measurements(long delta_ns) {
+	int iter = state.res_cnt++;
+	struct bench_res *res = &state.results[iter];
+
+	bench->measure(res);
+
+	if (bench->report_progress)
+		bench->report_progress(iter, res, delta_ns);
+
+	if (iter == env.duration_sec + env.warmup_sec) {
+		pthread_mutex_lock(&bench_done_mtx);
+		pthread_cond_signal(&bench_done);
+		pthread_mutex_unlock(&bench_done_mtx);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	parse_cmdline_args(argc, argv);
+
+	if (env.list) {
+		int i;
+
+		printf("Available benchmarks:\n");
+		for (i = 0; i < ARRAY_SIZE(benchs); i++) {
+			printf("- %s\n", benchs[i]->name);
+		}
+		return 0;
+	}
+
+	setup_benchmark();
+
+	setup_timer();
+
+	pthread_mutex_lock(&bench_done_mtx);
+	pthread_cond_wait(&bench_done, &bench_done_mtx);
+	pthread_mutex_unlock(&bench_done_mtx);
+
+	if (bench->report_final)
+		/* skip first sample */
+		bench->report_final(state.results + env.warmup_sec,
+				    state.res_cnt - env.warmup_sec);
+
+	return 0;
+}
+
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
new file mode 100644
index 000000000000..08aa0c5b1177
--- /dev/null
+++ b/tools/testing/selftests/bpf/bench.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#pragma once
+#include <stdlib.h>
+#include <stdbool.h>
+#include <linux/err.h>
+#include <errno.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include <math.h>
+#include <time.h>
+#include <sys/syscall.h>
+
+struct env {
+	char *bench_name;
+	int duration_sec;
+	int warmup_sec;
+	bool verbose;
+	bool list;
+	bool back2back;
+	bool affinity;
+	int consumer_cnt;
+	int producer_cnt;
+};
+
+struct bench_res {
+	long hits;
+	long drops;
+};
+
+struct bench {
+	const char *name;
+	void (*validate)();
+	void (*setup)();
+	void *(*producer_thread)(void *ctx);
+	void *(*consumer_thread)(void *ctx);
+	void (*measure)(struct bench_res* res);
+	void (*report_progress)(int iter, struct bench_res* res, long delta_ns);
+	void (*report_final)(struct bench_res res[], int res_cnt);
+};
+
+struct counter {
+	long value;
+} __attribute__((aligned(128)));
+
+extern struct env env;
+extern const struct bench *bench;
+
+void setup_libbpf();
+void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
+void hits_drops_report_final(struct bench_res res[], int res_cnt);
+
+static inline __u64 get_time_ns() {
+	struct timespec t;
+
+	clock_gettime(CLOCK_MONOTONIC, &t);
+
+	return (u64)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static inline void atomic_inc(long *value)
+{
+	(void)__atomic_add_fetch(value, 1, __ATOMIC_RELAXED);
+}
+
+static inline void atomic_add(long *value, long n)
+{
+	(void)__atomic_add_fetch(value, n, __ATOMIC_RELAXED);
+}
+
+static inline long atomic_swap(long *value, long n)
+{
+	return __atomic_exchange_n(value, n, __ATOMIC_RELAXED);
+}
diff --git a/tools/testing/selftests/bpf/benchs/bench_count.c b/tools/testing/selftests/bpf/benchs/bench_count.c
new file mode 100644
index 000000000000..befba7a82643
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_count.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "bench.h"
+
+/* COUNT-GLOBAL benchmark */
+
+static struct count_global_ctx {
+	struct counter hits;
+} count_global_ctx;
+
+static void *count_global_producer(void *input)
+{
+	struct count_global_ctx *ctx = &count_global_ctx;
+
+	while (true) {
+		atomic_inc(&ctx->hits.value);
+	}
+	return NULL;
+}
+
+static void *count_global_consumer(void *input)
+{
+	return NULL;
+}
+
+static void count_global_measure(struct bench_res *res)
+{
+	struct count_global_ctx *ctx = &count_global_ctx;
+
+	res->hits = atomic_swap(&ctx->hits.value, 0);
+}
+
+/* COUNT-local benchmark */
+
+static struct count_local_ctx {
+	struct counter *hits;
+} count_local_ctx;
+
+static void count_local_setup()
+{
+	struct count_local_ctx *ctx = &count_local_ctx;
+
+	ctx->hits = calloc(env.consumer_cnt, sizeof(*ctx->hits));
+	if (!ctx->hits)
+		exit(1);
+}
+
+static void *count_local_producer(void *input)
+{
+	struct count_local_ctx *ctx = &count_local_ctx;
+	int idx = (long)input;
+
+	while (true) {
+		atomic_inc(&ctx->hits[idx].value);
+	}
+	return NULL;
+}
+
+static void *count_local_consumer(void *input)
+{
+	return NULL;
+}
+
+static void count_local_measure(struct bench_res *res)
+{
+	struct count_local_ctx *ctx = &count_local_ctx;
+	int i;
+
+	for (i = 0; i < env.producer_cnt; i++) {
+		res->hits += atomic_swap(&ctx->hits[i].value, 0);
+	}
+}
+
+const struct bench bench_count_global = {
+	.name = "count-global",
+	.producer_thread = count_global_producer,
+	.consumer_thread = count_global_consumer,
+	.measure = count_global_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_count_local = {
+	.name = "count-local",
+	.setup = count_local_setup,
+	.producer_thread = count_local_producer,
+	.consumer_thread = count_local_consumer,
+	.measure = count_local_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench
  2020-05-08 23:20 [PATCH v2 bpf-next 0/3] Add benchmark runner and few benchmarks Andrii Nakryiko
  2020-05-08 23:20 ` [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
@ 2020-05-08 23:20 ` Andrii Nakryiko
  2020-05-09 17:23   ` Yonghong Song
  2020-05-08 23:20 ` [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark Andrii Nakryiko
  2 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-08 23:20 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, John Fastabend

Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
user-space benchmarking part into benchmark runner to compare results.  Results
with ./bench are consistently somewhat lower than test_overhead's, but relative
performance of various types of BPF programs stay consisten (e.g., kretprobe is
noticeably slower).

run_bench_rename.sh script (in benchs/ directory) was used to produce the
following numbers:

  base      :    3.975 ± 0.065M/s
  kprobe    :    3.268 ± 0.095M/s
  kretprobe :    2.496 ± 0.040M/s
  rawtp     :    3.899 ± 0.078M/s
  fentry    :    3.836 ± 0.049M/s
  fexit     :    3.660 ± 0.082M/s
  fmodret   :    3.776 ± 0.033M/s

While running test_overhead gives:

  task_rename base        4457K events per sec
  task_rename kprobe      3849K events per sec
  task_rename kretprobe   2729K events per sec
  task_rename raw_tp      4506K events per sec
  task_rename fentry      4381K events per sec
  task_rename fexit       4349K events per sec
  task_rename fmod_ret    4130K events per sec

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  14 ++
 .../selftests/bpf/benchs/bench_rename.c       | 195 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
 .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
 .../selftests/bpf/progs/test_overhead.c       |   6 +
 6 files changed, 240 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 289fffbf975e..29a02abf81a3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -409,10 +409,12 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
 $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
 	$(call msg,CC,,$@)
 	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
+$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
 $(OUTPUT)/bench.o: bench.h
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
-		 $(OUTPUT)/bench_count.o
+		 $(OUTPUT)/bench_count.o \
+		 $(OUTPUT)/bench_rename.o
 	$(call msg,BINARY,,$@)
 	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index dddc97cd4db6..650697df47af 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -254,10 +254,24 @@ const struct bench *bench = NULL;
 
 extern const struct bench bench_count_global;
 extern const struct bench bench_count_local;
+extern const struct bench bench_rename_base;
+extern const struct bench bench_rename_kprobe;
+extern const struct bench bench_rename_kretprobe;
+extern const struct bench bench_rename_rawtp;
+extern const struct bench bench_rename_fentry;
+extern const struct bench bench_rename_fexit;
+extern const struct bench bench_rename_fmodret;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
 	&bench_count_local,
+	&bench_rename_base,
+	&bench_rename_kprobe,
+	&bench_rename_kretprobe,
+	&bench_rename_rawtp,
+	&bench_rename_fentry,
+	&bench_rename_fexit,
+	&bench_rename_fmodret,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_rename.c b/tools/testing/selftests/bpf/benchs/bench_rename.c
new file mode 100644
index 000000000000..e74cff40f4fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_rename.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include <fcntl.h>
+#include "bench.h"
+#include "test_overhead.skel.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct test_overhead *skel;
+	struct counter hits;
+	int fd;
+} ctx;
+
+static void validate()
+{
+	if (env.producer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-producer!\n");
+		exit(1);
+	}
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	char buf[] = "test_overhead";
+	int err;
+
+	while (true) {
+		err = write(ctx.fd, buf, sizeof(buf));
+		if (err < 0) {
+			fprintf(stderr, "write failed\n");
+			exit(1);
+		}
+		atomic_inc(&ctx.hits.value);
+	}
+}
+
+static void measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.hits.value, 0);
+}
+
+static void setup_ctx()
+{
+	setup_libbpf();
+
+	ctx.skel = test_overhead__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	ctx.fd = open("/proc/self/comm", O_WRONLY|O_TRUNC);
+	if (ctx.fd < 0) {
+		fprintf(stderr, "failed to open /proc/self/comm: %d\n", -errno);
+		exit(1);
+	}
+}
+
+static void attach_bpf(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void setup_base()
+{
+	setup_ctx();
+}
+
+static void setup_kprobe()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog1);
+}
+
+static void setup_kretprobe()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog2);
+}
+
+static void setup_rawtp()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog3);
+}
+
+static void setup_fentry()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog4);
+}
+
+static void setup_fexit()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog5);
+}
+
+static void setup_fmodret()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog6);
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+const struct bench bench_rename_base = {
+	.name = "rename-base",
+	.validate = validate,
+	.setup = setup_base,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_kprobe = {
+	.name = "rename-kprobe",
+	.validate = validate,
+	.setup = setup_kprobe,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_kretprobe = {
+	.name = "rename-kretprobe",
+	.validate = validate,
+	.setup = setup_kretprobe,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_rawtp = {
+	.name = "rename-rawtp",
+	.validate = validate,
+	.setup = setup_rawtp,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_fentry = {
+	.name = "rename-fentry",
+	.validate = validate,
+	.setup = setup_fentry,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_fexit = {
+	.name = "rename-fexit",
+	.validate = validate,
+	.setup = setup_fexit,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_fmodret = {
+	.name = "rename-fmodret",
+	.validate = validate,
+	.setup = setup_fmodret,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_rename.sh b/tools/testing/selftests/bpf/benchs/run_bench_rename.sh
new file mode 100755
index 000000000000..16f774b1cdbe
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_rename.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+set -eufo pipefail
+
+for i in base kprobe kretprobe rawtp fentry fexit fmodret
+do
+	summary=$(sudo ./bench -w2 -d5 -a rename-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
+	printf "%-10s: %s\n" $i "$summary"
+done
diff --git a/tools/testing/selftests/bpf/prog_tests/test_overhead.c b/tools/testing/selftests/bpf/prog_tests/test_overhead.c
index 465b371a561d..2702df2b2343 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_overhead.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_overhead.c
@@ -61,9 +61,10 @@ void test_test_overhead(void)
 	const char *raw_tp_name = "raw_tp/task_rename";
 	const char *fentry_name = "fentry/__set_task_comm";
 	const char *fexit_name = "fexit/__set_task_comm";
+	const char *fmodret_name = "fmod_ret/__set_task_comm";
 	const char *kprobe_func = "__set_task_comm";
 	struct bpf_program *kprobe_prog, *kretprobe_prog, *raw_tp_prog;
-	struct bpf_program *fentry_prog, *fexit_prog;
+	struct bpf_program *fentry_prog, *fexit_prog, *fmodret_prog;
 	struct bpf_object *obj;
 	struct bpf_link *link;
 	int err, duration = 0;
@@ -96,6 +97,10 @@ void test_test_overhead(void)
 	if (CHECK(!fexit_prog, "find_probe",
 		  "prog '%s' not found\n", fexit_name))
 		goto cleanup;
+	fmodret_prog = bpf_object__find_program_by_title(obj, fmodret_name);
+	if (CHECK(!fmodret_prog, "find_probe",
+		  "prog '%s' not found\n", fmodret_name))
+		goto cleanup;
 
 	err = bpf_object__load(obj);
 	if (CHECK(err, "obj_load", "err %d\n", err))
@@ -142,6 +147,13 @@ void test_test_overhead(void)
 		goto cleanup;
 	test_run("fexit");
 	bpf_link__destroy(link);
+
+	/* attach fmod_ret */
+	link = bpf_program__attach_trace(fmodret_prog);
+	if (CHECK(IS_ERR(link), "attach fmod_ret", "err %ld\n", PTR_ERR(link)))
+		goto cleanup;
+	test_run("fmod_ret");
+	bpf_link__destroy(link);
 cleanup:
 	prctl(PR_SET_NAME, comm, 0L, 0L, 0L);
 	bpf_object__close(obj);
diff --git a/tools/testing/selftests/bpf/progs/test_overhead.c b/tools/testing/selftests/bpf/progs/test_overhead.c
index 56a50b25cd33..450bf819beac 100644
--- a/tools/testing/selftests/bpf/progs/test_overhead.c
+++ b/tools/testing/selftests/bpf/progs/test_overhead.c
@@ -39,4 +39,10 @@ int BPF_PROG(prog5, struct task_struct *tsk, const char *buf, bool exec)
 	return !tsk;
 }
 
+SEC("fmod_ret/__set_task_comm")
+int BPF_PROG(prog6, struct task_struct *tsk, const char *buf, bool exec)
+{
+	return !tsk;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark
  2020-05-08 23:20 [PATCH v2 bpf-next 0/3] Add benchmark runner and few benchmarks Andrii Nakryiko
  2020-05-08 23:20 ` [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
  2020-05-08 23:20 ` [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench Andrii Nakryiko
@ 2020-05-08 23:20 ` Andrii Nakryiko
  2020-05-09 17:43   ` Yonghong Song
  2 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-08 23:20 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, John Fastabend

It is sometimes desirable to be able to trigger BPF program from user-space
with minimal overhead. sys_enter would seem to be a good candidate, yet in
a lot of cases there will be a lot of noise from syscalls triggered by other
processes on the system. So while searching for low-overhead alternative, I've
stumbled upon getpgid() syscall, which seems to be specific enough to not
suffer from accidental syscall by other apps.

This set of benchmarks compares tp, raw_tp w/ filtering by syscall ID, kprobe,
fentry and fmod_ret with returning error (so that syscall would not be
executed), to determine the lowest-overhead way. Here are results on my
machine (using benchs/run_bench_trigger.sh script):

  base      :    9.200 ± 0.319M/s
  tp        :    6.690 ± 0.125M/s
  rawtp     :    8.571 ± 0.214M/s
  kprobe    :    6.431 ± 0.048M/s
  fentry    :    8.955 ± 0.241M/s
  fmodret   :    8.903 ± 0.135M/s

So it seems like fmodret doesn't give much benefit for such lightweight
syscall. Raw tracepoint is pretty decent despite additional filtering logic,
but it will be called for any other syscall in the system, which rules it out.
Fentry, though, seems to be adding the least amoung of overhead and achieves
97.3% of performance of baseline no-BPF-attached syscall.

Using getpgid() seems to be preferable to set_task_comm() approach from
test_overhead, as it's about 2.35x faster in a baseline performance.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  12 ++
 .../selftests/bpf/benchs/bench_trigger.c      | 167 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_trigger.sh |   9 +
 .../selftests/bpf/progs/trigger_bench.c       |  47 +++++
 5 files changed, 238 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_trigger.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_trigger.sh
 create mode 100644 tools/testing/selftests/bpf/progs/trigger_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 29a02abf81a3..45c68111fb48 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -410,11 +410,13 @@ $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
 	$(call msg,CC,,$@)
 	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
 $(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
+$(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h
 $(OUTPUT)/bench.o: bench.h
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_count.o \
-		 $(OUTPUT)/bench_rename.o
+		 $(OUTPUT)/bench_rename.o \
+		 $(OUTPUT)/bench_trigger.o
 	$(call msg,BINARY,,$@)
 	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 650697df47af..03ce436fee3e 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -261,6 +261,12 @@ extern const struct bench bench_rename_rawtp;
 extern const struct bench bench_rename_fentry;
 extern const struct bench bench_rename_fexit;
 extern const struct bench bench_rename_fmodret;
+extern const struct bench bench_trig_base;
+extern const struct bench bench_trig_tp;
+extern const struct bench bench_trig_rawtp;
+extern const struct bench bench_trig_kprobe;
+extern const struct bench bench_trig_fentry;
+extern const struct bench bench_trig_fmodret;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -272,6 +278,12 @@ static const struct bench *benchs[] = {
 	&bench_rename_fentry,
 	&bench_rename_fexit,
 	&bench_rename_fmodret,
+	&bench_trig_base,
+	&bench_trig_tp,
+	&bench_trig_rawtp,
+	&bench_trig_kprobe,
+	&bench_trig_fentry,
+	&bench_trig_fmodret,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
new file mode 100644
index 000000000000..49c22832f216
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "bench.h"
+#include "trigger_bench.skel.h"
+
+/* BPF triggering benchmarks */
+static struct trigger_ctx {
+	struct trigger_bench *skel;
+} ctx;
+
+static struct counter base_hits;
+
+static void trigger_validate()
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *trigger_base_producer(void *input)
+{
+	while (true) {
+		(void)syscall(__NR_getpgid);
+		atomic_inc(&base_hits.value);
+	}
+	return NULL;
+}
+
+static void trigger_base_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&base_hits.value, 0);
+}
+
+static void *trigger_producer(void *input)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void trigger_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+static void setup_ctx()
+{
+	setup_libbpf();
+
+	ctx.skel = trigger_bench__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+}
+
+static void attach_bpf(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void trigger_tp_setup()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.bench_trigger_tp);
+}
+
+static void trigger_rawtp_setup()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.bench_trigger_raw_tp);
+}
+
+static void trigger_kprobe_setup()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.bench_trigger_kprobe);
+}
+
+static void trigger_fentry_setup()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.bench_trigger_fentry);
+}
+
+static void trigger_fmodret_setup()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.bench_trigger_fmodret);
+}
+
+static void *trigger_consumer(void *input)
+{
+	return NULL;
+}
+
+const struct bench bench_trig_base = {
+	.name = "trig-base",
+	.validate = trigger_validate,
+	.producer_thread = trigger_base_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_base_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_trig_tp = {
+	.name = "trig-tp",
+	.validate = trigger_validate,
+	.setup = trigger_tp_setup,
+	.producer_thread = trigger_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_trig_rawtp = {
+	.name = "trig-rawtp",
+	.validate = trigger_validate,
+	.setup = trigger_rawtp_setup,
+	.producer_thread = trigger_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_trig_kprobe = {
+	.name = "trig-kprobe",
+	.validate = trigger_validate,
+	.setup = trigger_kprobe_setup,
+	.producer_thread = trigger_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_trig_fentry = {
+	.name = "trig-fentry",
+	.validate = trigger_validate,
+	.setup = trigger_fentry_setup,
+	.producer_thread = trigger_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_trig_fmodret = {
+	.name = "trig-fmodret",
+	.validate = trigger_validate,
+	.setup = trigger_fmodret_setup,
+	.producer_thread = trigger_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh b/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh
new file mode 100755
index 000000000000..78e83f243294
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+set -eufo pipefail
+
+for i in base tp rawtp kprobe fentry fmodret
+do
+	summary=$(sudo ./bench -w2 -d5 -a trig-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
+	printf "%-10s: %s\n" $i "$summary"
+done
diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
new file mode 100644
index 000000000000..8b36b6640e7e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Facebook
+
+#include <linux/bpf.h>
+#include <asm/unistd.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+long hits = 0;
+
+SEC("tp/syscalls/sys_enter_getpgid")
+int bench_trigger_tp(void *ctx)
+{
+	__sync_add_and_fetch(&hits, 1);
+	return 0;
+}
+
+SEC("raw_tp/sys_enter")
+int BPF_PROG(bench_trigger_raw_tp, struct pt_regs *regs, long id)
+{
+	if (id == __NR_getpgid)
+		__sync_add_and_fetch(&hits, 1);
+	return 0;
+}
+
+SEC("kprobe/__x64_sys_getpgid")
+int bench_trigger_kprobe(void *ctx)
+{
+	__sync_add_and_fetch(&hits, 1);
+	return 0;
+}
+
+SEC("fentry/__x64_sys_getpgid")
+int bench_trigger_fentry(void *ctx)
+{
+	__sync_add_and_fetch(&hits, 1);
+	return 0;
+}
+
+SEC("fmod_ret/__x64_sys_getpgid")
+int bench_trigger_fmodret(void *ctx)
+{
+	__sync_add_and_fetch(&hits, 1);
+	return -22;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure
  2020-05-08 23:20 ` [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
@ 2020-05-09 17:10   ` Yonghong Song
  2020-05-12  3:29     ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Yonghong Song @ 2020-05-09 17:10 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team



On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> While working on BPF ringbuf implementation, testing, and benchmarking, I've
> developed a pretty generic and modular benchmark runner, which seems to be
> generically useful, as I've already used it for one more purpose (testing
> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
> 
> This patch adds generic part of benchmark runner and sets up Makefile for
> extending it with more sets of benchmarks.
> 
> Benchmarker itself operates by spinning up specified number of producer and
> consumer threads, setting up interval timer sending SIGALARM signal to
> application once a second. Every second, current snapshot with hits/drops
> counters are collected and stored in an array. Drops are useful for
> producer/consumer benchmarks in which producer might overwhelm consumers.
> 
> Once test finishes after given amount of warm-up and testing seconds, mean and
> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
> This setup seems to give consistent and accurate results.
> 
> To validate behavior, I added two atomic counting tests: global and local.
> For global one, all the producer threads are atomically incrementing same
> counter as fast as possible. This, of course, leads to huge drop of
> performance once there is more than one producer thread due to CPUs fighting
> for the same memory location.
> 
> Local counting, on the other hand, maintains one counter per each producer
> thread, incremented independently. Once per second, all counters are read and
> added together to form final "counting throughput" measurement. As expected,
> such setup demonstrates linear scalability with number of producers (as long
> as there are enough physical CPU cores, of course). See example output below.
> Also, this setup can nicely demonstrate disastrous effects of false sharing,
> if care is not taken to take those per-producer counters apart into
> independent cache lines.
> 
> Demo output shows global counter first with 1 producer, then with 4. Both
> total and per-producer performance significantly drop. The last run is local
> counter with 4 producers, demonstrating near-perfect scalability.
> 
> $ ./bench -a -w1 -d2 -p1 count-global
> Setting up benchmark 'count-global'...
> Benchmark 'count-global' started.
> Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
> Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
> Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
> Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
> Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
> 
> $ ./bench -a -w1 -d2 -p4 count-global
> Setting up benchmark 'count-global'...
> Benchmark 'count-global' started.
> Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
> Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
> Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
> Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
> Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
> 
> $ ./bench -a -w1 -d2 -p4 count-local
> Setting up benchmark 'count-local'...
> Benchmark 'count-local' started.
> Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
> Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
> Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
> Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
> Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   tools/testing/selftests/bpf/.gitignore        |   1 +
>   tools/testing/selftests/bpf/Makefile          |  13 +-
>   tools/testing/selftests/bpf/bench.c           | 372 ++++++++++++++++++
>   tools/testing/selftests/bpf/bench.h           |  74 ++++
>   .../selftests/bpf/benchs/bench_count.c        |  91 +++++
>   5 files changed, 550 insertions(+), 1 deletion(-)
>   create mode 100644 tools/testing/selftests/bpf/bench.c
>   create mode 100644 tools/testing/selftests/bpf/bench.h
>   create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
> 
> diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> index 3ff031972975..1bb204cee853 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -38,3 +38,4 @@ test_cpp
>   /bpf_gcc
>   /tools
>   /runqslower
> +/bench
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 3d942be23d09..289fffbf975e 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -77,7 +77,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
>   # Compile but not part of 'make run_tests'
>   TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
>   	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> -	test_lirc_mode2_user xdping test_cpp runqslower
> +	test_lirc_mode2_user xdping test_cpp runqslower bench
>   
>   TEST_CUSTOM_PROGS = urandom_read
>   
> @@ -405,6 +405,17 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
>   	$(call msg,CXX,,$@)
>   	$(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
>   
> +# Benchmark runner
> +$(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
> +	$(call msg,CC,,$@)
> +	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
> +$(OUTPUT)/bench.o: bench.h
> +$(OUTPUT)/bench: LDLIBS += -lm
> +$(OUTPUT)/bench: $(OUTPUT)/bench.o \
> +		 $(OUTPUT)/bench_count.o
> +	$(call msg,BINARY,,$@)
> +	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
> +
>   EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR)			\
>   	prog_tests/tests.h map_tests/tests.h verifier/tests.h		\
>   	feature								\
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> new file mode 100644
> index 000000000000..dddc97cd4db6
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -0,0 +1,372 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Facebook */
> +#define _GNU_SOURCE
> +#include <argp.h>
> +#include <linux/compiler.h>
> +#include <sys/time.h>
> +#include <sched.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <sys/sysinfo.h>
> +#include <sys/resource.h>
> +#include <signal.h>
> +#include "bench.h"
> +
> +struct env env = {
> +	.warmup_sec = 1,
> +	.duration_sec = 5,
> +	.affinity = false,
> +	.consumer_cnt = 1,
> +	.producer_cnt = 1,
> +};
> +
> +static int libbpf_print_fn(enum libbpf_print_level level,
> +		    const char *format, va_list args)
> +{
> +	if (level == LIBBPF_DEBUG && !env.verbose)
> +		return 0;
> +	return vfprintf(stderr, format, args);
> +}
> +
> +static int bump_memlock_rlimit(void)
> +{
> +	struct rlimit rlim_new = {
> +		.rlim_cur	= RLIM_INFINITY,
> +		.rlim_max	= RLIM_INFINITY,
> +	};
> +
> +	return setrlimit(RLIMIT_MEMLOCK, &rlim_new);
> +}
> +
> +void setup_libbpf()
> +{
> +	int err;
> +
> +	libbpf_set_print(libbpf_print_fn);
> +
> +	err = bump_memlock_rlimit();
> +	if (err)
> +		fprintf(stderr, "failed to increase RLIMIT_MEMLOCK: %d", err);
> +}
> +
> +void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
> +{
> +	double hits_per_sec, drops_per_sec;
> +	double hits_per_prod;
> +
> +	hits_per_sec = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
> +	hits_per_prod = hits_per_sec / env.producer_cnt;
> +	drops_per_sec = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
> +
> +	printf("Iter %3d (%7.3lfus): ",
> +	       iter, (delta_ns - 1000000000) / 1000.0);
> +
> +	printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
> +	       hits_per_sec, hits_per_prod, drops_per_sec);
> +}
> +
> +void hits_drops_report_final(struct bench_res res[], int res_cnt)
> +{
> +	int i;
> +	double hits_mean = 0.0, drops_mean = 0.0;
> +	double hits_stddev = 0.0, drops_stddev = 0.0;
> +
> +	for (i = 0; i < res_cnt; i++) {
> +		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
> +		drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
> +	}
> +
> +	if (res_cnt > 1)  {
> +		for (i = 0; i < res_cnt; i++) {
> +			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
> +				       (hits_mean - res[i].hits / 1000000.0) /
> +				       (res_cnt - 1.0);
> +			drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
> +					(drops_mean - res[i].drops / 1000000.0) /
> +					(res_cnt - 1.0);
> +		}
> +		hits_stddev = sqrt(hits_stddev);
> +		drops_stddev = sqrt(drops_stddev);
> +	}
> +	printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
> +	       hits_mean, hits_stddev, hits_mean / env.producer_cnt);
> +	printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
> +	       drops_mean, drops_stddev);

The unicode char \u00B1 (for +|-) may cause some old compiler (e.g., 
4.8.5) warnings as it needs C99 mode.

/data/users/yhs/work/net-next/tools/testing/selftests/bpf/bench.c:91:9: 
warning: universal character names are only valid in C++ and C99 
[enabled by default]
   printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",

"+|-" is alternative, but not as good as \u00B1 indeed. Newer
compiler may already have the default C99. Maybe we can just add
C99 for build `bench`?

> +}
> +
> +const char *argp_program_version = "benchmark";
> +const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
> +const char argp_program_doc[] =
> +"benchmark    Generic benchmarking framework.\n"
> +"\n"
> +"This tool runs benchmarks.\n"
> +"\n"
> +"USAGE: benchmark <bench-name>\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    # run 'count-local' benchmark with 1 producer and 1 consumer\n"
> +"    benchmark count-local\n"
> +"    # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
> +"    benchmark -p16 -c8 -a count-local\n";

Some of the above global variables probably are statics.
But I do not have a strong preference on this.

> +
> +static const struct argp_option opts[] = {
> +	{ "list", 'l', NULL, 0, "List available benchmarks"},
> +	{ "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
> +	{ "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
> +	{ "producers", 'p', "NUM", 0, "Number of producer threads"},
> +	{ "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> +	{ "verbose", 'v', NULL, 0, "Verbose debug output"},
> +	{ "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> +	{ "b2b", 'b', NULL, 0, "Back-to-back mode"},
> +	{ "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},

I did not see b2b and rb-output options are processed in this file.

> +	{},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +	static int pos_args;
> +
> +	switch (key) {
> +	case 'v':
> +		env.verbose = true;
> +		break;
> +	case 'l':
> +		env.list = true;
> +		break;
> +	case 'd':
> +		env.duration_sec = strtol(arg, NULL, 10);
> +		if (env.duration_sec <= 0) {
> +			fprintf(stderr, "Invalid duration: %s\n", arg);
> +			argp_usage(state);
> +		}
> +		break;
> +	case 'w':
> +		env.warmup_sec = strtol(arg, NULL, 10);
> +		if (env.warmup_sec <= 0) {
> +			fprintf(stderr, "Invalid warm-up duration: %s\n", arg);
> +			argp_usage(state);
> +		}
> +		break;
> +	case 'p':
> +		env.producer_cnt = strtol(arg, NULL, 10);
> +		if (env.producer_cnt <= 0) {
> +			fprintf(stderr, "Invalid producer count: %s\n", arg);
> +			argp_usage(state);
> +		}
> +		break;
> +	case 'c':
> +		env.consumer_cnt = strtol(arg, NULL, 10);
> +		if (env.consumer_cnt <= 0) {
> +			fprintf(stderr, "Invalid consumer count: %s\n", arg);
> +			argp_usage(state);
> +		}
> +		break;
> +	case 'a':
> +		env.affinity = true;
> +		break;
> +	case ARGP_KEY_ARG:
> +		if (pos_args++) {
> +			fprintf(stderr,
> +				"Unrecognized positional argument: %s\n", arg);
> +			argp_usage(state);
> +		}
> +		env.bench_name = strdup(arg);
> +		break;
> +	default:
> +		return ARGP_ERR_UNKNOWN;
> +	}
> +	return 0;
> +}
> +
> +static void parse_cmdline_args(int argc, char **argv)
> +{
> +	static const struct argp argp = {
> +		.options = opts,
> +		.parser = parse_arg,
> +		.doc = argp_program_doc,
> +	};
> +	if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
> +		exit(1);
> +	if (!env.list && !env.bench_name) {
> +		argp_help(&argp, stderr, ARGP_HELP_DOC, "bench");
> +		exit(1);
> +	}
> +}
> +
> +static void collect_measurements(long delta_ns);
> +
> +static __u64 last_time_ns;
> +static void sigalarm_handler(int signo)
> +{
> +	long new_time_ns = get_time_ns();
> +	long delta_ns = new_time_ns - last_time_ns;
> +
> +	collect_measurements(delta_ns);
> +
> +	last_time_ns = new_time_ns;
> +}
> +
> +/* set up periodic 1-second timer */
> +static void setup_timer()
> +{
> +	static struct sigaction sigalarm_action = {
> +		.sa_handler = sigalarm_handler,
> +	};
> +	struct itimerval timer_settings = {};
> +	int err;
> +
> +	last_time_ns = get_time_ns();
> +	err = sigaction(SIGALRM, &sigalarm_action, NULL);
> +	if (err < 0) {
> +		fprintf(stderr, "failed to install SIGALARM handler: %d\n", -errno);
> +		exit(1);
> +	}
> +	timer_settings.it_interval.tv_sec = 1;
> +	timer_settings.it_value.tv_sec = 1;
> +	err = setitimer(ITIMER_REAL, &timer_settings, NULL);
> +	if (err < 0) {
> +		fprintf(stderr, "failed to arm interval timer: %d\n", -errno);
> +		exit(1);
> +	}
> +}
> +
> +static void set_thread_affinity(pthread_t thread, int cpu)
> +{
> +	cpu_set_t cpuset;
> +
> +	CPU_ZERO(&cpuset);
> +	CPU_SET(cpu, &cpuset);
> +	if (pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset)) {
> +		fprintf(stderr, "setting affinity to CPU #%d failed: %d\n",
> +			cpu, errno);
> +		exit(1);
> +	}
> +}
> +
> +static struct bench_state {
> +	int res_cnt;
> +	struct bench_res *results;
> +	pthread_t *consumers;
> +	pthread_t *producers;
> +} state;
> +
> +const struct bench *bench = NULL;
> +
> +extern const struct bench bench_count_global;
> +extern const struct bench bench_count_local;
> +
> +static const struct bench *benchs[] = {
> +	&bench_count_global,
> +	&bench_count_local,
> +};
> +
> +static void setup_benchmark()
> +{
> +	int i, err;
> +
> +	if (!env.bench_name) {
> +		fprintf(stderr, "benchmark name is not specified\n");
> +		exit(1);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(benchs); i++) {
> +		if (strcmp(benchs[i]->name, env.bench_name) == 0) {
> +			bench = benchs[i];
> +			break;
> +		}
> +	}
> +	if (!bench) {
> +		fprintf(stderr, "benchmark '%s' not found\n", env.bench_name);
> +		exit(1);
> +	}
> +
> +	printf("Setting up benchmark '%s'...\n", bench->name);
> +
> +	state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
> +	state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
> +	state.results = calloc(env.duration_sec + env.warmup_sec + 2,
> +			       sizeof(*state.results));
> +	if (!state.producers || !state.consumers || !state.results)
> +		exit(1);
> +
> +	if (bench->validate)
> +		bench->validate();
> +	if (bench->setup)
> +		bench->setup();
> +
> +	for (i = 0; i < env.consumer_cnt; i++) {
> +		err = pthread_create(&state.consumers[i], NULL,
> +				     bench->consumer_thread, (void *)(long)i);
> +		if (err) {
> +			fprintf(stderr, "failed to create consumer thread #%d: %d\n",
> +				i, -errno);
> +			exit(1);
> +		}
> +		if (env.affinity)
> +			set_thread_affinity(state.consumers[i], i);
> +	}
> +	for (i = 0; i < env.producer_cnt; i++) {
> +		err = pthread_create(&state.producers[i], NULL,
> +				     bench->producer_thread, (void *)(long)i);
> +		if (err) {
> +			fprintf(stderr, "failed to create producer thread #%d: %d\n",
> +				i, -errno);
> +			exit(1);
> +		}
> +		if (env.affinity)
> +			set_thread_affinity(state.producers[i],
> +					    env.consumer_cnt + i);

Here, we bind consumer threads first and then producer threads, I think 
this is probably just arbitrary choice?

In certain cases, I think people may want to have more advanced binding
scenarios, e.g., for hyperthreading, binding consumer and producer on
the same core or different cores etc. One possibility is to introduce
-c option similar to taskset. If -c not supplied, you can have
the current default. Otherwise, using -c list.

> +	}
> +
> +	printf("Benchmark '%s' started.\n", bench->name);
> +}
> +
> +static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
> +static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
> +
> +static void collect_measurements(long delta_ns) {
> +	int iter = state.res_cnt++;
> +	struct bench_res *res = &state.results[iter];
> +
> +	bench->measure(res);
> +
> +	if (bench->report_progress)
> +		bench->report_progress(iter, res, delta_ns);
> +
> +	if (iter == env.duration_sec + env.warmup_sec) {
> +		pthread_mutex_lock(&bench_done_mtx);
> +		pthread_cond_signal(&bench_done);
> +		pthread_mutex_unlock(&bench_done_mtx);
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	parse_cmdline_args(argc, argv);
> +
> +	if (env.list) {
> +		int i;
> +
> +		printf("Available benchmarks:\n");
> +		for (i = 0; i < ARRAY_SIZE(benchs); i++) {
> +			printf("- %s\n", benchs[i]->name);
> +		}
> +		return 0;
> +	}
> +
> +	setup_benchmark();
> +
> +	setup_timer();
> +
> +	pthread_mutex_lock(&bench_done_mtx);
> +	pthread_cond_wait(&bench_done, &bench_done_mtx);
> +	pthread_mutex_unlock(&bench_done_mtx);
> +
> +	if (bench->report_final)
> +		/* skip first sample */
> +		bench->report_final(state.results + env.warmup_sec,
> +				    state.res_cnt - env.warmup_sec);
> +
> +	return 0;
> +}
> +
> diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
> new file mode 100644
> index 000000000000..08aa0c5b1177
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bench.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#pragma once
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <linux/err.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +#include <math.h>
> +#include <time.h>
> +#include <sys/syscall.h>
> +
> +struct env {
> +	char *bench_name;
> +	int duration_sec;
> +	int warmup_sec;
> +	bool verbose;
> +	bool list;
> +	bool back2back;

seems not used.

> +	bool affinity;
> +	int consumer_cnt;
> +	int producer_cnt;
> +};
> +
> +struct bench_res {
> +	long hits;
> +	long drops;
> +};
> +
> +struct bench {
> +	const char *name;
> +	void (*validate)();
> +	void (*setup)();
> +	void *(*producer_thread)(void *ctx);
> +	void *(*consumer_thread)(void *ctx);
> +	void (*measure)(struct bench_res* res);
> +	void (*report_progress)(int iter, struct bench_res* res, long delta_ns);
> +	void (*report_final)(struct bench_res res[], int res_cnt);
> +};
> +
> +struct counter {
> +	long value;
> +} __attribute__((aligned(128)));
> +
> +extern struct env env;
> +extern const struct bench *bench;
> +
> +void setup_libbpf();
> +void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
> +void hits_drops_report_final(struct bench_res res[], int res_cnt);
> +
> +static inline __u64 get_time_ns() {
> +	struct timespec t;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &t);
> +
> +	return (u64)t.tv_sec * 1000000000 + t.tv_nsec;
> +}
> +
> +static inline void atomic_inc(long *value)
> +{
> +	(void)__atomic_add_fetch(value, 1, __ATOMIC_RELAXED);
> +}
> +
> +static inline void atomic_add(long *value, long n)
> +{
> +	(void)__atomic_add_fetch(value, n, __ATOMIC_RELAXED);
> +}
> +
> +static inline long atomic_swap(long *value, long n)
> +{
> +	return __atomic_exchange_n(value, n, __ATOMIC_RELAXED);
> +}
> diff --git a/tools/testing/selftests/bpf/benchs/bench_count.c b/tools/testing/selftests/bpf/benchs/bench_count.c
> new file mode 100644
> index 000000000000..befba7a82643
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_count.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Facebook */
> +#include "bench.h"
> +
> +/* COUNT-GLOBAL benchmark */
> +
> +static struct count_global_ctx {
> +	struct counter hits;
> +} count_global_ctx;
> +
> +static void *count_global_producer(void *input)
> +{
> +	struct count_global_ctx *ctx = &count_global_ctx;
> +
> +	while (true) {
> +		atomic_inc(&ctx->hits.value);
> +	}
> +	return NULL;
> +}
> +
> +static void *count_global_consumer(void *input)
> +{
> +	return NULL;
> +}
> +
> +static void count_global_measure(struct bench_res *res)
> +{
> +	struct count_global_ctx *ctx = &count_global_ctx;
> +
> +	res->hits = atomic_swap(&ctx->hits.value, 0);
> +}
> +
> +/* COUNT-local benchmark */
> +
> +static struct count_local_ctx {
> +	struct counter *hits;
> +} count_local_ctx;
> +
> +static void count_local_setup()
> +{
> +	struct count_local_ctx *ctx = &count_local_ctx;
> +
> +	ctx->hits = calloc(env.consumer_cnt, sizeof(*ctx->hits));
> +	if (!ctx->hits)
> +		exit(1);
> +}
> +
> +static void *count_local_producer(void *input)
> +{
> +	struct count_local_ctx *ctx = &count_local_ctx;
> +	int idx = (long)input;
> +
> +	while (true) {
> +		atomic_inc(&ctx->hits[idx].value);
> +	}
> +	return NULL;
> +}
> +
> +static void *count_local_consumer(void *input)
> +{
> +	return NULL;
> +}
> +
> +static void count_local_measure(struct bench_res *res)
> +{
> +	struct count_local_ctx *ctx = &count_local_ctx;
> +	int i;
> +
> +	for (i = 0; i < env.producer_cnt; i++) {
> +		res->hits += atomic_swap(&ctx->hits[i].value, 0);
> +	}
> +}
> +
> +const struct bench bench_count_global = {
> +	.name = "count-global",
> +	.producer_thread = count_global_producer,
> +	.consumer_thread = count_global_consumer,
> +	.measure = count_global_measure,
> +	.report_progress = hits_drops_report_progress,
> +	.report_final = hits_drops_report_final,
> +};
> +
> +const struct bench bench_count_local = {
> +	.name = "count-local",
> +	.setup = count_local_setup,
> +	.producer_thread = count_local_producer,
> +	.consumer_thread = count_local_consumer,
> +	.measure = count_local_measure,
> +	.report_progress = hits_drops_report_progress,
> +	.report_final = hits_drops_report_final,
> +};
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench
  2020-05-08 23:20 ` [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench Andrii Nakryiko
@ 2020-05-09 17:23   ` Yonghong Song
  2020-05-12  4:22     ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Yonghong Song @ 2020-05-09 17:23 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, John Fastabend



On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
> user-space benchmarking part into benchmark runner to compare results.  Results
> with ./bench are consistently somewhat lower than test_overhead's, but relative
> performance of various types of BPF programs stay consisten (e.g., kretprobe is
> noticeably slower).
> 
> run_bench_rename.sh script (in benchs/ directory) was used to produce the
> following numbers:
> 
>    base      :    3.975 ± 0.065M/s
>    kprobe    :    3.268 ± 0.095M/s
>    kretprobe :    2.496 ± 0.040M/s
>    rawtp     :    3.899 ± 0.078M/s
>    fentry    :    3.836 ± 0.049M/s
>    fexit     :    3.660 ± 0.082M/s
>    fmodret   :    3.776 ± 0.033M/s
> 
> While running test_overhead gives:
> 
>    task_rename base        4457K events per sec
>    task_rename kprobe      3849K events per sec
>    task_rename kretprobe   2729K events per sec
>    task_rename raw_tp      4506K events per sec
>    task_rename fentry      4381K events per sec
>    task_rename fexit       4349K events per sec
>    task_rename fmod_ret    4130K events per sec

Do you where the overhead is and how we could provide options in
bench to reduce the overhead so we can achieve similar numbers?
For benchmarking, sometimes you really want to see "true"
potential of a particular implementation.

> 
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   tools/testing/selftests/bpf/Makefile          |   4 +-
>   tools/testing/selftests/bpf/bench.c           |  14 ++
>   .../selftests/bpf/benchs/bench_rename.c       | 195 ++++++++++++++++++
>   .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
>   .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
>   .../selftests/bpf/progs/test_overhead.c       |   6 +
>   6 files changed, 240 insertions(+), 2 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
>   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 289fffbf975e..29a02abf81a3 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -409,10 +409,12 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
>   $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
>   	$(call msg,CC,,$@)
>   	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
> +$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
>   $(OUTPUT)/bench.o: bench.h
>   $(OUTPUT)/bench: LDLIBS += -lm
>   $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> -		 $(OUTPUT)/bench_count.o
> +		 $(OUTPUT)/bench_count.o \
> +		 $(OUTPUT)/bench_rename.o
>   	$(call msg,BINARY,,$@)
>   	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
>   
[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark
  2020-05-08 23:20 ` [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark Andrii Nakryiko
@ 2020-05-09 17:43   ` Yonghong Song
  0 siblings, 0 replies; 13+ messages in thread
From: Yonghong Song @ 2020-05-09 17:43 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, John Fastabend



On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> It is sometimes desirable to be able to trigger BPF program from user-space
> with minimal overhead. sys_enter would seem to be a good candidate, yet in

Probably "with minimal external noise"? Typically, overhead means the 
overhead from test infrastructure itself?

> a lot of cases there will be a lot of noise from syscalls triggered by other
> processes on the system. So while searching for low-overhead alternative, I've
> stumbled upon getpgid() syscall, which seems to be specific enough to not
> suffer from accidental syscall by other apps.
> 
> This set of benchmarks compares tp, raw_tp w/ filtering by syscall ID, kprobe,
> fentry and fmod_ret with returning error (so that syscall would not be
> executed), to determine the lowest-overhead way. Here are results on my
> machine (using benchs/run_bench_trigger.sh script):
> 
>    base      :    9.200 ± 0.319M/s
>    tp        :    6.690 ± 0.125M/s
>    rawtp     :    8.571 ± 0.214M/s
>    kprobe    :    6.431 ± 0.048M/s
>    fentry    :    8.955 ± 0.241M/s
>    fmodret   :    8.903 ± 0.135M/s

The relative ranking of different approaches is still similar to patch 
#2. But this patch reinforces that benchmarking really needs to reduce 
the noise to get highest number.

> 
> So it seems like fmodret doesn't give much benefit for such lightweight
> syscall. Raw tracepoint is pretty decent despite additional filtering logic,
> but it will be called for any other syscall in the system, which rules it out.
> Fentry, though, seems to be adding the least amoung of overhead and achieves
> 97.3% of performance of baseline no-BPF-attached syscall.
> 
> Using getpgid() seems to be preferable to set_task_comm() approach from
> test_overhead, as it's about 2.35x faster in a baseline performance.
> 
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   tools/testing/selftests/bpf/Makefile          |   4 +-
>   tools/testing/selftests/bpf/bench.c           |  12 ++
>   .../selftests/bpf/benchs/bench_trigger.c      | 167 ++++++++++++++++++
>   .../selftests/bpf/benchs/run_bench_trigger.sh |   9 +
>   .../selftests/bpf/progs/trigger_bench.c       |  47 +++++
>   5 files changed, 238 insertions(+), 1 deletion(-)
>   create mode 100644 tools/testing/selftests/bpf/benchs/bench_trigger.c
>   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_trigger.sh
>   create mode 100644 tools/testing/selftests/bpf/progs/trigger_bench.c
> 
[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure
  2020-05-09 17:10   ` Yonghong Song
@ 2020-05-12  3:29     ` Andrii Nakryiko
  2020-05-12 14:39       ` Yonghong Song
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-12  3:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Sat, May 9, 2020 at 10:10 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> > While working on BPF ringbuf implementation, testing, and benchmarking, I've
> > developed a pretty generic and modular benchmark runner, which seems to be
> > generically useful, as I've already used it for one more purpose (testing
> > fastest way to trigger BPF program, to minimize overhead of in-kernel code).
> >
> > This patch adds generic part of benchmark runner and sets up Makefile for
> > extending it with more sets of benchmarks.
> >
> > Benchmarker itself operates by spinning up specified number of producer and
> > consumer threads, setting up interval timer sending SIGALARM signal to
> > application once a second. Every second, current snapshot with hits/drops
> > counters are collected and stored in an array. Drops are useful for
> > producer/consumer benchmarks in which producer might overwhelm consumers.
> >
> > Once test finishes after given amount of warm-up and testing seconds, mean and
> > stddev are calculated (ignoring warm-up results) and is printed out to stdout.
> > This setup seems to give consistent and accurate results.
> >
> > To validate behavior, I added two atomic counting tests: global and local.
> > For global one, all the producer threads are atomically incrementing same
> > counter as fast as possible. This, of course, leads to huge drop of
> > performance once there is more than one producer thread due to CPUs fighting
> > for the same memory location.
> >
> > Local counting, on the other hand, maintains one counter per each producer
> > thread, incremented independently. Once per second, all counters are read and
> > added together to form final "counting throughput" measurement. As expected,
> > such setup demonstrates linear scalability with number of producers (as long
> > as there are enough physical CPU cores, of course). See example output below.
> > Also, this setup can nicely demonstrate disastrous effects of false sharing,
> > if care is not taken to take those per-producer counters apart into
> > independent cache lines.
> >
> > Demo output shows global counter first with 1 producer, then with 4. Both
> > total and per-producer performance significantly drop. The last run is local
> > counter with 4 producers, demonstrating near-perfect scalability.
> >
> > $ ./bench -a -w1 -d2 -p1 count-global
> > Setting up benchmark 'count-global'...
> > Benchmark 'count-global' started.
> > Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
> > Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
> > Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
> > Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
> > Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
> >
> > $ ./bench -a -w1 -d2 -p4 count-global
> > Setting up benchmark 'count-global'...
> > Benchmark 'count-global' started.
> > Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
> > Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
> > Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
> > Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
> > Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
> >
> > $ ./bench -a -w1 -d2 -p4 count-local
> > Setting up benchmark 'count-local'...
> > Benchmark 'count-local' started.
> > Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
> > Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
> > Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
> > Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
> > Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   tools/testing/selftests/bpf/.gitignore        |   1 +
> >   tools/testing/selftests/bpf/Makefile          |  13 +-
> >   tools/testing/selftests/bpf/bench.c           | 372 ++++++++++++++++++
> >   tools/testing/selftests/bpf/bench.h           |  74 ++++
> >   .../selftests/bpf/benchs/bench_count.c        |  91 +++++
> >   5 files changed, 550 insertions(+), 1 deletion(-)
> >   create mode 100644 tools/testing/selftests/bpf/bench.c
> >   create mode 100644 tools/testing/selftests/bpf/bench.h
> >   create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
> >

[...]

trimming is good :)

> > +
> > +void hits_drops_report_final(struct bench_res res[], int res_cnt)
> > +{
> > +     int i;
> > +     double hits_mean = 0.0, drops_mean = 0.0;
> > +     double hits_stddev = 0.0, drops_stddev = 0.0;
> > +
> > +     for (i = 0; i < res_cnt; i++) {
> > +             hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
> > +             drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
> > +     }
> > +
> > +     if (res_cnt > 1)  {
> > +             for (i = 0; i < res_cnt; i++) {
> > +                     hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
> > +                                    (hits_mean - res[i].hits / 1000000.0) /
> > +                                    (res_cnt - 1.0);
> > +                     drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
> > +                                     (drops_mean - res[i].drops / 1000000.0) /
> > +                                     (res_cnt - 1.0);
> > +             }
> > +             hits_stddev = sqrt(hits_stddev);
> > +             drops_stddev = sqrt(drops_stddev);
> > +     }
> > +     printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
> > +            hits_mean, hits_stddev, hits_mean / env.producer_cnt);
> > +     printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
> > +            drops_mean, drops_stddev);
>
> The unicode char \u00B1 (for +|-) may cause some old compiler (e.g.,
> 4.8.5) warnings as it needs C99 mode.
>
> /data/users/yhs/work/net-next/tools/testing/selftests/bpf/bench.c:91:9:
> warning: universal character names are only valid in C++ and C99
> [enabled by default]
>    printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>
> "+|-" is alternative, but not as good as \u00B1 indeed. Newer
> compiler may already have the default C99. Maybe we can just add
> C99 for build `bench`?

I think I'm fine with ancient compiler emitting harmless warning for
code under selftests/bpf, honestly...

>
> > +}
> > +
> > +const char *argp_program_version = "benchmark";
> > +const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
> > +const char argp_program_doc[] =
> > +"benchmark    Generic benchmarking framework.\n"
> > +"\n"
> > +"This tool runs benchmarks.\n"
> > +"\n"
> > +"USAGE: benchmark <bench-name>\n"
> > +"\n"
> > +"EXAMPLES:\n"
> > +"    # run 'count-local' benchmark with 1 producer and 1 consumer\n"
> > +"    benchmark count-local\n"
> > +"    # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
> > +"    benchmark -p16 -c8 -a count-local\n";
>
> Some of the above global variables probably are statics.
> But I do not have a strong preference on this.

Oh, it's actually global variables argp library expects, they have to be global.

>
> > +
> > +static const struct argp_option opts[] = {
> > +     { "list", 'l', NULL, 0, "List available benchmarks"},
> > +     { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
> > +     { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
> > +     { "producers", 'p', "NUM", 0, "Number of producer threads"},
> > +     { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> > +     { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > +     { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > +     { "b2b", 'b', NULL, 0, "Back-to-back mode"},
> > +     { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
>
> I did not see b2b and rb-output options are processed in this file.

Slipped through the rebasing cracks from the future ringbuf
benchmarks, will remove it. I also figured out a way to do this more
modular anyways (child parsers in argp).

>
> > +     {},
> > +};
> > +

[...]

> > +     for (i = 0; i < env.consumer_cnt; i++) {
> > +             err = pthread_create(&state.consumers[i], NULL,
> > +                                  bench->consumer_thread, (void *)(long)i);
> > +             if (err) {
> > +                     fprintf(stderr, "failed to create consumer thread #%d: %d\n",
> > +                             i, -errno);
> > +                     exit(1);
> > +             }
> > +             if (env.affinity)
> > +                     set_thread_affinity(state.consumers[i], i);
> > +     }
> > +     for (i = 0; i < env.producer_cnt; i++) {
> > +             err = pthread_create(&state.producers[i], NULL,
> > +                                  bench->producer_thread, (void *)(long)i);
> > +             if (err) {
> > +                     fprintf(stderr, "failed to create producer thread #%d: %d\n",
> > +                             i, -errno);
> > +                     exit(1);
> > +             }
> > +             if (env.affinity)
> > +                     set_thread_affinity(state.producers[i],
> > +                                         env.consumer_cnt + i);
>
> Here, we bind consumer threads first and then producer threads, I think
> this is probably just arbitrary choice?

yep, almost arbitrary. Most of my cases have 1 consumer and >=1
producers, so it was convenient to have consumer pinned to same CPU,
regardless of how many producers I have.

>
> In certain cases, I think people may want to have more advanced binding
> scenarios, e.g., for hyperthreading, binding consumer and producer on
> the same core or different cores etc. One possibility is to introduce
> -c option similar to taskset. If -c not supplied, you can have
> the current default. Otherwise, using -c list.
>

well, taskset's job is simpler, it takes a list of CPUs for single
PID, if I understand correctly. Here we have many threads, each might
have different CPU or even CPUs. But I agree that for some benchmarks
it's going to be critical to control this precisely. Here's how I'm
going to allows most flexibility without too much complexity.

--prod-affinity 1,2,10-16,100 -- will specify a set of CPUs for
producers. First producer will use CPU with least index form that
list. Second will take second, and so on. If there are less CPUs
provided than necessary - it's an error. If more - it's fine.

Then for consumers will add independent --cons-affinity parameters,
which will do the same for consumer threads.

Having two independent lists will allow to test scenarios where we
want producers and consumers to fight for the same CPU.

Does this sound ok?

> > +     }
> > +
> > +     printf("Benchmark '%s' started.\n", bench->name);
> > +}
> > +

[...]

> > diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
> > new file mode 100644
> > index 000000000000..08aa0c5b1177
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bench.h
> > @@ -0,0 +1,74 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#pragma once
> > +#include <stdlib.h>
> > +#include <stdbool.h>
> > +#include <linux/err.h>
> > +#include <errno.h>
> > +#include <unistd.h>
> > +#include <bpf/bpf.h>
> > +#include <bpf/libbpf.h>
> > +#include <math.h>
> > +#include <time.h>
> > +#include <sys/syscall.h>
> > +
> > +struct env {
> > +     char *bench_name;
> > +     int duration_sec;
> > +     int warmup_sec;
> > +     bool verbose;
> > +     bool list;
> > +     bool back2back;
>
> seems not used.

yep, cleaning up

>
> > +     bool affinity;
> > +     int consumer_cnt;
> > +     int producer_cnt;
> > +};
> > +

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench
  2020-05-09 17:23   ` Yonghong Song
@ 2020-05-12  4:22     ` Andrii Nakryiko
  2020-05-12 15:11       ` Yonghong Song
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-12  4:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, John Fastabend

On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> > Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
> > user-space benchmarking part into benchmark runner to compare results.  Results
> > with ./bench are consistently somewhat lower than test_overhead's, but relative
> > performance of various types of BPF programs stay consisten (e.g., kretprobe is
> > noticeably slower).
> >
> > run_bench_rename.sh script (in benchs/ directory) was used to produce the
> > following numbers:
> >
> >    base      :    3.975 ± 0.065M/s
> >    kprobe    :    3.268 ± 0.095M/s
> >    kretprobe :    2.496 ± 0.040M/s
> >    rawtp     :    3.899 ± 0.078M/s
> >    fentry    :    3.836 ± 0.049M/s
> >    fexit     :    3.660 ± 0.082M/s
> >    fmodret   :    3.776 ± 0.033M/s
> >
> > While running test_overhead gives:
> >
> >    task_rename base        4457K events per sec
> >    task_rename kprobe      3849K events per sec
> >    task_rename kretprobe   2729K events per sec
> >    task_rename raw_tp      4506K events per sec
> >    task_rename fentry      4381K events per sec
> >    task_rename fexit       4349K events per sec
> >    task_rename fmod_ret    4130K events per sec
>
> Do you where the overhead is and how we could provide options in
> bench to reduce the overhead so we can achieve similar numbers?
> For benchmarking, sometimes you really want to see "true"
> potential of a particular implementation.

Alright, let's make it an official bench-off... :) And the reason for
this discrepancy, turns out to be... not atomics at all! But rather a
single-threaded vs multi-threaded process (well, at least task_rename
happening from non-main thread, I didn't narrow it down further).
Atomics actually make very little difference, which gives me a good
peace of mind :)

So, I've built and ran test_overhead (selftest) and bench both as
multi-threaded and single-threaded apps. Corresponding results match
almost perfectly. And that's while test_overhead doesn't use atomics
at all, while bench still does. Then I also ran test_overhead with
added generics to match bench implementation. There are barely any
differences, see two last sets of results.

BTW, selftest results seems bit lower from the ones in original
commit, probably because I made it run more iterations (like 40 times
more) to have more stable results.

So here are the results:

Single-threaded implementations
===============================

/* bench: single-threaded, atomics */
base      :    4.622 ± 0.049M/s
kprobe    :    3.673 ± 0.052M/s
kretprobe :    2.625 ± 0.052M/s
rawtp     :    4.369 ± 0.089M/s
fentry    :    4.201 ± 0.558M/s
fexit     :    4.309 ± 0.148M/s
fmodret   :    4.314 ± 0.203M/s

/* selftest: single-threaded, no atomics */
task_rename base        4555K events per sec
task_rename kprobe      3643K events per sec
task_rename kretprobe   2506K events per sec
task_rename raw_tp      4303K events per sec
task_rename fentry      4307K events per sec
task_rename fexit       4010K events per sec
task_rename fmod_ret    3984K events per sec


Multi-threaded implementations
==============================

/* bench: multi-threaded w/ atomics */
base      :    3.910 ± 0.023M/s
kprobe    :    3.048 ± 0.037M/s
kretprobe :    2.300 ± 0.015M/s
rawtp     :    3.687 ± 0.034M/s
fentry    :    3.740 ± 0.087M/s
fexit     :    3.510 ± 0.009M/s
fmodret   :    3.485 ± 0.050M/s

/* selftest: multi-threaded w/ atomics */
task_rename base        3872K events per sec
task_rename kprobe      3068K events per sec
task_rename kretprobe   2350K events per sec
task_rename raw_tp      3731K events per sec
task_rename fentry      3639K events per sec
task_rename fexit       3558K events per sec
task_rename fmod_ret    3511K events per sec

/* selftest: multi-threaded, no atomics */
task_rename base        3945K events per sec
task_rename kprobe      3298K events per sec
task_rename kretprobe   2451K events per sec
task_rename raw_tp      3718K events per sec
task_rename fentry      3782K events per sec
task_rename fexit       3543K events per sec
task_rename fmod_ret    3526K events per sec


>
> >
> > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   tools/testing/selftests/bpf/Makefile          |   4 +-
> >   tools/testing/selftests/bpf/bench.c           |  14 ++
> >   .../selftests/bpf/benchs/bench_rename.c       | 195 ++++++++++++++++++
> >   .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
> >   .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
> >   .../selftests/bpf/progs/test_overhead.c       |   6 +
> >   6 files changed, 240 insertions(+), 2 deletions(-)
> >   create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
> >   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh
> >
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 289fffbf975e..29a02abf81a3 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -409,10 +409,12 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
> >   $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
> >       $(call msg,CC,,$@)
> >       $(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
> > +$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
> >   $(OUTPUT)/bench.o: bench.h
> >   $(OUTPUT)/bench: LDLIBS += -lm
> >   $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> > -              $(OUTPUT)/bench_count.o
> > +              $(OUTPUT)/bench_count.o \
> > +              $(OUTPUT)/bench_rename.o
> >       $(call msg,BINARY,,$@)
> >       $(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
> >
> [...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure
  2020-05-12  3:29     ` Andrii Nakryiko
@ 2020-05-12 14:39       ` Yonghong Song
  0 siblings, 0 replies; 13+ messages in thread
From: Yonghong Song @ 2020-05-12 14:39 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team



On 5/11/20 8:29 PM, Andrii Nakryiko wrote:
> On Sat, May 9, 2020 at 10:10 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>> While working on BPF ringbuf implementation, testing, and benchmarking, I've
>>> developed a pretty generic and modular benchmark runner, which seems to be
>>> generically useful, as I've already used it for one more purpose (testing
>>> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
>>>
>>> This patch adds generic part of benchmark runner and sets up Makefile for
>>> extending it with more sets of benchmarks.
>>>
>>> Benchmarker itself operates by spinning up specified number of producer and
>>> consumer threads, setting up interval timer sending SIGALARM signal to
>>> application once a second. Every second, current snapshot with hits/drops
>>> counters are collected and stored in an array. Drops are useful for
>>> producer/consumer benchmarks in which producer might overwhelm consumers.
>>>
>>> Once test finishes after given amount of warm-up and testing seconds, mean and
>>> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
>>> This setup seems to give consistent and accurate results.
>>>
>>> To validate behavior, I added two atomic counting tests: global and local.
>>> For global one, all the producer threads are atomically incrementing same
>>> counter as fast as possible. This, of course, leads to huge drop of
>>> performance once there is more than one producer thread due to CPUs fighting
>>> for the same memory location.
>>>
>>> Local counting, on the other hand, maintains one counter per each producer
>>> thread, incremented independently. Once per second, all counters are read and
>>> added together to form final "counting throughput" measurement. As expected,
>>> such setup demonstrates linear scalability with number of producers (as long
>>> as there are enough physical CPU cores, of course). See example output below.
>>> Also, this setup can nicely demonstrate disastrous effects of false sharing,
>>> if care is not taken to take those per-producer counters apart into
>>> independent cache lines.
>>>
>>> Demo output shows global counter first with 1 producer, then with 4. Both
>>> total and per-producer performance significantly drop. The last run is local
>>> counter with 4 producers, demonstrating near-perfect scalability.
>>>
>>> $ ./bench -a -w1 -d2 -p1 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
>>> Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
>>> Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
>>> Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
>>> Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
>>> Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
>>> Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
>>> Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
>>> Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-local
>>> Setting up benchmark 'count-local'...
>>> Benchmark 'count-local' started.
>>> Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
>>> Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
>>> Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
>>> Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
>>> Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
>>>
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
>>>    tools/testing/selftests/bpf/.gitignore        |   1 +
>>>    tools/testing/selftests/bpf/Makefile          |  13 +-
>>>    tools/testing/selftests/bpf/bench.c           | 372 ++++++++++++++++++
>>>    tools/testing/selftests/bpf/bench.h           |  74 ++++
>>>    .../selftests/bpf/benchs/bench_count.c        |  91 +++++
>>>    5 files changed, 550 insertions(+), 1 deletion(-)
>>>    create mode 100644 tools/testing/selftests/bpf/bench.c
>>>    create mode 100644 tools/testing/selftests/bpf/bench.h
>>>    create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
>>>
> 
> [...]
> 
> trimming is good :)
> 
>>> +
>>> +void hits_drops_report_final(struct bench_res res[], int res_cnt)
>>> +{
>>> +     int i;
>>> +     double hits_mean = 0.0, drops_mean = 0.0;
>>> +     double hits_stddev = 0.0, drops_stddev = 0.0;
>>> +
>>> +     for (i = 0; i < res_cnt; i++) {
>>> +             hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
>>> +             drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
>>> +     }
>>> +
>>> +     if (res_cnt > 1)  {
>>> +             for (i = 0; i < res_cnt; i++) {
>>> +                     hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
>>> +                                    (hits_mean - res[i].hits / 1000000.0) /
>>> +                                    (res_cnt - 1.0);
>>> +                     drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
>>> +                                     (drops_mean - res[i].drops / 1000000.0) /
>>> +                                     (res_cnt - 1.0);
>>> +             }
>>> +             hits_stddev = sqrt(hits_stddev);
>>> +             drops_stddev = sqrt(drops_stddev);
>>> +     }
>>> +     printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>> +            hits_mean, hits_stddev, hits_mean / env.producer_cnt);
>>> +     printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
>>> +            drops_mean, drops_stddev);
>>
>> The unicode char \u00B1 (for +|-) may cause some old compiler (e.g.,
>> 4.8.5) warnings as it needs C99 mode.
>>
>> /data/users/yhs/work/net-next/tools/testing/selftests/bpf/bench.c:91:9:
>> warning: universal character names are only valid in C++ and C99
>> [enabled by default]
>>     printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>
>> "+|-" is alternative, but not as good as \u00B1 indeed. Newer
>> compiler may already have the default C99. Maybe we can just add
>> C99 for build `bench`?
> 
> I think I'm fine with ancient compiler emitting harmless warning for
> code under selftests/bpf, honestly...
> 
>>
>>> +}
>>> +
>>> +const char *argp_program_version = "benchmark";
>>> +const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
>>> +const char argp_program_doc[] =
>>> +"benchmark    Generic benchmarking framework.\n"
>>> +"\n"
>>> +"This tool runs benchmarks.\n"
>>> +"\n"
>>> +"USAGE: benchmark <bench-name>\n"
>>> +"\n"
>>> +"EXAMPLES:\n"
>>> +"    # run 'count-local' benchmark with 1 producer and 1 consumer\n"
>>> +"    benchmark count-local\n"
>>> +"    # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
>>> +"    benchmark -p16 -c8 -a count-local\n";
>>
>> Some of the above global variables probably are statics.
>> But I do not have a strong preference on this.
> 
> Oh, it's actually global variables argp library expects, they have to be global.
> 
>>
>>> +
>>> +static const struct argp_option opts[] = {
>>> +     { "list", 'l', NULL, 0, "List available benchmarks"},
>>> +     { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
>>> +     { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
>>> +     { "producers", 'p', "NUM", 0, "Number of producer threads"},
>>> +     { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
>>> +     { "verbose", 'v', NULL, 0, "Verbose debug output"},
>>> +     { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
>>> +     { "b2b", 'b', NULL, 0, "Back-to-back mode"},
>>> +     { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
>>
>> I did not see b2b and rb-output options are processed in this file.
> 
> Slipped through the rebasing cracks from the future ringbuf
> benchmarks, will remove it. I also figured out a way to do this more
> modular anyways (child parsers in argp).
> 
>>
>>> +     {},
>>> +};
>>> +
> 
> [...]
> 
>>> +     for (i = 0; i < env.consumer_cnt; i++) {
>>> +             err = pthread_create(&state.consumers[i], NULL,
>>> +                                  bench->consumer_thread, (void *)(long)i);
>>> +             if (err) {
>>> +                     fprintf(stderr, "failed to create consumer thread #%d: %d\n",
>>> +                             i, -errno);
>>> +                     exit(1);
>>> +             }
>>> +             if (env.affinity)
>>> +                     set_thread_affinity(state.consumers[i], i);
>>> +     }
>>> +     for (i = 0; i < env.producer_cnt; i++) {
>>> +             err = pthread_create(&state.producers[i], NULL,
>>> +                                  bench->producer_thread, (void *)(long)i);
>>> +             if (err) {
>>> +                     fprintf(stderr, "failed to create producer thread #%d: %d\n",
>>> +                             i, -errno);
>>> +                     exit(1);
>>> +             }
>>> +             if (env.affinity)
>>> +                     set_thread_affinity(state.producers[i],
>>> +                                         env.consumer_cnt + i);
>>
>> Here, we bind consumer threads first and then producer threads, I think
>> this is probably just arbitrary choice?
> 
> yep, almost arbitrary. Most of my cases have 1 consumer and >=1
> producers, so it was convenient to have consumer pinned to same CPU,
> regardless of how many producers I have.
> 
>>
>> In certain cases, I think people may want to have more advanced binding
>> scenarios, e.g., for hyperthreading, binding consumer and producer on
>> the same core or different cores etc. One possibility is to introduce
>> -c option similar to taskset. If -c not supplied, you can have
>> the current default. Otherwise, using -c list.
>>
> 
> well, taskset's job is simpler, it takes a list of CPUs for single
> PID, if I understand correctly. Here we have many threads, each might
> have different CPU or even CPUs. But I agree that for some benchmarks
> it's going to be critical to control this precisely. Here's how I'm
> going to allows most flexibility without too much complexity.
> 
> --prod-affinity 1,2,10-16,100 -- will specify a set of CPUs for
> producers. First producer will use CPU with least index form that
> list. Second will take second, and so on. If there are less CPUs
> provided than necessary - it's an error. If more - it's fine.
> 
> Then for consumers will add independent --cons-affinity parameters,
> which will do the same for consumer threads.
> 
> Having two independent lists will allow to test scenarios where we
> want producers and consumers to fight for the same CPU.
> 
> Does this sound ok?

This sounds good, more than what I asked.

> 
>>> +     }
>>> +
>>> +     printf("Benchmark '%s' started.\n", bench->name);
>>> +}
>>> +
> 
> [...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench
  2020-05-12  4:22     ` Andrii Nakryiko
@ 2020-05-12 15:11       ` Yonghong Song
  2020-05-12 17:23         ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Yonghong Song @ 2020-05-12 15:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, John Fastabend



On 5/11/20 9:22 PM, Andrii Nakryiko wrote:
> On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
>>> user-space benchmarking part into benchmark runner to compare results.  Results
>>> with ./bench are consistently somewhat lower than test_overhead's, but relative
>>> performance of various types of BPF programs stay consisten (e.g., kretprobe is
>>> noticeably slower).
>>>
>>> run_bench_rename.sh script (in benchs/ directory) was used to produce the
>>> following numbers:
>>>
>>>     base      :    3.975 ± 0.065M/s
>>>     kprobe    :    3.268 ± 0.095M/s
>>>     kretprobe :    2.496 ± 0.040M/s
>>>     rawtp     :    3.899 ± 0.078M/s
>>>     fentry    :    3.836 ± 0.049M/s
>>>     fexit     :    3.660 ± 0.082M/s
>>>     fmodret   :    3.776 ± 0.033M/s
>>>
>>> While running test_overhead gives:
>>>
>>>     task_rename base        4457K events per sec
>>>     task_rename kprobe      3849K events per sec
>>>     task_rename kretprobe   2729K events per sec
>>>     task_rename raw_tp      4506K events per sec
>>>     task_rename fentry      4381K events per sec
>>>     task_rename fexit       4349K events per sec
>>>     task_rename fmod_ret    4130K events per sec
>>
>> Do you where the overhead is and how we could provide options in
>> bench to reduce the overhead so we can achieve similar numbers?
>> For benchmarking, sometimes you really want to see "true"
>> potential of a particular implementation.
> 
> Alright, let's make it an official bench-off... :) And the reason for
> this discrepancy, turns out to be... not atomics at all! But rather a
> single-threaded vs multi-threaded process (well, at least task_rename
> happening from non-main thread, I didn't narrow it down further).

It would be good to find out why and have a scheme (e.g. some kind
of affinity binding) to close the gap.

> Atomics actually make very little difference, which gives me a good
> peace of mind :)
> 
> So, I've built and ran test_overhead (selftest) and bench both as
> multi-threaded and single-threaded apps. Corresponding results match
> almost perfectly. And that's while test_overhead doesn't use atomics
> at all, while bench still does. Then I also ran test_overhead with
> added generics to match bench implementation. There are barely any
> differences, see two last sets of results.
> 
> BTW, selftest results seems bit lower from the ones in original
> commit, probably because I made it run more iterations (like 40 times
> more) to have more stable results.
> 
> So here are the results:
> 
> Single-threaded implementations
> ===============================
> 
> /* bench: single-threaded, atomics */
> base      :    4.622 ± 0.049M/s
> kprobe    :    3.673 ± 0.052M/s
> kretprobe :    2.625 ± 0.052M/s
> rawtp     :    4.369 ± 0.089M/s
> fentry    :    4.201 ± 0.558M/s
> fexit     :    4.309 ± 0.148M/s
> fmodret   :    4.314 ± 0.203M/s
> 
> /* selftest: single-threaded, no atomics */
> task_rename base        4555K events per sec
> task_rename kprobe      3643K events per sec
> task_rename kretprobe   2506K events per sec
> task_rename raw_tp      4303K events per sec
> task_rename fentry      4307K events per sec
> task_rename fexit       4010K events per sec
> task_rename fmod_ret    3984K events per sec
> 
> 
> Multi-threaded implementations
> ==============================
> 
> /* bench: multi-threaded w/ atomics */
> base      :    3.910 ± 0.023M/s
> kprobe    :    3.048 ± 0.037M/s
> kretprobe :    2.300 ± 0.015M/s
> rawtp     :    3.687 ± 0.034M/s
> fentry    :    3.740 ± 0.087M/s
> fexit     :    3.510 ± 0.009M/s
> fmodret   :    3.485 ± 0.050M/s
> 
> /* selftest: multi-threaded w/ atomics */
> task_rename base        3872K events per sec
> task_rename kprobe      3068K events per sec
> task_rename kretprobe   2350K events per sec
> task_rename raw_tp      3731K events per sec
> task_rename fentry      3639K events per sec
> task_rename fexit       3558K events per sec
> task_rename fmod_ret    3511K events per sec
> 
> /* selftest: multi-threaded, no atomics */
> task_rename base        3945K events per sec
> task_rename kprobe      3298K events per sec
> task_rename kretprobe   2451K events per sec
> task_rename raw_tp      3718K events per sec
> task_rename fentry      3782K events per sec
> task_rename fexit       3543K events per sec
> task_rename fmod_ret    3526K events per sec
> 
> 
[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench
  2020-05-12 15:11       ` Yonghong Song
@ 2020-05-12 17:23         ` Andrii Nakryiko
  2020-05-12 17:47           ` Yonghong Song
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2020-05-12 17:23 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, John Fastabend

On Tue, May 12, 2020 at 8:11 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/11/20 9:22 PM, Andrii Nakryiko wrote:
> > On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> >>> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
> >>> user-space benchmarking part into benchmark runner to compare results.  Results
> >>> with ./bench are consistently somewhat lower than test_overhead's, but relative
> >>> performance of various types of BPF programs stay consisten (e.g., kretprobe is
> >>> noticeably slower).
> >>>
> >>> run_bench_rename.sh script (in benchs/ directory) was used to produce the
> >>> following numbers:
> >>>
> >>>     base      :    3.975 ± 0.065M/s
> >>>     kprobe    :    3.268 ± 0.095M/s
> >>>     kretprobe :    2.496 ± 0.040M/s
> >>>     rawtp     :    3.899 ± 0.078M/s
> >>>     fentry    :    3.836 ± 0.049M/s
> >>>     fexit     :    3.660 ± 0.082M/s
> >>>     fmodret   :    3.776 ± 0.033M/s
> >>>
> >>> While running test_overhead gives:
> >>>
> >>>     task_rename base        4457K events per sec
> >>>     task_rename kprobe      3849K events per sec
> >>>     task_rename kretprobe   2729K events per sec
> >>>     task_rename raw_tp      4506K events per sec
> >>>     task_rename fentry      4381K events per sec
> >>>     task_rename fexit       4349K events per sec
> >>>     task_rename fmod_ret    4130K events per sec
> >>
> >> Do you where the overhead is and how we could provide options in
> >> bench to reduce the overhead so we can achieve similar numbers?
> >> For benchmarking, sometimes you really want to see "true"
> >> potential of a particular implementation.
> >
> > Alright, let's make it an official bench-off... :) And the reason for
> > this discrepancy, turns out to be... not atomics at all! But rather a
> > single-threaded vs multi-threaded process (well, at least task_rename
> > happening from non-main thread, I didn't narrow it down further).
>
> It would be good to find out why and have a scheme (e.g. some kind
> of affinity binding) to close the gap.

I don't think affinity has anything to do with this. test_overhead
sets affinity for entire process, and that doesn't change results at
all. Same for bench, both with and without setting affinity, results
are pretty much the same. Affinity helps a bit to get a bit more
stable and consistent results, but doesn't hurt or help performance
for this benchmark.

I don't think we need to spend that much time trying to understand
behavior of task renaming for such a particular setup. Benchmarking
has to be multi-threaded in most cases anyways, there is no way around
that.

>
> > Atomics actually make very little difference, which gives me a good
> > peace of mind :)
> >
> > So, I've built and ran test_overhead (selftest) and bench both as
> > multi-threaded and single-threaded apps. Corresponding results match
> > almost perfectly. And that's while test_overhead doesn't use atomics
> > at all, while bench still does. Then I also ran test_overhead with
> > added generics to match bench implementation. There are barely any
> > differences, see two last sets of results.
> >
> > BTW, selftest results seems bit lower from the ones in original
> > commit, probably because I made it run more iterations (like 40 times
> > more) to have more stable results.
> >
> > So here are the results:
> >
> > Single-threaded implementations
> > ===============================
> >
> > /* bench: single-threaded, atomics */
> > base      :    4.622 ± 0.049M/s
> > kprobe    :    3.673 ± 0.052M/s
> > kretprobe :    2.625 ± 0.052M/s
> > rawtp     :    4.369 ± 0.089M/s
> > fentry    :    4.201 ± 0.558M/s
> > fexit     :    4.309 ± 0.148M/s
> > fmodret   :    4.314 ± 0.203M/s
> >
> > /* selftest: single-threaded, no atomics */
> > task_rename base        4555K events per sec
> > task_rename kprobe      3643K events per sec
> > task_rename kretprobe   2506K events per sec
> > task_rename raw_tp      4303K events per sec
> > task_rename fentry      4307K events per sec
> > task_rename fexit       4010K events per sec
> > task_rename fmod_ret    3984K events per sec
> >
> >
> > Multi-threaded implementations
> > ==============================
> >
> > /* bench: multi-threaded w/ atomics */
> > base      :    3.910 ± 0.023M/s
> > kprobe    :    3.048 ± 0.037M/s
> > kretprobe :    2.300 ± 0.015M/s
> > rawtp     :    3.687 ± 0.034M/s
> > fentry    :    3.740 ± 0.087M/s
> > fexit     :    3.510 ± 0.009M/s
> > fmodret   :    3.485 ± 0.050M/s
> >
> > /* selftest: multi-threaded w/ atomics */
> > task_rename base        3872K events per sec
> > task_rename kprobe      3068K events per sec
> > task_rename kretprobe   2350K events per sec
> > task_rename raw_tp      3731K events per sec
> > task_rename fentry      3639K events per sec
> > task_rename fexit       3558K events per sec
> > task_rename fmod_ret    3511K events per sec
> >
> > /* selftest: multi-threaded, no atomics */
> > task_rename base        3945K events per sec
> > task_rename kprobe      3298K events per sec
> > task_rename kretprobe   2451K events per sec
> > task_rename raw_tp      3718K events per sec
> > task_rename fentry      3782K events per sec
> > task_rename fexit       3543K events per sec
> > task_rename fmod_ret    3526K events per sec
> >
> >
> [...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench
  2020-05-12 17:23         ` Andrii Nakryiko
@ 2020-05-12 17:47           ` Yonghong Song
  0 siblings, 0 replies; 13+ messages in thread
From: Yonghong Song @ 2020-05-12 17:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, John Fastabend



On 5/12/20 10:23 AM, Andrii Nakryiko wrote:
> On Tue, May 12, 2020 at 8:11 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/11/20 9:22 PM, Andrii Nakryiko wrote:
>>> On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>>>> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
>>>>> user-space benchmarking part into benchmark runner to compare results.  Results
>>>>> with ./bench are consistently somewhat lower than test_overhead's, but relative
>>>>> performance of various types of BPF programs stay consisten (e.g., kretprobe is
>>>>> noticeably slower).
>>>>>
>>>>> run_bench_rename.sh script (in benchs/ directory) was used to produce the
>>>>> following numbers:
>>>>>
>>>>>      base      :    3.975 ± 0.065M/s
>>>>>      kprobe    :    3.268 ± 0.095M/s
>>>>>      kretprobe :    2.496 ± 0.040M/s
>>>>>      rawtp     :    3.899 ± 0.078M/s
>>>>>      fentry    :    3.836 ± 0.049M/s
>>>>>      fexit     :    3.660 ± 0.082M/s
>>>>>      fmodret   :    3.776 ± 0.033M/s
>>>>>
>>>>> While running test_overhead gives:
>>>>>
>>>>>      task_rename base        4457K events per sec
>>>>>      task_rename kprobe      3849K events per sec
>>>>>      task_rename kretprobe   2729K events per sec
>>>>>      task_rename raw_tp      4506K events per sec
>>>>>      task_rename fentry      4381K events per sec
>>>>>      task_rename fexit       4349K events per sec
>>>>>      task_rename fmod_ret    4130K events per sec
>>>>
>>>> Do you where the overhead is and how we could provide options in
>>>> bench to reduce the overhead so we can achieve similar numbers?
>>>> For benchmarking, sometimes you really want to see "true"
>>>> potential of a particular implementation.
>>>
>>> Alright, let's make it an official bench-off... :) And the reason for
>>> this discrepancy, turns out to be... not atomics at all! But rather a
>>> single-threaded vs multi-threaded process (well, at least task_rename
>>> happening from non-main thread, I didn't narrow it down further).
>>
>> It would be good to find out why and have a scheme (e.g. some kind
>> of affinity binding) to close the gap.
> 
> I don't think affinity has anything to do with this. test_overhead
> sets affinity for entire process, and that doesn't change results at
> all. Same for bench, both with and without setting affinity, results
> are pretty much the same. Affinity helps a bit to get a bit more
> stable and consistent results, but doesn't hurt or help performance
> for this benchmark.
> 
> I don't think we need to spend that much time trying to understand
> behavior of task renaming for such a particular setup. Benchmarking
> has to be multi-threaded in most cases anyways, there is no way around
> that.

Okay. This might be related to kernel scheduling of main thread vs. 
secondary threads? This then indeed beyond this patch.

I am fine with the current mechanism as is. Maybe put the above
experimental data in commit message? If later other people
want to do further investigation, they have some data to
start with.

> 
>>
>>> Atomics actually make very little difference, which gives me a good
>>> peace of mind :)
>>>
>>> So, I've built and ran test_overhead (selftest) and bench both as
>>> multi-threaded and single-threaded apps. Corresponding results match
>>> almost perfectly. And that's while test_overhead doesn't use atomics
>>> at all, while bench still does. Then I also ran test_overhead with
>>> added generics to match bench implementation. There are barely any
>>> differences, see two last sets of results.
>>>
>>> BTW, selftest results seems bit lower from the ones in original
>>> commit, probably because I made it run more iterations (like 40 times
>>> more) to have more stable results.
>>>
>>> So here are the results:
>>>
>>> Single-threaded implementations
>>> ===============================
>>>
>>> /* bench: single-threaded, atomics */
>>> base      :    4.622 ± 0.049M/s
>>> kprobe    :    3.673 ± 0.052M/s
>>> kretprobe :    2.625 ± 0.052M/s
>>> rawtp     :    4.369 ± 0.089M/s
>>> fentry    :    4.201 ± 0.558M/s
>>> fexit     :    4.309 ± 0.148M/s
>>> fmodret   :    4.314 ± 0.203M/s
>>>
>>> /* selftest: single-threaded, no atomics */
>>> task_rename base        4555K events per sec
>>> task_rename kprobe      3643K events per sec
>>> task_rename kretprobe   2506K events per sec
>>> task_rename raw_tp      4303K events per sec
>>> task_rename fentry      4307K events per sec
>>> task_rename fexit       4010K events per sec
>>> task_rename fmod_ret    3984K events per sec
>>>
>>>
>>> Multi-threaded implementations
>>> ==============================
>>>
>>> /* bench: multi-threaded w/ atomics */
>>> base      :    3.910 ± 0.023M/s
>>> kprobe    :    3.048 ± 0.037M/s
>>> kretprobe :    2.300 ± 0.015M/s
>>> rawtp     :    3.687 ± 0.034M/s
>>> fentry    :    3.740 ± 0.087M/s
>>> fexit     :    3.510 ± 0.009M/s
>>> fmodret   :    3.485 ± 0.050M/s
>>>
>>> /* selftest: multi-threaded w/ atomics */
>>> task_rename base        3872K events per sec
>>> task_rename kprobe      3068K events per sec
>>> task_rename kretprobe   2350K events per sec
>>> task_rename raw_tp      3731K events per sec
>>> task_rename fentry      3639K events per sec
>>> task_rename fexit       3558K events per sec
>>> task_rename fmod_ret    3511K events per sec
>>>
>>> /* selftest: multi-threaded, no atomics */
>>> task_rename base        3945K events per sec
>>> task_rename kprobe      3298K events per sec
>>> task_rename kretprobe   2451K events per sec
>>> task_rename raw_tp      3718K events per sec
>>> task_rename fentry      3782K events per sec
>>> task_rename fexit       3543K events per sec
>>> task_rename fmod_ret    3526K events per sec
>>>
>>>
>> [...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-05-12 17:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-08 23:20 [PATCH v2 bpf-next 0/3] Add benchmark runner and few benchmarks Andrii Nakryiko
2020-05-08 23:20 ` [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
2020-05-09 17:10   ` Yonghong Song
2020-05-12  3:29     ` Andrii Nakryiko
2020-05-12 14:39       ` Yonghong Song
2020-05-08 23:20 ` [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench Andrii Nakryiko
2020-05-09 17:23   ` Yonghong Song
2020-05-12  4:22     ` Andrii Nakryiko
2020-05-12 15:11       ` Yonghong Song
2020-05-12 17:23         ` Andrii Nakryiko
2020-05-12 17:47           ` Yonghong Song
2020-05-08 23:20 ` [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark Andrii Nakryiko
2020-05-09 17:43   ` Yonghong Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.