All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/6] New benchmark for hashmap lookups
@ 2023-01-27 18:14 Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 1/6] selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update Anton Protopopov
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

Add a new benchmark for hashmap lookups and fix several typos. See individual
commits for descriptions.

One thing to mention here is that in commit 3 I've patched bench so that now
command line options can be reused by different benchmarks.

The benchmark itself is added in the last commit 6. I am using this benchmark
to test map lookup productivity when using a different hash function (see
https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
the benchmark look reasonable and match the results of my different benchmarks
(requiring to patch kernel to get actual statistics on map lookups).

Anton Protopopov (6):
  selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update
  selftest/bpf/benchs: make a function static in bpf_hashmap_full_update
  selftest/bpf/benchs: enhance argp parsing
  selftest/bpf/benchs: make quiet option common
  selftest/bpf/benchs: print less if the quiet option is set
  selftest/bpf/benchs: Add benchmark for hashmap lookups

 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           | 126 +++++++-
 tools/testing/selftests/bpf/bench.h           |  24 ++
 .../bpf/benchs/bench_bloom_filter_map.c       |   6 -
 .../benchs/bench_bpf_hashmap_full_update.c    |   4 +-
 .../bpf/benchs/bench_bpf_hashmap_lookup.c     | 277 ++++++++++++++++++
 .../selftests/bpf/benchs/bench_bpf_loop.c     |   4 -
 .../bpf/benchs/bench_local_storage.c          |   5 -
 .../bench_local_storage_rcu_tasks_trace.c     |  20 +-
 .../selftests/bpf/benchs/bench_ringbufs.c     |   8 -
 .../selftests/bpf/benchs/bench_strncmp.c      |   4 -
 .../run_bench_bpf_hashmap_full_update.sh      |   2 +-
 .../selftests/bpf/progs/bpf_hashmap_lookup.c  |  61 ++++
 13 files changed, 486 insertions(+), 60 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 1/6] selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
@ 2023-01-27 18:14 ` Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 2/6] selftest/bpf/benchs: make a function static " Anton Protopopov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

To call the bpf_hashmap_full_update benchmark, one should say:

    bench bpf-hashmap-ful-update

The patch adds a missing 'l' to the benchmark name.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 .../selftests/bpf/benchs/bench_bpf_hashmap_full_update.c        | 2 +-
 .../selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
index cec51e0ff4b8..44706acf632a 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
@@ -85,7 +85,7 @@ void hashmap_report_final(struct bench_res res[], int res_cnt)
 }
 
 const struct bench bench_bpf_hashmap_full_update = {
-	.name = "bpf-hashmap-ful-update",
+	.name = "bpf-hashmap-full-update",
 	.validate = validate,
 	.setup = setup,
 	.producer_thread = producer,
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh b/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
index 1e2de838f9fa..cd2efd3fdef3 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
@@ -6,6 +6,6 @@ source ./benchs/run_common.sh
 set -eufo pipefail
 
 nr_threads=`expr $(cat /proc/cpuinfo | grep "processor"| wc -l) - 1`
-summary=$($RUN_BENCH -p $nr_threads bpf-hashmap-ful-update)
+summary=$($RUN_BENCH -p $nr_threads bpf-hashmap-full-update)
 printf "$summary"
 printf "\n"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 2/6] selftest/bpf/benchs: make a function static in bpf_hashmap_full_update
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 1/6] selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update Anton Protopopov
@ 2023-01-27 18:14 ` Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing Anton Protopopov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

The hashmap_report_final callback function defined in the
benchs/bench_bpf_hashmap_full_update.c file should be static.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 .../selftests/bpf/benchs/bench_bpf_hashmap_full_update.c        | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
index 44706acf632a..67f76415a362 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
@@ -68,7 +68,7 @@ static void setup(void)
 		bpf_map_update_elem(map_fd, &i, &i, BPF_ANY);
 }
 
-void hashmap_report_final(struct bench_res res[], int res_cnt)
+static void hashmap_report_final(struct bench_res res[], int res_cnt)
 {
 	unsigned int nr_cpus = bpf_num_possible_cpus();
 	int i;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 1/6] selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 2/6] selftest/bpf/benchs: make a function static " Anton Protopopov
@ 2023-01-27 18:14 ` Anton Protopopov
  2023-01-31  0:07   ` Andrii Nakryiko
  2023-01-27 18:14 ` [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common Anton Protopopov
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

To parse command line the bench utility uses the argp_parse() function. This
function takes as an argument a parent 'struct argp' structure which defines
common command line options and an array of children 'struct argp' structures
which defines additional command line options for particular benchmarks. This
implementation doesn't allow benchmarks to share option names, e.g., if two
benchmarks want to use, say, the --option option, then only one of them will
succeed (the first one encountered in the array).  This will be convenient if
we could use the same option names in different benchmarks (with the same
semantics, e.g., --nr_loops=N).

Fix this by calling the argp_parse() function twice. The first call is needed
to find the benchmark name. Given the name, we can patch the list of argp
children to only include the correct list of option. This way the right parser
will be executed. (If there's no a specific list of arguments, then only one
call is enough.) As was mentioned above, arguments with same names should have
the same semantics (which means in this case "taking a parameter or not"), but
can have different description and will have different parsers.

As we now can share option names, this also makes sense to share the option ids.
Previously every benchmark defined a separate enum, like

    enum {
           ARG_SMTH = 9000,
           ARG_SMTH_ELSE = 9001,
    };

These ids were later used to distinguish between command line options:

    static const struct argp_option opts[] = {
            { "smth", ARG_SMTH, "SMTH", 0,
                    "Set number of smth to configure smth"},
            { "smth_else", ARG_SMTH_ELSE, "SMTH_ELSE", 0,
                    "Set number of smth_else to configure smth else"},
            ...

Move arguments id definitions to bench.h such that they are visible to every
benchmark (and such that there's no need to grep if this number is defined
already or not).

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 tools/testing/selftests/bpf/bench.c           | 98 +++++++++++++++++--
 tools/testing/selftests/bpf/bench.h           | 20 ++++
 .../bpf/benchs/bench_bloom_filter_map.c       |  6 --
 .../selftests/bpf/benchs/bench_bpf_loop.c     |  4 -
 .../bpf/benchs/bench_local_storage.c          |  5 -
 .../bench_local_storage_rcu_tasks_trace.c     |  6 --
 .../selftests/bpf/benchs/bench_ringbufs.c     |  8 --
 .../selftests/bpf/benchs/bench_strncmp.c      |  4 -
 8 files changed, 110 insertions(+), 41 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c1f20a147462..ba93f1b268e1 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -249,11 +249,6 @@ const char argp_program_doc[] =
 "    # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
 "    benchmark -p16 -c8 -a count-local\n";
 
-enum {
-	ARG_PROD_AFFINITY_SET = 1000,
-	ARG_CONS_AFFINITY_SET = 1001,
-};
-
 static const struct argp_option opts[] = {
 	{ "list", 'l', NULL, 0, "List available benchmarks"},
 	{ "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
@@ -276,7 +271,7 @@ extern struct argp bench_local_storage_argp;
 extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
 
-static const struct argp_child bench_parsers[] = {
+static struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
 	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
@@ -287,9 +282,10 @@ static const struct argp_child bench_parsers[] = {
 	{},
 };
 
+static int pos_args;
+
 static error_t parse_arg(int key, char *arg, struct argp_state *state)
 {
-	static int pos_args;
 
 	switch (key) {
 	case 'v':
@@ -359,6 +355,69 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
 	return 0;
 }
 
+struct argp *bench_name_to_argp(const char *bench_name)
+{
+
+#define _SCMP(NAME) (!strcmp(bench_name, NAME))
+
+	if (_SCMP("bloom-lookup") ||
+	    _SCMP("bloom-update") ||
+	    _SCMP("bloom-false-positive") ||
+	    _SCMP("hashmap-without-bloom") ||
+	    _SCMP("hashmap-with-bloom"))
+		return &bench_bloom_map_argp;
+
+	if (_SCMP("rb-libbpf") ||
+	    _SCMP("rb-custom") ||
+	    _SCMP("pb-libbpf") ||
+	    _SCMP("pb-custom"))
+		return &bench_ringbufs_argp;
+
+	if (_SCMP("local-storage-cache-seq-get") ||
+	    _SCMP("local-storage-cache-int-get") ||
+	    _SCMP("local-storage-cache-hashmap-control"))
+		return &bench_local_storage_argp;
+
+	if (_SCMP("local-storage-tasks-trace"))
+		return &bench_local_storage_rcu_tasks_trace_argp;
+
+	if (_SCMP("strncmp-no-helper") ||
+	    _SCMP("strncmp-helper"))
+		return &bench_strncmp_argp;
+
+	if (_SCMP("bpf-loop"))
+		return &bench_bpf_loop_argp;
+
+	/* no extra arguments */
+	if (_SCMP("count-global") ||
+	    _SCMP("count-local") ||
+	    _SCMP("rename-base") ||
+	    _SCMP("rename-kprobe") ||
+	    _SCMP("rename-kretprobe") ||
+	    _SCMP("rename-rawtp") ||
+	    _SCMP("rename-fentry") ||
+	    _SCMP("rename-fexit") ||
+	    _SCMP("trig-base") ||
+	    _SCMP("trig-tp") ||
+	    _SCMP("trig-rawtp") ||
+	    _SCMP("trig-kprobe") ||
+	    _SCMP("trig-fentry") ||
+	    _SCMP("trig-fentry-sleep") ||
+	    _SCMP("trig-fmodret") ||
+	    _SCMP("trig-uprobe-base") ||
+	    _SCMP("trig-uprobe-with-nop") ||
+	    _SCMP("trig-uretprobe-with-nop") ||
+	    _SCMP("trig-uprobe-without-nop") ||
+	    _SCMP("trig-uretprobe-without-nop") ||
+	    _SCMP("bpf-hashmap-full-update"))
+		return NULL;
+
+#undef _SCMP
+
+	fprintf(stderr, "%s: bench %s is unknown\n", __func__, bench_name);
+	exit(1);
+}
+
 static void parse_cmdline_args(int argc, char **argv)
 {
 	static const struct argp argp = {
@@ -367,12 +426,35 @@ static void parse_cmdline_args(int argc, char **argv)
 		.doc = argp_program_doc,
 		.children = bench_parsers,
 	};
+	static struct argp *bench_argp;
+
+	/* Parse args for the first time to get bench name */
 	if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
 		exit(1);
-	if (!env.list && !env.bench_name) {
+
+	if (env.list)
+		return;
+
+	if (!env.bench_name) {
 		argp_help(&argp, stderr, ARGP_HELP_DOC, "bench");
 		exit(1);
 	}
+
+	/* Now check if there are custom options available. If not, then
+	 * everything is done, if yes, then we need to patch bench_parsers
+	 * so that bench_parsers[0] points to the right 'struct argp', and
+	 * bench_parsers[1] terminates the list.
+	 */
+	bench_argp = bench_name_to_argp(env.bench_name);
+	if (bench_argp) {
+		bench_parsers[0].argp = bench_argp;
+		bench_parsers[0].header = env.bench_name;
+		memset(&bench_parsers[1], 0, sizeof(bench_parsers[1]));
+
+		pos_args = 0;
+		if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
+			exit(1);
+	}
 }
 
 static void collect_measurements(long delta_ns);
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index d748255877e2..316ba0589bf7 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -101,3 +101,23 @@ static inline long atomic_swap(long *value, long n)
 {
 	return __atomic_exchange_n(value, n, __ATOMIC_RELAXED);
 }
+
+enum {
+	ARG_PROD_AFFINITY_SET = 1000,
+	ARG_CONS_AFFINITY_SET,
+	ARG_RB_BACK2BACK,
+	ARG_RB_USE_OUTPUT,
+	ARG_RB_BATCH_CNT,
+	ARG_RB_SAMPLED,
+	ARG_RB_SAMPLE_RATE,
+	ARG_NR_ENTRIES,
+	ARG_NR_HASH_FUNCS,
+	ARG_VALUE_SIZE,
+	ARG_NR_LOOPS,
+	ARG_CMP_STR_LEN,
+	ARG_NR_MAPS,
+	ARG_HASHMAP_NR_KEYS_USED,
+	ARG_NR_PROCS,
+	ARG_KTHREAD_PID,
+	ARG_QUIET,
+};
diff --git a/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c b/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
index 5bcb8a8cdeb2..bef8999d4a9e 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bloom_filter_map.c
@@ -45,12 +45,6 @@ static struct {
 	.value_size = 8,
 };
 
-enum {
-	ARG_NR_ENTRIES = 3000,
-	ARG_NR_HASH_FUNCS = 3001,
-	ARG_VALUE_SIZE = 3002,
-};
-
 static const struct argp_option opts[] = {
 	{ "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0,
 		"Set number of expected unique entries in the bloom filter"},
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c b/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
index d0a6572bfab6..47630a9a0a4b 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
@@ -16,10 +16,6 @@ static struct {
 	.nr_loops = 10,
 };
 
-enum {
-	ARG_NR_LOOPS = 4000,
-};
-
 static const struct argp_option opts[] = {
 	{ "nr_loops", ARG_NR_LOOPS, "nr_loops", 0,
 		"Set number of loops for the bpf_loop helper"},
diff --git a/tools/testing/selftests/bpf/benchs/bench_local_storage.c b/tools/testing/selftests/bpf/benchs/bench_local_storage.c
index 5a378c84e81f..dd19ddf24fde 100644
--- a/tools/testing/selftests/bpf/benchs/bench_local_storage.c
+++ b/tools/testing/selftests/bpf/benchs/bench_local_storage.c
@@ -17,11 +17,6 @@ static struct {
 	.hashmap_nr_keys_used = 1000,
 };
 
-enum {
-	ARG_NR_MAPS = 6000,
-	ARG_HASHMAP_NR_KEYS_USED = 6001,
-};
-
 static const struct argp_option opts[] = {
 	{ "nr_maps", ARG_NR_MAPS, "NR_MAPS", 0,
 		"Set number of local_storage maps"},
diff --git a/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c b/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
index 43f109d93130..bcbcb8b90c61 100644
--- a/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
+++ b/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
@@ -19,12 +19,6 @@ static struct {
 	.quiet = false,
 };
 
-enum {
-	ARG_NR_PROCS = 7000,
-	ARG_KTHREAD_PID = 7001,
-	ARG_QUIET = 7002,
-};
-
 static const struct argp_option opts[] = {
 	{ "nr_procs", ARG_NR_PROCS, "NR_PROCS", 0,
 		"Set number of user processes to spin up"},
diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
index c2554f9695ff..b31ae31d1fea 100644
--- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
+++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
@@ -29,14 +29,6 @@ static struct {
 	.perfbuf_sz = 128,
 };
 
-enum {
-	ARG_RB_BACK2BACK = 2000,
-	ARG_RB_USE_OUTPUT = 2001,
-	ARG_RB_BATCH_CNT = 2002,
-	ARG_RB_SAMPLED = 2003,
-	ARG_RB_SAMPLE_RATE = 2004,
-};
-
 static const struct argp_option opts[] = {
 	{ "rb-b2b", ARG_RB_BACK2BACK, NULL, 0, "Back-to-back mode"},
 	{ "rb-use-output", ARG_RB_USE_OUTPUT, NULL, 0, "Use bpf_ringbuf_output() instead of bpf_ringbuf_reserve()"},
diff --git a/tools/testing/selftests/bpf/benchs/bench_strncmp.c b/tools/testing/selftests/bpf/benchs/bench_strncmp.c
index 494b591c0289..e7d48a65511b 100644
--- a/tools/testing/selftests/bpf/benchs/bench_strncmp.c
+++ b/tools/testing/selftests/bpf/benchs/bench_strncmp.c
@@ -14,10 +14,6 @@ static struct strncmp_args {
 	.cmp_str_len = 32,
 };
 
-enum {
-	ARG_CMP_STR_LEN = 5000,
-};
-
 static const struct argp_option opts[] = {
 	{ "cmp-str-len", ARG_CMP_STR_LEN, "CMP_STR_LEN", 0,
 	  "Set the length of compared string" },
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
                   ` (2 preceding siblings ...)
  2023-01-27 18:14 ` [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing Anton Protopopov
@ 2023-01-27 18:14 ` Anton Protopopov
  2023-01-31  0:10   ` Andrii Nakryiko
  2023-01-27 18:14 ` [PATCH bpf-next 5/6] selftest/bpf/benchs: print less if the quiet option is set Anton Protopopov
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
the list of common options, so that the main code and other benchmarks can use
(now global) env.quiet as well.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 tools/testing/selftests/bpf/bench.c               | 15 +++++++++++++++
 tools/testing/selftests/bpf/bench.h               |  1 +
 .../benchs/bench_local_storage_rcu_tasks_trace.c  | 14 +-------------
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index ba93f1b268e1..42bf41a9385e 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -16,6 +16,7 @@ struct env env = {
 	.warmup_sec = 1,
 	.duration_sec = 5,
 	.affinity = false,
+	.quiet = false,
 	.consumer_cnt = 1,
 	.producer_cnt = 1,
 };
@@ -257,6 +258,7 @@ static const struct argp_option opts[] = {
 	{ "consumers", 'c', "NUM", 0, "Number of consumer threads"},
 	{ "verbose", 'v', NULL, 0, "Verbose debug output"},
 	{ "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
+	{ "quiet", 'q', "{0,1}", OPTION_ARG_OPTIONAL, "If true, be quiet"},
 	{ "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
 	  "Set of CPUs for producer threads; implies --affinity"},
 	{ "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
@@ -286,6 +288,7 @@ static int pos_args;
 
 static error_t parse_arg(int key, char *arg, struct argp_state *state)
 {
+	long ret;
 
 	switch (key) {
 	case 'v':
@@ -325,6 +328,18 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
 	case 'a':
 		env.affinity = true;
 		break;
+	case 'q':
+		if (!arg) {
+			env.quiet = true;
+		} else {
+			ret = strtol(arg, NULL, 10);
+			if (ret < 0 || ret > 1) {
+				fprintf(stderr, "invalid quiet %ld\n", ret);
+				argp_usage(state);
+			}
+			env.quiet = ret;
+		}
+		break;
 	case ARG_PROD_AFFINITY_SET:
 		env.affinity = true;
 		if (parse_num_list(arg, &env.prod_cpus.cpus,
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index 316ba0589bf7..e322654b5e8a 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -24,6 +24,7 @@ struct env {
 	bool verbose;
 	bool list;
 	bool affinity;
+	bool quiet;
 	int consumer_cnt;
 	int producer_cnt;
 	struct cpu_set prod_cpus;
diff --git a/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c b/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
index bcbcb8b90c61..51d4b1a690f9 100644
--- a/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
+++ b/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
@@ -12,11 +12,9 @@
 static struct {
 	__u32 nr_procs;
 	__u32 kthread_pid;
-	bool quiet;
 } args = {
 	.nr_procs = 1000,
 	.kthread_pid = 0,
-	.quiet = false,
 };
 
 static const struct argp_option opts[] = {
@@ -24,8 +22,6 @@ static const struct argp_option opts[] = {
 		"Set number of user processes to spin up"},
 	{ "kthread_pid", ARG_KTHREAD_PID, "PID", 0,
 		"Pid of rcu_tasks_trace kthread for ticks tracking"},
-	{ "quiet", ARG_QUIET, "{0,1}", 0,
-		"If true, don't report progress"},
 	{},
 };
 
@@ -50,14 +46,6 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
 		}
 		args.kthread_pid = ret;
 		break;
-	case ARG_QUIET:
-		ret = strtol(arg, NULL, 10);
-		if (ret < 0 || ret > 1) {
-			fprintf(stderr, "invalid quiet %ld\n", ret);
-			argp_usage(state);
-		}
-		args.quiet = ret;
-		break;
 break;
 	default:
 		return ARGP_ERR_UNKNOWN;
@@ -224,7 +212,7 @@ static void report_progress(int iter, struct bench_res *res, long delta_ns)
 		exit(1);
 	}
 
-	if (args.quiet)
+	if (env.quiet)
 		return;
 
 	printf("Iter %d\t avg tasks_trace grace period latency\t%lf ns\n",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 5/6] selftest/bpf/benchs: print less if the quiet option is set
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
                   ` (3 preceding siblings ...)
  2023-01-27 18:14 ` [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common Anton Protopopov
@ 2023-01-27 18:14 ` Anton Protopopov
  2023-01-27 18:14 ` [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups Anton Protopopov
  2023-01-31  0:17 ` [PATCH bpf-next 0/6] New " Andrii Nakryiko
  6 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

The bench utility will print

    Setting up benchmark '<bench-name>'...
    Benchmark '<bench-name>' started.

on startup to stdout. Suppress this output if --quiet option if given. This
makes it simpler to parse benchmark output by a script.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 tools/testing/selftests/bpf/bench.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 42bf41a9385e..2f34db60f819 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -648,7 +648,8 @@ static void setup_benchmark()
 		exit(1);
 	}
 
-	printf("Setting up benchmark '%s'...\n", bench->name);
+	if (!env.quiet)
+		printf("Setting up benchmark '%s'...\n", bench->name);
 
 	state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
 	state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
@@ -694,7 +695,8 @@ static void setup_benchmark()
 					    next_cpu(&env.prod_cpus));
 	}
 
-	printf("Benchmark '%s' started.\n", bench->name);
+	if (!env.quiet)
+		printf("Benchmark '%s' started.\n", bench->name);
 }
 
 static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
                   ` (4 preceding siblings ...)
  2023-01-27 18:14 ` [PATCH bpf-next 5/6] selftest/bpf/benchs: print less if the quiet option is set Anton Protopopov
@ 2023-01-27 18:14 ` Anton Protopopov
  2023-01-31  0:16   ` Andrii Nakryiko
  2023-01-31  0:22   ` Martin KaFai Lau
  2023-01-31  0:17 ` [PATCH bpf-next 0/6] New " Andrii Nakryiko
  6 siblings, 2 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-27 18:14 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend
  Cc: Anton Protopopov

Add a new benchmark which measures hashmap lookup operations speed.  A user can
control the following parameters of the benchmark:

    * key_size (max 1024): the key size to use
    * max_entries: the hashmap max entries
    * nr_entries: the number of entries to insert/lookup
    * nr_loops: the number of loops for the benchmark
    * map_flags The hashmap flags passed to BPF_MAP_CREATE

The BPF program performing the benchmarks calls two nested bpf_loop:

    bpf_loop(nr_loops/nr_entries)
            bpf_loop(nr_entries)
                     bpf_map_lookup()

So the nr_loops determines the number of actual map lookups. All lookups are
successful.

Example (the output is generated on a AMD Ryzen 9 3950X machine):

    for nr_entries in `seq 4096 4096 65536`; do echo -n "$((nr_entries*100/65536))% full: "; sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=4 --nr_entries=$nr_entries --max_entries=65536 --nr_loops=1000000 --map_flags=0x40 | grep cpu; done
    6% full: cpu01: lookup 50.739M ± 0.018M events/sec (approximated from 32 samples of ~19ms)
    12% full: cpu01: lookup 47.751M ± 0.015M events/sec (approximated from 32 samples of ~20ms)
    18% full: cpu01: lookup 45.153M ± 0.013M events/sec (approximated from 32 samples of ~22ms)
    25% full: cpu01: lookup 43.826M ± 0.014M events/sec (approximated from 32 samples of ~22ms)
    31% full: cpu01: lookup 41.971M ± 0.012M events/sec (approximated from 32 samples of ~23ms)
    37% full: cpu01: lookup 41.034M ± 0.015M events/sec (approximated from 32 samples of ~24ms)
    43% full: cpu01: lookup 39.946M ± 0.012M events/sec (approximated from 32 samples of ~25ms)
    50% full: cpu01: lookup 38.256M ± 0.014M events/sec (approximated from 32 samples of ~26ms)
    56% full: cpu01: lookup 36.580M ± 0.018M events/sec (approximated from 32 samples of ~27ms)
    62% full: cpu01: lookup 36.252M ± 0.012M events/sec (approximated from 32 samples of ~27ms)
    68% full: cpu01: lookup 35.200M ± 0.012M events/sec (approximated from 32 samples of ~28ms)
    75% full: cpu01: lookup 34.061M ± 0.009M events/sec (approximated from 32 samples of ~29ms)
    81% full: cpu01: lookup 34.374M ± 0.010M events/sec (approximated from 32 samples of ~29ms)
    87% full: cpu01: lookup 33.244M ± 0.011M events/sec (approximated from 32 samples of ~30ms)
    93% full: cpu01: lookup 32.182M ± 0.013M events/sec (approximated from 32 samples of ~31ms)
    100% full: cpu01: lookup 31.497M ± 0.016M events/sec (approximated from 32 samples of ~31ms)

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |   7 +
 tools/testing/selftests/bpf/bench.h           |   3 +
 .../bpf/benchs/bench_bpf_hashmap_lookup.c     | 277 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_hashmap_lookup.c  |  61 ++++
 5 files changed, 352 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index a554b41fe872..cf08587b8639 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -604,6 +604,7 @@ $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
 $(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
 $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
+$(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -618,7 +619,9 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_strncmp.o \
 		 $(OUTPUT)/bench_bpf_hashmap_full_update.o \
 		 $(OUTPUT)/bench_local_storage.o \
-		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o
+		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
+		 $(OUTPUT)/bench_bpf_hashmap_lookup.o \
+		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 2f34db60f819..568a0d12197d 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -272,6 +272,7 @@ extern struct argp bench_bpf_loop_argp;
 extern struct argp bench_local_storage_argp;
 extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
+extern struct argp bench_hashmap_lookup_argp;
 
 static struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -281,6 +282,7 @@ static struct argp_child bench_parsers[] = {
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
 	{ &bench_local_storage_rcu_tasks_trace_argp, 0,
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
+	{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
 	{},
 };
 
@@ -403,6 +405,9 @@ struct argp *bench_name_to_argp(const char *bench_name)
 	if (_SCMP("bpf-loop"))
 		return &bench_bpf_loop_argp;
 
+	if (_SCMP("bpf-hashmap-lookup"))
+		return &bench_hashmap_lookup_argp;
+
 	/* no extra arguments */
 	if (_SCMP("count-global") ||
 	    _SCMP("count-local") ||
@@ -587,6 +592,7 @@ extern const struct bench bench_local_storage_cache_seq_get;
 extern const struct bench bench_local_storage_cache_interleaved_get;
 extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
+extern const struct bench bench_bpf_hashmap_lookup;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -626,6 +632,7 @@ static const struct bench *benchs[] = {
 	&bench_local_storage_cache_interleaved_get,
 	&bench_local_storage_cache_hashmap_control,
 	&bench_local_storage_tasks_trace,
+	&bench_bpf_hashmap_lookup,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index e322654b5e8a..f0e477518ab9 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -121,4 +121,7 @@ enum {
 	ARG_NR_PROCS,
 	ARG_KTHREAD_PID,
 	ARG_QUIET,
+	ARG_KEY_SIZE,
+	ARG_MAP_FLAGS,
+	ARG_MAX_ENTRIES,
 };
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
new file mode 100644
index 000000000000..79805d008125
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
@@ -0,0 +1,277 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+
+#include <sys/random.h>
+#include <argp.h>
+#include "bench.h"
+#include "bpf_hashmap_lookup.skel.h"
+#include "bpf_util.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct bpf_hashmap_lookup *skel;
+} ctx;
+
+/* only available to kernel, so define it here */
+#define BPF_MAX_LOOPS (1<<23)
+
+#define MAX_KEY_SIZE 1024 /* the size of the key map */
+
+static struct {
+	__u32 key_size;
+	__u32 map_flags;
+	__u32 max_entries;
+	__u32 nr_entries;
+	__u32 nr_loops;
+} args = {
+	.key_size = 4,
+	.map_flags = 0,
+	.max_entries = 1000,
+	.nr_entries = 500,
+	.nr_loops = 1000000,
+};
+
+static const struct argp_option opts[] = {
+	{ "key_size", ARG_KEY_SIZE, "KEY_SIZE", 0,
+	  "The hashmap key size (max 1024)"},
+	{ "map_flags", ARG_MAP_FLAGS, "MAP_FLAGS", 0,
+	  "The hashmap flags passed to BPF_MAP_CREATE"},
+	{ "max_entries", ARG_MAX_ENTRIES, "MAX_ENTRIES", 0,
+	  "The hashmap max entries"},
+	{ "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0,
+	  "The number of entries to insert/lookup"},
+	{ "nr_loops", ARG_NR_LOOPS, "NR_LOOPS", 0,
+	  "The number of loops for the benchmark"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	long ret;
+
+	switch (key) {
+	case ARG_KEY_SIZE:
+		ret = strtol(arg, NULL, 10);
+		if (ret < 1 || ret > MAX_KEY_SIZE) {
+			fprintf(stderr, "invalid key_size");
+			argp_usage(state);
+		}
+		args.key_size = ret;
+		break;
+	case ARG_MAP_FLAGS:
+		if (!strncasecmp(arg, "0x", 2))
+			ret = strtol(arg, NULL, 0x10);
+		else
+			ret = strtol(arg, NULL, 10);
+		if (ret < 0 || ret > UINT_MAX) {
+			fprintf(stderr, "invalid map_flags");
+			argp_usage(state);
+		}
+		args.map_flags = ret;
+		break;
+	case ARG_MAX_ENTRIES:
+		ret = strtol(arg, NULL, 10);
+		if (ret < 1 || ret > UINT_MAX) {
+			fprintf(stderr, "invalid max_entries");
+			argp_usage(state);
+		}
+		args.max_entries = ret;
+		break;
+	case ARG_NR_ENTRIES:
+		ret = strtol(arg, NULL, 10);
+		if (ret < 1 || ret > UINT_MAX) {
+			fprintf(stderr, "invalid nr_entries");
+			argp_usage(state);
+		}
+		args.nr_entries = ret;
+		break;
+	case ARG_NR_LOOPS:
+		ret = strtol(arg, NULL, 10);
+		if (ret < 1 || ret > BPF_MAX_LOOPS) {
+			fprintf(stderr, "invalid nr_loops: %ld (min=1 max=%u)\n",
+				ret, BPF_MAX_LOOPS);
+			argp_usage(state);
+		}
+		args.nr_loops = ret;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_hashmap_lookup_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (args.nr_entries > args.max_entries) {
+		fprintf(stderr, "args.nr_entries is too big! (max %u, got %u)\n",
+			args.max_entries, args.nr_entries);
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	while (true) {
+		/* trigger the bpf program */
+		syscall(__NR_getpgid);
+	}
+	return NULL;
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void measure(struct bench_res *res)
+{
+}
+
+static inline void patch_key(u32 i, u32 *key)
+{
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	*key = i + 1;
+#else
+	*key = __builtin_bswap32(i + 1);
+#endif
+	/* the rest of key is random */
+}
+
+static void setup(void)
+{
+	struct bpf_link *link;
+	int map_fd;
+	int ret;
+	int i;
+
+	setup_libbpf();
+
+	ctx.skel = bpf_hashmap_lookup__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	bpf_map__set_max_entries(ctx.skel->maps.hash_map_bench, args.max_entries);
+	bpf_map__set_key_size(ctx.skel->maps.hash_map_bench, args.key_size);
+	bpf_map__set_value_size(ctx.skel->maps.hash_map_bench, 8);
+	bpf_map__set_map_flags(ctx.skel->maps.hash_map_bench, args.map_flags);
+
+	ctx.skel->bss->nr_entries = args.nr_entries;
+	ctx.skel->bss->nr_loops = args.nr_loops / args.nr_entries;
+
+	if (args.key_size > 4) {
+		for (i = 1; i < args.key_size/4; i++)
+			ctx.skel->bss->key[i] = 2654435761 * i;
+	}
+
+	ret = bpf_hashmap_lookup__load(ctx.skel);
+	if (ret) {
+		bpf_hashmap_lookup__destroy(ctx.skel);
+		fprintf(stderr, "failed to load map: %s", strerror(-ret));
+		exit(1);
+	}
+
+	/* fill in the hash_map */
+	map_fd = bpf_map__fd(ctx.skel->maps.hash_map_bench);
+	for (u64 i = 0; i < args.nr_entries; i++) {
+		patch_key(i, ctx.skel->bss->key);
+		bpf_map_update_elem(map_fd, ctx.skel->bss->key, &i, BPF_ANY);
+	}
+
+	link = bpf_program__attach(ctx.skel->progs.benchmark);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static inline double events_from_time(u64 time)
+{
+	if (time)
+		return args.nr_loops * 1000000000llu / time / 1000000.0L;
+
+	return 0;
+}
+
+static int compute_events(u64 *times, double *events_mean, double *events_stddev, u64 *mean_time)
+{
+	int i, n = 0;
+
+	*events_mean = 0;
+	*events_stddev = 0;
+	*mean_time = 0;
+
+	for (i = 0; i < 32; i++) {
+		if (!times[i])
+			break;
+		*mean_time += times[i];
+		*events_mean += events_from_time(times[i]);
+		n += 1;
+	}
+	if (!n)
+		return 0;
+
+	*mean_time /= n;
+	*events_mean /= n;
+
+	if (n > 1) {
+		for (i = 0; i < n; i++) {
+			double events_i = *events_mean - events_from_time(times[i]);
+			*events_stddev += events_i * events_i / (n - 1);
+		}
+		*events_stddev = sqrt(*events_stddev);
+	}
+
+	return n;
+}
+
+static void hashmap_report_final(struct bench_res res[], int res_cnt)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	double events_mean, events_stddev;
+	u64 mean_time;
+	int i, n;
+
+	for (i = 0; i < nr_cpus; i++) {
+		n = compute_events(ctx.skel->bss->percpu_times[i], &events_mean,
+				   &events_stddev, &mean_time);
+		if (n == 0)
+			continue;
+
+		if (env.quiet) {
+			/* we expect only one cpu to be present */
+			if (env.affinity)
+				printf("%.3lf\n", events_mean);
+			else
+				printf("cpu%02d %.3lf\n", i, events_mean);
+		} else {
+			printf("cpu%02d: lookup %.3lfM ± %.3lfM events/sec"
+			       " (approximated from %d samples of ~%lums)\n",
+			       i, events_mean, 2*events_stddev,
+			       n, mean_time / 1000000);
+		}
+	}
+}
+
+const struct bench bench_bpf_hashmap_lookup = {
+	.name = "bpf-hashmap-lookup",
+	.validate = validate,
+	.setup = setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = NULL,
+	.report_final = hashmap_report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c b/tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c
new file mode 100644
index 000000000000..79e41e3fb1bf
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+} hash_map_bench SEC(".maps");
+
+/* The number of slots to store times */
+#define NR_SLOTS 32
+
+/* Configured by userspace */
+u64 nr_entries;
+u64 nr_loops;
+u32 __attribute__((__aligned__(8))) key[256];
+
+/* Filled by us */
+u64 __attribute__((__aligned__(256))) percpu_times_index[NR_SLOTS];
+u64 __attribute__((__aligned__(256))) percpu_times[256][NR_SLOTS];
+
+static inline void patch_key(u32 i)
+{
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	key[0] = i + 1;
+#else
+	key[0] = __builtin_bswap32(i + 1);
+#endif
+	/* the rest of key is random and is configured by userspace */
+}
+
+static int lookup_callback(__u32 index, u32 *unused)
+{
+	patch_key(index);
+	return bpf_map_lookup_elem(&hash_map_bench, key) ? 0 : 1;
+}
+
+static int loop_lookup_callback(__u32 index, u32 *unused)
+{
+	return bpf_loop(nr_entries, lookup_callback, NULL, 0) ? 0 : 1;
+}
+
+SEC("fentry/" SYS_PREFIX "sys_getpgid")
+int benchmark(void *ctx)
+{
+	u32 cpu = bpf_get_smp_processor_id();
+	u32 times_index;
+	u64 start_time;
+
+	times_index = percpu_times_index[cpu & 255] % NR_SLOTS;
+	start_time = bpf_ktime_get_ns();
+	bpf_loop(nr_loops, loop_lookup_callback, NULL, 0);
+	percpu_times[cpu & 255][times_index] = bpf_ktime_get_ns() - start_time;
+	percpu_times_index[cpu & 255] += 1;
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing
  2023-01-27 18:14 ` [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing Anton Protopopov
@ 2023-01-31  0:07   ` Andrii Nakryiko
  2023-01-31 13:35     ` Anton Protopopov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-01-31  0:07 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> To parse command line the bench utility uses the argp_parse() function. This
> function takes as an argument a parent 'struct argp' structure which defines
> common command line options and an array of children 'struct argp' structures
> which defines additional command line options for particular benchmarks. This
> implementation doesn't allow benchmarks to share option names, e.g., if two
> benchmarks want to use, say, the --option option, then only one of them will
> succeed (the first one encountered in the array).  This will be convenient if
> we could use the same option names in different benchmarks (with the same
> semantics, e.g., --nr_loops=N).
>
> Fix this by calling the argp_parse() function twice. The first call is needed
> to find the benchmark name. Given the name, we can patch the list of argp
> children to only include the correct list of option. This way the right parser
> will be executed. (If there's no a specific list of arguments, then only one
> call is enough.) As was mentioned above, arguments with same names should have
> the same semantics (which means in this case "taking a parameter or not"), but
> can have different description and will have different parsers.
>
> As we now can share option names, this also makes sense to share the option ids.
> Previously every benchmark defined a separate enum, like
>
>     enum {
>            ARG_SMTH = 9000,
>            ARG_SMTH_ELSE = 9001,
>     };
>
> These ids were later used to distinguish between command line options:
>
>     static const struct argp_option opts[] = {
>             { "smth", ARG_SMTH, "SMTH", 0,
>                     "Set number of smth to configure smth"},
>             { "smth_else", ARG_SMTH_ELSE, "SMTH_ELSE", 0,
>                     "Set number of smth_else to configure smth else"},
>             ...
>
> Move arguments id definitions to bench.h such that they are visible to every
> benchmark (and such that there's no need to grep if this number is defined
> already or not).

On the other hand, if each benchmark defines their own set of IDs and
parser, that keeps benchmarks more self-contained. Is there a real
need to centralize everything? I don't see much benefit, tbh.

If we want to centralize, then we can just bypass all the child parser
machinery and have one centralized list of arguments. But I think it's
good to have each benchmark self-contained and independent from other
benchmarks.


>
> Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> ---
>  tools/testing/selftests/bpf/bench.c           | 98 +++++++++++++++++--
>  tools/testing/selftests/bpf/bench.h           | 20 ++++
>  .../bpf/benchs/bench_bloom_filter_map.c       |  6 --
>  .../selftests/bpf/benchs/bench_bpf_loop.c     |  4 -
>  .../bpf/benchs/bench_local_storage.c          |  5 -
>  .../bench_local_storage_rcu_tasks_trace.c     |  6 --
>  .../selftests/bpf/benchs/bench_ringbufs.c     |  8 --
>  .../selftests/bpf/benchs/bench_strncmp.c      |  4 -
>  8 files changed, 110 insertions(+), 41 deletions(-)
>

[...]

> +struct argp *bench_name_to_argp(const char *bench_name)
> +{
> +
> +#define _SCMP(NAME) (!strcmp(bench_name, NAME))
> +
> +       if (_SCMP("bloom-lookup") ||
> +           _SCMP("bloom-update") ||
> +           _SCMP("bloom-false-positive") ||
> +           _SCMP("hashmap-without-bloom") ||
> +           _SCMP("hashmap-with-bloom"))
> +               return &bench_bloom_map_argp;
> +
> +       if (_SCMP("rb-libbpf") ||
> +           _SCMP("rb-custom") ||
> +           _SCMP("pb-libbpf") ||
> +           _SCMP("pb-custom"))
> +               return &bench_ringbufs_argp;
> +
> +       if (_SCMP("local-storage-cache-seq-get") ||
> +           _SCMP("local-storage-cache-int-get") ||
> +           _SCMP("local-storage-cache-hashmap-control"))
> +               return &bench_local_storage_argp;
> +
> +       if (_SCMP("local-storage-tasks-trace"))
> +               return &bench_local_storage_rcu_tasks_trace_argp;
> +
> +       if (_SCMP("strncmp-no-helper") ||
> +           _SCMP("strncmp-helper"))
> +               return &bench_strncmp_argp;
> +
> +       if (_SCMP("bpf-loop"))
> +               return &bench_bpf_loop_argp;
> +
> +       /* no extra arguments */
> +       if (_SCMP("count-global") ||
> +           _SCMP("count-local") ||
> +           _SCMP("rename-base") ||
> +           _SCMP("rename-kprobe") ||
> +           _SCMP("rename-kretprobe") ||
> +           _SCMP("rename-rawtp") ||
> +           _SCMP("rename-fentry") ||
> +           _SCMP("rename-fexit") ||
> +           _SCMP("trig-base") ||
> +           _SCMP("trig-tp") ||
> +           _SCMP("trig-rawtp") ||
> +           _SCMP("trig-kprobe") ||
> +           _SCMP("trig-fentry") ||
> +           _SCMP("trig-fentry-sleep") ||
> +           _SCMP("trig-fmodret") ||
> +           _SCMP("trig-uprobe-base") ||
> +           _SCMP("trig-uprobe-with-nop") ||
> +           _SCMP("trig-uretprobe-with-nop") ||
> +           _SCMP("trig-uprobe-without-nop") ||
> +           _SCMP("trig-uretprobe-without-nop") ||
> +           _SCMP("bpf-hashmap-full-update"))
> +               return NULL;
> +
> +#undef _SCMP
> +

it's not good to have to maintain a separate list of benchmark names
here. Let's maybe extend struct bench to specify extra parser and use
that to figure out if we need to run nested child parser?


> +       fprintf(stderr, "%s: bench %s is unknown\n", __func__, bench_name);
> +       exit(1);
> +}
> +
>  static void parse_cmdline_args(int argc, char **argv)
>  {
>         static const struct argp argp = {
> @@ -367,12 +426,35 @@ static void parse_cmdline_args(int argc, char **argv)
>                 .doc = argp_program_doc,
>                 .children = bench_parsers,
>         };
> +       static struct argp *bench_argp;

nit: do you need static?

> +
> +       /* Parse args for the first time to get bench name */
>         if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
>                 exit(1);
> -       if (!env.list && !env.bench_name) {
> +
> +       if (env.list)
> +               return;
> +
> +       if (!env.bench_name) {
>                 argp_help(&argp, stderr, ARGP_HELP_DOC, "bench");
>                 exit(1);
>         }
> +
> +       /* Now check if there are custom options available. If not, then
> +        * everything is done, if yes, then we need to patch bench_parsers
> +        * so that bench_parsers[0] points to the right 'struct argp', and
> +        * bench_parsers[1] terminates the list.
> +        */
> +       bench_argp = bench_name_to_argp(env.bench_name);
> +       if (bench_argp) {
> +               bench_parsers[0].argp = bench_argp;
> +               bench_parsers[0].header = env.bench_name;
> +               memset(&bench_parsers[1], 0, sizeof(bench_parsers[1]));
> +
> +               pos_args = 0;
> +               if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
> +                       exit(1);
> +       }

this also feels like a big hack, why can't you just construct a
single-item array based on child parser, instead of overwriting global
array?

>  }
>
>  static void collect_measurements(long delta_ns);

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common
  2023-01-27 18:14 ` [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common Anton Protopopov
@ 2023-01-31  0:10   ` Andrii Nakryiko
  2023-01-31 10:57     ` Anton Protopopov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-01-31  0:10 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
> the list of common options, so that the main code and other benchmarks can use
> (now global) env.quiet as well.
>
> Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> ---
>  tools/testing/selftests/bpf/bench.c               | 15 +++++++++++++++
>  tools/testing/selftests/bpf/bench.h               |  1 +
>  .../benchs/bench_local_storage_rcu_tasks_trace.c  | 14 +-------------
>  3 files changed, 17 insertions(+), 13 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index ba93f1b268e1..42bf41a9385e 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -16,6 +16,7 @@ struct env env = {
>         .warmup_sec = 1,
>         .duration_sec = 5,
>         .affinity = false,
> +       .quiet = false,
>         .consumer_cnt = 1,
>         .producer_cnt = 1,
>  };
> @@ -257,6 +258,7 @@ static const struct argp_option opts[] = {
>         { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
>         { "verbose", 'v', NULL, 0, "Verbose debug output"},
>         { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> +       { "quiet", 'q', "{0,1}", OPTION_ARG_OPTIONAL, "If true, be quiet"},

given the default is not quiet, why add 0 or 1? -q for quiet, no "-q"
for not quiet? Keeping it simple?

>         { "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
>           "Set of CPUs for producer threads; implies --affinity"},
>         { "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-27 18:14 ` [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups Anton Protopopov
@ 2023-01-31  0:16   ` Andrii Nakryiko
  2023-01-31 11:01     ` Anton Protopopov
  2023-01-31  0:22   ` Martin KaFai Lau
  1 sibling, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-01-31  0:16 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> Add a new benchmark which measures hashmap lookup operations speed.  A user can
> control the following parameters of the benchmark:
>
>     * key_size (max 1024): the key size to use
>     * max_entries: the hashmap max entries
>     * nr_entries: the number of entries to insert/lookup
>     * nr_loops: the number of loops for the benchmark
>     * map_flags The hashmap flags passed to BPF_MAP_CREATE
>
> The BPF program performing the benchmarks calls two nested bpf_loop:
>
>     bpf_loop(nr_loops/nr_entries)
>             bpf_loop(nr_entries)
>                      bpf_map_lookup()
>
> So the nr_loops determines the number of actual map lookups. All lookups are
> successful.
>
> Example (the output is generated on a AMD Ryzen 9 3950X machine):
>
>     for nr_entries in `seq 4096 4096 65536`; do echo -n "$((nr_entries*100/65536))% full: "; sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=4 --nr_entries=$nr_entries --max_entries=65536 --nr_loops=1000000 --map_flags=0x40 | grep cpu; done
>     6% full: cpu01: lookup 50.739M ± 0.018M events/sec (approximated from 32 samples of ~19ms)
>     12% full: cpu01: lookup 47.751M ± 0.015M events/sec (approximated from 32 samples of ~20ms)
>     18% full: cpu01: lookup 45.153M ± 0.013M events/sec (approximated from 32 samples of ~22ms)
>     25% full: cpu01: lookup 43.826M ± 0.014M events/sec (approximated from 32 samples of ~22ms)
>     31% full: cpu01: lookup 41.971M ± 0.012M events/sec (approximated from 32 samples of ~23ms)
>     37% full: cpu01: lookup 41.034M ± 0.015M events/sec (approximated from 32 samples of ~24ms)
>     43% full: cpu01: lookup 39.946M ± 0.012M events/sec (approximated from 32 samples of ~25ms)
>     50% full: cpu01: lookup 38.256M ± 0.014M events/sec (approximated from 32 samples of ~26ms)
>     56% full: cpu01: lookup 36.580M ± 0.018M events/sec (approximated from 32 samples of ~27ms)
>     62% full: cpu01: lookup 36.252M ± 0.012M events/sec (approximated from 32 samples of ~27ms)
>     68% full: cpu01: lookup 35.200M ± 0.012M events/sec (approximated from 32 samples of ~28ms)
>     75% full: cpu01: lookup 34.061M ± 0.009M events/sec (approximated from 32 samples of ~29ms)
>     81% full: cpu01: lookup 34.374M ± 0.010M events/sec (approximated from 32 samples of ~29ms)
>     87% full: cpu01: lookup 33.244M ± 0.011M events/sec (approximated from 32 samples of ~30ms)
>     93% full: cpu01: lookup 32.182M ± 0.013M events/sec (approximated from 32 samples of ~31ms)
>     100% full: cpu01: lookup 31.497M ± 0.016M events/sec (approximated from 32 samples of ~31ms)
>
> Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> ---
>  tools/testing/selftests/bpf/Makefile          |   5 +-
>  tools/testing/selftests/bpf/bench.c           |   7 +
>  tools/testing/selftests/bpf/bench.h           |   3 +
>  .../bpf/benchs/bench_bpf_hashmap_lookup.c     | 277 ++++++++++++++++++
>  .../selftests/bpf/progs/bpf_hashmap_lookup.c  |  61 ++++
>  5 files changed, 352 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c
>

[...]

> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +       long ret;
> +
> +       switch (key) {
> +       case ARG_KEY_SIZE:
> +               ret = strtol(arg, NULL, 10);
> +               if (ret < 1 || ret > MAX_KEY_SIZE) {
> +                       fprintf(stderr, "invalid key_size");
> +                       argp_usage(state);
> +               }
> +               args.key_size = ret;
> +               break;
> +       case ARG_MAP_FLAGS:
> +               if (!strncasecmp(arg, "0x", 2))
> +                       ret = strtol(arg, NULL, 0x10);
> +               else
> +                       ret = strtol(arg, NULL, 10);

if you pass base as zero, strtol() will do this for you

> +               if (ret < 0 || ret > UINT_MAX) {
> +                       fprintf(stderr, "invalid map_flags");
> +                       argp_usage(state);
> +               }
> +               args.map_flags = ret;
> +               break;

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 0/6] New benchmark for hashmap lookups
  2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
                   ` (5 preceding siblings ...)
  2023-01-27 18:14 ` [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups Anton Protopopov
@ 2023-01-31  0:17 ` Andrii Nakryiko
  2023-01-31 10:47   ` Anton Protopopov
  6 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-01-31  0:17 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> Add a new benchmark for hashmap lookups and fix several typos. See individual
> commits for descriptions.
>
> One thing to mention here is that in commit 3 I've patched bench so that now
> command line options can be reused by different benchmarks.
>
> The benchmark itself is added in the last commit 6. I am using this benchmark
> to test map lookup productivity when using a different hash function (see
> https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
> the benchmark look reasonable and match the results of my different benchmarks
> (requiring to patch kernel to get actual statistics on map lookups).

Could you share the results with us? Curious which hash functions did
you try and which one are the most promising :)

>
> Anton Protopopov (6):
>   selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update
>   selftest/bpf/benchs: make a function static in bpf_hashmap_full_update
>   selftest/bpf/benchs: enhance argp parsing
>   selftest/bpf/benchs: make quiet option common
>   selftest/bpf/benchs: print less if the quiet option is set
>   selftest/bpf/benchs: Add benchmark for hashmap lookups
>
>  tools/testing/selftests/bpf/Makefile          |   5 +-
>  tools/testing/selftests/bpf/bench.c           | 126 +++++++-
>  tools/testing/selftests/bpf/bench.h           |  24 ++
>  .../bpf/benchs/bench_bloom_filter_map.c       |   6 -
>  .../benchs/bench_bpf_hashmap_full_update.c    |   4 +-
>  .../bpf/benchs/bench_bpf_hashmap_lookup.c     | 277 ++++++++++++++++++
>  .../selftests/bpf/benchs/bench_bpf_loop.c     |   4 -
>  .../bpf/benchs/bench_local_storage.c          |   5 -
>  .../bench_local_storage_rcu_tasks_trace.c     |  20 +-
>  .../selftests/bpf/benchs/bench_ringbufs.c     |   8 -
>  .../selftests/bpf/benchs/bench_strncmp.c      |   4 -
>  .../run_bench_bpf_hashmap_full_update.sh      |   2 +-
>  .../selftests/bpf/progs/bpf_hashmap_lookup.c  |  61 ++++
>  13 files changed, 486 insertions(+), 60 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_hashmap_lookup.c
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-27 18:14 ` [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups Anton Protopopov
  2023-01-31  0:16   ` Andrii Nakryiko
@ 2023-01-31  0:22   ` Martin KaFai Lau
  2023-01-31 11:05     ` Anton Protopopov
  1 sibling, 1 reply; 25+ messages in thread
From: Martin KaFai Lau @ 2023-01-31  0:22 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend

On 1/27/23 10:14 AM, Anton Protopopov wrote:
> +/* The number of slots to store times */
> +#define NR_SLOTS 32
> +
> +/* Configured by userspace */
> +u64 nr_entries;
> +u64 nr_loops;
> +u32 __attribute__((__aligned__(8))) key[256];
> +
> +/* Filled by us */
> +u64 __attribute__((__aligned__(256))) percpu_times_index[NR_SLOTS]; > +u64 __attribute__((__aligned__(256))) percpu_times[256][NR_SLOTS];
> +
> +static inline void patch_key(u32 i)
> +{
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +	key[0] = i + 1;
> +#else
> +	key[0] = __builtin_bswap32(i + 1);
> +#endif
> +	/* the rest of key is random and is configured by userspace */
> +}
> +
> +static int lookup_callback(__u32 index, u32 *unused)
> +{
> +	patch_key(index);
> +	return bpf_map_lookup_elem(&hash_map_bench, key) ? 0 : 1;
> +}
> +
> +static int loop_lookup_callback(__u32 index, u32 *unused)
> +{
> +	return bpf_loop(nr_entries, lookup_callback, NULL, 0) ? 0 : 1;
> +}
> +
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")
> +int benchmark(void *ctx)
> +{
> +	u32 cpu = bpf_get_smp_processor_id();
> +	u32 times_index;
> +	u64 start_time;
> +
> +	times_index = percpu_times_index[cpu & 255] % NR_SLOTS;

percpu_times_index only has NR_SLOTS (32) elements?

> +	start_time = bpf_ktime_get_ns();
> +	bpf_loop(nr_loops, loop_lookup_callback, NULL, 0);
> +	percpu_times[cpu & 255][times_index] = bpf_ktime_get_ns() - start_time;
> +	percpu_times_index[cpu & 255] += 1;
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 0/6] New benchmark for hashmap lookups
  2023-01-31  0:17 ` [PATCH bpf-next 0/6] New " Andrii Nakryiko
@ 2023-01-31 10:47   ` Anton Protopopov
  2023-01-31 18:48     ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 10:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/30 04:17, Andrii Nakryiko wrote:
> On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > Add a new benchmark for hashmap lookups and fix several typos. See individual
> > commits for descriptions.
> >
> > One thing to mention here is that in commit 3 I've patched bench so that now
> > command line options can be reused by different benchmarks.
> >
> > The benchmark itself is added in the last commit 6. I am using this benchmark
> > to test map lookup productivity when using a different hash function (see
> > https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
> > the benchmark look reasonable and match the results of my different benchmarks
> > (requiring to patch kernel to get actual statistics on map lookups).
> 
> Could you share the results with us? Curious which hash functions did
> you try and which one are the most promising :)

For the longer version with pictures see the talk I've referenced above (it's
at FOSDEM next Sunday Feb 5). A short version follows.

The xxh3 hash works fine for big keys, where "big" is different for different
architectures and for different maps sizes. On my Intel i7 machine this means
key size >= 8. On my AMD machine xxh3 works better for all keys for small maps,
but degrades for keys of size 12,16,20 for bigger maps (>=200K elements or so).
Example (map size 100K, 50% full, measuring M ops/second):

    hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
    orig_map  15.7 15.4 14.2 13.9 13.1 13.2 12.0 12.0 11.5 11.2 10.6 10.7 10.0 10.0  9.6  9.3
    new_map   15.5 15.9 15.2 15.3 14.3 14.6 14.0 14.2 13.3 13.6 13.1 13.4 12.7 13.1 12.3 12.8

A smaller map (10K, 50% full):

    hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
    orig_map  36.1 36.8 32.1 32.4 29.0 29.1 26.2 26.4 23.9 24.3 21.8 22.5 20.4 20.7 19.3 19.1
    new_map   37.7 39.6 34.0 36.5 31.5 34.1 30.7 32.7 28.1 29.5 27.4 28.8 27.1 28.1 26.4 27.8

Other hash functions I've tried (older xxh32/64, spooky) work _way_ worse for
small keys than jhash/xxh3. (Answering a possible question about vector instructions:
xxh3 is scalar until key size <= 240, then the xxh64/xxh32 can be used which
also provides ~2x map lookup speed gain comparing to jhash for keys >240.)

Bloom filters for big >= ~40 keys, predictably, work way faster with xxh3 than
with jhash. (Why not similar to hashmap? Because Bloom filters use jhash2 for
keys % 4 which works faster than jhash for small keys, see also a patch below.)

The stacktrace map doesn't work much faster, because 95% of time it spends in
get_perf_callchain (the hash part, though, runs ~1.5-2.0 faster with xxh, as
hash size is typically about 60-90 bytes, so this makes sense to use xxh3 in
stacktrace by default).

One very simple change which brings 5-10% speed gain for all hashmaps is this:

static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
 {
+       if (likely((key_len & 0x3) == 0))
+               return jhash2(key, key_len >> 2, hashrnd);
        return jhash(key, key_len, hashrnd);
 }

I will follow up with a patch as simple as this ^ or with a combination of
jhash, jhash2, and xxh3 once I will run benchmarks on more architectures to
check that there are no degradations.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common
  2023-01-31  0:10   ` Andrii Nakryiko
@ 2023-01-31 10:57     ` Anton Protopopov
  2023-01-31 18:51       ` Andrii Nakryiko
  2023-01-31 18:57       ` Anton Protopopov
  0 siblings, 2 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 10:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/30 04:10, Andrii Nakryiko wrote:
> On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
> > the list of common options, so that the main code and other benchmarks can use
> > (now global) env.quiet as well.
> >
> > Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> > ---
> >  tools/testing/selftests/bpf/bench.c               | 15 +++++++++++++++
> >  tools/testing/selftests/bpf/bench.h               |  1 +
> >  .../benchs/bench_local_storage_rcu_tasks_trace.c  | 14 +-------------
> >  3 files changed, 17 insertions(+), 13 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > index ba93f1b268e1..42bf41a9385e 100644
> > --- a/tools/testing/selftests/bpf/bench.c
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -16,6 +16,7 @@ struct env env = {
> >         .warmup_sec = 1,
> >         .duration_sec = 5,
> >         .affinity = false,
> > +       .quiet = false,
> >         .consumer_cnt = 1,
> >         .producer_cnt = 1,
> >  };
> > @@ -257,6 +258,7 @@ static const struct argp_option opts[] = {
> >         { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> >         { "verbose", 'v', NULL, 0, "Verbose debug output"},
> >         { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > +       { "quiet", 'q', "{0,1}", OPTION_ARG_OPTIONAL, "If true, be quiet"},
> 
> given the default is not quiet, why add 0 or 1? -q for quiet, no "-q"
> for not quiet? Keeping it simple?

The local-storage-tasks-trace benchmark expected 0 or 1 there, so I didn't want
to break any script which utilize this option.

The new parser accepts the old --quiet=0|1 for consistency, but also -q|--quiet
without value, as you've suggested (I pass OPTION_ARG_OPTIONAL and set
quiet=true if arg is NULL in the new parser).

> >         { "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
> >           "Set of CPUs for producer threads; implies --affinity"},
> >         { "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
> 
> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-31  0:16   ` Andrii Nakryiko
@ 2023-01-31 11:01     ` Anton Protopopov
  0 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 11:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/30 04:16, Andrii Nakryiko wrote:
> On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> [...]
> > +
> > +       switch (key) {
> > +       case ARG_KEY_SIZE:
> > +               ret = strtol(arg, NULL, 10);
> > +               if (ret < 1 || ret > MAX_KEY_SIZE) {
> > +                       fprintf(stderr, "invalid key_size");
> > +                       argp_usage(state);
> > +               }
> > +               args.key_size = ret;
> > +               break;
> > +       case ARG_MAP_FLAGS:
> > +               if (!strncasecmp(arg, "0x", 2))
> > +                       ret = strtol(arg, NULL, 0x10);
> > +               else
> > +                       ret = strtol(arg, NULL, 10);
> 
> if you pass base as zero, strtol() will do this for you

Thanks, I didn't remeber this!

> > +               if (ret < 0 || ret > UINT_MAX) {
> > +                       fprintf(stderr, "invalid map_flags");
> > +                       argp_usage(state);
> > +               }
> > +               args.map_flags = ret;
> > +               break;
> 
> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-31  0:22   ` Martin KaFai Lau
@ 2023-01-31 11:05     ` Anton Protopopov
  2023-01-31 22:50       ` Martin KaFai Lau
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 11:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend

On 23/01/30 04:22, Martin KaFai Lau wrote:
> On 1/27/23 10:14 AM, Anton Protopopov wrote:
> > +/* The number of slots to store times */
> > +#define NR_SLOTS 32
> > +
> > +/* Configured by userspace */
> > +u64 nr_entries;
> > +u64 nr_loops;
> > +u32 __attribute__((__aligned__(8))) key[256];
> > +
> > +/* Filled by us */
> > +u64 __attribute__((__aligned__(256))) percpu_times_index[NR_SLOTS]; > +u64 __attribute__((__aligned__(256))) percpu_times[256][NR_SLOTS];
> > +
> > +static inline void patch_key(u32 i)
> > +{
> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > +	key[0] = i + 1;
> > +#else
> > +	key[0] = __builtin_bswap32(i + 1);
> > +#endif
> > +	/* the rest of key is random and is configured by userspace */
> > +}
> > +
> > +static int lookup_callback(__u32 index, u32 *unused)
> > +{
> > +	patch_key(index);
> > +	return bpf_map_lookup_elem(&hash_map_bench, key) ? 0 : 1;
> > +}
> > +
> > +static int loop_lookup_callback(__u32 index, u32 *unused)
> > +{
> > +	return bpf_loop(nr_entries, lookup_callback, NULL, 0) ? 0 : 1;
> > +}
> > +
> > +SEC("fentry/" SYS_PREFIX "sys_getpgid")
> > +int benchmark(void *ctx)
> > +{
> > +	u32 cpu = bpf_get_smp_processor_id();
> > +	u32 times_index;
> > +	u64 start_time;
> > +
> > +	times_index = percpu_times_index[cpu & 255] % NR_SLOTS;
> 
> percpu_times_index only has NR_SLOTS (32) elements?

Yes, the idea was the following. One measurement (bpf prog execution) takes
about 20-80 ms (depending on the key/map size). So in 2-3 seconds we can get
about NR_SLOTS elements. For me 32 looked like enough to get stats for this
benchmark. Do you think this is better to make the NR_SLOTS
bigger/configurable?

> > +	start_time = bpf_ktime_get_ns();
> > +	bpf_loop(nr_loops, loop_lookup_callback, NULL, 0);
> > +	percpu_times[cpu & 255][times_index] = bpf_ktime_get_ns() - start_time;
> > +	percpu_times_index[cpu & 255] += 1;
> > +	return 0;
> > +}
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing
  2023-01-31  0:07   ` Andrii Nakryiko
@ 2023-01-31 13:35     ` Anton Protopopov
  0 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 13:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/30 04:07, Andrii Nakryiko wrote:
> On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > To parse command line the bench utility uses the argp_parse() function. This
> > function takes as an argument a parent 'struct argp' structure which defines
> > common command line options and an array of children 'struct argp' structures
> > which defines additional command line options for particular benchmarks. This
> > implementation doesn't allow benchmarks to share option names, e.g., if two
> > benchmarks want to use, say, the --option option, then only one of them will
> > succeed (the first one encountered in the array).  This will be convenient if
> > we could use the same option names in different benchmarks (with the same
> > semantics, e.g., --nr_loops=N).
> >
> > Fix this by calling the argp_parse() function twice. The first call is needed
> > to find the benchmark name. Given the name, we can patch the list of argp
> > children to only include the correct list of option. This way the right parser
> > will be executed. (If there's no a specific list of arguments, then only one
> > call is enough.) As was mentioned above, arguments with same names should have
> > the same semantics (which means in this case "taking a parameter or not"), but
> > can have different description and will have different parsers.
> >
> > As we now can share option names, this also makes sense to share the option ids.
> > Previously every benchmark defined a separate enum, like
> >
> >     enum {
> >            ARG_SMTH = 9000,
> >            ARG_SMTH_ELSE = 9001,
> >     };
> >
> > These ids were later used to distinguish between command line options:
> >
> >     static const struct argp_option opts[] = {
> >             { "smth", ARG_SMTH, "SMTH", 0,
> >                     "Set number of smth to configure smth"},
> >             { "smth_else", ARG_SMTH_ELSE, "SMTH_ELSE", 0,
> >                     "Set number of smth_else to configure smth else"},
> >             ...
> >
> > Move arguments id definitions to bench.h such that they are visible to every
> > benchmark (and such that there's no need to grep if this number is defined
> > already or not).
> 
> On the other hand, if each benchmark defines their own set of IDs and
> parser, that keeps benchmarks more self-contained. Is there a real
> need to centralize everything? I don't see much benefit, tbh.
>
> If we want to centralize, then we can just bypass all the child parser
> machinery and have one centralized list of arguments. But I think it's
> good to have each benchmark self-contained and independent from other
> benchmarks.

When I was adding a new bench, it looked simpler to just add IDs to a global
list than to grep for what ID is not defined yet. This doesn't matter much
though, I can switch back to local IDs.

> >
> > Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> > ---
> >  tools/testing/selftests/bpf/bench.c           | 98 +++++++++++++++++--
> >  tools/testing/selftests/bpf/bench.h           | 20 ++++
> >  .../bpf/benchs/bench_bloom_filter_map.c       |  6 --
> >  .../selftests/bpf/benchs/bench_bpf_loop.c     |  4 -
> >  .../bpf/benchs/bench_local_storage.c          |  5 -
> >  .../bench_local_storage_rcu_tasks_trace.c     |  6 --
> >  .../selftests/bpf/benchs/bench_ringbufs.c     |  8 --
> >  .../selftests/bpf/benchs/bench_strncmp.c      |  4 -
> >  8 files changed, 110 insertions(+), 41 deletions(-)
> >
> 
> [...]
> 
> > +struct argp *bench_name_to_argp(const char *bench_name)
> > +{
> > +
> > +#define _SCMP(NAME) (!strcmp(bench_name, NAME))
> > +
> > +       if (_SCMP("bloom-lookup") ||
> > +           _SCMP("bloom-update") ||
> > +           _SCMP("bloom-false-positive") ||
> > +           _SCMP("hashmap-without-bloom") ||
> > +           _SCMP("hashmap-with-bloom"))
> > +               return &bench_bloom_map_argp;
> > +
> > +       if (_SCMP("rb-libbpf") ||
> > +           _SCMP("rb-custom") ||
> > +           _SCMP("pb-libbpf") ||
> > +           _SCMP("pb-custom"))
> > +               return &bench_ringbufs_argp;
> > +
> > +       if (_SCMP("local-storage-cache-seq-get") ||
> > +           _SCMP("local-storage-cache-int-get") ||
> > +           _SCMP("local-storage-cache-hashmap-control"))
> > +               return &bench_local_storage_argp;
> > +
> > +       if (_SCMP("local-storage-tasks-trace"))
> > +               return &bench_local_storage_rcu_tasks_trace_argp;
> > +
> > +       if (_SCMP("strncmp-no-helper") ||
> > +           _SCMP("strncmp-helper"))
> > +               return &bench_strncmp_argp;
> > +
> > +       if (_SCMP("bpf-loop"))
> > +               return &bench_bpf_loop_argp;
> > +
> > +       /* no extra arguments */
> > +       if (_SCMP("count-global") ||
> > +           _SCMP("count-local") ||
> > +           _SCMP("rename-base") ||
> > +           _SCMP("rename-kprobe") ||
> > +           _SCMP("rename-kretprobe") ||
> > +           _SCMP("rename-rawtp") ||
> > +           _SCMP("rename-fentry") ||
> > +           _SCMP("rename-fexit") ||
> > +           _SCMP("trig-base") ||
> > +           _SCMP("trig-tp") ||
> > +           _SCMP("trig-rawtp") ||
> > +           _SCMP("trig-kprobe") ||
> > +           _SCMP("trig-fentry") ||
> > +           _SCMP("trig-fentry-sleep") ||
> > +           _SCMP("trig-fmodret") ||
> > +           _SCMP("trig-uprobe-base") ||
> > +           _SCMP("trig-uprobe-with-nop") ||
> > +           _SCMP("trig-uretprobe-with-nop") ||
> > +           _SCMP("trig-uprobe-without-nop") ||
> > +           _SCMP("trig-uretprobe-without-nop") ||
> > +           _SCMP("bpf-hashmap-full-update"))
> > +               return NULL;
> > +
> > +#undef _SCMP
> > +
> 
> it's not good to have to maintain a separate list of benchmark names
> here. Let's maybe extend struct bench to specify extra parser and use
> that to figure out if we need to run nested child parser?

Yes, right, so then I can do something like

struct argp *bench_name_to_argp(const char *bench_name)
{
        int i;

        for (i = 0; i < ARRAY_SIZE(benchs); i++)
                if (!strcmp(bench_name, benchs[i]->name))
                        return benchs[i]->argp;

        fprintf(stderr, "benchmark '%s' not found\n", bench_name);
        exit(1);
}

> 
> > +       fprintf(stderr, "%s: bench %s is unknown\n", __func__, bench_name);
> > +       exit(1);
> > +}
> > +
> >  static void parse_cmdline_args(int argc, char **argv)
> >  {
> >         static const struct argp argp = {
> > @@ -367,12 +426,35 @@ static void parse_cmdline_args(int argc, char **argv)
> >                 .doc = argp_program_doc,
> >                 .children = bench_parsers,
> >         };
> > +       static struct argp *bench_argp;
> 
> nit: do you need static?

No, thanks for noting.

> 
> > +
> > +       /* Parse args for the first time to get bench name */
> >         if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
> >                 exit(1);
> > -       if (!env.list && !env.bench_name) {

Also, I will switch to a simpler mode here: just find the bench_name, then
construct correct `struct argp`, then call argp_parse() once.

> > +
> > +       if (env.list)
> > +               return;
> > +
> > +       if (!env.bench_name) {
> >                 argp_help(&argp, stderr, ARGP_HELP_DOC, "bench");
> >                 exit(1);
> >         }
> > +
> > +       /* Now check if there are custom options available. If not, then
> > +        * everything is done, if yes, then we need to patch bench_parsers
> > +        * so that bench_parsers[0] points to the right 'struct argp', and
> > +        * bench_parsers[1] terminates the list.
> > +        */
> > +       bench_argp = bench_name_to_argp(env.bench_name);
> > +       if (bench_argp) {
> > +               bench_parsers[0].argp = bench_argp;
> > +               bench_parsers[0].header = env.bench_name;
> > +               memset(&bench_parsers[1], 0, sizeof(bench_parsers[1]));
> > +
> > +               pos_args = 0;
> > +               if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
> > +                       exit(1);
> > +       }
> 
> this also feels like a big hack, why can't you just construct a
> single-item array based on child parser, instead of overwriting global
> array?

Sure, thanks.

> >  }
> >
> >  static void collect_measurements(long delta_ns);
> 
> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 0/6] New benchmark for hashmap lookups
  2023-01-31 10:47   ` Anton Protopopov
@ 2023-01-31 18:48     ` Andrii Nakryiko
  2023-01-31 19:18       ` Anton Protopopov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-01-31 18:48 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Tue, Jan 31, 2023 at 2:47 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On 23/01/30 04:17, Andrii Nakryiko wrote:
> > On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > Add a new benchmark for hashmap lookups and fix several typos. See individual
> > > commits for descriptions.
> > >
> > > One thing to mention here is that in commit 3 I've patched bench so that now
> > > command line options can be reused by different benchmarks.
> > >
> > > The benchmark itself is added in the last commit 6. I am using this benchmark
> > > to test map lookup productivity when using a different hash function (see
> > > https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
> > > the benchmark look reasonable and match the results of my different benchmarks
> > > (requiring to patch kernel to get actual statistics on map lookups).
> >
> > Could you share the results with us? Curious which hash functions did
> > you try and which one are the most promising :)
>
> For the longer version with pictures see the talk I've referenced above (it's
> at FOSDEM next Sunday Feb 5). A short version follows.

Yep, I'll try to watch it.

>
> The xxh3 hash works fine for big keys, where "big" is different for different
> architectures and for different maps sizes. On my Intel i7 machine this means
> key size >= 8. On my AMD machine xxh3 works better for all keys for small maps,
> but degrades for keys of size 12,16,20 for bigger maps (>=200K elements or so).
> Example (map size 100K, 50% full, measuring M ops/second):

Nice, I was hoping you would look at xxh3, as I've been meaning to try
it out as well (have dirty patches to introduce xxh3 into
lib/xxhash.c, but didn't get to actual benchmarking).


Despite this AMD-specific degradation (which is interesting in its own
right, could it be some fluke in testing?), I think it's a good idea
to switch from jhash to xxh3, as it seems almost universally better.
See also below.

>
>     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
>     orig_map  15.7 15.4 14.2 13.9 13.1 13.2 12.0 12.0 11.5 11.2 10.6 10.7 10.0 10.0  9.6  9.3
>     new_map   15.5 15.9 15.2 15.3 14.3 14.6 14.0 14.2 13.3 13.6 13.1 13.4 12.7 13.1 12.3 12.8
>
> A smaller map (10K, 50% full):
>
>     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
>     orig_map  36.1 36.8 32.1 32.4 29.0 29.1 26.2 26.4 23.9 24.3 21.8 22.5 20.4 20.7 19.3 19.1
>     new_map   37.7 39.6 34.0 36.5 31.5 34.1 30.7 32.7 28.1 29.5 27.4 28.8 27.1 28.1 26.4 27.8
>
> Other hash functions I've tried (older xxh32/64, spooky) work _way_ worse for
> small keys than jhash/xxh3. (Answering a possible question about vector instructions:
> xxh3 is scalar until key size <= 240, then the xxh64/xxh32 can be used which
> also provides ~2x map lookup speed gain comparing to jhash for keys >240.)

Yeah, not suprising. xxh32/64 were optimized for long byte arrays, not
for short keys. While xxh3 puts a lot of attention on short keys. Do
you see xxh64 being faster than xxh3 for longs keys, as implemented in
kernel? Kernel doesn't use SSE2/AVX versions, just purely scalars, so
from reading benchmarks of xxh3/xxh64 author, xxh3 should win in all
situations.

>
> Bloom filters for big >= ~40 keys, predictably, work way faster with xxh3 than
> with jhash. (Why not similar to hashmap? Because Bloom filters use jhash2 for
> keys % 4 which works faster than jhash for small keys, see also a patch below.)
>
> The stacktrace map doesn't work much faster, because 95% of time it spends in
> get_perf_callchain (the hash part, though, runs ~1.5-2.0 faster with xxh, as
> hash size is typically about 60-90 bytes, so this makes sense to use xxh3 in
> stacktrace by default).

For stacktrace very important aspect would be to pay attention (and
minimize) hash collisions, though. This was a big problem with
bpf_get_stackid() and STACK_TRACE map (and what motivated
bpf_get_stack()). Even with a big sparsely populated map we'd get a
lot of collisions between stack traces. xxh3 should have much better
distribution, so in production it should result in less
dropped/replaced stack traces. If you get a chance, it would be nice
to collect these stats for jhash and xxh3-based implementations. Note
that kernel's jhash2 seems to be what SMHasher ([0]) denotes as
lookup3 (as does Jenkins himself). It's not a very good hash anymore
in terms of distribution (and throughput as well), compared to xxh3
(and lots of other more modern hashes).

  [0] https://github.com/rurban/smhasher

>
> One very simple change which brings 5-10% speed gain for all hashmaps is this:
>
> static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
>  {
> +       if (likely((key_len & 0x3) == 0))
> +               return jhash2(key, key_len >> 2, hashrnd);
>         return jhash(key, key_len, hashrnd);
>  }
>
> I will follow up with a patch as simple as this ^ or with a combination of
> jhash, jhash2, and xxh3 once I will run benchmarks on more architectures to
> check that there are no degradations.


Sounds good, looking forward to it!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common
  2023-01-31 10:57     ` Anton Protopopov
@ 2023-01-31 18:51       ` Andrii Nakryiko
  2023-01-31 18:57       ` Anton Protopopov
  1 sibling, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-01-31 18:51 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Tue, Jan 31, 2023 at 2:57 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On 23/01/30 04:10, Andrii Nakryiko wrote:
> > On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
> > > the list of common options, so that the main code and other benchmarks can use
> > > (now global) env.quiet as well.
> > >
> > > Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> > > ---
> > >  tools/testing/selftests/bpf/bench.c               | 15 +++++++++++++++
> > >  tools/testing/selftests/bpf/bench.h               |  1 +
> > >  .../benchs/bench_local_storage_rcu_tasks_trace.c  | 14 +-------------
> > >  3 files changed, 17 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > > index ba93f1b268e1..42bf41a9385e 100644
> > > --- a/tools/testing/selftests/bpf/bench.c
> > > +++ b/tools/testing/selftests/bpf/bench.c
> > > @@ -16,6 +16,7 @@ struct env env = {
> > >         .warmup_sec = 1,
> > >         .duration_sec = 5,
> > >         .affinity = false,
> > > +       .quiet = false,
> > >         .consumer_cnt = 1,
> > >         .producer_cnt = 1,
> > >  };
> > > @@ -257,6 +258,7 @@ static const struct argp_option opts[] = {
> > >         { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> > >         { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > >         { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > > +       { "quiet", 'q', "{0,1}", OPTION_ARG_OPTIONAL, "If true, be quiet"},
> >
> > given the default is not quiet, why add 0 or 1? -q for quiet, no "-q"
> > for not quiet? Keeping it simple?
>
> The local-storage-tasks-trace benchmark expected 0 or 1 there, so I didn't want
> to break any script which utilize this option.
>
> The new parser accepts the old --quiet=0|1 for consistency, but also -q|--quiet
> without value, as you've suggested (I pass OPTION_ARG_OPTIONAL and set
> quiet=true if arg is NULL in the new parser).
>

I think it was mostly due to copy/pasting some other integer-based
argument handling. We don't need that for boolean flags. Let's just
fix benchs/run_bench_local_storage_rcu_tasks_trace.sh to do -q and
keep it simple?

> > >         { "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
> > >           "Set of CPUs for producer threads; implies --affinity"},
> > >         { "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
> >
> > [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common
  2023-01-31 10:57     ` Anton Protopopov
  2023-01-31 18:51       ` Andrii Nakryiko
@ 2023-01-31 18:57       ` Anton Protopopov
  1 sibling, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 18:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/31 11:57, Anton Protopopov wrote:
> On 23/01/30 04:10, Andrii Nakryiko wrote:
> > On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
> > > the list of common options, so that the main code and other benchmarks can use
> > > (now global) env.quiet as well.
> > >
> > > Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
> > > ---
> > >  tools/testing/selftests/bpf/bench.c               | 15 +++++++++++++++
> > >  tools/testing/selftests/bpf/bench.h               |  1 +
> > >  .../benchs/bench_local_storage_rcu_tasks_trace.c  | 14 +-------------
> > >  3 files changed, 17 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > > index ba93f1b268e1..42bf41a9385e 100644
> > > --- a/tools/testing/selftests/bpf/bench.c
> > > +++ b/tools/testing/selftests/bpf/bench.c
> > > @@ -16,6 +16,7 @@ struct env env = {
> > >         .warmup_sec = 1,
> > >         .duration_sec = 5,
> > >         .affinity = false,
> > > +       .quiet = false,
> > >         .consumer_cnt = 1,
> > >         .producer_cnt = 1,
> > >  };
> > > @@ -257,6 +258,7 @@ static const struct argp_option opts[] = {
> > >         { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> > >         { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > >         { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > > +       { "quiet", 'q', "{0,1}", OPTION_ARG_OPTIONAL, "If true, be quiet"},
> > 
> > given the default is not quiet, why add 0 or 1? -q for quiet, no "-q"
> > for not quiet? Keeping it simple?
> 
> The local-storage-tasks-trace benchmark expected 0 or 1 there, so I didn't want
> to break any script which utilize this option.
> 
> The new parser accepts the old --quiet=0|1 for consistency, but also -q|--quiet
> without value, as you've suggested (I pass OPTION_ARG_OPTIONAL and set
> quiet=true if arg is NULL in the new parser).

Sure will do, for me --quiet=1 looks weird as well.

> > >         { "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
> > >           "Set of CPUs for producer threads; implies --affinity"},
> > >         { "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
> > 
> > [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 0/6] New benchmark for hashmap lookups
  2023-01-31 18:48     ` Andrii Nakryiko
@ 2023-01-31 19:18       ` Anton Protopopov
  2023-02-01  0:02         ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Protopopov @ 2023-01-31 19:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/31 10:48, Andrii Nakryiko wrote:
> On Tue, Jan 31, 2023 at 2:47 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On 23/01/30 04:17, Andrii Nakryiko wrote:
> > > On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > >
> > > > Add a new benchmark for hashmap lookups and fix several typos. See individual
> > > > commits for descriptions.
> > > >
> > > > One thing to mention here is that in commit 3 I've patched bench so that now
> > > > command line options can be reused by different benchmarks.
> > > >
> > > > The benchmark itself is added in the last commit 6. I am using this benchmark
> > > > to test map lookup productivity when using a different hash function (see
> > > > https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
> > > > the benchmark look reasonable and match the results of my different benchmarks
> > > > (requiring to patch kernel to get actual statistics on map lookups).
> > >
> > > Could you share the results with us? Curious which hash functions did
> > > you try and which one are the most promising :)
> >
> > For the longer version with pictures see the talk I've referenced above (it's
> > at FOSDEM next Sunday Feb 5). A short version follows.
> 
> Yep, I'll try to watch it.
> 
> >
> > The xxh3 hash works fine for big keys, where "big" is different for different
> > architectures and for different maps sizes. On my Intel i7 machine this means
> > key size >= 8. On my AMD machine xxh3 works better for all keys for small maps,
> > but degrades for keys of size 12,16,20 for bigger maps (>=200K elements or so).
> > Example (map size 100K, 50% full, measuring M ops/second):
> 
> Nice, I was hoping you would look at xxh3, as I've been meaning to try
> it out as well (have dirty patches to introduce xxh3 into
> lib/xxhash.c, but didn't get to actual benchmarking).

My first attempt was with lib/xxhash.c, and it looked well on the first glance
(outperformed every other hash in my hash benchmark). However, when used inside
the hashmap, it behaved way worse than expected, so I had to inline it.

> Despite this AMD-specific degradation (which is interesting in its own
> right, could it be some fluke in testing?), I think it's a good idea
> to switch from jhash to xxh3, as it seems almost universally better.
> See also below.
> >
> >     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
> >     orig_map  15.7 15.4 14.2 13.9 13.1 13.2 12.0 12.0 11.5 11.2 10.6 10.7 10.0 10.0  9.6  9.3
> >     new_map   15.5 15.9 15.2 15.3 14.3 14.6 14.0 14.2 13.3 13.6 13.1 13.4 12.7 13.1 12.3 12.8
> >
> > A smaller map (10K, 50% full):
> >
> >     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
> >     orig_map  36.1 36.8 32.1 32.4 29.0 29.1 26.2 26.4 23.9 24.3 21.8 22.5 20.4 20.7 19.3 19.1
> >     new_map   37.7 39.6 34.0 36.5 31.5 34.1 30.7 32.7 28.1 29.5 27.4 28.8 27.1 28.1 26.4 27.8
> >
> > Other hash functions I've tried (older xxh32/64, spooky) work _way_ worse for
> > small keys than jhash/xxh3. (Answering a possible question about vector instructions:
> > xxh3 is scalar until key size <= 240, then the xxh64/xxh32 can be used which
> > also provides ~2x map lookup speed gain comparing to jhash for keys >240.)
> 
> Yeah, not suprising. xxh32/64 were optimized for long byte arrays, not
> for short keys. While xxh3 puts a lot of attention on short keys. Do
> you see xxh64 being faster than xxh3 for longs keys, as implemented in
> kernel? Kernel doesn't use SSE2/AVX versions, just purely scalars, so
> from reading benchmarks of xxh3/xxh64 author, xxh3 should win in all
> situations.

For keys longer than 240 the scalar xxh3 works way worse than xxhash. BTW, do
you know use cases when hashmap keys are > 240? (For cilium/tetragon the most
interesting use cases are keys of sizes ~4-40.)

> >
> > Bloom filters for big >= ~40 keys, predictably, work way faster with xxh3 than
> > with jhash. (Why not similar to hashmap? Because Bloom filters use jhash2 for
> > keys % 4 which works faster than jhash for small keys, see also a patch below.)
> >
> > The stacktrace map doesn't work much faster, because 95% of time it spends in
> > get_perf_callchain (the hash part, though, runs ~1.5-2.0 faster with xxh, as
> > hash size is typically about 60-90 bytes, so this makes sense to use xxh3 in
> > stacktrace by default).
> 
> For stacktrace very important aspect would be to pay attention (and
> minimize) hash collisions, though. This was a big problem with
> bpf_get_stackid() and STACK_TRACE map (and what motivated
> bpf_get_stack()). Even with a big sparsely populated map we'd get a
> lot of collisions between stack traces. xxh3 should have much better
> distribution, so in production it should result in less
> dropped/replaced stack traces. If you get a chance, it would be nice
> to collect these stats for jhash and xxh3-based implementations. Note
> that kernel's jhash2 seems to be what SMHasher ([0]) denotes as
> lookup3 (as does Jenkins himself). It's not a very good hash anymore
> in terms of distribution (and throughput as well), compared to xxh3
> (and lots of other more modern hashes).
> 
>   [0] https://github.com/rurban/smhasher

Ok, this makes sense. Based on the fact that for stacktrace xxh3 also works
about twice faster (for stack depths of 10 and more), I see no problem just
using it as is (corrected by the fact that for key sizes of 240 and more we
might prefer xxh64; this shouldn't break the stacktrace algorithms if we use
different hash algorithms, right?).

> >
> > One very simple change which brings 5-10% speed gain for all hashmaps is this:
> >
> > static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
> >  {
> > +       if (likely((key_len & 0x3) == 0))
> > +               return jhash2(key, key_len >> 2, hashrnd);
> >         return jhash(key, key_len, hashrnd);
> >  }
> >
> > I will follow up with a patch as simple as this ^ or with a combination of
> > jhash, jhash2, and xxh3 once I will run benchmarks on more architectures to
> > check that there are no degradations.
> 
> 
> Sounds good, looking forward to it!

Benchmarks for "the better hash" are running already!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-31 11:05     ` Anton Protopopov
@ 2023-01-31 22:50       ` Martin KaFai Lau
  2023-02-01  9:12         ` Anton Protopopov
  0 siblings, 1 reply; 25+ messages in thread
From: Martin KaFai Lau @ 2023-01-31 22:50 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend

On 1/31/23 3:05 AM, Anton Protopopov wrote:
> On 23/01/30 04:22, Martin KaFai Lau wrote:
>> On 1/27/23 10:14 AM, Anton Protopopov wrote:
>>> +/* The number of slots to store times */
>>> +#define NR_SLOTS 32
>>> +
>>> +/* Configured by userspace */
>>> +u64 nr_entries;
>>> +u64 nr_loops;
>>> +u32 __attribute__((__aligned__(8))) key[256];
>>> +
>>> +/* Filled by us */
>>> +u64 __attribute__((__aligned__(256))) percpu_times_index[NR_SLOTS]; > +u64 __attribute__((__aligned__(256))) percpu_times[256][NR_SLOTS];
>>> +
>>> +static inline void patch_key(u32 i)
>>> +{
>>> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
>>> +	key[0] = i + 1;
>>> +#else
>>> +	key[0] = __builtin_bswap32(i + 1);
>>> +#endif
>>> +	/* the rest of key is random and is configured by userspace */
>>> +}
>>> +
>>> +static int lookup_callback(__u32 index, u32 *unused)
>>> +{
>>> +	patch_key(index);
>>> +	return bpf_map_lookup_elem(&hash_map_bench, key) ? 0 : 1;
>>> +}
>>> +
>>> +static int loop_lookup_callback(__u32 index, u32 *unused)
>>> +{
>>> +	return bpf_loop(nr_entries, lookup_callback, NULL, 0) ? 0 : 1;
>>> +}
>>> +
>>> +SEC("fentry/" SYS_PREFIX "sys_getpgid")
>>> +int benchmark(void *ctx)
>>> +{
>>> +	u32 cpu = bpf_get_smp_processor_id();
>>> +	u32 times_index;
>>> +	u64 start_time;
>>> +
>>> +	times_index = percpu_times_index[cpu & 255] % NR_SLOTS;
>>
>> percpu_times_index only has NR_SLOTS (32) elements?
> 
> Yes, the idea was the following. One measurement (bpf prog execution) takes
> about 20-80 ms (depending on the key/map size). So in 2-3 seconds we can get
> about NR_SLOTS elements. For me 32 looked like enough to get stats for this
> benchmark. Do you think this is better to make the NR_SLOTS
> bigger/configurable?

I thought percpu_times_index[] is the next slot to use for a particular cpu in 
percpu_times[256][NR_SLOTS]. 256 is the max number of cpu supported? It is doing 
"cpu" & 255 also. Should it be sized as percpu_times_index[256] instead then?

May be #define what 256 is here such that it can have a self describe name.

> 
>>> +	start_time = bpf_ktime_get_ns();
>>> +	bpf_loop(nr_loops, loop_lookup_callback, NULL, 0);
>>> +	percpu_times[cpu & 255][times_index] = bpf_ktime_get_ns() - start_time;
>>> +	percpu_times_index[cpu & 255] += 1;
>>> +	return 0;
>>> +}
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 0/6] New benchmark for hashmap lookups
  2023-01-31 19:18       ` Anton Protopopov
@ 2023-02-01  0:02         ` Andrii Nakryiko
  2023-02-01  9:41           ` Anton Protopopov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-02-01  0:02 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On Tue, Jan 31, 2023 at 11:18 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On 23/01/31 10:48, Andrii Nakryiko wrote:
> > On Tue, Jan 31, 2023 at 2:47 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > On 23/01/30 04:17, Andrii Nakryiko wrote:
> > > > On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > >
> > > > > Add a new benchmark for hashmap lookups and fix several typos. See individual
> > > > > commits for descriptions.
> > > > >
> > > > > One thing to mention here is that in commit 3 I've patched bench so that now
> > > > > command line options can be reused by different benchmarks.
> > > > >
> > > > > The benchmark itself is added in the last commit 6. I am using this benchmark
> > > > > to test map lookup productivity when using a different hash function (see
> > > > > https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
> > > > > the benchmark look reasonable and match the results of my different benchmarks
> > > > > (requiring to patch kernel to get actual statistics on map lookups).
> > > >
> > > > Could you share the results with us? Curious which hash functions did
> > > > you try and which one are the most promising :)
> > >
> > > For the longer version with pictures see the talk I've referenced above (it's
> > > at FOSDEM next Sunday Feb 5). A short version follows.
> >
> > Yep, I'll try to watch it.
> >
> > >
> > > The xxh3 hash works fine for big keys, where "big" is different for different
> > > architectures and for different maps sizes. On my Intel i7 machine this means
> > > key size >= 8. On my AMD machine xxh3 works better for all keys for small maps,
> > > but degrades for keys of size 12,16,20 for bigger maps (>=200K elements or so).
> > > Example (map size 100K, 50% full, measuring M ops/second):
> >
> > Nice, I was hoping you would look at xxh3, as I've been meaning to try
> > it out as well (have dirty patches to introduce xxh3 into
> > lib/xxhash.c, but didn't get to actual benchmarking).
>
> My first attempt was with lib/xxhash.c, and it looked well on the first glance
> (outperformed every other hash in my hash benchmark). However, when used inside
> the hashmap, it behaved way worse than expected, so I had to inline it.
>
> > Despite this AMD-specific degradation (which is interesting in its own
> > right, could it be some fluke in testing?), I think it's a good idea
> > to switch from jhash to xxh3, as it seems almost universally better.
> > See also below.
> > >
> > >     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
> > >     orig_map  15.7 15.4 14.2 13.9 13.1 13.2 12.0 12.0 11.5 11.2 10.6 10.7 10.0 10.0  9.6  9.3
> > >     new_map   15.5 15.9 15.2 15.3 14.3 14.6 14.0 14.2 13.3 13.6 13.1 13.4 12.7 13.1 12.3 12.8
> > >
> > > A smaller map (10K, 50% full):
> > >
> > >     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
> > >     orig_map  36.1 36.8 32.1 32.4 29.0 29.1 26.2 26.4 23.9 24.3 21.8 22.5 20.4 20.7 19.3 19.1
> > >     new_map   37.7 39.6 34.0 36.5 31.5 34.1 30.7 32.7 28.1 29.5 27.4 28.8 27.1 28.1 26.4 27.8
> > >
> > > Other hash functions I've tried (older xxh32/64, spooky) work _way_ worse for
> > > small keys than jhash/xxh3. (Answering a possible question about vector instructions:
> > > xxh3 is scalar until key size <= 240, then the xxh64/xxh32 can be used which
> > > also provides ~2x map lookup speed gain comparing to jhash for keys >240.)
> >
> > Yeah, not suprising. xxh32/64 were optimized for long byte arrays, not
> > for short keys. While xxh3 puts a lot of attention on short keys. Do
> > you see xxh64 being faster than xxh3 for longs keys, as implemented in
> > kernel? Kernel doesn't use SSE2/AVX versions, just purely scalars, so
> > from reading benchmarks of xxh3/xxh64 author, xxh3 should win in all
> > situations.
>
> For keys longer than 240 the scalar xxh3 works way worse than xxhash. BTW, do
> you know use cases when hashmap keys are > 240? (For cilium/tetragon the most
> interesting use cases are keys of sizes ~4-40.)

Can't recall such big keys, but if someone was to use bpf_get_stack()
and using stack traces as keys of hashmap, we'd be looking at 1KB+
keys in such case.

When you say "way worse", how much do you think? This is surprising,
given official benchmark results. But then, I don't think official
benchmark has scalar xxh3 on some of the graphs.

>
> > >
> > > Bloom filters for big >= ~40 keys, predictably, work way faster with xxh3 than
> > > with jhash. (Why not similar to hashmap? Because Bloom filters use jhash2 for
> > > keys % 4 which works faster than jhash for small keys, see also a patch below.)
> > >
> > > The stacktrace map doesn't work much faster, because 95% of time it spends in
> > > get_perf_callchain (the hash part, though, runs ~1.5-2.0 faster with xxh, as
> > > hash size is typically about 60-90 bytes, so this makes sense to use xxh3 in
> > > stacktrace by default).
> >
> > For stacktrace very important aspect would be to pay attention (and
> > minimize) hash collisions, though. This was a big problem with
> > bpf_get_stackid() and STACK_TRACE map (and what motivated
> > bpf_get_stack()). Even with a big sparsely populated map we'd get a
> > lot of collisions between stack traces. xxh3 should have much better
> > distribution, so in production it should result in less
> > dropped/replaced stack traces. If you get a chance, it would be nice
> > to collect these stats for jhash and xxh3-based implementations. Note
> > that kernel's jhash2 seems to be what SMHasher ([0]) denotes as
> > lookup3 (as does Jenkins himself). It's not a very good hash anymore
> > in terms of distribution (and throughput as well), compared to xxh3
> > (and lots of other more modern hashes).
> >
> >   [0] https://github.com/rurban/smhasher
>
> Ok, this makes sense. Based on the fact that for stacktrace xxh3 also works
> about twice faster (for stack depths of 10 and more), I see no problem just
> using it as is (corrected by the fact that for key sizes of 240 and more we
> might prefer xxh64; this shouldn't break the stacktrace algorithms if we use
> different hash algorithms, right?).

Yep, it shouldn't, it's implementation detail. But of course it would
be nice to stick to just one hashing algorithm.

>
> > >
> > > One very simple change which brings 5-10% speed gain for all hashmaps is this:
> > >
> > > static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
> > >  {
> > > +       if (likely((key_len & 0x3) == 0))
> > > +               return jhash2(key, key_len >> 2, hashrnd);
> > >         return jhash(key, key_len, hashrnd);
> > >  }
> > >
> > > I will follow up with a patch as simple as this ^ or with a combination of
> > > jhash, jhash2, and xxh3 once I will run benchmarks on more architectures to
> > > check that there are no degradations.
> >
> >
> > Sounds good, looking forward to it!
>
> Benchmarks for "the better hash" are running already!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups
  2023-01-31 22:50       ` Martin KaFai Lau
@ 2023-02-01  9:12         ` Anton Protopopov
  0 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-02-01  9:12 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend

On 23/01/31 02:50, Martin KaFai Lau wrote:
> On 1/31/23 3:05 AM, Anton Protopopov wrote:
> > On 23/01/30 04:22, Martin KaFai Lau wrote:
> > > On 1/27/23 10:14 AM, Anton Protopopov wrote:
> > > > +/* The number of slots to store times */
> > > > +#define NR_SLOTS 32
> > > > +
> > > > +/* Configured by userspace */
> > > > +u64 nr_entries;
> > > > +u64 nr_loops;
> > > > +u32 __attribute__((__aligned__(8))) key[256];
> > > > +
> > > > +/* Filled by us */
> > > > +u64 __attribute__((__aligned__(256))) percpu_times_index[NR_SLOTS]; > +u64 __attribute__((__aligned__(256))) percpu_times[256][NR_SLOTS];
> > > > +
> > > > +static inline void patch_key(u32 i)
> > > > +{
> > > > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > > > +	key[0] = i + 1;
> > > > +#else
> > > > +	key[0] = __builtin_bswap32(i + 1);
> > > > +#endif
> > > > +	/* the rest of key is random and is configured by userspace */
> > > > +}
> > > > +
> > > > +static int lookup_callback(__u32 index, u32 *unused)
> > > > +{
> > > > +	patch_key(index);
> > > > +	return bpf_map_lookup_elem(&hash_map_bench, key) ? 0 : 1;
> > > > +}
> > > > +
> > > > +static int loop_lookup_callback(__u32 index, u32 *unused)
> > > > +{
> > > > +	return bpf_loop(nr_entries, lookup_callback, NULL, 0) ? 0 : 1;
> > > > +}
> > > > +
> > > > +SEC("fentry/" SYS_PREFIX "sys_getpgid")
> > > > +int benchmark(void *ctx)
> > > > +{
> > > > +	u32 cpu = bpf_get_smp_processor_id();
> > > > +	u32 times_index;
> > > > +	u64 start_time;
> > > > +
> > > > +	times_index = percpu_times_index[cpu & 255] % NR_SLOTS;
> > > 
> > > percpu_times_index only has NR_SLOTS (32) elements?
> > 
> > Yes, the idea was the following. One measurement (bpf prog execution) takes
> > about 20-80 ms (depending on the key/map size). So in 2-3 seconds we can get
> > about NR_SLOTS elements. For me 32 looked like enough to get stats for this
> > benchmark. Do you think this is better to make the NR_SLOTS
> > bigger/configurable?
> 
> I thought percpu_times_index[] is the next slot to use for a particular cpu
> in percpu_times[256][NR_SLOTS]. 256 is the max number of cpu supported? It
> is doing "cpu" & 255 also. Should it be sized as percpu_times_index[256]
> instead then?
> 
> May be #define what 256 is here such that it can have a self describe name.

Oh, thanks! Of course, percpu_times_index is of wrong size,
I will fix this and use names instead of numbers.

> > 
> > > > +	start_time = bpf_ktime_get_ns();
> > > > +	bpf_loop(nr_loops, loop_lookup_callback, NULL, 0);
> > > > +	percpu_times[cpu & 255][times_index] = bpf_ktime_get_ns() - start_time;
> > > > +	percpu_times_index[cpu & 255] += 1;
> > > > +	return 0;
> > > > +}
> > > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 0/6] New benchmark for hashmap lookups
  2023-02-01  0:02         ` Andrii Nakryiko
@ 2023-02-01  9:41           ` Anton Protopopov
  0 siblings, 0 replies; 25+ messages in thread
From: Anton Protopopov @ 2023-02-01  9:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, John Fastabend

On 23/01/31 04:02, Andrii Nakryiko wrote:
> On Tue, Jan 31, 2023 at 11:18 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On 23/01/31 10:48, Andrii Nakryiko wrote:
> > > On Tue, Jan 31, 2023 at 2:47 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > >
> > > > On 23/01/30 04:17, Andrii Nakryiko wrote:
> > > > > On Fri, Jan 27, 2023 at 10:14 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > >
> > > > > > Add a new benchmark for hashmap lookups and fix several typos. See individual
> > > > > > commits for descriptions.
> > > > > >
> > > > > > One thing to mention here is that in commit 3 I've patched bench so that now
> > > > > > command line options can be reused by different benchmarks.
> > > > > >
> > > > > > The benchmark itself is added in the last commit 6. I am using this benchmark
> > > > > > to test map lookup productivity when using a different hash function (see
> > > > > > https://fosdem.org/2023/schedule/event/bpf_hashing/). The results provided by
> > > > > > the benchmark look reasonable and match the results of my different benchmarks
> > > > > > (requiring to patch kernel to get actual statistics on map lookups).
> > > > >
> > > > > Could you share the results with us? Curious which hash functions did
> > > > > you try and which one are the most promising :)
> > > >
> > > > For the longer version with pictures see the talk I've referenced above (it's
> > > > at FOSDEM next Sunday Feb 5). A short version follows.
> > >
> > > Yep, I'll try to watch it.
> > >
> > > >
> > > > The xxh3 hash works fine for big keys, where "big" is different for different
> > > > architectures and for different maps sizes. On my Intel i7 machine this means
> > > > key size >= 8. On my AMD machine xxh3 works better for all keys for small maps,
> > > > but degrades for keys of size 12,16,20 for bigger maps (>=200K elements or so).
> > > > Example (map size 100K, 50% full, measuring M ops/second):
> > >
> > > Nice, I was hoping you would look at xxh3, as I've been meaning to try
> > > it out as well (have dirty patches to introduce xxh3 into
> > > lib/xxhash.c, but didn't get to actual benchmarking).
> >
> > My first attempt was with lib/xxhash.c, and it looked well on the first glance
> > (outperformed every other hash in my hash benchmark). However, when used inside
> > the hashmap, it behaved way worse than expected, so I had to inline it.
> >
> > > Despite this AMD-specific degradation (which is interesting in its own
> > > right, could it be some fluke in testing?), I think it's a good idea
> > > to switch from jhash to xxh3, as it seems almost universally better.
> > > See also below.
> > > >
> > > >     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
> > > >     orig_map  15.7 15.4 14.2 13.9 13.1 13.2 12.0 12.0 11.5 11.2 10.6 10.7 10.0 10.0  9.6  9.3
> > > >     new_map   15.5 15.9 15.2 15.3 14.3 14.6 14.0 14.2 13.3 13.6 13.1 13.4 12.7 13.1 12.3 12.8
> > > >
> > > > A smaller map (10K, 50% full):
> > > >
> > > >     hash_size    4    8   12   16   20   24   28   32   36   40   44   48   52   56   60   64
> > > >     orig_map  36.1 36.8 32.1 32.4 29.0 29.1 26.2 26.4 23.9 24.3 21.8 22.5 20.4 20.7 19.3 19.1
> > > >     new_map   37.7 39.6 34.0 36.5 31.5 34.1 30.7 32.7 28.1 29.5 27.4 28.8 27.1 28.1 26.4 27.8
> > > >
> > > > Other hash functions I've tried (older xxh32/64, spooky) work _way_ worse for
> > > > small keys than jhash/xxh3. (Answering a possible question about vector instructions:
> > > > xxh3 is scalar until key size <= 240, then the xxh64/xxh32 can be used which
> > > > also provides ~2x map lookup speed gain comparing to jhash for keys >240.)
> > >
> > > Yeah, not suprising. xxh32/64 were optimized for long byte arrays, not
> > > for short keys. While xxh3 puts a lot of attention on short keys. Do
> > > you see xxh64 being faster than xxh3 for longs keys, as implemented in
> > > kernel? Kernel doesn't use SSE2/AVX versions, just purely scalars, so
> > > from reading benchmarks of xxh3/xxh64 author, xxh3 should win in all
> > > situations.
> >
> > For keys longer than 240 the scalar xxh3 works way worse than xxhash. BTW, do
> > you know use cases when hashmap keys are > 240? (For cilium/tetragon the most
> > interesting use cases are keys of sizes ~4-40.)
> 
> Can't recall such big keys, but if someone was to use bpf_get_stack()
> and using stack traces as keys of hashmap, we'd be looking at 1KB+
> keys in such case.
> 
> When you say "way worse", how much do you think? This is surprising,
> given official benchmark results. But then, I don't think official
> benchmark has scalar xxh3 on some of the graphs.

Here are numbers for xxh3 vs xxh64 (hash_size/cycles):

    hash_size 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400
    xxh3      17 25 33 33 42 42  47  47  68  73  78  83  88  93  98 184 218 218 218 218 252 252 252 252 286
    xxh64     26 41 49 48 57 55  63  62  71  70  79  77  86  85  93  92 107 101 110 107 117 118 135 125 134

Note how xxh3 numbers jump at hash_size=240. Probably this is because the
scalar xxh3 was never actually benchmarked by the original author? I've asked
them about this case, see https://github.com/Cyan4973/xxHash/issues/793

> > > > Bloom filters for big >= ~40 keys, predictably, work way faster with xxh3 than
> > > > with jhash. (Why not similar to hashmap? Because Bloom filters use jhash2 for
> > > > keys % 4 which works faster than jhash for small keys, see also a patch below.)
> > > >
> > > > The stacktrace map doesn't work much faster, because 95% of time it spends in
> > > > get_perf_callchain (the hash part, though, runs ~1.5-2.0 faster with xxh, as
> > > > hash size is typically about 60-90 bytes, so this makes sense to use xxh3 in
> > > > stacktrace by default).
> > >
> > > For stacktrace very important aspect would be to pay attention (and
> > > minimize) hash collisions, though. This was a big problem with
> > > bpf_get_stackid() and STACK_TRACE map (and what motivated
> > > bpf_get_stack()). Even with a big sparsely populated map we'd get a
> > > lot of collisions between stack traces. xxh3 should have much better
> > > distribution, so in production it should result in less
> > > dropped/replaced stack traces. If you get a chance, it would be nice
> > > to collect these stats for jhash and xxh3-based implementations. Note
> > > that kernel's jhash2 seems to be what SMHasher ([0]) denotes as
> > > lookup3 (as does Jenkins himself). It's not a very good hash anymore
> > > in terms of distribution (and throughput as well), compared to xxh3
> > > (and lots of other more modern hashes).
> > >
> > >   [0] https://github.com/rurban/smhasher
> >
> > Ok, this makes sense. Based on the fact that for stacktrace xxh3 also works
> > about twice faster (for stack depths of 10 and more), I see no problem just
> > using it as is (corrected by the fact that for key sizes of 240 and more we
> > might prefer xxh64; this shouldn't break the stacktrace algorithms if we use
> > different hash algorithms, right?).
> 
> Yep, it shouldn't, it's implementation detail. But of course it would
> be nice to stick to just one hashing algorithm.
> 
> >
> > > >
> > > > One very simple change which brings 5-10% speed gain for all hashmaps is this:
> > > >
> > > > static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
> > > >  {
> > > > +       if (likely((key_len & 0x3) == 0))
> > > > +               return jhash2(key, key_len >> 2, hashrnd);
> > > >         return jhash(key, key_len, hashrnd);
> > > >  }
> > > >
> > > > I will follow up with a patch as simple as this ^ or with a combination of
> > > > jhash, jhash2, and xxh3 once I will run benchmarks on more architectures to
> > > > check that there are no degradations.
> > >
> > >
> > > Sounds good, looking forward to it!
> >
> > Benchmarks for "the better hash" are running already!

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-02-01  9:41 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-27 18:14 [PATCH bpf-next 0/6] New benchmark for hashmap lookups Anton Protopopov
2023-01-27 18:14 ` [PATCH bpf-next 1/6] selftest/bpf/benchs: fix a typo in bpf_hashmap_full_update Anton Protopopov
2023-01-27 18:14 ` [PATCH bpf-next 2/6] selftest/bpf/benchs: make a function static " Anton Protopopov
2023-01-27 18:14 ` [PATCH bpf-next 3/6] selftest/bpf/benchs: enhance argp parsing Anton Protopopov
2023-01-31  0:07   ` Andrii Nakryiko
2023-01-31 13:35     ` Anton Protopopov
2023-01-27 18:14 ` [PATCH bpf-next 4/6] selftest/bpf/benchs: make quiet option common Anton Protopopov
2023-01-31  0:10   ` Andrii Nakryiko
2023-01-31 10:57     ` Anton Protopopov
2023-01-31 18:51       ` Andrii Nakryiko
2023-01-31 18:57       ` Anton Protopopov
2023-01-27 18:14 ` [PATCH bpf-next 5/6] selftest/bpf/benchs: print less if the quiet option is set Anton Protopopov
2023-01-27 18:14 ` [PATCH bpf-next 6/6] selftest/bpf/benchs: Add benchmark for hashmap lookups Anton Protopopov
2023-01-31  0:16   ` Andrii Nakryiko
2023-01-31 11:01     ` Anton Protopopov
2023-01-31  0:22   ` Martin KaFai Lau
2023-01-31 11:05     ` Anton Protopopov
2023-01-31 22:50       ` Martin KaFai Lau
2023-02-01  9:12         ` Anton Protopopov
2023-01-31  0:17 ` [PATCH bpf-next 0/6] New " Andrii Nakryiko
2023-01-31 10:47   ` Anton Protopopov
2023-01-31 18:48     ` Andrii Nakryiko
2023-01-31 19:18       ` Anton Protopopov
2023-02-01  0:02         ` Andrii Nakryiko
2023-02-01  9:41           ` Anton Protopopov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.