linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Optimize performance of update hash-map when free is zero
@ 2022-05-30  9:13 Feng zhou
  2022-05-30  9:13 ` [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
  2022-05-30  9:13 ` [PATCH v3 2/2] selftest/bpf/benchs: Add bpf_map benchmark Feng zhou
  0 siblings, 2 replies; 5+ messages in thread
From: Feng zhou @ 2022-05-30  9:13 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh
  Cc: netdev, bpf, linux-kernel, duanxiongchun, songmuchun,
	wangdongdong.6, cong.wang, zhouchengming, zhoufeng.zf

From: Feng Zhou <zhoufeng.zf@bytedance.com>

We encountered bad case on big system with 96 CPUs that
alloc_htab_elem() would last for 1ms. The reason is that after the
prealloc hashtab has no free elems, when trying to update, it will still
grab spin_locks of all cpus. If there are multiple update users, the
competition is very serious.

0001: Add is_empty to check whether the free list is empty or not before taking
the lock.
0002: Add benchmark to reproduce this worst case.

Changelog:
v2->v3: Addressed comments from Alexei Starovoitov, Andrii Nakryiko.
- Adjust the way the benchmark is tested.
- Adjust the code format.
some details in here:
https://lore.kernel.org/all/20220524075306.32306-1-zhoufeng.zf@bytedance.com/T/

v1->v2: Addressed comments from Alexei Starovoitov.
- add a benchmark to reproduce the issue.
- Adjust the code format that avoid adding indent.
some details in here:
https://lore.kernel.org/all/877ac441-045b-1844-6938-fcaee5eee7f2@bytedance.com/T/

Feng Zhou (2):
  bpf: avoid grabbing spin_locks of all cpus when no free elems
  selftest/bpf/benchs: Add bpf_map benchmark

 kernel/bpf/percpu_freelist.c                  | 28 +++++-
 kernel/bpf/percpu_freelist.h                  |  1 +
 tools/testing/selftests/bpf/Makefile          |  4 +-
 tools/testing/selftests/bpf/bench.c           |  2 +
 .../benchs/bench_bpf_hashmap_full_update.c    | 96 +++++++++++++++++++
 .../run_bench_bpf_hashmap_full_update.sh      | 11 +++
 .../bpf/progs/bpf_hashmap_full_update_bench.c | 40 ++++++++
 7 files changed, 178 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_hashmap_full_update_bench.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-30  9:13 [PATCH v3 0/2] Optimize performance of update hash-map when free is zero Feng zhou
@ 2022-05-30  9:13 ` Feng zhou
  2022-05-30 21:20   ` Daniel Borkmann
  2022-05-30  9:13 ` [PATCH v3 2/2] selftest/bpf/benchs: Add bpf_map benchmark Feng zhou
  1 sibling, 1 reply; 5+ messages in thread
From: Feng zhou @ 2022-05-30  9:13 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh
  Cc: netdev, bpf, linux-kernel, duanxiongchun, songmuchun,
	wangdongdong.6, cong.wang, zhouchengming, zhoufeng.zf

From: Feng Zhou <zhoufeng.zf@bytedance.com>

This patch add is_empty in pcpu_freelist_head to check freelist
having free or not. If having, grab spin_lock, or check next cpu's
freelist.

Before patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 975345 events per sec
4:hash_map_perf pre-alloc 855367 events per sec
12:hash_map_perf pre-alloc 860862 events per sec
8:hash_map_perf pre-alloc 849561 events per sec
3:hash_map_perf pre-alloc 849074 events per sec
6:hash_map_perf pre-alloc 847120 events per sec
10:hash_map_perf pre-alloc 845047 events per sec
5:hash_map_perf pre-alloc 841266 events per sec
14:hash_map_perf pre-alloc 849740 events per sec
2:hash_map_perf pre-alloc 839598 events per sec
9:hash_map_perf pre-alloc 838695 events per sec
11:hash_map_perf pre-alloc 845390 events per sec
7:hash_map_perf pre-alloc 834865 events per sec
13:hash_map_perf pre-alloc 842619 events per sec
1:hash_map_perf pre-alloc 804231 events per sec
15:hash_map_perf pre-alloc 795314 events per sec

hash_map the worst: no free
./map_perf_test 2048
6:worse hash_map_perf pre-alloc 28628 events per sec
5:worse hash_map_perf pre-alloc 28553 events per sec
11:worse hash_map_perf pre-alloc 28543 events per sec
3:worse hash_map_perf pre-alloc 28444 events per sec
1:worse hash_map_perf pre-alloc 28418 events per sec
7:worse hash_map_perf pre-alloc 28427 events per sec
13:worse hash_map_perf pre-alloc 28330 events per sec
14:worse hash_map_perf pre-alloc 28263 events per sec
9:worse hash_map_perf pre-alloc 28211 events per sec
15:worse hash_map_perf pre-alloc 28193 events per sec
12:worse hash_map_perf pre-alloc 28190 events per sec
10:worse hash_map_perf pre-alloc 28129 events per sec
8:worse hash_map_perf pre-alloc 28116 events per sec
4:worse hash_map_perf pre-alloc 27906 events per sec
2:worse hash_map_perf pre-alloc 27801 events per sec
0:worse hash_map_perf pre-alloc 27416 events per sec
3:worse hash_map_perf pre-alloc 28188 events per sec

ftrace trace

0)               |  htab_map_update_elem() {
0)   0.198 us    |    migrate_disable();
0)               |    _raw_spin_lock_irqsave() {
0)   0.157 us    |      preempt_count_add();
0)   0.538 us    |    }
0)   0.260 us    |    lookup_elem_raw();
0)               |    alloc_htab_elem() {
0)               |      __pcpu_freelist_pop() {
0)               |        _raw_spin_lock() {
0)   0.152 us    |          preempt_count_add();
0)   0.352 us    |          native_queued_spin_lock_slowpath();
0)   1.065 us    |        }
		 |	  ...
0)               |        _raw_spin_unlock() {
0)   0.254 us    |          preempt_count_sub();
0)   0.555 us    |        }
0) + 25.188 us   |      }
0) + 25.486 us   |    }
0)               |    _raw_spin_unlock_irqrestore() {
0)   0.155 us    |      preempt_count_sub();
0)   0.454 us    |    }
0)   0.148 us    |    migrate_enable();
0) + 28.439 us   |  }

The test machine is 16C, trying to get spin_lock 17 times, in addition
to 16c, there is an extralist.

after patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 969348 events per sec
10:hash_map_perf pre-alloc 906526 events per sec
11:hash_map_perf pre-alloc 904557 events per sec
9:hash_map_perf pre-alloc 902384 events per sec
15:hash_map_perf pre-alloc 912287 events per sec
14:hash_map_perf pre-alloc 905689 events per sec
12:hash_map_perf pre-alloc 903680 events per sec
13:hash_map_perf pre-alloc 902631 events per sec
8:hash_map_perf pre-alloc 875369 events per sec
4:hash_map_perf pre-alloc 862808 events per sec
1:hash_map_perf pre-alloc 857218 events per sec
2:hash_map_perf pre-alloc 852875 events per sec
5:hash_map_perf pre-alloc 846497 events per sec
6:hash_map_perf pre-alloc 828467 events per sec
3:hash_map_perf pre-alloc 812542 events per sec
7:hash_map_perf pre-alloc 805336 events per sec

hash_map worst: no free
./map_perf_test 2048
7:worse hash_map_perf pre-alloc 391104 events per sec
4:worse hash_map_perf pre-alloc 388073 events per sec
5:worse hash_map_perf pre-alloc 387038 events per sec
1:worse hash_map_perf pre-alloc 386546 events per sec
0:worse hash_map_perf pre-alloc 384590 events per sec
11:worse hash_map_perf pre-alloc 379378 events per sec
10:worse hash_map_perf pre-alloc 375480 events per sec
12:worse hash_map_perf pre-alloc 372394 events per sec
6:worse hash_map_perf pre-alloc 367692 events per sec
3:worse hash_map_perf pre-alloc 363970 events per sec
9:worse hash_map_perf pre-alloc 364008 events per sec
8:worse hash_map_perf pre-alloc 363759 events per sec
2:worse hash_map_perf pre-alloc 360743 events per sec
14:worse hash_map_perf pre-alloc 361195 events per sec
13:worse hash_map_perf pre-alloc 360276 events per sec
15:worse hash_map_perf pre-alloc 360057 events per sec
0:worse hash_map_perf pre-alloc 378177 events per sec

ftrace trace
0)               |  htab_map_update_elem() {
0)   0.317 us    |    migrate_disable();
0)               |    _raw_spin_lock_irqsave() {
0)   0.260 us    |      preempt_count_add();
0)   1.803 us    |    }
0)   0.276 us    |    lookup_elem_raw();
0)               |    alloc_htab_elem() {
0)   0.586 us    |      __pcpu_freelist_pop();
0)   0.945 us    |    }
0)               |    _raw_spin_unlock_irqrestore() {
0)   0.160 us    |      preempt_count_sub();
0)   0.972 us    |    }
0)   0.657 us    |    migrate_enable();
0)   8.669 us    |  }

It can be seen that after adding this patch, the map performance is
almost not degraded, and when free=0, first check is_empty instead of
directly acquiring spin_lock.

As for why to add is_empty instead of directly judging head->first, my
understanding is this, head->first is frequently modified during updating
map, which will lead to invalid other cpus's cache, and is_empty is after
freelist having no free elems will be changed, the performance will be better.

Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
---
 kernel/bpf/percpu_freelist.c | 28 +++++++++++++++++++++++++---
 kernel/bpf/percpu_freelist.h |  1 +
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
index 3d897de89061..f83eb63720d4 100644
--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -16,9 +16,11 @@ int pcpu_freelist_init(struct pcpu_freelist *s)
 
 		raw_spin_lock_init(&head->lock);
 		head->first = NULL;
+		head->is_empty = true;
 	}
 	raw_spin_lock_init(&s->extralist.lock);
 	s->extralist.first = NULL;
+	s->extralist.is_empty = true;
 	return 0;
 }
 
@@ -32,6 +34,8 @@ static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
 {
 	node->next = head->first;
 	head->first = node;
+	if (head->is_empty)
+		head->is_empty = false;
 }
 
 static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
@@ -130,14 +134,19 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
 	orig_cpu = cpu = raw_smp_processor_id();
 	while (1) {
 		head = per_cpu_ptr(s->freelist, cpu);
+		if (head->is_empty)
+			goto next_cpu;
 		raw_spin_lock(&head->lock);
 		node = head->first;
 		if (node) {
 			head->first = node->next;
+			if (!head->first)
+				head->is_empty = true;
 			raw_spin_unlock(&head->lock);
 			return node;
 		}
 		raw_spin_unlock(&head->lock);
+next_cpu:
 		cpu = cpumask_next(cpu, cpu_possible_mask);
 		if (cpu >= nr_cpu_ids)
 			cpu = 0;
@@ -146,10 +155,15 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
 	}
 
 	/* per cpu lists are all empty, try extralist */
+	if (s->extralist.is_empty)
+		return NULL;
 	raw_spin_lock(&s->extralist.lock);
 	node = s->extralist.first;
-	if (node)
+	if (node) {
 		s->extralist.first = node->next;
+		if (!s->extralist.first)
+			s->extralist.is_empty = true;
+	}
 	raw_spin_unlock(&s->extralist.lock);
 	return node;
 }
@@ -164,15 +178,20 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
 	orig_cpu = cpu = raw_smp_processor_id();
 	while (1) {
 		head = per_cpu_ptr(s->freelist, cpu);
+		if (head->is_empty)
+			goto next_cpu;
 		if (raw_spin_trylock(&head->lock)) {
 			node = head->first;
 			if (node) {
 				head->first = node->next;
+				if (!head->first)
+					head->is_empty = true;
 				raw_spin_unlock(&head->lock);
 				return node;
 			}
 			raw_spin_unlock(&head->lock);
 		}
+next_cpu:
 		cpu = cpumask_next(cpu, cpu_possible_mask);
 		if (cpu >= nr_cpu_ids)
 			cpu = 0;
@@ -181,11 +200,14 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
 	}
 
 	/* cannot pop from per cpu lists, try extralist */
-	if (!raw_spin_trylock(&s->extralist.lock))
+	if (s->extralist.is_empty || !raw_spin_trylock(&s->extralist.lock))
 		return NULL;
 	node = s->extralist.first;
-	if (node)
+	if (node) {
 		s->extralist.first = node->next;
+		if (!s->extralist.first)
+			s->extralist.is_empty = true;
+	}
 	raw_spin_unlock(&s->extralist.lock);
 	return node;
 }
diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h
index 3c76553cfe57..9e4545631ed5 100644
--- a/kernel/bpf/percpu_freelist.h
+++ b/kernel/bpf/percpu_freelist.h
@@ -9,6 +9,7 @@
 struct pcpu_freelist_head {
 	struct pcpu_freelist_node *first;
 	raw_spinlock_t lock;
+	bool is_empty;
 };
 
 struct pcpu_freelist {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 2/2] selftest/bpf/benchs: Add bpf_map benchmark
  2022-05-30  9:13 [PATCH v3 0/2] Optimize performance of update hash-map when free is zero Feng zhou
  2022-05-30  9:13 ` [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
@ 2022-05-30  9:13 ` Feng zhou
  1 sibling, 0 replies; 5+ messages in thread
From: Feng zhou @ 2022-05-30  9:13 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh
  Cc: netdev, bpf, linux-kernel, duanxiongchun, songmuchun,
	wangdongdong.6, cong.wang, zhouchengming, zhoufeng.zf

From: Feng Zhou <zhoufeng.zf@bytedance.com>

Add benchmark for hash_map to reproduce the worst case
that non-stop update when map's free is zero.

before patch:
Setting up benchmark 'bpf-hashmap-ful-update'...
Benchmark 'bpf-hashmap-ful-update' started.
1:hash_map_full_perf 107796 events per sec
2:hash_map_full_perf 108072 events per sec
3:hash_map_full_perf 112169 events per sec
4:hash_map_full_perf 111423 events per sec
5:hash_map_full_perf 110778 events per sec
6:hash_map_full_perf 121336 events per sec
7:hash_map_full_perf 98676 events per sec
8:hash_map_full_perf 105860 events per sec
9:hash_map_full_perf 109930 events per sec
10:hash_map_full_perf 123434 events per sec
11:hash_map_full_perf 125374 events per sec
12:hash_map_full_perf 121979 events per sec
13:hash_map_full_perf 123014 events per sec
14:hash_map_full_perf 126219 events per sec
15:hash_map_full_perf 104793 events per sec

after patch:
Setting up benchmark 'bpf-hashmap-ful-update'...
Benchmark 'bpf-hashmap-ful-update' started.
0:hash_map_full_perf 1219230 events per sec
1:hash_map_full_perf 1320256 events per sec
2:hash_map_full_perf 1196550 events per sec
3:hash_map_full_perf 1375684 events per sec
4:hash_map_full_perf 1365551 events per sec
5:hash_map_full_perf 1318432 events per sec
6:hash_map_full_perf 1222007 events per sec
7:hash_map_full_perf 1240786 events per sec
8:hash_map_full_perf 1190005 events per sec
9:hash_map_full_perf 1562336 events per sec
10:hash_map_full_perf 1385241 events per sec
11:hash_map_full_perf 1387909 events per sec
12:hash_map_full_perf 1371877 events per sec
13:hash_map_full_perf 1561836 events per sec
14:hash_map_full_perf 1388895 events per sec
15:hash_map_full_perf 1579054 events per sec

Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
---
 tools/testing/selftests/bpf/Makefile          |  4 +-
 tools/testing/selftests/bpf/bench.c           |  2 +
 .../benchs/bench_bpf_hashmap_full_update.c    | 96 +++++++++++++++++++
 .../run_bench_bpf_hashmap_full_update.sh      | 11 +++
 .../bpf/progs/bpf_hashmap_full_update_bench.c | 40 ++++++++
 5 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_hashmap_full_update_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3820608faf57..b968649c7aa1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -549,6 +549,7 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
 $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
+$(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -560,7 +561,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_ringbufs.o \
 		 $(OUTPUT)/bench_bloom_filter_map.o \
 		 $(OUTPUT)/bench_bpf_loop.o \
-		 $(OUTPUT)/bench_strncmp.o
+		 $(OUTPUT)/bench_strncmp.o \
+		 $(OUTPUT)/bench_bpf_hashmap_full_update.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index f973320e6dbf..35de886d9a05 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -397,6 +397,7 @@ extern const struct bench bench_hashmap_with_bloom;
 extern const struct bench bench_bpf_loop;
 extern const struct bench bench_strncmp_no_helper;
 extern const struct bench bench_strncmp_helper;
+extern const struct bench bench_bpf_hashmap_full_update;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -431,6 +432,7 @@ static const struct bench *benchs[] = {
 	&bench_bpf_loop,
 	&bench_strncmp_no_helper,
 	&bench_strncmp_helper,
+	&bench_bpf_hashmap_full_update,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
new file mode 100644
index 000000000000..cec51e0ff4b8
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Bytedance */
+
+#include <argp.h>
+#include "bench.h"
+#include "bpf_hashmap_full_update_bench.skel.h"
+#include "bpf_util.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct bpf_hashmap_full_update_bench *skel;
+} ctx;
+
+#define MAX_LOOP_NUM 10000
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	while (true) {
+		/* trigger the bpf program */
+		syscall(__NR_getpgid);
+	}
+
+	return NULL;
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void measure(struct bench_res *res)
+{
+}
+
+static void setup(void)
+{
+	struct bpf_link *link;
+	int map_fd, i, max_entries;
+
+	setup_libbpf();
+
+	ctx.skel = bpf_hashmap_full_update_bench__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	ctx.skel->bss->nr_loops = MAX_LOOP_NUM;
+
+	link = bpf_program__attach(ctx.skel->progs.benchmark);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+
+	/* fill hash_map */
+	map_fd = bpf_map__fd(ctx.skel->maps.hash_map_bench);
+	max_entries = bpf_map__max_entries(ctx.skel->maps.hash_map_bench);
+	for (i = 0; i < max_entries; i++)
+		bpf_map_update_elem(map_fd, &i, &i, BPF_ANY);
+}
+
+void hashmap_report_final(struct bench_res res[], int res_cnt)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	int i;
+
+	for (i = 0; i < nr_cpus; i++) {
+		u64 time = ctx.skel->bss->percpu_time[i];
+
+		if (!time)
+			continue;
+
+		printf("%d:hash_map_full_perf %lld events per sec\n",
+		       i, ctx.skel->bss->nr_loops * 1000000000ll / time);
+	}
+}
+
+const struct bench bench_bpf_hashmap_full_update = {
+	.name = "bpf-hashmap-ful-update",
+	.validate = validate,
+	.setup = setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = NULL,
+	.report_final = hashmap_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh b/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
new file mode 100755
index 000000000000..1e2de838f9fa
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_bpf_hashmap_full_update.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+nr_threads=`expr $(cat /proc/cpuinfo | grep "processor"| wc -l) - 1`
+summary=$($RUN_BENCH -p $nr_threads bpf-hashmap-ful-update)
+printf "$summary"
+printf "\n"
diff --git a/tools/testing/selftests/bpf/progs/bpf_hashmap_full_update_bench.c b/tools/testing/selftests/bpf/progs/bpf_hashmap_full_update_bench.c
new file mode 100644
index 000000000000..aa93a03f961d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_hashmap_full_update_bench.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Bytedance */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define MAX_ENTRIES 1000
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, u32);
+	__type(value, u64);
+	__uint(max_entries, MAX_ENTRIES);
+} hash_map_bench SEC(".maps");
+
+u64 __attribute__((__aligned__(256))) percpu_time[256];
+u64 nr_loops;
+
+static int loop_update_callback(__u32 index, u32 *key)
+{
+	u64 init_val = 1;
+
+	bpf_map_update_elem(&hash_map_bench, key, &init_val, BPF_ANY);
+	return 0;
+}
+
+SEC("fentry/" SYS_PREFIX "sys_getpgid")
+int benchmark(void *ctx)
+{
+	u32 key = bpf_get_prandom_u32() % MAX_ENTRIES + MAX_ENTRIES;
+	u32 cpu = bpf_get_smp_processor_id();
+	u64 start_time = bpf_ktime_get_ns();
+
+	bpf_loop(nr_loops, loop_update_callback, &key, 0);
+	percpu_time[cpu & 255] = bpf_ktime_get_ns() - start_time;
+	return 0;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-30  9:13 ` [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
@ 2022-05-30 21:20   ` Daniel Borkmann
  2022-05-31  3:24     ` Feng Zhou
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Borkmann @ 2022-05-30 21:20 UTC (permalink / raw)
  To: Feng zhou, ast, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh
  Cc: netdev, bpf, linux-kernel, duanxiongchun, songmuchun,
	wangdongdong.6, cong.wang, zhouchengming

On 5/30/22 11:13 AM, Feng zhou wrote:
> From: Feng Zhou <zhoufeng.zf@bytedance.com>
> 
> This patch add is_empty in pcpu_freelist_head to check freelist
> having free or not. If having, grab spin_lock, or check next cpu's
> freelist.
> 
> Before patch: hash_map performance
> ./map_perf_test 1
> 0:hash_map_perf pre-alloc 975345 events per sec
> 4:hash_map_perf pre-alloc 855367 events per sec
> 12:hash_map_perf pre-alloc 860862 events per sec
> 8:hash_map_perf pre-alloc 849561 events per sec
> 3:hash_map_perf pre-alloc 849074 events per sec
> 6:hash_map_perf pre-alloc 847120 events per sec
> 10:hash_map_perf pre-alloc 845047 events per sec
> 5:hash_map_perf pre-alloc 841266 events per sec
> 14:hash_map_perf pre-alloc 849740 events per sec
> 2:hash_map_perf pre-alloc 839598 events per sec
> 9:hash_map_perf pre-alloc 838695 events per sec
> 11:hash_map_perf pre-alloc 845390 events per sec
> 7:hash_map_perf pre-alloc 834865 events per sec
> 13:hash_map_perf pre-alloc 842619 events per sec
> 1:hash_map_perf pre-alloc 804231 events per sec
> 15:hash_map_perf pre-alloc 795314 events per sec
> 
> hash_map the worst: no free
> ./map_perf_test 2048
> 6:worse hash_map_perf pre-alloc 28628 events per sec
> 5:worse hash_map_perf pre-alloc 28553 events per sec
> 11:worse hash_map_perf pre-alloc 28543 events per sec
> 3:worse hash_map_perf pre-alloc 28444 events per sec
> 1:worse hash_map_perf pre-alloc 28418 events per sec
> 7:worse hash_map_perf pre-alloc 28427 events per sec
> 13:worse hash_map_perf pre-alloc 28330 events per sec
> 14:worse hash_map_perf pre-alloc 28263 events per sec
> 9:worse hash_map_perf pre-alloc 28211 events per sec
> 15:worse hash_map_perf pre-alloc 28193 events per sec
> 12:worse hash_map_perf pre-alloc 28190 events per sec
> 10:worse hash_map_perf pre-alloc 28129 events per sec
> 8:worse hash_map_perf pre-alloc 28116 events per sec
> 4:worse hash_map_perf pre-alloc 27906 events per sec
> 2:worse hash_map_perf pre-alloc 27801 events per sec
> 0:worse hash_map_perf pre-alloc 27416 events per sec
> 3:worse hash_map_perf pre-alloc 28188 events per sec
> 
> ftrace trace
> 
> 0)               |  htab_map_update_elem() {
> 0)   0.198 us    |    migrate_disable();
> 0)               |    _raw_spin_lock_irqsave() {
> 0)   0.157 us    |      preempt_count_add();
> 0)   0.538 us    |    }
> 0)   0.260 us    |    lookup_elem_raw();
> 0)               |    alloc_htab_elem() {
> 0)               |      __pcpu_freelist_pop() {
> 0)               |        _raw_spin_lock() {
> 0)   0.152 us    |          preempt_count_add();
> 0)   0.352 us    |          native_queued_spin_lock_slowpath();
> 0)   1.065 us    |        }
> 		 |	  ...
> 0)               |        _raw_spin_unlock() {
> 0)   0.254 us    |          preempt_count_sub();
> 0)   0.555 us    |        }
> 0) + 25.188 us   |      }
> 0) + 25.486 us   |    }
> 0)               |    _raw_spin_unlock_irqrestore() {
> 0)   0.155 us    |      preempt_count_sub();
> 0)   0.454 us    |    }
> 0)   0.148 us    |    migrate_enable();
> 0) + 28.439 us   |  }
> 
> The test machine is 16C, trying to get spin_lock 17 times, in addition
> to 16c, there is an extralist.
> 
> after patch: hash_map performance
> ./map_perf_test 1
> 0:hash_map_perf pre-alloc 969348 events per sec
> 10:hash_map_perf pre-alloc 906526 events per sec
> 11:hash_map_perf pre-alloc 904557 events per sec
> 9:hash_map_perf pre-alloc 902384 events per sec
> 15:hash_map_perf pre-alloc 912287 events per sec
> 14:hash_map_perf pre-alloc 905689 events per sec
> 12:hash_map_perf pre-alloc 903680 events per sec
> 13:hash_map_perf pre-alloc 902631 events per sec
> 8:hash_map_perf pre-alloc 875369 events per sec
> 4:hash_map_perf pre-alloc 862808 events per sec
> 1:hash_map_perf pre-alloc 857218 events per sec
> 2:hash_map_perf pre-alloc 852875 events per sec
> 5:hash_map_perf pre-alloc 846497 events per sec
> 6:hash_map_perf pre-alloc 828467 events per sec
> 3:hash_map_perf pre-alloc 812542 events per sec
> 7:hash_map_perf pre-alloc 805336 events per sec
> 
> hash_map worst: no free
> ./map_perf_test 2048
> 7:worse hash_map_perf pre-alloc 391104 events per sec
> 4:worse hash_map_perf pre-alloc 388073 events per sec
> 5:worse hash_map_perf pre-alloc 387038 events per sec
> 1:worse hash_map_perf pre-alloc 386546 events per sec
> 0:worse hash_map_perf pre-alloc 384590 events per sec
> 11:worse hash_map_perf pre-alloc 379378 events per sec
> 10:worse hash_map_perf pre-alloc 375480 events per sec
> 12:worse hash_map_perf pre-alloc 372394 events per sec
> 6:worse hash_map_perf pre-alloc 367692 events per sec
> 3:worse hash_map_perf pre-alloc 363970 events per sec
> 9:worse hash_map_perf pre-alloc 364008 events per sec
> 8:worse hash_map_perf pre-alloc 363759 events per sec
> 2:worse hash_map_perf pre-alloc 360743 events per sec
> 14:worse hash_map_perf pre-alloc 361195 events per sec
> 13:worse hash_map_perf pre-alloc 360276 events per sec
> 15:worse hash_map_perf pre-alloc 360057 events per sec
> 0:worse hash_map_perf pre-alloc 378177 events per sec
> 
> ftrace trace
> 0)               |  htab_map_update_elem() {
> 0)   0.317 us    |    migrate_disable();
> 0)               |    _raw_spin_lock_irqsave() {
> 0)   0.260 us    |      preempt_count_add();
> 0)   1.803 us    |    }
> 0)   0.276 us    |    lookup_elem_raw();
> 0)               |    alloc_htab_elem() {
> 0)   0.586 us    |      __pcpu_freelist_pop();
> 0)   0.945 us    |    }
> 0)               |    _raw_spin_unlock_irqrestore() {
> 0)   0.160 us    |      preempt_count_sub();
> 0)   0.972 us    |    }
> 0)   0.657 us    |    migrate_enable();
> 0)   8.669 us    |  }
> 
> It can be seen that after adding this patch, the map performance is
> almost not degraded, and when free=0, first check is_empty instead of
> directly acquiring spin_lock.
> 
> As for why to add is_empty instead of directly judging head->first, my
> understanding is this, head->first is frequently modified during updating
> map, which will lead to invalid other cpus's cache, and is_empty is after
> freelist having no free elems will be changed, the performance will be better.
> 
> Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
> ---
>   kernel/bpf/percpu_freelist.c | 28 +++++++++++++++++++++++++---
>   kernel/bpf/percpu_freelist.h |  1 +
>   2 files changed, 26 insertions(+), 3 deletions(-)
[...]
>   	/* per cpu lists are all empty, try extralist */
> +	if (s->extralist.is_empty)
> +		return NULL;
>   	raw_spin_lock(&s->extralist.lock);
>   	node = s->extralist.first;
> -	if (node)
> +	if (node) {
>   		s->extralist.first = node->next;
> +		if (!s->extralist.first)
> +			s->extralist.is_empty = true;
> +	}
>   	raw_spin_unlock(&s->extralist.lock);
>   	return node;
>   }
> @@ -164,15 +178,20 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
>   	orig_cpu = cpu = raw_smp_processor_id();
>   	while (1) {
>   		head = per_cpu_ptr(s->freelist, cpu);
> +		if (head->is_empty)

This should use READ_ONCE/WRITE_ONCE pair for head->is_empty.

> +			goto next_cpu;
>   		if (raw_spin_trylock(&head->lock)) {
>   			node = head->first;
>   			if (node) {
>   				head->first = node->next;
> +				if (!head->first)
> +					head->is_empty = true;
>   				raw_spin_unlock(&head->lock);
>   				return node;
>   			}
>   			raw_spin_unlock(&head->lock);
>   		}
> +next_cpu:
>   		cpu = cpumask_next(cpu, cpu_possible_mask);
>   		if (cpu >= nr_cpu_ids)
>   			cpu = 0;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-30 21:20   ` Daniel Borkmann
@ 2022-05-31  3:24     ` Feng Zhou
  0 siblings, 0 replies; 5+ messages in thread
From: Feng Zhou @ 2022-05-31  3:24 UTC (permalink / raw)
  To: Daniel Borkmann, ast, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh
  Cc: netdev, bpf, linux-kernel, duanxiongchun, songmuchun,
	wangdongdong.6, cong.wang, zhouchengming

在 2022/5/31 上午5:20, Daniel Borkmann 写道:
> On 5/30/22 11:13 AM, Feng zhou wrote:
>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>
>> This patch add is_empty in pcpu_freelist_head to check freelist
>> having free or not. If having, grab spin_lock, or check next cpu's
>> freelist.
>>
>> Before patch: hash_map performance
>> ./map_perf_test 1
>> 0:hash_map_perf pre-alloc 975345 events per sec
>> 4:hash_map_perf pre-alloc 855367 events per sec
>> 12:hash_map_perf pre-alloc 860862 events per sec
>> 8:hash_map_perf pre-alloc 849561 events per sec
>> 3:hash_map_perf pre-alloc 849074 events per sec
>> 6:hash_map_perf pre-alloc 847120 events per sec
>> 10:hash_map_perf pre-alloc 845047 events per sec
>> 5:hash_map_perf pre-alloc 841266 events per sec
>> 14:hash_map_perf pre-alloc 849740 events per sec
>> 2:hash_map_perf pre-alloc 839598 events per sec
>> 9:hash_map_perf pre-alloc 838695 events per sec
>> 11:hash_map_perf pre-alloc 845390 events per sec
>> 7:hash_map_perf pre-alloc 834865 events per sec
>> 13:hash_map_perf pre-alloc 842619 events per sec
>> 1:hash_map_perf pre-alloc 804231 events per sec
>> 15:hash_map_perf pre-alloc 795314 events per sec
>>
>> hash_map the worst: no free
>> ./map_perf_test 2048
>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>
>> ftrace trace
>>
>> 0)               |  htab_map_update_elem() {
>> 0)   0.198 us    |    migrate_disable();
>> 0)               |    _raw_spin_lock_irqsave() {
>> 0)   0.157 us    |      preempt_count_add();
>> 0)   0.538 us    |    }
>> 0)   0.260 us    |    lookup_elem_raw();
>> 0)               |    alloc_htab_elem() {
>> 0)               |      __pcpu_freelist_pop() {
>> 0)               |        _raw_spin_lock() {
>> 0)   0.152 us    |          preempt_count_add();
>> 0)   0.352 us    |          native_queued_spin_lock_slowpath();
>> 0)   1.065 us    |        }
>>          |      ...
>> 0)               |        _raw_spin_unlock() {
>> 0)   0.254 us    |          preempt_count_sub();
>> 0)   0.555 us    |        }
>> 0) + 25.188 us   |      }
>> 0) + 25.486 us   |    }
>> 0)               |    _raw_spin_unlock_irqrestore() {
>> 0)   0.155 us    |      preempt_count_sub();
>> 0)   0.454 us    |    }
>> 0)   0.148 us    |    migrate_enable();
>> 0) + 28.439 us   |  }
>>
>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>> to 16c, there is an extralist.
>>
>> after patch: hash_map performance
>> ./map_perf_test 1
>> 0:hash_map_perf pre-alloc 969348 events per sec
>> 10:hash_map_perf pre-alloc 906526 events per sec
>> 11:hash_map_perf pre-alloc 904557 events per sec
>> 9:hash_map_perf pre-alloc 902384 events per sec
>> 15:hash_map_perf pre-alloc 912287 events per sec
>> 14:hash_map_perf pre-alloc 905689 events per sec
>> 12:hash_map_perf pre-alloc 903680 events per sec
>> 13:hash_map_perf pre-alloc 902631 events per sec
>> 8:hash_map_perf pre-alloc 875369 events per sec
>> 4:hash_map_perf pre-alloc 862808 events per sec
>> 1:hash_map_perf pre-alloc 857218 events per sec
>> 2:hash_map_perf pre-alloc 852875 events per sec
>> 5:hash_map_perf pre-alloc 846497 events per sec
>> 6:hash_map_perf pre-alloc 828467 events per sec
>> 3:hash_map_perf pre-alloc 812542 events per sec
>> 7:hash_map_perf pre-alloc 805336 events per sec
>>
>> hash_map worst: no free
>> ./map_perf_test 2048
>> 7:worse hash_map_perf pre-alloc 391104 events per sec
>> 4:worse hash_map_perf pre-alloc 388073 events per sec
>> 5:worse hash_map_perf pre-alloc 387038 events per sec
>> 1:worse hash_map_perf pre-alloc 386546 events per sec
>> 0:worse hash_map_perf pre-alloc 384590 events per sec
>> 11:worse hash_map_perf pre-alloc 379378 events per sec
>> 10:worse hash_map_perf pre-alloc 375480 events per sec
>> 12:worse hash_map_perf pre-alloc 372394 events per sec
>> 6:worse hash_map_perf pre-alloc 367692 events per sec
>> 3:worse hash_map_perf pre-alloc 363970 events per sec
>> 9:worse hash_map_perf pre-alloc 364008 events per sec
>> 8:worse hash_map_perf pre-alloc 363759 events per sec
>> 2:worse hash_map_perf pre-alloc 360743 events per sec
>> 14:worse hash_map_perf pre-alloc 361195 events per sec
>> 13:worse hash_map_perf pre-alloc 360276 events per sec
>> 15:worse hash_map_perf pre-alloc 360057 events per sec
>> 0:worse hash_map_perf pre-alloc 378177 events per sec
>>
>> ftrace trace
>> 0)               |  htab_map_update_elem() {
>> 0)   0.317 us    |    migrate_disable();
>> 0)               |    _raw_spin_lock_irqsave() {
>> 0)   0.260 us    |      preempt_count_add();
>> 0)   1.803 us    |    }
>> 0)   0.276 us    |    lookup_elem_raw();
>> 0)               |    alloc_htab_elem() {
>> 0)   0.586 us    |      __pcpu_freelist_pop();
>> 0)   0.945 us    |    }
>> 0)               |    _raw_spin_unlock_irqrestore() {
>> 0)   0.160 us    |      preempt_count_sub();
>> 0)   0.972 us    |    }
>> 0)   0.657 us    |    migrate_enable();
>> 0)   8.669 us    |  }
>>
>> It can be seen that after adding this patch, the map performance is
>> almost not degraded, and when free=0, first check is_empty instead of
>> directly acquiring spin_lock.
>>
>> As for why to add is_empty instead of directly judging head->first, my
>> understanding is this, head->first is frequently modified during 
>> updating
>> map, which will lead to invalid other cpus's cache, and is_empty is 
>> after
>> freelist having no free elems will be changed, the performance will 
>> be better.
>>
>> Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
>> ---
>>   kernel/bpf/percpu_freelist.c | 28 +++++++++++++++++++++++++---
>>   kernel/bpf/percpu_freelist.h |  1 +
>>   2 files changed, 26 insertions(+), 3 deletions(-)
> [...]
>>       /* per cpu lists are all empty, try extralist */
>> +    if (s->extralist.is_empty)
>> +        return NULL;
>>       raw_spin_lock(&s->extralist.lock);
>>       node = s->extralist.first;
>> -    if (node)
>> +    if (node) {
>>           s->extralist.first = node->next;
>> +        if (!s->extralist.first)
>> +            s->extralist.is_empty = true;
>> +    }
>>       raw_spin_unlock(&s->extralist.lock);
>>       return node;
>>   }
>> @@ -164,15 +178,20 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
>>       orig_cpu = cpu = raw_smp_processor_id();
>>       while (1) {
>>           head = per_cpu_ptr(s->freelist, cpu);
>> +        if (head->is_empty)
>
> This should use READ_ONCE/WRITE_ONCE pair for head->is_empty.

Yes, will do. Thanks.

>
>> +            goto next_cpu;
>>           if (raw_spin_trylock(&head->lock)) {
>>               node = head->first;
>>               if (node) {
>>                   head->first = node->next;
>> +                if (!head->first)
>> +                    head->is_empty = true;
>>                   raw_spin_unlock(&head->lock);
>>                   return node;
>>               }
>>               raw_spin_unlock(&head->lock);
>>           }
>> +next_cpu:
>>           cpu = cpumask_next(cpu, cpu_possible_mask);
>>           if (cpu >= nr_cpu_ids)
>>               cpu = 0;



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-05-31  3:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-30  9:13 [PATCH v3 0/2] Optimize performance of update hash-map when free is zero Feng zhou
2022-05-30  9:13 ` [PATCH v3 1/2] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
2022-05-30 21:20   ` Daniel Borkmann
2022-05-31  3:24     ` Feng Zhou
2022-05-30  9:13 ` [PATCH v3 2/2] selftest/bpf/benchs: Add bpf_map benchmark Feng zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).