bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
@ 2023-06-06  3:53 Hou Tao
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all() Hou Tao
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-06  3:53 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The implementation of v4 is mainly based on suggestions from Alexi [0].
There are still pending problems for the current implementation as shown
in the benchmark result in patch #3, but there was a long time from the
posting of v3, so posting v4 here for further disscussions and more
suggestions.

The first problem is the huge memory usage compared with bpf memory
allocator which does immediate reuse:

htab-mem-benchmark (reuse-after-RCU-GP):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1159.18   | 0.99                | 0.99             |
| overwrite          | 11.00     | 2288                | 4109             |
| batch_add_batch_del| 8.86      | 1558                | 2763             |
| add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |

htab-mem-benchmark (immediate-reuse):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1160.66   | 0.99                | 1.00             |
| overwrite          | 28.52     | 2.46                | 2.73             |
| batch_add_batch_del| 11.50     | 2.69                | 2.95             |
| add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |

It seems the direct reason is the slow RCU grace period. During
benchmark, the elapsed time when reuse_rcu() callback is called is about
100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
spin-lock and the irq-work running in the contex of freeing process will
increase the running overhead of bpf program, the running time of
getpgid() is increased, the contex switch is slowed down and the RCU
grace period increases [1], but I am still diggin into it.

Another problem is the performance degradation compared with immediate
reuse and the output from perf report shown the per-bpf-ma spin-lock is a
top-one hotspot:

map_perf_test (reuse-after-RCU-GP)
0:hash_map_perf kmalloc 194677 events per sec

map_perf_test (immediate reuse)
2:hash_map_perf kmalloc 384527 events per sec

Considering the purpose of introducing per-bpf-ma reusable list is to
handle the case in which the allocation and free are done on different
CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
for overwrite & batch_add_batch_del cases. So maybe we could implement a
hybrid of global reusable list and per-cpu reusable list and switch
between these two kinds of list according to the history of allocation
and free frequency.

As ususal, suggestions and comments are always welcome.

[0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
[1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@huaweicloud.com

Change Log:
v4:
 * no kworker (Alexei)
 * Use a global reusable list in bpf memory allocator (Alexei)
 * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
   defaultly in bpf memory allocator (Alexei)
 * add benchmark results from map_perf_test (Alexei)

v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@huaweicloud.com/
 * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
 * Update htab memory benchmark
   * move the benchmark patch to the last patch
   * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
     programs
   * add synchronization between addition CPU and deletion CPU for
     add_del_on_diff_cpu case to prevent unnecessary loop
   * add the benchmark result for "extra call_rcu + bpf ma"

v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@huaweicloud.com/
 * add a benchmark for bpf memory allocator to compare between different
   flavor of bpf memory allocator.
 * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.

v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

Hou Tao (3):
  bpf: Factor out a common helper free_all()
  selftests/bpf: Add benchmark for bpf memory allocator
  bpf: Only reuse after one RCU GP in bpf memory allocator

 include/linux/bpf_mem_alloc.h                 |   4 +
 kernel/bpf/memalloc.c                         | 385 ++++++++++++------
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
 .../selftests/bpf/progs/htab_mem_bench.c      | 135 ++++++
 7 files changed, 809 insertions(+), 116 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all()
  2023-06-06  3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
@ 2023-06-06  3:53 ` Hou Tao
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-06  3:53 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Factor out a common helper free_all() to free all normal elements or
per-cpu elements on a lock-less list.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 410637c225fb..0668bcd7c926 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -211,9 +211,9 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	mem_cgroup_put(memcg);
 }
 
-static void free_one(struct bpf_mem_cache *c, void *obj)
+static void free_one(void *obj, bool percpu)
 {
-	if (c->percpu_size) {
+	if (percpu) {
 		free_percpu(((void **)obj)[1]);
 		kfree(obj);
 		return;
@@ -222,14 +222,19 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 	kfree(obj);
 }
 
-static void __free_rcu(struct rcu_head *head)
+static void free_all(struct llist_node *llnode, bool percpu)
 {
-	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
-	struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
 	struct llist_node *pos, *t;
 
 	llist_for_each_safe(pos, t, llnode)
-		free_one(c, pos);
+		free_one(pos, percpu);
+}
+
+static void __free_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+
+	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
 	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
@@ -432,7 +437,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
-	struct llist_node *llnode, *t;
+	bool percpu = !!c->percpu_size;
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
@@ -441,14 +446,10 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * Except for waiting_for_gp list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
-		free_one(c, llnode);
+	free_all(__llist_del_all(&c->free_by_rcu), percpu);
+	free_all(llist_del_all(&c->waiting_for_gp), percpu);
+	free_all(__llist_del_all(&c->free_llist), percpu);
+	free_all(__llist_del_all(&c->free_llist_extra), percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator
  2023-06-06  3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all() Hou Tao
@ 2023-06-06  3:53 ` Hou Tao
  2023-06-06 21:13   ` Alexei Starovoitov
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 3/3] bpf: Only reuse after one RCU GP in " Hou Tao
  2023-06-06 12:30 ` [RFC PATCH bpf-next v4 0/3] Handle immediate reuse " Hou Tao
  3 siblings, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-06  3:53 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

The benchmark could be used to compare the performance of hash map
operations and the memory usage between different flavors of bpf memory
allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
could be used to check the performance improvement or the memory saving
provided by optimization.

The benchmark creates a non-preallocated hash map which uses bpf memory
allocator and shows the operation performance and the memory usage of
the hash map under different use cases:
(1) no_op
Only create the hash map and there is no operations on hash map. It is
used as the baseline. When each CPU completes the iteration of
nr_entries / nr_threads elements in hash map, the loop count is
increased.
(2) overwrite
Each CPU overwrites nonoverlapping part of hash map. When each CPU
completes overwriting of nr_entries / nr_threads elements in hash map,
the loop count is increased.
(3) batch_add_batch_del
Each CPU adds then deletes nonoverlapping part of hash map in batch.
When each CPU adds and deletes nr_entries / nr_threads elements in hash
map, the loop count is increased.
(4) add_del_on_diff_cpu
Each two CPUs add and delete nonoverlapping part of map cooperatively
When each two-CPU pair adds and deletes nr_entries / nr_threads * 2
elements in hash map, the loop count is increased twice.

The following are the benchmark results when comparing between different
flavors of bpf memory allocator. These tests are conducted on a KVM guest
with 8 CPUs and 16 GB memory. The command line below is used to do all
the following benchmarks:

  ./bench htab-mem --use-case $name --max-entries 16384 ${OPTS} \
          --full 50 -d 10 --producers=8 --prod-affinity=0-7

These results show:
* preallocated case has both better performance and better memory
  efficiency.
* normal bpf memory doesn't handle add_del_on_diff_cpu very well. The
  large memory usage is due to the slow tasks trace RCU grace period.

(1) non-preallocated + no bpf memory allocator (v6.0.19)
use kmalloc() + call_rcu

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1214.42    | 0.92                 | 0.92              |
| overwrite           | 3.21       | 40.47                | 67.98             |
| batch_add_batch_del | 2.32       | 24.31                | 49.33             |
| add_del_on_diff_cpu | 2.92       | 4.03                 | 6.00              |

(2) preallocated
OPTS=--preallocated

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1156.59    | 1.88                 | 1.88              |
| overwrite           | 36.19      | 1.88                 | 1.88              |
| batch_add_batch_del | 22.27      | 1.88                 | 1.88              |
| add_del_on_diff_cpu | 4.68       | 1.95                 | 2.05              |

(3) normal bpf memory allocator

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1273.55    | 0.98                 | 0.98              |
| overwrite           | 26.57      | 2.59                 | 2.74              |
| batch_add_batch_del | 11.13      | 2.59                 | 2.99              |
| add_del_on_diff_cpu | 3.72       | 15.15                | 26.04             |

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  42 +++
 .../selftests/bpf/progs/htab_mem_bench.c      | 135 +++++++
 5 files changed, 536 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 538df8fb8c42..add018823ebd 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -644,11 +644,13 @@ $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
 $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.skel.h
 $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
+$(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(TESTING_HELPERS) \
 		 $(TRACE_HELPERS) \
+		 $(CGROUP_HELPERS) \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
@@ -661,6 +663,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
 		 $(OUTPUT)/bench_bpf_hashmap_lookup.o \
 		 $(OUTPUT)/bench_local_storage_create.o \
+		 $(OUTPUT)/bench_htab_mem.o \
 		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index d9c080ac1796..d3d9ae321b74 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -279,6 +279,7 @@ extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
 extern struct argp bench_hashmap_lookup_argp;
 extern struct argp bench_local_storage_create_argp;
+extern struct argp bench_htab_mem_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -290,6 +291,7 @@ static const struct argp_child bench_parsers[] = {
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
 	{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
 	{ &bench_local_storage_create_argp, 0, "local-storage-create benchmark", 0 },
+	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
 	{},
 };
 
@@ -518,6 +520,7 @@ extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
 extern const struct bench bench_bpf_hashmap_lookup;
 extern const struct bench bench_local_storage_create;
+extern const struct bench bench_htab_mem;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -559,6 +562,7 @@ static const struct bench *benchs[] = {
 	&bench_local_storage_tasks_trace,
 	&bench_bpf_hashmap_lookup,
 	&bench_local_storage_create,
+	&bench_htab_mem,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_htab_mem.c b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
new file mode 100644
index 000000000000..f0c2505c0868
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
@@ -0,0 +1,352 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#include "bench.h"
+#include "cgroup_helpers.h"
+#include "htab_mem_bench.skel.h"
+
+static struct htab_mem_ctx {
+	struct htab_mem_bench *skel;
+	pthread_barrier_t *notify;
+	int fd;
+	bool do_notify_wait;
+} ctx;
+
+static struct htab_mem_args {
+	u32 max_entries;
+	u32 value_size;
+	u32 full;
+	const char *use_case;
+	bool preallocated;
+} args = {
+	.max_entries = 16384,
+	.full = 50,
+	.value_size = 8,
+	.use_case = "overwrite",
+	.preallocated = false,
+};
+
+enum {
+	ARG_MAX_ENTRIES = 10000,
+	ARG_FULL_PERCENT = 10001,
+	ARG_VALUE_SIZE = 10002,
+	ARG_USE_CASE = 10003,
+	ARG_PREALLOCATED = 10004,
+};
+
+static const struct argp_option opts[] = {
+	{ "max-entries", ARG_MAX_ENTRIES, "MAX_ENTRIES", 0,
+	  "Set the max entries of hash map (default 16384)" },
+	{ "full", ARG_FULL_PERCENT, "FULL", 0,
+	  "Set the full percent of hash map (default 50)" },
+	{ "value-size", ARG_VALUE_SIZE, "VALUE_SIZE", 0,
+	  "Set the value size of hash map (default 8)" },
+	{ "use-case", ARG_USE_CASE, "USE_CASE", 0,
+	  "Set the use case of hash map: no_op|overwrite|batch_add_batch_del|add_del_on_diff_cpu" },
+	{ "preallocated", ARG_PREALLOCATED, NULL, 0, "use preallocated hash map" },
+	{},
+};
+
+static error_t htab_mem_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_MAX_ENTRIES:
+		args.max_entries = strtoul(arg, NULL, 10);
+		break;
+	case ARG_FULL_PERCENT:
+		args.full = strtoul(arg, NULL, 10);
+		if (!args.full || args.full > 100) {
+			fprintf(stderr, "invalid full percent %u\n", args.full);
+			argp_usage(state);
+		}
+		break;
+	case ARG_VALUE_SIZE:
+		args.value_size = strtoul(arg, NULL, 10);
+		if (args.value_size > 4096) {
+			fprintf(stderr, "too big value size %u\n", args.value_size);
+			argp_usage(state);
+		}
+		break;
+	case ARG_USE_CASE:
+		args.use_case = strdup(arg);
+		break;
+	case ARG_PREALLOCATED:
+		args.preallocated = true;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_htab_mem_argp = {
+	.options = opts,
+	.parser = htab_mem_parse_arg,
+};
+
+static void htab_mem_validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "htab mem benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static int setup_and_join_cgroup(const char *path)
+{
+	int err, fd;
+
+	err = setup_cgroup_environment();
+	if (err) {
+		fprintf(stderr, "setup cgroup env failed\n");
+		return -1;
+	}
+
+	err = create_and_get_cgroup(path);
+	if (err < 0) {
+		fprintf(stderr, "create cgroup %s failed\n", path);
+		goto out;
+	}
+	fd = err;
+
+	err = join_cgroup(path);
+	if (err) {
+		fprintf(stderr, "join cgroup %s failed\n", path);
+		close(fd);
+		goto out;
+	}
+
+	return fd;
+out:
+	cleanup_cgroup_environment();
+	return -1;
+}
+
+static int htab_mem_bench_init_barriers(void)
+{
+	unsigned int i, nr = (env.producer_cnt + 1) / 2;
+	pthread_barrier_t *barriers;
+
+	barriers = calloc(nr, sizeof(*barriers));
+	if (!barriers)
+		return -1;
+
+	/* Used for synchronization between two threads */
+	for (i = 0; i < nr; i++)
+		pthread_barrier_init(&barriers[i], NULL, 2);
+
+	ctx.notify = barriers;
+	return 0;
+}
+
+static void htab_mem_bench_exit_barriers(void)
+{
+	unsigned int i, nr;
+
+	if (!ctx.notify)
+		return;
+
+	nr = (env.producer_cnt + 1) / 2;
+	for (i = 0; i < nr; i++)
+		pthread_barrier_destroy(&ctx.notify[i]);
+	free(ctx.notify);
+}
+
+static void htab_mem_setup(void)
+{
+	struct bpf_program *prog;
+	struct bpf_map *map;
+	int err;
+
+	setup_libbpf();
+
+	err = setup_and_join_cgroup("/htab_mem");
+	if (err < 0)
+		exit(1);
+	ctx.fd = err;
+
+	ctx.skel = htab_mem_bench__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		goto cleanup;
+	}
+
+	err = htab_mem_bench_init_barriers();
+	if (err) {
+		fprintf(stderr, "failed to init barrier\n");
+		goto cleanup;
+	}
+
+	map = ctx.skel->maps.htab;
+	bpf_map__set_max_entries(map, args.max_entries);
+	bpf_map__set_value_size(map, args.value_size);
+	if (args.preallocated)
+		bpf_map__set_map_flags(map, bpf_map__map_flags(map) & ~BPF_F_NO_PREALLOC);
+
+	/* Do synchronization between addition thread and deletion thread */
+	if (!strcmp("add_del_on_diff_cpu", args.use_case))
+		ctx.do_notify_wait = true;
+
+	prog = bpf_object__find_program_by_name(ctx.skel->obj, args.use_case);
+	if (!prog) {
+		fprintf(stderr, "no such use-case: %s\n", args.use_case);
+		fprintf(stderr, "available use case:");
+		bpf_object__for_each_program(prog, ctx.skel->obj)
+			fprintf(stderr, " %s", bpf_program__name(prog));
+		fprintf(stderr, "\n");
+		goto cleanup;
+	}
+	bpf_program__set_autoload(prog, true);
+
+	ctx.skel->bss->nr_thread = env.producer_cnt;
+	ctx.skel->bss->nr_entries = (uint64_t)args.max_entries * args.full / 100;
+
+	err = htab_mem_bench__load(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		goto cleanup;
+	}
+	err = htab_mem_bench__attach(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to attach skeleton\n");
+		goto cleanup;
+	}
+	return;
+cleanup:
+	close(ctx.fd);
+	cleanup_cgroup_environment();
+	htab_mem_bench_exit_barriers();
+	htab_mem_bench__destroy(ctx.skel);
+	exit(1);
+}
+
+static void htab_mem_notify_wait_producer(pthread_barrier_t *notify)
+{
+	while (true) {
+		(void)syscall(__NR_getpgid);
+		/* Notify for start */
+		pthread_barrier_wait(notify);
+		/* Wait for completion */
+		pthread_barrier_wait(notify);
+	}
+}
+
+static void htab_mem_wait_notify_producer(pthread_barrier_t *notify)
+{
+	while (true) {
+		/* Wait for start */
+		pthread_barrier_wait(notify);
+		(void)syscall(__NR_getpgid);
+		/* Notify for completion */
+		pthread_barrier_wait(notify);
+	}
+}
+
+static void *htab_mem_producer(void *arg)
+{
+	pthread_barrier_t *notify;
+	int seq;
+
+	if (!ctx.do_notify_wait) {
+		while (true)
+			(void)syscall(__NR_getpgid);
+		return NULL;
+	}
+
+	seq = (long)arg;
+	notify = &ctx.notify[seq / 2];
+	if (seq & 1)
+		htab_mem_notify_wait_producer(notify);
+	else
+		htab_mem_wait_notify_producer(notify);
+	return NULL;
+}
+
+static void *htab_mem_consumer(void *arg)
+{
+	return NULL;
+}
+
+static void htab_mem_read_mem_cgrp_file(const char *name, unsigned long *value)
+{
+	char buf[32];
+	int fd;
+
+	fd = openat(ctx.fd, name, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr, "no %s\n", name);
+		*value = 0;
+		return;
+	}
+
+	buf[sizeof(buf) - 1] = 0;
+	read(fd, buf, sizeof(buf) - 1);
+	*value = strtoull(buf, NULL, 0);
+
+	close(fd);
+}
+
+static void htab_mem_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->loop_cnt, 0);
+	htab_mem_read_mem_cgrp_file("memory.current", &res->gp_ct);
+}
+
+static void htab_mem_report_progress(int iter, struct bench_res *res, long delta_ns)
+{
+	double loop, mem;
+
+	loop = res->hits / 1000.0 / (delta_ns / 1000000000.0);
+	mem = res->gp_ct / 1048576.0;
+	printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
+	printf("loop %7.2lfk/s, memory usage %7.2lfMiB\n", loop, mem);
+}
+
+static void htab_mem_report_final(struct bench_res res[], int res_cnt)
+{
+	double mem_mean = 0.0, mem_stddev = 0.0;
+	double loop_mean = 0.0, loop_stddev = 0.0;
+	unsigned long peak_mem;
+	int i;
+
+	for (i = 0; i < res_cnt; i++) {
+		loop_mean += res[i].hits / 1000.0 / (0.0 + res_cnt);
+		mem_mean += res[i].gp_ct / 1048576.0 / (0.0 + res_cnt);
+	}
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++) {
+			loop_stddev += (loop_mean - res[i].hits / 1000.0) *
+				       (loop_mean - res[i].hits / 1000.0) /
+				       (res_cnt - 1.0);
+			mem_stddev += (mem_mean - res[i].gp_ct / 1048576.0) *
+				      (mem_mean - res[i].gp_ct / 1048576.0) /
+				      (res_cnt - 1.0);
+		}
+		loop_stddev = sqrt(loop_stddev);
+		mem_stddev = sqrt(mem_stddev);
+	}
+
+	htab_mem_read_mem_cgrp_file("memory.peak", &peak_mem);
+	printf("Summary: loop %7.2lf \u00B1 %7.2lfk/s, memory usage %7.2lf \u00B1 %7.2lfMiB, "
+	       "peak memory usage %7.2lfMiB\n",
+	       loop_mean, loop_stddev, mem_mean, mem_stddev, peak_mem / 1048576.0);
+}
+
+const struct bench bench_htab_mem = {
+	.name = "htab-mem",
+	.argp = &bench_htab_mem_argp,
+	.validate = htab_mem_validate,
+	.setup = htab_mem_setup,
+	.producer_thread = htab_mem_producer,
+	.consumer_thread = htab_mem_consumer,
+	.measure = htab_mem_measure,
+	.report_progress = htab_mem_report_progress,
+	.report_final = htab_mem_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh b/tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
new file mode 100755
index 000000000000..b488cd7ec646
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
@@ -0,0 +1,42 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+htab_mem()
+{
+	echo -n "loop : "
+	echo -n "$*" | sed -E "s/.* loop\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+k\/s).*/\1/"
+	echo -n -e ", avg mem: "
+	echo -n "$*" | sed -E "s/.* memory usage\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+MiB).*/\1/"
+	echo -n ", peak mem: "
+	echo "$*" | sed -E "s/.* peak memory usage\s+([0-9]+\.[0-9]+MiB).*/\1/"
+}
+
+summarize_htab_mem()
+{
+	local bench="$1"
+	local summary=$(echo $2 | tail -n1)
+
+	printf "%-20s %s\n" "$bench" "$(htab_mem $summary)"
+}
+
+htab_mem_bench()
+{
+	local name
+
+	for name in no_op overwrite batch_add_batch_del add_del_on_diff_cpu
+	do
+		summarize_htab_mem "$name" "$(sudo ./bench htab-mem --use-case $name \
+			--max-entries 16384 --full 50 -d 10 \
+			--producers=8 --prod-affinity=0-7 "$@")"
+	done
+}
+
+header "preallocated"
+htab_mem_bench "--preallocated"
+
+header "normal bpf ma"
+htab_mem_bench
diff --git a/tools/testing/selftests/bpf/progs/htab_mem_bench.c b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
new file mode 100644
index 000000000000..fe5c0edb262e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#include <stdbool.h>
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct update_ctx {
+	unsigned int from;
+	unsigned int step;
+	unsigned int max;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+unsigned char zeroed_value[4096];
+unsigned int nr_entries = 0;
+unsigned int nr_thread = 0;
+long loop_cnt = 0;
+
+static int noop_htab(unsigned int i, struct update_ctx *ctx)
+{
+	if (ctx->from >= ctx->max)
+		return 1;
+
+	ctx->from += ctx->step;
+	return 0;
+}
+
+static int write_htab(unsigned int i, struct update_ctx *ctx, unsigned int flags)
+{
+	if (ctx->from >= ctx->max)
+		return 1;
+
+	bpf_map_update_elem(&htab, &ctx->from, zeroed_value, flags);
+	ctx->from += ctx->step;
+
+	return 0;
+}
+
+static int overwrite_htab(unsigned int i, struct update_ctx *ctx)
+{
+	return write_htab(i, ctx, 0);
+}
+
+static int newwrite_htab(unsigned int i, struct update_ctx *ctx)
+{
+	return write_htab(i, ctx, BPF_NOEXIST);
+}
+
+static int del_htab(unsigned int i, struct update_ctx *ctx)
+{
+	if (ctx->from >= ctx->max)
+		return 1;
+
+	bpf_map_delete_elem(&htab, &ctx->from);
+	ctx->from += ctx->step;
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int no_op(void *ctx)
+{
+	struct update_ctx update;
+
+	update.from = bpf_get_smp_processor_id();
+	update.step = nr_thread;
+	update.max = nr_entries;
+	bpf_loop(update.max, noop_htab, &update, 0);
+	__sync_fetch_and_add(&loop_cnt, 1);
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int overwrite(void *ctx)
+{
+	struct update_ctx update;
+
+	update.from = bpf_get_smp_processor_id();
+	update.step = nr_thread;
+	update.max = nr_entries;
+	bpf_loop(update.max, overwrite_htab, &update, 0);
+	/* Increase when nr_entries / nr_thread elements are deleted and then added */
+	__sync_fetch_and_add(&loop_cnt, 1);
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int batch_add_batch_del(void *ctx)
+{
+	struct update_ctx update;
+
+	update.from = bpf_get_smp_processor_id();
+	update.step = nr_thread;
+	update.max = nr_entries;
+	bpf_loop(update.max, overwrite_htab, &update, 0);
+
+	update.from = bpf_get_smp_processor_id();
+	bpf_loop(update.max, del_htab, &update, 0);
+
+	/* Increase when nr_entries / nr_thread elements are added and then deleted */
+	__sync_fetch_and_add(&loop_cnt, 1);
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int add_del_on_diff_cpu(void *ctx)
+{
+	struct update_ctx update;
+	unsigned int from;
+
+	from = bpf_get_smp_processor_id();
+	update.from = from / 2;
+	update.step = nr_thread / 2;
+	update.max = nr_entries;
+
+	if (from & 1)
+		bpf_loop(update.max, newwrite_htab, &update, 0);
+	else
+		bpf_loop(update.max, del_htab, &update, 0);
+
+	/* Increase when nr_entries / nr_thread * 2 elements are added or deleted */
+	__sync_fetch_and_add(&loop_cnt, 1);
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH bpf-next v4 3/3] bpf: Only reuse after one RCU GP in bpf memory allocator
  2023-06-06  3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all() Hou Tao
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
@ 2023-06-06  3:53 ` Hou Tao
  2023-06-06 12:30 ` [RFC PATCH bpf-next v4 0/3] Handle immediate reuse " Hou Tao
  3 siblings, 0 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-06  3:53 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Currently the freed objects in bpf memory allocator may be reused
immediately by new allocation, it introduces use-after-bpf-ma-free
problem for non-preallocated hash map and makes lookup procedure
return incorrect result. The immediate reuse also makes introducing
new use case more difficult (e.g. qp-trie).

So implement reuse-after-RCU-GP to solve these problems. For
reuse-after-RCU-GP, the freed objects are reused only after one RCU
grace period and may be returned back to slab system after another
RCU-tasks-trace grace period. So for bpf programs which care about reuse
problem, these programs can use bpf_rcu_read_{lock,unlock}() to access
these freed objects safely and for those which doesn't care, there will
be safely use-after-bpf-ma-free because these objects have not been
freed by bpf memory allocator.

To handle the use case which does allocation and free on different CPUs,
a per-bpf-mem-alloc list is introduced to keep these reusable objects.
In order to reduce the risk of OOM, part of these reusable objects will
be freed and returned back to slab through RCU-tasks-trace call-back.
Before these freeing objects are freed, these objects are also available
for reuse.

As shown in the following benchmark results, the memory usage increases
a lot and the performance of overwrite and batch_op case is also
degraded. The benchmark is conducted on a KVM-VM with 8-CPUs and 16GB
memory. The command line for htab-mem-benchmark is:

  ./bench htab-mem --use-case $name --max-entries 16384 \
		          --full 50 -d 10 --producers=8
			  --prod-affinity=0-7

And the command line for map_perf_test benchmark is:
  ./map_perf_test 4 8 16384

htab-mem-benchmark (before):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1160.66   | 0.99                | 1.00             |
| overwrite          | 28.52     | 2.46                | 2.73             |
| batch_add_batch_del| 11.50     | 2.69                | 2.95             |
| add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |

map_perf_test (before)
2:hash_map_perf kmalloc 384527 events per sec
7:hash_map_perf kmalloc 359707 events per sec
6:hash_map_perf kmalloc 314229 events per sec
0:hash_map_perf kmalloc 306743 events per sec
3:hash_map_perf kmalloc 309987 events per sec
4:hash_map_perf kmalloc 309012 events per sec
1:hash_map_perf kmalloc 295757 events per sec
5:hash_map_perf kmalloc 292229 events per sec

htab-mem-benchmark (after):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1159.18   | 0.99                | 0.99             |
| overwrite          | 11.00     | 2288                | 4109             |
| batch_add_batch_del| 8.86      | 1558                | 2763             |
| add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |

map_perf_test (after)
0:hash_map_perf kmalloc 194677 events per sec
4:hash_map_perf kmalloc 194177 events per sec
1:hash_map_perf kmalloc 180662 events per sec
6:hash_map_perf kmalloc 181310 events per sec
5:hash_map_perf kmalloc 177213 events per sec
2:hash_map_perf kmalloc 173069 events per sec
3:hash_map_perf kmalloc 166792 events per sec
7:hash_map_perf kmalloc 165253 events per sec

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h |   4 +
 kernel/bpf/memalloc.c         | 366 ++++++++++++++++++++++++----------
 2 files changed, 263 insertions(+), 107 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3929be5743f4..69f84a21520b 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -7,10 +7,14 @@
 
 struct bpf_mem_cache;
 struct bpf_mem_caches;
+struct bpf_mem_shared_cache;
+struct bpf_mem_shared_caches;
 
 struct bpf_mem_alloc {
 	struct bpf_mem_caches __percpu *caches;
 	struct bpf_mem_cache __percpu *cache;
+	struct bpf_mem_shared_cache *s_cache;
+	struct bpf_mem_shared_caches *s_caches;
 	struct work_struct work;
 };
 
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 0668bcd7c926..fea1cb0c78bb 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -74,6 +74,17 @@ static int bpf_mem_cache_idx(size_t size)
 	return fls(size - 1) - 2;
 }
 
+struct bpf_mem_shared_cache {
+	raw_spinlock_t reuse_lock;
+	bool percpu;
+	bool direct_free;
+	struct llist_head reuse_ready_head;
+	struct llist_node *reuse_ready_tail;
+	struct llist_head wait_for_free;
+	atomic_t call_rcu_in_progress;
+	struct rcu_head rcu;
+};
+
 #define NUM_CACHES 11
 
 struct bpf_mem_cache {
@@ -98,10 +109,17 @@ struct bpf_mem_cache {
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
+	struct llist_head prepare_reuse_head;
+	struct llist_node *prepare_reuse_tail;
+	unsigned int prepare_reuse_cnt;
+	raw_spinlock_t lock;
 
+	struct bpf_mem_shared_cache *sc;
 	struct rcu_head rcu;
 	struct llist_head free_by_rcu;
+	struct llist_node *free_by_rcu_tail;
 	struct llist_head waiting_for_gp;
+	struct llist_node *waiting_for_gp_tail;
 	atomic_t call_rcu_in_progress;
 };
 
@@ -109,6 +127,10 @@ struct bpf_mem_caches {
 	struct bpf_mem_cache cache[NUM_CACHES];
 };
 
+struct bpf_mem_shared_caches {
+	struct bpf_mem_shared_cache cache[NUM_CACHES];
+};
+
 static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 {
 	struct llist_node *entry, *next;
@@ -153,6 +175,52 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
 #endif
 }
 
+static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
+{
+	struct bpf_mem_shared_cache *sc = c->sc;
+	struct llist_node *head, *tail, *obj;
+	unsigned long flags;
+	int alloc;
+
+	if (llist_empty(&sc->reuse_ready_head) && llist_empty(&sc->wait_for_free))
+		return 0;
+
+	alloc = 0;
+	head = NULL;
+	tail = NULL;
+	raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+	while (alloc < cnt) {
+		obj = __llist_del_first(&sc->reuse_ready_head);
+		if (obj) {
+			if (llist_empty(&sc->reuse_ready_head))
+				sc->reuse_ready_tail = NULL;
+		} else {
+			obj = __llist_del_first(&sc->wait_for_free);
+			if (!obj)
+				break;
+		}
+		if (!tail)
+			tail = obj;
+		obj->next = head;
+		head = obj;
+		alloc++;
+	}
+	raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+
+	if (alloc) {
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_save(flags);
+		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
+		__llist_add_batch(head, tail, &c->free_llist);
+		c->free_cnt += alloc;
+		local_dec(&c->active);
+		if (IS_ENABLED(CONFIG_PREEMPT_RT))
+			local_irq_restore(flags);
+	}
+
+	return alloc;
+}
+
 /* Mostly runs from irq_work except __init phase. */
 static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
@@ -161,32 +229,19 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	void *obj;
 	int i;
 
+	i = bpf_ma_get_reusable_obj(c, cnt);
+
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
-	for (i = 0; i < cnt; i++) {
-		/*
-		 * free_by_rcu is only manipulated by irq work refill_work().
-		 * IRQ works on the same CPU are called sequentially, so it is
-		 * safe to use __llist_del_first() here. If alloc_bulk() is
-		 * invoked by the initial prefill, there will be no running
-		 * refill_work(), so __llist_del_first() is fine as well.
-		 *
-		 * In most cases, objects on free_by_rcu are from the same CPU.
-		 * If some objects come from other CPUs, it doesn't incur any
-		 * harm because NUMA_NO_NODE means the preference for current
-		 * numa node and it is not a guarantee.
+	for (; i < cnt; i++) {
+		/* Allocate, but don't deplete atomic reserves that typical
+		 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+		 * will allocate from the current numa node which is what we
+		 * want here.
 		 */
-		obj = __llist_del_first(&c->free_by_rcu);
-		if (!obj) {
-			/* Allocate, but don't deplete atomic reserves that typical
-			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
-			 * will allocate from the current numa node which is what we
-			 * want here.
-			 */
-			obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT);
-			if (!obj)
-				break;
-		}
+		obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT);
+		if (!obj)
+			break;
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			/* In RT irq_work runs in per-cpu kthread, so disable
 			 * interrupts to avoid preemption and interrupts and
@@ -230,100 +285,125 @@ static void free_all(struct llist_node *llnode, bool percpu)
 		free_one(pos, percpu);
 }
 
-static void __free_rcu(struct rcu_head *head)
+static void free_rcu(struct rcu_head *rcu)
 {
-	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct bpf_mem_shared_cache *sc = container_of(rcu, struct bpf_mem_shared_cache, rcu);
+	struct llist_node *head;
+	unsigned long flags;
 
-	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
-	atomic_set(&c->call_rcu_in_progress, 0);
+	raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+	head = __llist_del_all(&sc->wait_for_free);
+	raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+	free_all(head, sc->percpu);
+	atomic_set(&sc->call_rcu_in_progress, 0);
 }
 
-static void __free_rcu_tasks_trace(struct rcu_head *head)
+static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
 {
-	/* If RCU Tasks Trace grace period implies RCU grace period,
-	 * there is no need to invoke call_rcu().
-	 */
-	if (rcu_trace_implies_rcu_gp())
-		__free_rcu(head);
-	else
-		call_rcu(head, __free_rcu);
-}
+	struct bpf_mem_shared_cache *sc = c->sc;
+	struct llist_node *head, *tail;
+	unsigned long flags;
 
-static void enque_to_free(struct bpf_mem_cache *c, void *obj)
-{
-	struct llist_node *llnode = obj;
+	/* Draining could be running concurrently with reuse_rcu() */
+	raw_spin_lock_irqsave(&c->lock, flags);
+	head = __llist_del_all(&c->waiting_for_gp);
+	tail = c->waiting_for_gp_tail;
+	c->waiting_for_gp_tail = NULL;
+	raw_spin_unlock_irqrestore(&c->lock, flags);
+	/* Draining is in progress ? */
+	if (!head)
+		return;
 
-	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
-	 * Nothing races to add to free_by_rcu list.
+	raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+	/* Don't move these objects to reuse_ready list and free
+	 * these objects directly.
 	 */
-	__llist_add(llnode, &c->free_by_rcu);
+	if (sc->direct_free) {
+		raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+		free_all(head, sc->percpu);
+		return;
+	}
+
+	if (llist_empty(&sc->reuse_ready_head))
+		sc->reuse_ready_tail = tail;
+	__llist_add_batch(head, tail, &sc->reuse_ready_head);
+
+	if (!atomic_xchg(&sc->call_rcu_in_progress, 1)) {
+		head = __llist_del_all(&sc->reuse_ready_head);
+		tail = sc->reuse_ready_tail;
+		sc->reuse_ready_tail = NULL;
+		WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
+		__llist_add_batch(head, tail, &sc->wait_for_free);
+		call_rcu_tasks_trace(&sc->rcu, free_rcu);
+	}
+	raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
 }
 
-static void do_call_rcu(struct bpf_mem_cache *c)
+static void reuse_rcu(struct rcu_head *rcu)
 {
-	struct llist_node *llnode, *t;
-
-	if (atomic_xchg(&c->call_rcu_in_progress, 1))
-		return;
+	struct bpf_mem_cache *c = container_of(rcu, struct bpf_mem_cache, rcu);
 
-	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
-		/* There is no concurrent __llist_add(waiting_for_gp) access.
-		 * It doesn't race with llist_del_all either.
-		 * But there could be two concurrent llist_del_all(waiting_for_gp):
-		 * from __free_rcu() and from drain_mem_cache().
-		 */
-		__llist_add(llnode, &c->waiting_for_gp);
-	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
-	 * If RCU Tasks Trace grace period implies RCU grace period, free
-	 * these elements directly, else use call_rcu() to wait for normal
-	 * progs to finish and finally do free_one() on each element.
-	 */
-	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
+	bpf_ma_add_to_reuse_ready_or_free(c);
+	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
-static void free_bulk(struct bpf_mem_cache *c)
+static void reuse_bulk(struct bpf_mem_cache *c)
 {
-	struct llist_node *llnode, *t;
+	struct llist_node *head, *tail, *llnode, *tmp;
 	unsigned long flags;
-	int cnt;
 
-	do {
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			local_irq_save(flags);
-		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
-		llnode = __llist_del_first(&c->free_llist);
-		if (llnode)
-			cnt = --c->free_cnt;
-		else
-			cnt = 0;
-		local_dec(&c->active);
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			local_irq_restore(flags);
-		if (llnode)
-			enque_to_free(c, llnode);
-	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_save(flags);
+	WARN_ON_ONCE(local_inc_return(&c->active) != 1);
+	head = __llist_del_all(&c->prepare_reuse_head);
+	tail = c->prepare_reuse_tail;
+	c->prepare_reuse_tail = NULL;
+	c->prepare_reuse_cnt = 0;
+	local_dec(&c->active);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(flags);
+
+	llist_for_each_safe(llnode, tmp, llist_del_all(&c->free_llist_extra)) {
+		if (!head)
+			tail = llnode;
+		llnode->next = head;
+		head = llnode->next;
+	}
+	WARN_ON_ONCE(!head);
+
+	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
+	 * Nothing races to add to free_by_rcu list.
+	 */
+	if (llist_empty(&c->free_by_rcu))
+		c->free_by_rcu_tail = tail;
+	__llist_add_batch(head, tail, &c->free_by_rcu);
+
+	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+		return;
 
-	/* and drain free_llist_extra */
-	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
-		enque_to_free(c, llnode);
-	do_call_rcu(c);
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+	head = __llist_del_all(&c->free_by_rcu);
+	tail = c->free_by_rcu_tail;
+	c->free_by_rcu_tail = NULL;
+	if (llist_empty(&c->waiting_for_gp))
+		c->waiting_for_gp_tail = tail;
+	__llist_add_batch(head, tail, &c->waiting_for_gp);
+	call_rcu(&c->rcu, reuse_rcu);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
 {
 	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
-	int cnt;
 
 	/* Racy access to free_cnt. It doesn't need to be 100% accurate */
-	cnt = c->free_cnt;
-	if (cnt < c->low_watermark)
+	if (c->free_cnt < c->low_watermark)
 		/* irq_work runs on this cpu and kmalloc will allocate
 		 * from the current numa node which is what we want here.
 		 */
 		alloc_bulk(c, c->batch, NUMA_NO_NODE);
-	else if (cnt > c->high_watermark)
-		free_bulk(c);
+
+	if (c->prepare_reuse_cnt > c->high_watermark)
+		reuse_bulk(c);
 }
 
 static void notrace irq_work_raise(struct bpf_mem_cache *c)
@@ -370,6 +450,12 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu));
 }
 
+static inline void shared_cache_init(struct bpf_mem_shared_cache *cache, bool percpu)
+{
+	cache->percpu = percpu;
+	raw_spin_lock_init(&cache->reuse_lock);
+}
+
 /* When size != 0 bpf_mem_cache for each cpu.
  * This is typical bpf hash map use case when all elements have equal size.
  *
@@ -382,13 +468,22 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
+	struct bpf_mem_shared_caches *scc;
+	struct bpf_mem_shared_cache *sc;
 	struct obj_cgroup *objcg = NULL;
 	int cpu, i, unit_size, percpu_size = 0;
 
 	if (size) {
+		sc = kzalloc(sizeof(*sc), GFP_KERNEL);
+		if (!sc)
+			return -ENOMEM;
+		shared_cache_init(sc, percpu);
+
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
-		if (!pc)
+		if (!pc) {
+			kfree(sc);
 			return -ENOMEM;
+		}
 
 		if (percpu)
 			/* room for llist_node and per-cpu pointer */
@@ -406,9 +501,12 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->unit_size = unit_size;
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
+			raw_spin_lock_init(&c->lock);
+			c->sc = sc;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
+		ma->s_cache = sc;
 		return 0;
 	}
 
@@ -416,28 +514,56 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 	if (WARN_ON_ONCE(percpu))
 		return -EINVAL;
 
+	scc = kzalloc(sizeof(*scc), GFP_KERNEL);
+	if (!scc)
+		return -ENOMEM;
 	pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL);
-	if (!pcc)
+	if (!pcc) {
+		kfree(scc);
 		return -ENOMEM;
+	}
 #ifdef CONFIG_MEMCG_KMEM
 	objcg = get_obj_cgroup_from_current();
 #endif
+	for (i = 0; i < NUM_CACHES; i++)
+		shared_cache_init(&scc->cache[i], 0);
 	for_each_possible_cpu(cpu) {
 		cc = per_cpu_ptr(pcc, cpu);
 		for (i = 0; i < NUM_CACHES; i++) {
 			c = &cc->cache[i];
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
+			raw_spin_lock_init(&c->lock);
+			c->sc = &scc->cache[i];
 			prefill_mem_cache(c, cpu);
 		}
 	}
 	ma->caches = pcc;
+	ma->s_caches = scc;
 	return 0;
 }
 
+static void drain_shared_mem_cache(struct bpf_mem_shared_cache *sc)
+{
+	struct llist_node *head[2];
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+	sc->direct_free = true;
+	head[0] = __llist_del_all(&sc->reuse_ready_head);
+	sc->reuse_ready_tail = NULL;
+	head[1] = __llist_del_all(&sc->wait_for_free);
+	raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+
+	free_all(head[0], sc->percpu);
+	free_all(head[1], sc->percpu);
+}
+
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	bool percpu = !!c->percpu_size;
+	struct llist_node *head;
+	unsigned long flags;
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
@@ -447,34 +573,44 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
 	free_all(__llist_del_all(&c->free_by_rcu), percpu);
-	free_all(llist_del_all(&c->waiting_for_gp), percpu);
+	c->free_by_rcu_tail = NULL;
+
+	/* Use lock to update head and tail atomically */
+	raw_spin_lock_irqsave(&c->lock, flags);
+	head = __llist_del_all(&c->waiting_for_gp);
+	c->waiting_for_gp_tail = NULL;
+	raw_spin_unlock_irqrestore(&c->lock, flags);
+	free_all(head, percpu);
+
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
+
+	head = __llist_del_all(&c->prepare_reuse_head);
+	c->prepare_reuse_tail = NULL;
+	free_all(head, percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 {
 	free_percpu(ma->cache);
 	free_percpu(ma->caches);
+	kfree(ma->s_cache);
+	kfree(ma->s_caches);
 	ma->cache = NULL;
 	ma->caches = NULL;
+	ma->s_cache = NULL;
+	ma->s_caches = NULL;
 }
 
 static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
-	/* waiting_for_gp lists was drained, but __free_rcu might
-	 * still execute. Wait for it now before we freeing percpu caches.
-	 *
-	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
-	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
-	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
-	 * so if call_rcu(head, __free_rcu) is skipped due to
-	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
-	 * using rcu_trace_implies_rcu_gp() as well.
+	/* Use rcu_barrier() to wait for the pending reuse_rcu() and use
+	 * rcu_barrier_tasks_trace() to wait for the pending free_rcu().
+	 * direct_free has already been set to prevent reuse_rcu() from
+	 * calling freee_rcu() again.
 	 */
+	rcu_barrier();
 	rcu_barrier_tasks_trace();
-	if (!rcu_trace_implies_rcu_gp())
-		rcu_barrier();
 	free_mem_alloc_no_barrier(ma);
 }
 
@@ -507,15 +643,20 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
 
 	/* Defer barriers into worker to let the rest of map memory to be freed */
 	copy->cache = ma->cache;
+	copy->s_cache = ma->s_cache;
 	ma->cache = NULL;
+	ma->s_cache = NULL;
 	copy->caches = ma->caches;
+	copy->s_caches = ma->s_caches;
 	ma->caches = NULL;
+	ma->s_caches = NULL;
 	INIT_WORK(&copy->work, free_mem_alloc_deferred);
 	queue_work(system_unbound_wq, &copy->work);
 }
 
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 {
+	struct bpf_mem_shared_cache *sc;
 	struct bpf_mem_caches *cc;
 	struct bpf_mem_cache *c;
 	int cpu, i, rcu_in_progress;
@@ -537,6 +678,9 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			drain_mem_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 		}
+		sc = ma->s_cache;
+		drain_shared_mem_cache(sc);
+		rcu_in_progress += atomic_read(&sc->call_rcu_in_progress);
 		/* objcg is the same across cpus */
 		if (c->objcg)
 			obj_cgroup_put(c->objcg);
@@ -553,6 +697,11 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 			}
 		}
+		for (i = 0; i < NUM_CACHES; i++) {
+			sc = &ma->s_caches->cache[i];
+			drain_shared_mem_cache(sc);
+			rcu_in_progress += atomic_read(&sc->call_rcu_in_progress);
+		}
 		if (c->objcg)
 			obj_cgroup_put(c->objcg);
 		destroy_mem_alloc(ma, rcu_in_progress);
@@ -595,8 +744,8 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 }
 
 /* Though 'ptr' object could have been allocated on a different cpu
- * add it to the free_llist of the current cpu.
- * Let kfree() logic deal with it when it's later called from irq_work.
+ * add it to the prepare_reuse list of the current cpu.
+ * Let kfree() logic deal with it when it's later called from RCU cb.
  */
 static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 {
@@ -607,12 +756,15 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
 
 	local_irq_save(flags);
+	/* In case a NMI-context bpf program is also freeing object. */
 	if (local_inc_return(&c->active) == 1) {
-		__llist_add(llnode, &c->free_llist);
-		cnt = ++c->free_cnt;
+		if (llist_empty(&c->prepare_reuse_head))
+			c->prepare_reuse_tail = llnode;
+		__llist_add(llnode, &c->prepare_reuse_head);
+		cnt = ++c->prepare_reuse_cnt;
 	} else {
 		/* unit_free() cannot fail. Therefore add an object to atomic
-		 * llist. free_bulk() will drain it. Though free_llist_extra is
+		 * llist. reuse_bulk() will drain it. Though free_llist_extra is
 		 * a per-cpu list we have to use atomic llist_add here, since
 		 * it also can be interrupted by bpf nmi prog that does another
 		 * unit_free() into the same free_llist_extra.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-06  3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
                   ` (2 preceding siblings ...)
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 3/3] bpf: Only reuse after one RCU GP in " Hou Tao
@ 2023-06-06 12:30 ` Hou Tao
  2023-06-06 21:04   ` Alexei Starovoitov
  3 siblings, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-06 12:30 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

Hi,

On 6/6/2023 11:53 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The implementation of v4 is mainly based on suggestions from Alexi [0].
> There are still pending problems for the current implementation as shown
> in the benchmark result in patch #3, but there was a long time from the
> posting of v3, so posting v4 here for further disscussions and more
> suggestions.
>
> The first problem is the huge memory usage compared with bpf memory
> allocator which does immediate reuse:
>
> htab-mem-benchmark (reuse-after-RCU-GP):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1159.18   | 0.99                | 0.99             |
> | overwrite          | 11.00     | 2288                | 4109             |
> | batch_add_batch_del| 8.86      | 1558                | 2763             |
> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>
> htab-mem-benchmark (immediate-reuse):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1160.66   | 0.99                | 1.00             |
> | overwrite          | 28.52     | 2.46                | 2.73             |
> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>
> It seems the direct reason is the slow RCU grace period. During
> benchmark, the elapsed time when reuse_rcu() callback is called is about
> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> spin-lock and the irq-work running in the contex of freeing process will
> increase the running overhead of bpf program, the running time of
> getpgid() is increased, the contex switch is slowed down and the RCU
> grace period increases [1], but I am still diggin into it.
For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
(namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
did) instead, the memory usage of htab-mem-benchmark will decrease a lot:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1165.38   | 0.97                | 1.00             |
| overwrite          | 17.25     | 626.41              | 781.82           |
| batch_add_batch_del| 11.51     | 398.56              | 500.29           |
| add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |

But the memory usage is still large compared with v3 and the elapsed
time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
are still two differences:
1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
accelerate the reuse of freed objects.
2) v3 uses kworker instead of irq_work for free procedure.

For 1), after using kmalloc() in irq_work to allocate multiple inflight
RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
but is not enough:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1247.00   | 0.97                | 1.00             |
| overwrite          | 16.56     | 490.18              | 557.17           |
| batch_add_batch_del| 11.31     | 276.32              | 360.89           |
| add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |

So it seems the large memory usage is due to irq_work (reuse_bulk) used
for free procedure. However after increasing the threshold for invoking
irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
big difference in the memory usage and the delayed time for RCU
callbacks. Perhaps the reason is that although the number of  reuse_bulk
irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
increased because there are no reusable objects.

>
> Another problem is the performance degradation compared with immediate
> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> top-one hotspot:
>
> map_perf_test (reuse-after-RCU-GP)
> 0:hash_map_perf kmalloc 194677 events per sec
>
> map_perf_test (immediate reuse)
> 2:hash_map_perf kmalloc 384527 events per sec
>
> Considering the purpose of introducing per-bpf-ma reusable list is to
> handle the case in which the allocation and free are done on different
> CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
> for overwrite & batch_add_batch_del cases. So maybe we could implement a
> hybrid of global reusable list and per-cpu reusable list and switch
> between these two kinds of list according to the history of allocation
> and free frequency.
>
> As ususal, suggestions and comments are always welcome.
>
> [0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
> [1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@huaweicloud.com
>
> Change Log:
> v4:
>  * no kworker (Alexei)
>  * Use a global reusable list in bpf memory allocator (Alexei)
>  * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
>    defaultly in bpf memory allocator (Alexei)
>  * add benchmark results from map_perf_test (Alexei)
>
> v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@huaweicloud.com/
>  * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
>  * Update htab memory benchmark
>    * move the benchmark patch to the last patch
>    * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
>      programs
>    * add synchronization between addition CPU and deletion CPU for
>      add_del_on_diff_cpu case to prevent unnecessary loop
>    * add the benchmark result for "extra call_rcu + bpf ma"
>
> v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@huaweicloud.com/
>  * add a benchmark for bpf memory allocator to compare between different
>    flavor of bpf memory allocator.
>  * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
>
> v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>
> Hou Tao (3):
>   bpf: Factor out a common helper free_all()
>   selftests/bpf: Add benchmark for bpf memory allocator
>   bpf: Only reuse after one RCU GP in bpf memory allocator
>
>  include/linux/bpf_mem_alloc.h                 |   4 +
>  kernel/bpf/memalloc.c                         | 385 ++++++++++++------
>  tools/testing/selftests/bpf/Makefile          |   3 +
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++
>  .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
>  .../selftests/bpf/progs/htab_mem_bench.c      | 135 ++++++
>  7 files changed, 809 insertions(+), 116 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-06 12:30 ` [RFC PATCH bpf-next v4 0/3] Handle immediate reuse " Hou Tao
@ 2023-06-06 21:04   ` Alexei Starovoitov
  2023-06-07  1:19     ` Hou Tao
  2023-06-07  8:42     ` Hou Tao
  0 siblings, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-06 21:04 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/6/2023 11:53 AM, Hou Tao wrote:
> > From: Hou Tao <houtao1@huawei.com>
> >
> > Hi,
> >
> > The implementation of v4 is mainly based on suggestions from Alexi [0].
> > There are still pending problems for the current implementation as shown
> > in the benchmark result in patch #3, but there was a long time from the
> > posting of v3, so posting v4 here for further disscussions and more
> > suggestions.
> >
> > The first problem is the huge memory usage compared with bpf memory
> > allocator which does immediate reuse:
> >
> > htab-mem-benchmark (reuse-after-RCU-GP):
> > | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> > | --                 | --        | --                  | --               |
> > | no_op              | 1159.18   | 0.99                | 0.99             |
> > | overwrite          | 11.00     | 2288                | 4109             |
> > | batch_add_batch_del| 8.86      | 1558                | 2763             |
> > | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
> >
> > htab-mem-benchmark (immediate-reuse):
> > | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> > | --                 | --        | --                  | --               |
> > | no_op              | 1160.66   | 0.99                | 1.00             |
> > | overwrite          | 28.52     | 2.46                | 2.73             |
> > | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> > | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
> >
> > It seems the direct reason is the slow RCU grace period. During
> > benchmark, the elapsed time when reuse_rcu() callback is called is about
> > 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> > spin-lock and the irq-work running in the contex of freeing process will
> > increase the running overhead of bpf program, the running time of
> > getpgid() is increased, the contex switch is slowed down and the RCU
> > grace period increases [1], but I am still diggin into it.
> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
> 
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1165.38   | 0.97                | 1.00             |
> | overwrite          | 17.25     | 626.41              | 781.82           |
> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
> 
> But the memory usage is still large compared with v3 and the elapsed
> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
> are still two differences:
> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
> accelerate the reuse of freed objects.
> 2) v3 uses kworker instead of irq_work for free procedure.
> 
> For 1), after using kmalloc() in irq_work to allocate multiple inflight
> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
> but is not enough:
> 
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1247.00   | 0.97                | 1.00             |
> | overwrite          | 16.56     | 490.18              | 557.17           |
> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
> 
> So it seems the large memory usage is due to irq_work (reuse_bulk) used
> for free procedure. However after increasing the threshold for invoking
> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
> big difference in the memory usage and the delayed time for RCU
> callbacks. Perhaps the reason is that although the number of  reuse_bulk
> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
> increased because there are no reusable objects.

The large memory usage is because the benchmark in patch 2 is abusing it.
It's doing one bpf_loop() over 16k elements (in case of 1 producer)
and 16k/8 loops for --producers=8.
That's 2k memory allocations that have to wait for RCU GP.
Of course that's a ton of memory.

As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
It makes the code hard to read and doesn't buy us anything.
Other than that the algorithm looks fine.

> > Another problem is the performance degradation compared with immediate
> > reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> > top-one hotspot:

That's not what I see.
Hot spin_lock is in generic htab code. Not it ma.
I still believe per-bpf-ma spin-lock is fine.
The bench in patch 2 is measuring something that no real bpf prog cares about.

See how map_perf_test is doing:
        for (i = 0; i < 10; i++) {
                bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);

Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
16k/8 is beyond any normal scenario.
There is no reason to optimize bpf_ma for the case of htab abuse.

> > map_perf_test (reuse-after-RCU-GP)
> > 0:hash_map_perf kmalloc 194677 events per sec
> >
> > map_perf_test (immediate reuse)
> > 2:hash_map_perf kmalloc 384527 events per sec

For some reason I cannot reproduce the slow down with map_perf_test 4 8.
I see the same perf with/without patch 3.

I've applied patch 1.
Please respin with patch 2 doing no more than 10 map_updates under rcu lock
and remove *_tail optimization from patch 3.
Just do llist_for_each_safe() when you move elements from one list to another.
And let's brainstorm further.
Please do not delay.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator
  2023-06-06  3:53 ` [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
@ 2023-06-06 21:13   ` Alexei Starovoitov
  2023-06-07  1:32     ` Hou Tao
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-06 21:13 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, Hou Tao

On Mon, Jun 5, 2023 at 8:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> +static void htab_mem_read_mem_cgrp_file(const char *name, unsigned long *value)
> +{
> +       char buf[32];
> +       int fd;
> +
> +       fd = openat(ctx.fd, name, O_RDONLY);
> +       if (fd < 0) {
> +               fprintf(stderr, "no %s\n", name);
> +               *value = 0;
> +               return;
> +       }
> +
> +       buf[sizeof(buf) - 1] = 0;
> +       read(fd, buf, sizeof(buf) - 1);

Please BPF CI. It's complaining about:

benchs/bench_htab_mem.c: In function ‘htab_mem_read_mem_cgrp_file’:
benchs/bench_htab_mem.c:290:2: error: ignoring return value of ‘read’,
declared with attribute warn_unused_result [-Werror=unused-result]
290 | read(fd, buf, sizeof(buf) - 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


> +       *value = strtoull(buf, NULL, 0);
> +
> +       close(fd);
> +}
> +
> +static void htab_mem_measure(struct bench_res *res)
> +{
> +       res->hits = atomic_swap(&ctx.skel->bss->loop_cnt, 0);

This is missing:
res->hits /= env.producer_cnt;

Doubling the number of producers should double the perf metric.
Like -p 4 should be half the speed of -p 8.
In an ideal situation, of course.
Without this normalization -p 1 vs -p 2 numbers are meaningless.
Runs with different numbers of producers cannot be compared.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-06 21:04   ` Alexei Starovoitov
@ 2023-06-07  1:19     ` Hou Tao
  2023-06-07  1:39       ` Alexei Starovoitov
  2023-06-07  8:42     ` Hou Tao
  1 sibling, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-07  1:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

Hi,

On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Hi,
>>>
>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>> There are still pending problems for the current implementation as shown
>>> in the benchmark result in patch #3, but there was a long time from the
>>> posting of v3, so posting v4 here for further disscussions and more
>>> suggestions.
>>>
>>> The first problem is the huge memory usage compared with bpf memory
>>> allocator which does immediate reuse:
>>>
>>> htab-mem-benchmark (reuse-after-RCU-GP):
>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>> | --                 | --        | --                  | --               |
>>> | no_op              | 1159.18   | 0.99                | 0.99             |
>>> | overwrite          | 11.00     | 2288                | 4109             |
>>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
>>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>>>
>>> htab-mem-benchmark (immediate-reuse):
>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>> | --                 | --        | --                  | --               |
>>> | no_op              | 1160.66   | 0.99                | 1.00             |
>>> | overwrite          | 28.52     | 2.46                | 2.73             |
>>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
>>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>>>
>>> It seems the direct reason is the slow RCU grace period. During
>>> benchmark, the elapsed time when reuse_rcu() callback is called is about
>>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
>>> spin-lock and the irq-work running in the contex of freeing process will
>>> increase the running overhead of bpf program, the running time of
>>> getpgid() is increased, the contex switch is slowed down and the RCU
>>> grace period increases [1], but I am still diggin into it.
>> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
>> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
>> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
>>
>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>> | --                 | --        | --                  | --               |
>> | no_op              | 1165.38   | 0.97                | 1.00             |
>> | overwrite          | 17.25     | 626.41              | 781.82           |
>> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
>> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
>>
>> But the memory usage is still large compared with v3 and the elapsed
>> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
>> are still two differences:
>> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
>> accelerate the reuse of freed objects.
>> 2) v3 uses kworker instead of irq_work for free procedure.
>>
>> For 1), after using kmalloc() in irq_work to allocate multiple inflight
>> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
>> but is not enough:
>>
>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>> | --                 | --        | --                  | --               |
>> | no_op              | 1247.00   | 0.97                | 1.00             |
>> | overwrite          | 16.56     | 490.18              | 557.17           |
>> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
>> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
>>
>> So it seems the large memory usage is due to irq_work (reuse_bulk) used
>> for free procedure. However after increasing the threshold for invoking
>> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
>> big difference in the memory usage and the delayed time for RCU
>> callbacks. Perhaps the reason is that although the number of  reuse_bulk
>> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
>> increased because there are no reusable objects.
> The large memory usage is because the benchmark in patch 2 is abusing it.
> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> and 16k/8 loops for --producers=8.
> That's 2k memory allocations that have to wait for RCU GP.
> Of course that's a ton of memory.
I don't agree that. Because in v3, the benchmark is the same, but both
the performance and the memory usage are better than v4. Even compared
with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
much smaller as shown below. If the large memory usage is due to the
abuse in benchmark, how do you explain the memory usage in v3 ?

htab-mem-benchmark (reuse-after-rcu-gp v3)

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1199.16    | 0.97                 | 0.99              |
| overwrite           | 16.37      | 24.01                | 31.76             |
| batch_add_batch_del | 9.61       | 16.71                | 19.95             |
| add_del_on_diff_cpu | 3.62       | 22.93                | 37.02             |

>
> As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
> It makes the code hard to read and doesn't buy us anything.
The reason I added tail for each list is that there could be thousands
even ten thousands elements in these lists and there is no need to spend
CPU time to traversal these list one by one. It maybe a premature
optimization. So let me remove tails from these list first and I will
try to add these tails back later and check whether or not there is any
performance improvement.
> Other than that the algorithm looks fine.
>
>>> Another problem is the performance degradation compared with immediate
>>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
>>> top-one hotspot:
> That's not what I see.
> Hot spin_lock is in generic htab code. Not it ma.
> I still believe per-bpf-ma spin-lock is fine.
> The bench in patch 2 is measuring something that no real bpf prog cares about.
>
> See how map_perf_test is doing:
>         for (i = 0; i < 10; i++) {
>                 bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
>
> Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
> 16k/8 is beyond any normal scenario.
> There is no reason to optimize bpf_ma for the case of htab abuse.
I have a different view for the benchmark. Firstly htab is not the only
user of bpf memory allocator, secondly we can't predict the exact
behavior of bpf programs, so I think to stress bpf memory allocator for
various kinds of use case is good for its broad usage.
>
>>> map_perf_test (reuse-after-RCU-GP)
>>> 0:hash_map_perf kmalloc 194677 events per sec
>>>
>>> map_perf_test (immediate reuse)
>>> 2:hash_map_perf kmalloc 384527 events per sec
> For some reason I cannot reproduce the slow down with map_perf_test 4 8.
> I see the same perf with/without patch 3.
I will double check my local setup and test results.
>
> I've applied patch 1.
> Please respin with patch 2 doing no more than 10 map_updates under rcu lock
> and remove *_tail optimization from patch 3.
> Just do llist_for_each_safe() when you move elements from one list to another.
> And let's brainstorm further.
> Please do not delay.
> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator
  2023-06-06 21:13   ` Alexei Starovoitov
@ 2023-06-07  1:32     ` Hou Tao
  0 siblings, 0 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-07  1:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, Hou Tao

Hi,

On 6/7/2023 5:13 AM, Alexei Starovoitov wrote:
> On Mon, Jun 5, 2023 at 8:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> +static void htab_mem_read_mem_cgrp_file(const char *name, unsigned long *value)
>> +{
>> +       char buf[32];
>> +       int fd;
>> +
>> +       fd = openat(ctx.fd, name, O_RDONLY);
>> +       if (fd < 0) {
>> +               fprintf(stderr, "no %s\n", name);
>> +               *value = 0;
>> +               return;
>> +       }
>> +
>> +       buf[sizeof(buf) - 1] = 0;
>> +       read(fd, buf, sizeof(buf) - 1);
> Please BPF CI. It's complaining about:
I don't know that RFC patch will go through BPF CI. Will check other
checks and tests in CI.
>
> benchs/bench_htab_mem.c: In function ‘htab_mem_read_mem_cgrp_file’:
> benchs/bench_htab_mem.c:290:2: error: ignoring return value of ‘read’,
> declared with attribute warn_unused_result [-Werror=unused-result]
> 290 | read(fd, buf, sizeof(buf) - 1);
Will fix in v5.
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>> +       *value = strtoull(buf, NULL, 0);
>> +
>> +       close(fd);
>> +}
>> +
>> +static void htab_mem_measure(struct bench_res *res)
>> +{
>> +       res->hits = atomic_swap(&ctx.skel->bss->loop_cnt, 0);
> This is missing:
> res->hits /= env.producer_cnt;
>
> Doubling the number of producers should double the perf metric.
> Like -p 4 should be half the speed of -p 8.
> In an ideal situation, of course.
> Without this normalization -p 1 vs -p 2 numbers are meaningless.
> Runs with different numbers of producers cannot be compared.
It is a good idea to compare the metric numbers between different
producers. Will do it in v5.
> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07  1:19     ` Hou Tao
@ 2023-06-07  1:39       ` Alexei Starovoitov
  2023-06-07  7:56         ` Hou Tao
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-07  1:39 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

On Tue, Jun 6, 2023 at 6:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> > On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/6/2023 11:53 AM, Hou Tao wrote:
> >>> From: Hou Tao <houtao1@huawei.com>
> >>>
> >>> Hi,
> >>>
> >>> The implementation of v4 is mainly based on suggestions from Alexi [0].
> >>> There are still pending problems for the current implementation as shown
> >>> in the benchmark result in patch #3, but there was a long time from the
> >>> posting of v3, so posting v4 here for further disscussions and more
> >>> suggestions.
> >>>
> >>> The first problem is the huge memory usage compared with bpf memory
> >>> allocator which does immediate reuse:
> >>>
> >>> htab-mem-benchmark (reuse-after-RCU-GP):
> >>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >>> | --                 | --        | --                  | --               |
> >>> | no_op              | 1159.18   | 0.99                | 0.99             |
> >>> | overwrite          | 11.00     | 2288                | 4109             |
> >>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
> >>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
> >>>
> >>> htab-mem-benchmark (immediate-reuse):
> >>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >>> | --                 | --        | --                  | --               |
> >>> | no_op              | 1160.66   | 0.99                | 1.00             |
> >>> | overwrite          | 28.52     | 2.46                | 2.73             |
> >>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> >>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
> >>>
> >>> It seems the direct reason is the slow RCU grace period. During
> >>> benchmark, the elapsed time when reuse_rcu() callback is called is about
> >>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> >>> spin-lock and the irq-work running in the contex of freeing process will
> >>> increase the running overhead of bpf program, the running time of
> >>> getpgid() is increased, the contex switch is slowed down and the RCU
> >>> grace period increases [1], but I am still diggin into it.
> >> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
> >> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
> >> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
> >>
> >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
> >> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >> | --                 | --        | --                  | --               |
> >> | no_op              | 1165.38   | 0.97                | 1.00             |
> >> | overwrite          | 17.25     | 626.41              | 781.82           |
> >> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
> >> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
> >>
> >> But the memory usage is still large compared with v3 and the elapsed
> >> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
> >> are still two differences:
> >> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
> >> accelerate the reuse of freed objects.
> >> 2) v3 uses kworker instead of irq_work for free procedure.
> >>
> >> For 1), after using kmalloc() in irq_work to allocate multiple inflight
> >> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
> >> but is not enough:
> >>
> >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
> >> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >> | --                 | --        | --                  | --               |
> >> | no_op              | 1247.00   | 0.97                | 1.00             |
> >> | overwrite          | 16.56     | 490.18              | 557.17           |
> >> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
> >> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
> >>
> >> So it seems the large memory usage is due to irq_work (reuse_bulk) used
> >> for free procedure. However after increasing the threshold for invoking
> >> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
> >> big difference in the memory usage and the delayed time for RCU
> >> callbacks. Perhaps the reason is that although the number of  reuse_bulk
> >> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
> >> increased because there are no reusable objects.
> > The large memory usage is because the benchmark in patch 2 is abusing it.
> > It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> > and 16k/8 loops for --producers=8.
> > That's 2k memory allocations that have to wait for RCU GP.
> > Of course that's a ton of memory.
> I don't agree that. Because in v3, the benchmark is the same, but both
> the performance and the memory usage are better than v4. Even compared
> with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
> multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
> much smaller as shown below. If the large memory usage is due to the
> abuse in benchmark, how do you explain the memory usage in v3 ?

There could have been implementation bugs or whatever else.
The main point is the bench test is not realistic and should not be
used to make design decisions.

> The reason I added tail for each list is that there could be thousands
> even ten thousands elements in these lists and there is no need to spend
> CPU time to traversal these list one by one. It maybe a premature
> optimization. So let me remove tails from these list first and I will
> try to add these tails back later and check whether or not there is any
> performance improvement.

There will be thousands of elements only because the bench test is wrong.
It's doing something no real prog would do.

> I have a different view for the benchmark. Firstly htab is not the only
> user of bpf memory allocator, secondly we can't predict the exact
> behavior of bpf programs, so I think to stress bpf memory allocator for
> various kinds of use case is good for its broad usage.

It is not a stress test. It's an abuse.
A stress test would be something that can happen in practice.
Doing thousands map_updates in a forever loop is not something
useful code would do.
For example call_rcu_tasks_trace is not design to be called millions
times a second. It's an anti-pattern and rcu core won't be optimized to do so.
rcu, srcu, rcu_task_trace have different usage patterns.
The programmer has to correctly pick one depending on the use case.
Same with bpf htab. If somebody has a real need to do thousands
updates under rcu lock they should be using preallocated map and deal
with immediate reuse.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07  1:39       ` Alexei Starovoitov
@ 2023-06-07  7:56         ` Hou Tao
  0 siblings, 0 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-07  7:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

Hi,

On 6/7/2023 9:39 AM, Alexei Starovoitov wrote:
> On Tue, Jun 6, 2023 at 6:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
>>> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> Hi,
>>>>>
>>>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>>>> There are still pending problems for the current implementation as shown
>>>>> in the benchmark result in patch #3, but there was a long time from the
>>>>> posting of v3, so posting v4 here for further disscussions and more
>>>>> suggestions.
>>>>>
>>>>> The first problem is the huge memory usage compared with bpf memory
>>>>> allocator which does immediate reuse:
>>>>>
>>>>> htab-mem-benchmark (reuse-after-RCU-GP):
>>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>>> | --                 | --        | --                  | --               |
>>>>> | no_op              | 1159.18   | 0.99                | 0.99             |
>>>>> | overwrite          | 11.00     | 2288                | 4109             |
>>>>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
>>>>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>>>>>
>>>>> htab-mem-benchmark (immediate-reuse):
>>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>>> | --                 | --        | --                  | --               |
>>>>> | no_op              | 1160.66   | 0.99                | 1.00             |
>>>>> | overwrite          | 28.52     | 2.46                | 2.73             |
>>>>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
>>>>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>>>>>
>>>>> It seems the direct reason is the slow RCU grace period. During
>>>>> benchmark, the elapsed time when reuse_rcu() callback is called is about
>>>>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
>>>>> spin-lock and the irq-work running in the contex of freeing process will
>>>>> increase the running overhead of bpf program, the running time of
>>>>> getpgid() is increased, the contex switch is slowed down and the RCU
>>>>> grace period increases [1], but I am still diggin into it.
>>>> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
>>>> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
>>>> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
>>>>
>>>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>> | --                 | --        | --                  | --               |
>>>> | no_op              | 1165.38   | 0.97                | 1.00             |
>>>> | overwrite          | 17.25     | 626.41              | 781.82           |
>>>> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
>>>> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
>>>>
>>>> But the memory usage is still large compared with v3 and the elapsed
>>>> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
>>>> are still two differences:
>>>> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
>>>> accelerate the reuse of freed objects.
>>>> 2) v3 uses kworker instead of irq_work for free procedure.
>>>>
>>>> For 1), after using kmalloc() in irq_work to allocate multiple inflight
>>>> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
>>>> but is not enough:
>>>>
>>>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>> | --                 | --        | --                  | --               |
>>>> | no_op              | 1247.00   | 0.97                | 1.00             |
>>>> | overwrite          | 16.56     | 490.18              | 557.17           |
>>>> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
>>>> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
>>>>
>>>> So it seems the large memory usage is due to irq_work (reuse_bulk) used
>>>> for free procedure. However after increasing the threshold for invoking
>>>> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
>>>> big difference in the memory usage and the delayed time for RCU
>>>> callbacks. Perhaps the reason is that although the number of  reuse_bulk
>>>> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
>>>> increased because there are no reusable objects.
>>> The large memory usage is because the benchmark in patch 2 is abusing it.
>>> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
>>> and 16k/8 loops for --producers=8.
>>> That's 2k memory allocations that have to wait for RCU GP.
>>> Of course that's a ton of memory.
>> I don't agree that. Because in v3, the benchmark is the same, but both
>> the performance and the memory usage are better than v4. Even compared
>> with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
>> multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
>> much smaller as shown below. If the large memory usage is due to the
>> abuse in benchmark, how do you explain the memory usage in v3 ?
> There could have been implementation bugs or whatever else.
> The main point is the bench test is not realistic and should not be
> used to make design decisions.
I see your point. I will continue to debug the memory usage difference
between v3 and v4.
>
>> The reason I added tail for each list is that there could be thousands
>> even ten thousands elements in these lists and there is no need to spend
>> CPU time to traversal these list one by one. It maybe a premature
>> optimization. So let me remove tails from these list first and I will
>> try to add these tails back later and check whether or not there is any
>> performance improvement.
> There will be thousands of elements only because the bench test is wrong.
> It's doing something no real prog would do.
I don't think so. Let's considering the per-cpu list first. Assume the
normal RCU grace period is about 30ms and we are tracing the IO latency
of a normal SSD. The iops is about 176K per seconds, so before one RCU
GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
For the per-ma list, when the number of CPUs increased, it is easy to
make the list contain thousands of elements.
>
>> I have a different view for the benchmark. Firstly htab is not the only
>> user of bpf memory allocator, secondly we can't predict the exact
>> behavior of bpf programs, so I think to stress bpf memory allocator for
>> various kinds of use case is good for its broad usage.
> It is not a stress test. It's an abuse.
> A stress test would be something that can happen in practice.
> Doing thousands map_updates in a forever loop is not something
> useful code would do.
> For example call_rcu_tasks_trace is not design to be called millions
> times a second. It's an anti-pattern and rcu core won't be optimized to do so.
> rcu, srcu, rcu_task_trace have different usage patterns.
> The programmer has to correctly pick one depending on the use case.
> Same with bpf htab. If somebody has a real need to do thousands
> updates under rcu lock they should be using preallocated map and deal
> with immediate reuse.
Before I post v5, I want to know the reason why per-bpf-ma list is
introduced. Previously, I though it was used to handle the case in which
allocation and freeing are done on different CPUs. And as we can see
from the benchmark result above and in v3, the performance and the
memory usage of v4 for add_del_on_diff_cpu is better than v3. But now I
am not sure, because as you mentioned above, it is used to reduce the
calling frequency of call_rcu_task_trace(). So could you tell me the
main reason for the per-bpf-ma list ? As shown in the above benchmark
performance, using per-cpu-reuse-list (namely htab-mem-benchmark
(reuse-after-RCU-GP + per-cpu reusable list)) have better performance
and memory usage compared with per-ma-list (htab-mem-benchmark
(reuse-after-RCU-GP)). If we just want to reduce the calling frequency
of call_rcu_task_trace(), we could make
bpf_mem_shared_cache->reuse_ready_head being per-cpu and leave
wait_for_free being per-bpf-ma.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-06 21:04   ` Alexei Starovoitov
  2023-06-07  1:19     ` Hou Tao
@ 2023-06-07  8:42     ` Hou Tao
  2023-06-07 17:52       ` Alexei Starovoitov
  1 sibling, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-07  8:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

Hi,

On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Hi,
>>>
>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>> There are still pending problems for the current implementation as shown
>>> in the benchmark result in patch #3, but there was a long time from the
>>> posting of v3, so posting v4 here for further disscussions and more
>>> suggestions.
>>>
>>> The first problem is the huge memory usage compared with bpf memory
>>> allocator which does immediate reuse:
SNIP
> The large memory usage is because the benchmark in patch 2 is abusing it.
> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> and 16k/8 loops for --producers=8.
> That's 2k memory allocations that have to wait for RCU GP.
> Of course that's a ton of memory.
>
> As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
> It makes the code hard to read and doesn't buy us anything.
> Other than that the algorithm looks fine.
>
>>> Another problem is the performance degradation compared with immediate
>>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
>>> top-one hotspot:
> That's not what I see.
> Hot spin_lock is in generic htab code. Not it ma.
> I still believe per-bpf-ma spin-lock is fine.
> The bench in patch 2 is measuring something that no real bpf prog cares about.
>
> See how map_perf_test is doing:
>         for (i = 0; i < 10; i++) {
>                 bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
>
> Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
> 16k/8 is beyond any normal scenario.
> There is no reason to optimize bpf_ma for the case of htab abuse.
>
>>> map_perf_test (reuse-after-RCU-GP)
>>> 0:hash_map_perf kmalloc 194677 events per sec
>>>
>>> map_perf_test (immediate reuse)
>>> 2:hash_map_perf kmalloc 384527 events per sec
> For some reason I cannot reproduce the slow down with map_perf_test 4 8.
> I see the same perf with/without patch 3.
As said in the commit message, the command line for test is
"./map_perf_test 4 8 16384", because the default max_entries is 1000. If
using default max_entries and the number of CPUs is greater than 15,
use_percpu_counter will be false.

I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
there are obvious performance degradation.

Before reuse-after-rcu-gp:
[root@hello bpf]# ./map_perf_test 4 8
5:hash_map_perf kmalloc 358498 events per sec
4:hash_map_perf kmalloc 306351 events per sec
1:hash_map_perf kmalloc 299175 events per sec
3:hash_map_perf kmalloc 297689 events per sec
6:hash_map_perf kmalloc 299839 events per sec
2:hash_map_perf kmalloc 292286 events per sec
7:hash_map_perf kmalloc 278138 events per sec
0:hash_map_perf kmalloc 265031 events per sec

[root@hello bpf]# ./map_perf_test 4 8 16384
2:hash_map_perf kmalloc 359201 events per sec
6:hash_map_perf kmalloc 302475 events per sec
3:hash_map_perf kmalloc 298583 events per sec
7:hash_map_perf kmalloc 301594 events per sec
0:hash_map_perf kmalloc 295210 events per sec
4:hash_map_perf kmalloc 230496 events per sec
5:hash_map_perf kmalloc 163530 events per sec
1:hash_map_perf kmalloc 153520 events per sec

After reuse-after-rcu-gp:
[root@hello bpf]# ./map_perf_test 4 8
1:hash_map_perf kmalloc 203252 events per sec
2:hash_map_perf kmalloc 181777 events per sec
6:hash_map_perf kmalloc 183467 events per sec
4:hash_map_perf kmalloc 182590 events per sec
0:hash_map_perf kmalloc 180840 events per sec
3:hash_map_perf kmalloc 179875 events per sec
5:hash_map_perf kmalloc 152250 events per sec
7:hash_map_perf kmalloc 137500 events per sec

[root@hello bpf]# ./map_perf_test 4 8 16384
4:hash_map_perf kmalloc 203983 events per sec
5:hash_map_perf kmalloc 197902 events per sec
2:hash_map_perf kmalloc 184693 events per sec
3:hash_map_perf kmalloc 185341 events per sec
1:hash_map_perf kmalloc 183064 events per sec
7:hash_map_perf kmalloc 181148 events per sec
0:hash_map_perf kmalloc 178520 events per sec
6:hash_map_perf kmalloc 179340 events per sec

I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
performances for "./map_perf_test 4 8" are similar, but there is obvious
performance degradation for "./map_perf_test 4 8 16384"

Before reuse-after-rcu-gp:

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8
1:hash_map_perf kmalloc 41711 events per sec
3:hash_map_perf kmalloc 41352 events per sec
4:hash_map_perf kmalloc 41352 events per sec
0:hash_map_perf kmalloc 41008 events per sec
7:hash_map_perf kmalloc 41086 events per sec
5:hash_map_perf kmalloc 41038 events per sec
2:hash_map_perf kmalloc 40971 events per sec
6:hash_map_perf kmalloc 41008 events per sec

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
1:hash_map_perf kmalloc 388088 events per sec
7:hash_map_perf kmalloc 391843 events per sec
6:hash_map_perf kmalloc 387901 events per sec
3:hash_map_perf kmalloc 356096 events per sec
4:hash_map_perf kmalloc 356836 events per sec
2:hash_map_perf kmalloc 349728 events per sec
0:hash_map_perf kmalloc 345792 events per sec
5:hash_map_perf kmalloc 346742 events per sec

After reuse-after-rcu-gp:

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8
4:hash_map_perf kmalloc 42667 events per sec
1:hash_map_perf kmalloc 42206 events per sec
5:hash_map_perf kmalloc 42264 events per sec
6:hash_map_perf kmalloc 42196 events per sec
7:hash_map_perf kmalloc 42142 events per sec
2:hash_map_perf kmalloc 42028 events per sec
0:hash_map_perf kmalloc 41933 events per sec
3:hash_map_perf kmalloc 41986 events per sec

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
5:hash_map_perf kmalloc 655628 events per sec
7:hash_map_perf kmalloc 650691 events per sec
2:hash_map_perf kmalloc 624384 events per sec
1:hash_map_perf kmalloc 615011 events per sec
3:hash_map_perf kmalloc 618335 events per sec
4:hash_map_perf kmalloc 624072 events per sec
6:hash_map_perf kmalloc 628559 events per sec
0:hash_map_perf kmalloc 585384 events per sec

So could you please double check your setup and rerun map_perf_test ? If
there is no performance degradation, could you please share your setup
and your kernel configure file ?
>
> I've applied patch 1.
> Please respin with patch 2 doing no more than 10 map_updates under rcu lock
> and remove *_tail optimization from patch 3.
> Just do llist_for_each_safe() when you move elements from one list to another.
> And let's brainstorm further.
> Please do not delay.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07  8:42     ` Hou Tao
@ 2023-06-07 17:52       ` Alexei Starovoitov
  2023-06-07 20:50         ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-07 17:52 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> As said in the commit message, the command line for test is
> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> using default max_entries and the number of CPUs is greater than 15,
> use_percpu_counter will be false.

Right. percpu or not depends on number of cpus.

> 
> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> there are obvious performance degradation.
...
> [root@hello bpf]# ./map_perf_test 4 8 16384
> 2:hash_map_perf kmalloc 359201 events per sec
..
> [root@hello bpf]# ./map_perf_test 4 8 16384
> 4:hash_map_perf kmalloc 203983 events per sec

this is indeed a degration in a VM.

> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> performances for "./map_perf_test 4 8" are similar, but there is obvious
> performance degradation for "./map_perf_test 4 8 16384"

but... a degradation?

> Before reuse-after-rcu-gp:
>
> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> 1:hash_map_perf kmalloc 388088 events per sec
...
> After reuse-after-rcu-gp:
> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> 5:hash_map_perf kmalloc 655628 events per sec

This is a big improvement :) Not a degration.
You always have to double check the numbers with perf report.

> So could you please double check your setup and rerun map_perf_test ? If
> there is no performance degradation, could you please share your setup
> and your kernel configure file ?

I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
Playing with it a bit more I found something interesting:
map_perf_test 4 8 16348
before/after has too much noise to be conclusive.

So I did
map_perf_test 4 8 16348 1000000

and now I see significant degration from patch 3.
It drops from 800k to 200k.
And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
The following hack addresses most of the perf degradtion:

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index fea1cb0c78bb..eeadc9359097 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
        alloc = 0;
        head = NULL;
        tail = NULL;
-       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
        while (alloc < cnt) {
                obj = __llist_del_first(&sc->reuse_ready_head);
                if (obj) {
@@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
                alloc++;
        }
        raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+       }

        if (alloc) {
                if (IS_ENABLED(CONFIG_PREEMPT_RT))
@@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
                sc->reuse_ready_tail = NULL;
                WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
                __llist_add_batch(head, tail, &sc->wait_for_free);
+               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
                call_rcu_tasks_trace(&sc->rcu, free_rcu);
+       } else {
+               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
        }
-       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
 }

It now drops from 800k to 450k.
And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.

Answering your other email:

> I see your point. I will continue to debug the memory usage difference
> between v3 and v4.

imo it's a waste of time to continue analyzing performance based on bench in patch 2.

> I don't think so. Let's considering the per-cpu list first. Assume the
> normal RCU grace period is about 30ms and we are tracing the IO latency
> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> For the per-ma list, when the number of CPUs increased, it is easy to
> make the list contain thousands of elements.

That would be true only if there were no scheduling events in all of 176K ops.
Which is not the case.
I'm not sure why you're saying that RCU GP is 30ms.
In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
Every sched event is sort-of implicit rcu_read_lock/unlock.
Network and block IO doesn't process 176K packets without resched.
Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.

For small size buckets low_watermark=32 and high=96.
We typically move 32 elements at a time from one list to another.
A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
is not instant, but once __free_rcu_tasks_trace is called we need to take
elements from waiting_for_gp one at a time and kfree it one at a time.
So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.

> Before I post v5, I want to know the reason why per-bpf-ma list is
>introduced. Previously, I though it was used to handle the case in which
> allocation and freeing are done on different CPUs. 

Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.

> And as we can see
> from the benchmark result above and in v3, the performance and the
> memory usage of v4 for add_del_on_diff_cpu is better than v3. 

bench from patch 2 is invalid. Hence no conclusion can be made.

So far the only bench we can trust and analyze is map_perf_test.
Please make bench in patch 2 yield the cpu after few updates.
Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
64 updates is realistic too. A thousand is not.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 17:52       ` Alexei Starovoitov
@ 2023-06-07 20:50         ` Alexei Starovoitov
  2023-06-07 23:23           ` Alexei Starovoitov
  2023-06-08  1:51           ` Hou Tao
  0 siblings, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-07 20:50 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > As said in the commit message, the command line for test is
> > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > using default max_entries and the number of CPUs is greater than 15,
> > use_percpu_counter will be false.
>
> Right. percpu or not depends on number of cpus.
>
> >
> > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > there are obvious performance degradation.
> ...
> > [root@hello bpf]# ./map_perf_test 4 8 16384
> > 2:hash_map_perf kmalloc 359201 events per sec
> ..
> > [root@hello bpf]# ./map_perf_test 4 8 16384
> > 4:hash_map_perf kmalloc 203983 events per sec
>
> this is indeed a degration in a VM.
>
> > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > performance degradation for "./map_perf_test 4 8 16384"
>
> but... a degradation?
>
> > Before reuse-after-rcu-gp:
> >
> > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > 1:hash_map_perf kmalloc 388088 events per sec
> ...
> > After reuse-after-rcu-gp:
> > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > 5:hash_map_perf kmalloc 655628 events per sec
>
> This is a big improvement :) Not a degration.
> You always have to double check the numbers with perf report.
>
> > So could you please double check your setup and rerun map_perf_test ? If
> > there is no performance degradation, could you please share your setup
> > and your kernel configure file ?
>
> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> Playing with it a bit more I found something interesting:
> map_perf_test 4 8 16348
> before/after has too much noise to be conclusive.
>
> So I did
> map_perf_test 4 8 16348 1000000
>
> and now I see significant degration from patch 3.
> It drops from 800k to 200k.
> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> The following hack addresses most of the perf degradtion:
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index fea1cb0c78bb..eeadc9359097 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>         alloc = 0;
>         head = NULL;
>         tail = NULL;
> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>         while (alloc < cnt) {
>                 obj = __llist_del_first(&sc->reuse_ready_head);
>                 if (obj) {
> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>                 alloc++;
>         }
>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> +       }
>
>         if (alloc) {
>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>                 sc->reuse_ready_tail = NULL;
>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>                 __llist_add_batch(head, tail, &sc->wait_for_free);
> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> +       } else {
> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>         }
> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>  }
>
> It now drops from 800k to 450k.
> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.

Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.

Also noticed that the overhead of shared reuse_ready list
comes both from the contended lock and from cache misses
when one cpu pushes to the list after RCU GP and another
cpu removes.

Also low/batch/high watermark are all wrong in patch 3.
low=32 and high=96 makes no sense when it's not a single list.
I'm experimenting with 32 for all three heuristics.

Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
are redundant.
unit_free() can push into free_by_rcu directly
then reuse_bulk() can fill it up with free_llist_extra and
move them into waiting_for_gp.

All these _tail optimizations are obscuring the code and make it hard
to notice these issues.

> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>
> Answering your other email:
>
> > I see your point. I will continue to debug the memory usage difference
> > between v3 and v4.
>
> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>
> > I don't think so. Let's considering the per-cpu list first. Assume the
> > normal RCU grace period is about 30ms and we are tracing the IO latency
> > of a normal SSD. The iops is about 176K per seconds, so before one RCU
> > GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> > For the per-ma list, when the number of CPUs increased, it is easy to
> > make the list contain thousands of elements.
>
> That would be true only if there were no scheduling events in all of 176K ops.
> Which is not the case.
> I'm not sure why you're saying that RCU GP is 30ms.
> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> Every sched event is sort-of implicit rcu_read_lock/unlock.
> Network and block IO doesn't process 176K packets without resched.
> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>
> For small size buckets low_watermark=32 and high=96.
> We typically move 32 elements at a time from one list to another.
> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> is not instant, but once __free_rcu_tasks_trace is called we need to take
> elements from waiting_for_gp one at a time and kfree it one at a time.
> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>
> > Before I post v5, I want to know the reason why per-bpf-ma list is
> >introduced. Previously, I though it was used to handle the case in which
> > allocation and freeing are done on different CPUs.
>
> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>
> > And as we can see
> > from the benchmark result above and in v3, the performance and the
> > memory usage of v4 for add_del_on_diff_cpu is better than v3.
>
> bench from patch 2 is invalid. Hence no conclusion can be made.
>
> So far the only bench we can trust and analyze is map_perf_test.
> Please make bench in patch 2 yield the cpu after few updates.
> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> 64 updates is realistic too. A thousand is not.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 20:50         ` Alexei Starovoitov
@ 2023-06-07 23:23           ` Alexei Starovoitov
  2023-06-07 23:30             ` Paul E. McKenney
  2023-06-08  1:57             ` Hou Tao
  2023-06-08  1:51           ` Hou Tao
  1 sibling, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-07 23:23 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > As said in the commit message, the command line for test is
> > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > using default max_entries and the number of CPUs is greater than 15,
> > > use_percpu_counter will be false.
> >
> > Right. percpu or not depends on number of cpus.
> >
> > >
> > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > there are obvious performance degradation.
> > ...
> > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > 2:hash_map_perf kmalloc 359201 events per sec
> > ..
> > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > 4:hash_map_perf kmalloc 203983 events per sec
> >
> > this is indeed a degration in a VM.
> >
> > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > performance degradation for "./map_perf_test 4 8 16384"
> >
> > but... a degradation?
> >
> > > Before reuse-after-rcu-gp:
> > >
> > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > 1:hash_map_perf kmalloc 388088 events per sec
> > ...
> > > After reuse-after-rcu-gp:
> > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > 5:hash_map_perf kmalloc 655628 events per sec
> >
> > This is a big improvement :) Not a degration.
> > You always have to double check the numbers with perf report.
> >
> > > So could you please double check your setup and rerun map_perf_test ? If
> > > there is no performance degradation, could you please share your setup
> > > and your kernel configure file ?
> >
> > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > Playing with it a bit more I found something interesting:
> > map_perf_test 4 8 16348
> > before/after has too much noise to be conclusive.
> >
> > So I did
> > map_perf_test 4 8 16348 1000000
> >
> > and now I see significant degration from patch 3.
> > It drops from 800k to 200k.
> > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > The following hack addresses most of the perf degradtion:
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index fea1cb0c78bb..eeadc9359097 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >         alloc = 0;
> >         head = NULL;
> >         tail = NULL;
> > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> >         while (alloc < cnt) {
> >                 obj = __llist_del_first(&sc->reuse_ready_head);
> >                 if (obj) {
> > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >                 alloc++;
> >         }
> >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > +       }
> >
> >         if (alloc) {
> >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> >                 sc->reuse_ready_tail = NULL;
> >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > +       } else {
> > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >         }
> > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >  }
> >
> > It now drops from 800k to 450k.
> > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
>
> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.

An update..

I tweaked patch 3 to do per-cpu reuse_ready and it addressed
the lock contention, but cache miss on
__llist_del_first(&c->reuse_ready_head);
was still very high and performance was still at 450k as
with a simple hack above.

Then I removed some of the _tail optimizations and added counters
to these llists.
To my surprise
map_perf_test 4 1 16348 1000000
was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
and ~400k sitting in reuse_ready_head.

Then noticed that we should be doing:
call_rcu_hurry(&c->rcu, reuse_rcu);
instead of call_rcu(),
but my config didn't have RCU_LAZY, so that didn't help.
Obviously we cannot allow such a huge number of elements to sit
in these link lists.
The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
To unblock qp-trie work I suggest to add rcu_head to each inner node
and do call_rcu() on them before free-ing them to bpf_mem_alloc.
Explicit call_rcu would disqualify qp-tree from tracing programs though :(

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 23:23           ` Alexei Starovoitov
@ 2023-06-07 23:30             ` Paul E. McKenney
  2023-06-07 23:50               ` Alexei Starovoitov
  2023-06-08  1:57             ` Hou Tao
  1 sibling, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-07 23:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > As said in the commit message, the command line for test is
> > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > using default max_entries and the number of CPUs is greater than 15,
> > > > use_percpu_counter will be false.
> > >
> > > Right. percpu or not depends on number of cpus.
> > >
> > > >
> > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > there are obvious performance degradation.
> > > ...
> > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > ..
> > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > 4:hash_map_perf kmalloc 203983 events per sec
> > >
> > > this is indeed a degration in a VM.
> > >
> > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > performance degradation for "./map_perf_test 4 8 16384"
> > >
> > > but... a degradation?
> > >
> > > > Before reuse-after-rcu-gp:
> > > >
> > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > ...
> > > > After reuse-after-rcu-gp:
> > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > 5:hash_map_perf kmalloc 655628 events per sec
> > >
> > > This is a big improvement :) Not a degration.
> > > You always have to double check the numbers with perf report.
> > >
> > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > there is no performance degradation, could you please share your setup
> > > > and your kernel configure file ?
> > >
> > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > Playing with it a bit more I found something interesting:
> > > map_perf_test 4 8 16348
> > > before/after has too much noise to be conclusive.
> > >
> > > So I did
> > > map_perf_test 4 8 16348 1000000
> > >
> > > and now I see significant degration from patch 3.
> > > It drops from 800k to 200k.
> > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > The following hack addresses most of the perf degradtion:
> > >
> > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > index fea1cb0c78bb..eeadc9359097 100644
> > > --- a/kernel/bpf/memalloc.c
> > > +++ b/kernel/bpf/memalloc.c
> > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > >         alloc = 0;
> > >         head = NULL;
> > >         tail = NULL;
> > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > >         while (alloc < cnt) {
> > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > >                 if (obj) {
> > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > >                 alloc++;
> > >         }
> > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > +       }
> > >
> > >         if (alloc) {
> > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > >                 sc->reuse_ready_tail = NULL;
> > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > +       } else {
> > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > >         }
> > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > >  }
> > >
> > > It now drops from 800k to 450k.
> > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> >
> > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> 
> An update..
> 
> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> the lock contention, but cache miss on
> __llist_del_first(&c->reuse_ready_head);
> was still very high and performance was still at 450k as
> with a simple hack above.
> 
> Then I removed some of the _tail optimizations and added counters
> to these llists.
> To my surprise
> map_perf_test 4 1 16348 1000000
> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> and ~400k sitting in reuse_ready_head.
> 
> Then noticed that we should be doing:
> call_rcu_hurry(&c->rcu, reuse_rcu);
> instead of call_rcu(),
> but my config didn't have RCU_LAZY, so that didn't help.
> Obviously we cannot allow such a huge number of elements to sit
> in these link lists.
> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> To unblock qp-trie work I suggest to add rcu_head to each inner node
> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> Explicit call_rcu would disqualify qp-tree from tracing programs though :(

I am sure that you guys have already considered and discarded this one,
but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 23:30             ` Paul E. McKenney
@ 2023-06-07 23:50               ` Alexei Starovoitov
  2023-06-08  0:13                 ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-07 23:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > As said in the commit message, the command line for test is
> > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > use_percpu_counter will be false.
> > > >
> > > > Right. percpu or not depends on number of cpus.
> > > >
> > > > >
> > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > there are obvious performance degradation.
> > > > ...
> > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > ..
> > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > >
> > > > this is indeed a degration in a VM.
> > > >
> > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > >
> > > > but... a degradation?
> > > >
> > > > > Before reuse-after-rcu-gp:
> > > > >
> > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > ...
> > > > > After reuse-after-rcu-gp:
> > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > >
> > > > This is a big improvement :) Not a degration.
> > > > You always have to double check the numbers with perf report.
> > > >
> > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > there is no performance degradation, could you please share your setup
> > > > > and your kernel configure file ?
> > > >
> > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > Playing with it a bit more I found something interesting:
> > > > map_perf_test 4 8 16348
> > > > before/after has too much noise to be conclusive.
> > > >
> > > > So I did
> > > > map_perf_test 4 8 16348 1000000
> > > >
> > > > and now I see significant degration from patch 3.
> > > > It drops from 800k to 200k.
> > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > The following hack addresses most of the perf degradtion:
> > > >
> > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > --- a/kernel/bpf/memalloc.c
> > > > +++ b/kernel/bpf/memalloc.c
> > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > >         alloc = 0;
> > > >         head = NULL;
> > > >         tail = NULL;
> > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > >         while (alloc < cnt) {
> > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > >                 if (obj) {
> > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > >                 alloc++;
> > > >         }
> > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > +       }
> > > >
> > > >         if (alloc) {
> > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > >                 sc->reuse_ready_tail = NULL;
> > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > +       } else {
> > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >         }
> > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >  }
> > > >
> > > > It now drops from 800k to 450k.
> > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > >
> > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> >
> > An update..
> >
> > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > the lock contention, but cache miss on
> > __llist_del_first(&c->reuse_ready_head);
> > was still very high and performance was still at 450k as
> > with a simple hack above.
> >
> > Then I removed some of the _tail optimizations and added counters
> > to these llists.
> > To my surprise
> > map_perf_test 4 1 16348 1000000
> > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > and ~400k sitting in reuse_ready_head.
> >
> > Then noticed that we should be doing:
> > call_rcu_hurry(&c->rcu, reuse_rcu);
> > instead of call_rcu(),
> > but my config didn't have RCU_LAZY, so that didn't help.
> > Obviously we cannot allow such a huge number of elements to sit
> > in these link lists.
> > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
>
> I am sure that you guys have already considered and discarded this one,
> but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.

SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
We want to add an option to make it not do it and instead observe RCU GP
for every element freed via bpf_mem_free().
In other words, make bpf_mem_free() behave like kfree_rcu.
I just tried to use rcu_expedite_gp() before bpf prog runs
and it helps a bit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 23:50               ` Alexei Starovoitov
@ 2023-06-08  0:13                 ` Paul E. McKenney
  2023-06-08  0:34                   ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-08  0:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > As said in the commit message, the command line for test is
> > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > use_percpu_counter will be false.
> > > > >
> > > > > Right. percpu or not depends on number of cpus.
> > > > >
> > > > > >
> > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > there are obvious performance degradation.
> > > > > ...
> > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > ..
> > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > >
> > > > > this is indeed a degration in a VM.
> > > > >
> > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > >
> > > > > but... a degradation?
> > > > >
> > > > > > Before reuse-after-rcu-gp:
> > > > > >
> > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > ...
> > > > > > After reuse-after-rcu-gp:
> > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > >
> > > > > This is a big improvement :) Not a degration.
> > > > > You always have to double check the numbers with perf report.
> > > > >
> > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > there is no performance degradation, could you please share your setup
> > > > > > and your kernel configure file ?
> > > > >
> > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > Playing with it a bit more I found something interesting:
> > > > > map_perf_test 4 8 16348
> > > > > before/after has too much noise to be conclusive.
> > > > >
> > > > > So I did
> > > > > map_perf_test 4 8 16348 1000000
> > > > >
> > > > > and now I see significant degration from patch 3.
> > > > > It drops from 800k to 200k.
> > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > The following hack addresses most of the perf degradtion:
> > > > >
> > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > --- a/kernel/bpf/memalloc.c
> > > > > +++ b/kernel/bpf/memalloc.c
> > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > >         alloc = 0;
> > > > >         head = NULL;
> > > > >         tail = NULL;
> > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > >         while (alloc < cnt) {
> > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > >                 if (obj) {
> > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > >                 alloc++;
> > > > >         }
> > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > +       }
> > > > >
> > > > >         if (alloc) {
> > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > >                 sc->reuse_ready_tail = NULL;
> > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > +       } else {
> > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > >         }
> > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > >  }
> > > > >
> > > > > It now drops from 800k to 450k.
> > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > >
> > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > >
> > > An update..
> > >
> > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > the lock contention, but cache miss on
> > > __llist_del_first(&c->reuse_ready_head);
> > > was still very high and performance was still at 450k as
> > > with a simple hack above.
> > >
> > > Then I removed some of the _tail optimizations and added counters
> > > to these llists.
> > > To my surprise
> > > map_perf_test 4 1 16348 1000000
> > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > and ~400k sitting in reuse_ready_head.
> > >
> > > Then noticed that we should be doing:
> > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > instead of call_rcu(),
> > > but my config didn't have RCU_LAZY, so that didn't help.
> > > Obviously we cannot allow such a huge number of elements to sit
> > > in these link lists.
> > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> >
> > I am sure that you guys have already considered and discarded this one,
> > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> 
> SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> We want to add an option to make it not do it and instead observe RCU GP
> for every element freed via bpf_mem_free().
> In other words, make bpf_mem_free() behave like kfree_rcu.
> I just tried to use rcu_expedite_gp() before bpf prog runs
> and it helps a bit.

OK, got it, so you guys have considered, implemented, and are now trying
to discard SLAB_TYPESAFE_BY_RCU.  ;-)

Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
surprised that rcu_expedite_gp() makes any difference.

We do some expediting if there are huge numbers of callbacks or if one
of RCU's shrinker notifiers is invoked.  If the concern is only memory
footprint, it is possible to make the shrinkers more aggressive.  I am
not sure whether making them unconditionally more aggressive is a good
idea, however if memory footprint is the only concern and if shrink-time
expediting would suffice, it is certainly worth some investigation.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  0:13                 ` Paul E. McKenney
@ 2023-06-08  0:34                   ` Alexei Starovoitov
  2023-06-08  1:15                     ` Paul E. McKenney
  2023-06-08  3:35                     ` Hou Tao
  0 siblings, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-08  0:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > > As said in the commit message, the command line for test is
> > > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > > use_percpu_counter will be false.
> > > > > >
> > > > > > Right. percpu or not depends on number of cpus.
> > > > > >
> > > > > > >
> > > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > > there are obvious performance degradation.
> > > > > > ...
> > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > > ..
> > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > > >
> > > > > > this is indeed a degration in a VM.
> > > > > >
> > > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > > >
> > > > > > but... a degradation?
> > > > > >
> > > > > > > Before reuse-after-rcu-gp:
> > > > > > >
> > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > > ...
> > > > > > > After reuse-after-rcu-gp:
> > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > > >
> > > > > > This is a big improvement :) Not a degration.
> > > > > > You always have to double check the numbers with perf report.
> > > > > >
> > > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > > there is no performance degradation, could you please share your setup
> > > > > > > and your kernel configure file ?
> > > > > >
> > > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > > Playing with it a bit more I found something interesting:
> > > > > > map_perf_test 4 8 16348
> > > > > > before/after has too much noise to be conclusive.
> > > > > >
> > > > > > So I did
> > > > > > map_perf_test 4 8 16348 1000000
> > > > > >
> > > > > > and now I see significant degration from patch 3.
> > > > > > It drops from 800k to 200k.
> > > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > > The following hack addresses most of the perf degradtion:
> > > > > >
> > > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > > --- a/kernel/bpf/memalloc.c
> > > > > > +++ b/kernel/bpf/memalloc.c
> > > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > >         alloc = 0;
> > > > > >         head = NULL;
> > > > > >         tail = NULL;
> > > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > > >         while (alloc < cnt) {
> > > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > > >                 if (obj) {
> > > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > >                 alloc++;
> > > > > >         }
> > > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > +       }
> > > > > >
> > > > > >         if (alloc) {
> > > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > > >                 sc->reuse_ready_tail = NULL;
> > > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > > +       } else {
> > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > >         }
> > > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > >  }
> > > > > >
> > > > > > It now drops from 800k to 450k.
> > > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > > >
> > > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > > >
> > > > An update..
> > > >
> > > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > > the lock contention, but cache miss on
> > > > __llist_del_first(&c->reuse_ready_head);
> > > > was still very high and performance was still at 450k as
> > > > with a simple hack above.
> > > >
> > > > Then I removed some of the _tail optimizations and added counters
> > > > to these llists.
> > > > To my surprise
> > > > map_perf_test 4 1 16348 1000000
> > > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > > and ~400k sitting in reuse_ready_head.
> > > >
> > > > Then noticed that we should be doing:
> > > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > > instead of call_rcu(),
> > > > but my config didn't have RCU_LAZY, so that didn't help.
> > > > Obviously we cannot allow such a huge number of elements to sit
> > > > in these link lists.
> > > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> > >
> > > I am sure that you guys have already considered and discarded this one,
> > > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> >
> > SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> > We want to add an option to make it not do it and instead observe RCU GP
> > for every element freed via bpf_mem_free().
> > In other words, make bpf_mem_free() behave like kfree_rcu.
> > I just tried to use rcu_expedite_gp() before bpf prog runs
> > and it helps a bit.
>
> OK, got it, so you guys have considered, implemented, and are now trying
> to discard SLAB_TYPESAFE_BY_RCU.  ;-)
>
> Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
> surprised that rcu_expedite_gp() makes any difference.
>
> We do some expediting if there are huge numbers of callbacks or if one
> of RCU's shrinker notifiers is invoked.  If the concern is only memory
> footprint, it is possible to make the shrinkers more aggressive.  I am
> not sure whether making them unconditionally more aggressive is a good
> idea, however if memory footprint is the only concern and if shrink-time
> expediting would suffice, it is certainly worth some investigation.

Right. I don't think it's a good idea to tweak RCU for this use case.
RCU parameters have to be optimized for all. Instead the bpf side needs
to understand how RCU heuristics/watermarks work and play that game.
For example, Hou's patch 3 has one pending call_rcu per-cpu.
As soon as one call_rcu_hurry is done all future freed elements gets
queued into llist and for the next call_rcu_hurry() that list will
contain 100k elements.
I believe from RCU pov one pending call_rcu cb is not a reason to
act right away. It's trying to batch multiple cb-s.
Right now I'm experimenting with multiple call_rcu calls from the bpf side,
so that RCU sees multiple pending cb-s and has to act.
It seems to work much better. Memory footprint is now reasonable.
Could you point me to a code in RCU where it's doing callback batching?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  0:34                   ` Alexei Starovoitov
@ 2023-06-08  1:15                     ` Paul E. McKenney
  2023-06-08  3:35                     ` Hou Tao
  1 sibling, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-08  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Wed, Jun 07, 2023 at 05:34:50PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > > > As said in the commit message, the command line for test is
> > > > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > > > use_percpu_counter will be false.
> > > > > > >
> > > > > > > Right. percpu or not depends on number of cpus.
> > > > > > >
> > > > > > > >
> > > > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > > > there are obvious performance degradation.
> > > > > > > ...
> > > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > > > ..
> > > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > > > >
> > > > > > > this is indeed a degration in a VM.
> > > > > > >
> > > > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > > > >
> > > > > > > but... a degradation?
> > > > > > >
> > > > > > > > Before reuse-after-rcu-gp:
> > > > > > > >
> > > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > > > ...
> > > > > > > > After reuse-after-rcu-gp:
> > > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > > > >
> > > > > > > This is a big improvement :) Not a degration.
> > > > > > > You always have to double check the numbers with perf report.
> > > > > > >
> > > > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > > > there is no performance degradation, could you please share your setup
> > > > > > > > and your kernel configure file ?
> > > > > > >
> > > > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > > > Playing with it a bit more I found something interesting:
> > > > > > > map_perf_test 4 8 16348
> > > > > > > before/after has too much noise to be conclusive.
> > > > > > >
> > > > > > > So I did
> > > > > > > map_perf_test 4 8 16348 1000000
> > > > > > >
> > > > > > > and now I see significant degration from patch 3.
> > > > > > > It drops from 800k to 200k.
> > > > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > > > The following hack addresses most of the perf degradtion:
> > > > > > >
> > > > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > > > --- a/kernel/bpf/memalloc.c
> > > > > > > +++ b/kernel/bpf/memalloc.c
> > > > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > > >         alloc = 0;
> > > > > > >         head = NULL;
> > > > > > >         tail = NULL;
> > > > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > > > >         while (alloc < cnt) {
> > > > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > > > >                 if (obj) {
> > > > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > > >                 alloc++;
> > > > > > >         }
> > > > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > > +       }
> > > > > > >
> > > > > > >         if (alloc) {
> > > > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > > > >                 sc->reuse_ready_tail = NULL;
> > > > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > > > +       } else {
> > > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >         }
> > > > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >  }
> > > > > > >
> > > > > > > It now drops from 800k to 450k.
> > > > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > > > >
> > > > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > > > >
> > > > > An update..
> > > > >
> > > > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > > > the lock contention, but cache miss on
> > > > > __llist_del_first(&c->reuse_ready_head);
> > > > > was still very high and performance was still at 450k as
> > > > > with a simple hack above.
> > > > >
> > > > > Then I removed some of the _tail optimizations and added counters
> > > > > to these llists.
> > > > > To my surprise
> > > > > map_perf_test 4 1 16348 1000000
> > > > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > > > and ~400k sitting in reuse_ready_head.
> > > > >
> > > > > Then noticed that we should be doing:
> > > > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > > > instead of call_rcu(),
> > > > > but my config didn't have RCU_LAZY, so that didn't help.
> > > > > Obviously we cannot allow such a huge number of elements to sit
> > > > > in these link lists.
> > > > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> > > >
> > > > I am sure that you guys have already considered and discarded this one,
> > > > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> > >
> > > SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> > > We want to add an option to make it not do it and instead observe RCU GP
> > > for every element freed via bpf_mem_free().
> > > In other words, make bpf_mem_free() behave like kfree_rcu.
> > > I just tried to use rcu_expedite_gp() before bpf prog runs
> > > and it helps a bit.
> >
> > OK, got it, so you guys have considered, implemented, and are now trying
> > to discard SLAB_TYPESAFE_BY_RCU.  ;-)
> >
> > Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
> > surprised that rcu_expedite_gp() makes any difference.
> >
> > We do some expediting if there are huge numbers of callbacks or if one
> > of RCU's shrinker notifiers is invoked.  If the concern is only memory
> > footprint, it is possible to make the shrinkers more aggressive.  I am
> > not sure whether making them unconditionally more aggressive is a good
> > idea, however if memory footprint is the only concern and if shrink-time
> > expediting would suffice, it is certainly worth some investigation.
> 
> Right. I don't think it's a good idea to tweak RCU for this use case.
> RCU parameters have to be optimized for all. Instead the bpf side needs
> to understand how RCU heuristics/watermarks work and play that game.
> For example, Hou's patch 3 has one pending call_rcu per-cpu.
> As soon as one call_rcu_hurry is done all future freed elements gets
> queued into llist and for the next call_rcu_hurry() that list will
> contain 100k elements.
> I believe from RCU pov one pending call_rcu cb is not a reason to
> act right away. It's trying to batch multiple cb-s.

Indeed, if a single RCU callback was that important, it would be necessary
for the caller to explicitly say so somehow.  And I agree that this
would open a can of worms that might be best left unopened.

> Right now I'm experimenting with multiple call_rcu calls from the bpf side,
> so that RCU sees multiple pending cb-s and has to act.
> It seems to work much better. Memory footprint is now reasonable.

Nice!!!

> Could you point me to a code in RCU where it's doing callback batching?

There are quite a few things in RCU that make this happen:

1.	The latency of the previous RCU grace period automatically
	batches up callbacks for the RCU grace period that is just
	now starting.

2.	More specifically, the rcutree.jiffies_till_first_fqs and
	rcutree.jiffies_till_next_fqs module parameters control
	how long RCU grace-period processing waits between checks
	for idle CPUs.

3.	The rcu_implicit_dynticks_qs() function is invoked periodically
	as specified by #2 above, and pays attention to grace-period
	progress.  If it decides that the current grace period is not
	moving along quickly enough, it will enlist help from
	cond_resched(), the context-switch code, and from a
	scheduler IPI.  Failing that, it collects information that
	can be printed out by the ensuing RCU CPU stall warning.

4.	When callbacks are not offloaded to the rcuo kthreads, when
	the __call_rcu_core() function decides that there are too many
	callbacks piled up on the current CPU, it gives the current grace
	period a kick by invoking rcu_force_quiescent_state(), which
	in turn forces RCU to check for holdout CPUs.  And also which
	forces the current CPU to check immediately for a quiescent state.

5.	The __call_rcu_nocb_wake() does something similar for the case
	where callbacks are offloaded.

	Most datacenter Linux kernels do *not* offload callbacks.

As a rough rule of thumb, RCU wants to keep grace period latency
somewhere in the milliseconds.  If the grace period goes too slowly or
if there are too many callbacks piling up, it will take evasive action
as described above.

Does that help, or am I missing the point of your question?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 20:50         ` Alexei Starovoitov
  2023-06-07 23:23           ` Alexei Starovoitov
@ 2023-06-08  1:51           ` Hou Tao
  2023-06-08  2:55             ` Paul E. McKenney
  1 sibling, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-08  1:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

Hi,

On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>> As said in the commit message, the command line for test is
>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>> using default max_entries and the number of CPUs is greater than 15,
>>> use_percpu_counter will be false.
>> Right. percpu or not depends on number of cpus.
>>
>>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
>>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
>>> there are obvious performance degradation.
>> ...
>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>> 2:hash_map_perf kmalloc 359201 events per sec
>> ..
>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>> 4:hash_map_perf kmalloc 203983 events per sec
>> this is indeed a degration in a VM.
>>
>>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
>>> performances for "./map_perf_test 4 8" are similar, but there is obvious
>>> performance degradation for "./map_perf_test 4 8 16384"
>> but... a degradation?
Er, My bad. I miss-labeled "Before" and "After". v4 indeed introduces
big performance degradation in physical host.
>>
>>> Before reuse-after-rcu-gp:
>>>
>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>> 1:hash_map_perf kmalloc 388088 events per sec
>> ...
>>> After reuse-after-rcu-gp:
>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>> 5:hash_map_perf kmalloc 655628 events per sec
>> This is a big improvement :) Not a degration.
>> You always have to double check the numbers with perf report.
>>
>>> So could you please double check your setup and rerun map_perf_test ? If
>>> there is no performance degradation, could you please share your setup
>>> and your kernel configure file ?
>> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
>> Playing with it a bit more I found something interesting:
>> map_perf_test 4 8 16348
>> before/after has too much noise to be conclusive.
>>
>> So I did
>> map_perf_test 4 8 16348 1000000
>>
>> and now I see significant degration from patch 3.
>> It drops from 800k to 200k.
>> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
>> The following hack addresses most of the perf degradtion:
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index fea1cb0c78bb..eeadc9359097 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>         alloc = 0;
>>         head = NULL;
>>         tail = NULL;
>> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
>> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>>         while (alloc < cnt) {
>>                 obj = __llist_del_first(&sc->reuse_ready_head);
>>                 if (obj) {
>> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>                 alloc++;
>>         }
>>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>> +       }
>>
>>         if (alloc) {
>>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
>> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>>                 sc->reuse_ready_tail = NULL;
>>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>>                 __llist_add_batch(head, tail, &sc->wait_for_free);
>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
>> +       } else {
>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>         }
>> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>  }
>>
>> It now drops from 800k to 450k.
>> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
>> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
Yes, I known, because I had just proposed it in the email yesterday.
>
> Also noticed that the overhead of shared reuse_ready list
> comes both from the contended lock and from cache misses
> when one cpu pushes to the list after RCU GP and another
> cpu removes.
>
> Also low/batch/high watermark are all wrong in patch 3.
> low=32 and high=96 makes no sense when it's not a single list.
> I'm experimenting with 32 for all three heuristics.
>
> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> are redundant.
> unit_free() can push into free_by_rcu directly
> then reuse_bulk() can fill it up with free_llist_extra and
> move them into waiting_for_gp.
Yes. Indeed missing that.
>
> All these _tail optimizations are obscuring the code and make it hard
> to notice these issues.
>
>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>
>> Answering your other email:
>>
>>> I see your point. I will continue to debug the memory usage difference
>>> between v3 and v4.
>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
Don't agree with that. I still think the potential memory usage of v4 is
a problem and the difference memory usage between v3 and v4 reveals that
there is some peculiarity in RCU subsystem, because the difference comes
from the duration of RCU grace period. We need to find out why and fix
or workaround that.
>>
>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>> make the list contain thousands of elements.
>> That would be true only if there were no scheduling events in all of 176K ops.
>> Which is not the case.
>> I'm not sure why you're saying that RCU GP is 30ms.
Because these freed elements will be freed after one RCU GP and in v4
there is only one RCU callback per-cpu, so before one RCU GP is expired,
these freed elements will be accumulated on the list.
>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>> Network and block IO doesn't process 176K packets without resched.
>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>
>> For small size buckets low_watermark=32 and high=96.
>> We typically move 32 elements at a time from one list to another.
>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>> elements from waiting_for_gp one at a time and kfree it one at a time.
>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>
>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>> introduced. Previously, I though it was used to handle the case in which
>>> allocation and freeing are done on different CPUs.
>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>
>>> And as we can see
>>> from the benchmark result above and in v3, the performance and the
>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>
>> So far the only bench we can trust and analyze is map_perf_test.
>> Please make bench in patch 2 yield the cpu after few updates.
>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>> 64 updates is realistic too. A thousand is not.
Will do that.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-07 23:23           ` Alexei Starovoitov
  2023-06-07 23:30             ` Paul E. McKenney
@ 2023-06-08  1:57             ` Hou Tao
  1 sibling, 0 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-08  1:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, Paul E . McKenney, rcu, houtao1

Hi,

On 6/8/2023 7:23 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>> As said in the commit message, the command line for test is
>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>> using default max_entries and the number of CPUs is greater than 15,
>>>> use_percpu_counter will be false.
>>> Right. percpu or not depends on number of cpus.
>>>
>>>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
>>>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
>>>> there are obvious performance degradation.
>>> ...
>>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>>> 2:hash_map_perf kmalloc 359201 events per sec
>>> ..
>>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>>> 4:hash_map_perf kmalloc 203983 events per sec
>>> this is indeed a degration in a VM.
>>>
>>>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
>>>> performances for "./map_perf_test 4 8" are similar, but there is obvious
>>>> performance degradation for "./map_perf_test 4 8 16384"
>>> but... a degradation?
>>>
>>>> Before reuse-after-rcu-gp:
>>>>
>>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>>> 1:hash_map_perf kmalloc 388088 events per sec
>>> ...
>>>> After reuse-after-rcu-gp:
>>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>>> 5:hash_map_perf kmalloc 655628 events per sec
>>> This is a big improvement :) Not a degration.
>>> You always have to double check the numbers with perf report.
>>>
>>>> So could you please double check your setup and rerun map_perf_test ? If
>>>> there is no performance degradation, could you please share your setup
>>>> and your kernel configure file ?
>>> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
>>> Playing with it a bit more I found something interesting:
>>> map_perf_test 4 8 16348
>>> before/after has too much noise to be conclusive.
>>>
>>> So I did
>>> map_perf_test 4 8 16348 1000000
>>>
>>> and now I see significant degration from patch 3.
>>> It drops from 800k to 200k.
>>> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
>>> The following hack addresses most of the perf degradtion:
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index fea1cb0c78bb..eeadc9359097 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>>         alloc = 0;
>>>         head = NULL;
>>>         tail = NULL;
>>> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
>>> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>>>         while (alloc < cnt) {
>>>                 obj = __llist_del_first(&sc->reuse_ready_head);
>>>                 if (obj) {
>>> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>>                 alloc++;
>>>         }
>>>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>> +       }
>>>
>>>         if (alloc) {
>>>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>>>                 sc->reuse_ready_tail = NULL;
>>>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>>>                 __llist_add_batch(head, tail, &sc->wait_for_free);
>>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
>>> +       } else {
>>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>         }
>>> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>  }
>>>
>>> It now drops from 800k to 450k.
>>> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
>>> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
>> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
>> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> An update..
>
> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> the lock contention, but cache miss on
> __llist_del_first(&c->reuse_ready_head);
> was still very high and performance was still at 450k as
> with a simple hack above.
>
> Then I removed some of the _tail optimizations and added counters
> to these llists.
> To my surprise
> map_perf_test 4 1 16348 1000000
> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> and ~400k sitting in reuse_ready_head.
Yep. If you use htab-mem-bechmark in patch #2, you will find the same
results, the same long single lists and the same huge memory usage.
>
> Then noticed that we should be doing:
> call_rcu_hurry(&c->rcu, reuse_rcu);
> instead of call_rcu(),
> but my config didn't have RCU_LAZY, so that didn't help.
> Obviously we cannot allow such a huge number of elements to sit
> in these link lists.
> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
I think the main blocker is the huge memory usage (it is the same thing
as the long wait_for_reuse and wait_for_free list), right ?
> To unblock qp-trie work I suggest to add rcu_head to each inner node
> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> Explicit call_rcu would disqualify qp-tree from tracing programs though :(


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  1:51           ` Hou Tao
@ 2023-06-08  2:55             ` Paul E. McKenney
  2023-06-08  3:43               ` Hou Tao
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-08  2:55 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>> As said in the commit message, the command line for test is
> >>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>> using default max_entries and the number of CPUs is greater than 15,
> >>> use_percpu_counter will be false.
> >> Right. percpu or not depends on number of cpus.
> >>
> >>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> >>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> >>> there are obvious performance degradation.
> >> ...
> >>> [root@hello bpf]# ./map_perf_test 4 8 16384
> >>> 2:hash_map_perf kmalloc 359201 events per sec
> >> ..
> >>> [root@hello bpf]# ./map_perf_test 4 8 16384
> >>> 4:hash_map_perf kmalloc 203983 events per sec
> >> this is indeed a degration in a VM.
> >>
> >>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> >>> performances for "./map_perf_test 4 8" are similar, but there is obvious
> >>> performance degradation for "./map_perf_test 4 8 16384"
> >> but... a degradation?
> Er, My bad. I miss-labeled "Before" and "After". v4 indeed introduces
> big performance degradation in physical host.
> >>
> >>> Before reuse-after-rcu-gp:
> >>>
> >>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> >>> 1:hash_map_perf kmalloc 388088 events per sec
> >> ...
> >>> After reuse-after-rcu-gp:
> >>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> >>> 5:hash_map_perf kmalloc 655628 events per sec
> >> This is a big improvement :) Not a degration.
> >> You always have to double check the numbers with perf report.
> >>
> >>> So could you please double check your setup and rerun map_perf_test ? If
> >>> there is no performance degradation, could you please share your setup
> >>> and your kernel configure file ?
> >> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> >> Playing with it a bit more I found something interesting:
> >> map_perf_test 4 8 16348
> >> before/after has too much noise to be conclusive.
> >>
> >> So I did
> >> map_perf_test 4 8 16348 1000000
> >>
> >> and now I see significant degration from patch 3.
> >> It drops from 800k to 200k.
> >> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> >> The following hack addresses most of the perf degradtion:
> >>
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index fea1cb0c78bb..eeadc9359097 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >>         alloc = 0;
> >>         head = NULL;
> >>         tail = NULL;
> >> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> >> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> >>         while (alloc < cnt) {
> >>                 obj = __llist_del_first(&sc->reuse_ready_head);
> >>                 if (obj) {
> >> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >>                 alloc++;
> >>         }
> >>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >> +       }
> >>
> >>         if (alloc) {
> >>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> >>                 sc->reuse_ready_tail = NULL;
> >>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> >>                 __llist_add_batch(head, tail, &sc->wait_for_free);
> >> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> >> +       } else {
> >> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >>         }
> >> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >>  }
> >>
> >> It now drops from 800k to 450k.
> >> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> >> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> Yes, I known, because I had just proposed it in the email yesterday.
> >
> > Also noticed that the overhead of shared reuse_ready list
> > comes both from the contended lock and from cache misses
> > when one cpu pushes to the list after RCU GP and another
> > cpu removes.
> >
> > Also low/batch/high watermark are all wrong in patch 3.
> > low=32 and high=96 makes no sense when it's not a single list.
> > I'm experimenting with 32 for all three heuristics.
> >
> > Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> > are redundant.
> > unit_free() can push into free_by_rcu directly
> > then reuse_bulk() can fill it up with free_llist_extra and
> > move them into waiting_for_gp.
> Yes. Indeed missing that.
> >
> > All these _tail optimizations are obscuring the code and make it hard
> > to notice these issues.
> >
> >> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>
> >> Answering your other email:
> >>
> >>> I see your point. I will continue to debug the memory usage difference
> >>> between v3 and v4.
> >> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> Don't agree with that. I still think the potential memory usage of v4 is
> a problem and the difference memory usage between v3 and v4 reveals that
> there is some peculiarity in RCU subsystem, because the difference comes
> from the duration of RCU grace period. We need to find out why and fix
> or workaround that.

A tight loop in the kernel can extend RCU grace periods, especially
for kernels built with CONFIG_PREEPTION=n.  Placing things like
cond_resched() in such loops can help.  Of course, if you are in an
RCU read-side critical section (or holding a spinlock), you will need
to exit, cond_resched(), then re-enter.  Taking care to ensure that the
state upon re-entry is valid.  After all, having exited either type of
critical section, anything might happen.

							Thanx, Paul

> >>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>> make the list contain thousands of elements.
> >> That would be true only if there were no scheduling events in all of 176K ops.
> >> Which is not the case.
> >> I'm not sure why you're saying that RCU GP is 30ms.
> Because these freed elements will be freed after one RCU GP and in v4
> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> these freed elements will be accumulated on the list.
> >> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >> Network and block IO doesn't process 176K packets without resched.
> >> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>
> >> For small size buckets low_watermark=32 and high=96.
> >> We typically move 32 elements at a time from one list to another.
> >> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >> elements from waiting_for_gp one at a time and kfree it one at a time.
> >> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>
> >>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>> introduced. Previously, I though it was used to handle the case in which
> >>> allocation and freeing are done on different CPUs.
> >> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>
> >>> And as we can see
> >>> from the benchmark result above and in v3, the performance and the
> >>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>
> >> So far the only bench we can trust and analyze is map_perf_test.
> >> Please make bench in patch 2 yield the cpu after few updates.
> >> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >> 64 updates is realistic too. A thousand is not.
> Will do that.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  0:34                   ` Alexei Starovoitov
  2023-06-08  1:15                     ` Paul E. McKenney
@ 2023-06-08  3:35                     ` Hou Tao
  2023-06-08  4:30                       ` Hou Tao
  2023-06-08  4:30                       ` Alexei Starovoitov
  1 sibling, 2 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-08  3:35 UTC (permalink / raw)
  To: Paul E. McKenney, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, rcu, houtao1

Hi,

On 6/8/2023 8:34 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
>>> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
SNIP
>>>> An update..
>>>>
>>>> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
>>>> the lock contention, but cache miss on
>>>> __llist_del_first(&c->reuse_ready_head);
>>>> was still very high and performance was still at 450k as
>>>> with a simple hack above.
>>>>
>>>> Then I removed some of the _tail optimizations and added counters
>>>> to these llists.
>>>> To my surprise
>>>> map_perf_test 4 1 16348 1000000
>>>> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
>>>> and ~400k sitting in reuse_ready_head.
>>>>
>>>> Then noticed that we should be doing:
>>>> call_rcu_hurry(&c->rcu, reuse_rcu);
>>>> instead of call_rcu(),
>>>> but my config didn't have RCU_LAZY, so that didn't help.
>>>> Obviously we cannot allow such a huge number of elements to sit
>>>> in these link lists.
>>>> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
>>>> To unblock qp-trie work I suggest to add rcu_head to each inner node
>>>> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
>>>> Explicit call_rcu would disqualify qp-tree from tracing programs though :(
>>>> I am sure that you guys have already considered and discarded this one,
>>>> but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
Thanks for the suggestion. SLAB_TYPESAFE_BY_RCU needs to modify the
logic of reader to check whether or not the found element is still
valid. If the reader finds the found element is invalid, it needs to retry.
And for different cache, the way to check the validity is also
different. In bpf we don't want to let the reader to take care of the
reuse and do the retry logic, and we want to do it atomatically in the
bpf memory allocator.
>>> SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
>>> We want to add an option to make it not do it and instead observe RCU GP
>>> for every element freed via bpf_mem_free().
>>> In other words, make bpf_mem_free() behave like kfree_rcu.
>>> I just tried to use rcu_expedite_gp() before bpf prog runs
>>> and it helps a bit.
>> OK, got it, so you guys have considered, implemented, and are now trying
>> to discard SLAB_TYPESAFE_BY_RCU.  ;-)
>>
>> Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
>> surprised that rcu_expedite_gp() makes any difference.
>>
>> We do some expediting if there are huge numbers of callbacks or if one
>> of RCU's shrinker notifiers is invoked.  If the concern is only memory
>> footprint, it is possible to make the shrinkers more aggressive.  I am
>> not sure whether making them unconditionally more aggressive is a good
>> idea, however if memory footprint is the only concern and if shrink-time
>> expediting would suffice, it is certainly worth some investigation.
> Right. I don't think it's a good idea to tweak RCU for this use case.
> RCU parameters have to be optimized for all. Instead the bpf side needs
> to understand how RCU heuristics/watermarks work and play that game.
> For example, Hou's patch 3 has one pending call_rcu per-cpu.
> As soon as one call_rcu_hurry is done all future freed elements gets
> queued into llist and for the next call_rcu_hurry() that list will
> contain 100k elements.
> I believe from RCU pov one pending call_rcu cb is not a reason to
> act right away. It's trying to batch multiple cb-s.
> Right now I'm experimenting with multiple call_rcu calls from the bpf side,
> so that RCU sees multiple pending cb-s and has to act.
> It seems to work much better. Memory footprint is now reasonable.
Could you please share the memory usage of the original v4 and the
version with reasonable memory by using htab-mem-benchmark ?

v3 already uses multiple RCU cbs for reuse to accelerate the reuse of
freed objects. I also did the experiment for v4 as I replied the day
before yesterday, so I just quota it:

htab-mem-benchmark (reuse-after-RCU-GP):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1159.18   | 0.99                | 0.99             |
| overwrite          | 11.00     | 2288                | 4109             |
| batch_add_batch_del| 8.86      | 1558                | 2763             |
| add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |

For 1), after using kmalloc() in irq_work to allocate multiple inflight
RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
but is not enough:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1247.00   | 0.97                | 1.00             |
| overwrite          | 16.56     | 490.18              | 557.17           |
| batch_add_batch_del| 11.31     | 276.32              | 360.89           |
| add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |


By comparing the implementation of v3 and v4, I just find one hack which
could reduce the memory usage of v4 (with per-cpu reusabe list)
significantly and memory usage will be similar between v3 and v4. If we
queue a empty work before calling irq_work_raise() as shown below, the
calling latency of reuse_rcu (a normal RCU callback) will decreased from
~150ms to ~10 ms. I think the reason is that the duration of normal RCU
grace period is decreased a lot, but I don't know why did it happen.
Hope to get some help from Paul. Because Paul doesn't have enough
context, so I will try to explain the context of the weird problem
below. And Alexei, could you please also try the hack below for your
multiple-rcu-cbs version ?

Hi Paul,

I just found out the time between the calling of call_rcu(..,
reuse_rcub) and the calling of RCU callback (namely resue_cb()) will
decrease a lot (from ~150ms to ~10ms) if I queued a empty kworker
periodically as shown in the diff below. Before the diff below applied,
the benchmark process will do the following things on a VM with 8 CPUs:

1) create a pthread htab_mem_producer on each CPU and pinned the thread
on the specific CPU
2) htab_mem_producer will call syscall(__NR_getpgid) repeatedly in a
dead-loop
3) the calling of getpgid() will trigger the invocation of a bpf program
attached on getpgid() syscall
4) the bpf program will overwrite 2048 elements in a bpf hash map
5) during the overwrite, it will free the existed element firstly
6) the free will call unit_free(), unit_free() will trigger a irq-work
batchly after 96-element were freed
7) in the irq_work, it will allocate a new struct to save the freed
elements and the rcu_head and do call_rcu(..., reuse_rcu)
8) in reuse_rcu() it just moves these freed elements into a per-cpu
reuse list
9) After the free completes, the overwrite will allocate a new element
10) the allocation may also trigger a irq-work batchly after the
preallocated elements were exhausted
11) in the irq-work, it will try to fetch elements from per-cpu reuse
list and if it is empty, it will do kmalloc()

For the procedure describe above, the calling latency between the call
of call_rcu() and the call of reuse_rcu is about ~150ms or larger. I
have also checked the calling latency of syscall(__NR_getpgid) and all
latency is less than 1ms. But after do queueing a empty kwork in step
6), the calling latency will decreased from ~150ms to ~10ms and I
suspect that is because the RCU grace period is decreased a lot, but I
don't know how to debug that (e.g., to debug why the RCU grace period is
so long), so I hope to get some help.

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks + queue_empty_work):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| overwrite          | 13.85     | 17.89               | 21.49            |
| batch_add_batch_del| 10.22     | 16.65               | 19.07            |
| add_del_on_diff_cpu| 3.82      | 21.36               | 33.05            |


+static void bpf_ma_prepare_reuse_work(struct work_struct *work)
+{
+       udelay(100);
+}
+
 /* When size != 0 bpf_mem_cache for each cpu.
  * This is typical bpf hash map use case when all elements have equal size.
  *
@@ -547,6 +559,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
size, bool percpu)
                        c->cpu = cpu;
                        c->objcg = objcg;
                        c->percpu_size = percpu_size;
+                       INIT_WORK(&c->reuse_work,
bpf_ma_prepare_reuse_work);
                        raw_spin_lock_init(&c->lock);
                        c->reuse.percpu = percpu;
                        c->reuse.cpu = cpu;
@@ -574,6 +587,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
size, bool percpu)
                        c->unit_size = sizes[i];
                        c->cpu = cpu
                        c->objcg = objcg;
+                       INIT_WORK(&c->reuse_work,
bpf_ma_prepare_reuse_work);
                        raw_spin_lock_init(&c->lock);
                        c->reuse.percpu = percpu;
                        c->reuse.lock = &c->lock;
@@ -793,6 +807,8 @@ static void notrace unit_free(struct bpf_mem_cache
*c, void *ptr)
                        c->prepare_reuse_tail = llnode;
                __llist_add(llnode, &c->prepare_reuse_head);
                cnt = ++c->prepare_reuse_cnt;
+               if (cnt > c->high_watermark &&
!work_pending(&c->reuse_work))
+                       queue_work(bpf_ma_wq, &c->reuse_work);
        } else {
                /* unit_free() cannot fail. Therefore add an object to
atomic
                 * llist. reuse_bulk() will drain it. Though
free_llist_extra is
@@ -901,3 +917,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct
bpf_mem_alloc *ma, gfp_t flags)

        return !ret ? NULL : ret + LLIST_NODE_SZ;
 }
+
+static int __init bpf_ma_init(void)
+{
+       bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
+       BUG_ON(!bpf_ma_wq);
+       return 0;
+}
+late_initcall(bpf_ma_init);



> Could you point me to a code in RCU where it's doing callback batching?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  2:55             ` Paul E. McKenney
@ 2023-06-08  3:43               ` Hou Tao
  2023-06-08 16:18                 ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-08  3:43 UTC (permalink / raw)
  To: paulmck
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

Hi Paul,

On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>> As said in the commit message, the command line for test is
>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>> use_percpu_counter will be false.
>>>> Right. percpu or not depends on number of cpus.
SNIP
>>>>  known, because I had just proposed it in the email yesterday.
>>> Also noticed that the overhead of shared reuse_ready list
>>> comes both from the contended lock and from cache misses
>>> when one cpu pushes to the list after RCU GP and another
>>> cpu removes.
>>>
>>> Also low/batch/high watermark are all wrong in patch 3.
>>> low=32 and high=96 makes no sense when it's not a single list.
>>> I'm experimenting with 32 for all three heuristics.
>>>
>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>> are redundant.
>>> unit_free() can push into free_by_rcu directly
>>> then reuse_bulk() can fill it up with free_llist_extra and
>>> move them into waiting_for_gp.
>> Yes. Indeed missing that.
>>> All these _tail optimizations are obscuring the code and make it hard
>>> to notice these issues.
>>>
>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>
>>>> Answering your other email:
>>>>
>>>>> I see your point. I will continue to debug the memory usage difference
>>>>> between v3 and v4.
>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>> Don't agree with that. I still think the potential memory usage of v4 is
>> a problem and the difference memory usage between v3 and v4 reveals that
>> there is some peculiarity in RCU subsystem, because the difference comes
>> from the duration of RCU grace period. We need to find out why and fix
>> or workaround that.
> A tight loop in the kernel can extend RCU grace periods, especially
> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> cond_resched() in such loops can help.  Of course, if you are in an
> RCU read-side critical section (or holding a spinlock), you will need
> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> state upon re-entry is valid.  After all, having exited either type of
> critical section, anything might happen.

As said in the help-wanted email just send out, it was weird that the
RCU grace period was extended a lot, up to ~150ms or more. But queue a
dummy kworker periodically which does nothing will help to reduce the
grace period to ~10ms. I have explained the context of the problem in
that email. And hope that we could get some help from you and the RCU
experts in the community.

Regards,
Tao
>
> 							Thanx, Paul
>
>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>> make the list contain thousands of elements.
>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>> Which is not the case.
>>>> I'm not sure why you're saying that RCU GP is 30ms.
>> Because these freed elements will be freed after one RCU GP and in v4
>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>> these freed elements will be accumulated on the list.
>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>> Network and block IO doesn't process 176K packets without resched.
>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>
>>>> For small size buckets low_watermark=32 and high=96.
>>>> We typically move 32 elements at a time from one list to another.
>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>
>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>> allocation and freeing are done on different CPUs.
>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>
>>>>> And as we can see
>>>>> from the benchmark result above and in v3, the performance and the
>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>
>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>> 64 updates is realistic too. A thousand is not.
>> Will do that.
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  3:35                     ` Hou Tao
@ 2023-06-08  4:30                       ` Hou Tao
  2023-06-08  4:30                       ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Hou Tao @ 2023-06-08  4:30 UTC (permalink / raw)
  To: Paul E. McKenney, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
	Jiri Olsa, John Fastabend, rcu, houtao1

Hi,

On 6/8/2023 11:35 AM, Hou Tao wrote:
> Hi,
>
> On 6/8/2023 8:34 AM, Alexei Starovoitov wrote:
>> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
>>>> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
SNIP
>
> By comparing the implementation of v3 and v4, I just find one hack which
> could reduce the memory usage of v4 (with per-cpu reusabe list)
> significantly and memory usage will be similar between v3 and v4. If we
> queue a empty work before calling irq_work_raise() as shown below, the
> calling latency of reuse_rcu (a normal RCU callback) will decreased from
> ~150ms to ~10 ms. I think the reason is that the duration of normal RCU
> grace period is decreased a lot, but I don't know why did it happen.
> Hope to get some help from Paul. Because Paul doesn't have enough
> context, so I will try to explain the context of the weird problem
> below. And Alexei, could you please also try the hack below for your
> multiple-rcu-cbs version ?
An update for the queue_work() hack. It works for both CONFIG_PREEMPT=y
and CONFIG_PREEMPT=n cases. I will try to enable RCU trace to check
whether or not there is any difference.
>
> Hi Paul,
>
> I just found out the time between the calling of call_rcu(..,
> reuse_rcub) and the calling of RCU callback (namely resue_cb()) will
> decrease a lot (from ~150ms to ~10ms) if I queued a empty kworker
> periodically as shown in the diff below. Before the diff below applied,
> the benchmark process will do the following things on a VM with 8 CPUs:
>
> 1) create a pthread htab_mem_producer on each CPU and pinned the thread
> on the specific CPU
> 2) htab_mem_producer will call syscall(__NR_getpgid) repeatedly in a
> dead-loop
> 3) the calling of getpgid() will trigger the invocation of a bpf program
> attached on getpgid() syscall
> 4) the bpf program will overwrite 2048 elements in a bpf hash map
> 5) during the overwrite, it will free the existed element firstly
> 6) the free will call unit_free(), unit_free() will trigger a irq-work
> batchly after 96-element were freed
> 7) in the irq_work, it will allocate a new struct to save the freed
> elements and the rcu_head and do call_rcu(..., reuse_rcu)
> 8) in reuse_rcu() it just moves these freed elements into a per-cpu
> reuse list
> 9) After the free completes, the overwrite will allocate a new element
> 10) the allocation may also trigger a irq-work batchly after the
> preallocated elements were exhausted
> 11) in the irq-work, it will try to fetch elements from per-cpu reuse
> list and if it is empty, it will do kmalloc()
>
> For the procedure describe above, the calling latency between the call
> of call_rcu() and the call of reuse_rcu is about ~150ms or larger. I
> have also checked the calling latency of syscall(__NR_getpgid) and all
> latency is less than 1ms. But after do queueing a empty kwork in step
> 6), the calling latency will decreased from ~150ms to ~10ms and I
> suspect that is because the RCU grace period is decreased a lot, but I
> don't know how to debug that (e.g., to debug why the RCU grace period is
> so long), so I hope to get some help.
>
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks + queue_empty_work):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | overwrite          | 13.85     | 17.89               | 21.49            |
> | batch_add_batch_del| 10.22     | 16.65               | 19.07            |
> | add_del_on_diff_cpu| 3.82      | 21.36               | 33.05            |
>
>
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work)
> +{
> +       udelay(100);
> +}
> +
>  /* When size != 0 bpf_mem_cache for each cpu.
>   * This is typical bpf hash map use case when all elements have equal size.
>   *
> @@ -547,6 +559,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
> size, bool percpu)
>                         c->cpu = cpu;
>                         c->objcg = objcg;
>                         c->percpu_size = percpu_size;
> +                       INIT_WORK(&c->reuse_work,
> bpf_ma_prepare_reuse_work);
>                         raw_spin_lock_init(&c->lock);
>                         c->reuse.percpu = percpu;
>                         c->reuse.cpu = cpu;
> @@ -574,6 +587,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
> size, bool percpu)
>                         c->unit_size = sizes[i];
>                         c->cpu = cpu
>                         c->objcg = objcg;
> +                       INIT_WORK(&c->reuse_work,
> bpf_ma_prepare_reuse_work);
>                         raw_spin_lock_init(&c->lock);
>                         c->reuse.percpu = percpu;
>                         c->reuse.lock = &c->lock;
> @@ -793,6 +807,8 @@ static void notrace unit_free(struct bpf_mem_cache
> *c, void *ptr)
>                         c->prepare_reuse_tail = llnode;
>                 __llist_add(llnode, &c->prepare_reuse_head);
>                 cnt = ++c->prepare_reuse_cnt;
> +               if (cnt > c->high_watermark &&
> !work_pending(&c->reuse_work))
> +                       queue_work(bpf_ma_wq, &c->reuse_work);
>         } else {
>                 /* unit_free() cannot fail. Therefore add an object to
> atomic
>                  * llist. reuse_bulk() will drain it. Though
> free_llist_extra is
> @@ -901,3 +917,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct
> bpf_mem_alloc *ma, gfp_t flags)
>
>         return !ret ? NULL : ret + LLIST_NODE_SZ;
>  }
> +
> +static int __init bpf_ma_init(void)
> +{
> +       bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
> +       BUG_ON(!bpf_ma_wq);
> +       return 0;
> +}
> +late_initcall(bpf_ma_init);
>
>
>
>> Could you point me to a code in RCU where it's doing callback batching?
> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  3:35                     ` Hou Tao
  2023-06-08  4:30                       ` Hou Tao
@ 2023-06-08  4:30                       ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-06-08  4:30 UTC (permalink / raw)
  To: Hou Tao
  Cc: Paul E. McKenney, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Wed, Jun 7, 2023 at 8:36 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> below. And Alexei, could you please also try the hack below for your
> multiple-rcu-cbs version ?

Before any further experiments... please fix the bench in patch 2.
Make sure it exits out of RCU CS after no more than 64 map_update_elem.
Just send that patch alone.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08  3:43               ` Hou Tao
@ 2023-06-08 16:18                 ` Paul E. McKenney
  2023-06-09  3:02                   ` Hou Tao
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-08 16:18 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> > On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> >>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>>>> As said in the commit message, the command line for test is
> >>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>>>> using default max_entries and the number of CPUs is greater than 15,
> >>>>> use_percpu_counter will be false.
> >>>> Right. percpu or not depends on number of cpus.
> SNIP
> >>>>  known, because I had just proposed it in the email yesterday.
> >>> Also noticed that the overhead of shared reuse_ready list
> >>> comes both from the contended lock and from cache misses
> >>> when one cpu pushes to the list after RCU GP and another
> >>> cpu removes.
> >>>
> >>> Also low/batch/high watermark are all wrong in patch 3.
> >>> low=32 and high=96 makes no sense when it's not a single list.
> >>> I'm experimenting with 32 for all three heuristics.
> >>>
> >>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> >>> are redundant.
> >>> unit_free() can push into free_by_rcu directly
> >>> then reuse_bulk() can fill it up with free_llist_extra and
> >>> move them into waiting_for_gp.
> >> Yes. Indeed missing that.
> >>> All these _tail optimizations are obscuring the code and make it hard
> >>> to notice these issues.
> >>>
> >>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>>>
> >>>> Answering your other email:
> >>>>
> >>>>> I see your point. I will continue to debug the memory usage difference
> >>>>> between v3 and v4.
> >>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> >> Don't agree with that. I still think the potential memory usage of v4 is
> >> a problem and the difference memory usage between v3 and v4 reveals that
> >> there is some peculiarity in RCU subsystem, because the difference comes
> >> from the duration of RCU grace period. We need to find out why and fix
> >> or workaround that.
> > A tight loop in the kernel can extend RCU grace periods, especially
> > for kernels built with CONFIG_PREEPTION=n.  Placing things like
> > cond_resched() in such loops can help.  Of course, if you are in an
> > RCU read-side critical section (or holding a spinlock), you will need
> > to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> > state upon re-entry is valid.  After all, having exited either type of
> > critical section, anything might happen.
> 
> As said in the help-wanted email just send out, it was weird that the
> RCU grace period was extended a lot, up to ~150ms or more. But queue a
> dummy kworker periodically which does nothing will help to reduce the
> grace period to ~10ms. I have explained the context of the problem in
> that email. And hope that we could get some help from you and the RCU
> experts in the community.

OK, I will bite...  Why do you think this is weird?

RCU depends on context switches, among other things.  If you have a
long-running in-kernel task, that will naturally extend the grace period.
Scheduling the empty worker provided the context switch that RCU needed
to make forward progress.

So 150 milliseconds is an OK RCU grace period.  A bit long, but not
excessively so.  In contrast, by default in mainline, RCU starts seriously
complaining if its grace period is extended beyond 21 *seconds*.  This is
when the RCU CPU stall warning will appear.  (Yes, some Android configs
are tuning this down to 20 milliseconds, but that is a special hardware
and software configuration.)

But if you want shorter RCU grace periods, what can you do?

1.	Follow Alexei's good advice on one of your early patches.

2.	As an alternative to scheduling an empty kworker, invoke something
	like rcu_momentary_dyntick_idle() periodically.  Note well that
	it is illegal to invoke this in an RCU read-side critical section,
	with preemption disabled, from idle, ...

3.	In non-preemptible kernels, cond_resched() is a much lighter
	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
	kernels get the same effect by preempting.  In your case, this
	is also true, but it takes 150 milliseconds.)

That should do for a start.  ;-)

							Thanx, Paul

> Regards,
> Tao
> >
> > 							Thanx, Paul
> >
> >>>>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>>>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>>>> make the list contain thousands of elements.
> >>>> That would be true only if there were no scheduling events in all of 176K ops.
> >>>> Which is not the case.
> >>>> I'm not sure why you're saying that RCU GP is 30ms.
> >> Because these freed elements will be freed after one RCU GP and in v4
> >> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> >> these freed elements will be accumulated on the list.
> >>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >>>> Network and block IO doesn't process 176K packets without resched.
> >>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>>>
> >>>> For small size buckets low_watermark=32 and high=96.
> >>>> We typically move 32 elements at a time from one list to another.
> >>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >>>> elements from waiting_for_gp one at a time and kfree it one at a time.
> >>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>>>
> >>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>>>> introduced. Previously, I though it was used to handle the case in which
> >>>>> allocation and freeing are done on different CPUs.
> >>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>>>
> >>>>> And as we can see
> >>>>> from the benchmark result above and in v3, the performance and the
> >>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >>>> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>>>
> >>>> So far the only bench we can trust and analyze is map_perf_test.
> >>>> Please make bench in patch 2 yield the cpu after few updates.
> >>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >>>> 64 updates is realistic too. A thousand is not.
> >> Will do that.
> >>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-08 16:18                 ` Paul E. McKenney
@ 2023-06-09  3:02                   ` Hou Tao
  2023-06-09 14:30                     ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-09  3:02 UTC (permalink / raw)
  To: paulmck
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

Hi Paul,

On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>>>> As said in the commit message, the command line for test is
>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>>>> use_percpu_counter will be false.
>>>>>> Right. percpu or not depends on number of cpus.
>> SNIP
>>>>>>  known, because I had just proposed it in the email yesterday.
>>>>> Also noticed that the overhead of shared reuse_ready list
>>>>> comes both from the contended lock and from cache misses
>>>>> when one cpu pushes to the list after RCU GP and another
>>>>> cpu removes.
>>>>>
>>>>> Also low/batch/high watermark are all wrong in patch 3.
>>>>> low=32 and high=96 makes no sense when it's not a single list.
>>>>> I'm experimenting with 32 for all three heuristics.
>>>>>
>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>>>> are redundant.
>>>>> unit_free() can push into free_by_rcu directly
>>>>> then reuse_bulk() can fill it up with free_llist_extra and
>>>>> move them into waiting_for_gp.
>>>> Yes. Indeed missing that.
>>>>> All these _tail optimizations are obscuring the code and make it hard
>>>>> to notice these issues.
>>>>>
>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>>>
>>>>>> Answering your other email:
>>>>>>
>>>>>>> I see your point. I will continue to debug the memory usage difference
>>>>>>> between v3 and v4.
>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>>>> Don't agree with that. I still think the potential memory usage of v4 is
>>>> a problem and the difference memory usage between v3 and v4 reveals that
>>>> there is some peculiarity in RCU subsystem, because the difference comes
>>>> from the duration of RCU grace period. We need to find out why and fix
>>>> or workaround that.
>>> A tight loop in the kernel can extend RCU grace periods, especially
>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
>>> cond_resched() in such loops can help.  Of course, if you are in an
>>> RCU read-side critical section (or holding a spinlock), you will need
>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
>>> state upon re-entry is valid.  After all, having exited either type of
>>> critical section, anything might happen.
>> As said in the help-wanted email just send out, it was weird that the
>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
>> dummy kworker periodically which does nothing will help to reduce the
>> grace period to ~10ms. I have explained the context of the problem in
>> that email. And hope that we could get some help from you and the RCU
>> experts in the community.
> OK, I will bite...  Why do you think this is weird?
>
> RCU depends on context switches, among other things.  If you have a
> long-running in-kernel task, that will naturally extend the grace period.
> Scheduling the empty worker provided the context switch that RCU needed
> to make forward progress.

Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
And there is neither implicit or explicit calling of schedule() in the
kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
And also I don't know how to define a long-running in-kernel task,
because I have checked the latency of getpgid() syscall in producer
thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
there are indeed some outliers, but the most latency is less than 1ms,
so the producer thread will do contex_switch in at most 1ms. But before
the empty kwork trick, the latency of RCU callback is about 100ms or
more, and after the empty kwork trick, the latenct for RCU callback is
reduced to ~8ms.

@hist_us:
[128, 256)            60
|                                                    |
[256, 512)        101364
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)          17580
|@@@@@@@@@                                           |
[1K, 2K)              16
|                                                    |

@stat_us: count 119020, average 436, total 51957277


>
> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
> excessively so.  In contrast, by default in mainline, RCU starts seriously
> complaining if its grace period is extended beyond 21 *seconds*.  This is
> when the RCU CPU stall warning will appear.  (Yes, some Android configs
> are tuning this down to 20 milliseconds, but that is a special hardware
> and software configuration.)
>
> But if you want shorter RCU grace periods, what can you do?
>
> 1.	Follow Alexei's good advice on one of your early patches.
>
> 2.	As an alternative to scheduling an empty kworker, invoke something
> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
> 	it is illegal to invoke this in an RCU read-side critical section,
> 	with preemption disabled, from idle, ...
>
> 3.	In non-preemptible kernels, cond_resched() is a much lighter
> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
> 	kernels get the same effect by preempting.  In your case, this
> 	is also true, but it takes 150 milliseconds.)
>
> That should do for a start.  ;-)
>
> 							Thanx, Paul
>
>> Regards,
>> Tao
>>> 							Thanx, Paul
>>>
>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>>>> make the list contain thousands of elements.
>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>>>> Which is not the case.
>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
>>>> Because these freed elements will be freed after one RCU GP and in v4
>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>>>> these freed elements will be accumulated on the list.
>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>>>> Network and block IO doesn't process 176K packets without resched.
>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>>>
>>>>>> For small size buckets low_watermark=32 and high=96.
>>>>>> We typically move 32 elements at a time from one list to another.
>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>>>
>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>>>> allocation and freeing are done on different CPUs.
>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>>>
>>>>>>> And as we can see
>>>>>>> from the benchmark result above and in v3, the performance and the
>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>>>
>>>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>>>> 64 updates is realistic too. A thousand is not.
>>>> Will do that.
>>>>
> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-09  3:02                   ` Hou Tao
@ 2023-06-09 14:30                     ` Paul E. McKenney
  2023-06-12  2:03                       ` Hou Tao
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-09 14:30 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
> > On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
> >> Hi Paul,
> >>
> >> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> >>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> >>>> Hi,
> >>>>
> >>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> >>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> >>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>>>>>> As said in the commit message, the command line for test is
> >>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>>>>>> using default max_entries and the number of CPUs is greater than 15,
> >>>>>>> use_percpu_counter will be false.
> >>>>>> Right. percpu or not depends on number of cpus.
> >> SNIP
> >>>>>>  known, because I had just proposed it in the email yesterday.
> >>>>> Also noticed that the overhead of shared reuse_ready list
> >>>>> comes both from the contended lock and from cache misses
> >>>>> when one cpu pushes to the list after RCU GP and another
> >>>>> cpu removes.
> >>>>>
> >>>>> Also low/batch/high watermark are all wrong in patch 3.
> >>>>> low=32 and high=96 makes no sense when it's not a single list.
> >>>>> I'm experimenting with 32 for all three heuristics.
> >>>>>
> >>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> >>>>> are redundant.
> >>>>> unit_free() can push into free_by_rcu directly
> >>>>> then reuse_bulk() can fill it up with free_llist_extra and
> >>>>> move them into waiting_for_gp.
> >>>> Yes. Indeed missing that.
> >>>>> All these _tail optimizations are obscuring the code and make it hard
> >>>>> to notice these issues.
> >>>>>
> >>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>>>>>
> >>>>>> Answering your other email:
> >>>>>>
> >>>>>>> I see your point. I will continue to debug the memory usage difference
> >>>>>>> between v3 and v4.
> >>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> >>>> Don't agree with that. I still think the potential memory usage of v4 is
> >>>> a problem and the difference memory usage between v3 and v4 reveals that
> >>>> there is some peculiarity in RCU subsystem, because the difference comes
> >>>> from the duration of RCU grace period. We need to find out why and fix
> >>>> or workaround that.
> >>> A tight loop in the kernel can extend RCU grace periods, especially
> >>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> >>> cond_resched() in such loops can help.  Of course, if you are in an
> >>> RCU read-side critical section (or holding a spinlock), you will need
> >>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> >>> state upon re-entry is valid.  After all, having exited either type of
> >>> critical section, anything might happen.
> >> As said in the help-wanted email just send out, it was weird that the
> >> RCU grace period was extended a lot, up to ~150ms or more. But queue a
> >> dummy kworker periodically which does nothing will help to reduce the
> >> grace period to ~10ms. I have explained the context of the problem in
> >> that email. And hope that we could get some help from you and the RCU
> >> experts in the community.
> > OK, I will bite...  Why do you think this is weird?
> >
> > RCU depends on context switches, among other things.  If you have a
> > long-running in-kernel task, that will naturally extend the grace period.
> > Scheduling the empty worker provided the context switch that RCU needed
> > to make forward progress.
> 
> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
> And there is neither implicit or explicit calling of schedule() in the
> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
> And also I don't know how to define a long-running in-kernel task,
> because I have checked the latency of getpgid() syscall in producer
> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
> there are indeed some outliers, but the most latency is less than 1ms,
> so the producer thread will do contex_switch in at most 1ms. But before
> the empty kwork trick, the latency of RCU callback is about 100ms or
> more, and after the empty kwork trick, the latenct for RCU callback is
> reduced to ~8ms.
> 
> @hist_us:
> [128, 256)            60
> |                                                    |
> [256, 512)        101364
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)          17580
> |@@@@@@@@@                                           |
> [1K, 2K)              16
> |                                                    |
> 
> @stat_us: count 119020, average 436, total 51957277

Ah!

The trick is that if you have a non-nohz_full CPU spending most of its
time in kernel space (in your case, due to repeated system calls), with
high probability this CPU will appear to RCU as if it were spending all
of its time in kernel space.  This probability will increase with the
fraction of the time this CPU spends in kernel space.

If I am reading your histogram correctly, the overhead of your system
call is usually etween 256 and 512 microseconds.  If you are invoking
that system call in a tight loop, that CPU might could easily be spending
way more than 99% of its time in the kernel.  Unless you have similar
statistics for the average lenght of your CPU's times in userspace,
let's assume exactly 99%.

On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
userspace occurs when the once-per-jiffy scheduling-clock interrupt
is taken from userspace.  If the CPU is spending 99% of its time in
the kernel, we can expect only one in 100 interrupts being taken from
userspace.  On a HZ=1000 system, that will give you 100 milliseconds
right there.

But it gets worse.  The variance is quite large.  For example, for a
given grace period, there is about a 5% chance that the delay will be
300 milliseconds instead of the expected 100 milliseconds.  (You can
easily check this by taking 0.99 to the 300th power.  Or whatever other
power you desire.)

So the problem is that simple statistics is biting pretty hard here.

As you may have noticed, life is like that sometimes.  ;-)

							Thanx, Paul

> > So 150 milliseconds is an OK RCU grace period.  A bit long, but not
> > excessively so.  In contrast, by default in mainline, RCU starts seriously
> > complaining if its grace period is extended beyond 21 *seconds*.  This is
> > when the RCU CPU stall warning will appear.  (Yes, some Android configs
> > are tuning this down to 20 milliseconds, but that is a special hardware
> > and software configuration.)
> >
> > But if you want shorter RCU grace periods, what can you do?
> >
> > 1.	Follow Alexei's good advice on one of your early patches.
> >
> > 2.	As an alternative to scheduling an empty kworker, invoke something
> > 	like rcu_momentary_dyntick_idle() periodically.  Note well that
> > 	it is illegal to invoke this in an RCU read-side critical section,
> > 	with preemption disabled, from idle, ...
> >
> > 3.	In non-preemptible kernels, cond_resched() is a much lighter
> > 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
> > 	kernels get the same effect by preempting.  In your case, this
> > 	is also true, but it takes 150 milliseconds.)
> >
> > That should do for a start.  ;-)
> >
> > 							Thanx, Paul
> >
> >> Regards,
> >> Tao
> >>> 							Thanx, Paul
> >>>
> >>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>>>>>> make the list contain thousands of elements.
> >>>>>> That would be true only if there were no scheduling events in all of 176K ops.
> >>>>>> Which is not the case.
> >>>>>> I'm not sure why you're saying that RCU GP is 30ms.
> >>>> Because these freed elements will be freed after one RCU GP and in v4
> >>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> >>>> these freed elements will be accumulated on the list.
> >>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >>>>>> Network and block IO doesn't process 176K packets without resched.
> >>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>>>>>
> >>>>>> For small size buckets low_watermark=32 and high=96.
> >>>>>> We typically move 32 elements at a time from one list to another.
> >>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
> >>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>>>>>
> >>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>>>>>> introduced. Previously, I though it was used to handle the case in which
> >>>>>>> allocation and freeing are done on different CPUs.
> >>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>>>>>
> >>>>>>> And as we can see
> >>>>>>> from the benchmark result above and in v3, the performance and the
> >>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>>>>>
> >>>>>> So far the only bench we can trust and analyze is map_perf_test.
> >>>>>> Please make bench in patch 2 yield the cpu after few updates.
> >>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >>>>>> 64 updates is realistic too. A thousand is not.
> >>>> Will do that.
> >>>>
> > .
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-09 14:30                     ` Paul E. McKenney
@ 2023-06-12  2:03                       ` Hou Tao
  2023-06-12  3:40                         ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Hou Tao @ 2023-06-12  2:03 UTC (permalink / raw)
  To: paulmck
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

Hi Paul,

On 6/9/2023 10:30 PM, Paul E. McKenney wrote:
> On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
>>> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
>>>> Hi Paul,
>>>>
>>>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
>>>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>>>>>> As said in the commit message, the command line for test is
>>>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>>>>>> use_percpu_counter will be false.
>>>>>>>> Right. percpu or not depends on number of cpus.
>>>> SNIP
>>>>>>>>  known, because I had just proposed it in the email yesterday.
>>>>>>> Also noticed that the overhead of shared reuse_ready list
>>>>>>> comes both from the contended lock and from cache misses
>>>>>>> when one cpu pushes to the list after RCU GP and another
>>>>>>> cpu removes.
>>>>>>>
>>>>>>> Also low/batch/high watermark are all wrong in patch 3.
>>>>>>> low=32 and high=96 makes no sense when it's not a single list.
>>>>>>> I'm experimenting with 32 for all three heuristics.
>>>>>>>
>>>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>>>>>> are redundant.
>>>>>>> unit_free() can push into free_by_rcu directly
>>>>>>> then reuse_bulk() can fill it up with free_llist_extra and
>>>>>>> move them into waiting_for_gp.
>>>>>> Yes. Indeed missing that.
>>>>>>> All these _tail optimizations are obscuring the code and make it hard
>>>>>>> to notice these issues.
>>>>>>>
>>>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>>>>>
>>>>>>>> Answering your other email:
>>>>>>>>
>>>>>>>>> I see your point. I will continue to debug the memory usage difference
>>>>>>>>> between v3 and v4.
>>>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>>>>>> Don't agree with that. I still think the potential memory usage of v4 is
>>>>>> a problem and the difference memory usage between v3 and v4 reveals that
>>>>>> there is some peculiarity in RCU subsystem, because the difference comes
>>>>>> from the duration of RCU grace period. We need to find out why and fix
>>>>>> or workaround that.
>>>>> A tight loop in the kernel can extend RCU grace periods, especially
>>>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
>>>>> cond_resched() in such loops can help.  Of course, if you are in an
>>>>> RCU read-side critical section (or holding a spinlock), you will need
>>>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
>>>>> state upon re-entry is valid.  After all, having exited either type of
>>>>> critical section, anything might happen.
>>>> As said in the help-wanted email just send out, it was weird that the
>>>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
>>>> dummy kworker periodically which does nothing will help to reduce the
>>>> grace period to ~10ms. I have explained the context of the problem in
>>>> that email. And hope that we could get some help from you and the RCU
>>>> experts in the community.
>>> OK, I will bite...  Why do you think this is weird?
>>>
>>> RCU depends on context switches, among other things.  If you have a
>>> long-running in-kernel task, that will naturally extend the grace period.
>>> Scheduling the empty worker provided the context switch that RCU needed
>>> to make forward progress.
>> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
>> And there is neither implicit or explicit calling of schedule() in the
>> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
>> And also I don't know how to define a long-running in-kernel task,
>> because I have checked the latency of getpgid() syscall in producer
>> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
>> there are indeed some outliers, but the most latency is less than 1ms,
>> so the producer thread will do contex_switch in at most 1ms. But before
>> the empty kwork trick, the latency of RCU callback is about 100ms or
>> more, and after the empty kwork trick, the latenct for RCU callback is
>> reduced to ~8ms.
>>
>> @hist_us:
>> [128, 256)            60
>> |                                                    |
>> [256, 512)        101364
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [512, 1K)          17580
>> |@@@@@@@@@                                           |
>> [1K, 2K)              16
>> |                                                    |
>>
>> @stat_us: count 119020, average 436, total 51957277
> Ah!
>
> The trick is that if you have a non-nohz_full CPU spending most of its
> time in kernel space (in your case, due to repeated system calls), with
> high probability this CPU will appear to RCU as if it were spending all
> of its time in kernel space.  This probability will increase with the
> fraction of the time this CPU spends in kernel space.
>
> If I am reading your histogram correctly, the overhead of your system
> call is usually etween 256 and 512 microseconds.  If you are invoking
> that system call in a tight loop, that CPU might could easily be spending
> way more than 99% of its time in the kernel.  Unless you have similar
> statistics for the average lenght of your CPU's times in userspace,
> let's assume exactly 99%.
>
> On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
> userspace occurs when the once-per-jiffy scheduling-clock interrupt
> is taken from userspace.  If the CPU is spending 99% of its time in
> the kernel, we can expect only one in 100 interrupts being taken from
> userspace.  On a HZ=1000 system, that will give you 100 milliseconds
> right there.
Thanks for the detailed explanation. It seemed my previous understanding
of RCU was wrong. My original though was that when one syscall returns
from the kernel space, the process which does the syscall will be
considered as being in quiescent state and an RCU grace period will be
expired soon.
>
> But it gets worse.  The variance is quite large.  For example, for a
> given grace period, there is about a 5% chance that the delay will be
> 300 milliseconds instead of the expected 100 milliseconds.  (You can
> easily check this by taking 0.99 to the 300th power.  Or whatever other
> power you desire.)
>
> So the problem is that simple statistics is biting pretty hard here.
>
> As you may have noticed, life is like that sometimes.  ;-)
>
> 							Thanx, Paul
>
>>> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
>>> excessively so.  In contrast, by default in mainline, RCU starts seriously
>>> complaining if its grace period is extended beyond 21 *seconds*.  This is
>>> when the RCU CPU stall warning will appear.  (Yes, some Android configs
>>> are tuning this down to 20 milliseconds, but that is a special hardware
>>> and software configuration.)
>>>
>>> But if you want shorter RCU grace periods, what can you do?
>>>
>>> 1.	Follow Alexei's good advice on one of your early patches.
>>>
>>> 2.	As an alternative to scheduling an empty kworker, invoke something
>>> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
>>> 	it is illegal to invoke this in an RCU read-side critical section,
>>> 	with preemption disabled, from idle, ...
>>>
>>> 3.	In non-preemptible kernels, cond_resched() is a much lighter
>>> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
>>> 	kernels get the same effect by preempting.  In your case, this
>>> 	is also true, but it takes 150 milliseconds.)
>>>
>>> That should do for a start.  ;-)
>>>
>>> 							Thanx, Paul
>>>
>>>> Regards,
>>>> Tao
>>>>> 							Thanx, Paul
>>>>>
>>>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>>>>>> make the list contain thousands of elements.
>>>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>>>>>> Which is not the case.
>>>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
>>>>>> Because these freed elements will be freed after one RCU GP and in v4
>>>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>>>>>> these freed elements will be accumulated on the list.
>>>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>>>>>> Network and block IO doesn't process 176K packets without resched.
>>>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>>>>>
>>>>>>>> For small size buckets low_watermark=32 and high=96.
>>>>>>>> We typically move 32 elements at a time from one list to another.
>>>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>>>>>
>>>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>>>>>> allocation and freeing are done on different CPUs.
>>>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>>>>>
>>>>>>>>> And as we can see
>>>>>>>>> from the benchmark result above and in v3, the performance and the
>>>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>>>>>
>>>>>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>>>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>>>>>> 64 updates is realistic too. A thousand is not.
>>>>>> Will do that.
>>>>>>
>>> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
  2023-06-12  2:03                       ` Hou Tao
@ 2023-06-12  3:40                         ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2023-06-12  3:40 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

On Mon, Jun 12, 2023 at 10:03:15AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 6/9/2023 10:30 PM, Paul E. McKenney wrote:
> > On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
> >> Hi Paul,
> >>
> >> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
> >>> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
> >>>> Hi Paul,
> >>>>
> >>>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> >>>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> >>>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> >>>>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>>>>>>>> As said in the commit message, the command line for test is
> >>>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>>>>>>>> using default max_entries and the number of CPUs is greater than 15,
> >>>>>>>>> use_percpu_counter will be false.
> >>>>>>>> Right. percpu or not depends on number of cpus.
> >>>> SNIP
> >>>>>>>>  known, because I had just proposed it in the email yesterday.
> >>>>>>> Also noticed that the overhead of shared reuse_ready list
> >>>>>>> comes both from the contended lock and from cache misses
> >>>>>>> when one cpu pushes to the list after RCU GP and another
> >>>>>>> cpu removes.
> >>>>>>>
> >>>>>>> Also low/batch/high watermark are all wrong in patch 3.
> >>>>>>> low=32 and high=96 makes no sense when it's not a single list.
> >>>>>>> I'm experimenting with 32 for all three heuristics.
> >>>>>>>
> >>>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> >>>>>>> are redundant.
> >>>>>>> unit_free() can push into free_by_rcu directly
> >>>>>>> then reuse_bulk() can fill it up with free_llist_extra and
> >>>>>>> move them into waiting_for_gp.
> >>>>>> Yes. Indeed missing that.
> >>>>>>> All these _tail optimizations are obscuring the code and make it hard
> >>>>>>> to notice these issues.
> >>>>>>>
> >>>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>>>>>>>
> >>>>>>>> Answering your other email:
> >>>>>>>>
> >>>>>>>>> I see your point. I will continue to debug the memory usage difference
> >>>>>>>>> between v3 and v4.
> >>>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> >>>>>> Don't agree with that. I still think the potential memory usage of v4 is
> >>>>>> a problem and the difference memory usage between v3 and v4 reveals that
> >>>>>> there is some peculiarity in RCU subsystem, because the difference comes
> >>>>>> from the duration of RCU grace period. We need to find out why and fix
> >>>>>> or workaround that.
> >>>>> A tight loop in the kernel can extend RCU grace periods, especially
> >>>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> >>>>> cond_resched() in such loops can help.  Of course, if you are in an
> >>>>> RCU read-side critical section (or holding a spinlock), you will need
> >>>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> >>>>> state upon re-entry is valid.  After all, having exited either type of
> >>>>> critical section, anything might happen.
> >>>> As said in the help-wanted email just send out, it was weird that the
> >>>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
> >>>> dummy kworker periodically which does nothing will help to reduce the
> >>>> grace period to ~10ms. I have explained the context of the problem in
> >>>> that email. And hope that we could get some help from you and the RCU
> >>>> experts in the community.
> >>> OK, I will bite...  Why do you think this is weird?
> >>>
> >>> RCU depends on context switches, among other things.  If you have a
> >>> long-running in-kernel task, that will naturally extend the grace period.
> >>> Scheduling the empty worker provided the context switch that RCU needed
> >>> to make forward progress.
> >> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
> >> And there is neither implicit or explicit calling of schedule() in the
> >> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
> >> And also I don't know how to define a long-running in-kernel task,
> >> because I have checked the latency of getpgid() syscall in producer
> >> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
> >> there are indeed some outliers, but the most latency is less than 1ms,
> >> so the producer thread will do contex_switch in at most 1ms. But before
> >> the empty kwork trick, the latency of RCU callback is about 100ms or
> >> more, and after the empty kwork trick, the latenct for RCU callback is
> >> reduced to ~8ms.
> >>
> >> @hist_us:
> >> [128, 256)            60
> >> |                                                    |
> >> [256, 512)        101364
> >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> [512, 1K)          17580
> >> |@@@@@@@@@                                           |
> >> [1K, 2K)              16
> >> |                                                    |
> >>
> >> @stat_us: count 119020, average 436, total 51957277
> > Ah!
> >
> > The trick is that if you have a non-nohz_full CPU spending most of its
> > time in kernel space (in your case, due to repeated system calls), with
> > high probability this CPU will appear to RCU as if it were spending all
> > of its time in kernel space.  This probability will increase with the
> > fraction of the time this CPU spends in kernel space.
> >
> > If I am reading your histogram correctly, the overhead of your system
> > call is usually etween 256 and 512 microseconds.  If you are invoking
> > that system call in a tight loop, that CPU might could easily be spending
> > way more than 99% of its time in the kernel.  Unless you have similar
> > statistics for the average lenght of your CPU's times in userspace,
> > let's assume exactly 99%.
> >
> > On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
> > userspace occurs when the once-per-jiffy scheduling-clock interrupt
> > is taken from userspace.  If the CPU is spending 99% of its time in
> > the kernel, we can expect only one in 100 interrupts being taken from
> > userspace.  On a HZ=1000 system, that will give you 100 milliseconds
> > right there.
> 
> Thanks for the detailed explanation. It seemed my previous understanding
> of RCU was wrong. My original though was that when one syscall returns
> from the kernel space, the process which does the syscall will be
> considered as being in quiescent state and an RCU grace period will be
> expired soon.

On a nohz_full CPU, that does in fact happen, courtesy of the
context-tracking code's ct_user_enter() and ct_user_exit() functions.

So your understanding was not entirely wrong.

But context tracking adds overhead, so it is not enabled on kernels
not requiring it.  Such kernels run certain types of HPC and/or
real-time workloads.  On other kernels, there is no context tracking
for user/kernel transitions, which (as discussed above) forces RCU to
rely on the scheduling clock interrupt, with the resulting grace-period
delays.

							Thanx, Paul

> > But it gets worse.  The variance is quite large.  For example, for a
> > given grace period, there is about a 5% chance that the delay will be
> > 300 milliseconds instead of the expected 100 milliseconds.  (You can
> > easily check this by taking 0.99 to the 300th power.  Or whatever other
> > power you desire.)
> >
> > So the problem is that simple statistics is biting pretty hard here.
> >
> > As you may have noticed, life is like that sometimes.  ;-)
> >
> > 							Thanx, Paul
> >
> >>> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
> >>> excessively so.  In contrast, by default in mainline, RCU starts seriously
> >>> complaining if its grace period is extended beyond 21 *seconds*.  This is
> >>> when the RCU CPU stall warning will appear.  (Yes, some Android configs
> >>> are tuning this down to 20 milliseconds, but that is a special hardware
> >>> and software configuration.)
> >>>
> >>> But if you want shorter RCU grace periods, what can you do?
> >>>
> >>> 1.	Follow Alexei's good advice on one of your early patches.
> >>>
> >>> 2.	As an alternative to scheduling an empty kworker, invoke something
> >>> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
> >>> 	it is illegal to invoke this in an RCU read-side critical section,
> >>> 	with preemption disabled, from idle, ...
> >>>
> >>> 3.	In non-preemptible kernels, cond_resched() is a much lighter
> >>> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
> >>> 	kernels get the same effect by preempting.  In your case, this
> >>> 	is also true, but it takes 150 milliseconds.)
> >>>
> >>> That should do for a start.  ;-)
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> Regards,
> >>>> Tao
> >>>>> 							Thanx, Paul
> >>>>>
> >>>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>>>>>>>> make the list contain thousands of elements.
> >>>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
> >>>>>>>> Which is not the case.
> >>>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
> >>>>>> Because these freed elements will be freed after one RCU GP and in v4
> >>>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> >>>>>> these freed elements will be accumulated on the list.
> >>>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >>>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >>>>>>>> Network and block IO doesn't process 176K packets without resched.
> >>>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>>>>>>>
> >>>>>>>> For small size buckets low_watermark=32 and high=96.
> >>>>>>>> We typically move 32 elements at a time from one list to another.
> >>>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >>>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >>>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
> >>>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>>>>>>>
> >>>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>>>>>>>> introduced. Previously, I though it was used to handle the case in which
> >>>>>>>>> allocation and freeing are done on different CPUs.
> >>>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>>>>>>>
> >>>>>>>>> And as we can see
> >>>>>>>>> from the benchmark result above and in v3, the performance and the
> >>>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >>>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>>>>>>>
> >>>>>>>> So far the only bench we can trust and analyze is map_perf_test.
> >>>>>>>> Please make bench in patch 2 yield the cpu after few updates.
> >>>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >>>>>>>> 64 updates is realistic too. A thousand is not.
> >>>>>> Will do that.
> >>>>>>
> >>> .
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-06-12  3:40 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-06  3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
2023-06-06  3:53 ` [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all() Hou Tao
2023-06-06  3:53 ` [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
2023-06-06 21:13   ` Alexei Starovoitov
2023-06-07  1:32     ` Hou Tao
2023-06-06  3:53 ` [RFC PATCH bpf-next v4 3/3] bpf: Only reuse after one RCU GP in " Hou Tao
2023-06-06 12:30 ` [RFC PATCH bpf-next v4 0/3] Handle immediate reuse " Hou Tao
2023-06-06 21:04   ` Alexei Starovoitov
2023-06-07  1:19     ` Hou Tao
2023-06-07  1:39       ` Alexei Starovoitov
2023-06-07  7:56         ` Hou Tao
2023-06-07  8:42     ` Hou Tao
2023-06-07 17:52       ` Alexei Starovoitov
2023-06-07 20:50         ` Alexei Starovoitov
2023-06-07 23:23           ` Alexei Starovoitov
2023-06-07 23:30             ` Paul E. McKenney
2023-06-07 23:50               ` Alexei Starovoitov
2023-06-08  0:13                 ` Paul E. McKenney
2023-06-08  0:34                   ` Alexei Starovoitov
2023-06-08  1:15                     ` Paul E. McKenney
2023-06-08  3:35                     ` Hou Tao
2023-06-08  4:30                       ` Hou Tao
2023-06-08  4:30                       ` Alexei Starovoitov
2023-06-08  1:57             ` Hou Tao
2023-06-08  1:51           ` Hou Tao
2023-06-08  2:55             ` Paul E. McKenney
2023-06-08  3:43               ` Hou Tao
2023-06-08 16:18                 ` Paul E. McKenney
2023-06-09  3:02                   ` Hou Tao
2023-06-09 14:30                     ` Paul E. McKenney
2023-06-12  2:03                       ` Hou Tao
2023-06-12  3:40                         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).