rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP
@ 2023-04-08 14:18 Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-08 14:18 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

As discussed in v1, currently the freed objects in bpf memory allocator
may be reused immediately by the new allocation, it introduces
use-after-bpf-ma-free problem for non-preallocated hash map and makes
lookup procedure return incorrect result. The immediate reuse also makes
introducing new use case more difficult (e.g. qp-trie).

The patch series tries to introduce BPF_MA_REUSE_AFTER_RCU_GP to solve
these problems. For BPF_MA_REUSE_AFTER_GP, the freed objects are reused
only after one RCU grace period and may be freed by bpf memory allocator
after another RCU-tasks-trace grace period. So for bpf programs which
care about reuse problem, these programs can use
bpf_rcu_read_{lock,unlock}() to access these freed objects safely and
for those which doesn't care, there will be safely use-after-bpf-ma-free
because these objects have not been freed by bpf memory allocator.

The current implementation is far from perfect, but I think it is ready
for get some feedbacks before putting in more effort. The implementation
mainly focus on how to speed up the transition from freed elements to
reusable elements and try to reduce the risk of OOM.

To accelerate the transition, it dynamically allocates rcu_head and call
call_rcu() in a kworker to do the transition. The frequency of call_rcu()
invocation could be improved by calling call_rcu() in irq work, but after
did that, I found the RCU grace period increased a lot and I still could
not figure out why. To reduce the risk of OOM, these reusable elements need
to be free as well, but we can not dynamically allocate rcu_head to do
that, because compared with RCU grace period RCU-tasks-trace grace
period is slower, so the freeing of reusable elements is just like the
freeing in normal bpf memory allocator, but these is one difference: for
BPF_MA_REUSE_AFTER_GP bpf ma these freeing elements are still available
for reuse in unit_alloc(). Please see individual patches for more details.

Comments and suggestions are always welcome.

Change Log:
v2:
 * add a benchmark for bpf memory allocator to compare between different
   flavor of bpf memory allocator.
 * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

Hou Tao (4):
  selftests/bpf: Add benchmark for bpf memory allocator
  bpf: Factor out a common helper free_all()
  bpf: Pass bitwise flags to bpf_mem_alloc_init()
  bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

 include/linux/bpf_mem_alloc.h                 |   9 +-
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/cpumask.c                          |   2 +-
 kernel/bpf/hashtab.c                          |   5 +-
 kernel/bpf/memalloc.c                         | 390 ++++++++++++++++--
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++
 .../selftests/bpf/progs/htab_mem_bench.c      | 145 +++++++
 9 files changed, 785 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
@ 2023-04-08 14:18 ` Hou Tao
  2023-04-22  2:59   ` Alexei Starovoitov
  2023-04-08 14:18 ` [RFC bpf-next v2 2/4] bpf: Factor out a common helper free_all() Hou Tao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-04-08 14:18 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

The benchmark could be used to compare the performance of hash map
operations and the memory usage between different flavors of bpf memory
allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
could be used to check the performance improvement or the memory saving
of bpf memory allocator optimization and check whether or not a specific
use case is suitable for bpf memory allocator.

The benchmark creates a non-preallocated hash map which uses bpf memory
allocator and shows the operation performance and the memory usage of
the hash map under different use cases:
(1) no_op
Only create the hash map and there is no operations on hash map. It is
used as the baseline. When each CPUs complete the iteartion of
nonoverlapping part of hash map, the loop count is increased.
(2) overwrite
Each CPU overwrites nonoverlapping part of hash map. When each CPU
completes one round of iteration, the loop count is increased.
(3) batch_add_batch_del
Each CPU adds then deletes nonoverlapping part of hash map in batch.
When each CPU completes one round of iteration, the loop count is
increased.
(4) add_del_on_diff_cpu
Each two CPUs add and delete nonoverlapping part of map concurrently.
When each CPU completes one round of iteration, the loop count is
increased.

The following benchmark results show that bpf memory allocator doesn't
handle add_del_on_diff_cpu scenario very well. Because map deletion
always happen on a different CPU than the map addition and the freed
memory can never be reused.

./bench htab-mem --use-case $name --max-entries 16384 \
	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1129       | 1.15                 | 1.15              |
| overwrite           | 24.37      | 2.07                 | 2.97              |
| batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
| add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |

./bench htab-mem --preallocated --use-case $name --max-entries 16384 \
	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1195       | 2.11                 | 2.16              |
| overwrite           | 34.02      | 1.96                 | 2.00              |
| batch_add_batch_del | 19.25      | 1.96                 | 2.00              |
| add_del_on_diff_cpu | 8.70       | 1.96                 | 2.00              |

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++++++++
 .../selftests/bpf/progs/htab_mem_bench.c      | 145 ++++++++++
 4 files changed, 425 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c02184eaae69..74a45c790d4a 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -647,11 +647,13 @@ $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
 $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.skel.h
 $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
+$(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(TESTING_HELPERS) \
 		 $(TRACE_HELPERS) \
+		 $(CGROUP_HELPERS) \
 		 $(OUTPUT)/bench_count.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
@@ -664,6 +666,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
 		 $(OUTPUT)/bench_bpf_hashmap_lookup.o \
 		 $(OUTPUT)/bench_local_storage_create.o \
+		 $(OUTPUT)/bench_htab_mem.o \
 		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index d9c080ac1796..d3d9ae321b74 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -279,6 +279,7 @@ extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
 extern struct argp bench_hashmap_lookup_argp;
 extern struct argp bench_local_storage_create_argp;
+extern struct argp bench_htab_mem_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -290,6 +291,7 @@ static const struct argp_child bench_parsers[] = {
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
 	{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
 	{ &bench_local_storage_create_argp, 0, "local-storage-create benchmark", 0 },
+	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
 	{},
 };
 
@@ -518,6 +520,7 @@ extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
 extern const struct bench bench_bpf_hashmap_lookup;
 extern const struct bench bench_local_storage_create;
+extern const struct bench bench_htab_mem;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -559,6 +562,7 @@ static const struct bench *benchs[] = {
 	&bench_local_storage_tasks_trace,
 	&bench_bpf_hashmap_lookup,
 	&bench_local_storage_create,
+	&bench_htab_mem,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_htab_mem.c b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
new file mode 100644
index 000000000000..116821a2a7dd
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#include "bench.h"
+#include "cgroup_helpers.h"
+#include "htab_mem_bench.skel.h"
+
+static struct htab_mem_ctx {
+	struct htab_mem_bench *skel;
+	int fd;
+} ctx;
+
+static struct htab_mem_args {
+	u32 max_entries;
+	u32 value_size;
+	u32 full;
+	const char *use_case;
+	bool preallocated;
+} args = {
+	.max_entries = 16384,
+	.full = 50,
+	.value_size = 8,
+	.use_case = "overwrite",
+	.preallocated = false,
+};
+
+enum {
+	ARG_MAX_ENTRIES = 10000,
+	ARG_FULL_PERCENT = 10001,
+	ARG_VALUE_SIZE = 10002,
+	ARG_USE_CASE = 10003,
+	ARG_PREALLOCATED = 10004,
+};
+
+static const struct argp_option opts[] = {
+	{ "max-entries", ARG_MAX_ENTRIES, "MAX_ENTRIES", 0,
+	  "Set the max entries of hash map (default 16384)" },
+	{ "full", ARG_FULL_PERCENT, "FULL", 0,
+	  "Set the full percent of hash map (default 50)" },
+	{ "value-size", ARG_VALUE_SIZE, "VALUE_SIZE", 0,
+	  "Set the value size of hash map (default 8)" },
+	{ "use-case", ARG_USE_CASE, "USE_CASE", 0,
+	  "Set the use case of hash map: no_op|overwrite|batch_add_batch_del|add_del_on_diff_cpu" },
+	{ "preallocated", ARG_PREALLOCATED, NULL, 0, "use preallocated hash map" },
+	{},
+};
+
+static error_t htab_mem_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_MAX_ENTRIES:
+		args.max_entries = strtoul(arg, NULL, 10);
+		break;
+	case ARG_FULL_PERCENT:
+		args.full = strtoul(arg, NULL, 10);
+		if (!args.full || args.full > 100) {
+			fprintf(stderr, "invalid full percent %u\n", args.full);
+			argp_usage(state);
+		}
+		break;
+	case ARG_VALUE_SIZE:
+		args.value_size = strtoul(arg, NULL, 10);
+		break;
+	case ARG_USE_CASE:
+		args.use_case = strdup(arg);
+		break;
+	case ARG_PREALLOCATED:
+		args.preallocated = true;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_htab_mem_argp = {
+	.options = opts,
+	.parser = htab_mem_parse_arg,
+};
+
+static void htab_mem_validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "htab mem benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static int setup_and_join_cgroup(const char *path)
+{
+	int err, fd;
+
+	err = setup_cgroup_environment();
+	if (err) {
+		fprintf(stderr, "setup cgroup env failed\n");
+		return -1;
+	}
+
+	err = create_and_get_cgroup(path);
+	if (err < 0) {
+		fprintf(stderr, "create cgroup %s failed\n", path);
+		goto out;
+	}
+	fd = err;
+
+	err = join_cgroup(path);
+	if (err) {
+		fprintf(stderr, "join cgroup %s failed\n", path);
+		close(fd);
+		goto out;
+	}
+
+	return fd;
+out:
+	cleanup_cgroup_environment();
+	return -1;
+}
+
+static void htab_mem_setup(void)
+{
+	struct bpf_program *prog;
+	struct bpf_map *map;
+	int err;
+
+	setup_libbpf();
+
+	err = setup_and_join_cgroup("/htab_mem");
+	if (err < 0)
+		exit(1);
+	ctx.fd = err;
+
+	ctx.skel = htab_mem_bench__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		goto cleanup;
+	}
+
+	map = ctx.skel->maps.htab;
+	bpf_map__set_max_entries(map, args.max_entries);
+	bpf_map__set_value_size(map, args.value_size);
+	if (args.preallocated)
+		bpf_map__set_map_flags(map, bpf_map__map_flags(map) & ~BPF_F_NO_PREALLOC);
+
+	map = ctx.skel->maps.array;
+	bpf_map__set_max_entries(map, args.max_entries);
+	bpf_map__set_value_size(map, args.value_size);
+
+	prog = bpf_object__find_program_by_name(ctx.skel->obj, args.use_case);
+	if (!prog) {
+		fprintf(stderr, "no such use-case: %s\n", args.use_case);
+		fprintf(stderr, "available use case:");
+		bpf_object__for_each_program(prog, ctx.skel->obj)
+			fprintf(stderr, " %s", bpf_program__name(prog));
+		fprintf(stderr, "\n");
+		goto cleanup;
+	}
+	bpf_program__set_autoload(prog, true);
+
+	ctx.skel->bss->nr_thread = env.producer_cnt;
+	ctx.skel->bss->nr_entries = (uint64_t)args.max_entries * args.full / 100;
+
+	err = htab_mem_bench__load(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		goto cleanup;
+	}
+	err = htab_mem_bench__attach(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to attach skeleton\n");
+		goto cleanup;
+	}
+	return;
+cleanup:
+	close(ctx.fd);
+	cleanup_cgroup_environment();
+	htab_mem_bench__destroy(ctx.skel);
+	exit(1);
+}
+
+static void *htab_mem_producer(void *ctx)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void *htab_mem_consumer(void *ctx)
+{
+	return NULL;
+}
+
+static void htab_mem_read_mem_cgrp_file(const char *name, unsigned long *value)
+{
+	char buf[32];
+	int fd;
+
+	fd = openat(ctx.fd, name, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr, "no %s\n", name);
+		*value = 0;
+		return;
+	}
+
+	buf[sizeof(buf) - 1] = 0;
+	read(fd, buf, sizeof(buf) - 1);
+	*value = strtoull(buf, NULL, 0);
+
+	close(fd);
+}
+
+static void htab_mem_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->loop_cnt, 0);
+	htab_mem_read_mem_cgrp_file("memory.current", &res->gp_ct);
+}
+
+static void htab_mem_report_progress(int iter, struct bench_res *res, long delta_ns)
+{
+	double loop, mem;
+
+	loop = res->hits / 1000.0 / (delta_ns / 1000000000.0);
+	mem = res->gp_ct / 1048576.0;
+	printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
+	printf("loop %7.2lfk/s, memory usage %7.2lfMiB\n", loop, mem);
+}
+
+static void htab_mem_report_final(struct bench_res res[], int res_cnt)
+{
+	double mem_mean = 0.0, mem_stddev = 0.0;
+	double loop_mean = 0.0, loop_stddev = 0.0;
+	unsigned long peak_mem;
+	int i;
+
+	for (i = 0; i < res_cnt; i++) {
+		loop_mean += res[i].hits / 1000.0 / (0.0 + res_cnt);
+		mem_mean += res[i].gp_ct / 1048576.0 / (0.0 + res_cnt);
+	}
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++) {
+			loop_stddev += (loop_mean - res[i].hits / 1000.0) *
+				       (loop_mean - res[i].hits / 1000.0) /
+				       (res_cnt - 1.0);
+			mem_stddev += (mem_mean - res[i].gp_ct / 1048576.0) *
+				      (mem_mean - res[i].gp_ct / 1048576.0) /
+				      (res_cnt - 1.0);
+		}
+		loop_stddev = sqrt(loop_stddev);
+		mem_stddev = sqrt(mem_stddev);
+	}
+
+	htab_mem_read_mem_cgrp_file("memory.peak", &peak_mem);
+	printf("Summary: loop %7.2lf \u00B1 %7.2lfk/s, memory usage %7.2lf \u00B1 %7.2lfMiB, "
+	       "peak memory usage %7.2lfMiB\n",
+	       loop_mean, loop_stddev, mem_mean, mem_stddev, peak_mem / 1048576.0);
+}
+
+const struct bench bench_htab_mem = {
+	.name = "htab-mem",
+	.argp = &bench_htab_mem_argp,
+	.validate = htab_mem_validate,
+	.setup = htab_mem_setup,
+	.producer_thread = htab_mem_producer,
+	.consumer_thread = htab_mem_consumer,
+	.measure = htab_mem_measure,
+	.report_progress = htab_mem_report_progress,
+	.report_final = htab_mem_report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/htab_mem_bench.c b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
new file mode 100644
index 000000000000..f320cb3a11e8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
+#include <stdbool.h>
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct update_ctx {
+	unsigned int from;
+	unsigned int step;
+	unsigned int max;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} array SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+unsigned int nr_entries = 0;
+unsigned int nr_thread = 0;
+long loop_cnt = 0;
+
+static int noop_htab(unsigned int i, struct update_ctx *ctx)
+{
+	if (ctx->from >= ctx->max)
+		return 1;
+
+	ctx->from += ctx->step;
+	return 0;
+}
+
+static int write_htab(unsigned int i, struct update_ctx *ctx, unsigned int flags)
+{
+	__u64 *value;
+
+	if (ctx->from >= ctx->max)
+		return 1;
+
+	value = bpf_map_lookup_elem(&array, &ctx->from);
+	if (value)
+		bpf_map_update_elem(&htab, &ctx->from, value, flags);
+	ctx->from += ctx->step;
+
+	return 0;
+}
+
+static int overwrite_htab(unsigned int i, struct update_ctx *ctx)
+{
+	return write_htab(i, ctx, 0);
+}
+
+static int newwrite_htab(unsigned int i, struct update_ctx *ctx)
+{
+	return write_htab(i, ctx, BPF_NOEXIST);
+}
+
+static int del_htab(unsigned int i, struct update_ctx *ctx)
+{
+	__u64 *value;
+
+	if (ctx->from >= ctx->max)
+		return 1;
+
+	value = bpf_map_lookup_elem(&array, &ctx->from);
+	if (value)
+		bpf_map_delete_elem(&htab, &ctx->from);
+	ctx->from += ctx->step;
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int no_op(void *ctx)
+{
+	struct update_ctx update;
+
+	update.from = bpf_get_smp_processor_id();
+	update.step = nr_thread;
+	update.max = nr_entries;
+	bpf_loop(update.max, noop_htab, &update, 0);
+	__sync_fetch_and_add(&loop_cnt, 1);
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int overwrite(void *ctx)
+{
+	struct update_ctx update;
+
+	update.from = bpf_get_smp_processor_id();
+	update.step = nr_thread;
+	update.max = nr_entries;
+	bpf_loop(update.max, overwrite_htab, &update, 0);
+
+	__sync_fetch_and_add(&loop_cnt, 1);
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int batch_add_batch_del(void *ctx)
+{
+	struct update_ctx update;
+
+	update.from = bpf_get_smp_processor_id();
+	update.step = nr_thread;
+	update.max = nr_entries;
+	bpf_loop(update.max, overwrite_htab, &update, 0);
+
+	update.from = bpf_get_smp_processor_id();
+	bpf_loop(update.max, del_htab, &update, 0);
+
+	__sync_fetch_and_add(&loop_cnt, 1);
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int add_del_on_diff_cpu(void *ctx)
+{
+	struct update_ctx update;
+	unsigned int from;
+
+	from = bpf_get_smp_processor_id();
+	update.from = from / 2;
+	update.step = nr_thread / 2;
+	update.max = nr_entries;
+
+	if (from & 1)
+		bpf_loop(update.max, newwrite_htab, &update, 0);
+	else
+		bpf_loop(update.max, del_htab, &update, 0);
+
+	__sync_fetch_and_add(&loop_cnt, 1);
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC bpf-next v2 2/4] bpf: Factor out a common helper free_all()
  2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
@ 2023-04-08 14:18 ` Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-08 14:18 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Factor out a common helper free_all() to free all normal elements or
per-cpu elements on a lock-less list.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 410637c225fb..0668bcd7c926 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -211,9 +211,9 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	mem_cgroup_put(memcg);
 }
 
-static void free_one(struct bpf_mem_cache *c, void *obj)
+static void free_one(void *obj, bool percpu)
 {
-	if (c->percpu_size) {
+	if (percpu) {
 		free_percpu(((void **)obj)[1]);
 		kfree(obj);
 		return;
@@ -222,14 +222,19 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
 	kfree(obj);
 }
 
-static void __free_rcu(struct rcu_head *head)
+static void free_all(struct llist_node *llnode, bool percpu)
 {
-	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
-	struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
 	struct llist_node *pos, *t;
 
 	llist_for_each_safe(pos, t, llnode)
-		free_one(c, pos);
+		free_one(pos, percpu);
+}
+
+static void __free_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+
+	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
 	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
@@ -432,7 +437,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
-	struct llist_node *llnode, *t;
+	bool percpu = !!c->percpu_size;
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
@@ -441,14 +446,10 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * Except for waiting_for_gp list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
-		free_one(c, llnode);
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
-		free_one(c, llnode);
+	free_all(__llist_del_all(&c->free_by_rcu), percpu);
+	free_all(llist_del_all(&c->waiting_for_gp), percpu);
+	free_all(__llist_del_all(&c->free_llist), percpu);
+	free_all(__llist_del_all(&c->free_llist_extra), percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC bpf-next v2 3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init()
  2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 2/4] bpf: Factor out a common helper free_all() Hou Tao
@ 2023-04-08 14:18 ` Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
  2023-04-21  6:23 ` [RFC bpf-next v2 0/4] " Hou Tao
  4 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-08 14:18 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Extend a boolean argument to a bitwise flags argument for
bpf_mem_alloc_init(), so more new flags can be added later.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h | 8 +++++++-
 kernel/bpf/core.c             | 2 +-
 kernel/bpf/cpumask.c          | 2 +-
 kernel/bpf/hashtab.c          | 5 +++--
 kernel/bpf/memalloc.c         | 8 +++++++-
 5 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3929be5743f4..148347950e16 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -12,6 +12,12 @@ struct bpf_mem_alloc {
 	struct bpf_mem_caches __percpu *caches;
 	struct bpf_mem_cache __percpu *cache;
 	struct work_struct work;
+	unsigned int flags;
+};
+
+/* flags for bpf_mem_alloc_init() */
+enum {
+	BPF_MA_PERCPU = 1U << 0,
 };
 
 /* 'size != 0' is for bpf_mem_alloc which manages fixed-size objects.
@@ -21,7 +27,7 @@ struct bpf_mem_alloc {
  * Alloc and free are done with bpf_mem_{alloc,free}() and the size of
  * the returned object is given by the size argument of bpf_mem_alloc().
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu);
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags);
 void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 
 /* kmalloc/kfree equivalent: */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b297e9f60ca1..fb4275afd1ad 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2762,7 +2762,7 @@ static int __init bpf_global_ma_init(void)
 {
 	int ret;
 
-	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
+	ret = bpf_mem_alloc_init(&bpf_global_ma, 0, 0);
 	bpf_global_ma_set = !ret;
 	return ret;
 }
diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
index 7efdf5d770ca..f40636796f75 100644
--- a/kernel/bpf/cpumask.c
+++ b/kernel/bpf/cpumask.c
@@ -445,7 +445,7 @@ static int __init cpumask_kfunc_init(void)
 		},
 	};
 
-	ret = bpf_mem_alloc_init(&bpf_cpumask_ma, sizeof(struct bpf_cpumask), false);
+	ret = bpf_mem_alloc_init(&bpf_cpumask_ma, sizeof(struct bpf_cpumask), 0);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &cpumask_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &cpumask_kfunc_set);
 	return  ret ?: register_btf_id_dtor_kfuncs(cpumask_dtors,
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 00c253b84bf5..93009b94ac9b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -576,12 +576,13 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 				goto free_prealloc;
 		}
 	} else {
-		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
+		err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, 0);
 		if (err)
 			goto free_map_locked;
 		if (percpu) {
 			err = bpf_mem_alloc_init(&htab->pcpu_ma,
-						 round_up(htab->map.value_size, 8), true);
+						 round_up(htab->map.value_size, 8),
+						 BPF_MA_PERCPU);
 			if (err)
 				goto free_map_locked;
 		}
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 0668bcd7c926..072102476019 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -98,6 +98,7 @@ struct bpf_mem_cache {
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
+	unsigned int flags;
 
 	struct rcu_head rcu;
 	struct llist_head free_by_rcu;
@@ -377,13 +378,14 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
  * kmalloc/kfree. Max allocation size is 4096 in this case.
  * This is bpf_dynptr and bpf_kptr use case.
  */
-int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
+int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
 {
 	static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096};
 	struct bpf_mem_caches *cc, __percpu *pcc;
 	struct bpf_mem_cache *c, __percpu *pc;
 	struct obj_cgroup *objcg = NULL;
 	int cpu, i, unit_size, percpu_size = 0;
+	bool percpu = (flags & BPF_MA_PERCPU);
 
 	if (size) {
 		pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL);
@@ -406,9 +408,11 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->unit_size = unit_size;
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
+			c->flags = flags;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
+		ma->flags = flags;
 		return 0;
 	}
 
@@ -428,10 +432,12 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c = &cc->cache[i];
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
+			c->flags = flags;
 			prefill_mem_cache(c, cpu);
 		}
 	}
 	ma->caches = pcc;
+	ma->flags = flags;
 	return 0;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
  2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
                   ` (2 preceding siblings ...)
  2023-04-08 14:18 ` [RFC bpf-next v2 3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
@ 2023-04-08 14:18 ` Hou Tao
  2023-04-22  3:12   ` Alexei Starovoitov
  2023-04-21  6:23 ` [RFC bpf-next v2 0/4] " Hou Tao
  4 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-04-08 14:18 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Currently the freed objects in bpf memory allocator may be reused
immediately by new allocation, it introduces use-after-bpf-ma-free
problem for non-preallocated hash map and makes lookup procedure
return incorrect result. The immediate reuse also makes introducing
new use case more difficult (e.g. qp-trie).

So introduce BPF_MA_REUSE_AFTER_RCU_GP to solve these problems. For
BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU
grace period and may be returned back to slab system after another
RCU-tasks-trace grace period. So for bpf programs which care about reuse
problem, these programs can use bpf_rcu_read_{lock,unlock}() to access
these freed objects safely and for those which doesn't care, there will
be safely use-after-bpf-ma-free because these objects have not been
freed by bpf memory allocator.

To make these freed elements being reusab quickly, BPF_MA_REUSE_AFTER_GP
dynamically allocates memory to create many inflight RCU callbacks to
mark these freed element being reusable. These memories used for
bpf_reuse_batch will be freed when these RCU callbacks complete. When no
memory is available, synchronize_rcu_expedited() will be used to make
these freed element reusable. In order to reduce the risk of OOM, part
of these reusable memory will be freed through RCU-tasks-trace grace
period. Before these freeing memories are freed, these memories are also
available for reuse.

The following are the benchmark results when comparing between different
flavors of bpf memory allocator. These results show:
* The performance of reuse-after-rcu-gp bpf ma is good than no bpf ma.
  Its memory usage is also good than no bpf ma except for
  add_del_on_diff_cpu case.
* The memory usage of reuse-after-rcu-gp bpf ma increases a lot compared
  with normal bpf ma.
* The memory usage of free-after-rcu-gp bpf ma is better than
  reuse-after-rcu-gp bpf ma, but its performance is bad than
  reuse-after-ruc-gp because it doesn't do reuse.

(1) no bpf memory allocator (v6.0.19)
| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1187       | 1.05                 | 1.05              |
| overwrite           | 3.74       | 32.52                | 84.18             |
| batch_add_batch_del | 2.23       | 26.38                | 48.75             |
| add_del_on_diff_cpu | 3.92       | 33.72                | 48.96             |

(2) normal bpf memory allocator
| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1187       | 0.96                 | 1.00              |
| overwrite           | 27.12      | 2.5                  | 2.99              |
| batch_add_batch_del | 8.9        | 2.77                 | 3.24              |
| add_del_on_diff_cpu | 11.30      | 218.54               | 440.37            |

(3) reuse-after-rcu-gp bpf memory allocator
| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1276       | 0.96                 | 1.00              |
| overwrite           | 15.66      | 25.00                | 33.07             |
| batch_add_batch_del | 10.32      | 18.84                | 22.64             |
| add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |

(4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1263       | 0.96                 | 1.00              |
| overwrite           | 10.73      | 12.33                | 20.32             |
| batch_add_batch_del | 7.02       | 9.45                 | 14.07             |
| add_del_on_diff_cpu | 8.99       | 131.64               | 204.42            |

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h |   1 +
 kernel/bpf/memalloc.c         | 353 +++++++++++++++++++++++++++++++---
 2 files changed, 326 insertions(+), 28 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 148347950e16..e7f68432713b 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -18,6 +18,7 @@ struct bpf_mem_alloc {
 /* flags for bpf_mem_alloc_init() */
 enum {
 	BPF_MA_PERCPU = 1U << 0,
+	BPF_MA_REUSE_AFTER_RCU_GP = 1U << 1,
 };
 
 /* 'size != 0' is for bpf_mem_alloc which manages fixed-size objects.
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 072102476019..262100f89610 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -63,6 +63,10 @@ static u8 size_index[24] __ro_after_init = {
 	2	/* 192 */
 };
 
+static struct workqueue_struct *bpf_ma_wq;
+
+static void bpf_ma_prepare_reuse_work(struct work_struct *work);
+
 static int bpf_mem_cache_idx(size_t size)
 {
 	if (!size || size > 4096)
@@ -98,18 +102,36 @@ struct bpf_mem_cache {
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
+	int cpu;
 	unsigned int flags;
 
+	raw_spinlock_t reuse_lock;
+	bool abort_reuse;
+	struct llist_head reuse_ready_head;
+	struct llist_node *reuse_ready_tail;
+	struct llist_head wait_for_free;
+	struct llist_head prepare_reuse_head;
+	struct llist_node *prepare_reuse_tail;
+	unsigned int prepare_reuse_cnt;
+	atomic_t reuse_cb_in_progress;
+	struct work_struct reuse_work;
+
 	struct rcu_head rcu;
 	struct llist_head free_by_rcu;
 	struct llist_head waiting_for_gp;
-	atomic_t call_rcu_in_progress;
+	atomic_t free_cb_in_progress;
 };
 
 struct bpf_mem_caches {
 	struct bpf_mem_cache cache[NUM_CACHES];
 };
 
+struct bpf_reuse_batch {
+	struct bpf_mem_cache *c;
+	struct llist_node *head, *tail;
+	struct rcu_head rcu;
+};
+
 static struct llist_node notrace *__llist_del_first(struct llist_head *head)
 {
 	struct llist_node *entry, *next;
@@ -154,6 +176,45 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
 #endif
 }
 
+static void *bpf_ma_get_reusable_obj(struct bpf_mem_cache *c)
+{
+	if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP) {
+		unsigned long flags;
+		void *obj;
+
+		if (llist_empty(&c->reuse_ready_head) && llist_empty(&c->wait_for_free))
+			return NULL;
+
+		/* reuse_ready_head and wait_for_free may be manipulated by
+		 * kworker and RCU callbacks.
+		 */
+		raw_spin_lock_irqsave(&c->reuse_lock, flags);
+		obj = __llist_del_first(&c->reuse_ready_head);
+		if (obj) {
+			if (llist_empty(&c->reuse_ready_head))
+				c->reuse_ready_tail = NULL;
+		} else {
+			obj = __llist_del_first(&c->wait_for_free);
+		}
+		raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
+		return obj;
+	}
+
+	/*
+	 * free_by_rcu is only manipulated by irq work refill_work().
+	 * IRQ works on the same CPU are called sequentially, so it is
+	 * safe to use __llist_del_first() here. If alloc_bulk() is
+	 * invoked by the initial prefill, there will be no running
+	 * refill_work(), so __llist_del_first() is fine as well.
+	 *
+	 * In most cases, objects on free_by_rcu are from the same CPU.
+	 * If some objects come from other CPUs, it doesn't incur any
+	 * harm because NUMA_NO_NODE means the preference for current
+	 * numa node and it is not a guarantee.
+	 */
+	return __llist_del_first(&c->free_by_rcu);
+}
+
 /* Mostly runs from irq_work except __init phase. */
 static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
@@ -165,19 +226,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
-		/*
-		 * free_by_rcu is only manipulated by irq work refill_work().
-		 * IRQ works on the same CPU are called sequentially, so it is
-		 * safe to use __llist_del_first() here. If alloc_bulk() is
-		 * invoked by the initial prefill, there will be no running
-		 * refill_work(), so __llist_del_first() is fine as well.
-		 *
-		 * In most cases, objects on free_by_rcu are from the same CPU.
-		 * If some objects come from other CPUs, it doesn't incur any
-		 * harm because NUMA_NO_NODE means the preference for current
-		 * numa node and it is not a guarantee.
-		 */
-		obj = __llist_del_first(&c->free_by_rcu);
+		obj = bpf_ma_get_reusable_obj(c);
 		if (!obj) {
 			/* Allocate, but don't deplete atomic reserves that typical
 			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
@@ -236,7 +285,7 @@ static void __free_rcu(struct rcu_head *head)
 	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
 
 	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
-	atomic_set(&c->call_rcu_in_progress, 0);
+	atomic_set(&c->free_cb_in_progress, 0);
 }
 
 static void __free_rcu_tasks_trace(struct rcu_head *head)
@@ -264,7 +313,7 @@ static void do_call_rcu(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
 
-	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+	if (atomic_xchg(&c->free_cb_in_progress, 1))
 		return;
 
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
@@ -409,6 +458,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
 			c->flags = flags;
+			c->cpu = cpu;
+			INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
@@ -433,6 +484,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
 			c->flags = flags;
+			c->cpu = cpu;
+			INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
 			prefill_mem_cache(c, cpu);
 		}
 	}
@@ -444,18 +497,40 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	bool percpu = !!c->percpu_size;
+	struct llist_node *head[3];
+	unsigned long flags;
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
 	 * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
 	 *
-	 * Except for waiting_for_gp list, there are no concurrent operations
-	 * on these lists, so it is safe to use __llist_del_all().
+	 * Except for waiting_for_gp and free_llist_extra list, there are no
+	 * concurrent operations on these lists, so it is safe to use
+	 * __llist_del_all().
 	 */
 	free_all(__llist_del_all(&c->free_by_rcu), percpu);
 	free_all(llist_del_all(&c->waiting_for_gp), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
-	free_all(__llist_del_all(&c->free_llist_extra), percpu);
+	free_all(llist_del_all(&c->free_llist_extra), percpu);
+
+	if (!(c->flags & BPF_MA_REUSE_AFTER_RCU_GP))
+		return;
+
+	raw_spin_lock_irqsave(&c->reuse_lock, flags);
+	/* Indicate kworker and RCU callback to free elements directly
+	 * instead of adding new elements into these lists.
+	 */
+	c->abort_reuse = true;
+	head[0] = __llist_del_all(&c->prepare_reuse_head);
+	c->prepare_reuse_tail = NULL;
+	head[1] = __llist_del_all(&c->reuse_ready_head);
+	c->reuse_ready_tail = NULL;
+	head[2] = __llist_del_all(&c->wait_for_free);
+	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
+
+	free_all(head[0], percpu);
+	free_all(head[1], percpu);
+	free_all(head[2], percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
@@ -466,10 +541,39 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 	ma->caches = NULL;
 }
 
+static void bpf_ma_cancel_reuse_work(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			cancel_work_sync(&c->reuse_work);
+		}
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				cancel_work_sync(&c->reuse_work);
+			}
+		}
+	}
+}
+
 static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
-	/* waiting_for_gp lists was drained, but __free_rcu might
-	 * still execute. Wait for it now before we freeing percpu caches.
+	bool reuse_after_rcu_gp = ma->flags & BPF_MA_REUSE_AFTER_RCU_GP;
+
+	/* Cancel the inflight kworkers */
+	if (reuse_after_rcu_gp)
+		bpf_ma_cancel_reuse_work(ma);
+
+	/* For normal bpf ma, waiting_for_gp lists was drained, but __free_rcu
+	 * might still execute. Wait for it now before we freeing percpu caches.
 	 *
 	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
 	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
@@ -477,9 +581,13 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
 	 * so if call_rcu(head, __free_rcu) is skipped due to
 	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
 	 * using rcu_trace_implies_rcu_gp() as well.
+	 *
+	 * For reuse-after-rcu-gp bpf ma, use rcu_barrier_tasks_trace() to
+	 * wait for the pending bpf_ma_free_reusable_cb() and use rcu_barrier()
+	 * to wait for the pending bpf_ma_reuse_cb().
 	 */
 	rcu_barrier_tasks_trace();
-	if (!rcu_trace_implies_rcu_gp())
+	if (reuse_after_rcu_gp || !rcu_trace_implies_rcu_gp())
 		rcu_barrier();
 	free_mem_alloc_no_barrier(ma);
 }
@@ -512,6 +620,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
 	}
 
 	/* Defer barriers into worker to let the rest of map memory to be freed */
+	copy->flags = ma->flags;
 	copy->cache = ma->cache;
 	ma->cache = NULL;
 	copy->caches = ma->caches;
@@ -541,7 +650,9 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			 */
 			irq_work_sync(&c->refill_work);
 			drain_mem_cache(c);
-			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
+			rcu_in_progress += atomic_read(&c->free_cb_in_progress);
+			/* Pending kworkers or RCU callbacks */
+			rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
 		}
 		/* objcg is the same across cpus */
 		if (c->objcg)
@@ -556,7 +667,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				c = &cc->cache[i];
 				irq_work_sync(&c->refill_work);
 				drain_mem_cache(c);
-				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
+				rcu_in_progress += atomic_read(&c->free_cb_in_progress);
+				rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
 			}
 		}
 		if (c->objcg)
@@ -600,18 +712,183 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	return llnode;
 }
 
+static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c, struct llist_node *head,
+					      struct llist_node *tail)
+{
+	unsigned long flags;
+	bool abort;
+
+	raw_spin_lock_irqsave(&c->reuse_lock, flags);
+	abort = c->abort_reuse;
+	if (!abort) {
+		if (llist_empty(&c->reuse_ready_head))
+			c->reuse_ready_tail = tail;
+		__llist_add_batch(head, tail, &c->reuse_ready_head);
+	}
+	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
+
+	/* Don't move these objects to reuse_ready list and free
+	 * these objects directly.
+	 */
+	if (abort)
+		free_all(head, !!c->percpu_size);
+}
+
+static void bpf_ma_reuse_cb(struct rcu_head *rcu)
+{
+	struct bpf_reuse_batch *batch = container_of(rcu, struct bpf_reuse_batch, rcu);
+	struct bpf_mem_cache *c = batch->c;
+
+	bpf_ma_add_to_reuse_ready_or_free(c, batch->head, batch->tail);
+	atomic_dec(&c->reuse_cb_in_progress);
+	kfree(batch);
+}
+
+static bool bpf_ma_try_free_reuse_objs(struct bpf_mem_cache *c)
+{
+	struct llist_node *head, *tail;
+	bool do_free;
+
+	if (llist_empty(&c->reuse_ready_head))
+		return false;
+
+	do_free = !atomic_xchg(&c->free_cb_in_progress, 1);
+	if (!do_free)
+		return false;
+
+	head = __llist_del_all(&c->reuse_ready_head);
+	tail = c->reuse_ready_tail;
+	c->reuse_ready_tail = NULL;
+
+	__llist_add_batch(head, tail, &c->wait_for_free);
+
+	return true;
+}
+
+static void bpf_ma_free_reusable_cb(struct rcu_head *rcu)
+{
+	struct bpf_mem_cache *c = container_of(rcu, struct bpf_mem_cache, rcu);
+	struct llist_node *head;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&c->reuse_lock, flags);
+	head = __llist_del_all(&c->wait_for_free);
+	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
+
+	free_all(head, !!c->percpu_size);
+	atomic_set(&c->free_cb_in_progress, 0);
+}
+
+static void bpf_ma_prepare_reuse_work(struct work_struct *work)
+{
+	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, reuse_work);
+	struct llist_node *head, *tail, *llnode, *tmp;
+	struct bpf_reuse_batch *batch;
+	unsigned long flags;
+	bool do_free;
+
+	local_irq_save(flags);
+	/* When CPU is offline, the running CPU may be different with
+	 * the CPU which submitted the work. When these two CPUs are the same,
+	 * kworker may be interrupted by NMI, so increase active to protect
+	 * again such concurrency.
+	 */
+	if (c->cpu == smp_processor_id())
+		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
+	raw_spin_lock(&c->reuse_lock);
+	head = __llist_del_all(&c->prepare_reuse_head);
+	tail = c->prepare_reuse_tail;
+	c->prepare_reuse_tail = NULL;
+	c->prepare_reuse_cnt = 0;
+	if (c->cpu == smp_processor_id())
+		local_dec(&c->active);
+
+	/* Try to free elements in reusable list. Before these elements are
+	 * freed in RCU cb, these element will still be available for reuse.
+	 */
+	do_free = bpf_ma_try_free_reuse_objs(c);
+	raw_spin_unlock(&c->reuse_lock);
+	local_irq_restore(flags);
+
+	if (do_free)
+		call_rcu_tasks_trace(&c->rcu, bpf_ma_free_reusable_cb);
+
+	llist_for_each_safe(llnode, tmp, llist_del_all(&c->free_llist_extra)) {
+		if (!head)
+			tail = llnode;
+		llnode->next = head;
+		head = llnode->next;
+	}
+	/* Draining is in progress ? */
+	if (!head) {
+		/* kworker completes and no RCU callback */
+		atomic_dec(&c->reuse_cb_in_progress);
+		return;
+	}
+
+	batch = kmalloc(sizeof(*batch), GFP_KERNEL);
+	if (!batch) {
+		synchronize_rcu_expedited();
+		bpf_ma_add_to_reuse_ready_or_free(c, head, tail);
+		/* kworker completes and no RCU callback */
+		atomic_dec(&c->reuse_cb_in_progress);
+		return;
+	}
+
+	batch->c = c;
+	batch->head = head;
+	batch->tail = tail;
+	call_rcu(&batch->rcu, bpf_ma_reuse_cb);
+}
+
+static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	/* In case a NMI-context bpf program is also freeing object. */
+	if (local_inc_return(&c->active) == 1) {
+		bool try_queue_work = false;
+
+		/* kworker may remove elements from prepare_reuse_head */
+		raw_spin_lock(&c->reuse_lock);
+		if (llist_empty(&c->prepare_reuse_head))
+			c->prepare_reuse_tail = llnode;
+		__llist_add(llnode, &c->prepare_reuse_head);
+		if (++c->prepare_reuse_cnt > c->high_watermark) {
+			/* Zero out prepare_reuse_cnt early to prevent
+			 * unnecessary queue_work().
+			 */
+			c->prepare_reuse_cnt = 0;
+			try_queue_work = true;
+		}
+		raw_spin_unlock(&c->reuse_lock);
+
+		if (try_queue_work && !work_pending(&c->reuse_work)) {
+			/* Use reuse_cb_in_progress to indicate there is
+			 * inflight reuse kworker or reuse RCU callback.
+			 */
+			atomic_inc(&c->reuse_cb_in_progress);
+			/* Already queued */
+			if (!queue_work(bpf_ma_wq, &c->reuse_work))
+				atomic_dec(&c->reuse_cb_in_progress);
+		}
+	} else {
+		llist_add(llnode, &c->free_llist_extra);
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+}
+
 /* Though 'ptr' object could have been allocated on a different cpu
  * add it to the free_llist of the current cpu.
  * Let kfree() logic deal with it when it's later called from irq_work.
  */
-static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+static void notrace immediate_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
 {
-	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
 	unsigned long flags;
 	int cnt = 0;
 
-	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
-
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		__llist_add(llnode, &c->free_llist);
@@ -633,6 +910,18 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 		irq_work_raise(c);
 }
 
+static inline void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+
+	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
+
+	if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP)
+		wait_gp_reuse_free(c, llnode);
+	else
+		immediate_reuse_free(c, llnode);
+}
+
 /* Called from BPF program or from sys_bpf syscall.
  * In both cases migration is disabled.
  */
@@ -724,3 +1013,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags)
 
 	return !ret ? NULL : ret + LLIST_NODE_SZ;
 }
+
+static int __init bpf_ma_init(void)
+{
+	bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
+	BUG_ON(!bpf_ma_wq);
+	return 0;
+}
+late_initcall(bpf_ma_init);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP
  2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
                   ` (3 preceding siblings ...)
  2023-04-08 14:18 ` [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
@ 2023-04-21  6:23 ` Hou Tao
  4 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-21  6:23 UTC (permalink / raw)
  To: Hou Tao, bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu

ping ?

On 4/8/2023 10:18 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> As discussed in v1, currently the freed objects in bpf memory allocator
> may be reused immediately by the new allocation, it introduces
> use-after-bpf-ma-free problem for non-preallocated hash map and makes
> lookup procedure return incorrect result. The immediate reuse also makes
> introducing new use case more difficult (e.g. qp-trie).
>
> The patch series tries to introduce BPF_MA_REUSE_AFTER_RCU_GP to solve
> these problems. For BPF_MA_REUSE_AFTER_GP, the freed objects are reused
> only after one RCU grace period and may be freed by bpf memory allocator
> after another RCU-tasks-trace grace period. So for bpf programs which
> care about reuse problem, these programs can use
> bpf_rcu_read_{lock,unlock}() to access these freed objects safely and
> for those which doesn't care, there will be safely use-after-bpf-ma-free
> because these objects have not been freed by bpf memory allocator.
>
> The current implementation is far from perfect, but I think it is ready
> for get some feedbacks before putting in more effort. The implementation
> mainly focus on how to speed up the transition from freed elements to
> reusable elements and try to reduce the risk of OOM.
>
> To accelerate the transition, it dynamically allocates rcu_head and call
> call_rcu() in a kworker to do the transition. The frequency of call_rcu()
> invocation could be improved by calling call_rcu() in irq work, but after
> did that, I found the RCU grace period increased a lot and I still could
> not figure out why. To reduce the risk of OOM, these reusable elements need
> to be free as well, but we can not dynamically allocate rcu_head to do
> that, because compared with RCU grace period RCU-tasks-trace grace
> period is slower, so the freeing of reusable elements is just like the
> freeing in normal bpf memory allocator, but these is one difference: for
> BPF_MA_REUSE_AFTER_GP bpf ma these freeing elements are still available
> for reuse in unit_alloc(). Please see individual patches for more details.
>
> Comments and suggestions are always welcome.
>
> Change Log:
> v2:
>  * add a benchmark for bpf memory allocator to compare between different
>    flavor of bpf memory allocator.
>  * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
> v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>
> Hou Tao (4):
>   selftests/bpf: Add benchmark for bpf memory allocator
>   bpf: Factor out a common helper free_all()
>   bpf: Pass bitwise flags to bpf_mem_alloc_init()
>   bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
>
>  include/linux/bpf_mem_alloc.h                 |   9 +-
>  kernel/bpf/core.c                             |   2 +-
>  kernel/bpf/cpumask.c                          |   2 +-
>  kernel/bpf/hashtab.c                          |   5 +-
>  kernel/bpf/memalloc.c                         | 390 ++++++++++++++++--
>  tools/testing/selftests/bpf/Makefile          |   3 +
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++
>  .../selftests/bpf/progs/htab_mem_bench.c      | 145 +++++++
>  9 files changed, 785 insertions(+), 48 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>  create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
@ 2023-04-22  2:59   ` Alexei Starovoitov
  2023-04-23  1:55     ` Hou Tao
  2023-04-23  8:03     ` Hou Tao
  0 siblings, 2 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2023-04-22  2:59 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

On Sat, Apr 08, 2023 at 10:18:43PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> The benchmark could be used to compare the performance of hash map
> operations and the memory usage between different flavors of bpf memory
> allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
> could be used to check the performance improvement or the memory saving
> of bpf memory allocator optimization and check whether or not a specific
> use case is suitable for bpf memory allocator.
> 
> The benchmark creates a non-preallocated hash map which uses bpf memory
> allocator and shows the operation performance and the memory usage of
> the hash map under different use cases:
> (1) no_op
> Only create the hash map and there is no operations on hash map. It is
> used as the baseline. When each CPUs complete the iteartion of
> nonoverlapping part of hash map, the loop count is increased.
> (2) overwrite
> Each CPU overwrites nonoverlapping part of hash map. When each CPU
> completes one round of iteration, the loop count is increased.
> (3) batch_add_batch_del
> Each CPU adds then deletes nonoverlapping part of hash map in batch.
> When each CPU completes one round of iteration, the loop count is
> increased.
> (4) add_del_on_diff_cpu
> Each two CPUs add and delete nonoverlapping part of map concurrently.
> When each CPU completes one round of iteration, the loop count is
> increased.
> 
> The following benchmark results show that bpf memory allocator doesn't
> handle add_del_on_diff_cpu scenario very well. Because map deletion
> always happen on a different CPU than the map addition and the freed
> memory can never be reused.

what do you mean "can never be reused" ?
bpf_ma frees back to slab when num of elems in per-cpu freelist
is above watermark.

> ./bench htab-mem --use-case $name --max-entries 16384 \
> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
> 
> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | --                  | --         | --                   | --                |
> | no_op               | 1129       | 1.15                 | 1.15              |
> | overwrite           | 24.37      | 2.07                 | 2.97              |
> | batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
> | add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |

large mem for diff_cpu case needs to be investigated.

> 
> ./bench htab-mem --preallocated --use-case $name --max-entries 16384 \
> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
> 
> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | --                  | --         | --                   | --                |
> | no_op               | 1195       | 2.11                 | 2.16              |
> | overwrite           | 34.02      | 1.96                 | 2.00              |
> | batch_add_batch_del | 19.25      | 1.96                 | 2.00              |
> | add_del_on_diff_cpu | 8.70       | 1.96                 | 2.00              |
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  tools/testing/selftests/bpf/Makefile          |   3 +
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++++++++
>  .../selftests/bpf/progs/htab_mem_bench.c      | 145 ++++++++++
>  4 files changed, 425 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>  create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index c02184eaae69..74a45c790d4a 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -647,11 +647,13 @@ $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
>  $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
>  $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.skel.h
>  $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
> +$(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
>  $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
>  $(OUTPUT)/bench: LDLIBS += -lm
>  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>  		 $(TESTING_HELPERS) \
>  		 $(TRACE_HELPERS) \
> +		 $(CGROUP_HELPERS) \
>  		 $(OUTPUT)/bench_count.o \
>  		 $(OUTPUT)/bench_rename.o \
>  		 $(OUTPUT)/bench_trigger.o \
> @@ -664,6 +666,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>  		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
>  		 $(OUTPUT)/bench_bpf_hashmap_lookup.o \
>  		 $(OUTPUT)/bench_local_storage_create.o \
> +		 $(OUTPUT)/bench_htab_mem.o \
>  		 #
>  	$(call msg,BINARY,,$@)
>  	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index d9c080ac1796..d3d9ae321b74 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -279,6 +279,7 @@ extern struct argp bench_local_storage_rcu_tasks_trace_argp;
>  extern struct argp bench_strncmp_argp;
>  extern struct argp bench_hashmap_lookup_argp;
>  extern struct argp bench_local_storage_create_argp;
> +extern struct argp bench_htab_mem_argp;
>  
>  static const struct argp_child bench_parsers[] = {
>  	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
> @@ -290,6 +291,7 @@ static const struct argp_child bench_parsers[] = {
>  		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
>  	{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
>  	{ &bench_local_storage_create_argp, 0, "local-storage-create benchmark", 0 },
> +	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
>  	{},
>  };
>  
> @@ -518,6 +520,7 @@ extern const struct bench bench_local_storage_cache_hashmap_control;
>  extern const struct bench bench_local_storage_tasks_trace;
>  extern const struct bench bench_bpf_hashmap_lookup;
>  extern const struct bench bench_local_storage_create;
> +extern const struct bench bench_htab_mem;
>  
>  static const struct bench *benchs[] = {
>  	&bench_count_global,
> @@ -559,6 +562,7 @@ static const struct bench *benchs[] = {
>  	&bench_local_storage_tasks_trace,
>  	&bench_bpf_hashmap_lookup,
>  	&bench_local_storage_create,
> +	&bench_htab_mem,
>  };
>  
>  static void find_benchmark(void)
> diff --git a/tools/testing/selftests/bpf/benchs/bench_htab_mem.c b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
> new file mode 100644
> index 000000000000..116821a2a7dd
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
> @@ -0,0 +1,273 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
> +#include <argp.h>
> +#include <stdbool.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +
> +#include "bench.h"
> +#include "cgroup_helpers.h"
> +#include "htab_mem_bench.skel.h"
> +
> +static struct htab_mem_ctx {
> +	struct htab_mem_bench *skel;
> +	int fd;
> +} ctx;
> +
> +static struct htab_mem_args {
> +	u32 max_entries;
> +	u32 value_size;
> +	u32 full;
> +	const char *use_case;
> +	bool preallocated;
> +} args = {
> +	.max_entries = 16384,
> +	.full = 50,
> +	.value_size = 8,
> +	.use_case = "overwrite",
> +	.preallocated = false,
> +};
> +
> +enum {
> +	ARG_MAX_ENTRIES = 10000,
> +	ARG_FULL_PERCENT = 10001,
> +	ARG_VALUE_SIZE = 10002,
> +	ARG_USE_CASE = 10003,
> +	ARG_PREALLOCATED = 10004,
> +};
> +
> +static const struct argp_option opts[] = {
> +	{ "max-entries", ARG_MAX_ENTRIES, "MAX_ENTRIES", 0,
> +	  "Set the max entries of hash map (default 16384)" },
> +	{ "full", ARG_FULL_PERCENT, "FULL", 0,
> +	  "Set the full percent of hash map (default 50)" },
> +	{ "value-size", ARG_VALUE_SIZE, "VALUE_SIZE", 0,
> +	  "Set the value size of hash map (default 8)" },
> +	{ "use-case", ARG_USE_CASE, "USE_CASE", 0,
> +	  "Set the use case of hash map: no_op|overwrite|batch_add_batch_del|add_del_on_diff_cpu" },
> +	{ "preallocated", ARG_PREALLOCATED, NULL, 0, "use preallocated hash map" },
> +	{},
> +};
> +
> +static error_t htab_mem_parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +	switch (key) {
> +	case ARG_MAX_ENTRIES:
> +		args.max_entries = strtoul(arg, NULL, 10);
> +		break;
> +	case ARG_FULL_PERCENT:
> +		args.full = strtoul(arg, NULL, 10);
> +		if (!args.full || args.full > 100) {
> +			fprintf(stderr, "invalid full percent %u\n", args.full);
> +			argp_usage(state);
> +		}
> +		break;
> +	case ARG_VALUE_SIZE:
> +		args.value_size = strtoul(arg, NULL, 10);
> +		break;
> +	case ARG_USE_CASE:
> +		args.use_case = strdup(arg);
> +		break;
> +	case ARG_PREALLOCATED:
> +		args.preallocated = true;
> +		break;
> +	default:
> +		return ARGP_ERR_UNKNOWN;
> +	}
> +
> +	return 0;
> +}
> +
> +const struct argp bench_htab_mem_argp = {
> +	.options = opts,
> +	.parser = htab_mem_parse_arg,
> +};
> +
> +static void htab_mem_validate(void)
> +{
> +	if (env.consumer_cnt != 1) {
> +		fprintf(stderr, "htab mem benchmark doesn't support multi-consumer!\n");
> +		exit(1);
> +	}
> +}
> +
> +static int setup_and_join_cgroup(const char *path)
> +{
> +	int err, fd;
> +
> +	err = setup_cgroup_environment();
> +	if (err) {
> +		fprintf(stderr, "setup cgroup env failed\n");
> +		return -1;
> +	}
> +
> +	err = create_and_get_cgroup(path);
> +	if (err < 0) {
> +		fprintf(stderr, "create cgroup %s failed\n", path);
> +		goto out;
> +	}
> +	fd = err;
> +
> +	err = join_cgroup(path);
> +	if (err) {
> +		fprintf(stderr, "join cgroup %s failed\n", path);
> +		close(fd);
> +		goto out;
> +	}
> +
> +	return fd;
> +out:
> +	cleanup_cgroup_environment();
> +	return -1;
> +}
> +
> +static void htab_mem_setup(void)
> +{
> +	struct bpf_program *prog;
> +	struct bpf_map *map;
> +	int err;
> +
> +	setup_libbpf();
> +
> +	err = setup_and_join_cgroup("/htab_mem");
> +	if (err < 0)
> +		exit(1);
> +	ctx.fd = err;
> +
> +	ctx.skel = htab_mem_bench__open();
> +	if (!ctx.skel) {
> +		fprintf(stderr, "failed to open skeleton\n");
> +		goto cleanup;
> +	}
> +
> +	map = ctx.skel->maps.htab;
> +	bpf_map__set_max_entries(map, args.max_entries);
> +	bpf_map__set_value_size(map, args.value_size);
> +	if (args.preallocated)
> +		bpf_map__set_map_flags(map, bpf_map__map_flags(map) & ~BPF_F_NO_PREALLOC);
> +
> +	map = ctx.skel->maps.array;
> +	bpf_map__set_max_entries(map, args.max_entries);
> +	bpf_map__set_value_size(map, args.value_size);
> +
> +	prog = bpf_object__find_program_by_name(ctx.skel->obj, args.use_case);
> +	if (!prog) {
> +		fprintf(stderr, "no such use-case: %s\n", args.use_case);
> +		fprintf(stderr, "available use case:");
> +		bpf_object__for_each_program(prog, ctx.skel->obj)
> +			fprintf(stderr, " %s", bpf_program__name(prog));
> +		fprintf(stderr, "\n");
> +		goto cleanup;
> +	}
> +	bpf_program__set_autoload(prog, true);
> +
> +	ctx.skel->bss->nr_thread = env.producer_cnt;
> +	ctx.skel->bss->nr_entries = (uint64_t)args.max_entries * args.full / 100;
> +
> +	err = htab_mem_bench__load(ctx.skel);
> +	if (err) {
> +		fprintf(stderr, "failed to load skeleton\n");
> +		goto cleanup;
> +	}
> +	err = htab_mem_bench__attach(ctx.skel);
> +	if (err) {
> +		fprintf(stderr, "failed to attach skeleton\n");
> +		goto cleanup;
> +	}
> +	return;
> +cleanup:
> +	close(ctx.fd);
> +	cleanup_cgroup_environment();
> +	htab_mem_bench__destroy(ctx.skel);
> +	exit(1);
> +}
> +
> +static void *htab_mem_producer(void *ctx)
> +{
> +	while (true)
> +		(void)syscall(__NR_getpgid);
> +	return NULL;
> +}
> +
> +static void *htab_mem_consumer(void *ctx)
> +{
> +	return NULL;
> +}
> +
> +static void htab_mem_read_mem_cgrp_file(const char *name, unsigned long *value)
> +{
> +	char buf[32];
> +	int fd;
> +
> +	fd = openat(ctx.fd, name, O_RDONLY);
> +	if (fd < 0) {
> +		fprintf(stderr, "no %s\n", name);
> +		*value = 0;
> +		return;
> +	}
> +
> +	buf[sizeof(buf) - 1] = 0;
> +	read(fd, buf, sizeof(buf) - 1);
> +	*value = strtoull(buf, NULL, 0);
> +
> +	close(fd);
> +}
> +
> +static void htab_mem_measure(struct bench_res *res)
> +{
> +	res->hits = atomic_swap(&ctx.skel->bss->loop_cnt, 0);
> +	htab_mem_read_mem_cgrp_file("memory.current", &res->gp_ct);
> +}
> +
> +static void htab_mem_report_progress(int iter, struct bench_res *res, long delta_ns)
> +{
> +	double loop, mem;
> +
> +	loop = res->hits / 1000.0 / (delta_ns / 1000000000.0);
> +	mem = res->gp_ct / 1048576.0;
> +	printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
> +	printf("loop %7.2lfk/s, memory usage %7.2lfMiB\n", loop, mem);
> +}
> +
> +static void htab_mem_report_final(struct bench_res res[], int res_cnt)
> +{
> +	double mem_mean = 0.0, mem_stddev = 0.0;
> +	double loop_mean = 0.0, loop_stddev = 0.0;
> +	unsigned long peak_mem;
> +	int i;
> +
> +	for (i = 0; i < res_cnt; i++) {
> +		loop_mean += res[i].hits / 1000.0 / (0.0 + res_cnt);
> +		mem_mean += res[i].gp_ct / 1048576.0 / (0.0 + res_cnt);
> +	}
> +	if (res_cnt > 1)  {
> +		for (i = 0; i < res_cnt; i++) {
> +			loop_stddev += (loop_mean - res[i].hits / 1000.0) *
> +				       (loop_mean - res[i].hits / 1000.0) /
> +				       (res_cnt - 1.0);
> +			mem_stddev += (mem_mean - res[i].gp_ct / 1048576.0) *
> +				      (mem_mean - res[i].gp_ct / 1048576.0) /
> +				      (res_cnt - 1.0);
> +		}
> +		loop_stddev = sqrt(loop_stddev);
> +		mem_stddev = sqrt(mem_stddev);
> +	}
> +
> +	htab_mem_read_mem_cgrp_file("memory.peak", &peak_mem);
> +	printf("Summary: loop %7.2lf \u00B1 %7.2lfk/s, memory usage %7.2lf \u00B1 %7.2lfMiB, "
> +	       "peak memory usage %7.2lfMiB\n",
> +	       loop_mean, loop_stddev, mem_mean, mem_stddev, peak_mem / 1048576.0);
> +}
> +
> +const struct bench bench_htab_mem = {
> +	.name = "htab-mem",
> +	.argp = &bench_htab_mem_argp,
> +	.validate = htab_mem_validate,
> +	.setup = htab_mem_setup,
> +	.producer_thread = htab_mem_producer,
> +	.consumer_thread = htab_mem_consumer,
> +	.measure = htab_mem_measure,
> +	.report_progress = htab_mem_report_progress,
> +	.report_final = htab_mem_report_final,
> +};
> diff --git a/tools/testing/selftests/bpf/progs/htab_mem_bench.c b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
> new file mode 100644
> index 000000000000..f320cb3a11e8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
> +#include <stdbool.h>
> +#include <errno.h>
> +#include <linux/types.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +struct update_ctx {
> +	unsigned int from;
> +	unsigned int step;
> +	unsigned int max;
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, 4);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +} htab SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(key_size, 4);
> +} array SEC(".maps");
> +
> +char _license[] SEC("license") = "GPL";
> +
> +unsigned int nr_entries = 0;
> +unsigned int nr_thread = 0;
> +long loop_cnt = 0;
> +
> +static int noop_htab(unsigned int i, struct update_ctx *ctx)
> +{
> +	if (ctx->from >= ctx->max)
> +		return 1;
> +
> +	ctx->from += ctx->step;
> +	return 0;
> +}
> +
> +static int write_htab(unsigned int i, struct update_ctx *ctx, unsigned int flags)
> +{
> +	__u64 *value;
> +
> +	if (ctx->from >= ctx->max)
> +		return 1;
> +
> +	value = bpf_map_lookup_elem(&array, &ctx->from);
> +	if (value)
> +		bpf_map_update_elem(&htab, &ctx->from, value, flags);

What is a point of doing lookup from giant array of en element with zero value
to copy it into htab?
Why not to use single zero inited elem for all htab ops?

> +	ctx->from += ctx->step;
> +
> +	return 0;
> +}
> +
> +static int overwrite_htab(unsigned int i, struct update_ctx *ctx)
> +{
> +	return write_htab(i, ctx, 0);
> +}
> +
> +static int newwrite_htab(unsigned int i, struct update_ctx *ctx)
> +{
> +	return write_htab(i, ctx, BPF_NOEXIST);
> +}
> +
> +static int del_htab(unsigned int i, struct update_ctx *ctx)
> +{
> +	__u64 *value;
> +
> +	if (ctx->from >= ctx->max)
> +		return 1;
> +
> +	value = bpf_map_lookup_elem(&array, &ctx->from);
> +	if (value)
> +		bpf_map_delete_elem(&htab, &ctx->from);

even worse here.
Why lookup from array to delete the elem by key from htab?

The if (ctx->from >= ctx->max) check is done before the lookup,
so lookup will succeed and disarded immediately.
array lookup is fast, but the waste of cpu cycles is meaningless.

> +	ctx->from += ctx->step;
> +
> +	return 0;
> +}
> +
> +SEC("?tp/syscalls/sys_enter_getpgid")
> +int no_op(void *ctx)
> +{
> +	struct update_ctx update;
> +
> +	update.from = bpf_get_smp_processor_id();
> +	update.step = nr_thread;
> +	update.max = nr_entries;
> +	bpf_loop(update.max, noop_htab, &update, 0);
> +	__sync_fetch_and_add(&loop_cnt, 1);
> +
> +	return 0;
> +}
> +
> +SEC("?tp/syscalls/sys_enter_getpgid")
> +int overwrite(void *ctx)
> +{
> +	struct update_ctx update;
> +
> +	update.from = bpf_get_smp_processor_id();
> +	update.step = nr_thread;
> +	update.max = nr_entries;
> +	bpf_loop(update.max, overwrite_htab, &update, 0);
> +
> +	__sync_fetch_and_add(&loop_cnt, 1);
> +	return 0;
> +}
> +
> +SEC("?tp/syscalls/sys_enter_getpgid")
> +int batch_add_batch_del(void *ctx)
> +{
> +	struct update_ctx update;
> +
> +	update.from = bpf_get_smp_processor_id();
> +	update.step = nr_thread;
> +	update.max = nr_entries;
> +	bpf_loop(update.max, overwrite_htab, &update, 0);
> +
> +	update.from = bpf_get_smp_processor_id();
> +	bpf_loop(update.max, del_htab, &update, 0);
> +
> +	__sync_fetch_and_add(&loop_cnt, 1);
> +	return 0;
> +}
> +
> +SEC("?tp/syscalls/sys_enter_getpgid")
> +int add_del_on_diff_cpu(void *ctx)
> +{
> +	struct update_ctx update;
> +	unsigned int from;
> +
> +	from = bpf_get_smp_processor_id();
> +	update.from = from / 2;
> +	update.step = nr_thread / 2;
> +	update.max = nr_entries;
> +
> +	if (from & 1)
> +		bpf_loop(update.max, newwrite_htab, &update, 0);
> +	else
> +		bpf_loop(update.max, del_htab, &update, 0);

This is oddly shaped test.
deleter cpu may run ahead of newwrite_htab.
deleter will try to delete elems that don't exist.
Loop of few thousand iterations is not a lot for one cpu to run ahead.

Each loop will run 16k times and every time you step += 4.
So 3/4 of these 16k runs it will be hitting if (ctx->from >= ctx->max) condition.
What are you measuring?

> +
> +	__sync_fetch_and_add(&loop_cnt, 1);
> +	return 0;
> +}
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
  2023-04-08 14:18 ` [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
@ 2023-04-22  3:12   ` Alexei Starovoitov
  2023-04-23  7:41     ` Hou Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2023-04-22  3:12 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

On Sat, Apr 08, 2023 at 10:18:46PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Currently the freed objects in bpf memory allocator may be reused
> immediately by new allocation, it introduces use-after-bpf-ma-free
> problem for non-preallocated hash map and makes lookup procedure
> return incorrect result. The immediate reuse also makes introducing
> new use case more difficult (e.g. qp-trie).
> 
> So introduce BPF_MA_REUSE_AFTER_RCU_GP to solve these problems. For
> BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU
> grace period and may be returned back to slab system after another
> RCU-tasks-trace grace period. So for bpf programs which care about reuse
> problem, these programs can use bpf_rcu_read_{lock,unlock}() to access
> these freed objects safely and for those which doesn't care, there will
> be safely use-after-bpf-ma-free because these objects have not been
> freed by bpf memory allocator.
> 
> To make these freed elements being reusab quickly, BPF_MA_REUSE_AFTER_GP
> dynamically allocates memory to create many inflight RCU callbacks to
> mark these freed element being reusable. These memories used for
> bpf_reuse_batch will be freed when these RCU callbacks complete. When no
> memory is available, synchronize_rcu_expedited() will be used to make
> these freed element reusable. In order to reduce the risk of OOM, part
> of these reusable memory will be freed through RCU-tasks-trace grace
> period. Before these freeing memories are freed, these memories are also
> available for reuse.
> 
> The following are the benchmark results when comparing between different
> flavors of bpf memory allocator. These results show:
> * The performance of reuse-after-rcu-gp bpf ma is good than no bpf ma.
>   Its memory usage is also good than no bpf ma except for
>   add_del_on_diff_cpu case.
> * The memory usage of reuse-after-rcu-gp bpf ma increases a lot compared
>   with normal bpf ma.
> * The memory usage of free-after-rcu-gp bpf ma is better than
>   reuse-after-rcu-gp bpf ma, but its performance is bad than
>   reuse-after-ruc-gp because it doesn't do reuse.
> 
> (1) no bpf memory allocator (v6.0.19)

meaning that htab is using kmalloc and call_rcu to free, right?

> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | --                  | --         | --                   | --                |
> | no_op               | 1187       | 1.05                 | 1.05              |
> | overwrite           | 3.74       | 32.52                | 84.18             |
> | batch_add_batch_del | 2.23       | 26.38                | 48.75             |
> | add_del_on_diff_cpu | 3.92       | 33.72                | 48.96             |
> 
> (2) normal bpf memory allocator
> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | --                  | --         | --                   | --                |
> | no_op               | 1187       | 0.96                 | 1.00              |
> | overwrite           | 27.12      | 2.5                  | 2.99              |
> | batch_add_batch_del | 8.9        | 2.77                 | 3.24              |
> | add_del_on_diff_cpu | 11.30      | 218.54               | 440.37            |
> 
> (3) reuse-after-rcu-gp bpf memory allocator

that's the one you're implementing below, right?

> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | --                  | --         | --                   | --                |
> | no_op               | 1276       | 0.96                 | 1.00              |
> | overwrite           | 15.66      | 25.00                | 33.07             |
> | batch_add_batch_del | 10.32      | 18.84                | 22.64             |
> | add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |
> 
> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)

What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ?

> 
> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | --                  | --         | --                   | --                |
> | no_op               | 1263       | 0.96                 | 1.00              |
> | overwrite           | 10.73      | 12.33                | 20.32             |
> | batch_add_batch_del | 7.02       | 9.45                 | 14.07             |
> | add_del_on_diff_cpu | 8.99       | 131.64               | 204.42            |

Depending on what we care about all the extra complexity in bpf_ma with reuse-after-rcu-gp
buys us a bit better perf, but many times worse memory consumption?

> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/linux/bpf_mem_alloc.h |   1 +
>  kernel/bpf/memalloc.c         | 353 +++++++++++++++++++++++++++++++---
>  2 files changed, 326 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
> index 148347950e16..e7f68432713b 100644
> --- a/include/linux/bpf_mem_alloc.h
> +++ b/include/linux/bpf_mem_alloc.h
> @@ -18,6 +18,7 @@ struct bpf_mem_alloc {
>  /* flags for bpf_mem_alloc_init() */
>  enum {
>  	BPF_MA_PERCPU = 1U << 0,
> +	BPF_MA_REUSE_AFTER_RCU_GP = 1U << 1,
>  };
>  
>  /* 'size != 0' is for bpf_mem_alloc which manages fixed-size objects.
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 072102476019..262100f89610 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -63,6 +63,10 @@ static u8 size_index[24] __ro_after_init = {
>  	2	/* 192 */
>  };
>  
> +static struct workqueue_struct *bpf_ma_wq;
> +
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work);
> +
>  static int bpf_mem_cache_idx(size_t size)
>  {
>  	if (!size || size > 4096)
> @@ -98,18 +102,36 @@ struct bpf_mem_cache {
>  	int free_cnt;
>  	int low_watermark, high_watermark, batch;
>  	int percpu_size;
> +	int cpu;
>  	unsigned int flags;
>  
> +	raw_spinlock_t reuse_lock;
> +	bool abort_reuse;
> +	struct llist_head reuse_ready_head;
> +	struct llist_node *reuse_ready_tail;
> +	struct llist_head wait_for_free;
> +	struct llist_head prepare_reuse_head;
> +	struct llist_node *prepare_reuse_tail;
> +	unsigned int prepare_reuse_cnt;
> +	atomic_t reuse_cb_in_progress;
> +	struct work_struct reuse_work;
> +
>  	struct rcu_head rcu;
>  	struct llist_head free_by_rcu;
>  	struct llist_head waiting_for_gp;
> -	atomic_t call_rcu_in_progress;
> +	atomic_t free_cb_in_progress;
>  };
>  
>  struct bpf_mem_caches {
>  	struct bpf_mem_cache cache[NUM_CACHES];
>  };
>  
> +struct bpf_reuse_batch {
> +	struct bpf_mem_cache *c;
> +	struct llist_node *head, *tail;
> +	struct rcu_head rcu;
> +};
> +
>  static struct llist_node notrace *__llist_del_first(struct llist_head *head)
>  {
>  	struct llist_node *entry, *next;
> @@ -154,6 +176,45 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
>  #endif
>  }
>  
> +static void *bpf_ma_get_reusable_obj(struct bpf_mem_cache *c)
> +{
> +	if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP) {
> +		unsigned long flags;
> +		void *obj;
> +
> +		if (llist_empty(&c->reuse_ready_head) && llist_empty(&c->wait_for_free))
> +			return NULL;
> +
> +		/* reuse_ready_head and wait_for_free may be manipulated by
> +		 * kworker and RCU callbacks.
> +		 */
> +		raw_spin_lock_irqsave(&c->reuse_lock, flags);
> +		obj = __llist_del_first(&c->reuse_ready_head);
> +		if (obj) {
> +			if (llist_empty(&c->reuse_ready_head))
> +				c->reuse_ready_tail = NULL;
> +		} else {
> +			obj = __llist_del_first(&c->wait_for_free);
> +		}
> +		raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +		return obj;
> +	}
> +
> +	/*
> +	 * free_by_rcu is only manipulated by irq work refill_work().
> +	 * IRQ works on the same CPU are called sequentially, so it is
> +	 * safe to use __llist_del_first() here. If alloc_bulk() is
> +	 * invoked by the initial prefill, there will be no running
> +	 * refill_work(), so __llist_del_first() is fine as well.
> +	 *
> +	 * In most cases, objects on free_by_rcu are from the same CPU.
> +	 * If some objects come from other CPUs, it doesn't incur any
> +	 * harm because NUMA_NO_NODE means the preference for current
> +	 * numa node and it is not a guarantee.
> +	 */
> +	return __llist_del_first(&c->free_by_rcu);
> +}
> +
>  /* Mostly runs from irq_work except __init phase. */
>  static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>  {
> @@ -165,19 +226,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>  	memcg = get_memcg(c);
>  	old_memcg = set_active_memcg(memcg);
>  	for (i = 0; i < cnt; i++) {
> -		/*
> -		 * free_by_rcu is only manipulated by irq work refill_work().
> -		 * IRQ works on the same CPU are called sequentially, so it is
> -		 * safe to use __llist_del_first() here. If alloc_bulk() is
> -		 * invoked by the initial prefill, there will be no running
> -		 * refill_work(), so __llist_del_first() is fine as well.
> -		 *
> -		 * In most cases, objects on free_by_rcu are from the same CPU.
> -		 * If some objects come from other CPUs, it doesn't incur any
> -		 * harm because NUMA_NO_NODE means the preference for current
> -		 * numa node and it is not a guarantee.
> -		 */
> -		obj = __llist_del_first(&c->free_by_rcu);
> +		obj = bpf_ma_get_reusable_obj(c);
>  		if (!obj) {
>  			/* Allocate, but don't deplete atomic reserves that typical
>  			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
> @@ -236,7 +285,7 @@ static void __free_rcu(struct rcu_head *head)
>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>  
>  	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
> -	atomic_set(&c->call_rcu_in_progress, 0);
> +	atomic_set(&c->free_cb_in_progress, 0);
>  }
>  
>  static void __free_rcu_tasks_trace(struct rcu_head *head)
> @@ -264,7 +313,7 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>  {
>  	struct llist_node *llnode, *t;
>  
> -	if (atomic_xchg(&c->call_rcu_in_progress, 1))
> +	if (atomic_xchg(&c->free_cb_in_progress, 1))
>  		return;
>  
>  	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
> @@ -409,6 +458,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
>  			c->objcg = objcg;
>  			c->percpu_size = percpu_size;
>  			c->flags = flags;
> +			c->cpu = cpu;
> +			INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
>  			prefill_mem_cache(c, cpu);
>  		}
>  		ma->cache = pc;
> @@ -433,6 +484,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
>  			c->unit_size = sizes[i];
>  			c->objcg = objcg;
>  			c->flags = flags;
> +			c->cpu = cpu;
> +			INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
>  			prefill_mem_cache(c, cpu);
>  		}
>  	}
> @@ -444,18 +497,40 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
>  static void drain_mem_cache(struct bpf_mem_cache *c)
>  {
>  	bool percpu = !!c->percpu_size;
> +	struct llist_node *head[3];
> +	unsigned long flags;
>  
>  	/* No progs are using this bpf_mem_cache, but htab_map_free() called
>  	 * bpf_mem_cache_free() for all remaining elements and they can be in
>  	 * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
>  	 *
> -	 * Except for waiting_for_gp list, there are no concurrent operations
> -	 * on these lists, so it is safe to use __llist_del_all().
> +	 * Except for waiting_for_gp and free_llist_extra list, there are no
> +	 * concurrent operations on these lists, so it is safe to use
> +	 * __llist_del_all().
>  	 */
>  	free_all(__llist_del_all(&c->free_by_rcu), percpu);
>  	free_all(llist_del_all(&c->waiting_for_gp), percpu);
>  	free_all(__llist_del_all(&c->free_llist), percpu);
> -	free_all(__llist_del_all(&c->free_llist_extra), percpu);
> +	free_all(llist_del_all(&c->free_llist_extra), percpu);
> +
> +	if (!(c->flags & BPF_MA_REUSE_AFTER_RCU_GP))
> +		return;
> +
> +	raw_spin_lock_irqsave(&c->reuse_lock, flags);
> +	/* Indicate kworker and RCU callback to free elements directly
> +	 * instead of adding new elements into these lists.
> +	 */
> +	c->abort_reuse = true;
> +	head[0] = __llist_del_all(&c->prepare_reuse_head);
> +	c->prepare_reuse_tail = NULL;
> +	head[1] = __llist_del_all(&c->reuse_ready_head);
> +	c->reuse_ready_tail = NULL;
> +	head[2] = __llist_del_all(&c->wait_for_free);
> +	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +
> +	free_all(head[0], percpu);
> +	free_all(head[1], percpu);
> +	free_all(head[2], percpu);
>  }
>  
>  static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
> @@ -466,10 +541,39 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>  	ma->caches = NULL;
>  }
>  
> +static void bpf_ma_cancel_reuse_work(struct bpf_mem_alloc *ma)
> +{
> +	struct bpf_mem_caches *cc;
> +	struct bpf_mem_cache *c;
> +	int cpu, i;
> +
> +	if (ma->cache) {
> +		for_each_possible_cpu(cpu) {
> +			c = per_cpu_ptr(ma->cache, cpu);
> +			cancel_work_sync(&c->reuse_work);
> +		}
> +	}
> +	if (ma->caches) {
> +		for_each_possible_cpu(cpu) {
> +			cc = per_cpu_ptr(ma->caches, cpu);
> +			for (i = 0; i < NUM_CACHES; i++) {
> +				c = &cc->cache[i];
> +				cancel_work_sync(&c->reuse_work);
> +			}
> +		}
> +	}
> +}
> +
>  static void free_mem_alloc(struct bpf_mem_alloc *ma)
>  {
> -	/* waiting_for_gp lists was drained, but __free_rcu might
> -	 * still execute. Wait for it now before we freeing percpu caches.
> +	bool reuse_after_rcu_gp = ma->flags & BPF_MA_REUSE_AFTER_RCU_GP;
> +
> +	/* Cancel the inflight kworkers */
> +	if (reuse_after_rcu_gp)
> +		bpf_ma_cancel_reuse_work(ma);
> +
> +	/* For normal bpf ma, waiting_for_gp lists was drained, but __free_rcu
> +	 * might still execute. Wait for it now before we freeing percpu caches.
>  	 *
>  	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
>  	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
> @@ -477,9 +581,13 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
>  	 * so if call_rcu(head, __free_rcu) is skipped due to
>  	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
>  	 * using rcu_trace_implies_rcu_gp() as well.
> +	 *
> +	 * For reuse-after-rcu-gp bpf ma, use rcu_barrier_tasks_trace() to
> +	 * wait for the pending bpf_ma_free_reusable_cb() and use rcu_barrier()
> +	 * to wait for the pending bpf_ma_reuse_cb().
>  	 */
>  	rcu_barrier_tasks_trace();
> -	if (!rcu_trace_implies_rcu_gp())
> +	if (reuse_after_rcu_gp || !rcu_trace_implies_rcu_gp())
>  		rcu_barrier();
>  	free_mem_alloc_no_barrier(ma);
>  }
> @@ -512,6 +620,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
>  	}
>  
>  	/* Defer barriers into worker to let the rest of map memory to be freed */
> +	copy->flags = ma->flags;
>  	copy->cache = ma->cache;
>  	ma->cache = NULL;
>  	copy->caches = ma->caches;
> @@ -541,7 +650,9 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>  			 */
>  			irq_work_sync(&c->refill_work);
>  			drain_mem_cache(c);
> -			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> +			rcu_in_progress += atomic_read(&c->free_cb_in_progress);
> +			/* Pending kworkers or RCU callbacks */
> +			rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
>  		}
>  		/* objcg is the same across cpus */
>  		if (c->objcg)
> @@ -556,7 +667,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>  				c = &cc->cache[i];
>  				irq_work_sync(&c->refill_work);
>  				drain_mem_cache(c);
> -				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> +				rcu_in_progress += atomic_read(&c->free_cb_in_progress);
> +				rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
>  			}
>  		}
>  		if (c->objcg)
> @@ -600,18 +712,183 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>  	return llnode;
>  }
>  
> +static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c, struct llist_node *head,
> +					      struct llist_node *tail)
> +{
> +	unsigned long flags;
> +	bool abort;
> +
> +	raw_spin_lock_irqsave(&c->reuse_lock, flags);
> +	abort = c->abort_reuse;
> +	if (!abort) {
> +		if (llist_empty(&c->reuse_ready_head))
> +			c->reuse_ready_tail = tail;
> +		__llist_add_batch(head, tail, &c->reuse_ready_head);
> +	}
> +	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +
> +	/* Don't move these objects to reuse_ready list and free
> +	 * these objects directly.
> +	 */
> +	if (abort)
> +		free_all(head, !!c->percpu_size);
> +}
> +
> +static void bpf_ma_reuse_cb(struct rcu_head *rcu)
> +{
> +	struct bpf_reuse_batch *batch = container_of(rcu, struct bpf_reuse_batch, rcu);
> +	struct bpf_mem_cache *c = batch->c;
> +
> +	bpf_ma_add_to_reuse_ready_or_free(c, batch->head, batch->tail);
> +	atomic_dec(&c->reuse_cb_in_progress);
> +	kfree(batch);
> +}
> +
> +static bool bpf_ma_try_free_reuse_objs(struct bpf_mem_cache *c)
> +{
> +	struct llist_node *head, *tail;
> +	bool do_free;
> +
> +	if (llist_empty(&c->reuse_ready_head))
> +		return false;
> +
> +	do_free = !atomic_xchg(&c->free_cb_in_progress, 1);
> +	if (!do_free)
> +		return false;
> +
> +	head = __llist_del_all(&c->reuse_ready_head);
> +	tail = c->reuse_ready_tail;
> +	c->reuse_ready_tail = NULL;
> +
> +	__llist_add_batch(head, tail, &c->wait_for_free);
> +
> +	return true;
> +}
> +
> +static void bpf_ma_free_reusable_cb(struct rcu_head *rcu)
> +{
> +	struct bpf_mem_cache *c = container_of(rcu, struct bpf_mem_cache, rcu);
> +	struct llist_node *head;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&c->reuse_lock, flags);
> +	head = __llist_del_all(&c->wait_for_free);
> +	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +
> +	free_all(head, !!c->percpu_size);
> +	atomic_set(&c->free_cb_in_progress, 0);
> +}
> +
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work)
> +{
> +	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, reuse_work);
> +	struct llist_node *head, *tail, *llnode, *tmp;
> +	struct bpf_reuse_batch *batch;
> +	unsigned long flags;
> +	bool do_free;
> +
> +	local_irq_save(flags);
> +	/* When CPU is offline, the running CPU may be different with
> +	 * the CPU which submitted the work. When these two CPUs are the same,
> +	 * kworker may be interrupted by NMI, so increase active to protect
> +	 * again such concurrency.
> +	 */
> +	if (c->cpu == smp_processor_id())
> +		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
> +	raw_spin_lock(&c->reuse_lock);
> +	head = __llist_del_all(&c->prepare_reuse_head);
> +	tail = c->prepare_reuse_tail;
> +	c->prepare_reuse_tail = NULL;
> +	c->prepare_reuse_cnt = 0;
> +	if (c->cpu == smp_processor_id())
> +		local_dec(&c->active);
> +
> +	/* Try to free elements in reusable list. Before these elements are
> +	 * freed in RCU cb, these element will still be available for reuse.
> +	 */
> +	do_free = bpf_ma_try_free_reuse_objs(c);
> +	raw_spin_unlock(&c->reuse_lock);
> +	local_irq_restore(flags);
> +
> +	if (do_free)
> +		call_rcu_tasks_trace(&c->rcu, bpf_ma_free_reusable_cb);
> +
> +	llist_for_each_safe(llnode, tmp, llist_del_all(&c->free_llist_extra)) {
> +		if (!head)
> +			tail = llnode;
> +		llnode->next = head;
> +		head = llnode->next;
> +	}
> +	/* Draining is in progress ? */
> +	if (!head) {
> +		/* kworker completes and no RCU callback */
> +		atomic_dec(&c->reuse_cb_in_progress);
> +		return;
> +	}
> +
> +	batch = kmalloc(sizeof(*batch), GFP_KERNEL);
> +	if (!batch) {
> +		synchronize_rcu_expedited();
> +		bpf_ma_add_to_reuse_ready_or_free(c, head, tail);
> +		/* kworker completes and no RCU callback */
> +		atomic_dec(&c->reuse_cb_in_progress);
> +		return;
> +	}
> +
> +	batch->c = c;
> +	batch->head = head;
> +	batch->tail = tail;
> +	call_rcu(&batch->rcu, bpf_ma_reuse_cb);
> +}
> +
> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	/* In case a NMI-context bpf program is also freeing object. */
> +	if (local_inc_return(&c->active) == 1) {
> +		bool try_queue_work = false;
> +
> +		/* kworker may remove elements from prepare_reuse_head */
> +		raw_spin_lock(&c->reuse_lock);
> +		if (llist_empty(&c->prepare_reuse_head))
> +			c->prepare_reuse_tail = llnode;
> +		__llist_add(llnode, &c->prepare_reuse_head);
> +		if (++c->prepare_reuse_cnt > c->high_watermark) {
> +			/* Zero out prepare_reuse_cnt early to prevent
> +			 * unnecessary queue_work().
> +			 */
> +			c->prepare_reuse_cnt = 0;
> +			try_queue_work = true;
> +		}
> +		raw_spin_unlock(&c->reuse_lock);
> +
> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
> +			/* Use reuse_cb_in_progress to indicate there is
> +			 * inflight reuse kworker or reuse RCU callback.
> +			 */
> +			atomic_inc(&c->reuse_cb_in_progress);
> +			/* Already queued */
> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))

how many kthreads are spawned by wq in the peak?

> +				atomic_dec(&c->reuse_cb_in_progress);
> +		}
> +	} else {
> +		llist_add(llnode, &c->free_llist_extra);
> +	}
> +	local_dec(&c->active);
> +	local_irq_restore(flags);
> +}
> +
>  /* Though 'ptr' object could have been allocated on a different cpu
>   * add it to the free_llist of the current cpu.
>   * Let kfree() logic deal with it when it's later called from irq_work.
>   */
> -static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
> +static void notrace immediate_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
>  {
> -	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
>  	unsigned long flags;
>  	int cnt = 0;
>  
> -	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
> -
>  	local_irq_save(flags);
>  	if (local_inc_return(&c->active) == 1) {
>  		__llist_add(llnode, &c->free_llist);
> @@ -633,6 +910,18 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
>  		irq_work_raise(c);
>  }
>  
> +static inline void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
> +{
> +	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
> +
> +	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
> +
> +	if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP)
> +		wait_gp_reuse_free(c, llnode);
> +	else
> +		immediate_reuse_free(c, llnode);
> +}
> +
>  /* Called from BPF program or from sys_bpf syscall.
>   * In both cases migration is disabled.
>   */
> @@ -724,3 +1013,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags)
>  
>  	return !ret ? NULL : ret + LLIST_NODE_SZ;
>  }
> +
> +static int __init bpf_ma_init(void)
> +{
> +	bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
> +	BUG_ON(!bpf_ma_wq);
> +	return 0;
> +}
> +late_initcall(bpf_ma_init);
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-22  2:59   ` Alexei Starovoitov
@ 2023-04-23  1:55     ` Hou Tao
  2023-04-27  4:20       ` Alexei Starovoitov
  2023-04-23  8:03     ` Hou Tao
  1 sibling, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-04-23  1:55 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi,

On 4/22/2023 10:59 AM, Alexei Starovoitov wrote:
> On Sat, Apr 08, 2023 at 10:18:43PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> The benchmark could be used to compare the performance of hash map
>> operations and the memory usage between different flavors of bpf memory
>> allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
>> could be used to check the performance improvement or the memory saving
>> of bpf memory allocator optimization and check whether or not a specific
>> use case is suitable for bpf memory allocator.
>>
>> The benchmark creates a non-preallocated hash map which uses bpf memory
>> allocator and shows the operation performance and the memory usage of
>> the hash map under different use cases:
>> (1) no_op
>> Only create the hash map and there is no operations on hash map. It is
>> used as the baseline. When each CPUs complete the iteartion of
>> nonoverlapping part of hash map, the loop count is increased.
>> (2) overwrite
>> Each CPU overwrites nonoverlapping part of hash map. When each CPU
>> completes one round of iteration, the loop count is increased.
>> (3) batch_add_batch_del
>> Each CPU adds then deletes nonoverlapping part of hash map in batch.
>> When each CPU completes one round of iteration, the loop count is
>> increased.
>> (4) add_del_on_diff_cpu
>> Each two CPUs add and delete nonoverlapping part of map concurrently.
>> When each CPU completes one round of iteration, the loop count is
>> increased.
>>
>> The following benchmark results show that bpf memory allocator doesn't
>> handle add_del_on_diff_cpu scenario very well. Because map deletion
>> always happen on a different CPU than the map addition and the freed
>> memory can never be reused.
> what do you mean "can never be reused" ?
> bpf_ma frees back to slab when num of elems in per-cpu freelist
> is above watermark.
Yes,  the above saying is inaccurate. I should say: the freed element can never
be immediately-reused and these elements are only available for reuse after
freeing to slab subsystem.
>
>> ./bench htab-mem --use-case $name --max-entries 16384 \
>> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
>>
>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>> | --                  | --         | --                   | --                |
>> | no_op               | 1129       | 1.15                 | 1.15              |
>> | overwrite           | 24.37      | 2.07                 | 2.97              |
>> | batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
>> | add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |
> large mem for diff_cpu case needs to be investigated.
The main reason is that tasks-trace RCU GP is slow and there is only one
inflight free callback, so the CPUs which only do element addition will allocate
new memory from slab continuously and the CPUs which only do element deletion
will free these elements continuously through call_tasks_trace_rcu(), but due to
the slowness of tasks-trace RCU GP, these freed elements could not be freed back
to slab subsystem timely.

I am trying to mitigate the problem by using multiple inflight free RCU
callback, but the result doesn't lookup very good and I'm still digging into it.
If it doesn't work out, maybe stealing from free_list of other CPUs is an
alternative option.
>
>> ./bench htab-mem --preallocated --use-case $name --max-entries 16384 \
>> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
>>
>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>> | --                  | --         | --                   | --                |
>> | no_op               | 1195       | 2.11                 | 2.16              |
>> | overwrite           | 34.02      | 1.96                 | 2.00              |
>> | batch_add_batch_del | 19.25      | 1.96                 | 2.00              |
>> | add_del_on_diff_cpu | 8.70       | 1.96                 | 2.00              |
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  tools/testing/selftests/bpf/Makefile          |   3 +
>>  tools/testing/selftests/bpf/bench.c           |   4 +
>>  .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++++++++
>>  .../selftests/bpf/progs/htab_mem_bench.c      | 145 ++++++++++
>>  4 files changed, 425 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
>>
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index c02184eaae69..74a45c790d4a 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -647,11 +647,13 @@ $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
>>  $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
>>  $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.skel.h
>>  $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
>> +$(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
>>  $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
>>  $(OUTPUT)/bench: LDLIBS += -lm
>>  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>>  		 $(TESTING_HELPERS) \
>>  		 $(TRACE_HELPERS) \
>> +		 $(CGROUP_HELPERS) \
>>  		 $(OUTPUT)/bench_count.o \
>>  		 $(OUTPUT)/bench_rename.o \
>>  		 $(OUTPUT)/bench_trigger.o \
>> @@ -664,6 +666,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>>  		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
>>  		 $(OUTPUT)/bench_bpf_hashmap_lookup.o \
>>  		 $(OUTPUT)/bench_local_storage_create.o \
>> +		 $(OUTPUT)/bench_htab_mem.o \
>>  		 #
>>  	$(call msg,BINARY,,$@)
>>  	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
>> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
>> index d9c080ac1796..d3d9ae321b74 100644
>> --- a/tools/testing/selftests/bpf/bench.c
>> +++ b/tools/testing/selftests/bpf/bench.c
>> @@ -279,6 +279,7 @@ extern struct argp bench_local_storage_rcu_tasks_trace_argp;
>>  extern struct argp bench_strncmp_argp;
>>  extern struct argp bench_hashmap_lookup_argp;
>>  extern struct argp bench_local_storage_create_argp;
>> +extern struct argp bench_htab_mem_argp;
>>  
>>  static const struct argp_child bench_parsers[] = {
>>  	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
>> @@ -290,6 +291,7 @@ static const struct argp_child bench_parsers[] = {
>>  		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
>>  	{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
>>  	{ &bench_local_storage_create_argp, 0, "local-storage-create benchmark", 0 },
>> +	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
>>  	{},
>>  };
>>  
>> @@ -518,6 +520,7 @@ extern const struct bench bench_local_storage_cache_hashmap_control;
>>  extern const struct bench bench_local_storage_tasks_trace;
>>  extern const struct bench bench_bpf_hashmap_lookup;
>>  extern const struct bench bench_local_storage_create;
>> +extern const struct bench bench_htab_mem;
>>  
>>  static const struct bench *benchs[] = {
>>  	&bench_count_global,
>> @@ -559,6 +562,7 @@ static const struct bench *benchs[] = {
>>  	&bench_local_storage_tasks_trace,
>>  	&bench_bpf_hashmap_lookup,
>>  	&bench_local_storage_create,
>> +	&bench_htab_mem,
>>  };
>>  
>>  static void find_benchmark(void)
>> diff --git a/tools/testing/selftests/bpf/benchs/bench_htab_mem.c b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>> new file mode 100644
>> index 000000000000..116821a2a7dd
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>> @@ -0,0 +1,273 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
>> +#include <argp.h>
>> +#include <stdbool.h>
>> +#include <sys/types.h>
>> +#include <sys/stat.h>
>> +#include <fcntl.h>
>> +
>> +#include "bench.h"
>> +#include "cgroup_helpers.h"
>> +#include "htab_mem_bench.skel.h"
>> +
>> +static struct htab_mem_ctx {
>> +	struct htab_mem_bench *skel;
>> +	int fd;
>> +} ctx;
>> +
>> +static struct htab_mem_args {
>> +	u32 max_entries;
>> +	u32 value_size;
>> +	u32 full;
>> +	const char *use_case;
>> +	bool preallocated;
>> +} args = {
>> +	.max_entries = 16384,
>> +	.full = 50,
>> +	.value_size = 8,
>> +	.use_case = "overwrite",
>> +	.preallocated = false,
>> +};
>> +
>> +enum {
>> +	ARG_MAX_ENTRIES = 10000,
>> +	ARG_FULL_PERCENT = 10001,
>> +	ARG_VALUE_SIZE = 10002,
>> +	ARG_USE_CASE = 10003,
>> +	ARG_PREALLOCATED = 10004,
>> +};
>> +
>> +static const struct argp_option opts[] = {
>> +	{ "max-entries", ARG_MAX_ENTRIES, "MAX_ENTRIES", 0,
>> +	  "Set the max entries of hash map (default 16384)" },
>> +	{ "full", ARG_FULL_PERCENT, "FULL", 0,
>> +	  "Set the full percent of hash map (default 50)" },
>> +	{ "value-size", ARG_VALUE_SIZE, "VALUE_SIZE", 0,
>> +	  "Set the value size of hash map (default 8)" },
>> +	{ "use-case", ARG_USE_CASE, "USE_CASE", 0,
>> +	  "Set the use case of hash map: no_op|overwrite|batch_add_batch_del|add_del_on_diff_cpu" },
>> +	{ "preallocated", ARG_PREALLOCATED, NULL, 0, "use preallocated hash map" },
>> +	{},
>> +};
>> +
>> +static error_t htab_mem_parse_arg(int key, char *arg, struct argp_state *state)
>> +{
>> +	switch (key) {
>> +	case ARG_MAX_ENTRIES:
>> +		args.max_entries = strtoul(arg, NULL, 10);
>> +		break;
>> +	case ARG_FULL_PERCENT:
>> +		args.full = strtoul(arg, NULL, 10);
>> +		if (!args.full || args.full > 100) {
>> +			fprintf(stderr, "invalid full percent %u\n", args.full);
>> +			argp_usage(state);
>> +		}
>> +		break;
>> +	case ARG_VALUE_SIZE:
>> +		args.value_size = strtoul(arg, NULL, 10);
>> +		break;
>> +	case ARG_USE_CASE:
>> +		args.use_case = strdup(arg);
>> +		break;
>> +	case ARG_PREALLOCATED:
>> +		args.preallocated = true;
>> +		break;
>> +	default:
>> +		return ARGP_ERR_UNKNOWN;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +const struct argp bench_htab_mem_argp = {
>> +	.options = opts,
>> +	.parser = htab_mem_parse_arg,
>> +};
>> +
>> +static void htab_mem_validate(void)
>> +{
>> +	if (env.consumer_cnt != 1) {
>> +		fprintf(stderr, "htab mem benchmark doesn't support multi-consumer!\n");
>> +		exit(1);
>> +	}
>> +}
>> +
>> +static int setup_and_join_cgroup(const char *path)
>> +{
>> +	int err, fd;
>> +
>> +	err = setup_cgroup_environment();
>> +	if (err) {
>> +		fprintf(stderr, "setup cgroup env failed\n");
>> +		return -1;
>> +	}
>> +
>> +	err = create_and_get_cgroup(path);
>> +	if (err < 0) {
>> +		fprintf(stderr, "create cgroup %s failed\n", path);
>> +		goto out;
>> +	}
>> +	fd = err;
>> +
>> +	err = join_cgroup(path);
>> +	if (err) {
>> +		fprintf(stderr, "join cgroup %s failed\n", path);
>> +		close(fd);
>> +		goto out;
>> +	}
>> +
>> +	return fd;
>> +out:
>> +	cleanup_cgroup_environment();
>> +	return -1;
>> +}
>> +
>> +static void htab_mem_setup(void)
>> +{
>> +	struct bpf_program *prog;
>> +	struct bpf_map *map;
>> +	int err;
>> +
>> +	setup_libbpf();
>> +
>> +	err = setup_and_join_cgroup("/htab_mem");
>> +	if (err < 0)
>> +		exit(1);
>> +	ctx.fd = err;
>> +
>> +	ctx.skel = htab_mem_bench__open();
>> +	if (!ctx.skel) {
>> +		fprintf(stderr, "failed to open skeleton\n");
>> +		goto cleanup;
>> +	}
>> +
>> +	map = ctx.skel->maps.htab;
>> +	bpf_map__set_max_entries(map, args.max_entries);
>> +	bpf_map__set_value_size(map, args.value_size);
>> +	if (args.preallocated)
>> +		bpf_map__set_map_flags(map, bpf_map__map_flags(map) & ~BPF_F_NO_PREALLOC);
>> +
>> +	map = ctx.skel->maps.array;
>> +	bpf_map__set_max_entries(map, args.max_entries);
>> +	bpf_map__set_value_size(map, args.value_size);
>> +
>> +	prog = bpf_object__find_program_by_name(ctx.skel->obj, args.use_case);
>> +	if (!prog) {
>> +		fprintf(stderr, "no such use-case: %s\n", args.use_case);
>> +		fprintf(stderr, "available use case:");
>> +		bpf_object__for_each_program(prog, ctx.skel->obj)
>> +			fprintf(stderr, " %s", bpf_program__name(prog));
>> +		fprintf(stderr, "\n");
>> +		goto cleanup;
>> +	}
>> +	bpf_program__set_autoload(prog, true);
>> +
>> +	ctx.skel->bss->nr_thread = env.producer_cnt;
>> +	ctx.skel->bss->nr_entries = (uint64_t)args.max_entries * args.full / 100;
>> +
>> +	err = htab_mem_bench__load(ctx.skel);
>> +	if (err) {
>> +		fprintf(stderr, "failed to load skeleton\n");
>> +		goto cleanup;
>> +	}
>> +	err = htab_mem_bench__attach(ctx.skel);
>> +	if (err) {
>> +		fprintf(stderr, "failed to attach skeleton\n");
>> +		goto cleanup;
>> +	}
>> +	return;
>> +cleanup:
>> +	close(ctx.fd);
>> +	cleanup_cgroup_environment();
>> +	htab_mem_bench__destroy(ctx.skel);
>> +	exit(1);
>> +}
>> +
>> +static void *htab_mem_producer(void *ctx)
>> +{
>> +	while (true)
>> +		(void)syscall(__NR_getpgid);
>> +	return NULL;
>> +}
>> +
>> +static void *htab_mem_consumer(void *ctx)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static void htab_mem_read_mem_cgrp_file(const char *name, unsigned long *value)
>> +{
>> +	char buf[32];
>> +	int fd;
>> +
>> +	fd = openat(ctx.fd, name, O_RDONLY);
>> +	if (fd < 0) {
>> +		fprintf(stderr, "no %s\n", name);
>> +		*value = 0;
>> +		return;
>> +	}
>> +
>> +	buf[sizeof(buf) - 1] = 0;
>> +	read(fd, buf, sizeof(buf) - 1);
>> +	*value = strtoull(buf, NULL, 0);
>> +
>> +	close(fd);
>> +}
>> +
>> +static void htab_mem_measure(struct bench_res *res)
>> +{
>> +	res->hits = atomic_swap(&ctx.skel->bss->loop_cnt, 0);
>> +	htab_mem_read_mem_cgrp_file("memory.current", &res->gp_ct);
>> +}
>> +
>> +static void htab_mem_report_progress(int iter, struct bench_res *res, long delta_ns)
>> +{
>> +	double loop, mem;
>> +
>> +	loop = res->hits / 1000.0 / (delta_ns / 1000000000.0);
>> +	mem = res->gp_ct / 1048576.0;
>> +	printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
>> +	printf("loop %7.2lfk/s, memory usage %7.2lfMiB\n", loop, mem);
>> +}
>> +
>> +static void htab_mem_report_final(struct bench_res res[], int res_cnt)
>> +{
>> +	double mem_mean = 0.0, mem_stddev = 0.0;
>> +	double loop_mean = 0.0, loop_stddev = 0.0;
>> +	unsigned long peak_mem;
>> +	int i;
>> +
>> +	for (i = 0; i < res_cnt; i++) {
>> +		loop_mean += res[i].hits / 1000.0 / (0.0 + res_cnt);
>> +		mem_mean += res[i].gp_ct / 1048576.0 / (0.0 + res_cnt);
>> +	}
>> +	if (res_cnt > 1)  {
>> +		for (i = 0; i < res_cnt; i++) {
>> +			loop_stddev += (loop_mean - res[i].hits / 1000.0) *
>> +				       (loop_mean - res[i].hits / 1000.0) /
>> +				       (res_cnt - 1.0);
>> +			mem_stddev += (mem_mean - res[i].gp_ct / 1048576.0) *
>> +				      (mem_mean - res[i].gp_ct / 1048576.0) /
>> +				      (res_cnt - 1.0);
>> +		}
>> +		loop_stddev = sqrt(loop_stddev);
>> +		mem_stddev = sqrt(mem_stddev);
>> +	}
>> +
>> +	htab_mem_read_mem_cgrp_file("memory.peak", &peak_mem);
>> +	printf("Summary: loop %7.2lf \u00B1 %7.2lfk/s, memory usage %7.2lf \u00B1 %7.2lfMiB, "
>> +	       "peak memory usage %7.2lfMiB\n",
>> +	       loop_mean, loop_stddev, mem_mean, mem_stddev, peak_mem / 1048576.0);
>> +}
>> +
>> +const struct bench bench_htab_mem = {
>> +	.name = "htab-mem",
>> +	.argp = &bench_htab_mem_argp,
>> +	.validate = htab_mem_validate,
>> +	.setup = htab_mem_setup,
>> +	.producer_thread = htab_mem_producer,
>> +	.consumer_thread = htab_mem_consumer,
>> +	.measure = htab_mem_measure,
>> +	.report_progress = htab_mem_report_progress,
>> +	.report_final = htab_mem_report_final,
>> +};
>> diff --git a/tools/testing/selftests/bpf/progs/htab_mem_bench.c b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
>> new file mode 100644
>> index 000000000000..f320cb3a11e8
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/htab_mem_bench.c
>> @@ -0,0 +1,145 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2023. Huawei Technologies Co., Ltd */
>> +#include <stdbool.h>
>> +#include <errno.h>
>> +#include <linux/types.h>
>> +#include <linux/bpf.h>
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +struct update_ctx {
>> +	unsigned int from;
>> +	unsigned int step;
>> +	unsigned int max;
>> +};
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_HASH);
>> +	__uint(key_size, 4);
>> +	__uint(map_flags, BPF_F_NO_PREALLOC);
>> +} htab SEC(".maps");
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> +	__uint(key_size, 4);
>> +} array SEC(".maps");
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +unsigned int nr_entries = 0;
>> +unsigned int nr_thread = 0;
>> +long loop_cnt = 0;
>> +
>> +static int noop_htab(unsigned int i, struct update_ctx *ctx)
>> +{
>> +	if (ctx->from >= ctx->max)
>> +		return 1;
>> +
>> +	ctx->from += ctx->step;
>> +	return 0;
>> +}
>> +
>> +static int write_htab(unsigned int i, struct update_ctx *ctx, unsigned int flags)
>> +{
>> +	__u64 *value;
>> +
>> +	if (ctx->from >= ctx->max)
>> +		return 1;
>> +
>> +	value = bpf_map_lookup_elem(&array, &ctx->from);
>> +	if (value)
>> +		bpf_map_update_elem(&htab, &ctx->from, value, flags);
> What is a point of doing lookup from giant array of en element with zero value
> to copy it into htab?
> Why not to use single zero inited elem for all htab ops?
I want to check how does the different size of value effect the benchmark
result, so I choose a variable-size value.
>
>> +	ctx->from += ctx->step;
>> +
>> +	return 0;
>> +}
>> +
>> +static int overwrite_htab(unsigned int i, struct update_ctx *ctx)
>> +{
>> +	return write_htab(i, ctx, 0);
>> +}
>> +
>> +static int newwrite_htab(unsigned int i, struct update_ctx *ctx)
>> +{
>> +	return write_htab(i, ctx, BPF_NOEXIST);
>> +}
>> +
>> +static int del_htab(unsigned int i, struct update_ctx *ctx)
>> +{
>> +	__u64 *value;
>> +
>> +	if (ctx->from >= ctx->max)
>> +		return 1;
>> +
>> +	value = bpf_map_lookup_elem(&array, &ctx->from);
>> +	if (value)
>> +		bpf_map_delete_elem(&htab, &ctx->from);
> even worse here.
> Why lookup from array to delete the elem by key from htab?
>
> The if (ctx->from >= ctx->max) check is done before the lookup,
> so lookup will succeed and disarded immediately.
> array lookup is fast, but the waste of cpu cycles is meaningless.
My bad. Indeed there is no need to lookup from the array. Will remove it.
>
>> +	ctx->from += ctx->step;
>> +
>> +	return 0;
>> +}
>> +
>> +SEC("?tp/syscalls/sys_enter_getpgid")
>> +int no_op(void *ctx)
>> +{
>> +	struct update_ctx update;
>> +
>> +	update.from = bpf_get_smp_processor_id();
>> +	update.step = nr_thread;
>> +	update.max = nr_entries;
>> +	bpf_loop(update.max, noop_htab, &update, 0);
>> +	__sync_fetch_and_add(&loop_cnt, 1);
>> +
>> +	return 0;
>> +}
>> +
>> +SEC("?tp/syscalls/sys_enter_getpgid")
>> +int overwrite(void *ctx)
>> +{
>> +	struct update_ctx update;
>> +
>> +	update.from = bpf_get_smp_processor_id();
>> +	update.step = nr_thread;
>> +	update.max = nr_entries;
>> +	bpf_loop(update.max, overwrite_htab, &update, 0);
>> +
>> +	__sync_fetch_and_add(&loop_cnt, 1);
>> +	return 0;
>> +}
>> +
>> +SEC("?tp/syscalls/sys_enter_getpgid")
>> +int batch_add_batch_del(void *ctx)
>> +{
>> +	struct update_ctx update;
>> +
>> +	update.from = bpf_get_smp_processor_id();
>> +	update.step = nr_thread;
>> +	update.max = nr_entries;
>> +	bpf_loop(update.max, overwrite_htab, &update, 0);
>> +
>> +	update.from = bpf_get_smp_processor_id();
>> +	bpf_loop(update.max, del_htab, &update, 0);
>> +
>> +	__sync_fetch_and_add(&loop_cnt, 1);
>> +	return 0;
>> +}
>> +
>> +SEC("?tp/syscalls/sys_enter_getpgid")
>> +int add_del_on_diff_cpu(void *ctx)
>> +{
>> +	struct update_ctx update;
>> +	unsigned int from;
>> +
>> +	from = bpf_get_smp_processor_id();
>> +	update.from = from / 2;
>> +	update.step = nr_thread / 2;
>> +	update.max = nr_entries;
>> +
>> +	if (from & 1)
>> +		bpf_loop(update.max, newwrite_htab, &update, 0);
>> +	else
>> +		bpf_loop(update.max, del_htab, &update, 0);
> This is oddly shaped test.
> deleter cpu may run ahead of newwrite_htab.
> deleter will try to delete elems that don't exist.
> Loop of few thousand iterations is not a lot for one cpu to run ahead.
Yes. There will be empty loop if deletion CPU is running ahead of addition CPU,
but as shown by the nop column in the benchmark result, the speed of empty loop
is relative fast.
>
> Each loop will run 16k times and every time you step += 4.
> So 3/4 of these 16k runs it will be hitting if (ctx->from >= ctx->max) condition.
> What are you measuring?
As explained in the commit message, I am trying to let different deletion and
deletion CPU pairs operate on the different subsets of hash-table elements.
Assuming there are 16 elements in the htab and there are 8 CPUs and 8 threads,
the following is the operation subset for each CPU:

CPU 0:  0 4 8 12 (do deletion)
CPU 1:  0 4 8 12 (do addition)

CPU 2:  1 5 9 13
CPU 3:  1 5 9 13

CPU 4:  2 6 10 14
CPU 5:  2 6 10 14

CPU 6:  3 7 11 15
CPU 7:  3 7 11 15

>
>> +
>> +	__sync_fetch_and_add(&loop_cnt, 1);
>> +	return 0;
>> +}
>> -- 
>> 2.29.2
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
  2023-04-22  3:12   ` Alexei Starovoitov
@ 2023-04-23  7:41     ` Hou Tao
  2023-04-27  4:24       ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2023-04-23  7:41 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi,

On 4/22/2023 11:12 AM, Alexei Starovoitov wrote:
> On Sat, Apr 08, 2023 at 10:18:46PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Currently the freed objects in bpf memory allocator may be reused
>> immediately by new allocation, it introduces use-after-bpf-ma-free
>> problem for non-preallocated hash map and makes lookup procedure
>> return incorrect result. The immediate reuse also makes introducing
>> new use case more difficult (e.g. qp-trie).
>>
>> So introduce BPF_MA_REUSE_AFTER_RCU_GP to solve these problems. For
>> BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU
>> grace period and may be returned back to slab system after another
>> RCU-tasks-trace grace period. So for bpf programs which care about reuse
>> problem, these programs can use bpf_rcu_read_{lock,unlock}() to access
>> these freed objects safely and for those which doesn't care, there will
>> be safely use-after-bpf-ma-free because these objects have not been
>> freed by bpf memory allocator.
>>
>> To make these freed elements being reusab quickly, BPF_MA_REUSE_AFTER_GP
>> dynamically allocates memory to create many inflight RCU callbacks to
>> mark these freed element being reusable. These memories used for
>> bpf_reuse_batch will be freed when these RCU callbacks complete. When no
>> memory is available, synchronize_rcu_expedited() will be used to make
>> these freed element reusable. In order to reduce the risk of OOM, part
>> of these reusable memory will be freed through RCU-tasks-trace grace
>> period. Before these freeing memories are freed, these memories are also
>> available for reuse.
>>
>> The following are the benchmark results when comparing between different
>> flavors of bpf memory allocator. These results show:
>> * The performance of reuse-after-rcu-gp bpf ma is good than no bpf ma.
>>   Its memory usage is also good than no bpf ma except for
>>   add_del_on_diff_cpu case.
>> * The memory usage of reuse-after-rcu-gp bpf ma increases a lot compared
>>   with normal bpf ma.
>> * The memory usage of free-after-rcu-gp bpf ma is better than
>>   reuse-after-rcu-gp bpf ma, but its performance is bad than
>>   reuse-after-ruc-gp because it doesn't do reuse.
>>
>> (1) no bpf memory allocator (v6.0.19)
> meaning that htab is using kmalloc and call_rcu to free, right?
Yes.
>
>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>> | --                  | --         | --                   | --                |
>> | no_op               | 1187       | 1.05                 | 1.05              |
>> | overwrite           | 3.74       | 32.52                | 84.18             |
>> | batch_add_batch_del | 2.23       | 26.38                | 48.75             |
>> | add_del_on_diff_cpu | 3.92       | 33.72                | 48.96             |
>>
>> (2) normal bpf memory allocator
>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>> | --                  | --         | --                   | --                |
>> | no_op               | 1187       | 0.96                 | 1.00              |
>> | overwrite           | 27.12      | 2.5                  | 2.99              |
>> | batch_add_batch_del | 8.9        | 2.77                 | 3.24              |
>> | add_del_on_diff_cpu | 11.30      | 218.54               | 440.37            |
>>
>> (3) reuse-after-rcu-gp bpf memory allocator
> that's the one you're implementing below, right?
Right.
>
>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>> | --                  | --         | --                   | --                |
>> | no_op               | 1276       | 0.96                 | 1.00              |
>> | overwrite           | 15.66      | 25.00                | 33.07             |
>> | batch_add_batch_del | 10.32      | 18.84                | 22.64             |
>> | add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |
>>
>> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)
> What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ?
No, there is no call_rcu() before bpf_mem_free(). bpf_mem_free() in
free-after-rcu-gp flavor will do call_rcu() in batch to free these elements back
to slab subsystem directly. The elements in this flavor of bpf_ma is not safe
for access from sleepable program except bpf_rcu_read_{lock,unlock}() are used.

But I think using call_rcu() to call bpf_mem_free() is good candidate for
comparison and I saw bpf_cpumask does that, so I modify bpf hash table to do the
similar thing and paste the benchmark result. As we can seen from the result,
the memory usage for such flavor is much bigger than reuse-after-rcu-gp and
free-after-rcu-gp:

* use call_rcu() to call bpf_mem_free()

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1273       | 0.99                 | 1.00              |
| overwrite           | 12.52      | 432.57               | 653.32            |
| batch_add_batch_del | 9.21       | 272.81               | 436.07            |
| add_del_on_diff_cpu | 14.45      | 681.58               | 881.92            |

>
>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>> | --                  | --         | --                   | --                |
>> | no_op               | 1263       | 0.96                 | 1.00              |
>> | overwrite           | 10.73      | 12.33                | 20.32             |
>> | batch_add_batch_del | 7.02       | 9.45                 | 14.07             |
>> | add_del_on_diff_cpu | 8.99       | 131.64               | 204.42            |
> Depending on what we care about all the extra complexity in bpf_ma with reuse-after-rcu-gp
> buys us a bit better perf, but many times worse memory consumption?
No. As the above benchmark result shows, the memory consumption of
reuse-after-rcu-gp or free-after-rcu-gp is better than
calling-bpf_mem_free()-in-call_rcu(). The memory consumption of
free-after-rcu-gp is better and its implementation is also simpler than
reuse-after-rcu-gp.
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  include/linux/bpf_mem_alloc.h |   1 +
>>  kernel/bpf/memalloc.c         | 353 +++++++++++++++++++++++++++++++---
>>  2 files changed, 326 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
>> index 148347950e16..e7f68432713b 100644
>> --- a/include/linux/bpf_mem_alloc.h
>> +++ b/include/linux/bpf_mem_alloc.h
>> @@ -18,6 +18,7 @@ struct bpf_mem_alloc {
>>  /* flags for bpf_mem_alloc_init() */
>>  enum {
>>  	BPF_MA_PERCPU = 1U << 0,
>> +	BPF_MA_REUSE_AFTER_RCU_GP = 1U << 1,
>>  };
>>  
>>  /* 'size != 0' is for bpf_mem_alloc which manages fixed-size objects.
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 072102476019..262100f89610 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -63,6 +63,10 @@ static u8 size_index[24] __ro_after_init = {
>>  	2	/* 192 */
>>  };
>>  
>> +static struct workqueue_struct *bpf_ma_wq;
>> +
>> +static void bpf_ma_prepare_reuse_work(struct work_struct *work);
>> +
>>  static int bpf_mem_cache_idx(size_t size)
>>  {
>>  	if (!size || size > 4096)
>> @@ -98,18 +102,36 @@ struct bpf_mem_cache {
>>  	int free_cnt;
>>  	int low_watermark, high_watermark, batch;
>>  	int percpu_size;
>> +	int cpu;
>>  	unsigned int flags;
>>  
>> +	raw_spinlock_t reuse_lock;
>> +	bool abort_reuse;
>> +	struct llist_head reuse_ready_head;
>> +	struct llist_node *reuse_ready_tail;
>> +	struct llist_head wait_for_free;
>> +	struct llist_head prepare_reuse_head;
>> +	struct llist_node *prepare_reuse_tail;
>> +	unsigned int prepare_reuse_cnt;
>> +	atomic_t reuse_cb_in_progress;
>> +	struct work_struct reuse_work;
>> +
>>  	struct rcu_head rcu;
>>  	struct llist_head free_by_rcu;
>>  	struct llist_head waiting_for_gp;
>> -	atomic_t call_rcu_in_progress;
>> +	atomic_t free_cb_in_progress;
>>  };
>>  
>>  struct bpf_mem_caches {
>>  	struct bpf_mem_cache cache[NUM_CACHES];
>>  };
>>  
>> +struct bpf_reuse_batch {
>> +	struct bpf_mem_cache *c;
>> +	struct llist_node *head, *tail;
>> +	struct rcu_head rcu;
>> +};
>> +
>>  static struct llist_node notrace *__llist_del_first(struct llist_head *head)
>>  {
>>  	struct llist_node *entry, *next;
>> @@ -154,6 +176,45 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
>>  #endif
>>  }
>>  
>> +static void *bpf_ma_get_reusable_obj(struct bpf_mem_cache *c)
>> +{
>> +	if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP) {
>> +		unsigned long flags;
>> +		void *obj;
>> +
>> +		if (llist_empty(&c->reuse_ready_head) && llist_empty(&c->wait_for_free))
>> +			return NULL;
>> +
>> +		/* reuse_ready_head and wait_for_free may be manipulated by
>> +		 * kworker and RCU callbacks.
>> +		 */
>> +		raw_spin_lock_irqsave(&c->reuse_lock, flags);
>> +		obj = __llist_del_first(&c->reuse_ready_head);
>> +		if (obj) {
>> +			if (llist_empty(&c->reuse_ready_head))
>> +				c->reuse_ready_tail = NULL;
>> +		} else {
>> +			obj = __llist_del_first(&c->wait_for_free);
>> +		}
>> +		raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
>> +		return obj;
>> +	}
>> +
>> +	/*
>> +	 * free_by_rcu is only manipulated by irq work refill_work().
>> +	 * IRQ works on the same CPU are called sequentially, so it is
>> +	 * safe to use __llist_del_first() here. If alloc_bulk() is
>> +	 * invoked by the initial prefill, there will be no running
>> +	 * refill_work(), so __llist_del_first() is fine as well.
>> +	 *
>> +	 * In most cases, objects on free_by_rcu are from the same CPU.
>> +	 * If some objects come from other CPUs, it doesn't incur any
>> +	 * harm because NUMA_NO_NODE means the preference for current
>> +	 * numa node and it is not a guarantee.
>> +	 */
>> +	return __llist_del_first(&c->free_by_rcu);
>> +}
>> +
>>  /* Mostly runs from irq_work except __init phase. */
>>  static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>  {
>> @@ -165,19 +226,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>  	memcg = get_memcg(c);
>>  	old_memcg = set_active_memcg(memcg);
>>  	for (i = 0; i < cnt; i++) {
>> -		/*
>> -		 * free_by_rcu is only manipulated by irq work refill_work().
>> -		 * IRQ works on the same CPU are called sequentially, so it is
>> -		 * safe to use __llist_del_first() here. If alloc_bulk() is
>> -		 * invoked by the initial prefill, there will be no running
>> -		 * refill_work(), so __llist_del_first() is fine as well.
>> -		 *
>> -		 * In most cases, objects on free_by_rcu are from the same CPU.
>> -		 * If some objects come from other CPUs, it doesn't incur any
>> -		 * harm because NUMA_NO_NODE means the preference for current
>> -		 * numa node and it is not a guarantee.
>> -		 */
>> -		obj = __llist_del_first(&c->free_by_rcu);
>> +		obj = bpf_ma_get_reusable_obj(c);
>>  		if (!obj) {
>>  			/* Allocate, but don't deplete atomic reserves that typical
>>  			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
>> @@ -236,7 +285,7 @@ static void __free_rcu(struct rcu_head *head)
>>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>  
>>  	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
>> -	atomic_set(&c->call_rcu_in_progress, 0);
>> +	atomic_set(&c->free_cb_in_progress, 0);
>>  }
>>  
>>  static void __free_rcu_tasks_trace(struct rcu_head *head)
>> @@ -264,7 +313,7 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>>  {
>>  	struct llist_node *llnode, *t;
>>  
>> -	if (atomic_xchg(&c->call_rcu_in_progress, 1))
>> +	if (atomic_xchg(&c->free_cb_in_progress, 1))
>>  		return;
>>  
>>  	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
>> @@ -409,6 +458,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
>>  			c->objcg = objcg;
>>  			c->percpu_size = percpu_size;
>>  			c->flags = flags;
>> +			c->cpu = cpu;
>> +			INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
>>  			prefill_mem_cache(c, cpu);
>>  		}
>>  		ma->cache = pc;
>> @@ -433,6 +484,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
>>  			c->unit_size = sizes[i];
>>  			c->objcg = objcg;
>>  			c->flags = flags;
>> +			c->cpu = cpu;
>> +			INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
>>  			prefill_mem_cache(c, cpu);
>>  		}
>>  	}
>> @@ -444,18 +497,40 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
>>  static void drain_mem_cache(struct bpf_mem_cache *c)
>>  {
>>  	bool percpu = !!c->percpu_size;
>> +	struct llist_node *head[3];
>> +	unsigned long flags;
>>  
>>  	/* No progs are using this bpf_mem_cache, but htab_map_free() called
>>  	 * bpf_mem_cache_free() for all remaining elements and they can be in
>>  	 * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
>>  	 *
>> -	 * Except for waiting_for_gp list, there are no concurrent operations
>> -	 * on these lists, so it is safe to use __llist_del_all().
>> +	 * Except for waiting_for_gp and free_llist_extra list, there are no
>> +	 * concurrent operations on these lists, so it is safe to use
>> +	 * __llist_del_all().
>>  	 */
>>  	free_all(__llist_del_all(&c->free_by_rcu), percpu);
>>  	free_all(llist_del_all(&c->waiting_for_gp), percpu);
>>  	free_all(__llist_del_all(&c->free_llist), percpu);
>> -	free_all(__llist_del_all(&c->free_llist_extra), percpu);
>> +	free_all(llist_del_all(&c->free_llist_extra), percpu);
>> +
>> +	if (!(c->flags & BPF_MA_REUSE_AFTER_RCU_GP))
>> +		return;
>> +
>> +	raw_spin_lock_irqsave(&c->reuse_lock, flags);
>> +	/* Indicate kworker and RCU callback to free elements directly
>> +	 * instead of adding new elements into these lists.
>> +	 */
>> +	c->abort_reuse = true;
>> +	head[0] = __llist_del_all(&c->prepare_reuse_head);
>> +	c->prepare_reuse_tail = NULL;
>> +	head[1] = __llist_del_all(&c->reuse_ready_head);
>> +	c->reuse_ready_tail = NULL;
>> +	head[2] = __llist_del_all(&c->wait_for_free);
>> +	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
>> +
>> +	free_all(head[0], percpu);
>> +	free_all(head[1], percpu);
>> +	free_all(head[2], percpu);
>>  }
>>  
>>  static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>> @@ -466,10 +541,39 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>>  	ma->caches = NULL;
>>  }
>>  
>> +static void bpf_ma_cancel_reuse_work(struct bpf_mem_alloc *ma)
>> +{
>> +	struct bpf_mem_caches *cc;
>> +	struct bpf_mem_cache *c;
>> +	int cpu, i;
>> +
>> +	if (ma->cache) {
>> +		for_each_possible_cpu(cpu) {
>> +			c = per_cpu_ptr(ma->cache, cpu);
>> +			cancel_work_sync(&c->reuse_work);
>> +		}
>> +	}
>> +	if (ma->caches) {
>> +		for_each_possible_cpu(cpu) {
>> +			cc = per_cpu_ptr(ma->caches, cpu);
>> +			for (i = 0; i < NUM_CACHES; i++) {
>> +				c = &cc->cache[i];
>> +				cancel_work_sync(&c->reuse_work);
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>  static void free_mem_alloc(struct bpf_mem_alloc *ma)
>>  {
>> -	/* waiting_for_gp lists was drained, but __free_rcu might
>> -	 * still execute. Wait for it now before we freeing percpu caches.
>> +	bool reuse_after_rcu_gp = ma->flags & BPF_MA_REUSE_AFTER_RCU_GP;
>> +
>> +	/* Cancel the inflight kworkers */
>> +	if (reuse_after_rcu_gp)
>> +		bpf_ma_cancel_reuse_work(ma);
>> +
>> +	/* For normal bpf ma, waiting_for_gp lists was drained, but __free_rcu
>> +	 * might still execute. Wait for it now before we freeing percpu caches.
>>  	 *
>>  	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
>>  	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
>> @@ -477,9 +581,13 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
>>  	 * so if call_rcu(head, __free_rcu) is skipped due to
>>  	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
>>  	 * using rcu_trace_implies_rcu_gp() as well.
>> +	 *
>> +	 * For reuse-after-rcu-gp bpf ma, use rcu_barrier_tasks_trace() to
>> +	 * wait for the pending bpf_ma_free_reusable_cb() and use rcu_barrier()
>> +	 * to wait for the pending bpf_ma_reuse_cb().
>>  	 */
>>  	rcu_barrier_tasks_trace();
>> -	if (!rcu_trace_implies_rcu_gp())
>> +	if (reuse_after_rcu_gp || !rcu_trace_implies_rcu_gp())
>>  		rcu_barrier();
>>  	free_mem_alloc_no_barrier(ma);
>>  }
>> @@ -512,6 +620,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
>>  	}
>>  
>>  	/* Defer barriers into worker to let the rest of map memory to be freed */
>> +	copy->flags = ma->flags;
>>  	copy->cache = ma->cache;
>>  	ma->cache = NULL;
>>  	copy->caches = ma->caches;
>> @@ -541,7 +650,9 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>  			 */
>>  			irq_work_sync(&c->refill_work);
>>  			drain_mem_cache(c);
>> -			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>> +			rcu_in_progress += atomic_read(&c->free_cb_in_progress);
>> +			/* Pending kworkers or RCU callbacks */
>> +			rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
>>  		}
>>  		/* objcg is the same across cpus */
>>  		if (c->objcg)
>> @@ -556,7 +667,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>  				c = &cc->cache[i];
>>  				irq_work_sync(&c->refill_work);
>>  				drain_mem_cache(c);
>> -				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>> +				rcu_in_progress += atomic_read(&c->free_cb_in_progress);
>> +				rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
>>  			}
>>  		}
>>  		if (c->objcg)
>> @@ -600,18 +712,183 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>  	return llnode;
>>  }
>>  
>> +static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c, struct llist_node *head,
>> +					      struct llist_node *tail)
>> +{
>> +	unsigned long flags;
>> +	bool abort;
>> +
>> +	raw_spin_lock_irqsave(&c->reuse_lock, flags);
>> +	abort = c->abort_reuse;
>> +	if (!abort) {
>> +		if (llist_empty(&c->reuse_ready_head))
>> +			c->reuse_ready_tail = tail;
>> +		__llist_add_batch(head, tail, &c->reuse_ready_head);
>> +	}
>> +	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
>> +
>> +	/* Don't move these objects to reuse_ready list and free
>> +	 * these objects directly.
>> +	 */
>> +	if (abort)
>> +		free_all(head, !!c->percpu_size);
>> +}
>> +
>> +static void bpf_ma_reuse_cb(struct rcu_head *rcu)
>> +{
>> +	struct bpf_reuse_batch *batch = container_of(rcu, struct bpf_reuse_batch, rcu);
>> +	struct bpf_mem_cache *c = batch->c;
>> +
>> +	bpf_ma_add_to_reuse_ready_or_free(c, batch->head, batch->tail);
>> +	atomic_dec(&c->reuse_cb_in_progress);
>> +	kfree(batch);
>> +}
>> +
>> +static bool bpf_ma_try_free_reuse_objs(struct bpf_mem_cache *c)
>> +{
>> +	struct llist_node *head, *tail;
>> +	bool do_free;
>> +
>> +	if (llist_empty(&c->reuse_ready_head))
>> +		return false;
>> +
>> +	do_free = !atomic_xchg(&c->free_cb_in_progress, 1);
>> +	if (!do_free)
>> +		return false;
>> +
>> +	head = __llist_del_all(&c->reuse_ready_head);
>> +	tail = c->reuse_ready_tail;
>> +	c->reuse_ready_tail = NULL;
>> +
>> +	__llist_add_batch(head, tail, &c->wait_for_free);
>> +
>> +	return true;
>> +}
>> +
>> +static void bpf_ma_free_reusable_cb(struct rcu_head *rcu)
>> +{
>> +	struct bpf_mem_cache *c = container_of(rcu, struct bpf_mem_cache, rcu);
>> +	struct llist_node *head;
>> +	unsigned long flags;
>> +
>> +	raw_spin_lock_irqsave(&c->reuse_lock, flags);
>> +	head = __llist_del_all(&c->wait_for_free);
>> +	raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
>> +
>> +	free_all(head, !!c->percpu_size);
>> +	atomic_set(&c->free_cb_in_progress, 0);
>> +}
>> +
>> +static void bpf_ma_prepare_reuse_work(struct work_struct *work)
>> +{
>> +	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, reuse_work);
>> +	struct llist_node *head, *tail, *llnode, *tmp;
>> +	struct bpf_reuse_batch *batch;
>> +	unsigned long flags;
>> +	bool do_free;
>> +
>> +	local_irq_save(flags);
>> +	/* When CPU is offline, the running CPU may be different with
>> +	 * the CPU which submitted the work. When these two CPUs are the same,
>> +	 * kworker may be interrupted by NMI, so increase active to protect
>> +	 * again such concurrency.
>> +	 */
>> +	if (c->cpu == smp_processor_id())
>> +		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
>> +	raw_spin_lock(&c->reuse_lock);
>> +	head = __llist_del_all(&c->prepare_reuse_head);
>> +	tail = c->prepare_reuse_tail;
>> +	c->prepare_reuse_tail = NULL;
>> +	c->prepare_reuse_cnt = 0;
>> +	if (c->cpu == smp_processor_id())
>> +		local_dec(&c->active);
>> +
>> +	/* Try to free elements in reusable list. Before these elements are
>> +	 * freed in RCU cb, these element will still be available for reuse.
>> +	 */
>> +	do_free = bpf_ma_try_free_reuse_objs(c);
>> +	raw_spin_unlock(&c->reuse_lock);
>> +	local_irq_restore(flags);
>> +
>> +	if (do_free)
>> +		call_rcu_tasks_trace(&c->rcu, bpf_ma_free_reusable_cb);
>> +
>> +	llist_for_each_safe(llnode, tmp, llist_del_all(&c->free_llist_extra)) {
>> +		if (!head)
>> +			tail = llnode;
>> +		llnode->next = head;
>> +		head = llnode->next;
>> +	}
>> +	/* Draining is in progress ? */
>> +	if (!head) {
>> +		/* kworker completes and no RCU callback */
>> +		atomic_dec(&c->reuse_cb_in_progress);
>> +		return;
>> +	}
>> +
>> +	batch = kmalloc(sizeof(*batch), GFP_KERNEL);
>> +	if (!batch) {
>> +		synchronize_rcu_expedited();
>> +		bpf_ma_add_to_reuse_ready_or_free(c, head, tail);
>> +		/* kworker completes and no RCU callback */
>> +		atomic_dec(&c->reuse_cb_in_progress);
>> +		return;
>> +	}
>> +
>> +	batch->c = c;
>> +	batch->head = head;
>> +	batch->tail = tail;
>> +	call_rcu(&batch->rcu, bpf_ma_reuse_cb);
>> +}
>> +
>> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
>> +{
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	/* In case a NMI-context bpf program is also freeing object. */
>> +	if (local_inc_return(&c->active) == 1) {
>> +		bool try_queue_work = false;
>> +
>> +		/* kworker may remove elements from prepare_reuse_head */
>> +		raw_spin_lock(&c->reuse_lock);
>> +		if (llist_empty(&c->prepare_reuse_head))
>> +			c->prepare_reuse_tail = llnode;
>> +		__llist_add(llnode, &c->prepare_reuse_head);
>> +		if (++c->prepare_reuse_cnt > c->high_watermark) {
>> +			/* Zero out prepare_reuse_cnt early to prevent
>> +			 * unnecessary queue_work().
>> +			 */
>> +			c->prepare_reuse_cnt = 0;
>> +			try_queue_work = true;
>> +		}
>> +		raw_spin_unlock(&c->reuse_lock);
>> +
>> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
>> +			/* Use reuse_cb_in_progress to indicate there is
>> +			 * inflight reuse kworker or reuse RCU callback.
>> +			 */
>> +			atomic_inc(&c->reuse_cb_in_progress);
>> +			/* Already queued */
>> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
> how many kthreads are spawned by wq in the peak?
I think it depends on the number of bpf_ma. Because bpf_ma_wq is per-CPU
workqueue, so for each bpf_ma, there is at most one worker for each CPU. And now
the limit for the number of active workers on each CPU is 256, but it is
customizable through alloc_workqueue() API.
>
>> +				atomic_dec(&c->reuse_cb_in_progress);
>> +		}
>> +	} else {
>> +		llist_add(llnode, &c->free_llist_extra);
>> +	}
>> +	local_dec(&c->active);
>> +	local_irq_restore(flags);
>> +}
>> +
>>  /* Though 'ptr' object could have been allocated on a different cpu
>>   * add it to the free_llist of the current cpu.
>>   * Let kfree() logic deal with it when it's later called from irq_work.
>>   */
>> -static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
>> +static void notrace immediate_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
>>  {
>> -	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
>>  	unsigned long flags;
>>  	int cnt = 0;
>>  
>> -	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
>> -
>>  	local_irq_save(flags);
>>  	if (local_inc_return(&c->active) == 1) {
>>  		__llist_add(llnode, &c->free_llist);
>> @@ -633,6 +910,18 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
>>  		irq_work_raise(c);
>>  }
>>  
>> +static inline void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
>> +{
>> +	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
>> +
>> +	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
>> +
>> +	if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP)
>> +		wait_gp_reuse_free(c, llnode);
>> +	else
>> +		immediate_reuse_free(c, llnode);
>> +}
>> +
>>  /* Called from BPF program or from sys_bpf syscall.
>>   * In both cases migration is disabled.
>>   */
>> @@ -724,3 +1013,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags)
>>  
>>  	return !ret ? NULL : ret + LLIST_NODE_SZ;
>>  }
>> +
>> +static int __init bpf_ma_init(void)
>> +{
>> +	bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
>> +	BUG_ON(!bpf_ma_wq);
>> +	return 0;
>> +}
>> +late_initcall(bpf_ma_init);
>> -- 
>> 2.29.2
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-22  2:59   ` Alexei Starovoitov
  2023-04-23  1:55     ` Hou Tao
@ 2023-04-23  8:03     ` Hou Tao
  1 sibling, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-23  8:03 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi,

On 4/22/2023 10:59 AM, Alexei Starovoitov wrote:
> On Sat, Apr 08, 2023 at 10:18:43PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> The benchmark could be used to compare the performance of hash map
>> operations and the memory usage between different flavors of bpf memory
>> allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
>> could be used to check the performance improvement or the memory saving
>> of bpf memory allocator optimization and check whether or not a specific
>> use case is suitable for bpf memory allocator.
>>
>> The benchmark creates a non-preallocated hash map which uses bpf memory
>> allocator and shows the operation performance and the memory usage of
>> the hash map under different use cases:
>> (1) no_op
>> Only create the hash map and there is no operations on hash map. It is
>> used as the baseline. When each CPUs complete the iteartion of
>> nonoverlapping part of hash map, the loop count is increased.
>> (2) overwrite
>> Each CPU overwrites nonoverlapping part of hash map. When each CPU
>> completes one round of iteration, the loop count is increased.
>> (3) batch_add_batch_del
>> Each CPU adds then deletes nonoverlapping part of hash map in batch.
>> When each CPU completes one round of iteration, the loop count is
>> increased.
>> (4) add_del_on_diff_cpu
>> Each two CPUs add and delete nonoverlapping part of map concurrently.
>> When each CPU completes one round of iteration, the loop count is
>> increased.
>>
>> The following benchmark results show that bpf memory allocator doesn't
>> handle add_del_on_diff_cpu scenario very well. Because map deletion
>> always happen on a different CPU than the map addition and the freed
>> memory can never be reused.
SNIP
>> +
>> +SEC("?tp/syscalls/sys_enter_getpgid")
>> +int add_del_on_diff_cpu(void *ctx)
>> +{
>> +	struct update_ctx update;
>> +	unsigned int from;
>> +
>> +	from = bpf_get_smp_processor_id();
>> +	update.from = from / 2;
>> +	update.step = nr_thread / 2;
>> +	update.max = nr_entries;
>> +
>> +	if (from & 1)
>> +		bpf_loop(update.max, newwrite_htab, &update, 0);
>> +	else
>> +		bpf_loop(update.max, del_htab, &update, 0);
> This is oddly shaped test.
> deleter cpu may run ahead of newwrite_htab.
> deleter will try to delete elems that don't exist.
> Loop of few thousand iterations is not a lot for one cpu to run ahead.
>
> Each loop will run 16k times and every time you step += 4.
> So 3/4 of these 16k runs it will be hitting if (ctx->from >= ctx->max) condition.
> What are you measuring?
I think it would be better to synchronize between deletion CPU and addition CPU.
Will fix it.
>
>> +
>> +	__sync_fetch_and_add(&loop_cnt, 1);
>> +	return 0;
>> +}
>> -- 
>> 2.29.2
>>
> .


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-23  1:55     ` Hou Tao
@ 2023-04-27  4:20       ` Alexei Starovoitov
  2023-04-27 13:46         ` Paul E. McKenney
  2023-04-28  2:16         ` Hou Tao
  0 siblings, 2 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2023-04-27  4:20 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

On Sun, Apr 23, 2023 at 09:55:24AM +0800, Hou Tao wrote:
> >
> >> ./bench htab-mem --use-case $name --max-entries 16384 \
> >> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
> >>
> >> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> >> | --                  | --         | --                   | --                |
> >> | no_op               | 1129       | 1.15                 | 1.15              |
> >> | overwrite           | 24.37      | 2.07                 | 2.97              |
> >> | batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
> >> | add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |
> > large mem for diff_cpu case needs to be investigated.
> The main reason is that tasks-trace RCU GP is slow and there is only one
> inflight free callback, so the CPUs which only do element addition will allocate
> new memory from slab continuously and the CPUs which only do element deletion
> will free these elements continuously through call_tasks_trace_rcu(), but due to
> the slowness of tasks-trace RCU GP, these freed elements could not be freed back
> to slab subsystem timely.

I see. Now it makes sense. It's slow call_tasks_trace_rcu and not at all "memory can never be reused."
Please explain things clearly in commit log.

> >> +{
> >> +	__u64 *value;
> >> +
> >> +	if (ctx->from >= ctx->max)
> >> +		return 1;
> >> +
> >> +	value = bpf_map_lookup_elem(&array, &ctx->from);
> >> +	if (value)
> >> +		bpf_map_update_elem(&htab, &ctx->from, value, flags);
> > What is a point of doing lookup from giant array of en element with zero value
> > to copy it into htab?
> > Why not to use single zero inited elem for all htab ops?
> I want to check how does the different size of value effect the benchmark
> result, so I choose a variable-size value.

Not following. All elements of the array have the same size.
Are you saying you were not able to figure out how to supply a single 'value'
of variable size? Try array of max_entries=1.
Do not do unnecessary and confusing bpf_map_lookup_elem(&array, &ctx->from);.

> >
> > Each loop will run 16k times and every time you step += 4.
> > So 3/4 of these 16k runs it will be hitting if (ctx->from >= ctx->max) condition.
> > What are you measuring?
> As explained in the commit message, I am trying to let different deletion and
> deletion CPU pairs operate on the different subsets of hash-table elements.
> Assuming there are 16 elements in the htab and there are 8 CPUs and 8 threads,
> the following is the operation subset for each CPU:
> 
> CPU 0:  0 4 8 12 (do deletion)
> CPU 1:  0 4 8 12 (do addition)
> 
> CPU 2:  1 5 9 13
> CPU 3:  1 5 9 13
> 
> CPU 4:  2 6 10 14
> CPU 5:  2 6 10 14
> 
> CPU 6:  3 7 11 15
> CPU 7:  3 7 11 15

That part is clear, but

> >> +	__sync_fetch_and_add(&loop_cnt, 1);

this doesn't match the rest. loop_cnt is inremented 4 times faster.
So it's not comparable to other tests.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
  2023-04-23  7:41     ` Hou Tao
@ 2023-04-27  4:24       ` Alexei Starovoitov
  2023-04-28  2:24         ` Hou Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2023-04-27  4:24 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

On Sun, Apr 23, 2023 at 03:41:05PM +0800, Hou Tao wrote:
> >>
> >> (3) reuse-after-rcu-gp bpf memory allocator
> > that's the one you're implementing below, right?
> Right.
> >
> >> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> >> | --                  | --         | --                   | --                |
> >> | no_op               | 1276       | 0.96                 | 1.00              |
> >> | overwrite           | 15.66      | 25.00                | 33.07             |
> >> | batch_add_batch_del | 10.32      | 18.84                | 22.64             |
> >> | add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |
> >>
> >> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)
> > What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ?
> No, there is no call_rcu() before bpf_mem_free(). bpf_mem_free() in
> free-after-rcu-gp flavor will do call_rcu() in batch to free these elements back
> to slab subsystem directly. The elements in this flavor of bpf_ma is not safe
> for access from sleepable program except bpf_rcu_read_{lock,unlock}() are used.
> 
> But I think using call_rcu() to call bpf_mem_free() is good candidate for
> comparison and I saw bpf_cpumask does that, so I modify bpf hash table to do the
> similar thing and paste the benchmark result. As we can seen from the result,
> the memory usage for such flavor is much bigger than reuse-after-rcu-gp and
> free-after-rcu-gp:

I don't follow what exactly you're doing and what you're measuring.
Please provide patches for both reuse-after-rcu-gp and free-after-rcu-gp to
have meaningful conversation.
Rigth now we're stuck at what bench tool is actually measuring.

> >> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
> >> +			/* Use reuse_cb_in_progress to indicate there is
> >> +			 * inflight reuse kworker or reuse RCU callback.
> >> +			 */
> >> +			atomic_inc(&c->reuse_cb_in_progress);
> >> +			/* Already queued */
> >> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
> > how many kthreads are spawned by wq in the peak?
> I think it depends on the number of bpf_ma. Because bpf_ma_wq is per-CPU
> workqueue, so for each bpf_ma, there is at most one worker for each CPU. And now
> the limit for the number of active workers on each CPU is 256, but it is
> customizable through alloc_workqueue() API.

Which means that on 8 cpu system there will be 8 * 256 kthreads ?
That's a lot. Please provide num_of_all_threads before/after/at_peak during bench.

Pls trim your replies. Mailers like mutt have a hard time navigating.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-27  4:20       ` Alexei Starovoitov
@ 2023-04-27 13:46         ` Paul E. McKenney
  2023-04-28  6:13           ` Hou Tao
  2023-04-28  2:16         ` Hou Tao
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2023-04-27 13:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, bpf, Martin KaFai Lau, Alexei Starovoitov,
	Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, rcu, houtao1

On Wed, Apr 26, 2023 at 09:20:49PM -0700, Alexei Starovoitov wrote:
> On Sun, Apr 23, 2023 at 09:55:24AM +0800, Hou Tao wrote:
> > >
> > >> ./bench htab-mem --use-case $name --max-entries 16384 \
> > >> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
> > >>
> > >> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> > >> | --                  | --         | --                   | --                |
> > >> | no_op               | 1129       | 1.15                 | 1.15              |
> > >> | overwrite           | 24.37      | 2.07                 | 2.97              |
> > >> | batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
> > >> | add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |
> > > large mem for diff_cpu case needs to be investigated.
> > The main reason is that tasks-trace RCU GP is slow and there is only one
> > inflight free callback, so the CPUs which only do element addition will allocate
> > new memory from slab continuously and the CPUs which only do element deletion
> > will free these elements continuously through call_tasks_trace_rcu(), but due to
> > the slowness of tasks-trace RCU GP, these freed elements could not be freed back
> > to slab subsystem timely.
> 
> I see. Now it makes sense. It's slow call_tasks_trace_rcu and not at all "memory can never be reused."
> Please explain things clearly in commit log.

Is this a benchmarking issue, or is this happening in real workloads?

If the former, one trick I use in rcutorture's callback-flooding code is
to pass the ready-to-be-freed memory directly back to the allocating CPU.
Which might be what you were getting at with your "maybe stealing from
free_list of other CPUs".

If this is happening in real workloads, then I would like to better
understand that workload.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-27  4:20       ` Alexei Starovoitov
  2023-04-27 13:46         ` Paul E. McKenney
@ 2023-04-28  2:16         ` Hou Tao
  1 sibling, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-28  2:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi,

On 4/27/2023 12:20 PM, Alexei Starovoitov wrote:
> On Sun, Apr 23, 2023 at 09:55:24AM +0800, Hou Tao wrote:
>>>> ./bench htab-mem --use-case $name --max-entries 16384 \
>>>> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
>>>>
>>>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>>>> | --                  | --         | --                   | --                |
>>>> | no_op               | 1129       | 1.15                 | 1.15              |
>>>> | overwrite           | 24.37      | 2.07                 | 2.97              |
>>>> | batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
>>>> | add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |
>>> large mem for diff_cpu case needs to be investigated.
>> The main reason is that tasks-trace RCU GP is slow and there is only one
>> inflight free callback, so the CPUs which only do element addition will allocate
>> new memory from slab continuously and the CPUs which only do element deletion
>> will free these elements continuously through call_tasks_trace_rcu(), but due to
>> the slowness of tasks-trace RCU GP, these freed elements could not be freed back
>> to slab subsystem timely.
> I see. Now it makes sense. It's slow call_tasks_trace_rcu and not at all "memory can never be reused."
> Please explain things clearly in commit log.
Will fix the commit message.
>
>>>> +{
>>>> +	__u64 *value;
>>>> +
>>>> +	if (ctx->from >= ctx->max)
>>>> +		return 1;
>>>> +
>>>> +	value = bpf_map_lookup_elem(&array, &ctx->from);
>>>> +	if (value)
>>>> +		bpf_map_update_elem(&htab, &ctx->from, value, flags);
>>> What is a point of doing lookup from giant array of en element with zero value
>>> to copy it into htab?
>>> Why not to use single zero inited elem for all htab ops?
>> I want to check how does the different size of value effect the benchmark
>> result, so I choose a variable-size value.
> Not following. All elements of the array have the same size.
> Are you saying you were not able to figure out how to supply a single 'value'
> of variable size? Try array of max_entries=1.
> Do not do unnecessary and confusing bpf_map_lookup_elem(&array, &ctx->from);.
My bad. I misunderstood your meaning. Yes, even though the value size is
variable, but using an array with only one element is enough for this
benchmark.
>
>>> Each loop will run 16k times and every time you step += 4.
>>> So 3/4 of these 16k runs it will be hitting if (ctx->from >= ctx->max) condition.
>>> What are you measuring?
>> As explained in the commit message, I am trying to let different deletion and
>> deletion CPU pairs operate on the different subsets of hash-table elements.
>> Assuming there are 16 elements in the htab and there are 8 CPUs and 8 threads,
>> the following is the operation subset for each CPU:
>>
>> CPU 0:  0 4 8 12 (do deletion)
>> CPU 1:  0 4 8 12 (do addition)
>>
>> CPU 2:  1 5 9 13
>> CPU 3:  1 5 9 13
>>
>> CPU 4:  2 6 10 14
>> CPU 5:  2 6 10 14
>>
>> CPU 6:  3 7 11 15
>> CPU 7:  3 7 11 15
> That part is clear, but
>
>>>> +	__sync_fetch_and_add(&loop_cnt, 1);
> this doesn't match the rest. loop_cnt is inremented 4 times faster.
> So it's not comparable to other tests.
In the previous two cases, loop_cnt is increased when nr_entries /
nr_thread elements are deleted and then added (or opposite). For
add_del_on_diff_cpu case, loop_cnt will be increased twice when
nr_entries / nr_thread * 2 are added and then deleted. So I think the
result is roughly comparable to other tests.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
  2023-04-27  4:24       ` Alexei Starovoitov
@ 2023-04-28  2:24         ` Hou Tao
  0 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-28  2:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

Hi Alexei,

On 4/27/2023 12:24 PM, Alexei Starovoitov wrote:
> On Sun, Apr 23, 2023 at 03:41:05PM +0800, Hou Tao wrote:
>>>> (3) reuse-after-rcu-gp bpf memory allocator
>>> that's the one you're implementing below, right?
>> Right.
>>>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>>>> | --                  | --         | --                   | --                |
>>>> | no_op               | 1276       | 0.96                 | 1.00              |
>>>> | overwrite           | 15.66      | 25.00                | 33.07             |
>>>> | batch_add_batch_del | 10.32      | 18.84                | 22.64             |
>>>> | add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |
>>>>
>>>> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)
>>> What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ?
>> No, there is no call_rcu() before bpf_mem_free(). bpf_mem_free() in
>> free-after-rcu-gp flavor will do call_rcu() in batch to free these elements back
>> to slab subsystem directly. The elements in this flavor of bpf_ma is not safe
>> for access from sleepable program except bpf_rcu_read_{lock,unlock}() are used.
>>
>> But I think using call_rcu() to call bpf_mem_free() is good candidate for
>> comparison and I saw bpf_cpumask does that, so I modify bpf hash table to do the
>> similar thing and paste the benchmark result. As we can seen from the result,
>> the memory usage for such flavor is much bigger than reuse-after-rcu-gp and
>> free-after-rcu-gp:
> I don't follow what exactly you're doing and what you're measuring.
> Please provide patches for both reuse-after-rcu-gp and free-after-rcu-gp to
> have meaningful conversation.
OK. Will add a new flavor of FREE_AFTER_RCU_GP bpf memory allocator in v3.
> Rigth now we're stuck at what bench tool is actually measuring.
>
>>>> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
>>>> +			/* Use reuse_cb_in_progress to indicate there is
>>>> +			 * inflight reuse kworker or reuse RCU callback.
>>>> +			 */
>>>> +			atomic_inc(&c->reuse_cb_in_progress);
>>>> +			/* Already queued */
>>>> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
>>> how many kthreads are spawned by wq in the peak?
>> I think it depends on the number of bpf_ma. Because bpf_ma_wq is per-CPU
>> workqueue, so for each bpf_ma, there is at most one worker for each CPU. And now
>> the limit for the number of active workers on each CPU is 256, but it is
>> customizable through alloc_workqueue() API.
> Which means that on 8 cpu system there will be 8 * 256 kthreads ?
> That's a lot. Please provide num_of_all_threads before/after/at_peak during bench.
Yes, 8 * 256 is a lot, but there are at most 8 kworkers during
benchmark, because there is only one bpf_memory_allocator is used.
>
> Pls trim your replies. Mailers like mutt have a hard time navigating.
Do you mean the email content didn't wrap automatically ? or the wrap
length is too lengthy (my current setting is 80) ?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator
  2023-04-27 13:46         ` Paul E. McKenney
@ 2023-04-28  6:13           ` Hou Tao
  0 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-28  6:13 UTC (permalink / raw)
  To: paulmck, Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, rcu, houtao1

Hi Paul,

On 4/27/2023 9:46 PM, Paul E. McKenney wrote:
> On Wed, Apr 26, 2023 at 09:20:49PM -0700, Alexei Starovoitov wrote:
>> On Sun, Apr 23, 2023 at 09:55:24AM +0800, Hou Tao wrote:
>>>>> ./bench htab-mem --use-case $name --max-entries 16384 \
>>>>> 	--full 50 -d 7 -w 3 --producers=8 --prod-affinity=0-7
>>>>>
>>>>> | name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
>>>>> | --                  | --         | --                   | --                |
>>>>> | no_op               | 1129       | 1.15                 | 1.15              |
>>>>> | overwrite           | 24.37      | 2.07                 | 2.97              |
>>>>> | batch_add_batch_del | 10.58      | 2.91                 | 3.36              |
>>>>> | add_del_on_diff_cpu | 13.14      | 380.66               | 633.99            |
>>>> large mem for diff_cpu case needs to be investigated.
>>> The main reason is that tasks-trace RCU GP is slow and there is only one
>>> inflight free callback, so the CPUs which only do element addition will allocate
>>> new memory from slab continuously and the CPUs which only do element deletion
>>> will free these elements continuously through call_tasks_trace_rcu(), but due to
>>> the slowness of tasks-trace RCU GP, these freed elements could not be freed back
>>> to slab subsystem timely.
>> I see. Now it makes sense. It's slow call_tasks_trace_rcu and not at all "memory can never be reused."
>> Please explain things clearly in commit log.
> Is this a benchmarking issue, or is this happening in real workloads?
It is just a benchmark issue. The add_del_on_diff_cpu case in the
benchmark simulates the hypothetical workload which will do hash map
addition and deletion on different CPUs.
>
> If the former, one trick I use in rcutorture's callback-flooding code is
> to pass the ready-to-be-freed memory directly back to the allocating CPU.
> Which might be what you were getting at with your "maybe stealing from
> free_list of other CPUs".
Thanks, it is a good idea. I could try it later.
>
> If this is happening in real workloads, then I would like to better
> understand that workload.
>
> 							Thanx, Paul
> .


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-04-28  6:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
2023-04-22  2:59   ` Alexei Starovoitov
2023-04-23  1:55     ` Hou Tao
2023-04-27  4:20       ` Alexei Starovoitov
2023-04-27 13:46         ` Paul E. McKenney
2023-04-28  6:13           ` Hou Tao
2023-04-28  2:16         ` Hou Tao
2023-04-23  8:03     ` Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 2/4] bpf: Factor out a common helper free_all() Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-04-22  3:12   ` Alexei Starovoitov
2023-04-23  7:41     ` Hou Tao
2023-04-27  4:24       ` Alexei Starovoitov
2023-04-28  2:24         ` Hou Tao
2023-04-21  6:23 ` [RFC bpf-next v2 0/4] " Hou Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).