[RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

From: Hou Tao <houtao@huaweicloud.com>
To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, houtao1@huawei.com
Subject: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
Date: Tue,  6 Jun 2023 11:53:07 +0800	[thread overview]
Message-ID: <20230606035310.4026145-1-houtao@huaweicloud.com> (raw)

From: Hou Tao <houtao1@huawei.com>

Hi,

The implementation of v4 is mainly based on suggestions from Alexi [0].
There are still pending problems for the current implementation as shown
in the benchmark result in patch #3, but there was a long time from the
posting of v3, so posting v4 here for further disscussions and more
suggestions.

The first problem is the huge memory usage compared with bpf memory
allocator which does immediate reuse:

htab-mem-benchmark (reuse-after-RCU-GP):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1159.18   | 0.99                | 0.99             |
| overwrite          | 11.00     | 2288                | 4109             |
| batch_add_batch_del| 8.86      | 1558                | 2763             |
| add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |

htab-mem-benchmark (immediate-reuse):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1160.66   | 0.99                | 1.00             |
| overwrite          | 28.52     | 2.46                | 2.73             |
| batch_add_batch_del| 11.50     | 2.69                | 2.95             |
| add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |

It seems the direct reason is the slow RCU grace period. During
benchmark, the elapsed time when reuse_rcu() callback is called is about
100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
spin-lock and the irq-work running in the contex of freeing process will
increase the running overhead of bpf program, the running time of
getpgid() is increased, the contex switch is slowed down and the RCU
grace period increases [1], but I am still diggin into it.

Another problem is the performance degradation compared with immediate
reuse and the output from perf report shown the per-bpf-ma spin-lock is a
top-one hotspot:

map_perf_test (reuse-after-RCU-GP)
0:hash_map_perf kmalloc 194677 events per sec

map_perf_test (immediate reuse)
2:hash_map_perf kmalloc 384527 events per sec

Considering the purpose of introducing per-bpf-ma reusable list is to
handle the case in which the allocation and free are done on different
CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
for overwrite & batch_add_batch_del cases. So maybe we could implement a
hybrid of global reusable list and per-cpu reusable list and switch
between these two kinds of list according to the history of allocation
and free frequency.

As ususal, suggestions and comments are always welcome.

[0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
[1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@huaweicloud.com

Change Log:
v4:
 * no kworker (Alexei)
 * Use a global reusable list in bpf memory allocator (Alexei)
 * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
   defaultly in bpf memory allocator (Alexei)
 * add benchmark results from map_perf_test (Alexei)

v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@huaweicloud.com/
 * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
 * Update htab memory benchmark
   * move the benchmark patch to the last patch
   * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
     programs
   * add synchronization between addition CPU and deletion CPU for
     add_del_on_diff_cpu case to prevent unnecessary loop
   * add the benchmark result for "extra call_rcu + bpf ma"

v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@huaweicloud.com/
 * add a benchmark for bpf memory allocator to compare between different
   flavor of bpf memory allocator.
 * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.

v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

Hou Tao (3):
  bpf: Factor out a common helper free_all()
  selftests/bpf: Add benchmark for bpf memory allocator
  bpf: Only reuse after one RCU GP in bpf memory allocator

 include/linux/bpf_mem_alloc.h                 |   4 +
 kernel/bpf/memalloc.c                         | 385 ++++++++++++------
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
 .../selftests/bpf/progs/htab_mem_bench.c      | 135 ++++++
 7 files changed, 809 insertions(+), 116 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

-- 
2.29.2