From: Hou Tao <houtao@huaweicloud.com>
To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
Daniel Borkmann <daniel@iogearbox.net>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
rcu@vger.kernel.org, "houtao1@huawei.com" <houtao1@huawei.com>
Subject: Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
Date: Tue, 6 Jun 2023 20:30:58 +0800 [thread overview]
Message-ID: <f0e77d34-7459-8375-d844-4b0c8d79eb8f@huaweicloud.com> (raw)
In-Reply-To: <20230606035310.4026145-1-houtao@huaweicloud.com>
Hi,
On 6/6/2023 11:53 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The implementation of v4 is mainly based on suggestions from Alexi [0].
> There are still pending problems for the current implementation as shown
> in the benchmark result in patch #3, but there was a long time from the
> posting of v3, so posting v4 here for further disscussions and more
> suggestions.
>
> The first problem is the huge memory usage compared with bpf memory
> allocator which does immediate reuse:
>
> htab-mem-benchmark (reuse-after-RCU-GP):
> | name | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | -- | -- | -- | -- |
> | no_op | 1159.18 | 0.99 | 0.99 |
> | overwrite | 11.00 | 2288 | 4109 |
> | batch_add_batch_del| 8.86 | 1558 | 2763 |
> | add_del_on_diff_cpu| 4.74 | 11.39 | 14.77 |
>
> htab-mem-benchmark (immediate-reuse):
> | name | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | -- | -- | -- | -- |
> | no_op | 1160.66 | 0.99 | 1.00 |
> | overwrite | 28.52 | 2.46 | 2.73 |
> | batch_add_batch_del| 11.50 | 2.69 | 2.95 |
> | add_del_on_diff_cpu| 3.75 | 15.85 | 24.24 |
>
> It seems the direct reason is the slow RCU grace period. During
> benchmark, the elapsed time when reuse_rcu() callback is called is about
> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> spin-lock and the irq-work running in the contex of freeing process will
> increase the running overhead of bpf program, the running time of
> getpgid() is increased, the contex switch is slowed down and the RCU
> grace period increases [1], but I am still diggin into it.
For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
(namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
| name | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| -- | -- | -- | -- |
| no_op | 1165.38 | 0.97 | 1.00 |
| overwrite | 17.25 | 626.41 | 781.82 |
| batch_add_batch_del| 11.51 | 398.56 | 500.29 |
| add_del_on_diff_cpu| 4.21 | 31.06 | 48.84 |
But the memory usage is still large compared with v3 and the elapsed
time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
are still two differences:
1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
accelerate the reuse of freed objects.
2) v3 uses kworker instead of irq_work for free procedure.
For 1), after using kmalloc() in irq_work to allocate multiple inflight
RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
but is not enough:
htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
| name | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| -- | -- | -- | -- |
| no_op | 1247.00 | 0.97 | 1.00 |
| overwrite | 16.56 | 490.18 | 557.17 |
| batch_add_batch_del| 11.31 | 276.32 | 360.89 |
| add_del_on_diff_cpu| 4.00 | 24.76 | 42.58 |
So it seems the large memory usage is due to irq_work (reuse_bulk) used
for free procedure. However after increasing the threshold for invoking
irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
big difference in the memory usage and the delayed time for RCU
callbacks. Perhaps the reason is that although the number of reuse_bulk
irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
increased because there are no reusable objects.
>
> Another problem is the performance degradation compared with immediate
> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> top-one hotspot:
>
> map_perf_test (reuse-after-RCU-GP)
> 0:hash_map_perf kmalloc 194677 events per sec
>
> map_perf_test (immediate reuse)
> 2:hash_map_perf kmalloc 384527 events per sec
>
> Considering the purpose of introducing per-bpf-ma reusable list is to
> handle the case in which the allocation and free are done on different
> CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
> for overwrite & batch_add_batch_del cases. So maybe we could implement a
> hybrid of global reusable list and per-cpu reusable list and switch
> between these two kinds of list according to the history of allocation
> and free frequency.
>
> As ususal, suggestions and comments are always welcome.
>
> [0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
> [1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@huaweicloud.com
>
> Change Log:
> v4:
> * no kworker (Alexei)
> * Use a global reusable list in bpf memory allocator (Alexei)
> * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
> defaultly in bpf memory allocator (Alexei)
> * add benchmark results from map_perf_test (Alexei)
>
> v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@huaweicloud.com/
> * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
> * Update htab memory benchmark
> * move the benchmark patch to the last patch
> * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
> programs
> * add synchronization between addition CPU and deletion CPU for
> add_del_on_diff_cpu case to prevent unnecessary loop
> * add the benchmark result for "extra call_rcu + bpf ma"
>
> v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@huaweicloud.com/
> * add a benchmark for bpf memory allocator to compare between different
> flavor of bpf memory allocator.
> * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
>
> v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>
> Hou Tao (3):
> bpf: Factor out a common helper free_all()
> selftests/bpf: Add benchmark for bpf memory allocator
> bpf: Only reuse after one RCU GP in bpf memory allocator
>
> include/linux/bpf_mem_alloc.h | 4 +
> kernel/bpf/memalloc.c | 385 ++++++++++++------
> tools/testing/selftests/bpf/Makefile | 3 +
> tools/testing/selftests/bpf/bench.c | 4 +
> .../selftests/bpf/benchs/bench_htab_mem.c | 352 ++++++++++++++++
> .../bpf/benchs/run_bench_htab_mem.sh | 42 ++
> .../selftests/bpf/progs/htab_mem_bench.c | 135 ++++++
> 7 files changed, 809 insertions(+), 116 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
> create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
>
next prev parent reply other threads:[~2023-06-06 12:31 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-06 3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
2023-06-06 3:53 ` [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all() Hou Tao
2023-06-06 3:53 ` [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
2023-06-06 21:13 ` Alexei Starovoitov
2023-06-07 1:32 ` Hou Tao
2023-06-06 3:53 ` [RFC PATCH bpf-next v4 3/3] bpf: Only reuse after one RCU GP in " Hou Tao
2023-06-06 12:30 ` Hou Tao [this message]
2023-06-06 21:04 ` [RFC PATCH bpf-next v4 0/3] Handle immediate reuse " Alexei Starovoitov
2023-06-07 1:19 ` Hou Tao
2023-06-07 1:39 ` Alexei Starovoitov
2023-06-07 7:56 ` Hou Tao
2023-06-07 8:42 ` Hou Tao
2023-06-07 17:52 ` Alexei Starovoitov
2023-06-07 20:50 ` Alexei Starovoitov
2023-06-07 23:23 ` Alexei Starovoitov
2023-06-07 23:30 ` Paul E. McKenney
2023-06-07 23:50 ` Alexei Starovoitov
2023-06-08 0:13 ` Paul E. McKenney
2023-06-08 0:34 ` Alexei Starovoitov
2023-06-08 1:15 ` Paul E. McKenney
2023-06-08 3:35 ` Hou Tao
2023-06-08 4:30 ` Hou Tao
2023-06-08 4:30 ` Alexei Starovoitov
2023-06-08 1:57 ` Hou Tao
2023-06-08 1:51 ` Hou Tao
2023-06-08 2:55 ` Paul E. McKenney
2023-06-08 3:43 ` Hou Tao
2023-06-08 16:18 ` Paul E. McKenney
2023-06-09 3:02 ` Hou Tao
2023-06-09 14:30 ` Paul E. McKenney
2023-06-12 2:03 ` Hou Tao
2023-06-12 3:40 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f0e77d34-7459-8375-d844-4b0c8d79eb8f@huaweicloud.com \
--to=houtao@huaweicloud.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=houtao1@huawei.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=martin.lau@linux.dev \
--cc=paulmck@kernel.org \
--cc=rcu@vger.kernel.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).