bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hou Tao <houtao@huaweicloud.com>
To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
	Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, houtao1@huawei.com
Subject: [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP
Date: Sat,  8 Apr 2023 22:18:42 +0800	[thread overview]
Message-ID: <20230408141846.1878768-1-houtao@huaweicloud.com> (raw)

From: Hou Tao <houtao1@huawei.com>


As discussed in v1, currently the freed objects in bpf memory allocator
may be reused immediately by the new allocation, it introduces
use-after-bpf-ma-free problem for non-preallocated hash map and makes
lookup procedure return incorrect result. The immediate reuse also makes
introducing new use case more difficult (e.g. qp-trie).

The patch series tries to introduce BPF_MA_REUSE_AFTER_RCU_GP to solve
these problems. For BPF_MA_REUSE_AFTER_GP, the freed objects are reused
only after one RCU grace period and may be freed by bpf memory allocator
after another RCU-tasks-trace grace period. So for bpf programs which
care about reuse problem, these programs can use
bpf_rcu_read_{lock,unlock}() to access these freed objects safely and
for those which doesn't care, there will be safely use-after-bpf-ma-free
because these objects have not been freed by bpf memory allocator.

The current implementation is far from perfect, but I think it is ready
for get some feedbacks before putting in more effort. The implementation
mainly focus on how to speed up the transition from freed elements to
reusable elements and try to reduce the risk of OOM.

To accelerate the transition, it dynamically allocates rcu_head and call
call_rcu() in a kworker to do the transition. The frequency of call_rcu()
invocation could be improved by calling call_rcu() in irq work, but after
did that, I found the RCU grace period increased a lot and I still could
not figure out why. To reduce the risk of OOM, these reusable elements need
to be free as well, but we can not dynamically allocate rcu_head to do
that, because compared with RCU grace period RCU-tasks-trace grace
period is slower, so the freeing of reusable elements is just like the
freeing in normal bpf memory allocator, but these is one difference: for
BPF_MA_REUSE_AFTER_GP bpf ma these freeing elements are still available
for reuse in unit_alloc(). Please see individual patches for more details.

Comments and suggestions are always welcome.

Change Log:
 * add a benchmark for bpf memory allocator to compare between different
   flavor of bpf memory allocator.
 * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

Hou Tao (4):
  selftests/bpf: Add benchmark for bpf memory allocator
  bpf: Factor out a common helper free_all()
  bpf: Pass bitwise flags to bpf_mem_alloc_init()
  bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

 include/linux/bpf_mem_alloc.h                 |   9 +-
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/cpumask.c                          |   2 +-
 kernel/bpf/hashtab.c                          |   5 +-
 kernel/bpf/memalloc.c                         | 390 ++++++++++++++++--
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++
 .../selftests/bpf/progs/htab_mem_bench.c      | 145 +++++++
 9 files changed, 785 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c


             reply	other threads:[~2023-04-08 13:47 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-08 14:18 Hou Tao [this message]
2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
2023-04-22  2:59   ` Alexei Starovoitov
2023-04-23  1:55     ` Hou Tao
2023-04-27  4:20       ` Alexei Starovoitov
2023-04-27 13:46         ` Paul E. McKenney
2023-04-28  6:13           ` Hou Tao
2023-04-28  2:16         ` Hou Tao
2023-04-23  8:03     ` Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 2/4] bpf: Factor out a common helper free_all() Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-04-22  3:12   ` Alexei Starovoitov
2023-04-23  7:41     ` Hou Tao
2023-04-27  4:24       ` Alexei Starovoitov
2023-04-28  2:24         ` Hou Tao
2023-04-21  6:23 ` [RFC bpf-next v2 0/4] " Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230408141846.1878768-1-houtao@huaweicloud.com \
    --to=houtao@huaweicloud.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).