[RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP

* [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP
@ 2023-04-08 14:18 Hou Tao
  2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Hou Tao @ 2023-04-08 14:18 UTC (permalink / raw)
  To: bpf, Martin KaFai Lau, Alexei Starovoitov
  Cc: Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Paul E . McKenney, rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

As discussed in v1, currently the freed objects in bpf memory allocator
may be reused immediately by the new allocation, it introduces
use-after-bpf-ma-free problem for non-preallocated hash map and makes
lookup procedure return incorrect result. The immediate reuse also makes
introducing new use case more difficult (e.g. qp-trie).

The patch series tries to introduce BPF_MA_REUSE_AFTER_RCU_GP to solve
these problems. For BPF_MA_REUSE_AFTER_GP, the freed objects are reused
only after one RCU grace period and may be freed by bpf memory allocator
after another RCU-tasks-trace grace period. So for bpf programs which
care about reuse problem, these programs can use
bpf_rcu_read_{lock,unlock}() to access these freed objects safely and
for those which doesn't care, there will be safely use-after-bpf-ma-free
because these objects have not been freed by bpf memory allocator.

The current implementation is far from perfect, but I think it is ready
for get some feedbacks before putting in more effort. The implementation
mainly focus on how to speed up the transition from freed elements to
reusable elements and try to reduce the risk of OOM.

To accelerate the transition, it dynamically allocates rcu_head and call
call_rcu() in a kworker to do the transition. The frequency of call_rcu()
invocation could be improved by calling call_rcu() in irq work, but after
did that, I found the RCU grace period increased a lot and I still could
not figure out why. To reduce the risk of OOM, these reusable elements need
to be free as well, but we can not dynamically allocate rcu_head to do
that, because compared with RCU grace period RCU-tasks-trace grace
period is slower, so the freeing of reusable elements is just like the
freeing in normal bpf memory allocator, but these is one difference: for
BPF_MA_REUSE_AFTER_GP bpf ma these freeing elements are still available
for reuse in unit_alloc(). Please see individual patches for more details.

Comments and suggestions are always welcome.

Change Log:
v2:
 * add a benchmark for bpf memory allocator to compare between different
   flavor of bpf memory allocator.
 * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

Hou Tao (4):
  selftests/bpf: Add benchmark for bpf memory allocator
  bpf: Factor out a common helper free_all()
  bpf: Pass bitwise flags to bpf_mem_alloc_init()
  bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

 include/linux/bpf_mem_alloc.h                 |   9 +-
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/cpumask.c                          |   2 +-
 kernel/bpf/hashtab.c                          |   5 +-
 kernel/bpf/memalloc.c                         | 390 ++++++++++++++++--
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 273 ++++++++++++
 .../selftests/bpf/progs/htab_mem_bench.c      | 145 +++++++
 9 files changed, 785 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

-- 
2.29.2

^ permalink raw reply	[flat|nested] 17+ messages in thread