bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
	bpf <bpf@vger.kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	rcu@vger.kernel.org, Hou Tao <houtao1@huawei.com>
Subject: Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
Date: Tue, 3 Jan 2023 11:38:16 -0800	[thread overview]
Message-ID: <CAADnVQ+z-Y6Yv2i-icAUy=Uyh9yiN4S1AOrLd=K8mu32TXORkw@mail.gmail.com> (raw)
In-Reply-To: <2f875cf9-88ac-1406-4ad0-f7647fb92883@huaweicloud.com>

On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote:
> Hi,
> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
> > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Hi,
> >>
> >> The patchset tries to fix the problems found when checking how htab map
> >> handles element reuse in bpf memory allocator. The immediate reuse of
> >> freed elements may lead to two problems in htab map:
> >>
> >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> >>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> >>     flag which acquires bpf-spin-lock during value copying. The
> >>     corruption of bpf-spin-lock may result in hard lock-up.
> >> (2) lookup procedure may get incorrect map value if the found element is
> >>     freed and then reused.
> >>
> >> Because the type of htab map elements are the same, so problem #1 can be
> >> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> >> these special fields in map element only when the map element is newly
> >> allocated. If it is just a reused element, there will be no
> >> reinitialization.
> > Instead of adding the overhead of ctor callback let's just
> > add __GFP_ZERO to flags in __alloc().
> > That will address the issue 1 and will make bpf_mem_alloc behave just
> > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > will behave the same way.
> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
> object may be bigger than ctor callback when the size of allocated object is
> large. And it also introduces unnecessary memory zeroing if there is no special
> field in the map value.

Small memset is often faster than an indirect call.
I doubt that adding GFP_ZERO will have a measurable perf difference
in map_perf_test and other benchmarks.

> >> Problem #2 exists for both non-preallocated and preallocated htab map.
> >> By adding seq in htab element, doing reuse check and retrying the
> >> lookup procedure may be a feasible solution, but it will make the
> >> lookup API being hard to use, because the user needs to check whether
> >> the found element is reused or not and repeat the lookup procedure if it
> >> is reused. A simpler solution would be just disabling freed elements
> >> reuse and freeing these elements after lookup procedure ends.
> > You've proposed this 'solution' twice already in qptrie thread and both
> > times the answer was 'no, we cannot do this' with reasons explained.
> > The 3rd time the answer is still the same.
> This time a workable demo which calls call_rcu_task_trace() in batch is provided
> :) Also because I can not find a better solution for the reuse problem. But you
> are right, although don't reuse the freed element will make the implementation
> of map simpler, the potential OOM problem is hard to solve specially when RCU
> tasks trace grace period is slow. Hope Paul can provide some insight about the
> problem.

OOM is exactly the reason why we cannot do this delaying logic
in the general case. We don't control what progs do and rcu tasks trace
may take a long time.

> > This 'issue 2' existed in hashmap since very beginning for many years.
> > It's a known quirk. There is nothing to fix really.
> Do we need to document the unexpected behavior somewhere, because I really don't
> know nothing about the quirk ?

Yeah. It's not documented in Documentation/bpf/map_hash.rst.
Please send a patch to add it.

> >
> > The graph apis (aka new gen data structs) with link list and rbtree are
> > in active development. Soon bpf progs will be able to implement their own
> > hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> > be making the trade off between performance and lookup/delete race.
> It seems these new gen data struct also need to solve the reuse problem because
> a global bpf memory allocator is used.

Currently the graph api is single owner and kptr_xchg is used to allow
single cpu at a time to observe the object.
In the future we will add bpf_refcount and multi owner.
Then multiple cpus will be able to access the same object concurrently.
They will race to read/write the fields and it will be prog decision
to arbitrate the access.
In other words the bpf prog needs to have mechanisms to deal with reuse.

  reply	other threads:[~2023-01-03 19:39 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
2023-01-01 18:48   ` Yonghong Song
2023-01-03 13:47     ` Hou Tao
2023-01-04  6:10       ` Yonghong Song
2023-01-04  6:30         ` Hou Tao
2023-01-04  7:14           ` Yonghong Song
2023-01-04 18:26             ` Alexei Starovoitov
2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
2023-02-10 21:06                 ` Alexei Starovoitov
2023-02-11  1:09                   ` Hou Tao
2023-02-11 16:33                     ` Alexei Starovoitov
2023-02-11 16:34                       ` Alexei Starovoitov
2023-02-15  1:54                         ` Martin KaFai Lau
2023-02-15  4:02                           ` Hou Tao
2023-02-15  7:22                             ` Martin KaFai Lau
2023-02-16  2:11                               ` Hou Tao
2023-02-16  7:47                                 ` Martin KaFai Lau
2023-02-16  8:18                                   ` Hou Tao
2023-02-16 13:55                         ` Hou Tao
2023-02-16 16:35                           ` Alexei Starovoitov
2023-02-17  1:19                             ` Hou Tao
2023-02-22 19:30                               ` Alexei Starovoitov
2023-02-15  2:35                       ` Hou Tao
2023-02-15  2:42                         ` Alexei Starovoitov
2023-02-15  3:00                           ` Hou Tao
2023-01-03 13:40   ` Hou Tao
2023-01-03 19:38     ` Alexei Starovoitov [this message]
2023-01-10  6:26       ` Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAADnVQ+z-Y6Yv2i-icAUy=Uyh9yiN4S1AOrLd=K8mu32TXORkw@mail.gmail.com' \
    --to=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).