bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Hou Tao <houtao@huaweicloud.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
	bpf <bpf@vger.kernel.org>, Andrii Nakryiko <andrii@kernel.org>,
	Song Liu <song@kernel.org>, Hao Luo <haoluo@google.com>,
	Yonghong Song <yhs@fb.com>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	rcu@vger.kernel.org, Hou Tao <houtao1@huawei.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subject: Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
Date: Mon, 9 Jan 2023 22:26:53 -0800	[thread overview]
Message-ID: <476f705b-8501-dfd8-d62e-30d00b4a7d64@linux.dev> (raw)
In-Reply-To: <CAADnVQ+z-Y6Yv2i-icAUy=Uyh9yiN4S1AOrLd=K8mu32TXORkw@mail.gmail.com>

On 1/3/23 11:38 AM, Alexei Starovoitov wrote:
> On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> The patchset tries to fix the problems found when checking how htab map
>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>> freed elements may lead to two problems in htab map:
>>>>
>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>      flag which acquires bpf-spin-lock during value copying. The
>>>>      corruption of bpf-spin-lock may result in hard lock-up.
>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>      freed and then reused.
>>>>
>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>> these special fields in map element only when the map element is newly
>>>> allocated. If it is just a reused element, there will be no
>>>> reinitialization.
>>> Instead of adding the overhead of ctor callback let's just
>>> add __GFP_ZERO to flags in __alloc().
>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>> will behave the same way.
>> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
>> object may be bigger than ctor callback when the size of allocated object is
>> large. And it also introduces unnecessary memory zeroing if there is no special
>> field in the map value.
> 
> Small memset is often faster than an indirect call.
> I doubt that adding GFP_ZERO will have a measurable perf difference
> in map_perf_test and other benchmarks.
> 
>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>> By adding seq in htab element, doing reuse check and retrying the
>>>> lookup procedure may be a feasible solution, but it will make the
>>>> lookup API being hard to use, because the user needs to check whether
>>>> the found element is reused or not and repeat the lookup procedure if it
>>>> is reused. A simpler solution would be just disabling freed elements
>>>> reuse and freeing these elements after lookup procedure ends.
>>> You've proposed this 'solution' twice already in qptrie thread and both
>>> times the answer was 'no, we cannot do this' with reasons explained.
>>> The 3rd time the answer is still the same.
>> This time a workable demo which calls call_rcu_task_trace() in batch is provided
>> :) Also because I can not find a better solution for the reuse problem. But you
>> are right, although don't reuse the freed element will make the implementation
>> of map simpler, the potential OOM problem is hard to solve specially when RCU
>> tasks trace grace period is slow. Hope Paul can provide some insight about the
>> problem.
> 
> OOM is exactly the reason why we cannot do this delaying logic
> in the general case. We don't control what progs do and rcu tasks trace
> may take a long time.

I haven't looked at the details of this patch yet. Since Hou was asking in
https://lore.kernel.org/bpf/6e4ec7a4-9ac9-417c-c11a-de59e72a6e42@huawei.com/ for 
the local storage use case (thanks!), so continue the discussion in this thread.

Some of my current thoughts, the local storage map is a little different from 
the more generic bpf map (eg htab). The common use case in local storage map is 
less demanding on the alloc/free path. The storage is usually allocated once and 
then stays for the whole lifetime with its owner (sk, task, cgrp, or inode). 
There is no update helper, so it encourages a get() and then followed by an 
in-place modification. Beside, the current local storage is also freed after 
going through a rcu tasks trace gp.

That said, I am checking to see if the common local storage use case can be 
modified to safely reuse without waiting the gp. It will then be an improvement 
on the existing implementation. The bottom line is not slowing down the current 
get (ie lookup) performance. Not 100% sure yet if it is doable.

The delete() helper likely still need to go through gp. Being able to avoid 
immediate reuse in bpf_mem_alloc will at least be needed for the local storage 
delete() code path.

> 
>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>> It's a known quirk. There is nothing to fix really.
>> Do we need to document the unexpected behavior somewhere, because I really don't
>> know nothing about the quirk ?
> 
> Yeah. It's not documented in Documentation/bpf/map_hash.rst.
> Please send a patch to add it.
> 
>>>
>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>> in active development. Soon bpf progs will be able to implement their own
>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>> be making the trade off between performance and lookup/delete race.
>> It seems these new gen data struct also need to solve the reuse problem because
>> a global bpf memory allocator is used.
> 
> Currently the graph api is single owner and kptr_xchg is used to allow
> single cpu at a time to observe the object.
> In the future we will add bpf_refcount and multi owner.
> Then multiple cpus will be able to access the same object concurrently.
> They will race to read/write the fields and it will be prog decision
> to arbitrate the access.
> In other words the bpf prog needs to have mechanisms to deal with reuse.


      reply	other threads:[~2023-01-10  6:27 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
2023-01-01 18:48   ` Yonghong Song
2023-01-03 13:47     ` Hou Tao
2023-01-04  6:10       ` Yonghong Song
2023-01-04  6:30         ` Hou Tao
2023-01-04  7:14           ` Yonghong Song
2023-01-04 18:26             ` Alexei Starovoitov
2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
2023-02-10 21:06                 ` Alexei Starovoitov
2023-02-11  1:09                   ` Hou Tao
2023-02-11 16:33                     ` Alexei Starovoitov
2023-02-11 16:34                       ` Alexei Starovoitov
2023-02-15  1:54                         ` Martin KaFai Lau
2023-02-15  4:02                           ` Hou Tao
2023-02-15  7:22                             ` Martin KaFai Lau
2023-02-16  2:11                               ` Hou Tao
2023-02-16  7:47                                 ` Martin KaFai Lau
2023-02-16  8:18                                   ` Hou Tao
2023-02-16 13:55                         ` Hou Tao
2023-02-16 16:35                           ` Alexei Starovoitov
2023-02-17  1:19                             ` Hou Tao
2023-02-22 19:30                               ` Alexei Starovoitov
2023-02-15  2:35                       ` Hou Tao
2023-02-15  2:42                         ` Alexei Starovoitov
2023-02-15  3:00                           ` Hou Tao
2023-01-03 13:40   ` Hou Tao
2023-01-03 19:38     ` Alexei Starovoitov
2023-01-10  6:26       ` Martin KaFai Lau [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=476f705b-8501-dfd8-d62e-30d00b4a7d64@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).