bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@meta.com>
To: Hou Tao <houtao@huaweicloud.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, houtao1@huawei.com
Subject: Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
Date: Tue, 3 Jan 2023 23:14:41 -0800	[thread overview]
Message-ID: <e205d4a3-a885-93c7-5d02-2e9fd87348e8@meta.com> (raw)
In-Reply-To: <1d97a5c0-d1fb-a625-8e8d-25ef799ee9e2@huaweicloud.com>



On 1/3/23 10:30 PM, Hou Tao wrote:
> Hi,
> 
> On 1/4/2023 2:10 PM, Yonghong Song wrote:
>>
>>
>> On 1/3/23 5:47 AM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>>> freed elements may lead to two problems in htab map:
>>>>>>
>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>>        flag which acquires bpf-spin-lock during value copying. The
>>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
>>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>>        freed and then reused.
>>>>>>
>>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>>> these special fields in map element only when the map element is newly
>>>>>> allocated. If it is just a reused element, there will be no
>>>>>> reinitialization.
>>>>>
>>>>> Instead of adding the overhead of ctor callback let's just
>>>>> add __GFP_ZERO to flags in __alloc().
>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>>> will behave the same way.
>>>>
>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>>> tried to address a similar issue for lru hash table.
>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>>> hash table?
>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>>
>> The following is my understanding:
>> in function alloc_htab_elem() (hashtab.c), we have
>>
>>                  if (is_map_full(htab))
>>                          if (!old_elem)
>>                                  /* when map is full and update() is replacing
>>                                   * old element, it's ok to allocate, since
>>                                   * old element will be freed immediately.
>>                                   * Otherwise return an error
>>                                   */
>>                                  return ERR_PTR(-E2BIG);
>>                  inc_elem_count(htab);
>>                  l_new = bpf_mem_cache_alloc(&htab->ma);
>>                  if (!l_new) {
>>                          l_new = ERR_PTR(-ENOMEM);
>>                          goto dec_count;
>>                  }
>>                  check_and_init_map_value(&htab->map,
>>                                           l_new->key + round_up(key_size, 8));
>>
>> In the above check_and_init_map_value() intends to do initializing
>> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>>
>> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>>
>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
>> {
>>          int i;
>>
>>          if (!foffs)
>>                  return;
>>          for (i = 0; i < foffs->cnt; i++)
>>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
>> }
>>
>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>> {
>>          bpf_obj_init(map->field_offs, dst);
>> }
>>
>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
>> list_head, list_node, etc.
>>
>> This is the problem for above problem #1.
>> Maybe I missed something?
> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> problem by only calling check_and_init_map_value() once for the newly-allocated
> element, so if a freed element is reused, its special fields will not be zeroed
> again. Is there any other cases which are not covered by the solution or any
> other similar problems in hash-tab ?

No, I checked all cases of check_and_init_map_value() and didn't find
any other instances.

>>
>>>>
>>>>
>>>>>
>>>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>>>> By adding seq in htab element, doing reuse check and retrying the
>>>>>> lookup procedure may be a feasible solution, but it will make the
>>>>>> lookup API being hard to use, because the user needs to check whether
>>>>>> the found element is reused or not and repeat the lookup procedure if it
>>>>>> is reused. A simpler solution would be just disabling freed elements
>>>>>> reuse and freeing these elements after lookup procedure ends.
>>>>>
>>>>> You've proposed this 'solution' twice already in qptrie thread and both
>>>>> times the answer was 'no, we cannot do this' with reasons explained.
>>>>> The 3rd time the answer is still the same.
>>>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>>>> It's a known quirk. There is nothing to fix really.
>>>>>
>>>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>>>> in active development. Soon bpf progs will be able to implement their own
>>>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>>>> be making the trade off between performance and lookup/delete race.
>>>>> So please respin with just __GFP_ZERO and update the patch 6
>>>>> to check for lockup only.
>>>
> 

  reply	other threads:[~2023-01-04  7:15 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
2023-01-01 18:48   ` Yonghong Song
2023-01-03 13:47     ` Hou Tao
2023-01-04  6:10       ` Yonghong Song
2023-01-04  6:30         ` Hou Tao
2023-01-04  7:14           ` Yonghong Song [this message]
2023-01-04 18:26             ` Alexei Starovoitov
2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
2023-02-10 21:06                 ` Alexei Starovoitov
2023-02-11  1:09                   ` Hou Tao
2023-02-11 16:33                     ` Alexei Starovoitov
2023-02-11 16:34                       ` Alexei Starovoitov
2023-02-15  1:54                         ` Martin KaFai Lau
2023-02-15  4:02                           ` Hou Tao
2023-02-15  7:22                             ` Martin KaFai Lau
2023-02-16  2:11                               ` Hou Tao
2023-02-16  7:47                                 ` Martin KaFai Lau
2023-02-16  8:18                                   ` Hou Tao
2023-02-16 13:55                         ` Hou Tao
2023-02-16 16:35                           ` Alexei Starovoitov
2023-02-17  1:19                             ` Hou Tao
2023-02-22 19:30                               ` Alexei Starovoitov
2023-02-15  2:35                       ` Hou Tao
2023-02-15  2:42                         ` Alexei Starovoitov
2023-02-15  3:00                           ` Hou Tao
2023-01-03 13:40   ` Hou Tao
2023-01-03 19:38     ` Alexei Starovoitov
2023-01-10  6:26       ` Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e205d4a3-a885-93c7-5d02-2e9fd87348e8@meta.com \
    --to=yhs@meta.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).