bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hou Tao <houtao@huaweicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Yonghong Song <yhs@meta.com>, bpf <bpf@vger.kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, Hou Tao <houtao1@huawei.com>
Subject: Re: [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc
Date: Sat, 11 Feb 2023 09:09:46 +0800	[thread overview]
Message-ID: <bf936f22-f8b7-c4a3-41a1-c3f2f115e67a@huaweicloud.com> (raw)
In-Reply-To: <CAADnVQLVi7CcW9ci62Dps4mxCEqHOYvYJ-Fant-0kSy0vPZ3AA@mail.gmail.com>



On 2/11/2023 5:06 AM, Alexei Starovoitov wrote:
> On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>> On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
>>> On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
>>>>
>>>>
>>>> On 1/3/23 10:30 PM, Hou Tao wrote:
>>>>> Hi,
>>>>>
>>>>> On 1/4/2023 2:10 PM, Yonghong Song wrote:
>>>>>>
>>>>>> On 1/3/23 5:47 AM, Hou Tao wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>>>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>>>>>>> freed elements may lead to two problems in htab map:
>>>>>>>>>>
>>>>>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>>>>>>        flag which acquires bpf-spin-lock during value copying. The
>>>>>>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
>>>>>>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>>>>>>        freed and then reused.
>>>>>>>>>>
>>>>>>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>>>>>>> these special fields in map element only when the map element is newly
>>>>>>>>>> allocated. If it is just a reused element, there will be no
>>>>>>>>>> reinitialization.
>>>>>>>>> Instead of adding the overhead of ctor callback let's just
>>>>>>>>> add __GFP_ZERO to flags in __alloc().
>>>>>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>>>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>>>>>>> will behave the same way.
>>>>>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>>>>>>> tried to address a similar issue for lru hash table.
>>>>>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>>>>>>> hash table?
>>>>>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>>>>>> The following is my understanding:
>>>>>> in function alloc_htab_elem() (hashtab.c), we have
>>>>>>
>>>>>>                  if (is_map_full(htab))
>>>>>>                          if (!old_elem)
>>>>>>                                  /* when map is full and update() is replacing
>>>>>>                                   * old element, it's ok to allocate, since
>>>>>>                                   * old element will be freed immediately.
>>>>>>                                   * Otherwise return an error
>>>>>>                                   */
>>>>>>                                  return ERR_PTR(-E2BIG);
>>>>>>                  inc_elem_count(htab);
>>>>>>                  l_new = bpf_mem_cache_alloc(&htab->ma);
>>>>>>                  if (!l_new) {
>>>>>>                          l_new = ERR_PTR(-ENOMEM);
>>>>>>                          goto dec_count;
>>>>>>                  }
>>>>>>                  check_and_init_map_value(&htab->map,
>>>>>>                                           l_new->key + round_up(key_size, 8));
>>>>>>
>>>>>> In the above check_and_init_map_value() intends to do initializing
>>>>>> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>>>>>>
>>>>>> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>>>>>>
>>>>>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
>>>>>> {
>>>>>>          int i;
>>>>>>
>>>>>>          if (!foffs)
>>>>>>                  return;
>>>>>>          for (i = 0; i < foffs->cnt; i++)
>>>>>>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
>>>>>> }
>>>>>>
>>>>>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>>>>>> {
>>>>>>          bpf_obj_init(map->field_offs, dst);
>>>>>> }
>>>>>>
>>>>>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
>>>>>> list_head, list_node, etc.
>>>>>>
>>>>>> This is the problem for above problem #1.
>>>>>> Maybe I missed something?
>>>>> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
>>>>> problem by only calling check_and_init_map_value() once for the newly-allocated
>>>>> element, so if a freed element is reused, its special fields will not be zeroed
>>>>> again. Is there any other cases which are not covered by the solution or any
>>>>> other similar problems in hash-tab ?
>>>> No, I checked all cases of check_and_init_map_value() and didn't find
>>>> any other instances.
>>> check_and_init_map_value() is called in two other cases:
>>> lookup_and_delete[_batch].
>>> There the zeroing of the fields is necessary because the 'value'
>>> is a temp buffer that is going to be copied to user space.
>>> I think the way forward is to add GFP_ZERO to mem_alloc
>>> (to make it equivalent to prealloc), remove one case
>>> of check_and_init_map_value from hashmap, add short comments
>>> to two other cases and add a big comment to check_and_init_map_value()
>>> that should say that 'dst' must be a temp buffer and should not
>>> point to memory that could be used in parallel by a bpf prog.
>>> It feels like we've dealt with this issue a couple times already
>>> and keep repeating this mistake, so the more comments the better.
>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>> changes on bpf-next.
> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> The former will take a long time to settle.
> The latter is trivial.
> To unblock yourself just add GFP_ZERO in an extra patch?
Sorry for the long delay. Just find find out time to do some tests to compare
the performance of bzero and ctor. After it is done, will resubmit on next week.
> .


  reply	other threads:[~2023-02-11  1:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-30  4:11 [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 1/6] bpf: Support ctor in bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 2/6] bpf: Factor out a common helper free_llist() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 3/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 4/6] bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 5/6] bpf: Use BPF_MA_NO_REUSE in htab map Hou Tao
2022-12-30  4:11 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add test case for element reuse " Hou Tao
2023-01-01  1:26 ` [RFC PATCH bpf-next 0/6] bpf: Handle reuse in bpf memory alloc Alexei Starovoitov
2023-01-01 18:48   ` Yonghong Song
2023-01-03 13:47     ` Hou Tao
2023-01-04  6:10       ` Yonghong Song
2023-01-04  6:30         ` Hou Tao
2023-01-04  7:14           ` Yonghong Song
2023-01-04 18:26             ` Alexei Starovoitov
2023-02-10 16:32               ` Kumar Kartikeya Dwivedi
2023-02-10 21:06                 ` Alexei Starovoitov
2023-02-11  1:09                   ` Hou Tao [this message]
2023-02-11 16:33                     ` Alexei Starovoitov
2023-02-11 16:34                       ` Alexei Starovoitov
2023-02-15  1:54                         ` Martin KaFai Lau
2023-02-15  4:02                           ` Hou Tao
2023-02-15  7:22                             ` Martin KaFai Lau
2023-02-16  2:11                               ` Hou Tao
2023-02-16  7:47                                 ` Martin KaFai Lau
2023-02-16  8:18                                   ` Hou Tao
2023-02-16 13:55                         ` Hou Tao
2023-02-16 16:35                           ` Alexei Starovoitov
2023-02-17  1:19                             ` Hou Tao
2023-02-22 19:30                               ` Alexei Starovoitov
2023-02-15  2:35                       ` Hou Tao
2023-02-15  2:42                         ` Alexei Starovoitov
2023-02-15  3:00                           ` Hou Tao
2023-01-03 13:40   ` Hou Tao
2023-01-03 19:38     ` Alexei Starovoitov
2023-01-10  6:26       ` Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf936f22-f8b7-c4a3-41a1-c3f2f115e67a@huaweicloud.com \
    --to=houtao@huaweicloud.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \
    --cc=yhs@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).