From: Hou Tao <houtao@huaweicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
Daniel Borkmann <daniel@iogearbox.net>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
rcu@vger.kernel.org, houtao1@huawei.com
Subject: Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
Date: Thu, 4 May 2023 09:35:17 +0800 [thread overview]
Message-ID: <986216a3-437a-5219-fd9a-341786e9264b@huaweicloud.com> (raw)
In-Reply-To: <20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local>
Hi,
On 5/4/2023 2:48 AM, Alexei Starovoitov wrote:
> On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote:
>> +
>> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
>> +{
>> + unsigned long flags;
>> +
>> + local_irq_save(flags);
>> + /* In case a NMI-context bpf program is also freeing object. */
>> + if (local_inc_return(&c->active) == 1) {
>> + bool try_queue_work = false;
>> +
>> + /* kworker may remove elements from prepare_reuse_head */
>> + raw_spin_lock(&c->reuse_lock);
>> + if (llist_empty(&c->prepare_reuse_head))
>> + c->prepare_reuse_tail = llnode;
>> + __llist_add(llnode, &c->prepare_reuse_head);
>> + if (++c->prepare_reuse_cnt > c->high_watermark) {
>> + /* Zero out prepare_reuse_cnt early to prevent
>> + * unnecessary queue_work().
>> + */
>> + c->prepare_reuse_cnt = 0;
>> + try_queue_work = true;
>> + }
>> + raw_spin_unlock(&c->reuse_lock);
>> +
>> + if (try_queue_work && !work_pending(&c->reuse_work)) {
>> + /* Use reuse_cb_in_progress to indicate there is
>> + * inflight reuse kworker or reuse RCU callback.
>> + */
>> + atomic_inc(&c->reuse_cb_in_progress);
>> + /* Already queued */
>> + if (!queue_work(bpf_ma_wq, &c->reuse_work))
> As Martin pointed out queue_work() is not safe here.
> The raw_spin_lock(&c->reuse_lock); earlier is not safe either.
I see. Didn't recognize these problems.
> For the next version please drop workers and spin_lock from unit_free/alloc paths.
> If lock has to be taken it should be done from irq_work.
> Under no circumstances we can use alloc_workqueue(). No new kthreads.
Is there any reason to prohibit the use of new kthread in irq_work ?
>
> We can avoid adding new flag to bpf_mem_alloc to reduce the complexity
> and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way:
>
> - alloc_bulk() won't be trying to steal from c->free_by_rcu.
>
> - do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version.
No sure whether or not one inflight RCU callback is enough. Will check.
If one is not enough, I may use kmalloc(__GFP_NOWAIT) in irq work to
allocate multiple RCU callbacks.
> - rcu_trace_implies_rcu_gp() is never used.
>
> - after RCU_GP __free_rcu() moves all waiting_for_gp elements into
> a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu)
> and does call_rcu_tasks_trace
>
> - Let's call this list ma->free_by_rcu_tasks_trace
> (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init)
>
> - any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that
> is protected by ma->spin_lock (1 or NUM_CACHES such locks)
To reduce the lock contention, alloc_bulk() can steal from the global
list in batch. Had tried the global list before but I didn't do the
concurrent freeing, I think it could reduce the risk of OOM for
add_del_on_diff_cpu.
>
> - ma->waiting_for_gp_tasks_trace will be freeing elements into slab
>
> What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock().
> Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc.
> (while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs)
>
> After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop.
> Probably usage of bpf_mem_alloc in local storage can be simplified as well.
> Martin wdyt?
>
> I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points
> including needs of qp-trie.
Thanks for these great suggestions. Will try to do it in v4.
next prev parent reply other threads:[~2023-05-04 1:35 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-29 10:12 [RFC bpf-next v3 0/6] Handle immediate reuse in bpf memory allocator Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 1/6] bpf: Factor out a common helper free_all() Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 2/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-05-01 23:59 ` Martin KaFai Lau
2023-05-03 18:48 ` Alexei Starovoitov
2023-05-03 21:57 ` Martin KaFai Lau
2023-05-03 23:06 ` Alexei Starovoitov
2023-05-03 23:39 ` Martin KaFai Lau
2023-05-04 1:42 ` Alexei Starovoitov
2023-05-04 2:08 ` Hou Tao
2023-05-04 1:35 ` Hou Tao [this message]
2023-05-04 2:00 ` Alexei Starovoitov
2023-05-04 2:30 ` Hou Tao
2023-06-01 17:36 ` Alexei Starovoitov
2023-06-02 2:39 ` Hou Tao
2023-06-02 16:25 ` Alexei Starovoitov
2023-04-29 10:12 ` [RFC bpf-next v3 4/6] bpf: Introduce BPF_MA_FREE_AFTER_RCU_GP Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 5/6] bpf: Add two module parameters in htab for memory benchmark Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 6/6] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=986216a3-437a-5219-fd9a-341786e9264b@huaweicloud.com \
--to=houtao@huaweicloud.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=houtao1@huawei.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=martin.lau@linux.dev \
--cc=paulmck@kernel.org \
--cc=rcu@vger.kernel.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).