bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hou Tao <houtao@huaweicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, houtao1@huawei.com
Subject: Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
Date: Thu, 4 May 2023 09:35:17 +0800	[thread overview]
Message-ID: <986216a3-437a-5219-fd9a-341786e9264b@huaweicloud.com> (raw)
In-Reply-To: <20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local>

Hi,

On 5/4/2023 2:48 AM, Alexei Starovoitov wrote:
> On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote:
>> +
>> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
>> +{
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	/* In case a NMI-context bpf program is also freeing object. */
>> +	if (local_inc_return(&c->active) == 1) {
>> +		bool try_queue_work = false;
>> +
>> +		/* kworker may remove elements from prepare_reuse_head */
>> +		raw_spin_lock(&c->reuse_lock);
>> +		if (llist_empty(&c->prepare_reuse_head))
>> +			c->prepare_reuse_tail = llnode;
>> +		__llist_add(llnode, &c->prepare_reuse_head);
>> +		if (++c->prepare_reuse_cnt > c->high_watermark) {
>> +			/* Zero out prepare_reuse_cnt early to prevent
>> +			 * unnecessary queue_work().
>> +			 */
>> +			c->prepare_reuse_cnt = 0;
>> +			try_queue_work = true;
>> +		}
>> +		raw_spin_unlock(&c->reuse_lock);
>> +
>> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
>> +			/* Use reuse_cb_in_progress to indicate there is
>> +			 * inflight reuse kworker or reuse RCU callback.
>> +			 */
>> +			atomic_inc(&c->reuse_cb_in_progress);
>> +			/* Already queued */
>> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
> As Martin pointed out queue_work() is not safe here.
> The raw_spin_lock(&c->reuse_lock); earlier is not safe either.
I see. Didn't recognize these problems.
> For the next version please drop workers and spin_lock from unit_free/alloc paths.
> If lock has to be taken it should be done from irq_work.
> Under no circumstances we can use alloc_workqueue(). No new kthreads.
Is there any reason to prohibit the use of new kthread in irq_work ?
>
> We can avoid adding new flag to bpf_mem_alloc to reduce the complexity
> and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way:
>
> - alloc_bulk() won't be trying to steal from c->free_by_rcu.
>
> - do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version.
No sure whether or not one inflight RCU callback is enough. Will check.
If one is not enough, I may use kmalloc(__GFP_NOWAIT) in irq work to
allocate multiple RCU callbacks.
> - rcu_trace_implies_rcu_gp() is never used.
>
> - after RCU_GP __free_rcu() moves all waiting_for_gp elements into 
>   a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu)
>   and does call_rcu_tasks_trace
>
> - Let's call this list ma->free_by_rcu_tasks_trace
>   (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init)
>
> - any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that
>   is protected by ma->spin_lock (1 or NUM_CACHES such locks)
To reduce the lock contention, alloc_bulk() can steal from the global
list in batch. Had tried the global list before but I didn't do the
concurrent freeing, I think it could reduce the risk of OOM for
add_del_on_diff_cpu.
>
> - ma->waiting_for_gp_tasks_trace will be freeing elements into slab
>
> What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock().
> Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc.
> (while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs)
>
> After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop.
> Probably usage of bpf_mem_alloc in local storage can be simplified as well.
> Martin wdyt?
>
> I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points
> including needs of qp-trie.
Thanks for these great suggestions. Will try to do it in v4.


  parent reply	other threads:[~2023-05-04  1:35 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-29 10:12 [RFC bpf-next v3 0/6] Handle immediate reuse in bpf memory allocator Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 1/6] bpf: Factor out a common helper free_all() Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 2/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-05-01 23:59   ` Martin KaFai Lau
2023-05-03 18:48   ` Alexei Starovoitov
2023-05-03 21:57     ` Martin KaFai Lau
2023-05-03 23:06       ` Alexei Starovoitov
2023-05-03 23:39         ` Martin KaFai Lau
2023-05-04  1:42           ` Alexei Starovoitov
2023-05-04  2:08           ` Hou Tao
2023-05-04  1:35     ` Hou Tao [this message]
2023-05-04  2:00       ` Alexei Starovoitov
2023-05-04  2:30         ` Hou Tao
2023-06-01 17:36           ` Alexei Starovoitov
2023-06-02  2:39             ` Hou Tao
2023-06-02 16:25               ` Alexei Starovoitov
2023-04-29 10:12 ` [RFC bpf-next v3 4/6] bpf: Introduce BPF_MA_FREE_AFTER_RCU_GP Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 5/6] bpf: Add two module parameters in htab for memory benchmark Hou Tao
2023-04-29 10:12 ` [RFC bpf-next v3 6/6] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=986216a3-437a-5219-fd9a-341786e9264b@huaweicloud.com \
    --to=houtao@huaweicloud.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).