Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

From: Hou Tao <houtao@huaweicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, houtao1@huawei.com
Subject: Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
Date: Thu, 4 May 2023 09:35:17 +0800	[thread overview]
Message-ID: <986216a3-437a-5219-fd9a-341786e9264b@huaweicloud.com> (raw)
In-Reply-To: <20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local>

Hi,

On 5/4/2023 2:48 AM, Alexei Starovoitov wrote:
> On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote:
>> +
>> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
>> +{
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	/* In case a NMI-context bpf program is also freeing object. */
>> +	if (local_inc_return(&c->active) == 1) {
>> +		bool try_queue_work = false;
>> +
>> +		/* kworker may remove elements from prepare_reuse_head */
>> +		raw_spin_lock(&c->reuse_lock);
>> +		if (llist_empty(&c->prepare_reuse_head))
>> +			c->prepare_reuse_tail = llnode;
>> +		__llist_add(llnode, &c->prepare_reuse_head);
>> +		if (++c->prepare_reuse_cnt > c->high_watermark) {
>> +			/* Zero out prepare_reuse_cnt early to prevent
>> +			 * unnecessary queue_work().
>> +			 */
>> +			c->prepare_reuse_cnt = 0;
>> +			try_queue_work = true;
>> +		}
>> +		raw_spin_unlock(&c->reuse_lock);
>> +
>> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
>> +			/* Use reuse_cb_in_progress to indicate there is
>> +			 * inflight reuse kworker or reuse RCU callback.
>> +			 */
>> +			atomic_inc(&c->reuse_cb_in_progress);
>> +			/* Already queued */
>> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
> As Martin pointed out queue_work() is not safe here.
> The raw_spin_lock(&c->reuse_lock); earlier is not safe either.
I see. Didn't recognize these problems.
> For the next version please drop workers and spin_lock from unit_free/alloc paths.
> If lock has to be taken it should be done from irq_work.
> Under no circumstances we can use alloc_workqueue(). No new kthreads.
Is there any reason to prohibit the use of new kthread in irq_work ?
>
> We can avoid adding new flag to bpf_mem_alloc to reduce the complexity
> and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way:
>
> - alloc_bulk() won't be trying to steal from c->free_by_rcu.
>
> - do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version.
No sure whether or not one inflight RCU callback is enough. Will check.
If one is not enough, I may use kmalloc(__GFP_NOWAIT) in irq work to
allocate multiple RCU callbacks.
> - rcu_trace_implies_rcu_gp() is never used.
>
> - after RCU_GP __free_rcu() moves all waiting_for_gp elements into 
>   a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu)
>   and does call_rcu_tasks_trace
>
> - Let's call this list ma->free_by_rcu_tasks_trace
>   (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init)
>
> - any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that
>   is protected by ma->spin_lock (1 or NUM_CACHES such locks)
To reduce the lock contention, alloc_bulk() can steal from the global
list in batch. Had tried the global list before but I didn't do the
concurrent freeing, I think it could reduce the risk of OOM for
add_del_on_diff_cpu.
>
> - ma->waiting_for_gp_tasks_trace will be freeing elements into slab
>
> What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock().
> Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc.
> (while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs)
>
> After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop.
> Probably usage of bpf_mem_alloc in local storage can be simplified as well.
> Martin wdyt?
>
> I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points
> including needs of qp-trie.
Thanks for these great suggestions. Will try to do it in v4.