Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

From: Yonghong Song <yhs@meta.com>
To: David Vernet <void@manifault.com>
Cc: Yonghong Song <yhs@fb.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, KP Singh <kpsingh@kernel.org>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Date: Sun, 23 Oct 2022 09:45:35 -0700	[thread overview]
Message-ID: <95ff1fa3-124b-6886-64e0-adcf40085e55@meta.com> (raw)
In-Reply-To: <Y1NdLah/c38isGT+@maniforge.dhcp.thefacebook.com>

On 10/21/22 8:02 PM, David Vernet wrote:
> On Fri, Oct 21, 2022 at 03:57:15PM -0700, Yonghong Song wrote:
> 
> [...]
> 
>>>>>> +	 * could be modifying the local_storage->list now.
>>>>>> +	 * Thus, no elem can be added-to or deleted-from the
>>>>>> +	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
>>>>>> +	 *
>>>>>> +	 * It is racing with bpf_local_storage_map_free() alone
>>>>>> +	 * when unlinking elem from the local_storage->list and
>>>>>> +	 * the map's bucket->list.
>>>>>> +	 */
>>>>>> +	bpf_cgrp_storage_lock();
>>>>>> +	raw_spin_lock_irqsave(&local_storage->lock, flags);
>>>>>> +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>>>>> +		bpf_selem_unlink_map(selem);
>>>>>> +		free_cgroup_storage =
>>>>>> +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>>>>
>>>>> This still requires a comment explaining why it's OK to overwrite
>>>>> free_cgroup_storage with a previous value from calling
>>>>> bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like
>>>>> a pretty weird programming pattern, and IMO doing this feels more
>>>>> intentional and future-proof:
>>>>>
>>>>> if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
>>>>> 	free_cgroup_storage = true;
>>>>
>>>> We have a comment a few lines below.
>>>>     /* free_cgroup_storage should always be true as long as
>>>>      * local_storage->list was non-empty.
>>>>      */
>>>>     if (free_cgroup_storage)
>>>> 	kfree_rcu(local_storage, rcu);
>>>
>>> IMO that comment doesn't provide much useful information -- it states an
>>> assumption, but doesn't give a reason for it.
>>>
>>>> I will add more explanation in the above code like
>>>>
>>>> 	bpf_selem_unlink_map(selem);
>>>> 	/* If local_storage list only have one element, the
>>>> 	 * bpf_selem_unlink_storage_nolock() will return true.
>>>> 	 * Otherwise, it will return false. The current loop iteration
>>>> 	 * intends to remove all local storage. So the last iteration
>>>> 	 * of the loop will set the free_cgroup_storage to true.
>>>> 	 */
>>>> 	free_cgroup_storage =
>>>> 		bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>>
>>> Thanks, this is the type of comment I was looking for.
>>>
>>> Also, I realize this was copy-pasted from a number of other possible
>>> locations in the codebase which are doing the same thing, but I still
>>> think this pattern is an odd and brittle way to do this. We're relying
>>> on an abstracted implementation detail of
>>> bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal
>>> that bpf_selem_unlink_storage_nolock() should probably be the one
>>> invoking kfree_rcu() on behalf of callers in the first place.  It looks
>>> like all of the callers end up calling kfree_rcu() on the struct
>>> bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true,
>>> so can we just move the responsibility of freeing the local storage
>>> object down into bpf_selem_unlink_storage_nolock() where it's unlinked?
>>
>> We probably cannot do this. bpf_selem_unlink_storage_nolock()
>> is inside the rcu_read_lock() region. We do kfree_rcu() outside
>> the rcu_read_lock() region.
> 
> kfree_rcu() is non-blocking and is safe to invoke from within an RCU
> read region. If you invoke it within an RCU read region, the object will
> not be kfree'd until (at least) you exit the current read region, so I
> believe that the net effect here should be the same whether it's done in
> bpf_selem_unlink_storage_nolock(), or in the caller after the RCU read
> region is exited.

Okay. we probably still want to do kfree_rcu outside
bpf_selem_unlink_storage_nolock() as the function is to unlink storage
for a particular selem.

We could move
	if (free_cgroup_storage)
		kfree_rcu(local_storage, rcu);
immediately after hlist_for_each_entry_safe() loop.
But I think putting that 'if' statement after rcu_read_unlock() is
slightly better as it will not increase the code inside the lock region.

> 
>>> IMO this can be done in a separate patch set, if we decide it's worth
>>> doing at all.
>>>
>>>>>
>>>>>> +	}
>>>>>> +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>>>>> +	bpf_cgrp_storage_unlock();
>>>>>> +	rcu_read_unlock();
>>>>>> +
>>>>>> +	/* free_cgroup_storage should always be true as long as
>>>>>> +	 * local_storage->list was non-empty.
>>>>>> +	 */
>>>>>> +	if (free_cgroup_storage)
>>>>>> +		kfree_rcu(local_storage, rcu);
>>>>>> +}
>>>>>> +
>>>>>> +static struct bpf_local_storage_data *
>>>>>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
>>>>>> +{
>>>>>> +	struct bpf_local_storage *cgroup_storage;
>>>>>> +	struct bpf_local_storage_map *smap;
>>>>>> +
>>>>>> +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
>>>>>> +					       bpf_rcu_lock_held());
>>>>>> +	if (!cgroup_storage)
>>>>>> +		return NULL;
>>>>>> +
>>>>>> +	smap = (struct bpf_local_storage_map *)map;
>>>>>> +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
>>>>>> +}
>>>>>> +
>>>>>> +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
>>>>>> +{
>>>>>> +	struct bpf_local_storage_data *sdata;
>>>>>> +	struct cgroup *cgroup;
>>>>>> +	int fd;
>>>>>> +
>>>>>> +	fd = *(int *)key;
>>>>>> +	cgroup = cgroup_get_from_fd(fd);
>>>>>> +	if (IS_ERR(cgroup))
>>>>>> +		return ERR_CAST(cgroup);
>>>>>> +
>>>>>> +	bpf_cgrp_storage_lock();
>>>>>> +	sdata = cgroup_storage_lookup(cgroup, map, true);
>>>>>> +	bpf_cgrp_storage_unlock();
>>>>>> +	cgroup_put(cgroup);
>>>>>> +	return sdata ? sdata->data : NULL;
>>>>>> +}
>>>>>
>>>>> Stanislav pointed out in the v1 revision that there's a lot of very
>>>>> similar logic in task storage, and I think you'd mentioned that you were
>>>>> going to think about generalizing some of that. Have you had a chance to
>>>>> consider?
>>>>
>>>> It is hard to have a common function for
>>>> lookup_elem/update_elem/delete_elem(). They are quite different as each
>>>> heavily involves
>>>> task/cgroup-specific functions.
>>>
>>> Yes agreed, each implementation is acquiring their own references, and
>>> finding the backing element in whatever way it was implemented, etc.
>>>
>>>> but map_alloc and map_free could have common helpers.
>>>
>>> Agreed, and many of the static functions that are invoked on those paths
>>> such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly
>>> as well. In general this feels like something we could pretty easily
>>> simplify using something like a structure with callbacks to implement
>>> the pieces of logic that are specific to each local storage type, such
>>> as getting the struct bpf_local_storage __rcu
>>> * pointer from some context (e.g.  cgroup_storage_ptr()). It doesn't
>>> necessarily need to block this change, but IMO we should clean this up
>>> soon because a lot of this is nearly a 100% copy-paste of other local
>>> storage implementations.
>>
>> Further refactoring is possible. Martin is working to simplify the
>> locking mechanism. We can wait for that done before doing refactoring.
> 
> Sounds great, thanks!
> 
> - David