Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure

From: Dave Marchevsky <davemarchevsky@meta.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Dave Marchevsky <davemarchevsky@fb.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kernel Team <kernel-team@fb.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
Date: Wed, 7 Dec 2022 20:18:25 -0500	[thread overview]
Message-ID: <33b0c075-3551-b57a-76e4-bc40452b3253@meta.com> (raw)
In-Reply-To: <20221207230602.logjjjv3kwiiy6u3@macbook-pro-6.dhcp.thefacebook.com>

On 12/7/22 6:06 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
>>>> This series adds a rbtree datastructure following the "next-gen
>>>> datastructure" precedent set by recently-added linked-list [0]. This is
>>>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
>>>> instead of adding a new map type. This series adds a smaller set of API
>>>> functions than that RFC - just the minimum needed to support current
>>>> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>>>>
>>>>   bpf_rbtree_add
>>>>   bpf_rbtree_remove
>>>>   bpf_rbtree_first
>>>>
>>>> [...]
>>>>
>>>> Future work:
>>>>   Enabling writes to release_on_unlock refs should be done before the
>>>>   functionality of BPF rbtree can truly be considered complete.
>>>>   Implementing this proved more complex than expected so it's been
>>>>   pushed off to a future patch.
>>>>
>>
>>>
>>> TBH, I think we need to revisit whether there's a strong need for this. I would
>>> even argue that we should simply make the release semantics of rbtree_add,
>>> list_push helpers stronger and remove release_on_unlock logic entirely,
>>> releasing the node immediately. I don't see why it is so critical to have read,
>>> and more importantly, write access to nodes after losing their ownership. And
>>> that too is only available until the lock is unlocked.
>>>
>>
>> Moved the next paragraph here to ease reply, it was the last paragraph
>> in your response.
>>
>>>
>>> Can you elaborate on actual use cases where immediate release or not having
>>> write support makes it hard or impossible to support a certain use case, so that
>>> it is easier to understand the requirements and design things accordingly?
>>>
>>
>> Sure, the main usecase and impetus behind this for me is the sched_ext work
>> Tejun and others are doing (https://lwn.net/Articles/916291/ ). One of the
>> things they'd like to be able to do is implement a CFS-like scheduler using
>> rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
>> implement complicated scheduling logic.
>>
>> If we can implement such complicated scheduling logic, but it has so much
>> BPF-specific twisting of program logic that it's incomprehensible to scheduler
>> folks, that's not great. The overlap between "BPF experts" and "scheduler
>> experts" is small, and we want the latter group to be able to read BPF
>> scheduling logic without too much struggle. Lower learning curve makes folks
>> more likely to experiment with sched_ext.
>>
>> When 'rbtree map' was in brainstorming / prototyping, non-owning reference
>> semantics were called out as moving BPF datastructures closer to their kernel
>> equivalents from a UX perspective.
> 
> Our emails crossed. See my previous email.
> Agree on the above.
> 
>> If the "it makes BPF code better resemble normal kernel code" argumentwas the
>> only reason to do this I wouldn't feel so strongly, but there are practical
>> concerns as well:
>>
>> If we could only read / write from rbtree node if it isn't in a tree, the common
>> operation of "find this node and update its data" would require removing and
>> re-adding it. For rbtree, these unnecessary remove and add operations could
> 
> Not really. See my previous email.
> 
>> result in unnecessary rebalancing. Going back to the sched_ext usecase,
>> if we have a rbtree with task or cgroup stats that need to be updated often,
>> unnecessary rebalancing would make this update slower than if non-owning refs
>> allowed in-place read/write of node data.
> 
> Agree. Read/write from non-owning refs is necessary.
> In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0
> (your non-owning ref) should not be mixed with release_on_unlock logic.
> 
> KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0.
> 
>>
>> Also, we eventually want to be able to have a node that's part of both a
>> list and rbtree. Likely adding such a node to both would require calling
>> kfunc for adding to list, and separate kfunc call for adding to rbtree.
>> Once the node has been added to list, we need some way to represent a reference
>> to that node so that we can pass it to rbtree add kfunc. Sounds like a
>> non-owning reference to me, albeit with different semantics than current
>> release_on_unlock.
> 
> A node with both link list and rbtree would be a new concept.
> We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing.
> That's a future discussion.
> 
>>
>>> I think this relaxed release logic and write support is the wrong direction to
>>> take, as it has a direct bearing on what can be done with a node inside the
>>> critical section. There's already the problem with not being able to do
>>> bpf_obj_drop easily inside the critical section with this. That might be useful
>>> for draining operations while holding the lock.
>>>
>>
>> The bpf_obj_drop case is similar to your "can't pass non-owning reference
>> to bpf_rbtree_remove" concern from patch 1's thread. If we have:
>>
>>   n = bpf_obj_new(...); // n is owning ref
>>   bpf_rbtree_add(&tree, &n->node); // n is non-owning ref
> 
> what I proposed in the other email...
> n should be untrusted here.
> That's != 'n is non-owning ref'
> 
>>   res = bpf_rbtree_first(&tree);
>>   if (!res) {...}
>>   m = container_of(res, struct node_data, node); // m is non-owning ref
> 
> agree. m == PTR_TRUSTED with ref_obj_id == 0.
> 
>>   res = bpf_rbtree_remove(&tree, &n->node);
> 
> a typo here? Did you mean 'm->node' ?
> 
> and after 'if (res)' ...
>>   n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory
> 
> agree. n -> ref_obj_id > 0
> 
>>   bpf_obj_drop(n);
> 
> above is ok to do.
> 'n' becomes UNTRUSTED or invalid.
> 
>>   // Not safe to use m anymore
> 
> 'm' should have become UNTRUSTED after bpf_rbtree_remove.
> 
>> Datastructures which support bpf_obj_drop in the critical section can
>> do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
>> references after bpf_obj_drop.
> 
> 'invalidate all' sounds suspicious.
> I don't think we need to do sweaping search after bpf_obj_drop.
> 
>> Then there's no potential use-after-free.
>> (For the above example, pretend bpf_rbtree_remove didn't already invalidate
>> 'm', or that there's some other way to obtain non-owning ref to 'n''s node
>> after rbtree_remove)
>>
>> I think that, in practice, operations where the BPF program wants to remove
>> / delete nodes will be distinct from operations where program just wants to 
>> obtain some non-owning refs and do read / write. At least for sched_ext usecase
>> this is true. So all the additional clobbers won't require program writer
>> to do special workarounds to deal with verifier in the common case.
>>
>>> Semantically in other languages, once you move an object, accessing it is
>>> usually a bug, and in most of the cases it is sufficient to prepare it before
>>> insertion. We are certainly in the same territory here with these APIs.
>>
>> Sure, but 'add'/'remove' for these intrusive linked datastructures is
>> _not_ a 'move'. Obscuring this from the user and forcing them to use
>> less performant patterns for the sake of some verifier complexity, or desire
>> to mimic semantics of languages w/o reference stability, doesn't make sense to
>> me.
> 
> I agree, but everything we discuss in the above looks orthogonal
> to release_on_unlock that myself and Kumar are proposing to drop.
> 
>> If we were to add some datastructures without reference stability, sure, let's
>> not do non-owning references for those. So let's make this non-owning reference
>> stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
>> which will coincidentally make it very easy to remove if we later decide that
>> the complexity isn't worth it. 
> 
> You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ?
> So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ?
> If so then I agree. The 'release' part of the name was confusing.
> It's also not clear which arg it applies to.
> bpf_rbtree_remove has two args. Both are PTR_TRUSTED.
> I wouldn't introduce a new flag for this just yet.
> We can hard code bpf_rbtree_remove, bpf_list_pop for now
> or use our name suffix hack.

Before replying to specific things in this email, I think it would be useful
to have a subthread clearing up definitions and semantics, as I think we're
talking past each other a bit.

On a conceptual level I've still been using "owning reference" and "non-owning
reference" to understand rbtree operations. I'll use those here and try to map
them to actual verifier concepts later.

owning reference

  * This reference controls the lifetime of the pointee
  * Ownership of pointee must be 'released' by passing it to some rbtree
    API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
    * If not released before program ends, verifier considers prog invalid
  * Access to the memory ref is pointing at will not page fault

non-owning reference

  * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
    to bpf_obj_drop
  * No control of lifetime, but can infer memory safety based on context
    (see explanation below)
  * Access to the memory ref is pointing at will not page fault
    (see explanation below)

2) From verifier's perspective non-owning references can only exist
between spin_lock and spin_unlock. Why? After spin_unlock another program
can do arbitrary operations on the rbtree like removing and free-ing
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
free'd, and reused via bpf_obj_new would point to an entirely different thing.
Or the memory could go away.

To prevent this logic violation all non-owning references are invalidated by
verifier after critical section ends. This is necessary to ensure "will
not page fault" property of non-owning reference. So if verifier hasn't
invalidated a non-owning ref, accessing it will not page fault.

Currently bpf_obj_drop is not allowed in the critical section, so similarly,
if there's a valid non-owning ref, we must be in critical section, and can
conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
and-reused.

1) Any reference to a node that is in a rbtree _must_ be non-owning, since
the tree has control of pointee lifetime. Similarly, any ref to a node
that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
node in map_val for now)

Moving on to rbtree API:

bpf_rbtree_add(&tree, &node);
  'node' is an owning ref, becomes a non-owning ref.

bpf_rbtree_first(&tree);
  retval is a non-owning ref, since first() node is still in tree

bpf_rbtree_remove(&tree, &node);
  'node' is a non-owning ref, retval is an owning ref

All of the above can only be called when rbtree's lock is held, so invalidation
of all non-owning refs on spin_unlock is fine for rbtree_remove.

Nice property of paragraph marked with 1) above is the ability to use the
type system to prevent rbtree_add of node that's already in rbtree and
rbtree_remove of node that's not in one. So we can forego runtime
checking of "already in tree", "already not in tree".

But, as you and Kumar talked about in the past and referenced in patch 1's
thread, non-owning refs may alias each other, or an owning ref, and have no
way of knowing whether this is the case. So if X and Y are two non-owning refs
that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
already isn't in any tree (since prog has an owning ref to it). But verifier
doesn't know X and Y alias each other. So previous paragraph's "forego
runtime checks" statement can only hold if we invalidate all non-owning refs
after 'destructive' rbtree_remove operation.

It doesn't matter to me which combination of type flags, ref_obj_id, other
reg state stuff, and special-casing is used to implement owning and non-owning
refs. Specific ones chosen in this series for rbtree node:

owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
            ref_obj_id > 0

non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
                PTR_UNTRUSTED
                  - used for "can't pass ownership", not PROBE_MEM
                  - this is why I mentioned "decomposing UNTRUSTED into more
                    granular reg traits" in another thread
                ref_obj_id > 0
                release_on_unlock = true
                  - used due to paragraphs starting with 2) above                

Any other combination of type and reg state that gives me the semantics def'd
above works4me.

Based on this reply and others from today, I think you're saying that these
concepts should be implemented using:

owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
            PTR_TRUSTED
            ref_obj_id > 0

non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
                PTR_TRUSTED
                ref_obj_id == 0
                 - used for "can't pass ownership", since funcs that expect
                   owning ref need ref_obj_id > 0

And you're also adding 'untrusted' here, mainly as a result of
bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
instead of becoming a non-owning ref. 'untrusted' would have state like:

PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
PTR_UNTRUSTED
ref_obj_id == 0?

I think your "non-owning ref" definition also differs from mine, specifically
yours doesn't seem to have "will not page fault". For this reason, you don't
see the need for release_on_unlock logic, since that's used to prevent refs
escaping critical section and potentially referring to free'd memory.

This is where I start to get confused. Some questions:

  * If we get rid of release_on_unlock, and with mass invalidation of
    non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?

  * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
    in this scheme, since non-owning ref can escape spin_unlock b/c no mass
    invalidation? PTR_UNTRUSTED isn't sufficient here

  * If non-owning ref can live past spin_unlock, do we expect read from
    such ref after _unlock to go through bpf_probe_read()? Otherwise direct
    read might fault and silently write 0.

  * For your 'untrusted', but not non-owning ref concept, I'm not sure
    what this gives us that's better than just invalidating the ref which
    gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)

I'm also not sure if you agree with my paragraph marked 1) above. But IMO the
release_on_unlock difference, and the perhaps-differing non-owning ref concept
are where we're really talking past each other.