Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure

From: Dave Marchevsky <davemarchevsky@meta.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Dave Marchevsky <davemarchevsky@fb.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kernel Team <kernel-team@fb.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
Date: Thu, 8 Dec 2022 18:35:24 -0500	[thread overview]
Message-ID: <3e6af95f-1d8b-aaf9-7e65-002b8fff19b6@meta.com> (raw)
In-Reply-To: <20221208203654.zwwqxzjhx563d3z3@macbook-pro-6.dhcp.thefacebook.com>

On 12/8/22 3:36 PM, Alexei Starovoitov wrote:
> On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote:
>>
>> I don't mind using active_lock.id for invalidation, but using reg->id to
>> associate it with reg is a bad idea IMO, it's already preserved and set when the
>> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
>> with that non-owing ref if it has a spin lock, essentially unlocking different
>> spin lock if the reg->btf of already locked spin lock reg is same due to same
>> active_lock.id.
> 
> Right. Overwriting reg->id was a bad idea.
> 
>> Even if you prevent it somehow it's more confusing to overload reg->id again for
>> this purpose.
>>
>> It makes more sense to introduce a new nonref_obj_id instead dedicated for this
>> purpose, to associate it back to the reg->id of the collection it is coming from.
> 
> nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be
> connected to reg->id the way we do it for ref_obj_id.
> 
>> Also, there are two cases of invalidation, one is on remove from rbtree, which
>> should only invalidate non-owning references into the rbtree, and one is on
>> unlock, which should invalidate all non-owning references.
> 
> Two cases only if we're going to do invalidation on rbtree_remove.
> 
>> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
>> lock, but unlocking should do it for both rbtree and list non-owning refs it is
>> protecting.
>>
>> So it seems you will have to maintain two IDs for non-owning referneces, one for
>> the collection it comes from, and one for the lock region it is obtained in.
> 
> Right. Like this ?
> collection_id = rbroot->reg->id; // to track the collection it came from
> active_lock_id = cur_state->active_lock.id // to track the lock region
> 
> but before we proceed let me demonstrate an example where
> cleanup on rbtree_remove is not user friendly:
> 
> bpf_spin_lock
> x = bpf_list_first(); if (!x) ..
> y = bpf_list_last(); if (!y) ..
> 
> n = bpf_list_remove(x); if (!n) ..
> 
> bpf_list_add_after(n, y); // we should allow this
> bpf_spin_unlock
> 
> We don't have such apis right now.
> The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy
> all regs that point somewhere in the collection.
> This way we save run-time check in bpf_rbtree_remove, but sacrificing usability.
> 
> x and y could be pointing to the same thing.
> In such case bpf_list_add_after() should fail in runtime after discovering
> that 'y' is unlinked.
> 
> Similarly with bpf_rbtree_add().
> Currently it cannot fail. It takes owning ref and will release it.
> We can mark it as KF_RELEASE and no extra verifier changes necessary.
> 
> But in the future we might have failing add/insert operations on lists and rbtree.
> If they're failing we'd need to struggle with 'conditional release' verifier additions,
> the bpf prog would need to check return value, etc.
> 
> I think we better deal with it in run-time.
> The verifier could supply bpf_list_add_after() with two hidden args:
> - container_of offset (delta between rb_node and begining of prog's struct)
> - struct btf_struct_meta *meta
> Then inside bpf_list_add_after or any failing KF_RELEASE kfunc
> it can call bpf_obj_drop_impl() that element.
> Then from the verifier pov the KF_RELEASE function did the release
> and 'owning ref' became 'non-owning ref'.
> 
>>>>> And you're also adding 'untrusted' here, mainly as a result of
>>>>> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
>>>>> instead of becoming a non-owning ref. 'untrusted' would have state like:
>>>>>
>>>>> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>>>> PTR_UNTRUSTED
>>>>> ref_obj_id == 0?
>>>>
>>>> I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
>>>> or doing 'non-owning' is enough.
>>>> If it's full untrusted it will be:
>>>> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
>>>>
>>>
>>> Yeah, I don't see what this "full untrusted" is giving us either. Let's have
>>> "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
>>> instead of converting to "full untrusted"?
>>>
>>
>> +1, I prefer invalidating completely on unlock.
> 
> fine by me.
> 
>>
>> I think it's better to clean by invalidating. We have better tools to form
>> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
>> such an escape hatch for some reason. It's also easier to review where an
>> untrusted pointer is being used in a program, and has zero cost at runtime.
> 
> ok. Since it's more strict we can relax to untrusted later if necessary.
> 
>> So far I'm leaning towards:
>>
>> bpf_rbtree_add(node) : node becomes non-owned ref
>> bpf_spin_unlock(lock) : node is invalidated
> 
> ok
> 
>>>> Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
>>>> and non-owning after bpf_rbtree_add.
>>>>
>>>> Walking the example from previous email:
>>>>
>>>> struct bpf_rbtree_iter it;
>>>> struct bpf_rb_node * node;
>>>> struct bpf_rb_node *n, *m;
>>>>
>>>> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
>>>> while ((node = bpf_rbtree_iter_next(&it)) {
>>>>   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
>>>>   if (node && node->field == condition) {
>>>>
>>>>     n = bpf_rbtree_remove(rb_root, node);
>>>>     if (!n) ...;
>>>>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
>>>>     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
>>>>     if (!m) ...;
>>>>     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
>>>>
>>
>> This second remove I would simply disallow as Dave is suggesting during
>> verification, by invalidating non-owning refs for rb_root.
> 
> Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our
> only remaining disagreement.
> I feel run-time checks will be fast enough and will improve usabililty.
> 
> Also it feels that not doing cleanup on rbtree_remove is simpler to
> implement and reason about.
> 
> Here is the proposal with one new field 'active_lock_id':
> 
> first = bpf_rbtree_first(root) KF_RET_NULL
>   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
>   R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
>   R0->active_lock_id = root->reg->id
>   R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.
> 
> same way we can add rb_find, rb_find_first,
> but not rb_next, rb_prev, since they don't have 'root' argument.
> 
> bpf_rbtree_add(root, node, cb); KF_RELEASE.
>   needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
>   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
>   calls release_reference(node->ref_obj_id)
>   converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
>   node->active_lock_id = root->reg->id
> 
> 'node' is equivalent to 'first'. They both point to some element
> inside rbtree and valid inside spin_locked region.
> It's ok to read|write to both under lock.
> 
> removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
>   need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
>   usual check_reg_allocation_locked(root)
>   R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
>   R0->ref_obj_id = R0->id = acquire_reference_state();
>   R0->active_lock_id should stay 0
>   mark_reg_unknown(node)
> 
> bpf_spin_unlock(lock);
>   checks lock->id == cur->active_lock.id
>   for all regs in state 
>     if (reg->active_lock_id == lock->id)
>        mark_reg_unknown(reg)

OK, so sounds like a few more points of agreement, regardless of whether
we go the runtime checking route or the other one:

  * We're tossing 'full untrusted' for now. non-owning references will not be
    allowed to escape critical section. They'll be clobbered w/
    mark_reg_unknown.
    * No pressing need to make bpf_obj_drop callable from critical section.
      As a result no owning or non-owning ref access can page fault.

  * When spin_lock is unlocked, verifier needs to know about all non-owning
    references so that it can clobber them. Current implementation -
    ref_obj_id + release_on_unlock - is bad for a number of reasons, should
    be replaced with something that doesn't use ref_obj_id or reg->id.
    * Specific better approach was proposed above: new field + keep track
      of lock and datastructure identity.

Differences in proposed approaches:

"Type System checks + invalidation on 'destructive' rbtree ops"

  * This approach tries to prevent aliasing problems by invalidating
    non-owning refs after 'destructive' rbtree ops - like rbtree_remove -
    in addition to invalidation on spin_unlock

  * Type system guarantees invariants:
    * "if it's an owning ref, the node is guaranteed to not be in an rbtree"
    * "if it's a non-owning ref, the node is guaranteed to be in an rbtree"

  * Downside: mass non-owning ref invalidation on rbtree_remove will make some
    programs that logically don't have aliasing problem will be rejected by
    verifier. Will affect usability depending on how bad this is.

"Runtime checks + spin_unlock invalidation only"

  * This approach allows for the possibility of aliasing problem. As a result
    the invariants guaranteed in point 2 above don't necessarily hold.
    * Helpers that add or remove need to account for possibility that the node
      they're operating on has already been added / removed. Need to check this
      at runtime and nop if so.

  * non-owning refs are only invalidated on spin_unlock.
    * As a result, usability issues of previous approach don't happen here.

  * Downside: Need to do runtime checks, some additional verifier complexity
    to deal with "runtime check failed" case due to prev approach's invariant
    not holding

Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock
or remove) can be added to either approach (maybe - at least it was specifically
discussed for "runtime checks"). Such untrusted refs, by virtue of being
PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input.
For the "type system" approach this might ameliorate some of the usability
issues. For the "runtime checks" approach it would only be useful to let
such refs escape spin_unlock.

But we're not going to do non-owning -> 'untrusted' for now, just listing for
completeness.

The distance between what I have now and "type system" approach is smaller
than "runtime checks" approach. And to get from "type system" to "runtime
checks" I'd need to:

  * Remove 'destructive op' invalidation points
  * Add runtime checks to rbtree_{add,remove}
  * Add verifier handling of runtime check failure possibility

Of which only the first point is getting rid of something added for the
"type system" approach, and won't be much work relative to all the refactoring
and other improvements that are common between the two approaches.

So for V2 I will do the "type system + invalidation on 'destructive' ops"
approach as it'll take less time. This'll get eyes on common improvements
faster. Then can do a "runtime checks" v3 and we can compare usability of both
on same base.