Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Dave Marchevsky <davemarchevsky@meta.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Dave Marchevsky <davemarchevsky@fb.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kernel Team <kernel-team@fb.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
Date: Wed, 7 Dec 2022 19:51:40 -0800	[thread overview]
Message-ID: <20221208035140.skuadnybf5aqb4o5@macbook-pro-6.dhcp.thefacebook.com> (raw)
In-Reply-To: <33b0c075-3551-b57a-76e4-bc40452b3253@meta.com>

On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
> 
> Before replying to specific things in this email, I think it would be useful
> to have a subthread clearing up definitions and semantics, as I think we're
> talking past each other a bit.

Yeah. We were not on the same page.
The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
and I thought we agreed that both are not necessary and with that
I assumed that anything 'non-owning' as a concept is gone too.
So the only thing left (in my mind) was the 'owning' concept.
Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.

Please have this detailed explanation in the commit log next time to
avoid this back and forth.
Now to the fun part...

> 
> On a conceptual level I've still been using "owning reference" and "non-owning
> reference" to understand rbtree operations. I'll use those here and try to map
> them to actual verifier concepts later.
> 
> owning reference
> 
>   * This reference controls the lifetime of the pointee
>   * Ownership of pointee must be 'released' by passing it to some rbtree
>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
>     * If not released before program ends, verifier considers prog invalid
>   * Access to the memory ref is pointing at will not page fault

agree.

> non-owning reference
> 
>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
>     to bpf_obj_drop
>   * No control of lifetime, but can infer memory safety based on context
>     (see explanation below)
>   * Access to the memory ref is pointing at will not page fault
>     (see explanation below)

agree with addition that both read and write should be allowed into this
'non-owning' ptr.
Which breaks if you map this to something that ORs with PTR_UNTRUSTED.

> 2) From verifier's perspective non-owning references can only exist
> between spin_lock and spin_unlock. Why? After spin_unlock another program
> can do arbitrary operations on the rbtree like removing and free-ing
> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
> free'd, and reused via bpf_obj_new would point to an entirely different thing.
> Or the memory could go away.

agree that spin_unlock needs to clean up 'non-owning'.

> To prevent this logic violation all non-owning references are invalidated by
> verifier after critical section ends. This is necessary to ensure "will
> not page fault" property of non-owning reference. So if verifier hasn't
> invalidated a non-owning ref, accessing it will not page fault.
> 
> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
> if there's a valid non-owning ref, we must be in critical section, and can
> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
> and-reused.

I don't understand why is that a problem.

> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
> the tree has control of pointee lifetime. Similarly, any ref to a node
> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
> node in map_val for now)

Also not clear why such restriction is necessary.

> Moving on to rbtree API:
> 
> bpf_rbtree_add(&tree, &node);
>   'node' is an owning ref, becomes a non-owning ref.
> 
> bpf_rbtree_first(&tree);
>   retval is a non-owning ref, since first() node is still in tree
> 
> bpf_rbtree_remove(&tree, &node);
>   'node' is a non-owning ref, retval is an owning ref

agree on the above definition.

> All of the above can only be called when rbtree's lock is held, so invalidation
> of all non-owning refs on spin_unlock is fine for rbtree_remove.
> 
> Nice property of paragraph marked with 1) above is the ability to use the
> type system to prevent rbtree_add of node that's already in rbtree and
> rbtree_remove of node that's not in one. So we can forego runtime
> checking of "already in tree", "already not in tree".

I think it's easier to add runtime check inside bpf_rbtree_remove()
since it already returns MAYBE_NULL. No 'conditional release' necessary.
And with that we don't need to worry about aliases.

> But, as you and Kumar talked about in the past and referenced in patch 1's
> thread, non-owning refs may alias each other, or an owning ref, and have no
> way of knowing whether this is the case. So if X and Y are two non-owning refs
> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
> already isn't in any tree (since prog has an owning ref to it). But verifier
> doesn't know X and Y alias each other. So previous paragraph's "forego
> runtime checks" statement can only hold if we invalidate all non-owning refs
> after 'destructive' rbtree_remove operation.

right. we either invalidate all non-owning after bpf_rbtree_remove
or do run-time check in bpf_rbtree_remove.
Consider the following:
bpf_spin_lock
n = bpf_rbtree_first(root);
m = bpf_rbtree_first(root);
x = bpf_rbtree_remove(root, n)
y = bpf_rbtree_remove(root, m)
bpf_spin_unlock
if (x)
   bpf_obj_drop(x)
if (y)
   bpf_obj_drop(y)

If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
If we do run-time check the above will be accepted and will work without crashing.

The problem with release_on_unlock is that it marks 'n' after 1st remove
as UNTRUSTED which means 'no write' and 'read via probe_read'.
That's not good imo.

> 
> It doesn't matter to me which combination of type flags, ref_obj_id, other
> reg state stuff, and special-casing is used to implement owning and non-owning
> refs. Specific ones chosen in this series for rbtree node:
> 
> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>             ref_obj_id > 0
> 
> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>                 PTR_UNTRUSTED
>                   - used for "can't pass ownership", not PROBE_MEM
>                   - this is why I mentioned "decomposing UNTRUSTED into more
>                     granular reg traits" in another thread

Now I undestand, but that was very hard to grasp.
UNTRUSTED means 'no write' and 'read via probe_read'.
ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
pointing out below:
>                 ref_obj_id > 0
>                 release_on_unlock = true
>                   - used due to paragraphs starting with 2) above                

but the problem with ref_set_release_on_unlock() that it mixes real ref-d
pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
And the latter is a quite confusing combination in my mind,
since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.

> Any other combination of type and reg state that gives me the semantics def'd
> above works4me.
> 
> 
> Based on this reply and others from today, I think you're saying that these
> concepts should be implemented using:
> 
> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>             PTR_TRUSTED
>             ref_obj_id > 0

Almost.
I propose:
PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0

See the definition of is_trusted_reg().
It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)

I was saying 'trusted' because of is_trusted_reg() definition.
Sorry for confusion.

> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>                 PTR_TRUSTED
>                 ref_obj_id == 0
>                  - used for "can't pass ownership", since funcs that expect
>                    owning ref need ref_obj_id > 0

I propose:
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0

Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.

And we will be able to pass 'non-owning' under spin_lock into other kfuncs
and owning outside of spin_lock into other kfuncs.
Which is a good thing.

> And you're also adding 'untrusted' here, mainly as a result of
> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> instead of becoming a non-owning ref. 'untrusted' would have state like:
> 
> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> PTR_UNTRUSTED
> ref_obj_id == 0?

I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
or doing 'non-owning' is enough.
If it's full untrusted it will be:
PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0

tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.

> I think your "non-owning ref" definition also differs from mine, specifically
> yours doesn't seem to have "will not page fault". For this reason, you don't
> see the need for release_on_unlock logic, since that's used to prevent refs
> escaping critical section and potentially referring to free'd memory.

Not quite.
We should be able to read/write directly through
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
the way release_reference() is doing.
I'm just not happy with using acquire_reference/release_reference() logic
(as release_on_unlock is doing) for cleaning after unlock.
Since we need to clean 'non-owning' ptrs in unlock it's confusing
to call the process 'release'.
I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
every reg where 
reg->id == cur_state->active_lock.id &&
flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0

By deleting relase_on_unlock I meant delete release_on_unlock flag
and remove ref_set_release_on_unlock.

> This is where I start to get confused. Some questions:
> 
>   * If we get rid of release_on_unlock, and with mass invalidation of
>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?

Since we'll be cleaning all
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.

The verifier already enforces that bpf_spin_unlock will be present
at the right place in bpf prog.
When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
So no concerns of leaking 'non-owning' outside.

While processing bpf_rbtree_first we need to:
regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
regs[BPF_REG_0].id = active_lock.id;
regs[BPF_REG_0].ref_obj_id = 0;

>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
>     invalidation? PTR_UNTRUSTED isn't sufficient here

run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
should address it, no?

>   * If non-owning ref can live past spin_unlock, do we expect read from
>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
>     read might fault and silently write 0.

unlock has to clean them.

>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
>     what this gives us that's better than just invalidating the ref which
>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)

Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
Untrusted will allow prog to do read only access (via probe_read) into the node
but might hide bugs.
The cleanup after bpf_spin_unlock of non-owning and clean up after
bpf_rbtree_add() does not have to be the same.
Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
and non-owning after bpf_rbtree_add.

Walking the example from previous email:

struct bpf_rbtree_iter it;
struct bpf_rb_node * node;
struct bpf_rb_node *n, *m;

bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
while ((node = bpf_rbtree_iter_next(&it)) {
  // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
  if (node && node->field == condition) {

    n = bpf_rbtree_remove(rb_root, node);
    if (!n) ...;
    // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
    m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
    if (!m) ...;
    // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y

    // node is still:
    // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id

    // assume we allow double locks one day
    bpf_spin_lock(another_rb_root);
    bpf_rbtree_add(another_rb_root, n);
    // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
    bpf_spin_unlock(another_rb_root);
    // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
    break;
  }
}
// node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
bpf_rbtree_iter_destroy(&it); // does unlock
// node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
// n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
// m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
bpf_obj_drop(m);