Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure

From: Dave Marchevsky <davemarchevsky@meta.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Dave Marchevsky <davemarchevsky@fb.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kernel Team <kernel-team@fb.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
Date: Wed, 7 Dec 2022 17:28:34 -0500	[thread overview]
Message-ID: <5756f37f-c61a-71e1-5559-e6e009b606d6@meta.com> (raw)
In-Reply-To: <20221207193616.y7n4lmufztjsq6tr@apollo>

On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
>> This series adds a rbtree datastructure following the "next-gen
>> datastructure" precedent set by recently-added linked-list [0]. This is
>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
>> instead of adding a new map type. This series adds a smaller set of API
>> functions than that RFC - just the minimum needed to support current
>> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>>
>>   bpf_rbtree_add
>>   bpf_rbtree_remove
>>   bpf_rbtree_first
>>
>> [...]
>>
>> Future work:
>>   Enabling writes to release_on_unlock refs should be done before the
>>   functionality of BPF rbtree can truly be considered complete.
>>   Implementing this proved more complex than expected so it's been
>>   pushed off to a future patch.
>>

> 
> TBH, I think we need to revisit whether there's a strong need for this. I would
> even argue that we should simply make the release semantics of rbtree_add,
> list_push helpers stronger and remove release_on_unlock logic entirely,
> releasing the node immediately. I don't see why it is so critical to have read,
> and more importantly, write access to nodes after losing their ownership. And
> that too is only available until the lock is unlocked.
> 

Moved the next paragraph here to ease reply, it was the last paragraph
in your response.

> 
> Can you elaborate on actual use cases where immediate release or not having
> write support makes it hard or impossible to support a certain use case, so that
> it is easier to understand the requirements and design things accordingly?
>

Sure, the main usecase and impetus behind this for me is the sched_ext work
Tejun and others are doing (https://lwn.net/Articles/916291/). One of the
things they'd like to be able to do is implement a CFS-like scheduler using
rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
implement complicated scheduling logic.

If we can implement such complicated scheduling logic, but it has so much
BPF-specific twisting of program logic that it's incomprehensible to scheduler
folks, that's not great. The overlap between "BPF experts" and "scheduler
experts" is small, and we want the latter group to be able to read BPF
scheduling logic without too much struggle. Lower learning curve makes folks
more likely to experiment with sched_ext.

When 'rbtree map' was in brainstorming / prototyping, non-owning reference
semantics were called out as moving BPF datastructures closer to their kernel
equivalents from a UX perspective.

If the "it makes BPF code better resemble normal kernel code" argumentwas the
only reason to do this I wouldn't feel so strongly, but there are practical
concerns as well:

If we could only read / write from rbtree node if it isn't in a tree, the common
operation of "find this node and update its data" would require removing and
re-adding it. For rbtree, these unnecessary remove and add operations could
result in unnecessary rebalancing. Going back to the sched_ext usecase,
if we have a rbtree with task or cgroup stats that need to be updated often,
unnecessary rebalancing would make this update slower than if non-owning refs
allowed in-place read/write of node data.

Also, we eventually want to be able to have a node that's part of both a
list and rbtree. Likely adding such a node to both would require calling
kfunc for adding to list, and separate kfunc call for adding to rbtree.
Once the node has been added to list, we need some way to represent a reference
to that node so that we can pass it to rbtree add kfunc. Sounds like a
non-owning reference to me, albeit with different semantics than current
release_on_unlock.

> I think this relaxed release logic and write support is the wrong direction to
> take, as it has a direct bearing on what can be done with a node inside the
> critical section. There's already the problem with not being able to do
> bpf_obj_drop easily inside the critical section with this. That might be useful
> for draining operations while holding the lock.
> 

The bpf_obj_drop case is similar to your "can't pass non-owning reference
to bpf_rbtree_remove" concern from patch 1's thread. If we have:

  n = bpf_obj_new(...); // n is owning ref
  bpf_rbtree_add(&tree, &n->node); // n is non-owning ref

  res = bpf_rbtree_first(&tree);
  if (!res) {...}
  m = container_of(res, struct node_data, node); // m is non-owning ref

  res = bpf_rbtree_remove(&tree, &n->node);
  n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory

  bpf_obj_drop(n);
  // Not safe to use m anymore

Datastructures which support bpf_obj_drop in the critical section can
do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
references after bpf_obj_drop. Then there's no potential use-after-free.
(For the above example, pretend bpf_rbtree_remove didn't already invalidate
'm', or that there's some other way to obtain non-owning ref to 'n''s node
after rbtree_remove)

I think that, in practice, operations where the BPF program wants to remove
/ delete nodes will be distinct from operations where program just wants to 
obtain some non-owning refs and do read / write. At least for sched_ext usecase
this is true. So all the additional clobbers won't require program writer
to do special workarounds to deal with verifier in the common case.

> Semantically in other languages, once you move an object, accessing it is
> usually a bug, and in most of the cases it is sufficient to prepare it before
> insertion. We are certainly in the same territory here with these APIs.

Sure, but 'add'/'remove' for these intrusive linked datastructures is
_not_ a 'move'. Obscuring this from the user and forcing them to use
less performant patterns for the sake of some verifier complexity, or desire
to mimic semantics of languages w/o reference stability, doesn't make sense to
me.

If we were to add some datastructures without reference stability, sure, let's
not do non-owning references for those. So let's make this non-owning reference
stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
which will coincidentally make it very easy to remove if we later decide that
the complexity isn't worth it.