From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: Dave Marchevsky <davemarchevsky@fb.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs
Date: Sat, 22 Apr 2023 04:03:45 +0200 [thread overview]
Message-ID: <d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2> (raw)
In-Reply-To: <20230415201811.343116-1-davemarchevsky@fb.com>
On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> This series adds support for refcounted local kptrs to the verifier. A local
> kptr is 'refcounted' if its type contains a struct bpf_refcount field:
>
> struct refcounted_node {
> long data;
> struct bpf_list_node ll;
> struct bpf_refcount ref;
> };
>
> bpf_refcount is used to implement shared ownership for local kptrs.
>
> Motivating usecase
> ==================
>
> If a struct has two collection node fields, e.g.:
>
> struct node {
> long key;
> long val;
> struct bpf_rb_node rb;
> struct bpf_list_node ll;
> };
>
> It's not currently possible to add a node to both the list and rbtree:
>
> long bpf_prog(void *ctx)
> {
> struct node *n = bpf_obj_new(typeof(*n));
> if (!n) { /* ... */ }
>
> bpf_spin_lock(&lock);
>
> bpf_list_push_back(&head, &n->ll);
> bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> bpf_spin_unlock(&lock);
> }
>
> The above program will fail verification due to current owning / non-owning ref
> logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> passed to bpf_rbtree_add. The only way to get an owning reference for the node
> that was added is to bpf_list_pop_{front,back} it.
>
> More generally, verifier ownership semantics expect that a node has one
> owner (program, collection, or stashed in map) with exclusive ownership
> of the node's lifetime. The owner free's the node's underlying memory when it
> itself goes away.
>
> Without a shared ownership concept it's impossible to express many real-world
> usecases such that they pass verification.
>
> Semantic Changes
> ================
>
> Before this series, the verifier could make this statement: "whoever has the
> owning reference has exclusive ownership of the referent's lifetime". As
> demonstrated in the previous section, this implies that a BPF program can't
> have an owning reference to some node if that node is in a collection. If
> such a state were possible, the node would have multiple owners, each thinking
> they have exclusive ownership. In order to support shared ownership it's
> necessary to modify the exclusive ownership semantic.
>
> After this series' changes, an owning reference has ownership of the referent's
> lifetime, but it's not necessarily exclusive. The referent's underlying memory
> is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> used for collection insert.
>
> This change doesn't affect UX of owning or non-owning references much:
>
> * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> an owning reference arg, as ownership still must be passed to the
> collection in a shared-ownership world.
>
> * non-owning references still refer to valid memory without claiming
> any ownership.
> [...]
I think there are a series of major issues right now. I am not sure everything
can be addressed using bug fixes.
If I had to summarize the main problems in one liners:
The mutation of the node fields of an object can be racy.
Lack of collection identity either at runtime or verification.
--
It is possible for multiple CPUs to get owned references to an object containing
a rbtree or list node, and they can attempt to modify those fields in parallel
without any synchronization.
CPU 0 CPU 1
n = bpf_obj_new(...)
m = bpf_refcount_acquire(n)
kptr_xchg(map, m)
m = kptr_xchg(map, NULL)
// m == n
bpf_spin_lock(lock1) bpf_spin_lock(lock2)
bpf_rbtree_add(rbtree1, m) bpf_rbtree_add(rbtree2, n)
if (!RB_EMPTY_NODE) if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
bpf_spin_unlock(lock1) bpf_spin_unlock(lock2)
--
Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
sufficient. Consider this:
Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
CPU 0 CPU 1
n = bpf_obj_new(...)
bpf_spin_lock(lock1)
bpf_rbtree_add(rbtree1, n)
m = bpf_refcount_acquire(n)
bpf_spin_unlock(lock1)
bpf_spin_lock(lock1)
n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
bpf_spin_unlock(lock1)
// let m == n
bpf_spin_lock(lock1) bpf_spin_lock(lock2)
bpf_rbtree_add(rbtree1, m) bpf_rbtree_add(rbtree2, n)
if (!RB_EMPTY_NODE) if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
bpf_spin_unlock(lock1) bpf_spin_unlock(lock2)
--
There's also no discussion on the problem with collection identities we
discussed before (maybe you plan to address it later):
https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
But static tracaking won't be sufficient any longer. Considering another case
where the runtime will be confused about which rbtree a node belongs to.
CPU 0 CPU 1
n = bpf_obj_new(...)
m = bpf_refcount_acquire(n)
kptr_xchg(map, m)
p = kptr_xchg(map, NULL)
lock(lock2)
bpf_rbtree_add(rbtree2, p->rnode)
unlock(lock2)
lock(lock1)
bpf_list_push_back(head1, n->lnode)
// make n non-owning ref
bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
unlock(lock1)
--
I can come up with multiple other examples. The point I'm trying to drive home
is that it's a more fundamental issue in the way things are set up right now,
not something that was overlooked during the implementation.
The root cause is that access to a node's fields is going to be racy once
multiple CPUs have *mutable* references to it. The lack of ownership information
mapped to the collection either at runtime or during verification is also a
separate problem.
When we were discussing this a few months ago, we tried to form consensus on
synchronizing updates over a node using an 'owner' pointer to eliminate similar
races. Every node field has an associated owner field, which each updater
modifying that node synchronizes over.
In short:
Node's owner describes its state at runtime.
node.owner == ptr_of_ds // part of DS
node.owner == NULL // not part of DS
node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
bpf_rbtree_add and such will do this to claim node ownership before trying to
link it in, and then store owner to ptr_of_ds. The _store_ will be the
*linearization* point of bpf_rbtree_add, not cmpxchg.
READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
remain in this state until the end of CS, since ptr_to_ds to NULL only happens
in remove under a held lock for the ds. bpf_rbtree_remove will do this check
before WRITE_ONCE of NULL to unlink a node.
Ofcourse, this is slower, and requires extra space in the object, but unless
this or something else is used to eliminate races, there will be bugs.
next prev parent reply other threads:[~2023-04-22 2:03 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-15 20:18 [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 1/9] bpf: Remove btf_field_offs, use btf_record's fields instead Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 3/9] bpf: Support refcounted local kptrs in existing semantics Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 4/9] bpf: Add bpf_refcount_acquire kfunc Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail Dave Marchevsky
2023-04-16 1:11 ` Alexei Starovoitov
2023-04-17 18:08 ` Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 7/9] bpf: Migrate bpf_rbtree_remove to possibly fail Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 8/9] bpf: Centralize btf_field-specific initialization logic Dave Marchevsky
2023-04-15 20:18 ` [PATCH v2 bpf-next 9/9] selftests/bpf: Add refcounted_kptr tests Dave Marchevsky
2023-04-21 22:17 ` Kumar Kartikeya Dwivedi
2023-04-21 23:49 ` Alexei Starovoitov
2023-04-22 2:06 ` Kumar Kartikeya Dwivedi
2023-04-22 2:18 ` Alexei Starovoitov
2023-04-25 6:53 ` Dave Marchevsky
2023-04-16 0:50 ` [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs patchwork-bot+netdevbpf
2023-04-22 2:03 ` Kumar Kartikeya Dwivedi [this message]
2023-04-22 3:21 ` Alexei Starovoitov
2023-04-22 18:42 ` Kumar Kartikeya Dwivedi
2023-04-22 21:25 ` Alexei Starovoitov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2 \
--to=memxor@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davemarchevsky@fb.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).