Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: Alexei Starovoitov <ast@fb.com>
Cc: Dave Marchevsky <davemarchevsky@fb.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kernel Team <kernel-team@fb.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [RFC PATCH bpf-next 05/11] bpf: Add bpf_spin_lock member to rbtree
Date: Wed, 10 Aug 2022 23:46:50 +0200	[thread overview]
Message-ID: <CAP01T75nt69=jgGPGXYXHSGc5EDHejgLQpyY8TMeUy2U4Prxvg@mail.gmail.com> (raw)
In-Reply-To: <61209d3a-bc15-e4f2-9079-7bdcfdd13cd0@fb.com>

On Tue, 2 Aug 2022 at 00:23, Alexei Starovoitov <ast@fb.com> wrote:
>
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> > This patch adds a struct bpf_spin_lock *lock member to bpf_rbtree, as
> > well as a bpf_rbtree_get_lock helper which allows bpf programs to access
> > the lock.
> >
> > Ideally the bpf_spin_lock would be created independently oustide of the
> > tree and associated with it before the tree is used, either as part of
> > map definition or via some call like rbtree_init(&rbtree, &lock). Doing
> > this in an ergonomic way is proving harder than expected, so for now use
> > this workaround.
> >
> > Why is creating the bpf_spin_lock independently and associating it with
> > the tree preferable? Because we want to be able to transfer nodes
> > between trees atomically, and for this to work need same lock associated
> > with 2 trees.
>
> Right. We need one lock to protect multiple rbtrees.
> Since add/find/remove helpers will look into rbtree->lock
> the two different rbtree (== map) have to point to the same lock.
> Other than rbtree_init(&rbtree, &lock) what would be an alternative ?
>
> >
> > Further locking-related patches will make it possible for the lock to be
> > used in BPF programs and add code which enforces that the lock is held
> > when doing any operation on the tree.
> >
> > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > ---
> >   include/uapi/linux/bpf.h       |  7 +++++++
> >   kernel/bpf/helpers.c           |  3 +++
> >   kernel/bpf/rbtree.c            | 24 ++++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |  7 +++++++
> >   4 files changed, 41 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4688ce88caf4..c677d92de3bc 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5385,6 +5385,12 @@ union bpf_attr {
> >    *  Return
> >    *          0
> >    *
> > + * void *bpf_rbtree_get_lock(struct bpf_map *map)
> > + *   Description
> > + *           Return the bpf_spin_lock associated with the rbtree
> > + *
> > + *   Return
> > + *           Ptr to lock
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)               \
> >       FN(unspec),                     \
> > @@ -5600,6 +5606,7 @@ union bpf_attr {
> >       FN(rbtree_find),                \
> >       FN(rbtree_remove),              \
> >       FN(rbtree_free_node),           \
> > +     FN(rbtree_get_lock),            \
> >       /* */
> >
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 35eb66d11bf6..257a808bb767 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1587,6 +1587,7 @@ const struct bpf_func_proto bpf_rbtree_add_proto __weak;
> >   const struct bpf_func_proto bpf_rbtree_find_proto __weak;
> >   const struct bpf_func_proto bpf_rbtree_remove_proto __weak;
> >   const struct bpf_func_proto bpf_rbtree_free_node_proto __weak;
> > +const struct bpf_func_proto bpf_rbtree_get_lock_proto __weak;
> >
> >   const struct bpf_func_proto *
> >   bpf_base_func_proto(enum bpf_func_id func_id)
> > @@ -1686,6 +1687,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >               return &bpf_rbtree_remove_proto;
> >       case BPF_FUNC_rbtree_free_node:
> >               return &bpf_rbtree_free_node_proto;
> > +     case BPF_FUNC_rbtree_get_lock:
> > +             return &bpf_rbtree_get_lock_proto;
> >       default:
> >               break;
> >       }
> > diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> > index 250d62210804..c6f0a2a083f6 100644
> > --- a/kernel/bpf/rbtree.c
> > +++ b/kernel/bpf/rbtree.c
> > @@ -9,6 +9,7 @@
> >   struct bpf_rbtree {
> >       struct bpf_map map;
> >       struct rb_root_cached root;
> > +     struct bpf_spin_lock *lock;
> >   };
> >
> >   BTF_ID_LIST_SINGLE(bpf_rbtree_btf_ids, struct, rb_node);
> > @@ -39,6 +40,14 @@ static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
> >
> >       tree->root = RB_ROOT_CACHED;
> >       bpf_map_init_from_attr(&tree->map, attr);
> > +
> > +     tree->lock = bpf_map_kzalloc(&tree->map, sizeof(struct bpf_spin_lock),
> > +                                  GFP_KERNEL | __GFP_NOWARN);
> > +     if (!tree->lock) {
> > +             bpf_map_area_free(tree);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> >       return &tree->map;
> >   }
> >
> > @@ -139,6 +148,7 @@ static void rbtree_map_free(struct bpf_map *map)
> >
> >       bpf_rbtree_postorder_for_each_entry_safe(pos, n, &tree->root.rb_root)
> >               kfree(pos);
> > +     kfree(tree->lock);
> >       bpf_map_area_free(tree);
> >   }
> >
> > @@ -238,6 +248,20 @@ static int rbtree_map_get_next_key(struct bpf_map *map, void *key,
> >       return -ENOTSUPP;
> >   }
> >
> > +BPF_CALL_1(bpf_rbtree_get_lock, struct bpf_map *, map)
> > +{
> > +     struct bpf_rbtree *tree = container_of(map, struct bpf_rbtree, map);
> > +
> > +     return (u64)tree->lock;
> > +}
> > +
> > +const struct bpf_func_proto bpf_rbtree_get_lock_proto = {
> > +     .func = bpf_rbtree_get_lock,
> > +     .gpl_only = true,
> > +     .ret_type = RET_PTR_TO_MAP_VALUE,
>
> This hack and
>
> +const struct bpf_func_proto bpf_rbtree_unlock_proto = {
> +       .func = bpf_rbtree_unlock,
> +       .gpl_only = true,
> +       .ret_type = RET_INTEGER,
> +       .arg1_type = ARG_PTR_TO_SPIN_LOCK,
>
> may be too much arm twisting to reuse bpf_spin_lock.
>
> May be return ptr_to_btf_id here and bpf_rbtree_lock
> should match the type?
> It could be new 'struct bpf_lock' in kernel's BTF.
>
> Let's figure out how to associate locks with rbtrees.
>
> Reiterating my proposal that was done earlier in the context
> of Delyan's irq_work, but for different type:
> How about:
> struct bpf_lock *l;
> l = bpf_mem_alloc(allocator, bpf_core_type_id_kernel(*l));
>
> that would allocate ptr_to_btf_id object from kernel's btf.
> The bpf_lock would have constructor and destructor provided by the
> kernel code.
> constructor will set bpf_lock's refcnt to 1.
> then bpf_rbtree_init(&map, lock) will bump the refcnt.
> and dtor will eventually free it when all rbtrees are freed.
> That would be similar to kptr's logic with kptr_get and dtor's
> associated with kernel's btf_id-s.

Just to continue brainstorming: Comments on this?

Instead of a rbtree map, you have a struct bpf_rbtree global variable
which works like a rbtree. To associate a lock with multiple
bpf_rbtree, you do clang style thread safety annotation in the bpf
program:

#define __guarded_by(lock) __attribute__((btf_type_tag("guarded_by:" #lock))

struct bpf_spin_lock shared_lock;
struct bpf_rbtree rbtree1 __guarded_by(shared_lock);
struct bpf_rbtree rbtree2 __guarded_by(shared_lock);

guarded_by tag is mandatory for the rbtree. Verifier ensures
shared_lock spin lock is held whenever rbtree1 or rbtree2 is being
accessed, and whitelists add/remove helpers inside the critical
section.

I don't think associating locks to rbtree dynamically is a hard
requirement for your use case? But if you need that, you may probably
also allocate sets of rbtree that are part of the same lock "class"
dynamically using bpf_kptr_alloc, and do similar annotation for the
struct being allocated.
struct rbtree_set {
  struct bpf_spin_lock lock;
  struct bpf_rbtree rbtree1 __guarded_by(lock);
  struct bpf_rbtree rbtree2 __guarded_by(lock);
};
struct rbtree_set *s = bpf_kptr_alloc(sizeof(*s), btf_local_type_id(*s));
// Just stash the pointer somewhere with kptr_xchg
On bpf_kptr_free, the verifier knows this is not a "trivial" struct,
so inserts destructor calls for bpf_rbtree fields during program
fixup.

The main insight is that lock and rbtree are part of the same
allocation (map value for global case, ptr_to_btf_id for dynamic case)
so the locked state can be bound to the reg state in the verifier.
Then we can also make this new rbtree API use kfuncs instead of UAPI
helpers, to get some field experience before baking it in.

Any opinions? Any brainos or deficiencies in the scheme above?

Background: I have been thinking about how I can bind kptr and normal
data synchronization without having unneeded atomic xchg cost when
lock is already protecting kptr. In my tests this guarded_by
annotation has been proving very useful (you mark data and kptr
protected by lock as guarded_by some spin lock member in same map
value, verifier ensures lock is held during access, and kptr_xchg for
guarded kptr is lowered to non-atomic load/store assuming no
concurrent access, and kptr_xchg from outside the lock section is
rejected).