All of lore.kernel.org
 help / color / mirror / Atom feed
* BPF Linked Lists discussion
@ 2022-08-17  9:04 Kumar Kartikeya Dwivedi
  2022-08-17  9:09 ` Kumar Kartikeya Dwivedi
  2022-08-19  8:55 ` Dave Marchevsky
  0 siblings, 2 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-17  9:04 UTC (permalink / raw)
  To: bpf; +Cc: ast, daniel, andrii, davemarchevsky

Alexei and I had a meeting where we discussed some of my ideas related
to BPF linked lists. I am sharing the discussion with everyone to get
wider feedback, and document what we discussed.

The hard stuff is the shared ownership case, hence we can discuss this
while we work on landing single ownership lists. I will be sharing my
patches for that as an RFC.

1. Definition

We can use BTF declaration tags to annotate a common structure like
struct bpf_list_head, struct bpf_rb_root, etc.

#define __value(kind, name, node) __attribute__((btf_decl_tag(#kind
":" #name ":" #node))

struct foo {
    unsigned long data;
    struct bpf_list_node node;
};

struct map_value {
    struct bpf_spin_lock lock;
    struct bpf_list_head head __value(struct, foo, node);
};

This allows us to parameterize any kind of intrusive collection.

For the map-in-map use case:

struct bar {
    unsigned long data;
    struct bpf_list_node node;
};
// Only two levels of types allowed, to ensure no cycles, and to
prevent ownership cycles
struct foo {
    unsigned long data;
    struct bpf_spin_lock lock;
    struct bpf_list_node node;
    struct bpf_list_head head __value(struct, bar, node);
};

struct map_value {
    struct bpf_spin_lock lock;
    struct bpf_list_head head __value(struct, foo, node);
};

2. Building Blocks - A type safe allocator

Add bpf_kptr_alloc, bpf_kptr_free
This will use bpf_mem_alloc infra, allocator maps.
Alexei mentioned that Delyan is working on support for exposing
bpf_mem_alloc using an allocator map.
Allocates referenced PTR_TO_BTF_ID (should we call these local kptr?):
reg->btf == prog->aux->btf
reg->btf_id = bpf_core_type_id_local(...)
btf_struct_access allows writing to these objects.
Due to type visibility, we can embed objects with special semantics
inside these user defined types.
Add a concept of constructing/destructing kptr.
constructing -> normal kptr, escapable -> destructing
In constructing and destructing state, pointer cannot escape the
program. Hence, only one CPU is guaranteed to observe the object in
those states. So when we have access to single ownership kptr, we know
nobody else can access it. Hence we can also move its state from
normal to destructing state.
In case of shared ownership, we will have to rely on the result of
bpf_refcount_put for this to work.

3. Embedding special fields inside such allocated kptr

We must allow programmer to compose their own user defined BPF object
out of building blocks provided by BPF.
BPF users may have certain special objects inside this allocated
object. E.g. bpf_list_node, bpf_spin_lock, even bpf_list_head
(map-in-map use case).
btf_struct_access won’t allow direct reads/writes to these fields.
Each of them needs to be constructed before the object is considered
fully constructed.
An unconstructed object’s kptr cannot escape a program, it can only be
destructed and freed.
This has multiple advantages. We can add fields which have complex
initialization requirements.
This also allows safe recycling of memory without having to do zero
init or inserting constructor calls automatically from verifier.
Allows constructors to have parameters in future, also allows complex
multi-step initialization of fields in future.

4. Single Ownership Linked Lists

The kptr has single ownership.
Program has to release it before BPF_EXIT, either free or move it out
of program.
Once passed to list, the program loses ownership.
But BPF can track that until spin_lock is released, nobody else can
touch it, so we can technically still list_remove a node we added
using list_add, and then we will be owning it after unlock.
list_add marks reference_state as ‘release_on_unlock’
list_remove unmark reference_state
Alexei: Similar to Dave’s approach, but different implementation.
bpf_spin_unlock walks acquired_refs and release_reference marked ones.
No other function calls allows in critical section, hence
reference_state remains same.

----------

5. Shared Ownership

Idea: Add bpf_refcount as special field embeddable in allocated kptrs.
bpf_refcount_set(const), bpf_refcount_inc(const), bpf_refcount_put(ptr).
If combined with RCU, can allow safe kptr_get operations for such objects.
Each rb_root, list_head requires ownership of node.
Caller will transfer its reference to them.
If having only a single reference, do inc before transfer.
It is a generic concept, and can apply to kernel types as well.
When linking after allocation, it is extremely cheap to set, add, add, add…

We add ‘static_ref’ to each reference_state to track incs/decs
acq = static_ref = 1
set  = static_ref = K (must be in [1, …] range)
inc  = static_ref += K
rel/put = static_ref -=  1 (may allow K, dunno)

Alexei suggested that he prefers if helpers did the increment on their
own in case where the bpf_refcount field exists in the object. I.e.
instead of caller incrementing and then passing their reference to
lists or rbtree, the add helpers receive hidden parameter to refcount
field address automatically and bump the refcount when adding. In that
case, they won't be releasing caller's reference_state.
Then this static_ref code is not required.

Kartikeya: No strong opinions, this is also another way. One advantage
of managing refcount on caller side and just keeping helpers move only
(regardless of single owner or shared owner kptr) is that helpers
themselves have the same semantics. It always moves ownership of a
reference. Also, one inc(K) and multiple add is a little cheaper than
multiple inc(1) on each add.

6. How does the verifier reason about shared kptr we don't know the state of?

Consider a case where we load a kptr which has shared ownership from a
map using kptr_get.

Now, it may have a list_node and a rb_node. We don't know whether this
node is already part of some list (so that list_node is occupied),
same for rb_node.

There can be races like two CPUs having access to the node:

CPU 0                         CPU 1
lock(&list1_lock)            lock(&list2_lock)
list_add(&node, &list2)
    next.prev = node;
    node.next = next;      list_remove(&node)
                                         node.next = NULL;
                                         node.prev = NULL;
    node.prev = prev;
    prev.next = node;
unlock(&list1_lock);         unlock(&list2_lock);

Interleavings can leave nodes in inconsistent states.
We need to ensure that when we are doing list_add or list_remove for
kptr we don't know the state of, it is only in a safe context with
ownership of that operation.

Remove:

When inside list_for_each helper, list_remove is safe for nodes since
we are protected by lock.

Idea: An owner field alongside the list_node and rb_node.
list_add sets it to the address of list_head, list_remove sets it to
NULL. This will be done under spinlock of the list.

When we get access to the object in an unknown state for these fields,
we first lock the list we want to remove it from, check the owner
field, and only remove it when we see that owner matches locked list.

Each list_add updates owner, list_remove sets to NULL.
    bpf_spin_lock(&lock);
    if (bpf_list_owns_node(&i->node, &list)) { // checks owner
list_remove(&i->node);
    }
    bpf_spin_unlock(&lock);

bpf_list_owns_node poisons pointer in false branch, so user can only
list_remove in true branch.

If the owner is not a locked list pointer, it will be either NULL or
some other value (because of previous list_remove while holding same
lock, or list_add while holding some other list lock).
If the owner is our list pointer, we can be sure this is safe, as we
have already locked list.
Otherwise, previous critical section must have modified owner.
So one single load (after inlining this helper) allows unlinking
random kptr we have reference to, safely.

Cost: 8-bytes per object. Advantages: Prevents bugs like racy
list_remove and double list_add, doesn't need fallible helpers (the
check that would have been inside has to be done by the user now).
Don't need the abort logic.

We can also use this to jump back to owner, lock it, list_owns_node,
remove safely without having to iterate owner list.

Idea: Make it optional, so cost only paid by those who need this
dynamic removal of kptr from kptr_gets, only this helper would require
associated owner field.

7. How to add such a randomly acquired kptr?

Same idea, but requires a cmpxchg owner to BPF_PTR_POISON. This can’t
be set from list_add or list_remove. list_owns_node will see it and
not return true. Hence, we will have exclusive ownership of list_node.

Safety: Verifier will force to either add to some list, and make
pointer unescapable, or reset to NULL (needs a store, makes escapable)
(otherwise if you move it to map, it will become unlinkable on next
prog execution, as we lose this info when we reload from map).

6,7: Alternative approach: node locking

Be able to lock kptr so that its list_node, rb_node fields are
protected from concurrent mutation. This requires teaching the
verifier to take 2 levels of locks.

Requires building a type ownership graph and detecting cycles to avoid
ABBA deadlocks when loading program BTF.

It would still require checking the owner field (i.e.
bpf_list_owns_node operation) inside the CS (as someone before us
might have taken the lock and added it or removed it), but it becomes
a simple branch, instead of having to do N cmpxchg for N fields to be
able to add them.

8. Insight: Need a generic concept of helpers that poison/unpoison  a
pointer argument depending on the checking of the result they return.

if (use_only_in_one_branch(ptr)) { … } else { … }
Poisons all copies of ptr. Checking the returned pointer unpoisons.
Returned pointer stores ref_obj_id (currently only needed for
refcounted registers), which can be used to find candidates for
un-poisoning.
Generic concept, similar to CONDITIONAL_RELEASE approach from Dave,
but can apply to do all kinds of other things.
E.g. if (bpf_refcount_single(...)) does one load internally, simply
check to downgrade to single ownership pointer in one branch. Some
operations then don’t need lock (like adding to list_head, since only
we can access it).
Same for bpf_refcount_put pattern.
if (bpf_refcount_put(kptr)) { destruct(kptr); kptr_free(kptr); }

9. RCU protection for single & shared ownershipkptrs

Idea: Expose bpf_call_rcu kfunc. RCU protected kptr cannot be
destructed, cannot be bpf_kptr_free directly. Only bpf_call_rcu can be
called once refcount reaches 0, then it will invoke callback and give
ownership of kptr to the callback and force destruction + free (by
setting destructing state of pointer).

For single ownership, once we remove visibility using kptr_xchg (it
can be only in one field, because of single ownership allowing one
move from/to map), we can call this helper as well.


In shared ownership we rely on bpf_refcount_put's true branch to call
bpf_call_rcu.

Callback runs once after RCU gp, it will only be allowed to destruct
kptr and then call bpf_kptr_free, not escape program.

I _think_ we can avoid taking prog reference, if we do RCU barrier
after synchronize_rcu in prog free path? That waits for all call_rcu
invoked from read sections that may be executing the prog.

Inside callbacks, regardless of single ownership kptr or kptr with
refcount (possible shared ownership), we know we have single
ownership, and set destructing state (with all possible destructible
fields marked as constructed in callback func_state, so user has to
call all destructors and then free, can do nothing else).

Alexei: Instead of open coding bpf_call_rcu plus destruction like
this, associate destructor callback with user kptr. Then
bpf_kptr_free_rcu automatically invokes this using call_rcu, and BPF
map also invokes it on map_free.

Kartikeya: One big limitation is that now BPF map must take reference
to prog as it needs callback reference to stay stable, but to solve
the reference cycle it must release these kptr on map_release_uref. We
then also need to check this atomic count like the timer helper before
setting the kptr in the map, so that such kptr is not set after
map_release_uref again, which is a bit limiting. It would be limiting
to have to pin maps to make this work, we may not even have user
visible fds for such maps in future (as we are moving towards BPF
powered kernel programming).

10. Verifier callback handling is broken

Loop callback verification is broken
Same issues exist now as were anticipated in open coded iteration
Fixing this will open clear path to enabling open coded iteration
https://lore.kernel.org/bpf/20220815051540.18791-1-memxor@gmail.com
Alexei mentioned that Andrii is working on how to do inline nested loops.

--

This concludes what we discussed yesterday. Apologies in advance if I
forgot to mention anything else.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-17  9:04 BPF Linked Lists discussion Kumar Kartikeya Dwivedi
@ 2022-08-17  9:09 ` Kumar Kartikeya Dwivedi
  2022-08-19  8:55 ` Dave Marchevsky
  1 sibling, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-17  9:09 UTC (permalink / raw)
  To: bpf; +Cc: ast, andrii, davemarchevsky, daniel

+Cc Daniel (now correct email address, typo'd it in the original mail).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-17  9:04 BPF Linked Lists discussion Kumar Kartikeya Dwivedi
  2022-08-17  9:09 ` Kumar Kartikeya Dwivedi
@ 2022-08-19  8:55 ` Dave Marchevsky
  2022-08-19 16:00   ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Marchevsky @ 2022-08-19  8:55 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf; +Cc: ast, daniel, andrii

Hi Kumar,

Alexei and I talked about locking and a few other things today in regards to my
rbtree work. Some of this isn't a direct response to your ideas/notes here, 
but hoping to summarize today's discussion inline with your code samples and
get your opinion.

Also, some inline comments more directly addressing your notes.

On 8/17/22 5:04 AM, Kumar Kartikeya Dwivedi wrote:   
> Alexei and I had a meeting where we discussed some of my ideas related
> to BPF linked lists. I am sharing the discussion with everyone to get
> wider feedback, and document what we discussed.
> 
> The hard stuff is the shared ownership case, hence we can discuss this
> while we work on landing single ownership lists. I will be sharing my
> patches for that as an RFC.
> 
> 1. Definition
> 
> We can use BTF declaration tags to annotate a common structure like
> struct bpf_list_head, struct bpf_rb_root, etc.
> 
> #define __value(kind, name, node) __attribute__((btf_decl_tag(#kind
> ":" #name ":" #node))
> 
> struct foo {
>     unsigned long data;
>     struct bpf_list_node node;
> };
> 
> struct map_value {
>     struct bpf_spin_lock lock;
>     struct bpf_list_head head __value(struct, foo, node);
> };
> 
> This allows us to parameterize any kind of intrusive collection.
> 
> For the map-in-map use case:
> 
> struct bar {
>     unsigned long data;
>     struct bpf_list_node node;
> };
> // Only two levels of types allowed, to ensure no cycles, and to
> prevent ownership cycles
> struct foo {
>     unsigned long data;
>     struct bpf_spin_lock lock;
>     struct bpf_list_node node;
>     struct bpf_list_head head __value(struct, bar, node);
> };
> 
> struct map_value {
>     struct bpf_spin_lock lock;
>     struct bpf_list_head head __value(struct, foo, node);
> };
> 

Will these still be 'bpf maps' under the hood? If the list were to use
convention similar to the rbtree RFC, the first (non map-in-map) def could be
written like:

struct foo {
    unsigned long data;
    struct bpf_list_node node;
};

struct {
    __uint(type, BPF_MAP_TYPE_LINKED_LIST);
    __type(value, struct foo);
} list SEC(".maps");

I think we're thinking along similar lines with regards to the BTF tag, but I
was thinking of tagging the value type instead of the container, something like:

struct foo {
    unsigned long data;
    struct bpf_list_node node __protected_list_node;
};

'protected' meaning verifier knows to prevent prog from touching the
bpf_list_node. Currently my rbtree RFCv1 is just assuming that value type will
have rb_node at offset 0. BTF tag would eliminate this offset requirement
and allow value types to be part of multiple data structures:

struct foo {
    unsigned long data;
    struct bpf_list_node node __protected_list_node;
    struct bpf_rb_node rb_node __protected_rb_node;
};

Not a hard requirement for me, but nice-to-have: support for heterogenous value
types in same list / rbtree. Map def's __type(value, struct foo) wouldn't work
in this case, I think your __value(struct, foo, node) would have same issue.

But I think this should be possible with BTF tags somehow. List
helpers are ostensibly only touching the list_{head,node} - and similar for
rbtree, and we're both planning on explicit in-prog typed allocation.
If type can be propagated from alloc -> reg state -> helper input ->
helper output, helpers could use reg BTF info w/ properly tagged field
to manipulate the right field in the value struct.

In that case the tag would have to be on the value struct's field, not the
container. I do like that your __value(struct, foo, node) is teaching the
container what named field to manipulate. If value struct were to be part of
two lists this would make it possible to disambiguate.

When we discussed this Alexei mentioned existing pointer casting helper pattern
(e.g. 'bpf_skc_to_tcp_sock') as potentially being helpful here.

> 2. Building Blocks - A type safe allocator
> 
> Add bpf_kptr_alloc, bpf_kptr_free
> This will use bpf_mem_alloc infra, allocator maps.
> Alexei mentioned that Delyan is working on support for exposing
> bpf_mem_alloc using an allocator map.
> Allocates referenced PTR_TO_BTF_ID (should we call these local kptr?):
> reg->btf == prog->aux->btf
> reg->btf_id = bpf_core_type_id_local(...)
> btf_struct_access allows writing to these objects.
> Due to type visibility, we can embed objects with special semantics
> inside these user defined types.
> Add a concept of constructing/destructing kptr.
> constructing -> normal kptr, escapable -> destructing
> In constructing and destructing state, pointer cannot escape the
> program. Hence, only one CPU is guaranteed to observe the object in
> those states. So when we have access to single ownership kptr, we know
> nobody else can access it. Hence we can also move its state from
> normal to destructing state.
> In case of shared ownership, we will have to rely on the result of
> bpf_refcount_put for this to work.
> 
> 3. Embedding special fields inside such allocated kptr
> 
> We must allow programmer to compose their own user defined BPF object
> out of building blocks provided by BPF.
> BPF users may have certain special objects inside this allocated
> object. E.g. bpf_list_node, bpf_spin_lock, even bpf_list_head
> (map-in-map use case).
> btf_struct_access won’t allow direct reads/writes to these fields.
> Each of them needs to be constructed before the object is considered
> fully constructed.
> An unconstructed object’s kptr cannot escape a program, it can only be
> destructed and freed.
> This has multiple advantages. We can add fields which have complex
> initialization requirements.
> This also allows safe recycling of memory without having to do zero
> init or inserting constructor calls automatically from verifier.
> Allows constructors to have parameters in future, also allows complex
> multi-step initialization of fields in future.
> 

I don't fully understand "shared ownership" from 2) and don't have a use case
for complex constructors in 3), but broadly agree with everything else. Will
do another pass.

> 4. Single Ownership Linked Lists
> 
> The kptr has single ownership.
> Program has to release it before BPF_EXIT, either free or move it out
> of program.
> Once passed to list, the program loses ownership.
> But BPF can track that until spin_lock is released, nobody else can
> touch it, so we can technically still list_remove a node we added
> using list_add, and then we will be owning it after unlock.
> list_add marks reference_state as ‘release_on_unlock’
> list_remove unmark reference_state
> Alexei: Similar to Dave’s approach, but different implementation.
> bpf_spin_unlock walks acquired_refs and release_reference marked ones.
> No other function calls allows in critical section, hence
> reference_state remains same.
> 
> ----------
> 
> 5. Shared Ownership
> 
> Idea: Add bpf_refcount as special field embeddable in allocated kptrs.
> bpf_refcount_set(const), bpf_refcount_inc(const), bpf_refcount_put(ptr).
> If combined with RCU, can allow safe kptr_get operations for such objects.
> Each rb_root, list_head requires ownership of node.
> Caller will transfer its reference to them.
> If having only a single reference, do inc before transfer.
> It is a generic concept, and can apply to kernel types as well.
> When linking after allocation, it is extremely cheap to set, add, add, add…
> 
> We add ‘static_ref’ to each reference_state to track incs/decs
> acq = static_ref = 1
> set  = static_ref = K (must be in [1, …] range)
> inc  = static_ref += K
> rel/put = static_ref -=  1 (may allow K, dunno)
> 
> Alexei suggested that he prefers if helpers did the increment on their
> own in case where the bpf_refcount field exists in the object. I.e.
> instead of caller incrementing and then passing their reference to
> lists or rbtree, the add helpers receive hidden parameter to refcount
> field address automatically and bump the refcount when adding. In that
> case, they won't be releasing caller's reference_state.
> Then this static_ref code is not required.
> 
> Kartikeya: No strong opinions, this is also another way. One advantage
> of managing refcount on caller side and just keeping helpers move only
> (regardless of single owner or shared owner kptr) is that helpers
> themselves have the same semantics. It always moves ownership of a
> reference. Also, one inc(K) and multiple add is a little cheaper than
> multiple inc(1) on each add.
> 
> 6. How does the verifier reason about shared kptr we don't know the state of?
> 
> Consider a case where we load a kptr which has shared ownership from a
> map using kptr_get.
> 
> Now, it may have a list_node and a rb_node. We don't know whether this
> node is already part of some list (so that list_node is occupied),
> same for rb_node.
> 
> There can be races like two CPUs having access to the node:
> 
> CPU 0                         CPU 1
> lock(&list1_lock)            lock(&list2_lock)
> list_add(&node, &list2)
>     next.prev = node;
>     node.next = next;      list_remove(&node)
>                                          node.next = NULL;
>                                          node.prev = NULL;
>     node.prev = prev;
>     prev.next = node;
> unlock(&list1_lock);         unlock(&list2_lock);
> 
> Interleavings can leave nodes in inconsistent states.
> We need to ensure that when we are doing list_add or list_remove for
> kptr we don't know the state of, it is only in a safe context with
> ownership of that operation.
> 
> Remove:
> 
> When inside list_for_each helper, list_remove is safe for nodes since
> we are protected by lock.
> 
> Idea: An owner field alongside the list_node and rb_node.
> list_add sets it to the address of list_head, list_remove sets it to
> NULL. This will be done under spinlock of the list.
> 
> When we get access to the object in an unknown state for these fields,
> we first lock the list we want to remove it from, check the owner
> field, and only remove it when we see that owner matches locked list.
> 
> Each list_add updates owner, list_remove sets to NULL.
>     bpf_spin_lock(&lock);
>     if (bpf_list_owns_node(&i->node, &list)) { // checks owner
> list_remove(&i->node);
>     }
>     bpf_spin_unlock(&lock);
> 
> bpf_list_owns_node poisons pointer in false branch, so user can only
> list_remove in true branch.
> 
> If the owner is not a locked list pointer, it will be either NULL or
> some other value (because of previous list_remove while holding same
> lock, or list_add while holding some other list lock).
> If the owner is our list pointer, we can be sure this is safe, as we
> have already locked list.
> Otherwise, previous critical section must have modified owner.
> So one single load (after inlining this helper) allows unlinking
> random kptr we have reference to, safely.
> 
> Cost: 8-bytes per object. Advantages: Prevents bugs like racy
> list_remove and double list_add, doesn't need fallible helpers (the
> check that would have been inside has to be done by the user now).
> Don't need the abort logic.
> 

I agree, keeping track of owner seems necessary. Seems harder to verify
statically than lock as well. Alexei mentioned today that combination
"grab lock and take ownership" helper for dynamic check might make
sense.

Tangentially, I've been poking at ergonomics of
libbpf lock definition this week and think I have something reasonable:

struct node_data {
        struct rb_node node;
        __u32 one;
        __u32 two;
};

struct l {
        __uint(type, BPF_MAP_TYPE_ARRAY);
        __type(key, u32);
        __type(value, struct bpf_spin_lock);
        __uint(max_entries, 1);
} lock_arr SEC(".maps");

struct {
        __uint(type, BPF_MAP_TYPE_RBTREE);
        __type(value, struct node_data);
        __array(lock, struct l);
} rbtree1 SEC(".maps") = {
        .lock = {
                [0] = &lock_arr,
        },
};

struct {
        __uint(type, BPF_MAP_TYPE_RBTREE);
        __type(value, struct node_data);
        __array(lock, struct l);
} rbtree2 SEC(".maps") = {
        .lock = {
                [0] = &lock_arr,
        },
};

... in BPF prog

  bpf_spin_lock(&lock_arr[0]);

  // Can safely operate on either tree, move nodes between them, etc.

  bpf_spin_unlock(&lock_arr[0]);


Notes:
  * Verifier knows which lock is supposed to be used at map creation time
    * Can reuse bpf_verifier_state's 'active_spin_lock' member, so no addt'l
      bookkeeping needed to verify that rbtree_add or similar is happening
      in critical section
  * Can benefit from relo goodness (e.g. rbtree3 using extern lock in another
    file)
  * If necessary, similar dynamic verification behavior as just keeping lock
    internal
  * Implementation similarities with map_of_map 'inner_map'. Similarly to
    inner_map_fd, kernel needs to know about lock_map_fd. Can use map_extra for
    this to avoid uapi changes

Alexei and I discussed possibly allowing raw 'struct bpf_spin_lock' global var,
which would require some additional libbpf changes as bpf_spin_lock can't be
mmap'd and libbpf tries to mmap all .data maps currently. Perhaps a separate
.data.no_mmap section.

This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure
how to statically verify lock in that case. Have you had a chance to think
about it further? And does the above seem reasonable to you?

> We can also use this to jump back to owner, lock it, list_owns_node,
> remove safely without having to iterate owner list.
> 
> Idea: Make it optional, so cost only paid by those who need this
> dynamic removal of kptr from kptr_gets, only this helper would require
> associated owner field.
> 
> 7. How to add such a randomly acquired kptr?
> 
> Same idea, but requires a cmpxchg owner to BPF_PTR_POISON. This can’t
> be set from list_add or list_remove. list_owns_node will see it and
> not return true. Hence, we will have exclusive ownership of list_node.
> 
> Safety: Verifier will force to either add to some list, and make
> pointer unescapable, or reset to NULL (needs a store, makes escapable)
> (otherwise if you move it to map, it will become unlinkable on next
> prog execution, as we lose this info when we reload from map).
> 
> 6,7: Alternative approach: node locking
> 
> Be able to lock kptr so that its list_node, rb_node fields are
> protected from concurrent mutation. This requires teaching the
> verifier to take 2 levels of locks.
> 
> Requires building a type ownership graph and detecting cycles to avoid
> ABBA deadlocks when loading program BTF.
> 
> It would still require checking the owner field (i.e.
> bpf_list_owns_node operation) inside the CS (as someone before us
> might have taken the lock and added it or removed it), but it becomes
> a simple branch, instead of having to do N cmpxchg for N fields to be
> able to add them.
> 
> 8. Insight: Need a generic concept of helpers that poison/unpoison  a
> pointer argument depending on the checking of the result they return.
> 
> if (use_only_in_one_branch(ptr)) { … } else { … }
> Poisons all copies of ptr. Checking the returned pointer unpoisons.
> Returned pointer stores ref_obj_id (currently only needed for
> refcounted registers), which can be used to find candidates for
> un-poisoning.
> Generic concept, similar to CONDITIONAL_RELEASE approach from Dave,
> but can apply to do all kinds of other things.
> E.g. if (bpf_refcount_single(...)) does one load internally, simply
> check to downgrade to single ownership pointer in one branch. Some
> operations then don’t need lock (like adding to list_head, since only
> we can access it).
> Same for bpf_refcount_put pattern.
> if (bpf_refcount_put(kptr)) { destruct(kptr); kptr_free(kptr); }
> 
> 9. RCU protection for single & shared ownershipkptrs
> 
> Idea: Expose bpf_call_rcu kfunc. RCU protected kptr cannot be
> destructed, cannot be bpf_kptr_free directly. Only bpf_call_rcu can be
> called once refcount reaches 0, then it will invoke callback and give
> ownership of kptr to the callback and force destruction + free (by
> setting destructing state of pointer).
> 
> For single ownership, once we remove visibility using kptr_xchg (it
> can be only in one field, because of single ownership allowing one
> move from/to map), we can call this helper as well.
> 
> 
> In shared ownership we rely on bpf_refcount_put's true branch to call
> bpf_call_rcu.
> 
> Callback runs once after RCU gp, it will only be allowed to destruct
> kptr and then call bpf_kptr_free, not escape program.
> 
> I _think_ we can avoid taking prog reference, if we do RCU barrier
> after synchronize_rcu in prog free path? That waits for all call_rcu
> invoked from read sections that may be executing the prog.
> 
> Inside callbacks, regardless of single ownership kptr or kptr with
> refcount (possible shared ownership), we know we have single
> ownership, and set destructing state (with all possible destructible
> fields marked as constructed in callback func_state, so user has to
> call all destructors and then free, can do nothing else).
> 
> Alexei: Instead of open coding bpf_call_rcu plus destruction like
> this, associate destructor callback with user kptr. Then
> bpf_kptr_free_rcu automatically invokes this using call_rcu, and BPF
> map also invokes it on map_free.
> 
> Kartikeya: One big limitation is that now BPF map must take reference
> to prog as it needs callback reference to stay stable, but to solve
> the reference cycle it must release these kptr on map_release_uref. We
> then also need to check this atomic count like the timer helper before
> setting the kptr in the map, so that such kptr is not set after
> map_release_uref again, which is a bit limiting. It would be limiting
> to have to pin maps to make this work, we may not even have user
> visible fds for such maps in future (as we are moving towards BPF
> powered kernel programming).
> 
> 10. Verifier callback handling is broken
> 
> Loop callback verification is broken
> Same issues exist now as were anticipated in open coded iteration
> Fixing this will open clear path to enabling open coded iteration
> https://lore.kernel.org/bpf/20220815051540.18791-1-memxor@gmail.com
> Alexei mentioned that Andrii is working on how to do inline nested loops.
> 
> --
> 
> This concludes what we discussed yesterday. Apologies in advance if I
> forgot to mention anything else.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-19  8:55 ` Dave Marchevsky
@ 2022-08-19 16:00   ` Kumar Kartikeya Dwivedi
  2022-08-19 19:03     ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-19 16:00 UTC (permalink / raw)
  To: Dave Marchevsky; +Cc: bpf, ast, andrii, daniel

On Fri, 19 Aug 2022 at 10:55, Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> Hi Kumar,
>
> Alexei and I talked about locking and a few other things today in regards to my
> rbtree work. Some of this isn't a direct response to your ideas/notes here,
> but hoping to summarize today's discussion inline with your code samples and
> get your opinion.
>
> Also, some inline comments more directly addressing your notes.

Hi Dave, thanks for sharing the notes.

>
> On 8/17/22 5:04 AM, Kumar Kartikeya Dwivedi wrote:
> > Alexei and I had a meeting where we discussed some of my ideas related
> > to BPF linked lists. I am sharing the discussion with everyone to get
> > wider feedback, and document what we discussed.
> >
> > The hard stuff is the shared ownership case, hence we can discuss this
> > while we work on landing single ownership lists. I will be sharing my
> > patches for that as an RFC.
> >
> > 1. Definition
> >
> > We can use BTF declaration tags to annotate a common structure like
> > struct bpf_list_head, struct bpf_rb_root, etc.
> >
> > #define __value(kind, name, node) __attribute__((btf_decl_tag(#kind
> > ":" #name ":" #node))
> >
> > struct foo {
> >     unsigned long data;
> >     struct bpf_list_node node;
> > };
> >
> > struct map_value {
> >     struct bpf_spin_lock lock;
> >     struct bpf_list_head head __value(struct, foo, node);
> > };
> >
> > This allows us to parameterize any kind of intrusive collection.
> >
> > For the map-in-map use case:
> >
> > struct bar {
> >     unsigned long data;
> >     struct bpf_list_node node;
> > };
> > // Only two levels of types allowed, to ensure no cycles, and to
> > prevent ownership cycles
> > struct foo {
> >     unsigned long data;
> >     struct bpf_spin_lock lock;
> >     struct bpf_list_node node;
> >     struct bpf_list_head head __value(struct, bar, node);
> > };
> >
> > struct map_value {
> >     struct bpf_spin_lock lock;
> >     struct bpf_list_head head __value(struct, foo, node);
> > };
> >
>
> Will these still be 'bpf maps' under the hood? If the list were to use

Nope, my idea was to get rid of maps for intrusive collections, and
you always put bpf_list_head, bpf_rb_root in a map value. For global
map-like use cases, you instead use global variables, which are also
inside the map value of a BPF_MAP_TYPE_ARRAY. IMO there is no need
anymore to add more and more map types with this new style of data
structures. It is much more ergonomic to just use the head structure,
either as a global variable, or as an allocated object, but again,
that's my opinion. There will be some pros and cons for either
approach :).

I am aware of the problem with global bpf_spin_lock (thanks to Alexei
for spotting it), but as you described it can be solved by moving them
into a different section.

> convention similar to the rbtree RFC, the first (non map-in-map) def could be
> written like:
>
> struct foo {
>     unsigned long data;
>     struct bpf_list_node node;
> };
>
> struct {
>     __uint(type, BPF_MAP_TYPE_LINKED_LIST);
>     __type(value, struct foo);
> } list SEC(".maps");
>
> I think we're thinking along similar lines with regards to the BTF tag, but I
> was thinking of tagging the value type instead of the container, something like:
>
> struct foo {
>     unsigned long data;
>     struct bpf_list_node node __protected_list_node;
> };
>
> 'protected' meaning verifier knows to prevent prog from touching the
> bpf_list_node. Currently my rbtree RFCv1 is just assuming that value type will
> have rb_node at offset 0. BTF tag would eliminate this offset requirement
> and allow value types to be part of multiple data structures:
>
> struct foo {
>     unsigned long data;
>     struct bpf_list_node node __protected_list_node;
>     struct bpf_rb_node rb_node __protected_rb_node;
> };
>
> Not a hard requirement for me, but nice-to-have: support for heterogenous value
> types in same list / rbtree. Map def's __type(value, struct foo) wouldn't work
> in this case, I think your __value(struct, foo, node) would have same issue.
>
> But I think this should be possible with BTF tags somehow. List
> helpers are ostensibly only touching the list_{head,node} - and similar for
> rbtree, and we're both planning on explicit in-prog typed allocation.
> If type can be propagated from alloc -> reg state -> helper input ->
> helper output, helpers could use reg BTF info w/ properly tagged field
> to manipulate the right field in the value struct.

We can have multiple __value on a list_head and add some way to
disambiguate what the type will be on list_remove. The problem is not
during list_add, as you know the type of the node being added. It is
when you list_remove, is when you need to be able to disambiguate the
type so that we set the reg as correct btf, btf_id and then set the
right reg->off (so that container_of can give the entry). Until the
disambiguation step is done, it is unknown what the type might be.

When thinking of heterogeneous values, we should probably look to add
a way to do generic variant types in BPF, such that e.g. the first
field is a tag, and then the actual data of that type follows. This
would allow us to use them not only in intrusive collections, but
anywhere else, and probably even store them in map values as kptrs.

I think it is much simpler to start with homogenous types first, though.

>
> In that case the tag would have to be on the value struct's field, not the
> container. I do like that your __value(struct, foo, node) is teaching the
> container what named field to manipulate. If value struct were to be part of
> two lists this would make it possible to disambiguate.
>
> When we discussed this Alexei mentioned existing pointer casting helper pattern
> (e.g. 'bpf_skc_to_tcp_sock') as potentially being helpful here.
>

Indeed, but I think you need some bit of info at runtime to be able to do this.

> > 2. Building Blocks - A type safe allocator
> >
> > Add bpf_kptr_alloc, bpf_kptr_free
> > This will use bpf_mem_alloc infra, allocator maps.
> > Alexei mentioned that Delyan is working on support for exposing
> > bpf_mem_alloc using an allocator map.
> > Allocates referenced PTR_TO_BTF_ID (should we call these local kptr?):
> > reg->btf == prog->aux->btf
> > reg->btf_id = bpf_core_type_id_local(...)
> > btf_struct_access allows writing to these objects.
> > Due to type visibility, we can embed objects with special semantics
> > inside these user defined types.
> > Add a concept of constructing/destructing kptr.
> > constructing -> normal kptr, escapable -> destructing
> > In constructing and destructing state, pointer cannot escape the
> > program. Hence, only one CPU is guaranteed to observe the object in
> > those states. So when we have access to single ownership kptr, we know
> > nobody else can access it. Hence we can also move its state from
> > normal to destructing state.
> > In case of shared ownership, we will have to rely on the result of
> > bpf_refcount_put for this to work.
> >
> > 3. Embedding special fields inside such allocated kptr
> >
> > We must allow programmer to compose their own user defined BPF object
> > out of building blocks provided by BPF.
> > BPF users may have certain special objects inside this allocated
> > object. E.g. bpf_list_node, bpf_spin_lock, even bpf_list_head
> > (map-in-map use case).
> > btf_struct_access won’t allow direct reads/writes to these fields.
> > Each of them needs to be constructed before the object is considered
> > fully constructed.
> > An unconstructed object’s kptr cannot escape a program, it can only be
> > destructed and freed.
> > This has multiple advantages. We can add fields which have complex
> > initialization requirements.
> > This also allows safe recycling of memory without having to do zero
> > init or inserting constructor calls automatically from verifier.
> > Allows constructors to have parameters in future, also allows complex
> > multi-step initialization of fields in future.
> >
>
> I don't fully understand "shared ownership" from 2) and don't have a use case

Shared ownership is explained further later in section 5.

> for complex constructors in 3), but broadly agree with everything else. Will
> do another pass.
>
> > 4. Single Ownership Linked Lists
> >
> > The kptr has single ownership.
> > Program has to release it before BPF_EXIT, either free or move it out
> > of program.
> > Once passed to list, the program loses ownership.
> > But BPF can track that until spin_lock is released, nobody else can
> > touch it, so we can technically still list_remove a node we added
> > using list_add, and then we will be owning it after unlock.
> > list_add marks reference_state as ‘release_on_unlock’
> > list_remove unmark reference_state
> > Alexei: Similar to Dave’s approach, but different implementation.
> > bpf_spin_unlock walks acquired_refs and release_reference marked ones.
> > No other function calls allows in critical section, hence
> > reference_state remains same.
> >
> > ----------
> >
> > 5. Shared Ownership
> >
> > Idea: Add bpf_refcount as special field embeddable in allocated kptrs.
> > bpf_refcount_set(const), bpf_refcount_inc(const), bpf_refcount_put(ptr).
> > If combined with RCU, can allow safe kptr_get operations for such objects.
> > Each rb_root, list_head requires ownership of node.
> > Caller will transfer its reference to them.
> > If having only a single reference, do inc before transfer.
> > It is a generic concept, and can apply to kernel types as well.
> > When linking after allocation, it is extremely cheap to set, add, add, add…
> >
> > We add ‘static_ref’ to each reference_state to track incs/decs
> > acq = static_ref = 1
> > set  = static_ref = K (must be in [1, …] range)
> > inc  = static_ref += K
> > rel/put = static_ref -=  1 (may allow K, dunno)
> >
> > Alexei suggested that he prefers if helpers did the increment on their
> > own in case where the bpf_refcount field exists in the object. I.e.
> > instead of caller incrementing and then passing their reference to
> > lists or rbtree, the add helpers receive hidden parameter to refcount
> > field address automatically and bump the refcount when adding. In that
> > case, they won't be releasing caller's reference_state.
> > Then this static_ref code is not required.
> >
> > Kartikeya: No strong opinions, this is also another way. One advantage
> > of managing refcount on caller side and just keeping helpers move only
> > (regardless of single owner or shared owner kptr) is that helpers
> > themselves have the same semantics. It always moves ownership of a
> > reference. Also, one inc(K) and multiple add is a little cheaper than
> > multiple inc(1) on each add.
> >
> > 6. How does the verifier reason about shared kptr we don't know the state of?
> >
> > Consider a case where we load a kptr which has shared ownership from a
> > map using kptr_get.
> >
> > Now, it may have a list_node and a rb_node. We don't know whether this
> > node is already part of some list (so that list_node is occupied),
> > same for rb_node.
> >
> > There can be races like two CPUs having access to the node:
> >
> > CPU 0                         CPU 1
> > lock(&list1_lock)            lock(&list2_lock)
> > list_add(&node, &list2)
> >     next.prev = node;
> >     node.next = next;      list_remove(&node)
> >                                          node.next = NULL;
> >                                          node.prev = NULL;
> >     node.prev = prev;
> >     prev.next = node;
> > unlock(&list1_lock);         unlock(&list2_lock);
> >
> > Interleavings can leave nodes in inconsistent states.
> > We need to ensure that when we are doing list_add or list_remove for
> > kptr we don't know the state of, it is only in a safe context with
> > ownership of that operation.
> >
> > Remove:
> >
> > When inside list_for_each helper, list_remove is safe for nodes since
> > we are protected by lock.
> >
> > Idea: An owner field alongside the list_node and rb_node.
> > list_add sets it to the address of list_head, list_remove sets it to
> > NULL. This will be done under spinlock of the list.
> >
> > When we get access to the object in an unknown state for these fields,
> > we first lock the list we want to remove it from, check the owner
> > field, and only remove it when we see that owner matches locked list.
> >
> > Each list_add updates owner, list_remove sets to NULL.
> >     bpf_spin_lock(&lock);
> >     if (bpf_list_owns_node(&i->node, &list)) { // checks owner
> > list_remove(&i->node);
> >     }
> >     bpf_spin_unlock(&lock);
> >
> > bpf_list_owns_node poisons pointer in false branch, so user can only
> > list_remove in true branch.
> >
> > If the owner is not a locked list pointer, it will be either NULL or
> > some other value (because of previous list_remove while holding same
> > lock, or list_add while holding some other list lock).
> > If the owner is our list pointer, we can be sure this is safe, as we
> > have already locked list.
> > Otherwise, previous critical section must have modified owner.
> > So one single load (after inlining this helper) allows unlinking
> > random kptr we have reference to, safely.
> >
> > Cost: 8-bytes per object. Advantages: Prevents bugs like racy
> > list_remove and double list_add, doesn't need fallible helpers (the
> > check that would have been inside has to be done by the user now).
> > Don't need the abort logic.
> >
>
> I agree, keeping track of owner seems necessary. Seems harder to verify
> statically than lock as well. Alexei mentioned today that combination
> "grab lock and take ownership" helper for dynamic check might make
> sense.
>
> Tangentially, I've been poking at ergonomics of
> libbpf lock definition this week and think I have something reasonable:
>
> struct node_data {
>         struct rb_node node;
>         __u32 one;
>         __u32 two;
> };
>
> struct l {
>         __uint(type, BPF_MAP_TYPE_ARRAY);
>         __type(key, u32);
>         __type(value, struct bpf_spin_lock);
>         __uint(max_entries, 1);
> } lock_arr SEC(".maps");
>
> struct {
>         __uint(type, BPF_MAP_TYPE_RBTREE);
>         __type(value, struct node_data);
>         __array(lock, struct l);
> } rbtree1 SEC(".maps") = {
>         .lock = {
>                 [0] = &lock_arr,
>         },
> };
>
> struct {
>         __uint(type, BPF_MAP_TYPE_RBTREE);
>         __type(value, struct node_data);
>         __array(lock, struct l);
> } rbtree2 SEC(".maps") = {
>         .lock = {
>                 [0] = &lock_arr,
>         },
> };
>
> ... in BPF prog
>
>   bpf_spin_lock(&lock_arr[0]);
>
>   // Can safely operate on either tree, move nodes between them, etc.
>
>   bpf_spin_unlock(&lock_arr[0]);
>
>
> Notes:
>   * Verifier knows which lock is supposed to be used at map creation time
>     * Can reuse bpf_verifier_state's 'active_spin_lock' member, so no addt'l
>       bookkeeping needed to verify that rbtree_add or similar is happening
>       in critical section

Yes, this is similar to my approach, except what I'm doing is (suppose
we fix the bpf_spin_lock in mmap-able map value problem):

The list_head and lock protecting it are global variables, hence in
the same map value for the global variable's array map (for now only
one lock is allowed in a map value, but we may allow some guarded_by
annotation to associate different locks to different containers).
Now, you can use the same active_spin_lock infra to track whether I
hold the one in the same map value as the list_head. More on that
below.

>   * Can benefit from relo goodness (e.g. rbtree3 using extern lock in another
>     file)
>   * If necessary, similar dynamic verification behavior as just keeping lock
>     internal
>   * Implementation similarities with map_of_map 'inner_map'. Similarly to
>     inner_map_fd, kernel needs to know about lock_map_fd. Can use map_extra for
>     this to avoid uapi changes
>
> Alexei and I discussed possibly allowing raw 'struct bpf_spin_lock' global var,
> which would require some additional libbpf changes as bpf_spin_lock can't be
> mmap'd and libbpf tries to mmap all .data maps currently. Perhaps a separate
> .data.no_mmap section.
>
> This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure
> how to statically verify lock in that case. Have you had a chance to think
> about it further?
>

You rely on the lock being in the same allocation, and manipulation
done on an object from the same 'lookup'. See below:

struct foo {
        struct bpf_spin_lock lock;
        struct bpf_list_head head __value(...);
};

struct map_value {
        struct foo __local_kptr *ptr;
};

struct {
        __uint(type, BPF_MAP_TYPE_ARRAY);
        __type(key, int);
        __type(value, struct map_value);
        __uint(max_entries, 8);
} array_of_lists SEC(".maps");

In my case, the structure is the map, so pointer to the structure
inside a map makes it map-in-map (now common, the existing map-in-maps
just hide this from you, so it's pretty much the same thing
anyway...).

This is just an example, it can be one more level deep, but anyway.

When I do a map lookup, there is check in
verifier.c:reg_may_point_to_spin_lock, this preserves reg->id on NULL
unmarking.  This reg->id is then remembered when you take lock inside
this map value, to associate it back to unlock correctly.

Now, suppose you load the kptr. You know the kptr has a lock, you will
update this check to also consider local kptr with locks. The reg->id
is preserved after loaded kptr is NULL checked, but it is unique for
each load of the kptr. You lock spin_lock in the kptr, you then add to
list, the list_add verifier check goes and sees whether the current
lock held and the current list_head come from the same reg->id (you
know the reg of list_head, right? So you know the id as well, and you
match that to cur->active_spin_lock_id). If so, it is the correct
lock, we locked the lock in the same loaded kptr as the one whose
list_head we are list_add-ing to.

For global variables, the check needs more work. In the normal map
lookup case, we assign fresh reg->id whenever you do a map lookup, so
in case of array map spin lock for the same key will set different id
in cur->active_spin_lock_id for two different map values from two
different lookups. This is because we don't know if it is the same map
value on second lookup, so both locks in different map value are
considered different locks. The id is the unique lock id, essentially.

Since global variables are in direct_value_addr map with 1 max_entry,
we don't need to assign fresh reg->id and each pseudo ldimm64 insn. We
can instead teach it to either track it using id (for the case of
normal map lookups and local kptr), or map_ptr to accomodate global
variables non-unique ids. At once, only one of two is set, the other
is zero.

Then everything falls in place. We always match both map_ptr and id.
For global data and map lookups, the map_ptr is matched, id will be 0
for global data, non zero for normal map lookups. There is only one
map value so the lock protects everything in it. For the other case I
described above, map_ptr is NULL but id will be different if not from
the same 'lookup' in case of local kptr (PTR_TO_BTF_ID).

We also have map_uid, which is assigned to map_ptr of inner map
lookups. But remember that we are talking of map values above, so even
if for lookups from two differ inner maps of same map, we get two map
values whose map_ptr is technically same (even if the map_uid was
different), their reg->id _will_ be different, so the above checks are
sufficient to disambiguate spin locks for all kinds of cases.

Keeping lock and data in the same allocation thus allows you to
associate locks statically even for dynamic allocations, enabling the
map-in-map use case.

If you still find it confusing, I'll be posting the RFC early next
week, please let me know if there is a flaw in the approach above
after you look at the code in detail. I might have actually missed
something.

>  And does the above seem reasonable to you?

I still think if we can solve the mmapable spin_lock problem (e.g. by
moving it to a different section), we should do it using global
variables instead of maps. We're both essentially doing the same
thing, but in different ways. Why I prefer global variables is because
I think the stuff below looks more neat and is simple for the user
compared to defining array map with lock, then having a new map type
for the container, and then setting lock address of array map with
lock:

struct bpf_spin_lock lock;
struct bpf_rb_root rb_root __guarded_by(lock);

But that is just my opinion, we can discuss the pros and cons.


> > We can also use this to jump back to owner, lock it, list_owns_node,
> > remove safely without having to iterate owner list.
> >
> > Idea: Make it optional, so cost only paid by those who need this
> > dynamic removal of kptr from kptr_gets, only this helper would require
> > associated owner field.
> >
> > 7. How to add such a randomly acquired kptr?
> >
> > Same idea, but requires a cmpxchg owner to BPF_PTR_POISON. This can’t
> > be set from list_add or list_remove. list_owns_node will see it and
> > not return true. Hence, we will have exclusive ownership of list_node.
> >
> > Safety: Verifier will force to either add to some list, and make
> > pointer unescapable, or reset to NULL (needs a store, makes escapable)
> > (otherwise if you move it to map, it will become unlinkable on next
> > prog execution, as we lose this info when we reload from map).
> >
> > 6,7: Alternative approach: node locking
> >
> > Be able to lock kptr so that its list_node, rb_node fields are
> > protected from concurrent mutation. This requires teaching the
> > verifier to take 2 levels of locks.
> >
> > Requires building a type ownership graph and detecting cycles to avoid
> > ABBA deadlocks when loading program BTF.
> >
> > It would still require checking the owner field (i.e.
> > bpf_list_owns_node operation) inside the CS (as someone before us
> > might have taken the lock and added it or removed it), but it becomes
> > a simple branch, instead of having to do N cmpxchg for N fields to be
> > able to add them.
> >
> > 8. Insight: Need a generic concept of helpers that poison/unpoison  a
> > pointer argument depending on the checking of the result they return.
> >
> > if (use_only_in_one_branch(ptr)) { … } else { … }
> > Poisons all copies of ptr. Checking the returned pointer unpoisons.
> > Returned pointer stores ref_obj_id (currently only needed for
> > refcounted registers), which can be used to find candidates for
> > un-poisoning.
> > Generic concept, similar to CONDITIONAL_RELEASE approach from Dave,
> > but can apply to do all kinds of other things.
> > E.g. if (bpf_refcount_single(...)) does one load internally, simply
> > check to downgrade to single ownership pointer in one branch. Some
> > operations then don’t need lock (like adding to list_head, since only
> > we can access it).
> > Same for bpf_refcount_put pattern.
> > if (bpf_refcount_put(kptr)) { destruct(kptr); kptr_free(kptr); }
> >
> > 9. RCU protection for single & shared ownershipkptrs
> >
> > Idea: Expose bpf_call_rcu kfunc. RCU protected kptr cannot be
> > destructed, cannot be bpf_kptr_free directly. Only bpf_call_rcu can be
> > called once refcount reaches 0, then it will invoke callback and give
> > ownership of kptr to the callback and force destruction + free (by
> > setting destructing state of pointer).
> >
> > For single ownership, once we remove visibility using kptr_xchg (it
> > can be only in one field, because of single ownership allowing one
> > move from/to map), we can call this helper as well.
> >
> >
> > In shared ownership we rely on bpf_refcount_put's true branch to call
> > bpf_call_rcu.
> >
> > Callback runs once after RCU gp, it will only be allowed to destruct
> > kptr and then call bpf_kptr_free, not escape program.
> >
> > I _think_ we can avoid taking prog reference, if we do RCU barrier
> > after synchronize_rcu in prog free path? That waits for all call_rcu
> > invoked from read sections that may be executing the prog.
> >
> > Inside callbacks, regardless of single ownership kptr or kptr with
> > refcount (possible shared ownership), we know we have single
> > ownership, and set destructing state (with all possible destructible
> > fields marked as constructed in callback func_state, so user has to
> > call all destructors and then free, can do nothing else).
> >
> > Alexei: Instead of open coding bpf_call_rcu plus destruction like
> > this, associate destructor callback with user kptr. Then
> > bpf_kptr_free_rcu automatically invokes this using call_rcu, and BPF
> > map also invokes it on map_free.
> >
> > Kartikeya: One big limitation is that now BPF map must take reference
> > to prog as it needs callback reference to stay stable, but to solve
> > the reference cycle it must release these kptr on map_release_uref. We
> > then also need to check this atomic count like the timer helper before
> > setting the kptr in the map, so that such kptr is not set after
> > map_release_uref again, which is a bit limiting. It would be limiting
> > to have to pin maps to make this work, we may not even have user
> > visible fds for such maps in future (as we are moving towards BPF
> > powered kernel programming).
> >
> > 10. Verifier callback handling is broken
> >
> > Loop callback verification is broken
> > Same issues exist now as were anticipated in open coded iteration
> > Fixing this will open clear path to enabling open coded iteration
> > https://lore.kernel.org/bpf/20220815051540.18791-1-memxor@gmail.com
> > Alexei mentioned that Andrii is working on how to do inline nested loops.
> >
> > --
> >
> > This concludes what we discussed yesterday. Apologies in advance if I
> > forgot to mention anything else.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-19 16:00   ` Kumar Kartikeya Dwivedi
@ 2022-08-19 19:03     ` Alexei Starovoitov
  2022-08-19 20:24       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2022-08-19 19:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Fri, Aug 19, 2022 at 06:00:22PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Fri, 19 Aug 2022 at 10:55, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >
> > Hi Kumar,
> >
> > Alexei and I talked about locking and a few other things today in regards to my
> > rbtree work. Some of this isn't a direct response to your ideas/notes here,
> > but hoping to summarize today's discussion inline with your code samples and
> > get your opinion.
> >
> > Also, some inline comments more directly addressing your notes.
> 
> Hi Dave, thanks for sharing the notes.
> 
> >
> > On 8/17/22 5:04 AM, Kumar Kartikeya Dwivedi wrote:
> > > Alexei and I had a meeting where we discussed some of my ideas related
> > > to BPF linked lists. I am sharing the discussion with everyone to get
> > > wider feedback, and document what we discussed.
> > >
> > > The hard stuff is the shared ownership case, hence we can discuss this
> > > while we work on landing single ownership lists. I will be sharing my
> > > patches for that as an RFC.
> > >
> > > 1. Definition
> > >
> > > We can use BTF declaration tags to annotate a common structure like
> > > struct bpf_list_head, struct bpf_rb_root, etc.
> > >
> > > #define __value(kind, name, node) __attribute__((btf_decl_tag(#kind
> > > ":" #name ":" #node))
> > >
> > > struct foo {
> > >     unsigned long data;
> > >     struct bpf_list_node node;
> > > };
> > >
> > > struct map_value {
> > >     struct bpf_spin_lock lock;
> > >     struct bpf_list_head head __value(struct, foo, node);
> > > };
> > >
> > > This allows us to parameterize any kind of intrusive collection.
> > >
> > > For the map-in-map use case:
> > >
> > > struct bar {
> > >     unsigned long data;
> > >     struct bpf_list_node node;
> > > };
> > > // Only two levels of types allowed, to ensure no cycles, and to
> > > prevent ownership cycles
> > > struct foo {
> > >     unsigned long data;
> > >     struct bpf_spin_lock lock;
> > >     struct bpf_list_node node;
> > >     struct bpf_list_head head __value(struct, bar, node);
> > > };
> > >
> > > struct map_value {
> > >     struct bpf_spin_lock lock;
> > >     struct bpf_list_head head __value(struct, foo, node);
> > > };
> > >
> >
> > Will these still be 'bpf maps' under the hood? If the list were to use
> 
> Nope, my idea was to get rid of maps for intrusive collections, and
> you always put bpf_list_head, bpf_rb_root in a map value. For global
> map-like use cases, you instead use global variables, which are also
> inside the map value of a BPF_MAP_TYPE_ARRAY. IMO there is no need
> anymore to add more and more map types with this new style of data
> structures. It is much more ergonomic to just use the head structure,
> either as a global variable, or as an allocated object, but again,
> that's my opinion. There will be some pros and cons for either
> approach :).

+1
If we can avoid adding new map types and instead recognize 'struct bpf_list_head'
and 'struct bpf_rb_root' in global data and in map values that
would be the best.
The code would look the most natural to developers familiar with the kernel code.

> I am aware of the problem with global bpf_spin_lock (thanks to Alexei
> for spotting it), but as you described it can be solved by moving them
> into a different section.
> 
> > convention similar to the rbtree RFC, the first (non map-in-map) def could be
> > written like:
> >
> > struct foo {
> >     unsigned long data;
> >     struct bpf_list_node node;
> > };
> >
> > struct {
> >     __uint(type, BPF_MAP_TYPE_LINKED_LIST);
> >     __type(value, struct foo);
> > } list SEC(".maps");
> >
> > I think we're thinking along similar lines with regards to the BTF tag, but I
> > was thinking of tagging the value type instead of the container, something like:
> >
> > struct foo {
> >     unsigned long data;
> >     struct bpf_list_node node __protected_list_node;
> > };
> >
> > 'protected' meaning verifier knows to prevent prog from touching the
> > bpf_list_node. Currently my rbtree RFCv1 is just assuming that value type will
> > have rb_node at offset 0. BTF tag would eliminate this offset requirement
> > and allow value types to be part of multiple data structures:
> >
> > struct foo {
> >     unsigned long data;
> >     struct bpf_list_node node __protected_list_node;
> >     struct bpf_rb_node rb_node __protected_rb_node;
> > };

I'm not sure what '__protected_rb_node' tag buys here.
The verifier can infer the same from 'struct bpf_rb_node' type name.

> >
> > Not a hard requirement for me, but nice-to-have: support for heterogenous value
> > types in same list / rbtree. Map def's __type(value, struct foo) wouldn't work
> > in this case, I think your __value(struct, foo, node) would have same issue.

I thought Kumar's proposal to do:
  struct bpf_list_head head __value(struct, foo, node);
was to tell the verifier how to do iterate link list with appropriate container_of().

Same tagging is needed for:
  struct bpf_rb_root tree __value(struct, foo, node)
to tell bpf_rb_find(key, tree, cmp_cb) what btf_id to return
and how to container_of() from rb_node to bpf program supplied struct to
pass that btf_id pointer into cmb_cb.

> >
> > But I think this should be possible with BTF tags somehow. List
> > helpers are ostensibly only touching the list_{head,node} - and similar for
> > rbtree, and we're both planning on explicit in-prog typed allocation.
> > If type can be propagated from alloc -> reg state -> helper input ->
> > helper output, helpers could use reg BTF info w/ properly tagged field
> > to manipulate the right field in the value struct.
> 
> We can have multiple __value on a list_head and add some way to
> disambiguate what the type will be on list_remove. The problem is not
> during list_add, as you know the type of the node being added. It is
> when you list_remove, is when you need to be able to disambiguate the
> type so that we set the reg as correct btf, btf_id and then set the
> right reg->off (so that container_of can give the entry). Until the
> disambiguation step is done, it is unknown what the type might be.

list_remove and list iterate/find need that btf_id.

> When thinking of heterogeneous values, we should probably look to add
> a way to do generic variant types in BPF, such that e.g. the first
> field is a tag, and then the actual data of that type follows. This
> would allow us to use them not only in intrusive collections, but
> anywhere else, and probably even store them in map values as kptrs.

Not following this idea. Could you give an example?

> 
> I think it is much simpler to start with homogenous types first, though.
> 
> >
> > In that case the tag would have to be on the value struct's field, not the
> > container. I do like that your __value(struct, foo, node) is teaching the
> > container what named field to manipulate. If value struct were to be part of
> > two lists this would make it possible to disambiguate.
> >
> > When we discussed this Alexei mentioned existing pointer casting helper pattern
> > (e.g. 'bpf_skc_to_tcp_sock') as potentially being helpful here.
> >
> 
> Indeed, but I think you need some bit of info at runtime to be able to do this.
> 
> > > 2. Building Blocks - A type safe allocator
> > >
> > > Add bpf_kptr_alloc, bpf_kptr_free
> > > This will use bpf_mem_alloc infra, allocator maps.
> > > Alexei mentioned that Delyan is working on support for exposing
> > > bpf_mem_alloc using an allocator map.
> > > Allocates referenced PTR_TO_BTF_ID (should we call these local kptr?):
> > > reg->btf == prog->aux->btf
> > > reg->btf_id = bpf_core_type_id_local(...)
> > > btf_struct_access allows writing to these objects.
> > > Due to type visibility, we can embed objects with special semantics
> > > inside these user defined types.
> > > Add a concept of constructing/destructing kptr.
> > > constructing -> normal kptr, escapable -> destructing
> > > In constructing and destructing state, pointer cannot escape the
> > > program. Hence, only one CPU is guaranteed to observe the object in
> > > those states. So when we have access to single ownership kptr, we know
> > > nobody else can access it. Hence we can also move its state from
> > > normal to destructing state.
> > > In case of shared ownership, we will have to rely on the result of
> > > bpf_refcount_put for this to work.
> > >
> > > 3. Embedding special fields inside such allocated kptr
> > >
> > > We must allow programmer to compose their own user defined BPF object
> > > out of building blocks provided by BPF.
> > > BPF users may have certain special objects inside this allocated
> > > object. E.g. bpf_list_node, bpf_spin_lock, even bpf_list_head
> > > (map-in-map use case).
> > > btf_struct_access won’t allow direct reads/writes to these fields.
> > > Each of them needs to be constructed before the object is considered
> > > fully constructed.
> > > An unconstructed object’s kptr cannot escape a program, it can only be
> > > destructed and freed.
> > > This has multiple advantages. We can add fields which have complex
> > > initialization requirements.
> > > This also allows safe recycling of memory without having to do zero
> > > init or inserting constructor calls automatically from verifier.
> > > Allows constructors to have parameters in future, also allows complex
> > > multi-step initialization of fields in future.
> > >
> >
> > I don't fully understand "shared ownership" from 2) and don't have a use case
> 
> Shared ownership is explained further later in section 5.
> 
> > for complex constructors in 3), but broadly agree with everything else. Will
> > do another pass.
> >
> > > 4. Single Ownership Linked Lists
> > >
> > > The kptr has single ownership.
> > > Program has to release it before BPF_EXIT, either free or move it out
> > > of program.
> > > Once passed to list, the program loses ownership.
> > > But BPF can track that until spin_lock is released, nobody else can
> > > touch it, so we can technically still list_remove a node we added
> > > using list_add, and then we will be owning it after unlock.
> > > list_add marks reference_state as ‘release_on_unlock’
> > > list_remove unmark reference_state
> > > Alexei: Similar to Dave’s approach, but different implementation.
> > > bpf_spin_unlock walks acquired_refs and release_reference marked ones.
> > > No other function calls allows in critical section, hence
> > > reference_state remains same.
> > >
> > > ----------
> > >
> > > 5. Shared Ownership
> > >
> > > Idea: Add bpf_refcount as special field embeddable in allocated kptrs.
> > > bpf_refcount_set(const), bpf_refcount_inc(const), bpf_refcount_put(ptr).
> > > If combined with RCU, can allow safe kptr_get operations for such objects.
> > > Each rb_root, list_head requires ownership of node.
> > > Caller will transfer its reference to them.
> > > If having only a single reference, do inc before transfer.
> > > It is a generic concept, and can apply to kernel types as well.
> > > When linking after allocation, it is extremely cheap to set, add, add, add…
> > >
> > > We add ‘static_ref’ to each reference_state to track incs/decs
> > > acq = static_ref = 1
> > > set  = static_ref = K (must be in [1, …] range)
> > > inc  = static_ref += K
> > > rel/put = static_ref -=  1 (may allow K, dunno)
> > >
> > > Alexei suggested that he prefers if helpers did the increment on their
> > > own in case where the bpf_refcount field exists in the object. I.e.
> > > instead of caller incrementing and then passing their reference to
> > > lists or rbtree, the add helpers receive hidden parameter to refcount
> > > field address automatically and bump the refcount when adding. In that
> > > case, they won't be releasing caller's reference_state.
> > > Then this static_ref code is not required.
> > >
> > > Kartikeya: No strong opinions, this is also another way. One advantage
> > > of managing refcount on caller side and just keeping helpers move only
> > > (regardless of single owner or shared owner kptr) is that helpers
> > > themselves have the same semantics. It always moves ownership of a
> > > reference. Also, one inc(K) and multiple add is a little cheaper than
> > > multiple inc(1) on each add.
> > >
> > > 6. How does the verifier reason about shared kptr we don't know the state of?
> > >
> > > Consider a case where we load a kptr which has shared ownership from a
> > > map using kptr_get.
> > >
> > > Now, it may have a list_node and a rb_node. We don't know whether this
> > > node is already part of some list (so that list_node is occupied),
> > > same for rb_node.
> > >
> > > There can be races like two CPUs having access to the node:
> > >
> > > CPU 0                         CPU 1
> > > lock(&list1_lock)            lock(&list2_lock)
> > > list_add(&node, &list2)
> > >     next.prev = node;
> > >     node.next = next;      list_remove(&node)
> > >                                          node.next = NULL;
> > >                                          node.prev = NULL;
> > >     node.prev = prev;
> > >     prev.next = node;
> > > unlock(&list1_lock);         unlock(&list2_lock);
> > >
> > > Interleavings can leave nodes in inconsistent states.
> > > We need to ensure that when we are doing list_add or list_remove for
> > > kptr we don't know the state of, it is only in a safe context with
> > > ownership of that operation.
> > >
> > > Remove:
> > >
> > > When inside list_for_each helper, list_remove is safe for nodes since
> > > we are protected by lock.
> > >
> > > Idea: An owner field alongside the list_node and rb_node.
> > > list_add sets it to the address of list_head, list_remove sets it to
> > > NULL. This will be done under spinlock of the list.
> > >
> > > When we get access to the object in an unknown state for these fields,
> > > we first lock the list we want to remove it from, check the owner
> > > field, and only remove it when we see that owner matches locked list.
> > >
> > > Each list_add updates owner, list_remove sets to NULL.
> > >     bpf_spin_lock(&lock);
> > >     if (bpf_list_owns_node(&i->node, &list)) { // checks owner
> > > list_remove(&i->node);
> > >     }
> > >     bpf_spin_unlock(&lock);
> > >
> > > bpf_list_owns_node poisons pointer in false branch, so user can only
> > > list_remove in true branch.
> > >
> > > If the owner is not a locked list pointer, it will be either NULL or
> > > some other value (because of previous list_remove while holding same
> > > lock, or list_add while holding some other list lock).
> > > If the owner is our list pointer, we can be sure this is safe, as we
> > > have already locked list.
> > > Otherwise, previous critical section must have modified owner.
> > > So one single load (after inlining this helper) allows unlinking
> > > random kptr we have reference to, safely.
> > >
> > > Cost: 8-bytes per object. Advantages: Prevents bugs like racy
> > > list_remove and double list_add, doesn't need fallible helpers (the
> > > check that would have been inside has to be done by the user now).
> > > Don't need the abort logic.
> > >
> >
> > I agree, keeping track of owner seems necessary. Seems harder to verify
> > statically than lock as well. Alexei mentioned today that combination
> > "grab lock and take ownership" helper for dynamic check might make
> > sense.
> >
> > Tangentially, I've been poking at ergonomics of
> > libbpf lock definition this week and think I have something reasonable:
> >
> > struct node_data {
> >         struct rb_node node;
> >         __u32 one;
> >         __u32 two;
> > };
> >
> > struct l {
> >         __uint(type, BPF_MAP_TYPE_ARRAY);
> >         __type(key, u32);
> >         __type(value, struct bpf_spin_lock);
> >         __uint(max_entries, 1);
> > } lock_arr SEC(".maps");
> >
> > struct {
> >         __uint(type, BPF_MAP_TYPE_RBTREE);
> >         __type(value, struct node_data);
> >         __array(lock, struct l);
> > } rbtree1 SEC(".maps") = {
> >         .lock = {
> >                 [0] = &lock_arr,
> >         },
> > };
> >
> > struct {
> >         __uint(type, BPF_MAP_TYPE_RBTREE);
> >         __type(value, struct node_data);
> >         __array(lock, struct l);
> > } rbtree2 SEC(".maps") = {
> >         .lock = {
> >                 [0] = &lock_arr,
> >         },
> > };
> >
> > ... in BPF prog
> >
> >   bpf_spin_lock(&lock_arr[0]);
> >
> >   // Can safely operate on either tree, move nodes between them, etc.
> >
> >   bpf_spin_unlock(&lock_arr[0]);
> >
> >
> > Notes:
> >   * Verifier knows which lock is supposed to be used at map creation time
> >     * Can reuse bpf_verifier_state's 'active_spin_lock' member, so no addt'l
> >       bookkeeping needed to verify that rbtree_add or similar is happening
> >       in critical section
> 
> Yes, this is similar to my approach, except what I'm doing is (suppose
> we fix the bpf_spin_lock in mmap-able map value problem):

We can teach libbpf to support more than 3 hard coded global maps
(bss, rodata, data). So any named section will go into its own array with max_entries=1
that won't be mmap-able and will allow to host bpf_rb_root, bpf_spin_lock, etc.

> The list_head and lock protecting it are global variables, hence in
> the same map value for the global variable's array map (for now only
> one lock is allowed in a map value, but we may allow some guarded_by
> annotation to associate different locks to different containers).
> Now, you can use the same active_spin_lock infra to track whether I
> hold the one in the same map value as the list_head. More on that
> below.
> 
> >   * Can benefit from relo goodness (e.g. rbtree3 using extern lock in another
> >     file)
> >   * If necessary, similar dynamic verification behavior as just keeping lock
> >     internal
> >   * Implementation similarities with map_of_map 'inner_map'. Similarly to
> >     inner_map_fd, kernel needs to know about lock_map_fd. Can use map_extra for
> >     this to avoid uapi changes
> >
> > Alexei and I discussed possibly allowing raw 'struct bpf_spin_lock' global var,
> > which would require some additional libbpf changes as bpf_spin_lock can't be
> > mmap'd and libbpf tries to mmap all .data maps currently. Perhaps a separate
> > .data.no_mmap section.
> >
> > This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure
> > how to statically verify lock in that case. Have you had a chance to think
> > about it further?
> >
> 
> You rely on the lock being in the same allocation, and manipulation
> done on an object from the same 'lookup'. See below:
> 
> struct foo {
>         struct bpf_spin_lock lock;
>         struct bpf_list_head head __value(...);
> };
> 
> struct map_value {
>         struct foo __local_kptr *ptr;
> };
> 
> struct {
>         __uint(type, BPF_MAP_TYPE_ARRAY);
>         __type(key, int);
>         __type(value, struct map_value);
>         __uint(max_entries, 8);
> } array_of_lists SEC(".maps");
> 
> In my case, the structure is the map, so pointer to the structure
> inside a map makes it map-in-map (now common, the existing map-in-maps
> just hide this from you, so it's pretty much the same thing
> anyway...).
> 
> This is just an example, it can be one more level deep, but anyway.
> 
> When I do a map lookup, there is check in
> verifier.c:reg_may_point_to_spin_lock, this preserves reg->id on NULL
> unmarking.  This reg->id is then remembered when you take lock inside
> this map value, to associate it back to unlock correctly.
> 
> Now, suppose you load the kptr. You know the kptr has a lock, you will
> update this check to also consider local kptr with locks. The reg->id
> is preserved after loaded kptr is NULL checked, but it is unique for
> each load of the kptr. You lock spin_lock in the kptr, you then add to
> list, the list_add verifier check goes and sees whether the current
> lock held and the current list_head come from the same reg->id (you
> know the reg of list_head, right? So you know the id as well, and you
> match that to cur->active_spin_lock_id). If so, it is the correct
> lock, we locked the lock in the same loaded kptr as the one whose
> list_head we are list_add-ing to.
> 
> For global variables, the check needs more work. In the normal map
> lookup case, we assign fresh reg->id whenever you do a map lookup, so
> in case of array map spin lock for the same key will set different id
> in cur->active_spin_lock_id for two different map values from two
> different lookups. This is because we don't know if it is the same map
> value on second lookup, so both locks in different map value are
> considered different locks. The id is the unique lock id, essentially.
> 
> Since global variables are in direct_value_addr map with 1 max_entry,
> we don't need to assign fresh reg->id and each pseudo ldimm64 insn. We
> can instead teach it to either track it using id (for the case of
> normal map lookups and local kptr), or map_ptr to accomodate global
> variables non-unique ids. At once, only one of two is set, the other
> is zero.
> 
> Then everything falls in place. We always match both map_ptr and id.
> For global data and map lookups, the map_ptr is matched, id will be 0
> for global data, non zero for normal map lookups. There is only one
> map value so the lock protects everything in it. For the other case I
> described above, map_ptr is NULL but id will be different if not from
> the same 'lookup' in case of local kptr (PTR_TO_BTF_ID).
> 
> We also have map_uid, which is assigned to map_ptr of inner map
> lookups. But remember that we are talking of map values above, so even
> if for lookups from two differ inner maps of same map, we get two map
> values whose map_ptr is technically same (even if the map_uid was
> different), their reg->id _will_ be different, so the above checks are
> sufficient to disambiguate spin locks for all kinds of cases.
> 
> Keeping lock and data in the same allocation thus allows you to
> associate locks statically even for dynamic allocations, enabling the
> map-in-map use case.

All that makes sense, but consider use case where we need rb_root for every cgroup.
The 'struct bpf_rb_root' will be created dynamically in cgroup local storage.
We can create a lock in the same cgroup local storage as well.
It's all nice and the verifier can do locking checks statically,
but how the program can trasnfer and rb_node from one rb tree to another
in a different cgroup?
Either two locks need to be held and I don't see a way to check that
statically or one bpf_spin_lock should be used across all cgroups.
Thoughts?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-19 19:03     ` Alexei Starovoitov
@ 2022-08-19 20:24       ` Kumar Kartikeya Dwivedi
  2022-08-23 20:02         ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-19 20:24 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Fri, 19 Aug 2022 at 21:03, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 19, 2022 at 06:00:22PM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Fri, 19 Aug 2022 at 10:55, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > >
> > > Hi Kumar,
> > >
> > > Alexei and I talked about locking and a few other things today in regards to my
> > > rbtree work. Some of this isn't a direct response to your ideas/notes here,
> > > but hoping to summarize today's discussion inline with your code samples and
> > > get your opinion.
> > >
> > > Also, some inline comments more directly addressing your notes.
> >
> > Hi Dave, thanks for sharing the notes.
> >
> > >
> > > On 8/17/22 5:04 AM, Kumar Kartikeya Dwivedi wrote:
> > > > Alexei and I had a meeting where we discussed some of my ideas related
> > > > to BPF linked lists. I am sharing the discussion with everyone to get
> > > > wider feedback, and document what we discussed.
> > > >
> > > > The hard stuff is the shared ownership case, hence we can discuss this
> > > > while we work on landing single ownership lists. I will be sharing my
> > > > patches for that as an RFC.
> > > >
> > > > 1. Definition
> > > >
> > > > We can use BTF declaration tags to annotate a common structure like
> > > > struct bpf_list_head, struct bpf_rb_root, etc.
> > > >
> > > > #define __value(kind, name, node) __attribute__((btf_decl_tag(#kind
> > > > ":" #name ":" #node))
> > > >
> > > > struct foo {
> > > >     unsigned long data;
> > > >     struct bpf_list_node node;
> > > > };
> > > >
> > > > struct map_value {
> > > >     struct bpf_spin_lock lock;
> > > >     struct bpf_list_head head __value(struct, foo, node);
> > > > };
> > > >
> > > > This allows us to parameterize any kind of intrusive collection.
> > > >
> > > > For the map-in-map use case:
> > > >
> > > > struct bar {
> > > >     unsigned long data;
> > > >     struct bpf_list_node node;
> > > > };
> > > > // Only two levels of types allowed, to ensure no cycles, and to
> > > > prevent ownership cycles
> > > > struct foo {
> > > >     unsigned long data;
> > > >     struct bpf_spin_lock lock;
> > > >     struct bpf_list_node node;
> > > >     struct bpf_list_head head __value(struct, bar, node);
> > > > };
> > > >
> > > > struct map_value {
> > > >     struct bpf_spin_lock lock;
> > > >     struct bpf_list_head head __value(struct, foo, node);
> > > > };
> > > >
> > >
> > > Will these still be 'bpf maps' under the hood? If the list were to use
> >
> > Nope, my idea was to get rid of maps for intrusive collections, and
> > you always put bpf_list_head, bpf_rb_root in a map value. For global
> > map-like use cases, you instead use global variables, which are also
> > inside the map value of a BPF_MAP_TYPE_ARRAY. IMO there is no need
> > anymore to add more and more map types with this new style of data
> > structures. It is much more ergonomic to just use the head structure,
> > either as a global variable, or as an allocated object, but again,
> > that's my opinion. There will be some pros and cons for either
> > approach :).
>
> +1
> If we can avoid adding new map types and instead recognize 'struct bpf_list_head'
> and 'struct bpf_rb_root' in global data and in map values that
> would be the best.
> The code would look the most natural to developers familiar with the kernel code.
>
> > I am aware of the problem with global bpf_spin_lock (thanks to Alexei
> > for spotting it), but as you described it can be solved by moving them
> > into a different section.
> >
> > > convention similar to the rbtree RFC, the first (non map-in-map) def could be
> > > written like:
> > >
> > > struct foo {
> > >     unsigned long data;
> > >     struct bpf_list_node node;
> > > };
> > >
> > > struct {
> > >     __uint(type, BPF_MAP_TYPE_LINKED_LIST);
> > >     __type(value, struct foo);
> > > } list SEC(".maps");
> > >
> > > I think we're thinking along similar lines with regards to the BTF tag, but I
> > > was thinking of tagging the value type instead of the container, something like:
> > >
> > > struct foo {
> > >     unsigned long data;
> > >     struct bpf_list_node node __protected_list_node;
> > > };
> > >
> > > 'protected' meaning verifier knows to prevent prog from touching the
> > > bpf_list_node. Currently my rbtree RFCv1 is just assuming that value type will
> > > have rb_node at offset 0. BTF tag would eliminate this offset requirement
> > > and allow value types to be part of multiple data structures:
> > >
> > > struct foo {
> > >     unsigned long data;
> > >     struct bpf_list_node node __protected_list_node;
> > >     struct bpf_rb_node rb_node __protected_rb_node;
> > > };
>
> I'm not sure what '__protected_rb_node' tag buys here.
> The verifier can infer the same from 'struct bpf_rb_node' type name.
>
> > >
> > > Not a hard requirement for me, but nice-to-have: support for heterogenous value
> > > types in same list / rbtree. Map def's __type(value, struct foo) wouldn't work
> > > in this case, I think your __value(struct, foo, node) would have same issue.
>
> I thought Kumar's proposal to do:
>   struct bpf_list_head head __value(struct, foo, node);
> was to tell the verifier how to do iterate link list with appropriate container_of().
>
> Same tagging is needed for:
>   struct bpf_rb_root tree __value(struct, foo, node)
> to tell bpf_rb_find(key, tree, cmp_cb) what btf_id to return
> and how to container_of() from rb_node to bpf program supplied struct to
> pass that btf_id pointer into cmb_cb.
>
> > >
> > > But I think this should be possible with BTF tags somehow. List
> > > helpers are ostensibly only touching the list_{head,node} - and similar for
> > > rbtree, and we're both planning on explicit in-prog typed allocation.
> > > If type can be propagated from alloc -> reg state -> helper input ->
> > > helper output, helpers could use reg BTF info w/ properly tagged field
> > > to manipulate the right field in the value struct.
> >
> > We can have multiple __value on a list_head and add some way to
> > disambiguate what the type will be on list_remove. The problem is not
> > during list_add, as you know the type of the node being added. It is
> > when you list_remove, is when you need to be able to disambiguate the
> > type so that we set the reg as correct btf, btf_id and then set the
> > right reg->off (so that container_of can give the entry). Until the
> > disambiguation step is done, it is unknown what the type might be.
>
> list_remove and list iterate/find need that btf_id.
>
> > When thinking of heterogeneous values, we should probably look to add
> > a way to do generic variant types in BPF, such that e.g. the first
> > field is a tag, and then the actual data of that type follows. This
> > would allow us to use them not only in intrusive collections, but
> > anywhere else, and probably even store them in map values as kptrs.
>
> Not following this idea. Could you give an example?
>

Basically a tagged union in C.

For list_head to take value A, B, C, we define a struct type that has
a enum type field that tells the type, some common fields that can be
assumed at the same offset in all types sharing, and then the various
remaining members in a union that may differ.

Basically, how we do variable type structs in the kernel with a common
initial sequence in structs (e.g. sock_common), but with an added type
field to be able to implement safe casting at runtime. list_remove
then sets btf_id to variant, and it needs to be casted before
non-common sequence can be used.

> >
> > I think it is much simpler to start with homogenous types first, though.
> >
> > >
> > > In that case the tag would have to be on the value struct's field, not the
> > > container. I do like that your __value(struct, foo, node) is teaching the
> > > container what named field to manipulate. If value struct were to be part of
> > > two lists this would make it possible to disambiguate.
> > >
> > > When we discussed this Alexei mentioned existing pointer casting helper pattern
> > > (e.g. 'bpf_skc_to_tcp_sock') as potentially being helpful here.
> > >
> >
> > Indeed, but I think you need some bit of info at runtime to be able to do this.
> >
> > > > 2. Building Blocks - A type safe allocator
> > > >
> > > > Add bpf_kptr_alloc, bpf_kptr_free
> > > > This will use bpf_mem_alloc infra, allocator maps.
> > > > Alexei mentioned that Delyan is working on support for exposing
> > > > bpf_mem_alloc using an allocator map.
> > > > Allocates referenced PTR_TO_BTF_ID (should we call these local kptr?):
> > > > reg->btf == prog->aux->btf
> > > > reg->btf_id = bpf_core_type_id_local(...)
> > > > btf_struct_access allows writing to these objects.
> > > > Due to type visibility, we can embed objects with special semantics
> > > > inside these user defined types.
> > > > Add a concept of constructing/destructing kptr.
> > > > constructing -> normal kptr, escapable -> destructing
> > > > In constructing and destructing state, pointer cannot escape the
> > > > program. Hence, only one CPU is guaranteed to observe the object in
> > > > those states. So when we have access to single ownership kptr, we know
> > > > nobody else can access it. Hence we can also move its state from
> > > > normal to destructing state.
> > > > In case of shared ownership, we will have to rely on the result of
> > > > bpf_refcount_put for this to work.
> > > >
> > > > 3. Embedding special fields inside such allocated kptr
> > > >
> > > > We must allow programmer to compose their own user defined BPF object
> > > > out of building blocks provided by BPF.
> > > > BPF users may have certain special objects inside this allocated
> > > > object. E.g. bpf_list_node, bpf_spin_lock, even bpf_list_head
> > > > (map-in-map use case).
> > > > btf_struct_access won’t allow direct reads/writes to these fields.
> > > > Each of them needs to be constructed before the object is considered
> > > > fully constructed.
> > > > An unconstructed object’s kptr cannot escape a program, it can only be
> > > > destructed and freed.
> > > > This has multiple advantages. We can add fields which have complex
> > > > initialization requirements.
> > > > This also allows safe recycling of memory without having to do zero
> > > > init or inserting constructor calls automatically from verifier.
> > > > Allows constructors to have parameters in future, also allows complex
> > > > multi-step initialization of fields in future.
> > > >
> > >
> > > I don't fully understand "shared ownership" from 2) and don't have a use case
> >
> > Shared ownership is explained further later in section 5.
> >
> > > for complex constructors in 3), but broadly agree with everything else. Will
> > > do another pass.
> > >
> > > > 4. Single Ownership Linked Lists
> > > >
> > > > The kptr has single ownership.
> > > > Program has to release it before BPF_EXIT, either free or move it out
> > > > of program.
> > > > Once passed to list, the program loses ownership.
> > > > But BPF can track that until spin_lock is released, nobody else can
> > > > touch it, so we can technically still list_remove a node we added
> > > > using list_add, and then we will be owning it after unlock.
> > > > list_add marks reference_state as ‘release_on_unlock’
> > > > list_remove unmark reference_state
> > > > Alexei: Similar to Dave’s approach, but different implementation.
> > > > bpf_spin_unlock walks acquired_refs and release_reference marked ones.
> > > > No other function calls allows in critical section, hence
> > > > reference_state remains same.
> > > >
> > > > ----------
> > > >
> > > > 5. Shared Ownership
> > > >
> > > > Idea: Add bpf_refcount as special field embeddable in allocated kptrs.
> > > > bpf_refcount_set(const), bpf_refcount_inc(const), bpf_refcount_put(ptr).
> > > > If combined with RCU, can allow safe kptr_get operations for such objects.
> > > > Each rb_root, list_head requires ownership of node.
> > > > Caller will transfer its reference to them.
> > > > If having only a single reference, do inc before transfer.
> > > > It is a generic concept, and can apply to kernel types as well.
> > > > When linking after allocation, it is extremely cheap to set, add, add, add…
> > > >
> > > > We add ‘static_ref’ to each reference_state to track incs/decs
> > > > acq = static_ref = 1
> > > > set  = static_ref = K (must be in [1, …] range)
> > > > inc  = static_ref += K
> > > > rel/put = static_ref -=  1 (may allow K, dunno)
> > > >
> > > > Alexei suggested that he prefers if helpers did the increment on their
> > > > own in case where the bpf_refcount field exists in the object. I.e.
> > > > instead of caller incrementing and then passing their reference to
> > > > lists or rbtree, the add helpers receive hidden parameter to refcount
> > > > field address automatically and bump the refcount when adding. In that
> > > > case, they won't be releasing caller's reference_state.
> > > > Then this static_ref code is not required.
> > > >
> > > > Kartikeya: No strong opinions, this is also another way. One advantage
> > > > of managing refcount on caller side and just keeping helpers move only
> > > > (regardless of single owner or shared owner kptr) is that helpers
> > > > themselves have the same semantics. It always moves ownership of a
> > > > reference. Also, one inc(K) and multiple add is a little cheaper than
> > > > multiple inc(1) on each add.
> > > >
> > > > 6. How does the verifier reason about shared kptr we don't know the state of?
> > > >
> > > > Consider a case where we load a kptr which has shared ownership from a
> > > > map using kptr_get.
> > > >
> > > > Now, it may have a list_node and a rb_node. We don't know whether this
> > > > node is already part of some list (so that list_node is occupied),
> > > > same for rb_node.
> > > >
> > > > There can be races like two CPUs having access to the node:
> > > >
> > > > CPU 0                         CPU 1
> > > > lock(&list1_lock)            lock(&list2_lock)
> > > > list_add(&node, &list2)
> > > >     next.prev = node;
> > > >     node.next = next;      list_remove(&node)
> > > >                                          node.next = NULL;
> > > >                                          node.prev = NULL;
> > > >     node.prev = prev;
> > > >     prev.next = node;
> > > > unlock(&list1_lock);         unlock(&list2_lock);
> > > >
> > > > Interleavings can leave nodes in inconsistent states.
> > > > We need to ensure that when we are doing list_add or list_remove for
> > > > kptr we don't know the state of, it is only in a safe context with
> > > > ownership of that operation.
> > > >
> > > > Remove:
> > > >
> > > > When inside list_for_each helper, list_remove is safe for nodes since
> > > > we are protected by lock.
> > > >
> > > > Idea: An owner field alongside the list_node and rb_node.
> > > > list_add sets it to the address of list_head, list_remove sets it to
> > > > NULL. This will be done under spinlock of the list.
> > > >
> > > > When we get access to the object in an unknown state for these fields,
> > > > we first lock the list we want to remove it from, check the owner
> > > > field, and only remove it when we see that owner matches locked list.
> > > >
> > > > Each list_add updates owner, list_remove sets to NULL.
> > > >     bpf_spin_lock(&lock);
> > > >     if (bpf_list_owns_node(&i->node, &list)) { // checks owner
> > > > list_remove(&i->node);
> > > >     }
> > > >     bpf_spin_unlock(&lock);
> > > >
> > > > bpf_list_owns_node poisons pointer in false branch, so user can only
> > > > list_remove in true branch.
> > > >
> > > > If the owner is not a locked list pointer, it will be either NULL or
> > > > some other value (because of previous list_remove while holding same
> > > > lock, or list_add while holding some other list lock).
> > > > If the owner is our list pointer, we can be sure this is safe, as we
> > > > have already locked list.
> > > > Otherwise, previous critical section must have modified owner.
> > > > So one single load (after inlining this helper) allows unlinking
> > > > random kptr we have reference to, safely.
> > > >
> > > > Cost: 8-bytes per object. Advantages: Prevents bugs like racy
> > > > list_remove and double list_add, doesn't need fallible helpers (the
> > > > check that would have been inside has to be done by the user now).
> > > > Don't need the abort logic.
> > > >
> > >
> > > I agree, keeping track of owner seems necessary. Seems harder to verify
> > > statically than lock as well. Alexei mentioned today that combination
> > > "grab lock and take ownership" helper for dynamic check might make
> > > sense.
> > >
> > > Tangentially, I've been poking at ergonomics of
> > > libbpf lock definition this week and think I have something reasonable:
> > >
> > > struct node_data {
> > >         struct rb_node node;
> > >         __u32 one;
> > >         __u32 two;
> > > };
> > >
> > > struct l {
> > >         __uint(type, BPF_MAP_TYPE_ARRAY);
> > >         __type(key, u32);
> > >         __type(value, struct bpf_spin_lock);
> > >         __uint(max_entries, 1);
> > > } lock_arr SEC(".maps");
> > >
> > > struct {
> > >         __uint(type, BPF_MAP_TYPE_RBTREE);
> > >         __type(value, struct node_data);
> > >         __array(lock, struct l);
> > > } rbtree1 SEC(".maps") = {
> > >         .lock = {
> > >                 [0] = &lock_arr,
> > >         },
> > > };
> > >
> > > struct {
> > >         __uint(type, BPF_MAP_TYPE_RBTREE);
> > >         __type(value, struct node_data);
> > >         __array(lock, struct l);
> > > } rbtree2 SEC(".maps") = {
> > >         .lock = {
> > >                 [0] = &lock_arr,
> > >         },
> > > };
> > >
> > > ... in BPF prog
> > >
> > >   bpf_spin_lock(&lock_arr[0]);
> > >
> > >   // Can safely operate on either tree, move nodes between them, etc.
> > >
> > >   bpf_spin_unlock(&lock_arr[0]);
> > >
> > >
> > > Notes:
> > >   * Verifier knows which lock is supposed to be used at map creation time
> > >     * Can reuse bpf_verifier_state's 'active_spin_lock' member, so no addt'l
> > >       bookkeeping needed to verify that rbtree_add or similar is happening
> > >       in critical section
> >
> > Yes, this is similar to my approach, except what I'm doing is (suppose
> > we fix the bpf_spin_lock in mmap-able map value problem):
>
> We can teach libbpf to support more than 3 hard coded global maps
> (bss, rodata, data). So any named section will go into its own array with max_entries=1
> that won't be mmap-able and will allow to host bpf_rb_root, bpf_spin_lock, etc.
>

Yep, this should work.

> > The list_head and lock protecting it are global variables, hence in
> > the same map value for the global variable's array map (for now only
> > one lock is allowed in a map value, but we may allow some guarded_by
> > annotation to associate different locks to different containers).
> > Now, you can use the same active_spin_lock infra to track whether I
> > hold the one in the same map value as the list_head. More on that
> > below.
> >
> > >   * Can benefit from relo goodness (e.g. rbtree3 using extern lock in another
> > >     file)
> > >   * If necessary, similar dynamic verification behavior as just keeping lock
> > >     internal
> > >   * Implementation similarities with map_of_map 'inner_map'. Similarly to
> > >     inner_map_fd, kernel needs to know about lock_map_fd. Can use map_extra for
> > >     this to avoid uapi changes
> > >
> > > Alexei and I discussed possibly allowing raw 'struct bpf_spin_lock' global var,
> > > which would require some additional libbpf changes as bpf_spin_lock can't be
> > > mmap'd and libbpf tries to mmap all .data maps currently. Perhaps a separate
> > > .data.no_mmap section.
> > >
> > > This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure
> > > how to statically verify lock in that case. Have you had a chance to think
> > > about it further?
> > >
> >
> > You rely on the lock being in the same allocation, and manipulation
> > done on an object from the same 'lookup'. See below:
> >
> > struct foo {
> >         struct bpf_spin_lock lock;
> >         struct bpf_list_head head __value(...);
> > };
> >
> > struct map_value {
> >         struct foo __local_kptr *ptr;
> > };
> >
> > struct {
> >         __uint(type, BPF_MAP_TYPE_ARRAY);
> >         __type(key, int);
> >         __type(value, struct map_value);
> >         __uint(max_entries, 8);
> > } array_of_lists SEC(".maps");
> >
> > In my case, the structure is the map, so pointer to the structure
> > inside a map makes it map-in-map (now common, the existing map-in-maps
> > just hide this from you, so it's pretty much the same thing
> > anyway...).
> >
> > This is just an example, it can be one more level deep, but anyway.
> >
> > When I do a map lookup, there is check in
> > verifier.c:reg_may_point_to_spin_lock, this preserves reg->id on NULL
> > unmarking.  This reg->id is then remembered when you take lock inside
> > this map value, to associate it back to unlock correctly.
> >
> > Now, suppose you load the kptr. You know the kptr has a lock, you will
> > update this check to also consider local kptr with locks. The reg->id
> > is preserved after loaded kptr is NULL checked, but it is unique for
> > each load of the kptr. You lock spin_lock in the kptr, you then add to
> > list, the list_add verifier check goes and sees whether the current
> > lock held and the current list_head come from the same reg->id (you
> > know the reg of list_head, right? So you know the id as well, and you
> > match that to cur->active_spin_lock_id). If so, it is the correct
> > lock, we locked the lock in the same loaded kptr as the one whose
> > list_head we are list_add-ing to.
> >
> > For global variables, the check needs more work. In the normal map
> > lookup case, we assign fresh reg->id whenever you do a map lookup, so
> > in case of array map spin lock for the same key will set different id
> > in cur->active_spin_lock_id for two different map values from two
> > different lookups. This is because we don't know if it is the same map
> > value on second lookup, so both locks in different map value are
> > considered different locks. The id is the unique lock id, essentially.
> >
> > Since global variables are in direct_value_addr map with 1 max_entry,
> > we don't need to assign fresh reg->id and each pseudo ldimm64 insn. We
> > can instead teach it to either track it using id (for the case of
> > normal map lookups and local kptr), or map_ptr to accomodate global
> > variables non-unique ids. At once, only one of two is set, the other
> > is zero.
> >
> > Then everything falls in place. We always match both map_ptr and id.
> > For global data and map lookups, the map_ptr is matched, id will be 0
> > for global data, non zero for normal map lookups. There is only one
> > map value so the lock protects everything in it. For the other case I
> > described above, map_ptr is NULL but id will be different if not from
> > the same 'lookup' in case of local kptr (PTR_TO_BTF_ID).
> >
> > We also have map_uid, which is assigned to map_ptr of inner map
> > lookups. But remember that we are talking of map values above, so even
> > if for lookups from two differ inner maps of same map, we get two map
> > values whose map_ptr is technically same (even if the map_uid was
> > different), their reg->id _will_ be different, so the above checks are
> > sufficient to disambiguate spin locks for all kinds of cases.
> >
> > Keeping lock and data in the same allocation thus allows you to
> > associate locks statically even for dynamic allocations, enabling the
> > map-in-map use case.
>
> All that makes sense, but consider use case where we need rb_root for every cgroup.
> The 'struct bpf_rb_root' will be created dynamically in cgroup local storage.
> We can create a lock in the same cgroup local storage as well.
> It's all nice and the verifier can do locking checks statically,
> but how the program can trasnfer and rb_node from one rb tree to another
> in a different cgroup?
> Either two locks need to be held and I don't see a way to check that
> statically or one bpf_spin_lock should be used across all cgroups.
> Thoughts?

Thanks for the concrete example, much easier to reason about this :).
Here's my thought dump, comments welcome.

So it depends on your use case and the type of node (single ownership
vs shared ownership).

Do you want an atomic move or not?
If yes, for the single ownership case, the following works.

lock lock1
remove_rb_node
unlock lock1
lock lock2
add_rb_node
unlock lock2

Due to single ownership, nobody can write (reads may be safe e.g.
using RCU) to the removed rb_node between the two critical sections.
Both locks can be checked statically for their rb_root. For shared
ownership, the above won't be atomic (as someone else with ref to the
node may be able to steal it while we unlock lock1 to lock lock2.

So the main question is, can we allow holding two locks at the same
time safely, we only need it to do atomic moves for the shared
ownership case. Then the problem becomes a deadlock avoidance problem.

The first pass solution might be to make first lock a proper spin_lock
call, but second attempt a spin_trylock. If it succeeds, you can do
the operation, but if it fails, it may be due to ABBA deadlock, or
just contention.

However, this is not good, because under contention we'll see a lot of
failed operations. The next best thing we can do is define a lock
order, to do it generically, we can impose a simple rule for all BPF
programs:
Whenever two bpf_spin_locks are taken, they are always taken in the
order of the pointer address of the lock. I.e. you must lock l1 before
l2 if &l1 < &l2, and l2 before l1 if &l2 < &l1.
Then it becomes a problem of enforcing the correct order in the right branch.

One catch is that both locks may have the same address, in that case,
we can use spin_lock_nested with SINGLE_DEPTH_NESTING to allow taking
the same lock twice (or the user can have an else case).

This ensures that as long as we have two levels of locking supported
in the verifier, there is never an ABBA deadlock condition across all
BPF programs on the system.

----

Anyway, this is one approach, which I really like, but I understand
there might be cases in the future where you need to share locks
dynamically between structures. For this, I am thinking that we can
use the same idea we used in bpf_list_owns_node, but make it
convenient. It is a copy paste of Alexei's refcounted locks idea, but
with a few small tweaks.

For this case, you will have:

struct foo {
  struct bpf_spin_lock *lock;
  ...;
};

When you construct such kptr, the verifier enforces you "move" a
reference to an allocated lock (with a refcount to share it among
multiple structures), and it disallows mutation of this field until
you enter the single ownership phase of the object (i.e. refcount == 1
for refcounted ones, or just having ownership of pointer for
non-refcounted objects). For RCU protected ones, we might have to also
enforce the RCU grace period. But the thing is, mutation should only
be permitted once we know no other CPU can take the lock.

Then, whenever you know two nodes share the same lock, you upgrade the
unlocked structure using the following pattern to mark it as locked in
the true branch by the currently held spinlock.

p1 = foo();
p2 = foo();
lock(p1->lock_ptr);
if (p1->lock_ptr == p2->lock_ptr) // cannot change while we can observe p2 {
    // Both are locked, do operations protecting structs in both, like
atomic moves
}
unlock(p1->lock_ptr);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-19 20:24       ` Kumar Kartikeya Dwivedi
@ 2022-08-23 20:02         ` Alexei Starovoitov
  2022-08-24 18:56           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2022-08-23 20:02 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Fri, Aug 19, 2022 at 10:24:17PM +0200, Kumar Kartikeya Dwivedi wrote:
> >
> > > When thinking of heterogeneous values, we should probably look to add
> > > a way to do generic variant types in BPF, such that e.g. the first
> > > field is a tag, and then the actual data of that type follows. This
> > > would allow us to use them not only in intrusive collections, but
> > > anywhere else, and probably even store them in map values as kptrs.
> >
> > Not following this idea. Could you give an example?
> >
> 
> Basically a tagged union in C.
> 
> For list_head to take value A, B, C, we define a struct type that has
> a enum type field that tells the type, some common fields that can be
> assumed at the same offset in all types sharing, and then the various
> remaining members in a union that may differ.
> 
> Basically, how we do variable type structs in the kernel with a common
> initial sequence in structs (e.g. sock_common), but with an added type
> field to be able to implement safe casting at runtime. list_remove
> then sets btf_id to variant, and it needs to be casted before
> non-common sequence can be used.

More like class inheritance with run-time type casts ?
I guess it can be useful one day, but let's put on back burner for now.
We're still trying to figure out the basic link lists :)
Advanced hierarchy can wait.

> > All that makes sense, but consider use case where we need rb_root for every cgroup.
> > The 'struct bpf_rb_root' will be created dynamically in cgroup local storage.
> > We can create a lock in the same cgroup local storage as well.
> > It's all nice and the verifier can do locking checks statically,
> > but how the program can trasnfer and rb_node from one rb tree to another
> > in a different cgroup?
> > Either two locks need to be held and I don't see a way to check that
> > statically or one bpf_spin_lock should be used across all cgroups.
> > Thoughts?
> 
> Thanks for the concrete example, much easier to reason about this :).
> Here's my thought dump, comments welcome.
> 
> So it depends on your use case and the type of node (single ownership
> vs shared ownership).
> 
> Do you want an atomic move or not?
> If yes, for the single ownership case, the following works.
> 
> lock lock1
> remove_rb_node
> unlock lock1
> lock lock2
> add_rb_node
> unlock lock2
> 
> Due to single ownership, nobody can write (reads may be safe e.g.
> using RCU) to the removed rb_node between the two critical sections.
> Both locks can be checked statically for their rb_root. For shared
> ownership, the above won't be atomic (as someone else with ref to the
> node may be able to steal it while we unlock lock1 to lock lock2.
> 
> So the main question is, can we allow holding two locks at the same
> time safely, we only need it to do atomic moves for the shared
> ownership case. Then the problem becomes a deadlock avoidance problem.

We can add bpf_spin_lock_double(lock1, lock2)
that would take the lock in address order.
The verifier can enforce only one bpf_spin_lock or bpf_spin_lock_double
is used at a time.

> The first pass solution might be to make first lock a proper spin_lock
> call, but second attempt a spin_trylock. If it succeeds, you can do
> the operation, but if it fails, it may be due to ABBA deadlock, or
> just contention.
> 
> However, this is not good, because under contention we'll see a lot of
> failed operations. The next best thing we can do is define a lock
> order, to do it generically, we can impose a simple rule for all BPF
> programs:
> Whenever two bpf_spin_locks are taken, they are always taken in the
> order of the pointer address of the lock. I.e. you must lock l1 before
> l2 if &l1 < &l2, and l2 before l1 if &l2 < &l1.
> Then it becomes a problem of enforcing the correct order in the right branch.

right, but that can only be a run-time check done inside the helper
(like bpf_spin_lock_double)

> One catch is that both locks may have the same address, in that case,
> we can use spin_lock_nested with SINGLE_DEPTH_NESTING to allow taking
> the same lock twice (or the user can have an else case).

afaik spin_lock_nested is only lockdep special.
Not sure how it helps us.

> This ensures that as long as we have two levels of locking supported
> in the verifier, there is never an ABBA deadlock condition across all
> BPF programs on the system.
> 
> ----
> 
> Anyway, this is one approach, which I really like, but I understand
> there might be cases in the future where you need to share locks
> dynamically between structures. For this, I am thinking that we can
> use the same idea we used in bpf_list_owns_node, but make it
> convenient. It is a copy paste of Alexei's refcounted locks idea, but
> with a few small tweaks.
> 
> For this case, you will have:
> 
> struct foo {
>   struct bpf_spin_lock *lock;
>   ...;
> };
> 
> When you construct such kptr, the verifier enforces you "move" a
> reference to an allocated lock (with a refcount to share it among
> multiple structures), and it disallows mutation of this field until
> you enter the single ownership phase of the object (i.e. refcount == 1
> for refcounted ones, or just having ownership of pointer for
> non-refcounted objects). For RCU protected ones, we might have to also
> enforce the RCU grace period. But the thing is, mutation should only
> be permitted once we know no other CPU can take the lock.
> 
> Then, whenever you know two nodes share the same lock, you upgrade the
> unlocked structure using the following pattern to mark it as locked in
> the true branch by the currently held spinlock.
> 
> p1 = foo();
> p2 = foo();
> lock(p1->lock_ptr);
> if (p1->lock_ptr == p2->lock_ptr) // cannot change while we can observe p2 {
>     // Both are locked, do operations protecting structs in both, like
> atomic moves
> }
> unlock(p1->lock_ptr);

yeah. A pointer to a lock creates all this complexity.
We're discussing complex cases yet haven't agreed on the path for the basics.

Here is minimal-viable-rbtree-linklist proposal composed from ideas expressed
in this thread:

- allow special structs 
  bpf_lock,
  bpf_rb_root, bpf_rb_node, 
  bpf_list_head, bpf_list_node
  bpf_refcount
  to be defined inside map value, global data, and local storage.
  (The rbtree is no longer special map type).
  The verifier enforces that there is only one bpf_lock in the same "allocation"
  with one or more bpf_rb_root and bpf_list_head.
  To have 2 or more locks in the global data the user would have to do
  SEC("my_data_section1") struct bpf_lock lock; // protects rbtree
  SEC("my_data_section1") struct bpf_rb_root rbtree;
  SEC("my_data_section2") struct bpf_lock another_lock; // protects list
  SEC("my_data_section2") struct bpf_list_head list;
  The user would have to put such special structs into named sections,
  so that libbpf doesn't make them mmap-able.
  Only one bpf_lock will be allowed in map value and in local storage.
  It will protect all bpf_rb_root-s and bpf_list_head-s in the same map value/local storage.

- bpf_rb_root and bpf_list_heade have to be tagged with contain(struct_name,field)
  btf_decl_tag to allow the verifier enforcing contain_of() type casts.

Example:
  struct rb_elem {
     int var;
     struct bpf_rb_node node;
  };
  struct list_elem {
     int var;
     struct bpf_list_node node;
  };

  struct roots {
     struct bpf_lock lock; // lock protects both root and head
         // any operation on value->root or value->head has to precede with bpf_lock(value->lock)
     struct bpf_rb_root root contain(rb_elem, node);
     struct bpf_list_head head contain(list_elem, node);
  };
  struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 123);
    __type(key, __u32);
    __type(value, struct roots);
  } hash SEC(".maps");


- allow bpf_obj_alloc(bpf_core_type_id_local(struct list_elem))
  It returns acquired ptr_to_btf_id to program local btf type.
  This is single owner ship case. The ptr_to_btf_id is acquired (in the verifier terminology)
  and has to be kptr_xchg-ed into a map value or bpf_list_add-ed to a link list.
  Same thing with bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem))

- the verifier ensures single lock is taken to allows ops on root or head.
  value = bpf_map_lookup_elem(hash, key);
  rb_elem = bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem));
  list_elem = bpf_obj_alloc(bpf_core_type_id_local(struct list_elem));

  bpf_lock(&value->lock);
  bpf_rb_add(&rb_elem->node, &value->root, cb);
  bpf_list_add(&list_elem->node, &value->head);
  bpf_unlock(&value->lock);

  The verifier checks that arg1 type matches btf_decl_tag "contain" of root/head of arg2.
  In this example one lock protects rbtree and list.

- walking link list or bpf_rb_find do not return acuquired pointer.
  bpf_rb_erase and bpf_list_del convert the pointer to acquired,
  so it has to be inserted into another list or rbtree, freed with bpf_obj_free,
  or bpf_kptr_xchg-ed.
  bpf_rb_add doesn't fail.

- one element can be a part of an rbtree and a link list. In such cases it has to contain
  an explicit bpf_refcount field.
  Example:
  struct elem {
     int var;
     struct bpf_rb_node rb_node;
     struct bpf_list_node list_node;
     struct bpf_refcount refcount;
  };

  SEC("data1") struct bpf_lock lock1; // lock1 protects rb_root
  SEC("data1") struct bpf_rb_root rb_root contain(elem, rb_node);
  SEC("data2") struct bpf_lock lock2; // lock2 protects list_head
  SEC("data2") struct bpf_list_head list_head contain(elem, list_node);

  elem = bpf_obj_alloc(bpf_core_type_id_local(struct elem));

  bpf_lock(&lock1);
  if (bpf_rb_add(&elem->rb_node, &rb_root, cb));
  bpf_unlock(&lock1);

  bpf_lock(&lock2);
  if (bpf_list_add(&elem->list_node, &list_head));
  bpf_unlock(&lock2);
  bpf_obj_put(elem);

  bpf_obj_alloc automatically sets refcount to 1.
  bpf_rb_add() and bpf_list_add() automatically increment refcount.
  This is achieved by passing hidden offset to refcount into these helpers.
  Explicit operations on bpf_refcount can be allowed in theory,
  but it's simpler to start with implicit operations.
  Eventually explicit destructor for 'struct elem' would be needed as well.
  For now it will be implicit. The verifier will recursively destroy the object
  if 'struct elem' has other containers like bpf_rb_root or bpf_list_head.

- bpf_rb_add() also checks that owner field hidden inside bpf_rb_node is still NULL.
  Since the object is shared the multiple CPUs can obtain a pointer to elem and
  attemp to insert it into different link lists.
  bpf_rb_add() can fail, but bpf program can ignore the failure.
  It has to call bpf_obj_put() regardless.
  bpf_obj_put will decrement refcnt and will call bpf_obj_free() when it reaches zero.

- allow one level of nesting:
  struct elem {
     int var;
     struct bpf_rb_node rb_node;
     struct bpf_list_node list_node;
     struct bpf_refcount refcount;

     struct bpf_lock lock; // lock that protects nested_root
     struct bpf_rb_root nested_root contain(elem2, node);
  };
  if bpf_rb_node and bpf_rb_root are seen in the same struct
  the contain() tag has to be different than the struct elem.
  Other circular checks might be necessary.

Implementation steps:
1. allow single owner for lock, rbtree, link_list
2. add bpf_refcount and shared elems
3. nested

Thoughts?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-23 20:02         ` Alexei Starovoitov
@ 2022-08-24 18:56           ` Kumar Kartikeya Dwivedi
  2022-08-24 23:53             ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 18:56 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Tue, 23 Aug 2022 at 22:02, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 19, 2022 at 10:24:17PM +0200, Kumar Kartikeya Dwivedi wrote:
> > >
> > > > When thinking of heterogeneous values, we should probably look to add
> > > > a way to do generic variant types in BPF, such that e.g. the first
> > > > field is a tag, and then the actual data of that type follows. This
> > > > would allow us to use them not only in intrusive collections, but
> > > > anywhere else, and probably even store them in map values as kptrs.
> > >
> > > Not following this idea. Could you give an example?
> > >
> >
> > Basically a tagged union in C.
> >
> > For list_head to take value A, B, C, we define a struct type that has
> > a enum type field that tells the type, some common fields that can be
> > assumed at the same offset in all types sharing, and then the various
> > remaining members in a union that may differ.
> >
> > Basically, how we do variable type structs in the kernel with a common
> > initial sequence in structs (e.g. sock_common), but with an added type
> > field to be able to implement safe casting at runtime. list_remove
> > then sets btf_id to variant, and it needs to be casted before
> > non-common sequence can be used.
>
> More like class inheritance with run-time type casts ?
> I guess it can be useful one day, but let's put on back burner for now.
> We're still trying to figure out the basic link lists :)
> Advanced hierarchy can wait.
>

Yes, my only point is we should do it generically so that everything
is able to work with such types, not just these containers.

> > > All that makes sense, but consider use case where we need rb_root for every cgroup.
> > > The 'struct bpf_rb_root' will be created dynamically in cgroup local storage.
> > > We can create a lock in the same cgroup local storage as well.
> > > It's all nice and the verifier can do locking checks statically,
> > > but how the program can trasnfer and rb_node from one rb tree to another
> > > in a different cgroup?
> > > Either two locks need to be held and I don't see a way to check that
> > > statically or one bpf_spin_lock should be used across all cgroups.
> > > Thoughts?
> >
> > Thanks for the concrete example, much easier to reason about this :).
> > Here's my thought dump, comments welcome.
> >
> > So it depends on your use case and the type of node (single ownership
> > vs shared ownership).
> >
> > Do you want an atomic move or not?
> > If yes, for the single ownership case, the following works.
> >
> > lock lock1
> > remove_rb_node
> > unlock lock1
> > lock lock2
> > add_rb_node
> > unlock lock2
> >
> > Due to single ownership, nobody can write (reads may be safe e.g.
> > using RCU) to the removed rb_node between the two critical sections.
> > Both locks can be checked statically for their rb_root. For shared
> > ownership, the above won't be atomic (as someone else with ref to the
> > node may be able to steal it while we unlock lock1 to lock lock2.
> >
> > So the main question is, can we allow holding two locks at the same
> > time safely, we only need it to do atomic moves for the shared
> > ownership case. Then the problem becomes a deadlock avoidance problem.
>
> We can add bpf_spin_lock_double(lock1, lock2)
> that would take the lock in address order.
> The verifier can enforce only one bpf_spin_lock or bpf_spin_lock_double
> is used at a time.
>
> > The first pass solution might be to make first lock a proper spin_lock
> > call, but second attempt a spin_trylock. If it succeeds, you can do
> > the operation, but if it fails, it may be due to ABBA deadlock, or
> > just contention.
> >
> > However, this is not good, because under contention we'll see a lot of
> > failed operations. The next best thing we can do is define a lock
> > order, to do it generically, we can impose a simple rule for all BPF
> > programs:
> > Whenever two bpf_spin_locks are taken, they are always taken in the
> > order of the pointer address of the lock. I.e. you must lock l1 before
> > l2 if &l1 < &l2, and l2 before l1 if &l2 < &l1.
> > Then it becomes a problem of enforcing the correct order in the right branch.
>
> right, but that can only be a run-time check done inside the helper
> (like bpf_spin_lock_double)

So this is an idea, not proposing all of this now, and happy to get
more feedback:

I'm thinking of more realistic cases and whether this will cover them
in the future. This comparison idea works fine when locks are in the
same _type_.

Eventually you will have cases where you have lock hierarchies: When
holding two locks in two _different_ types, you have the ordering A,
B. When holding the locks of different _types_, you can determine that
you never go and take them in B, A ordering.

So you have lock 'domains', each domain defines an ordering rule and
comprises a set of types.

When the user doesn't specify anything, there is the default domain
with all prog BTF types, and the rule is lock address based ordering.
When the user defines an ordering between types, they cannot be locked
now with the address based rule, they can only be held in A, B order
(or whatever it is). Lock in A cannot be held with anything else, lock
in B cannot be held unless lock of type A is held.

Within the same type, address based ordering will still be ok.

This is not exhaustive, but it will cover the majority of the cases.
The good thing is then you don't need comparisons, and we can extend
this to even more than two locks, as long as the constraints are
specified in BTF.

So I lean more towards letting the verifier understand the
relationship between two lock calls, instead of bpf_spin_lock_double.
That can either come from address comparisons, or it can come from
BTF. bpf_spin_lock_double is ok for 2 levels of locking, but with BTF
we will be able to go beyond that.

But anyway, for now if bpf_spin_lock_double is _not_ UAPI, we can
certainly go with it to reduce effort and think about all this later.

>
> > One catch is that both locks may have the same address, in that case,
> > we can use spin_lock_nested with SINGLE_DEPTH_NESTING to allow taking
> > the same lock twice (or the user can have an else case).
>
> afaik spin_lock_nested is only lockdep special.
> Not sure how it helps us.
>
> > This ensures that as long as we have two levels of locking supported
> > in the verifier, there is never an ABBA deadlock condition across all
> > BPF programs on the system.
> >
> > ----
> >
> > Anyway, this is one approach, which I really like, but I understand
> > there might be cases in the future where you need to share locks
> > dynamically between structures. For this, I am thinking that we can
> > use the same idea we used in bpf_list_owns_node, but make it
> > convenient. It is a copy paste of Alexei's refcounted locks idea, but
> > with a few small tweaks.
> >
> > For this case, you will have:
> >
> > struct foo {
> >   struct bpf_spin_lock *lock;
> >   ...;
> > };
> >
> > When you construct such kptr, the verifier enforces you "move" a
> > reference to an allocated lock (with a refcount to share it among
> > multiple structures), and it disallows mutation of this field until
> > you enter the single ownership phase of the object (i.e. refcount == 1
> > for refcounted ones, or just having ownership of pointer for
> > non-refcounted objects). For RCU protected ones, we might have to also
> > enforce the RCU grace period. But the thing is, mutation should only
> > be permitted once we know no other CPU can take the lock.
> >
> > Then, whenever you know two nodes share the same lock, you upgrade the
> > unlocked structure using the following pattern to mark it as locked in
> > the true branch by the currently held spinlock.
> >
> > p1 = foo();
> > p2 = foo();
> > lock(p1->lock_ptr);
> > if (p1->lock_ptr == p2->lock_ptr) // cannot change while we can observe p2 {
> >     // Both are locked, do operations protecting structs in both, like
> > atomic moves
> > }
> > unlock(p1->lock_ptr);
>
> yeah. A pointer to a lock creates all this complexity.
> We're discussing complex cases yet haven't agreed on the path for the basics.
>
> Here is minimal-viable-rbtree-linklist proposal composed from ideas expressed
> in this thread:
>
> - allow special structs
>   bpf_lock,
>   bpf_rb_root, bpf_rb_node,
>   bpf_list_head, bpf_list_node
>   bpf_refcount

What will be the stability story around all this? So we want to use
kfuncs and keep this 'experimental' for now, but when you add such
structs to map value, you need a fixed size.

So we'll just reject the program load if internal struct size changes
and program uses the old size in map value or allocated object, right?

>   to be defined inside map value, global data, and local storage.
>   (The rbtree is no longer special map type).
>   The verifier enforces that there is only one bpf_lock in the same "allocation"
>   with one or more bpf_rb_root and bpf_list_head.
>   To have 2 or more locks in the global data the user would have to do
>   SEC("my_data_section1") struct bpf_lock lock; // protects rbtree
>   SEC("my_data_section1") struct bpf_rb_root rbtree;
>   SEC("my_data_section2") struct bpf_lock another_lock; // protects list
>   SEC("my_data_section2") struct bpf_list_head list;
>   The user would have to put such special structs into named sections,
>   so that libbpf doesn't make them mmap-able.
>   Only one bpf_lock will be allowed in map value and in local storage.
>   It will protect all bpf_rb_root-s and bpf_list_head-s in the same map value/local storage.
>
> - bpf_rb_root and bpf_list_heade have to be tagged with contain(struct_name,field)
>   btf_decl_tag to allow the verifier enforcing contain_of() type casts.
>
> Example:
>   struct rb_elem {
>      int var;
>      struct bpf_rb_node node;
>   };
>   struct list_elem {
>      int var;
>      struct bpf_list_node node;
>   };
>
>   struct roots {
>      struct bpf_lock lock; // lock protects both root and head
>          // any operation on value->root or value->head has to precede with bpf_lock(value->lock)
>      struct bpf_rb_root root contain(rb_elem, node);
>      struct bpf_list_head head contain(list_elem, node);
>   };
>   struct {
>     __uint(type, BPF_MAP_TYPE_HASH);
>     __uint(max_entries, 123);
>     __type(key, __u32);
>     __type(value, struct roots);
>   } hash SEC(".maps");
>
>
> - allow bpf_obj_alloc(bpf_core_type_id_local(struct list_elem))
>   It returns acquired ptr_to_btf_id to program local btf type.
>   This is single owner ship case. The ptr_to_btf_id is acquired (in the verifier terminology)
>   and has to be kptr_xchg-ed into a map value or bpf_list_add-ed to a link list.
>   Same thing with bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem))
>
> - the verifier ensures single lock is taken to allows ops on root or head.
>   value = bpf_map_lookup_elem(hash, key);
>   rb_elem = bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem));
>   list_elem = bpf_obj_alloc(bpf_core_type_id_local(struct list_elem));
>
>   bpf_lock(&value->lock);
>   bpf_rb_add(&rb_elem->node, &value->root, cb);
>   bpf_list_add(&list_elem->node, &value->head);
>   bpf_unlock(&value->lock);
>
>   The verifier checks that arg1 type matches btf_decl_tag "contain" of root/head of arg2.
>   In this example one lock protects rbtree and list.
>
> - walking link list or bpf_rb_find do not return acuquired pointer.
>   bpf_rb_erase and bpf_list_del convert the pointer to acquired,
>   so it has to be inserted into another list or rbtree, freed with bpf_obj_free,
>   or bpf_kptr_xchg-ed.
>   bpf_rb_add doesn't fail.
>
> - one element can be a part of an rbtree and a link list. In such cases it has to contain
>   an explicit bpf_refcount field.
>   Example:
>   struct elem {
>      int var;
>      struct bpf_rb_node rb_node;
>      struct bpf_list_node list_node;
>      struct bpf_refcount refcount;
>   };
>
>   SEC("data1") struct bpf_lock lock1; // lock1 protects rb_root
>   SEC("data1") struct bpf_rb_root rb_root contain(elem, rb_node);
>   SEC("data2") struct bpf_lock lock2; // lock2 protects list_head
>   SEC("data2") struct bpf_list_head list_head contain(elem, list_node);
>
>   elem = bpf_obj_alloc(bpf_core_type_id_local(struct elem));
>
>   bpf_lock(&lock1);
>   if (bpf_rb_add(&elem->rb_node, &rb_root, cb));
>   bpf_unlock(&lock1);
>
>   bpf_lock(&lock2);
>   if (bpf_list_add(&elem->list_node, &list_head));
>   bpf_unlock(&lock2);
>   bpf_obj_put(elem);
>
>   bpf_obj_alloc automatically sets refcount to 1.
>   bpf_rb_add() and bpf_list_add() automatically increment refcount.

Weak agree, I still prefer leaving object initialization up to the
user, but let's discuss these details when the patch is posted.

>   This is achieved by passing hidden offset to refcount into these helpers.
>   Explicit operations on bpf_refcount can be allowed in theory,
>   but it's simpler to start with implicit operations.
>   Eventually explicit destructor for 'struct elem' would be needed as well.
>   For now it will be implicit. The verifier will recursively destroy the object
>   if 'struct elem' has other containers like bpf_rb_root or bpf_list_head.
>
> - bpf_rb_add() also checks that owner field hidden inside bpf_rb_node is still NULL.

Just checking to be NULL is not enough if references have been leaked,
it will be racy, you need cmpxchg to some random pointer that can
never be a valid one.

>   Since the object is shared the multiple CPUs can obtain a pointer to elem and
>   attemp to insert it into different link lists.
>   bpf_rb_add() can fail, but bpf program can ignore the failure.

Agree in principle, but still prefer that user checks if it can add
and only allowing bpf_rb_add in that branch, so void bpf_rb_add vs int
bpf_rb_add, but this is a detail (check inside vs outside the helper),
and we can discuss pros/cons during code review.

>   It has to call bpf_obj_put() regardless.
>   bpf_obj_put will decrement refcnt and will call bpf_obj_free() when it reaches zero.
>
> - allow one level of nesting:
>   struct elem {
>      int var;
>      struct bpf_rb_node rb_node;
>      struct bpf_list_node list_node;
>      struct bpf_refcount refcount;
>
>      struct bpf_lock lock; // lock that protects nested_root
>      struct bpf_rb_root nested_root contain(elem2, node);
>   };
>   if bpf_rb_node and bpf_rb_root are seen in the same struct
>   the contain() tag has to be different than the struct elem.
>   Other circular checks might be necessary.
>
> Implementation steps:
> 1. allow single owner for lock, rbtree, link_list
> 2. add bpf_refcount and shared elems
> 3. nested
>
> Thoughts?

LGTM! Agree with everything, just some comments that I left above.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-24 18:56           ` Kumar Kartikeya Dwivedi
@ 2022-08-24 23:53             ` Alexei Starovoitov
  2022-08-25  1:02               ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2022-08-24 23:53 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Wed, Aug 24, 2022 at 08:56:51PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue, 23 Aug 2022 at 22:02, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Aug 19, 2022 at 10:24:17PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > >
> > > > > When thinking of heterogeneous values, we should probably look to add
> > > > > a way to do generic variant types in BPF, such that e.g. the first
> > > > > field is a tag, and then the actual data of that type follows. This
> > > > > would allow us to use them not only in intrusive collections, but
> > > > > anywhere else, and probably even store them in map values as kptrs.
> > > >
> > > > Not following this idea. Could you give an example?
> > > >
> > >
> > > Basically a tagged union in C.
> > >
> > > For list_head to take value A, B, C, we define a struct type that has
> > > a enum type field that tells the type, some common fields that can be
> > > assumed at the same offset in all types sharing, and then the various
> > > remaining members in a union that may differ.
> > >
> > > Basically, how we do variable type structs in the kernel with a common
> > > initial sequence in structs (e.g. sock_common), but with an added type
> > > field to be able to implement safe casting at runtime. list_remove
> > > then sets btf_id to variant, and it needs to be casted before
> > > non-common sequence can be used.
> >
> > More like class inheritance with run-time type casts ?
> > I guess it can be useful one day, but let's put on back burner for now.
> > We're still trying to figure out the basic link lists :)
> > Advanced hierarchy can wait.
> >
> 
> Yes, my only point is we should do it generically so that everything
> is able to work with such types, not just these containers.

'everything to work with such types'? Not following.
Do we need to agree on it right now?
I'm trying to contain this thread to things we have to agree and implement
as the first step.

> > > > All that makes sense, but consider use case where we need rb_root for every cgroup.
> > > > The 'struct bpf_rb_root' will be created dynamically in cgroup local storage.
> > > > We can create a lock in the same cgroup local storage as well.
> > > > It's all nice and the verifier can do locking checks statically,
> > > > but how the program can trasnfer and rb_node from one rb tree to another
> > > > in a different cgroup?
> > > > Either two locks need to be held and I don't see a way to check that
> > > > statically or one bpf_spin_lock should be used across all cgroups.
> > > > Thoughts?
> > >
> > > Thanks for the concrete example, much easier to reason about this :).
> > > Here's my thought dump, comments welcome.
> > >
> > > So it depends on your use case and the type of node (single ownership
> > > vs shared ownership).
> > >
> > > Do you want an atomic move or not?
> > > If yes, for the single ownership case, the following works.
> > >
> > > lock lock1
> > > remove_rb_node
> > > unlock lock1
> > > lock lock2
> > > add_rb_node
> > > unlock lock2
> > >
> > > Due to single ownership, nobody can write (reads may be safe e.g.
> > > using RCU) to the removed rb_node between the two critical sections.
> > > Both locks can be checked statically for their rb_root. For shared
> > > ownership, the above won't be atomic (as someone else with ref to the
> > > node may be able to steal it while we unlock lock1 to lock lock2.
> > >
> > > So the main question is, can we allow holding two locks at the same
> > > time safely, we only need it to do atomic moves for the shared
> > > ownership case. Then the problem becomes a deadlock avoidance problem.
> >
> > We can add bpf_spin_lock_double(lock1, lock2)
> > that would take the lock in address order.
> > The verifier can enforce only one bpf_spin_lock or bpf_spin_lock_double
> > is used at a time.
> >
> > > The first pass solution might be to make first lock a proper spin_lock
> > > call, but second attempt a spin_trylock. If it succeeds, you can do
> > > the operation, but if it fails, it may be due to ABBA deadlock, or
> > > just contention.
> > >
> > > However, this is not good, because under contention we'll see a lot of
> > > failed operations. The next best thing we can do is define a lock
> > > order, to do it generically, we can impose a simple rule for all BPF
> > > programs:
> > > Whenever two bpf_spin_locks are taken, they are always taken in the
> > > order of the pointer address of the lock. I.e. you must lock l1 before
> > > l2 if &l1 < &l2, and l2 before l1 if &l2 < &l1.
> > > Then it becomes a problem of enforcing the correct order in the right branch.
> >
> > right, but that can only be a run-time check done inside the helper
> > (like bpf_spin_lock_double)
> 
> So this is an idea, not proposing all of this now, and happy to get
> more feedback:
> 
> I'm thinking of more realistic cases and whether this will cover them
> in the future. This comparison idea works fine when locks are in the
> same _type_.
> 
> Eventually you will have cases where you have lock hierarchies: When
> holding two locks in two _different_ types, you have the ordering A,
> B. When holding the locks of different _types_, you can determine that
> you never go and take them in B, A ordering.
> 
> So you have lock 'domains', each domain defines an ordering rule and
> comprises a set of types.
> 
> When the user doesn't specify anything, there is the default domain
> with all prog BTF types, and the rule is lock address based ordering.
> When the user defines an ordering between types, they cannot be locked
> now with the address based rule, they can only be held in A, B order
> (or whatever it is). Lock in A cannot be held with anything else, lock
> in B cannot be held unless lock of type A is held.
> 
> Within the same type, address based ordering will still be ok.

Not following at all.
What is the 'type' of lock ? spin_lock vs mutex ?
Why bother distinguishing?

> This is not exhaustive, but it will cover the majority of the cases.
> The good thing is then you don't need comparisons, and we can extend
> this to even more than two locks, as long as the constraints are
> specified in BTF.
> 
> So I lean more towards letting the verifier understand the
> relationship between two lock calls, instead of bpf_spin_lock_double.
> That can either come from address comparisons, or it can come from
> BTF. bpf_spin_lock_double is ok for 2 levels of locking, but with BTF
> we will be able to go beyond that.

It doesn't look like that the address of the lock will stay known
to the verifier.
The name or whatever other attribute won't stay static either.
We will have to support allocated (fully dynamic) locks.
So run-time checks of locks will be inevitable
(whether it's just an address of the lock or other properties like owner field)
We can cover some useful part of bpf progs with static checks and that's the
first milestone, but after that we'll be looking hard at trade-off of
increasing verifier complexity vs doing checks at run-time.

> But anyway, for now if bpf_spin_lock_double is _not_ UAPI, we can
> certainly go with it to reduce effort and think about all this later.

right.

> >
> > > One catch is that both locks may have the same address, in that case,
> > > we can use spin_lock_nested with SINGLE_DEPTH_NESTING to allow taking
> > > the same lock twice (or the user can have an else case).
> >
> > afaik spin_lock_nested is only lockdep special.
> > Not sure how it helps us.
> >
> > > This ensures that as long as we have two levels of locking supported
> > > in the verifier, there is never an ABBA deadlock condition across all
> > > BPF programs on the system.
> > >
> > > ----
> > >
> > > Anyway, this is one approach, which I really like, but I understand
> > > there might be cases in the future where you need to share locks
> > > dynamically between structures. For this, I am thinking that we can
> > > use the same idea we used in bpf_list_owns_node, but make it
> > > convenient. It is a copy paste of Alexei's refcounted locks idea, but
> > > with a few small tweaks.
> > >
> > > For this case, you will have:
> > >
> > > struct foo {
> > >   struct bpf_spin_lock *lock;
> > >   ...;
> > > };
> > >
> > > When you construct such kptr, the verifier enforces you "move" a
> > > reference to an allocated lock (with a refcount to share it among
> > > multiple structures), and it disallows mutation of this field until
> > > you enter the single ownership phase of the object (i.e. refcount == 1
> > > for refcounted ones, or just having ownership of pointer for
> > > non-refcounted objects). For RCU protected ones, we might have to also
> > > enforce the RCU grace period. But the thing is, mutation should only
> > > be permitted once we know no other CPU can take the lock.
> > >
> > > Then, whenever you know two nodes share the same lock, you upgrade the
> > > unlocked structure using the following pattern to mark it as locked in
> > > the true branch by the currently held spinlock.
> > >
> > > p1 = foo();
> > > p2 = foo();
> > > lock(p1->lock_ptr);
> > > if (p1->lock_ptr == p2->lock_ptr) // cannot change while we can observe p2 {
> > >     // Both are locked, do operations protecting structs in both, like
> > > atomic moves
> > > }
> > > unlock(p1->lock_ptr);
> >
> > yeah. A pointer to a lock creates all this complexity.
> > We're discussing complex cases yet haven't agreed on the path for the basics.
> >
> > Here is minimal-viable-rbtree-linklist proposal composed from ideas expressed
> > in this thread:
> >
> > - allow special structs
> >   bpf_lock,
> >   bpf_rb_root, bpf_rb_node,
> >   bpf_list_head, bpf_list_node
> >   bpf_refcount
> 
> What will be the stability story around all this? So we want to use
> kfuncs and keep this 'experimental' for now, but when you add such
> structs to map value, you need a fixed size.
> 
> So we'll just reject the program load if internal struct size changes
> and program uses the old size in map value or allocated object, right?

Good question.
We baked bpf_spin_lock into uapi.
Above structs I would put into a header file in selftests/bpf only.
At prog load the verifier would check whether sizeof(prog's type) == sizeof(kernel).

> >   to be defined inside map value, global data, and local storage.
> >   (The rbtree is no longer special map type).
> >   The verifier enforces that there is only one bpf_lock in the same "allocation"
> >   with one or more bpf_rb_root and bpf_list_head.
> >   To have 2 or more locks in the global data the user would have to do
> >   SEC("my_data_section1") struct bpf_lock lock; // protects rbtree
> >   SEC("my_data_section1") struct bpf_rb_root rbtree;
> >   SEC("my_data_section2") struct bpf_lock another_lock; // protects list
> >   SEC("my_data_section2") struct bpf_list_head list;
> >   The user would have to put such special structs into named sections,
> >   so that libbpf doesn't make them mmap-able.
> >   Only one bpf_lock will be allowed in map value and in local storage.
> >   It will protect all bpf_rb_root-s and bpf_list_head-s in the same map value/local storage.
> >
> > - bpf_rb_root and bpf_list_heade have to be tagged with contain(struct_name,field)
> >   btf_decl_tag to allow the verifier enforcing contain_of() type casts.
> >
> > Example:
> >   struct rb_elem {
> >      int var;
> >      struct bpf_rb_node node;
> >   };
> >   struct list_elem {
> >      int var;
> >      struct bpf_list_node node;
> >   };
> >
> >   struct roots {
> >      struct bpf_lock lock; // lock protects both root and head
> >          // any operation on value->root or value->head has to precede with bpf_lock(value->lock)
> >      struct bpf_rb_root root contain(rb_elem, node);
> >      struct bpf_list_head head contain(list_elem, node);
> >   };
> >   struct {
> >     __uint(type, BPF_MAP_TYPE_HASH);
> >     __uint(max_entries, 123);
> >     __type(key, __u32);
> >     __type(value, struct roots);
> >   } hash SEC(".maps");
> >
> >
> > - allow bpf_obj_alloc(bpf_core_type_id_local(struct list_elem))
> >   It returns acquired ptr_to_btf_id to program local btf type.
> >   This is single owner ship case. The ptr_to_btf_id is acquired (in the verifier terminology)
> >   and has to be kptr_xchg-ed into a map value or bpf_list_add-ed to a link list.
> >   Same thing with bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem))
> >
> > - the verifier ensures single lock is taken to allows ops on root or head.
> >   value = bpf_map_lookup_elem(hash, key);
> >   rb_elem = bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem));
> >   list_elem = bpf_obj_alloc(bpf_core_type_id_local(struct list_elem));
> >
> >   bpf_lock(&value->lock);
> >   bpf_rb_add(&rb_elem->node, &value->root, cb);
> >   bpf_list_add(&list_elem->node, &value->head);
> >   bpf_unlock(&value->lock);
> >
> >   The verifier checks that arg1 type matches btf_decl_tag "contain" of root/head of arg2.
> >   In this example one lock protects rbtree and list.
> >
> > - walking link list or bpf_rb_find do not return acuquired pointer.
> >   bpf_rb_erase and bpf_list_del convert the pointer to acquired,
> >   so it has to be inserted into another list or rbtree, freed with bpf_obj_free,
> >   or bpf_kptr_xchg-ed.
> >   bpf_rb_add doesn't fail.
> >
> > - one element can be a part of an rbtree and a link list. In such cases it has to contain
> >   an explicit bpf_refcount field.
> >   Example:
> >   struct elem {
> >      int var;
> >      struct bpf_rb_node rb_node;
> >      struct bpf_list_node list_node;
> >      struct bpf_refcount refcount;
> >   };
> >
> >   SEC("data1") struct bpf_lock lock1; // lock1 protects rb_root
> >   SEC("data1") struct bpf_rb_root rb_root contain(elem, rb_node);
> >   SEC("data2") struct bpf_lock lock2; // lock2 protects list_head
> >   SEC("data2") struct bpf_list_head list_head contain(elem, list_node);
> >
> >   elem = bpf_obj_alloc(bpf_core_type_id_local(struct elem));
> >
> >   bpf_lock(&lock1);
> >   if (bpf_rb_add(&elem->rb_node, &rb_root, cb));
> >   bpf_unlock(&lock1);
> >
> >   bpf_lock(&lock2);
> >   if (bpf_list_add(&elem->list_node, &list_head));
> >   bpf_unlock(&lock2);
> >   bpf_obj_put(elem);
> >
> >   bpf_obj_alloc automatically sets refcount to 1.
> >   bpf_rb_add() and bpf_list_add() automatically increment refcount.
> 
> Weak agree, I still prefer leaving object initialization up to the
> user, but let's discuss these details when the patch is posted.
> 
> >   This is achieved by passing hidden offset to refcount into these helpers.
> >   Explicit operations on bpf_refcount can be allowed in theory,
> >   but it's simpler to start with implicit operations.
> >   Eventually explicit destructor for 'struct elem' would be needed as well.
> >   For now it will be implicit. The verifier will recursively destroy the object
> >   if 'struct elem' has other containers like bpf_rb_root or bpf_list_head.
> >
> > - bpf_rb_add() also checks that owner field hidden inside bpf_rb_node is still NULL.
> 
> Just checking to be NULL is not enough if references have been leaked,
> it will be racy, you need cmpxchg to some random pointer that can
> never be a valid one.

Of coure. It needs to be cmpxchg.

> 
> >   Since the object is shared the multiple CPUs can obtain a pointer to elem and
> >   attemp to insert it into different link lists.
> >   bpf_rb_add() can fail, but bpf program can ignore the failure.
> 
> Agree in principle, but still prefer that user checks if it can add
> and only allowing bpf_rb_add in that branch, so void bpf_rb_add vs int
> bpf_rb_add, but this is a detail (check inside vs outside the helper),
> and we can discuss pros/cons during code review.

I've considered it, but it only adds run-time overhead and extra UX pain. Consider:
if (bpf_can_rb_add(node, root))
  bpf_rb_add(node, root, cb);

It's non-trivial for the verifier to enforce that 'root' is the same
in both calls. Especially when 'root' arg in 2nd call is unnecessary, since it was cmpxchg-ed
and stored in the first call.
So the user should write:
if (bpf_can_rb_add(node, root))
  bpf_rb_add(node, cb);

So bpf_can_rb_add did: node->owner = root; refcount_inc(node->refcnt)
the only thing it didn't do is rb_add.
bpf_rb_add() at least doing some work, but for bpf_list_add() it's a pointer assignment
that left to do.
And now these code spawns two function calls.
And above code has hidden logic that 'root' was remembered earlier.
These side effects make reading such code unpleasant.

But turning the same argument around... and thinking about can_add differently...

Single owner case:
bpf_lock, bpf_list_add, bpf_unlock -> cmpxchg, stores, cmpxchg

Shared owner case:
bpf_lock, bpf_list_add, bpf_unlock -> cmpxchg, cmpxchg, atomic_inc, stores, cmpxchg.

While discussing this with Tejun and Dave the idea came to split association
of the node with a tree and actual addition.
Something like:
obj = bpf_obj_alloc();
bpf_obj_assoc(&obj->rb_node, &rb_root);
bpf_spin_lock();
bpf_rbtree_add(&obj->rb_node, cb);
bpf_spin_unlock();

The assumption is that nodes will belong to the same tree much longer
than add/erase from the tree. This will move cmpxchg + atomic_inc from add/erase path.

But now the verifier has to make sure that correct root is locked at the time
of bpf_rbtree_add... if above example is in one function it's doable, but consider:
bpf_prog1()
{
  obj = bpf_obj_alloc();
  bpf_obj_assoc(&obj->rb_node, &rb_root);
  kptr_xchg(obj, map_value); // stash it for future use
}
bpf_prog2()
{
  obj = kptr_xchg(map_value); // get it from local storage or else
  bpf_spin_lock();
  bpf_rbtree_add(&obj->rb_node, cb);
  bpf_spin_unlock();
}

Not only btf_type_id needs to be known in kptr_xchg, but the specific root
to be able to check that root and lock are from the same allocation.

So bpf_can_rb_add and bpf_obj_assoc are not quite dead-ends yet, but
I propose to implement bpf_list_add() the way I drafted earlier as the first step.
The rest can be an optimization later.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-24 23:53             ` Alexei Starovoitov
@ 2022-08-25  1:02               ` Kumar Kartikeya Dwivedi
  2022-08-25  1:40                 ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-25  1:02 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Thu, 25 Aug 2022 at 01:53, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 08:56:51PM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 23 Aug 2022 at 22:02, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Aug 19, 2022 at 10:24:17PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > >
> > > > > > When thinking of heterogeneous values, we should probably look to add
> > > > > > a way to do generic variant types in BPF, such that e.g. the first
> > > > > > field is a tag, and then the actual data of that type follows. This
> > > > > > would allow us to use them not only in intrusive collections, but
> > > > > > anywhere else, and probably even store them in map values as kptrs.
> > > > >
> > > > > Not following this idea. Could you give an example?
> > > > >
> > > >
> > > > Basically a tagged union in C.
> > > >
> > > > For list_head to take value A, B, C, we define a struct type that has
> > > > a enum type field that tells the type, some common fields that can be
> > > > assumed at the same offset in all types sharing, and then the various
> > > > remaining members in a union that may differ.
> > > >
> > > > Basically, how we do variable type structs in the kernel with a common
> > > > initial sequence in structs (e.g. sock_common), but with an added type
> > > > field to be able to implement safe casting at runtime. list_remove
> > > > then sets btf_id to variant, and it needs to be casted before
> > > > non-common sequence can be used.
> > >
> > > More like class inheritance with run-time type casts ?
> > > I guess it can be useful one day, but let's put on back burner for now.
> > > We're still trying to figure out the basic link lists :)
> > > Advanced hierarchy can wait.
> > >
> >
> > Yes, my only point is we should do it generically so that everything
> > is able to work with such types, not just these containers.
>
> 'everything to work with such types'? Not following.
> Do we need to agree on it right now?

Nope.

> I'm trying to contain this thread to things we have to agree and implement
> as the first step.
>
> > > > > All that makes sense, but consider use case where we need rb_root for every cgroup.
> > > > > The 'struct bpf_rb_root' will be created dynamically in cgroup local storage.
> > > > > We can create a lock in the same cgroup local storage as well.
> > > > > It's all nice and the verifier can do locking checks statically,
> > > > > but how the program can trasnfer and rb_node from one rb tree to another
> > > > > in a different cgroup?
> > > > > Either two locks need to be held and I don't see a way to check that
> > > > > statically or one bpf_spin_lock should be used across all cgroups.
> > > > > Thoughts?
> > > >
> > > > Thanks for the concrete example, much easier to reason about this :).
> > > > Here's my thought dump, comments welcome.
> > > >
> > > > So it depends on your use case and the type of node (single ownership
> > > > vs shared ownership).
> > > >
> > > > Do you want an atomic move or not?
> > > > If yes, for the single ownership case, the following works.
> > > >
> > > > lock lock1
> > > > remove_rb_node
> > > > unlock lock1
> > > > lock lock2
> > > > add_rb_node
> > > > unlock lock2
> > > >
> > > > Due to single ownership, nobody can write (reads may be safe e.g.
> > > > using RCU) to the removed rb_node between the two critical sections.
> > > > Both locks can be checked statically for their rb_root. For shared
> > > > ownership, the above won't be atomic (as someone else with ref to the
> > > > node may be able to steal it while we unlock lock1 to lock lock2.
> > > >
> > > > So the main question is, can we allow holding two locks at the same
> > > > time safely, we only need it to do atomic moves for the shared
> > > > ownership case. Then the problem becomes a deadlock avoidance problem.
> > >
> > > We can add bpf_spin_lock_double(lock1, lock2)
> > > that would take the lock in address order.
> > > The verifier can enforce only one bpf_spin_lock or bpf_spin_lock_double
> > > is used at a time.
> > >
> > > > The first pass solution might be to make first lock a proper spin_lock
> > > > call, but second attempt a spin_trylock. If it succeeds, you can do
> > > > the operation, but if it fails, it may be due to ABBA deadlock, or
> > > > just contention.
> > > >
> > > > However, this is not good, because under contention we'll see a lot of
> > > > failed operations. The next best thing we can do is define a lock
> > > > order, to do it generically, we can impose a simple rule for all BPF
> > > > programs:
> > > > Whenever two bpf_spin_locks are taken, they are always taken in the
> > > > order of the pointer address of the lock. I.e. you must lock l1 before
> > > > l2 if &l1 < &l2, and l2 before l1 if &l2 < &l1.
> > > > Then it becomes a problem of enforcing the correct order in the right branch.
> > >
> > > right, but that can only be a run-time check done inside the helper
> > > (like bpf_spin_lock_double)
> >
> > So this is an idea, not proposing all of this now, and happy to get
> > more feedback:
> >
> > I'm thinking of more realistic cases and whether this will cover them
> > in the future. This comparison idea works fine when locks are in the
> > same _type_.
> >
> > Eventually you will have cases where you have lock hierarchies: When
> > holding two locks in two _different_ types, you have the ordering A,
> > B. When holding the locks of different _types_, you can determine that
> > you never go and take them in B, A ordering.
> >
> > So you have lock 'domains', each domain defines an ordering rule and
> > comprises a set of types.
> >
> > When the user doesn't specify anything, there is the default domain
> > with all prog BTF types, and the rule is lock address based ordering.
> > When the user defines an ordering between types, they cannot be locked
> > now with the address based rule, they can only be held in A, B order
> > (or whatever it is). Lock in A cannot be held with anything else, lock
> > in B cannot be held unless lock of type A is held.
> >
> > Within the same type, address based ordering will still be ok.
>
> Not following at all.
> What is the 'type' of lock ? spin_lock vs mutex ?

Lock within a type. Lock inside type A, Lock inside type B.

> Why bother distinguishing?
>
> > This is not exhaustive, but it will cover the majority of the cases.
> > The good thing is then you don't need comparisons, and we can extend
> > this to even more than two locks, as long as the constraints are
> > specified in BTF.
> >
> > So I lean more towards letting the verifier understand the
> > relationship between two lock calls, instead of bpf_spin_lock_double.
> > That can either come from address comparisons, or it can come from
> > BTF. bpf_spin_lock_double is ok for 2 levels of locking, but with BTF
> > we will be able to go beyond that.
>
> It doesn't look like that the address of the lock will stay known
> to the verifier.

But when you lock it, you need to pass the bpf_spin_lock pointer to
the call, right? We only need to determine ordering between lock
pointers passed to two spin lock calls, the address doesn't have to be
known to the verifier. Same thing which bpf_spin_lock_double does. We
can do inside the program whatever it can do.

if (l1 < l2)
 lock(l1); lock(l2);
else
 lock(l2); lock(l1);

> The name or whatever other attribute won't stay static either.
> We will have to support allocated (fully dynamic) locks.
> So run-time checks of locks will be inevitable
> (whether it's just an address of the lock or other properties like owner field)
> We can cover some useful part of bpf progs with static checks and that's the
> first milestone, but after that we'll be looking hard at trade-off of
> increasing verifier complexity vs doing checks at run-time.
>
> > But anyway, for now if bpf_spin_lock_double is _not_ UAPI, we can
> > certainly go with it to reduce effort and think about all this later.
>
> right.
>
> > >
> > > > One catch is that both locks may have the same address, in that case,
> > > > we can use spin_lock_nested with SINGLE_DEPTH_NESTING to allow taking
> > > > the same lock twice (or the user can have an else case).
> > >
> > > afaik spin_lock_nested is only lockdep special.
> > > Not sure how it helps us.
> > >
> > > > This ensures that as long as we have two levels of locking supported
> > > > in the verifier, there is never an ABBA deadlock condition across all
> > > > BPF programs on the system.
> > > >
> > > > ----
> > > >
> > > > Anyway, this is one approach, which I really like, but I understand
> > > > there might be cases in the future where you need to share locks
> > > > dynamically between structures. For this, I am thinking that we can
> > > > use the same idea we used in bpf_list_owns_node, but make it
> > > > convenient. It is a copy paste of Alexei's refcounted locks idea, but
> > > > with a few small tweaks.
> > > >
> > > > For this case, you will have:
> > > >
> > > > struct foo {
> > > >   struct bpf_spin_lock *lock;
> > > >   ...;
> > > > };
> > > >
> > > > When you construct such kptr, the verifier enforces you "move" a
> > > > reference to an allocated lock (with a refcount to share it among
> > > > multiple structures), and it disallows mutation of this field until
> > > > you enter the single ownership phase of the object (i.e. refcount == 1
> > > > for refcounted ones, or just having ownership of pointer for
> > > > non-refcounted objects). For RCU protected ones, we might have to also
> > > > enforce the RCU grace period. But the thing is, mutation should only
> > > > be permitted once we know no other CPU can take the lock.
> > > >
> > > > Then, whenever you know two nodes share the same lock, you upgrade the
> > > > unlocked structure using the following pattern to mark it as locked in
> > > > the true branch by the currently held spinlock.
> > > >
> > > > p1 = foo();
> > > > p2 = foo();
> > > > lock(p1->lock_ptr);
> > > > if (p1->lock_ptr == p2->lock_ptr) // cannot change while we can observe p2 {
> > > >     // Both are locked, do operations protecting structs in both, like
> > > > atomic moves
> > > > }
> > > > unlock(p1->lock_ptr);
> > >
> > > yeah. A pointer to a lock creates all this complexity.
> > > We're discussing complex cases yet haven't agreed on the path for the basics.
> > >
> > > Here is minimal-viable-rbtree-linklist proposal composed from ideas expressed
> > > in this thread:
> > >
> > > - allow special structs
> > >   bpf_lock,
> > >   bpf_rb_root, bpf_rb_node,
> > >   bpf_list_head, bpf_list_node
> > >   bpf_refcount
> >
> > What will be the stability story around all this? So we want to use
> > kfuncs and keep this 'experimental' for now, but when you add such
> > structs to map value, you need a fixed size.
> >
> > So we'll just reject the program load if internal struct size changes
> > and program uses the old size in map value or allocated object, right?
>
> Good question.
> We baked bpf_spin_lock into uapi.
> Above structs I would put into a header file in selftests/bpf only.
> At prog load the verifier would check whether sizeof(prog's type) == sizeof(kernel).
>
> > >   to be defined inside map value, global data, and local storage.
> > >   (The rbtree is no longer special map type).
> > >   The verifier enforces that there is only one bpf_lock in the same "allocation"
> > >   with one or more bpf_rb_root and bpf_list_head.
> > >   To have 2 or more locks in the global data the user would have to do
> > >   SEC("my_data_section1") struct bpf_lock lock; // protects rbtree
> > >   SEC("my_data_section1") struct bpf_rb_root rbtree;
> > >   SEC("my_data_section2") struct bpf_lock another_lock; // protects list
> > >   SEC("my_data_section2") struct bpf_list_head list;
> > >   The user would have to put such special structs into named sections,
> > >   so that libbpf doesn't make them mmap-able.
> > >   Only one bpf_lock will be allowed in map value and in local storage.
> > >   It will protect all bpf_rb_root-s and bpf_list_head-s in the same map value/local storage.
> > >
> > > - bpf_rb_root and bpf_list_heade have to be tagged with contain(struct_name,field)
> > >   btf_decl_tag to allow the verifier enforcing contain_of() type casts.
> > >
> > > Example:
> > >   struct rb_elem {
> > >      int var;
> > >      struct bpf_rb_node node;
> > >   };
> > >   struct list_elem {
> > >      int var;
> > >      struct bpf_list_node node;
> > >   };
> > >
> > >   struct roots {
> > >      struct bpf_lock lock; // lock protects both root and head
> > >          // any operation on value->root or value->head has to precede with bpf_lock(value->lock)
> > >      struct bpf_rb_root root contain(rb_elem, node);
> > >      struct bpf_list_head head contain(list_elem, node);
> > >   };
> > >   struct {
> > >     __uint(type, BPF_MAP_TYPE_HASH);
> > >     __uint(max_entries, 123);
> > >     __type(key, __u32);
> > >     __type(value, struct roots);
> > >   } hash SEC(".maps");
> > >
> > >
> > > - allow bpf_obj_alloc(bpf_core_type_id_local(struct list_elem))
> > >   It returns acquired ptr_to_btf_id to program local btf type.
> > >   This is single owner ship case. The ptr_to_btf_id is acquired (in the verifier terminology)
> > >   and has to be kptr_xchg-ed into a map value or bpf_list_add-ed to a link list.
> > >   Same thing with bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem))
> > >
> > > - the verifier ensures single lock is taken to allows ops on root or head.
> > >   value = bpf_map_lookup_elem(hash, key);
> > >   rb_elem = bpf_obj_alloc(bpf_core_type_id_local(struct rb_elem));
> > >   list_elem = bpf_obj_alloc(bpf_core_type_id_local(struct list_elem));
> > >
> > >   bpf_lock(&value->lock);
> > >   bpf_rb_add(&rb_elem->node, &value->root, cb);
> > >   bpf_list_add(&list_elem->node, &value->head);
> > >   bpf_unlock(&value->lock);
> > >
> > >   The verifier checks that arg1 type matches btf_decl_tag "contain" of root/head of arg2.
> > >   In this example one lock protects rbtree and list.
> > >
> > > - walking link list or bpf_rb_find do not return acuquired pointer.
> > >   bpf_rb_erase and bpf_list_del convert the pointer to acquired,
> > >   so it has to be inserted into another list or rbtree, freed with bpf_obj_free,
> > >   or bpf_kptr_xchg-ed.
> > >   bpf_rb_add doesn't fail.
> > >
> > > - one element can be a part of an rbtree and a link list. In such cases it has to contain
> > >   an explicit bpf_refcount field.
> > >   Example:
> > >   struct elem {
> > >      int var;
> > >      struct bpf_rb_node rb_node;
> > >      struct bpf_list_node list_node;
> > >      struct bpf_refcount refcount;
> > >   };
> > >
> > >   SEC("data1") struct bpf_lock lock1; // lock1 protects rb_root
> > >   SEC("data1") struct bpf_rb_root rb_root contain(elem, rb_node);
> > >   SEC("data2") struct bpf_lock lock2; // lock2 protects list_head
> > >   SEC("data2") struct bpf_list_head list_head contain(elem, list_node);
> > >
> > >   elem = bpf_obj_alloc(bpf_core_type_id_local(struct elem));
> > >
> > >   bpf_lock(&lock1);
> > >   if (bpf_rb_add(&elem->rb_node, &rb_root, cb));
> > >   bpf_unlock(&lock1);
> > >
> > >   bpf_lock(&lock2);
> > >   if (bpf_list_add(&elem->list_node, &list_head));
> > >   bpf_unlock(&lock2);
> > >   bpf_obj_put(elem);
> > >
> > >   bpf_obj_alloc automatically sets refcount to 1.
> > >   bpf_rb_add() and bpf_list_add() automatically increment refcount.
> >
> > Weak agree, I still prefer leaving object initialization up to the
> > user, but let's discuss these details when the patch is posted.
> >
> > >   This is achieved by passing hidden offset to refcount into these helpers.
> > >   Explicit operations on bpf_refcount can be allowed in theory,
> > >   but it's simpler to start with implicit operations.
> > >   Eventually explicit destructor for 'struct elem' would be needed as well.
> > >   For now it will be implicit. The verifier will recursively destroy the object
> > >   if 'struct elem' has other containers like bpf_rb_root or bpf_list_head.
> > >
> > > - bpf_rb_add() also checks that owner field hidden inside bpf_rb_node is still NULL.
> >
> > Just checking to be NULL is not enough if references have been leaked,
> > it will be racy, you need cmpxchg to some random pointer that can
> > never be a valid one.
>
> Of coure. It needs to be cmpxchg.
>
> >
> > >   Since the object is shared the multiple CPUs can obtain a pointer to elem and
> > >   attemp to insert it into different link lists.
> > >   bpf_rb_add() can fail, but bpf program can ignore the failure.
> >
> > Agree in principle, but still prefer that user checks if it can add
> > and only allowing bpf_rb_add in that branch, so void bpf_rb_add vs int
> > bpf_rb_add, but this is a detail (check inside vs outside the helper),
> > and we can discuss pros/cons during code review.
>
> I've considered it, but it only adds run-time overhead and extra UX pain. Consider:

That I agree with.

> if (bpf_can_rb_add(node, root))
>   bpf_rb_add(node, root, cb);
>
> It's non-trivial for the verifier to enforce that 'root' is the same
> in both calls. Especially when 'root' arg in 2nd call is unnecessary, since it was cmpxchg-ed
> and stored in the first call.
> So the user should write:
> if (bpf_can_rb_add(node, root))
>   bpf_rb_add(node, cb);
>
> So bpf_can_rb_add did: node->owner = root; refcount_inc(node->refcnt)

This is a bit different from what I proposed. Once you do cmpxchg
node->owner to root to 'own for add' (can_add in your email), some
other CPU can see node->owner == root and try to remove node which is
not part of the rb_tree yet. What you want to store is BPF_PTR_POISON.
Invalid non-NULL value that can never match that is_owner load.
Basically mark the node as 'busy'.

And then it is usable for rb_add in the success branch, but not to
just some single rb_tree 'root', to any rb_tree. No need for
'associating' node and root. You can do this op outside the lock. Add
will finally do a simple store of the owner to complete the add (you
can call it the linearization point of bpf_rb_add, as now the node is
visible as 'added to rb_tree' to other CPUs).

When doing rb_add, you will see the 'root' we are adding to is locked,
and we already own the node parameter for add, so we don't check
anything else. As soon as you unlock, your ownership to simply remove
or add is gone.

But I guess it might be a bit tiring for everyone slogging through
this over and over :), so I'll focus on single ownership first, and
we'll discuss this shared case in the RFC patches next time... More
easier to poke holes theories then.

Because I think it is clear by now that unless we restrict usage
heavily, to ensure safety for the shared ownership case we need to
track the owner. It will just depend on what protocol is chosen to
transfer the ownership of the node from one tree to another.

> the only thing it didn't do is rb_add.
> bpf_rb_add() at least doing some work, but for bpf_list_add() it's a pointer assignment
> that left to do.
> And now these code spawns two function calls.
> And above code has hidden logic that 'root' was remembered earlier.
> These side effects make reading such code unpleasant.
>
> But turning the same argument around... and thinking about can_add differently...
>
> Single owner case:
> bpf_lock, bpf_list_add, bpf_unlock -> cmpxchg, stores, cmpxchg
>
> Shared owner case:
> bpf_lock, bpf_list_add, bpf_unlock -> cmpxchg, cmpxchg, atomic_inc, stores, cmpxchg.
>
> While discussing this with Tejun and Dave the idea came to split association
> of the node with a tree and actual addition.
> Something like:
> obj = bpf_obj_alloc();
> bpf_obj_assoc(&obj->rb_node, &rb_root);
> bpf_spin_lock();
> bpf_rbtree_add(&obj->rb_node, cb);
> bpf_spin_unlock();
>
> The assumption is that nodes will belong to the same tree much longer
> than add/erase from the tree. This will move cmpxchg + atomic_inc from add/erase path.

But whenever you move from one rb_root to another, then you have to
change object association again, right? I.e. that call is part of move
operation, so what changed from above where you have to own for add
because someone else might try to 'steal' it after unlock? Maybe I
missed something.

>
> But now the verifier has to make sure that correct root is locked at the time
> of bpf_rbtree_add... if above example is in one function it's doable, but consider:
> bpf_prog1()
> {
>   obj = bpf_obj_alloc();
>   bpf_obj_assoc(&obj->rb_node, &rb_root);
>   kptr_xchg(obj, map_value); // stash it for future use
> }
> bpf_prog2()
> {
>   obj = kptr_xchg(map_value); // get it from local storage or else
>   bpf_spin_lock();
>   bpf_rbtree_add(&obj->rb_node, cb);
>   bpf_spin_unlock();
> }
>
> Not only btf_type_id needs to be known in kptr_xchg, but the specific root
> to be able to check that root and lock are from the same allocation.
>
> So bpf_can_rb_add and bpf_obj_assoc are not quite dead-ends yet, but
> I propose to implement bpf_list_add() the way I drafted earlier as the first step.
> The rest can be an optimization later.

Agreed, going to work on just the single ownership case first, and
this shared case after that (and think about making it better in the
mean time).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BPF Linked Lists discussion
  2022-08-25  1:02               ` Kumar Kartikeya Dwivedi
@ 2022-08-25  1:40                 ` Alexei Starovoitov
  0 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2022-08-25  1:40 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Dave Marchevsky, bpf, ast, andrii, daniel

On Thu, Aug 25, 2022 at 03:02:44AM +0200, Kumar Kartikeya Dwivedi wrote:
> > >
> > > I'm thinking of more realistic cases and whether this will cover them
> > > in the future. This comparison idea works fine when locks are in the
> > > same _type_.
> > >
> > > Eventually you will have cases where you have lock hierarchies: When
> > > holding two locks in two _different_ types, you have the ordering A,
> > > B. When holding the locks of different _types_, you can determine that
> > > you never go and take them in B, A ordering.
> > >
> > > So you have lock 'domains', each domain defines an ordering rule and
> > > comprises a set of types.
> > >
> > > When the user doesn't specify anything, there is the default domain
> > > with all prog BTF types, and the rule is lock address based ordering.
> > > When the user defines an ordering between types, they cannot be locked
> > > now with the address based rule, they can only be held in A, B order
> > > (or whatever it is). Lock in A cannot be held with anything else, lock
> > > in B cannot be held unless lock of type A is held.
> > >
> > > Within the same type, address based ordering will still be ok.
> >
> > Not following at all.
> > What is the 'type' of lock ? spin_lock vs mutex ?
> 
> Lock within a type. Lock inside type A, Lock inside type B.

Hmm. Did you mean the case of:
struct map_value_A {
  struct bpf_lock lock;
  struct bpf_rb_root root;
};

struct map_value_B {
  struct bpf_lock lock;
  struct bpf_rb_root root;
};

and because the locks (and corresponding rb trees) are in different
allocations it's ok to take both locks in whatever order?
That's still a looming deadlock.
Even if bpf progs has tags that tell the verifier "no deadlock" here
we cannot rely on that.

lockdep's concept of lock classes cannot be used here, since the verifier
have to guarantee safety.

> 
> > Why bother distinguishing?
> >
> > > This is not exhaustive, but it will cover the majority of the cases.
> > > The good thing is then you don't need comparisons, and we can extend
> > > this to even more than two locks, as long as the constraints are
> > > specified in BTF.
> > >
> > > So I lean more towards letting the verifier understand the
> > > relationship between two lock calls, instead of bpf_spin_lock_double.
> > > That can either come from address comparisons, or it can come from
> > > BTF. bpf_spin_lock_double is ok for 2 levels of locking, but with BTF
> > > we will be able to go beyond that.
> >
> > It doesn't look like that the address of the lock will stay known
> > to the verifier.
> 
> But when you lock it, you need to pass the bpf_spin_lock pointer to
> the call, right? We only need to determine ordering between lock
> pointers passed to two spin lock calls, the address doesn't have to be
> known to the verifier. Same thing which bpf_spin_lock_double does. We
> can do inside the program whatever it can do.
> 
> if (l1 < l2)
>  lock(l1); lock(l2);
> else
>  lock(l2); lock(l1);

as an open coded version lock_double? Sure. That can be made to work,
but I'm missing why the verifier has to support that when the helper
can do the same.

The "dynamic" lock I'm talking about is the case where we do:
struct map_value {
   struct bpf_lock *lock;
   struct bpf_rb_root root;
};
and the verifier checks that the program did
 bpf_lock(elem->lock);
before bpf_rb_add(... &elem->root);

Same concept as "lock and corresponding rb tree / list head are part of
the same allocation", but now lock is only known at run-time.
So bpf_rb_add needs to cmpxchg it and stash it similar to
cmpxchg(rb_node->owner, rb_root) check that we will do for shared objects.

bpf_rb_add() will do two run-time checks: one that lock is actually
correct lock used to protect the rbtree and second check that
node-to-be-added is associated with corrent rbtree.

> > I've considered it, but it only adds run-time overhead and extra UX pain. Consider:
> 
> That I agree with.
> 
> > if (bpf_can_rb_add(node, root))
> >   bpf_rb_add(node, root, cb);
> >
> > It's non-trivial for the verifier to enforce that 'root' is the same
> > in both calls. Especially when 'root' arg in 2nd call is unnecessary, since it was cmpxchg-ed
> > and stored in the first call.
> > So the user should write:
> > if (bpf_can_rb_add(node, root))
> >   bpf_rb_add(node, cb);
> >
> > So bpf_can_rb_add did: node->owner = root; refcount_inc(node->refcnt)
> 
> This is a bit different from what I proposed. Once you do cmpxchg
> node->owner to root to 'own for add' (can_add in your email), some
> other CPU can see node->owner == root and try to remove node which is
> not part of the rb_tree yet. What you want to store is BPF_PTR_POISON.
> Invalid non-NULL value that can never match that is_owner load.
> Basically mark the node as 'busy'.
> 
> And then it is usable for rb_add in the success branch, but not to
> just some single rb_tree 'root', to any rb_tree. No need for
> 'associating' node and root. You can do this op outside the lock. Add
> will finally do a simple store of the owner to complete the add (you
> can call it the linearization point of bpf_rb_add, as now the node is
> visible as 'added to rb_tree' to other CPUs).
> 
> When doing rb_add, you will see the 'root' we are adding to is locked,
> and we already own the node parameter for add, so we don't check
> anything else. As soon as you unlock, your ownership to simply remove
> or add is gone.

I see. So it will be:
obj = bpf_obj_alloc();
if (bpf_can_rb_add(&obj->rb_node)) {
  bpf_spin_lock();
  bpf_rb_add(&obj->rb_node, &rb_root, cb);
  bpf_spin_unlock();
} else {
  bpf_obj_free(obj);
}
and that has to be done every time we do bpf_rb_add?
Because the verifier has to see this data flow within the same program
it cannot be done once earlier or outside of the iterator loop.
This 'if' has to be there for every obj->rb_node.

And the cost of rb_add for shared owner case is still the same:
cmpxchg, cmpxchg, atomic_inc, stores, cmpxchg.

That doesn't help performance and comes to UX preferences.

> But I guess it might be a bit tiring for everyone slogging through
> this over and over :), 

Brainstorming consumes more calories than physical exercise :)

> so I'll focus on single ownership first, and
> we'll discuss this shared case in the RFC patches next time... More
> easier to poke holes theories then.
> 
> Because I think it is clear by now that unless we restrict usage
> heavily, to ensure safety for the shared ownership case we need to
> track the owner. It will just depend on what protocol is chosen to
> transfer the ownership of the node from one tree to another.
> 
> > the only thing it didn't do is rb_add.
> > bpf_rb_add() at least doing some work, but for bpf_list_add() it's a pointer assignment
> > that left to do.
> > And now these code spawns two function calls.
> > And above code has hidden logic that 'root' was remembered earlier.
> > These side effects make reading such code unpleasant.
> >
> > But turning the same argument around... and thinking about can_add differently...
> >
> > Single owner case:
> > bpf_lock, bpf_list_add, bpf_unlock -> cmpxchg, stores, cmpxchg
> >
> > Shared owner case:
> > bpf_lock, bpf_list_add, bpf_unlock -> cmpxchg, cmpxchg, atomic_inc, stores, cmpxchg.
> >
> > While discussing this with Tejun and Dave the idea came to split association
> > of the node with a tree and actual addition.
> > Something like:
> > obj = bpf_obj_alloc();
> > bpf_obj_assoc(&obj->rb_node, &rb_root);
> > bpf_spin_lock();
> > bpf_rbtree_add(&obj->rb_node, cb);
> > bpf_spin_unlock();
> >
> > The assumption is that nodes will belong to the same tree much longer
> > than add/erase from the tree. This will move cmpxchg + atomic_inc from add/erase path.
> 
> But whenever you move from one rb_root to another, then you have to
> change object association again, right? I.e. that call is part of move
> operation, so what changed from above where you have to own for add
> because someone else might try to 'steal' it after unlock? Maybe I
> missed something.

Correct. To move the node from one tree to another the prog has to
do another bpf_obj_assoc.

> >
> > But now the verifier has to make sure that correct root is locked at the time
> > of bpf_rbtree_add... if above example is in one function it's doable, but consider:
> > bpf_prog1()
> > {
> >   obj = bpf_obj_alloc();
> >   bpf_obj_assoc(&obj->rb_node, &rb_root);
> >   kptr_xchg(obj, map_value); // stash it for future use
> > }
> > bpf_prog2()
> > {
> >   obj = kptr_xchg(map_value); // get it from local storage or else
> >   bpf_spin_lock();
> >   bpf_rbtree_add(&obj->rb_node, cb);
> >   bpf_spin_unlock();
> > }
> >
> > Not only btf_type_id needs to be known in kptr_xchg, but the specific root
> > to be able to check that root and lock are from the same allocation.
> >
> > So bpf_can_rb_add and bpf_obj_assoc are not quite dead-ends yet, but
> > I propose to implement bpf_list_add() the way I drafted earlier as the first step.
> > The rest can be an optimization later.
> 
> Agreed, going to work on just the single ownership case first, and
> this shared case after that (and think about making it better in the
> mean time).

+1.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-25  1:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17  9:04 BPF Linked Lists discussion Kumar Kartikeya Dwivedi
2022-08-17  9:09 ` Kumar Kartikeya Dwivedi
2022-08-19  8:55 ` Dave Marchevsky
2022-08-19 16:00   ` Kumar Kartikeya Dwivedi
2022-08-19 19:03     ` Alexei Starovoitov
2022-08-19 20:24       ` Kumar Kartikeya Dwivedi
2022-08-23 20:02         ` Alexei Starovoitov
2022-08-24 18:56           ` Kumar Kartikeya Dwivedi
2022-08-24 23:53             ` Alexei Starovoitov
2022-08-25  1:02               ` Kumar Kartikeya Dwivedi
2022-08-25  1:40                 ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.