Re: [PATCH bpf-next 1/2] bpf: Allow bpf_local_storage to be used by sleepable programs

From: Martin KaFai Lau <kafai@fb.com>
To: KP Singh <kpsingh@kernel.org>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Jann Horn <jannh@google.com>,
	Florent Revest <revest@chromium.org>,
	Brendan Jackman <jackmanb@chromium.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: Allow bpf_local_storage to be used by sleepable programs
Date: Mon, 30 Aug 2021 19:11:32 -0700	[thread overview]
Message-ID: <20210831021132.sehzvrudvcjbzmwt@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <CACYkzJ6sgJ+PV3SUMtsg=8Xuun2hfYHn8szQ6Rdps7rpWmPP_g@mail.gmail.com>

On Sun, Aug 29, 2021 at 11:52:03PM +0200, KP Singh wrote:
[ ... ]

> > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> > > index b305270b7a4b..7760bc4e9565 100644
> > > --- a/kernel/bpf/bpf_local_storage.c
> > > +++ b/kernel/bpf/bpf_local_storage.c
> > > @@ -11,6 +11,8 @@
> > >  #include <net/sock.h>
> > >  #include <uapi/linux/sock_diag.h>
> > >  #include <uapi/linux/btf.h>
> > > +#include <linux/rcupdate_trace.h>
> > > +#include <linux/rcupdate_wait.h>
> > >
> > >  #define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
> > >
> > > @@ -81,6 +83,22 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> > >       return NULL;
> > >  }
> > >
> > > +void bpf_local_storage_free_rcu(struct rcu_head *rcu)
> > > +{
> > > +     struct bpf_local_storage *local_storage;
> > > +
> > > +     local_storage = container_of(rcu, struct bpf_local_storage, rcu);
> > > +     kfree_rcu(local_storage, rcu);
> > > +}
> > > +
> > > +static void bpf_selem_free_rcu(struct rcu_head *rcu)
> > > +{
> > > +     struct bpf_local_storage_elem *selem;
> > > +
> > > +     selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> > > +     kfree_rcu(selem, rcu);
> > > +}
> > > +
> > >  /* local_storage->lock must be held and selem->local_storage == local_storage.
> > >   * The caller must ensure selem->smap is still valid to be
> > >   * dereferenced for its smap->elem_size and smap->cache_idx.
> > > @@ -118,12 +136,12 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
> > >                *
> > >                * Although the unlock will be done under
> > >                * rcu_read_lock(),  it is more intutivie to
> > > -              * read if kfree_rcu(local_storage, rcu) is done
> > > +              * read if the freeing of the storage is done
> > >                * after the raw_spin_unlock_bh(&local_storage->lock).
> > >                *
> > >                * Hence, a "bool free_local_storage" is returned
> > > -              * to the caller which then calls the kfree_rcu()
> > > -              * after unlock.
> > > +              * to the caller which then calls then frees the storage after
> > > +              * all the RCU grace periods have expired.
> > >                */
> > >       }
> > >       hlist_del_init_rcu(&selem->snode);
> > > @@ -131,7 +149,7 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
> > >           SDATA(selem))
> > >               RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
> > >
> > > -     kfree_rcu(selem, rcu);
> > > +     call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> > Although the common use case is usually storage_get() much more often
> > than storage_delete(), do you aware any performance impact for
> > the bpf prog that does a lot of storage_delete()?
> 
> I have not really measured the impact on deletes, My understanding is
> that it should
> not impact the BPF program, but yes, if there are some critical
> sections that are prolonged
> due to a sleepable program "sleeping" too long, then it would pile up
> the callbacks.
> 
> But this is not something new, as we have a similar thing in BPF
> trampolines. If this really
> becomes an issue, we could add a flag BPF_F_SLEEPABLE_STORAGE and only maps
> with this flag would be allowed in sleepable progs.
Agree that is similar to trampoline updates but not sure it is comparable
in terms of the frequency of elems being deleted here.  e.g. many
short lived tcp connections created by external traffic.

Adding a BPF_F_SLEEPABLE_STORAGE later won't work.  It will break
existing sleepable bpf prog.

I don't know enough on call_rcu_tasks_trace() here, so the
earlier question on perf/callback-pile-up implications in order to
decide if extra logic or knob is needed here or not.

> 
> We could then wait for both critical sections only when this flag is
> set on the map.
> 
> >
> > >
> > >       return free_local_storage;
> > >  }
> > > @@ -154,7 +172,8 @@ static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem)
> > >       raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > >
> > >       if (free_local_storage)
> > > -             kfree_rcu(local_storage, rcu);
> > > +             call_rcu_tasks_trace(&local_storage->rcu,
> > > +                                  bpf_local_storage_free_rcu);
> > >  }
> > >
> > >  void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
> > > @@ -213,7 +232,8 @@ bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
> > >       struct bpf_local_storage_elem *selem;
> > >
> > >       /* Fast path (cache hit) */
> > > -     sdata = rcu_dereference(local_storage->cache[smap->cache_idx]);
> > > +     sdata = rcu_dereference_protected(local_storage->cache[smap->cache_idx],
> > > +                                       bpf_local_storage_rcu_lock_held());
> > There are other places using rcu_dereference() also.
> > e.g. in bpf_local_storage_update().
> > Should they be changed also?
> 
> From what I saw, the other usage of rcu_derference is in a nested
> (w.r.t to the RCU section that in bpf_prog_enter/exit) RCU
> read side critical section/rcu_read_{lock, unlock} so it should not be required.
hmm... not sure what nested or not has to do here.
It is likely we are talking different things.

Did you turn on the lockdep rcu debugging in kconfig?

afaik, lockdep uses the second "check" argument to WARN on incorrect usage.
Even the right check is passed here, the later rcu_dereference() will still
complain because it only checks for rcu_read_lock_held().

Also, after another look, why rcu_dereference_protected() is used
here instead of rcu_dereference_check()?  The spinlock is not acquired
here.  The same comment goes for similar rcu_dereference_protected() usages.

> 
> If there are some that are not, then they need to be updated. Did I miss any?
> 
> >
> > [ ... ]
> >
> > > --- a/net/core/bpf_sk_storage.c
> > > +++ b/net/core/bpf_sk_storage.c
> > > @@ -13,6 +13,7 @@
> > >  #include <net/sock.h>
> > >  #include <uapi/linux/sock_diag.h>
> > >  #include <uapi/linux/btf.h>
> > > +#include <linux/rcupdate_trace.h>
> > >
> > >  DEFINE_BPF_STORAGE_CACHE(sk_cache);
> > >
> > > @@ -22,7 +23,8 @@ bpf_sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit)
> > >       struct bpf_local_storage *sk_storage;
> > >       struct bpf_local_storage_map *smap;
> > >
> > > -     sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > > +     sk_storage = rcu_dereference_protected(sk->sk_bpf_storage,
> > > +                                            bpf_local_storage_rcu_lock_held());
> > >       if (!sk_storage)
> > >               return NULL;
> > >
> > > @@ -258,6 +260,7 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> > >  {
> > >       struct bpf_local_storage_data *sdata;
> > >
> > > +     WARN_ON_ONCE(!bpf_local_storage_rcu_lock_held());
> > >       if (!sk || !sk_fullsock(sk) || flags > BPF_SK_STORAGE_GET_F_CREATE)
> > sk is protected by rcu_read_lock here.
> > Is it always safe to access it with the rcu_read_lock_trace alone ?
> 
> We don't dereference sk with an rcu_dereference though, is it still the case for
> tracing and LSM programs? Or is it somehow implicity protected even
> though we don't use rcu_dereference since that's just a READ_ONCE + some checks?
e.g. the bpf_prog (currently run under rcu_read_lock()) may read the sk from
req_sk->sk which I don't think the verifier will optimize it out, so as good
as READ_ONCE(), iiuc.

The sk here is obtained from the bpf_lsm_socket_* hooks?  Those sk should have
a refcnt, right?  If that is the case, it should be good enough for now.

> 
> Would grabbing a refcount earlier in the code be helpful?
Grab the refcount earlier in the bpf prog itself?
right, it may be what eventually needs to be done
if somehow there is a usecase that the sleepable bpf prog can get a
hold on a rcu protected sk.