Re: [PATCH bpf-next 1/2] bpf: Allow bpf_local_storage to be used by sleepable programs

From: Martin KaFai Lau <kafai@fb.com>
To: KP Singh <kpsingh@kernel.org>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Jann Horn <jannh@google.com>,
	Florent Revest <revest@chromium.org>,
	Brendan Jackman <jackmanb@chromium.org>,
	Yonghong Song <yhs@fb.com>
Subject: Re: [PATCH bpf-next 1/2] bpf: Allow bpf_local_storage to be used by sleepable programs
Date: Tue, 31 Aug 2021 23:32:17 -0700	[thread overview]
Message-ID: <20210901063217.5zpvnltvfmctrkum@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <CACYkzJ58Yp_YQBGMFCL_5UhjK3pHC5n-dcqpR-HEDz+Y-yasfw@mail.gmail.com>

On Tue, Aug 31, 2021 at 09:38:01PM +0200, KP Singh wrote:
[ ... ]

> > > > > > > @@ -131,7 +149,7 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
> > > > > > >           SDATA(selem))
> > > > > > >               RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
> > > > > > >
> > > > > > > -     kfree_rcu(selem, rcu);
> > > > > > > +     call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> > > > > > Although the common use case is usually storage_get() much more often
> > > > > > than storage_delete(), do you aware any performance impact for
> > > > > > the bpf prog that does a lot of storage_delete()?
> > > > >
> > > > > I have not really measured the impact on deletes, My understanding is
> > > > > that it should
> > > > > not impact the BPF program, but yes, if there are some critical
> > > > > sections that are prolonged
> > > > > due to a sleepable program "sleeping" too long, then it would pile up
> > > > > the callbacks.
> > > > >
> > > > > But this is not something new, as we have a similar thing in BPF
> > > > > trampolines. If this really
> > > > > becomes an issue, we could add a flag BPF_F_SLEEPABLE_STORAGE and only maps
> > > > > with this flag would be allowed in sleepable progs.
> > > > Agree that is similar to trampoline updates but not sure it is comparable
> > > > in terms of the frequency of elems being deleted here.  e.g. many
> > > > short lived tcp connections created by external traffic.
> > > >
> > > > Adding a BPF_F_SLEEPABLE_STORAGE later won't work.  It will break
> > > > existing sleepable bpf prog.
> > > >
> > > > I don't know enough on call_rcu_tasks_trace() here, so the
> > > > earlier question on perf/callback-pile-up implications in order to
> > > > decide if extra logic or knob is needed here or not.
> > >
> > > I will defer to the others, maybe Alexei and Paul,
> >
> > > we could also just
> > > add the flag to not affect existing performance characteristics?
> > I would see if it is really necessary first.  Other sleepable
> > supported maps do not need a flag.  Adding one here for local
> > storage will be confusing especially if it turns out to be
> > unnecessary.
> >
> > Could you run some tests first which can guide the decision?
> 
> I think the performance impact would happen only in the worst case which
> needs some work to simulate. What do you think about:
> 
> A bprm_committed_creds program that processes a large argv
> and also gets a storage on the inode.
> 
> A file_open program that tries to delete the local storage on the inode.
> 
> Trigger this code in parallel. i.e. lots of programs that execute with a very
> large argv and then in parallel the executable being opened to trigger the
> delete.
> 
> Do you have any other ideas? Is there something we could re-use from
> the selftests?
There is a bench framework in tools/testing/selftests/bpf/benchs/
that has a parallel thread setup which could be useful.

Don't know how to simulate the "sleeping" too long which
then pile-up callbacks.  This is not bpf specific.
Paul, I wonder if you have similar test to trigger this to
compare between call_rcu_tasks_trace() and call_rcu()?

[ ... ]

> > > > > > > @@ -213,7 +232,8 @@ bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
> > > > > > >       struct bpf_local_storage_elem *selem;
> > > > > > >
> > > > > > >       /* Fast path (cache hit) */
> > > > > > > -     sdata = rcu_dereference(local_storage->cache[smap->cache_idx]);
> > > > > > > +     sdata = rcu_dereference_protected(local_storage->cache[smap->cache_idx],
> > > > > > > +                                       bpf_local_storage_rcu_lock_held());
> > > > > > There are other places using rcu_dereference() also.
> > > > > > e.g. in bpf_local_storage_update().
> > > > > > Should they be changed also?
> > > > >
> > > > > From what I saw, the other usage of rcu_derference is in a nested
> > > > > (w.r.t to the RCU section that in bpf_prog_enter/exit) RCU
> > > > > read side critical section/rcu_read_{lock, unlock} so it should not be required.
> > > > hmm... not sure what nested or not has to do here.
> > > > It is likely we are talking different things.
> > > >
> > > Yeah, we were looking at different things.
> > >
> > > e.g. bpf_selem_unlink does not need to be changed as it is in
> > > a rcu_read_lock.
> > No.  It is not always under rcu_read_lock().  From the patch 2 test,
> > it should have a splat either from bpf_inode_storage_delete()
> > or bpf_sk_storage_delete(), depending on which one runs first.
> 
> I missed this one, but I wonder why it did not trigger a warning. The test does
> exercise the delete and rcu_dereference should have warned me that I am not
> holding an rcu_read_lock();
hmm... not sure either.  may be some kconfigs that disabled rcu_read_lock_held()?
I would also take a look at RCU_LOCKDEP_WARN().

I just quickly tried the patches to check:

[  143.376587] =============================
[  143.377068] WARNING: suspicious RCU usage
[  143.377541] 5.14.0-rc5-01271-g68e5bda2b18e #4966 Tainted: G           O
[  143.378378] -----------------------------
[  143.378857] kernel/bpf/bpf_local_storage.c:114 suspicious rcu_dereference_check() usage!
[  143.379914]
[  143.379914] other info that might help us debug this:
[  143.379914]
[  143.380838]
[  143.380838] rcu_scheduler_active = 2, debug_locks = 1
[  143.381602] 4 locks held by mv/1781:
[  143.382025]  #0: ffff888121e7c438 (sb_writers#6){.+.+}-{0:0}, at: do_renameat2+0x2f5/0xa80
[  143.383009]  #1: ffff88812ce68760 (&type->i_mutex_dir_key#5/1){+.+.}-{3:3}, at: lock_rename+0x1f4/0x250
[  143.384144]  #2: ffffffff843fbc60 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x45/0x160
[  143.385326]  #3: ffff88811d8348b8 (&storage->lock){..-.}-{2:2}, at: __bpf_selem_unlink_storage+0x7d/0x170
[  143.386459]
[  143.386459] stack backtrace:
[  143.386983] CPU: 2 PID: 1781 Comm: mv Tainted: G           O      5.14.0-rc5-01271-g68e5bda2b18e #4966
[  143.388071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[  143.389146] Call Trace:
[  143.389446]  dump_stack_lvl+0x5b/0x82
[  143.389901]  dump_stack+0x10/0x12
[  143.390302]  lockdep_rcu_suspicious+0x15c/0x167
[  143.390854]  bpf_selem_unlink_storage_nolock+0x2e1/0x6d0
[  143.391501]  __bpf_selem_unlink_storage+0xb7/0x170
[  143.392085]  bpf_selem_unlink+0x1b/0x30
[  143.392554]  bpf_inode_storage_delete+0x57/0xa0
[  143.393112]  bpf_prog_31e277fe2c132665_inode_rename+0x9c/0x268
[  143.393814]  bpf_trampoline_6442476301_0+0x4e/0x1000
[  143.394413]  bpf_lsm_inode_rename+0x5/0x10

> > > > > > > --- a/net/core/bpf_sk_storage.c
> > > > > > > +++ b/net/core/bpf_sk_storage.c
> > > > > > > @@ -13,6 +13,7 @@
> > > > > > >  #include <net/sock.h>
> > > > > > >  #include <uapi/linux/sock_diag.h>
> > > > > > >  #include <uapi/linux/btf.h>
> > > > > > > +#include <linux/rcupdate_trace.h>
> > > > > > >
> > > > > > >  DEFINE_BPF_STORAGE_CACHE(sk_cache);
> > > > > > >
> > > > > > > @@ -22,7 +23,8 @@ bpf_sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit)
> > > > > > >       struct bpf_local_storage *sk_storage;
> > > > > > >       struct bpf_local_storage_map *smap;
> > > > > > >
> > > > > > > -     sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > > > > > > +     sk_storage = rcu_dereference_protected(sk->sk_bpf_storage,
> > > > > > > +                                            bpf_local_storage_rcu_lock_held());
> > > > > > >       if (!sk_storage)
> > > > > > >               return NULL;
> > > > > > >
> > > > > > > @@ -258,6 +260,7 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> > > > > > >  {
> > > > > > >       struct bpf_local_storage_data *sdata;
> > > > > > >
> > > > > > > +     WARN_ON_ONCE(!bpf_local_storage_rcu_lock_held());
> > > > > > >       if (!sk || !sk_fullsock(sk) || flags > BPF_SK_STORAGE_GET_F_CREATE)
> > > > > > sk is protected by rcu_read_lock here.
> > > > > > Is it always safe to access it with the rcu_read_lock_trace alone ?
> > > > >
> > > > > We don't dereference sk with an rcu_dereference though, is it still the case for
> > > > > tracing and LSM programs? Or is it somehow implicity protected even
> > > > > though we don't use rcu_dereference since that's just a READ_ONCE + some checks?
> > > > e.g. the bpf_prog (currently run under rcu_read_lock()) may read the sk from
> > > > req_sk->sk which I don't think the verifier will optimize it out, so as good
> > > > as READ_ONCE(), iiuc.
> > > >
> > > > The sk here is obtained from the bpf_lsm_socket_* hooks?  Those sk should have
> > > > a refcnt, right?  If that is the case, it should be good enough for now.
> > >
> > > The one passed in the arguments yes, but if you notice the discussion in
> > >
> > > https://lore.kernel.org/bpf/20210826133913.627361-1-memxor@gmail.com/T/#me254212a125516a6c5d2fbf349b97c199e66dce0
> > >
> > > one may also get an sk in LSM and tracing progs by pointer walking.
> > Right.  There is pointer walking case.
> > e.g. "struct request_sock __rcu *fastopen_rsk" in tcp_sock.
> > I don't think it is possible for lsm to get a hold on tcp_sock
> > but agree that other similar cases could happen.
> >
> > May be for now, in sleepable program, only allow safe sk ptr
> > to be used in helpers that take sk PTR_TO_BTF_ID argument.
> > e.g. sock->sk is safe in the test in patch 2.  The same should go for other
> > storages like inode.  This needs verifier change.
> >
> 
> Sorry, I may be missing some context. Do you mean wait for Yonghong's work?
I don't think we have to wait.  Just saying Yonghong's work could fit
well in this use case in the future.

> Or is there another way to update the verifier to recognize safe sk and inode
> pointers?
I was thinking specifically for this pointer walking case.
Take a look at btf_struct_access().  It walks the struct
in the verifier and figures out reading sock->sk will get
a "struct sock *".  It marks the reg to PTR_TO_BTF_ID.
This will allow the bpf prog to directly read from sk (e.g. sk->sk_num)
or pass the sk to helper that takes a "struct sock *" pointer.
Reading from any sk pointer is fine since it is protected by BPF_PROBE_MEM
read.  However, we only allow the sk from sock->sk to be passed to the
helper here because we only know this one is refcnt-ed.

Take a look at check_ptr_to_btf_access().  An individual verifier_ops 
can also have its own btf_struct_access.  One possibility is
to introduce a (new) PTR_TO_RDONLY_BTF_ID to mean it can only
do BPR_PROBE_MEM read but cannot be used in helper.