From: John Fastabend <john.fastabend@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>,
John Fastabend <john.fastabend@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
daniel@iogearbox.net, ast@kernel.org, song@kernel.org,
jonathan.lemon@gmail.com
Subject: Re: [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
Date: Wed, 05 Feb 2020 21:51:40 -0800 [thread overview]
Message-ID: <5e3ba96ca7889_6b512aafe4b145b812@john-XPS-13-9370.notmuch> (raw)
In-Reply-To: <8736boor55.fsf@cloudflare.com>
Jakub Sitnicki wrote:
> On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
> > The sock_map_free() and sock_hash_free() paths used to delete sockmap
> > and sockhash maps walk the maps and destroy psock and bpf state associated
> > with the socks in the map. When done the socks no longer have BPF programs
> > attached and will function normally. This can happen while the socks in
> > the map are still "live" meaning data may be sent/received during the walk.
> >
> > Currently, though we don't take the sock_lock when the psock and bpf state
> > is removed through this path. Specifically, this means we can be writing
> > into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
> > while they are also being called from the networking side. This is not
> > safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
> > believed it was safe. Further its not clear to me its even a good idea
> > to try and do this on "live" sockets while networking side might also
> > be using the socket. Instead of trying to reason about using the socks
> > from both sides lets realize that every use case I'm aware of rarely
> > deletes maps, in fact kubernetes/Cilium case builds map at init and
> > never tears it down except on errors. So lets do the simple fix and
> > grab sock lock.
> >
> > This patch wraps sock deletes from maps in sock lock and adds some
> > annotations so we catch any other cases easier.
> >
> > Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
> > Cc: stable@vger.kernel.org
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> > net/core/skmsg.c | 2 ++
> > net/core/sock_map.c | 7 ++++++-
> > 2 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index ded2d5227678..3866d7e20c07 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
> >
> > void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
> > {
> > + sock_owned_by_me(sk);
> > +
> > sk_psock_cork_free(psock);
> > sk_psock_zap_ingress(psock);
> >
> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > index eb114ee419b6..8998e356f423 100644
> > --- a/net/core/sock_map.c
> > +++ b/net/core/sock_map.c
> > @@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
> > struct sock *sk;
> >
> > sk = xchg(psk, NULL);
> > - if (sk)
> > + if (sk) {
> > + lock_sock(sk);
> > sock_map_unref(sk, psk);
> > + release_sock(sk);
> > + }
> > }
> > raw_spin_unlock_bh(&stab->lock);
> > rcu_read_unlock();
>
> John, I've noticed this is triggering warnings that we might sleep in
> lock_sock while (1) in RCU read-side section, and (2) holding a spin
> lock:
OK, this patch was misguided, it makes things worse rather than better,
the real fix is to do proper READ_ONCE/WRITE_ONCE on the sk ops. That
is coming soon. Thanks for following up on it.
>
> =============================
> WARNING: suspicious RCU usage
> 5.5.0-04012-g38c811e4cd3c #443 Not tainted
> -----------------------------
> include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
>
> other info that might help us debug this:
[...]
>
> Easy to trigger on a VM with 1 vCPU, reproducer below.
When net-next opens we should add some tests to cover this its not something
I test often because our use case the maps are rarely free'd.
>
> Here's an idea how to change the locking. I'm still wrapping my head
> around what protects what in sock_map_free, so please bear with me:
>
> 1. synchronize_rcu before we iterate over the array is not needed,
> AFAICT. We are not free'ing the map just yet, hence any readers
> accessing the map via the psock are not in danger of use-after-free.
Agreed. When we added 2bb90e5cc90e ("bpf: sockmap, synchronize_rcu before
free'ing map") we could have done this.
>
> 2. rcu_read_lock is needed to protect access to psock inside
> sock_map_unref, but we can't sleep while in RCU read-side. So push
> it down, after we grab the sock lock.
yes this looks better.
>
> 3. Grabbing stab->lock seems not needed, either. We get called from
> bpf_map_free_deferred, after map refcnt dropped to 0, so we're not
> racing with any other map user to modify its contents.
This I'll need to think on a bit. We have the link-lock there so
probably should be safe to drop. But will need to trace this through
git history to be sure.
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 2cbde385e1a0..7f54e0d27d32 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -259,9 +259,6 @@ static void sock_map_free(struct bpf_map *map)
> struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
> int i;
>
> - synchronize_rcu();
> - rcu_read_lock();
> - raw_spin_lock_bh(&stab->lock);
> for (i = 0; i < stab->map.max_entries; i++) {
> struct sock **psk = &stab->sks[i];
> struct sock *sk;
> @@ -269,12 +266,12 @@ static void sock_map_free(struct bpf_map *map)
> sk = xchg(psk, NULL);
> if (sk) {
> lock_sock(sk);
> + rcu_read_lock();
> sock_map_unref(sk, psk);
> + rcu_read_unlock();
> release_sock(sk);
> }
> }
> - raw_spin_unlock_bh(&stab->lock);
> - rcu_read_unlock();
>
> synchronize_rcu();
>
> > @@ -862,7 +865,9 @@ static void sock_hash_free(struct bpf_map *map)
> > raw_spin_lock_bh(&bucket->lock);
> > hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
> > hlist_del_rcu(&elem->node);
> > + lock_sock(elem->sk);
> > sock_map_unref(elem->sk, elem);
> > + release_sock(elem->sk);
> > }
> > raw_spin_unlock_bh(&bucket->lock);
> > }
>
> Similar problems here. With one extra that it seems we're missing a
> synchronize_rcu *after* walking over the htable for the same reason as
> it got added to sock_map_free in 2bb90e5cc90e ("bpf: sockmap,
> synchronize_rcu before free'ing map"):
>
> We need to have a synchronize_rcu before free'ing the sockmap because
> any outstanding psock references will have a pointer to the map and
> when they use this could trigger a use after free.
>
> WDYT?
Can you push the fix to bpf but leave the stab->lock for now. I think
we can do a slightly better cleanup on stab->lock in bpf-next.
>
> Reproducer follows.
push reproducer into selftests?
Thanks.
next prev parent reply other threads:[~2020-02-06 5:51 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-11 6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
2020-01-11 6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
2020-01-13 13:29 ` Jakub Sitnicki
2020-01-14 3:19 ` John Fastabend
2020-01-11 6:12 ` [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down John Fastabend
2020-02-05 17:55 ` Jakub Sitnicki
2020-02-06 5:51 ` John Fastabend [this message]
2020-02-06 12:26 ` Jakub Sitnicki
2020-02-06 19:04 ` John Fastabend
2020-01-11 6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
2020-01-12 1:00 ` Jonathan Lemon
2020-01-13 13:59 ` Jakub Sitnicki
2020-01-11 6:12 ` [bpf PATCH v2 4/8] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
2020-01-11 6:12 ` [bpf PATCH v2 5/8] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
2020-01-11 6:12 ` [bpf PATCH v2 6/8] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
2020-01-11 6:12 ` [bpf PATCH v2 7/8] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
2020-01-11 6:12 ` [bpf PATCH v2 8/8] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
2020-01-15 22:37 ` [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs Daniel Borkmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5e3ba96ca7889_6b512aafe4b145b812@john-XPS-13-9370.notmuch \
--to=john.fastabend@gmail.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=jakub@cloudflare.com \
--cc=jonathan.lemon@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=song@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).