bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
	daniel@iogearbox.net, ast@kernel.org, song@kernel.org,
	jonathan.lemon@gmail.com
Subject: Re: [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
Date: Thu, 06 Feb 2020 13:26:30 +0100	[thread overview]
Message-ID: <871rr7oqa1.fsf@cloudflare.com> (raw)
In-Reply-To: <5e3ba96ca7889_6b512aafe4b145b812@john-XPS-13-9370.notmuch>

On Thu, Feb 06, 2020 at 06:51 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
>> > The sock_map_free() and sock_hash_free() paths used to delete sockmap
>> > and sockhash maps walk the maps and destroy psock and bpf state associated
>> > with the socks in the map. When done the socks no longer have BPF programs
>> > attached and will function normally. This can happen while the socks in
>> > the map are still "live" meaning data may be sent/received during the walk.
>> >
>> > Currently, though we don't take the sock_lock when the psock and bpf state
>> > is removed through this path. Specifically, this means we can be writing
>> > into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
>> > while they are also being called from the networking side. This is not
>> > safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
>> > believed it was safe. Further its not clear to me its even a good idea
>> > to try and do this on "live" sockets while networking side might also
>> > be using the socket. Instead of trying to reason about using the socks
>> > from both sides lets realize that every use case I'm aware of rarely
>> > deletes maps, in fact kubernetes/Cilium case builds map at init and
>> > never tears it down except on errors. So lets do the simple fix and
>> > grab sock lock.
>> >
>> > This patch wraps sock deletes from maps in sock lock and adds some
>> > annotations so we catch any other cases easier.
>> >
>> > Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
>> > Cc: stable@vger.kernel.org
>> > Acked-by: Song Liu <songliubraving@fb.com>
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>> >  net/core/skmsg.c    | 2 ++
>> >  net/core/sock_map.c | 7 ++++++-
>> >  2 files changed, 8 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> > index ded2d5227678..3866d7e20c07 100644
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
>> >
>> >  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>> >  {
>> > +	sock_owned_by_me(sk);
>> > +
>> >  	sk_psock_cork_free(psock);
>> >  	sk_psock_zap_ingress(psock);
>> >
>> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> > index eb114ee419b6..8998e356f423 100644
>> > --- a/net/core/sock_map.c
>> > +++ b/net/core/sock_map.c
>> > @@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
>> >  		struct sock *sk;
>> >
>> >  		sk = xchg(psk, NULL);
>> > -		if (sk)
>> > +		if (sk) {
>> > +			lock_sock(sk);
>> >  			sock_map_unref(sk, psk);
>> > +			release_sock(sk);
>> > +		}
>> >  	}
>> >  	raw_spin_unlock_bh(&stab->lock);
>> >  	rcu_read_unlock();
>>
>> John, I've noticed this is triggering warnings that we might sleep in
>> lock_sock while (1) in RCU read-side section, and (2) holding a spin
>> lock:

[...]

>>
>> Here's an idea how to change the locking. I'm still wrapping my head
>> around what protects what in sock_map_free, so please bear with me:
>>
>> 1. synchronize_rcu before we iterate over the array is not needed,
>>    AFAICT. We are not free'ing the map just yet, hence any readers
>>    accessing the map via the psock are not in danger of use-after-free.
>
> Agreed. When we added 2bb90e5cc90e ("bpf: sockmap, synchronize_rcu before
> free'ing map") we could have done this.
>
>>
>> 2. rcu_read_lock is needed to protect access to psock inside
>>    sock_map_unref, but we can't sleep while in RCU read-side.  So push
>>    it down, after we grab the sock lock.
>
> yes this looks better.
>
>>
>> 3. Grabbing stab->lock seems not needed, either. We get called from
>>    bpf_map_free_deferred, after map refcnt dropped to 0, so we're not
>>    racing with any other map user to modify its contents.
>
> This I'll need to think on a bit. We have the link-lock there so
> probably should be safe to drop. But will need to trace this through
> git history to be sure.
>

[...]

>> WDYT?
>
> Can you push the fix to bpf but leave the stab->lock for now. I think
> we can do a slightly better cleanup on stab->lock in bpf-next.

Here it is:

https://lore.kernel.org/bpf/20200206111652.694507-1-jakub@cloudflare.com/T/#t

I left the "extra" synchronize_rcu before walking the map. On second
thought, this isn't a bug. Just adds extra wait. bpf-next material?

>
>>
>> Reproducer follows.
>
> push reproducer into selftests?

Included the reproducer with the fixes. If it gets dropped from the
series, I'll resubmit it once bpf-next reopens.

  reply	other threads:[~2020-02-06 12:26 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
2020-01-11  6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
2020-01-13 13:29   ` Jakub Sitnicki
2020-01-14  3:19     ` John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down John Fastabend
2020-02-05 17:55   ` Jakub Sitnicki
2020-02-06  5:51     ` John Fastabend
2020-02-06 12:26       ` Jakub Sitnicki [this message]
2020-02-06 19:04         ` John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
2020-01-12  1:00   ` Jonathan Lemon
2020-01-13 13:59   ` Jakub Sitnicki
2020-01-11  6:12 ` [bpf PATCH v2 4/8] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 5/8] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 6/8] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 7/8] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 8/8] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
2020-01-15 22:37 ` [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs Daniel Borkmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rr7oqa1.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).