bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	John Fastabend <john.fastabend@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [bpf PATCH] bpf: sockmap, remove bucket->lock from sock_{hash|map}_free
Date: Wed, 03 Jun 2020 11:35:41 -0700	[thread overview]
Message-ID: <5ed7ed7d315bd_36aa2ab64b3c85bcd9@john-XPS-13-9370.notmuch> (raw)
In-Reply-To: <87a71k2yje.fsf@cloudflare.com>

Jakub Sitnicki wrote:
> On Wed, Jun 03, 2020 at 08:13 AM CEST, Eric Dumazet wrote:
> > On 3/10/20 9:41 AM, John Fastabend wrote:
> >> The bucket->lock is not needed in the sock_hash_free and sock_map_free
> >> calls, in fact it is causing a splat due to being inside rcu block.
> >>

[...]

>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> >> index 085cef5..b70c844 100644
> >> --- a/net/core/sock_map.c
> >> +++ b/net/core/sock_map.c
> >> @@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map)
> >>  	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
> >>  	int i;
> >>
> >> +	/* After the sync no updates or deletes will be in-flight so it
> >> +	 * is safe to walk map and remove entries without risking a race
> >> +	 * in EEXIST update case.
> >
> >
> > What prevents other cpus from deleting stuff in sock_hash_delete_elem() ?
> >
> > What state has been changed before the synchronize_rcu() call here,
> > that other cpus check before attempting a delete ?
> >
> > Typically, synchronize_rcu() only makes sense if readers can not start a new cycle.
> >
> > A possible fix would be to check in sock_hash_delete_elem() (and possibly others methods)
> > if map->refcnt is not zero.
> >
> > syzbot found : (no repro yet)
> >
> > general protection fault, probably for non-canonical address 0xfbd59c0000000024: 0000 [#1] PREEMPT SMP KASAN
> > KASAN: maybe wild-memory-access in range [0xdead000000000120-0xdead000000000127]
> > CPU: 2 PID: 14305 Comm: kworker/2:3 Not tainted 5.7.0-syzkaller #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > Workqueue: events bpf_map_free_deferred
> > RIP: 0010:__write_once_size include/linux/compiler.h:279 [inline]
> > RIP: 0010:__hlist_del include/linux/list.h:811 [inline]
> > RIP: 0010:hlist_del_rcu include/linux/rculist.h:485 [inline]
> > RIP: 0010:sock_hash_free+0x202/0x4a0 net/core/sock_map.c:1021
> > Code: 0f 85 15 02 00 00 4c 8d 7b 28 4c 8b 63 20 4c 89 f8 48 c1 e8 03 80 3c 28 00 0f 85 47 02 00 00 4c 8b 6b 28 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 25 02 00 00 4d 85 e4 4d 89 65 00 74 20 e8 f6 82

[...]

Thanks Eric.

> My initial reasoning behind the change was that sock_hash_delete_elem()
> callers hold a ref to sockhash [0]. Either because there is an open FD
> for the map, or the map is in use by loaded BPF program. The same
> applies to updates.
> 
> If that holds, map->refcnt is > 0, and we should not see the map being
> freed at the same time as sock_hash_delete_elem() happens.
> 
> But then there is also sock_hash_delete_from_link() that deletes from
> sockhash when a sock/psock unlinks itself from the map. This operation
> happens without holding a ref to the map, so that sockets won't keep the
> map "alive". There is no corresponding *_update_from_link() for updates
> without holding a ref.

Yep we missed this case :/

> 
> Sadly, I can't spot anything preventing list mutation, hlist_del_rcu(),
> from happening both in sock_hash_delete_elem() and sock_hash_free()
> concurrently, now that the bucket spin-lock doesn't protect it any
> more. That is what I understand syzbot is reporting.

Agreed.

> 
> synchronize_rcu() before we walk the htable doesn't rule it out, because
> as you point out, new readers can start a new cycle, and we don't change
> any state that would signal that the map is going away.
> 
> I'm not sure that the check for map->refcnt when sock is unlinking
> itself from the map will do it. I worry we will then have issues when
> sockhash is unlinking itself from socks (so the other way around) in
> sock_hash_free(). We could no longer assume that the sock & psock
> exists.
> 
> What comes to mind is to reintroduce the spin-lock protected critical
> section in sock_hash_free(), but delay the processing of sockets to be
> unlinked from sockhash. We could grab a ref to sk_psock while holding a
> spin-lock and unlink it while no longer in atomic critical section.

It seems so. In sock_hash_free we logically need,

 for (i = 0; i < htab->buckets_num; i++) {
  hlist_for_each_entryy_safe(...) {
  	hlist_del_rcu() <- detached from bucket and no longer reachable
        synchronize_rcu()
        // now element can not be reached from unhash()
	... sock_map_unref(elem->sk, elem) ...
  }
 } 

We don't actually want to stick a synchronize_rcu() in that loop
so I agree we need to collect the elements do a sync then remove them.

> 
> Either way, Eric, thank you for the report and the pointers.

+1

> 
> John, WDYT?

Want to give it a try? Or I can draft something.

> 
> [0] https://lore.kernel.org/netdev/8736boor55.fsf@cloudflare.com/

[...]

  reply	other threads:[~2020-06-03 18:35 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-10 16:41 [bpf PATCH] bpf: sockmap, remove bucket->lock from sock_{hash|map}_free John Fastabend
2020-03-11 13:10 ` Daniel Borkmann
2020-06-03  6:13 ` Eric Dumazet
2020-06-03 11:12   ` Jakub Sitnicki
2020-06-03 18:35     ` John Fastabend [this message]
2020-06-03 20:39       ` Jakub Sitnicki
2020-06-03 22:51         ` John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ed7ed7d315bd_36aa2ab64b3c85bcd9@john-XPS-13-9370.notmuch \
    --to=john.fastabend@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=eric.dumazet@gmail.com \
    --cc=jakub@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).