All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
Date: Tue, 10 Mar 2020 12:30:42 +0100	[thread overview]
Message-ID: <8736agzbtp.fsf@cloudflare.com> (raw)
In-Reply-To: <5e3c6c7f8730e_22ad2af2cbd0a5b4a4@john-XPS-13-9370.notmuch>

On Thu, Feb 06, 2020 at 08:43 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Couple of fixes that came from recent discussion [0] on commit
>> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
>>
>> This series doesn't address the sleeping while holding a spinlock
>> problem. We're still trying to decide how to fix that [1].
>>
>> Until then sockmap users might see the following warnings:
>>
>> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
>> | in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
>> | 3 locks held by kworker/0:1/62:
>> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
>> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>> | Workqueue: events bpf_map_free_deferred
>> | Call Trace:
>> |  dump_stack+0x71/0xa0
>> |  ___might_sleep.cold+0xa6/0xb6
>> |  lock_sock_nested+0x28/0x90
>> |  sock_map_free+0x5f/0x180
>> |  bpf_map_free_deferred+0x58/0x80
>> |  process_one_work+0x260/0x5e0
>> |  worker_thread+0x4d/0x3e0
>> |  kthread+0x108/0x140
>> |  ? process_one_work+0x5e0/0x5e0
>> |  ? kthread_park+0x90/0x90
>> |  ret_from_fork+0x3a/0x50
>> |
>> | ======================================================
>> | WARNING: possible circular locking dependency detected
>> | 5.5.0-04008-g7b083332376e #454 Tainted: G        W
>> | ------------------------------------------------------
>> | kworker/0:1/62 is trying to acquire lock:
>> | ffff88813b280130 (sk_lock-AF_INET){+.+.}, at: sock_map_free+0x5f/0x180
>> |
>> | but task is already holding lock:
>> | ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> |
>> | which lock already depends on the new lock.
>> |
>> |
>> | the existing dependency chain (in reverse order) is:
>> |
>> | -> #1 (&stab->lock){+...}:
>> |        _raw_spin_lock_bh+0x39/0x80
>> |        sock_map_update_common+0xdc/0x300
>> |        sock_map_update_elem+0xc3/0x150
>> |        __do_sys_bpf+0x1285/0x1620
>> |        do_syscall_64+0x6d/0x690
>> |        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> |
>> | -> #0 (sk_lock-AF_INET){+.+.}:
>> |        __lock_acquire+0xe2f/0x19f0
>> |        lock_acquire+0x95/0x190
>> |        lock_sock_nested+0x6b/0x90
>> |        sock_map_free+0x5f/0x180
>> |        bpf_map_free_deferred+0x58/0x80
>> |        process_one_work+0x260/0x5e0
>> |        worker_thread+0x4d/0x3e0
>> |        kthread+0x108/0x140
>> |        ret_from_fork+0x3a/0x50
>> |
>> | other info that might help us debug this:
>> |
>> |  Possible unsafe locking scenario:
>> |
>> |        CPU0                    CPU1
>> |        ----                    ----
>> |   lock(&stab->lock);
>> |                                lock(sk_lock-AF_INET);
>> |                                lock(&stab->lock);
>> |   lock(sk_lock-AF_INET);
>> |
>> |  *** DEADLOCK ***
>> |
>> | 3 locks held by kworker/0:1/62:
>> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> |
>> | stack backtrace:
>> | CPU: 0 PID: 62 Comm: kworker/0:1 Tainted: G        W         5.5.0-04008-g7b083332376e #454
>> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>> | Workqueue: events bpf_map_free_deferred
>> | Call Trace:
>> |  dump_stack+0x71/0xa0
>> |  check_noncircular+0x176/0x190
>> |  __lock_acquire+0xe2f/0x19f0
>> |  lock_acquire+0x95/0x190
>> |  ? sock_map_free+0x5f/0x180
>> |  lock_sock_nested+0x6b/0x90
>> |  ? sock_map_free+0x5f/0x180
>> |  sock_map_free+0x5f/0x180
>> |  bpf_map_free_deferred+0x58/0x80
>> |  process_one_work+0x260/0x5e0
>> |  worker_thread+0x4d/0x3e0
>> |  kthread+0x108/0x140
>> |  ? process_one_work+0x5e0/0x5e0
>> |  ? kthread_park+0x90/0x90
>> |  ret_from_fork+0x3a/0x50
>
> Hi Jakub,
>
> Untested at the moment, but this should also be fine per your suggestion
> (if I read it correctly).  The reason we have stab->lock and bucket->locks
> here is to handle checking EEXIST in update/delete cases. We need to
> be careful that when an update happens and we check for EEXIST that the
> socket is added/removed during this check. So both map_update_common and
> sock_map_delete need to guard from being run together potentially deleting
> an entry we are checking, etc. But by the time we get here we just did
> a synchronize_rcu() in the line above so no updates/deletes should be
> in flight. So it seems safe to drop these locks because of the condition
> no updates in flight.
>
> So with patch below we keep the sync rcu but that is fine IMO these
> map free's are rare. Take a look and make sure it seems sane to you
> as well.
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index f36e13e577a3..1d56ec20330c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map)
>  	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
>  	int i;
>
> +	/* After the sync no updates or deletes will be in-flight so it
> +	 * is safe to walk map and remove entries without risking a race
> +	 * in EEXIST update case.
> +	 */
>  	synchronize_rcu();
> -	raw_spin_lock_bh(&stab->lock);
>  	for (i = 0; i < stab->map.max_entries; i++) {
>  		struct sock **psk = &stab->sks[i];
>  		struct sock *sk;
> @@ -248,7 +251,6 @@ static void sock_map_free(struct bpf_map *map)
>  			release_sock(sk);
>  		}
>  	}
> -	raw_spin_unlock_bh(&stab->lock);
>
>  	/* wait for psock readers accessing its map link */
>  	synchronize_rcu();
> @@ -859,10 +861,13 @@ static void sock_hash_free(struct bpf_map *map)
>  	struct hlist_node *node;
>  	int i;
>
> +	/* After the sync no updates or deletes will be in-flight so it
> +	 * is safe to walk hash and remove entries without risking a race
> +	 * in EEXIST update case.
> +	 */
>  	synchronize_rcu();
>  	for (i = 0; i < htab->buckets_num; i++) {
>  		bucket = sock_hash_select_bucket(htab, i);
> -		raw_spin_lock_bh(&bucket->lock);
>  		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
>  			hlist_del_rcu(&elem->node);
>  			lock_sock(elem->sk);
> @@ -871,7 +876,6 @@ static void sock_hash_free(struct bpf_map *map)
>  			rcu_read_unlock();
>  			release_sock(elem->sk);
>  		}
> -		raw_spin_unlock_bh(&bucket->lock);
>  	}
>
>  	/* wait for psock readers accessing its map link */

Hi John,

We would like to get rid of lockdep splats we are seeing in testing.

Mind if I submit the above fix for bpf-next on your behalf?

That is, of course, unless you have cycles to tend to it yourself.

Thanks,
-jkbs

  parent reply	other threads:[~2020-03-10 11:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
2020-02-06 18:59   ` John Fastabend
2020-02-06 11:16 ` [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map Jakub Sitnicki
2020-02-06 19:01   ` John Fastabend
2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
2020-02-06 19:03   ` John Fastabend
2020-02-09  2:41   ` Alexei Starovoitov
2020-02-09  4:12     ` Yonghong Song
2020-02-09 15:29     ` Jakub Sitnicki
2020-02-10  3:55       ` John Fastabend
2020-02-10 11:52         ` Jakub Sitnicki
2020-02-10 23:38           ` Daniel Borkmann
2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
2020-02-07 10:45   ` Jakub Sitnicki
2020-03-10 16:44     ` John Fastabend
2020-03-11 11:51       ` Jakub Sitnicki
2020-03-10 11:30   ` Jakub Sitnicki [this message]
2020-02-07 21:56 ` Daniel Borkmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8736agzbtp.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=bpf@vger.kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.