bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
@ 2020-02-06 11:16 Jakub Sitnicki
  2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-06 11:16 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend

Couple of fixes that came from recent discussion [0] on commit
7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").

This series doesn't address the sleeping while holding a spinlock
problem. We're still trying to decide how to fix that [1].

Until then sockmap users might see the following warnings:

| BUG: sleeping function called from invalid context at net/core/sock.c:2935
| in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
| 3 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  ___might_sleep.cold+0xa6/0xb6
|  lock_sock_nested+0x28/0x90
|  sock_map_free+0x5f/0x180
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50
|
| ======================================================
| WARNING: possible circular locking dependency detected
| 5.5.0-04008-g7b083332376e #454 Tainted: G        W
| ------------------------------------------------------
| kworker/0:1/62 is trying to acquire lock:
| ffff88813b280130 (sk_lock-AF_INET){+.+.}, at: sock_map_free+0x5f/0x180
|
| but task is already holding lock:
| ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
|
| which lock already depends on the new lock.
|
|
| the existing dependency chain (in reverse order) is:
|
| -> #1 (&stab->lock){+...}:
|        _raw_spin_lock_bh+0x39/0x80
|        sock_map_update_common+0xdc/0x300
|        sock_map_update_elem+0xc3/0x150
|        __do_sys_bpf+0x1285/0x1620
|        do_syscall_64+0x6d/0x690
|        entry_SYSCALL_64_after_hwframe+0x49/0xbe
|
| -> #0 (sk_lock-AF_INET){+.+.}:
|        __lock_acquire+0xe2f/0x19f0
|        lock_acquire+0x95/0x190
|        lock_sock_nested+0x6b/0x90
|        sock_map_free+0x5f/0x180
|        bpf_map_free_deferred+0x58/0x80
|        process_one_work+0x260/0x5e0
|        worker_thread+0x4d/0x3e0
|        kthread+0x108/0x140
|        ret_from_fork+0x3a/0x50
|
| other info that might help us debug this:
|
|  Possible unsafe locking scenario:
|
|        CPU0                    CPU1
|        ----                    ----
|   lock(&stab->lock);
|                                lock(sk_lock-AF_INET);
|                                lock(&stab->lock);
|   lock(sk_lock-AF_INET);
|
|  *** DEADLOCK ***
|
| 3 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
|
| stack backtrace:
| CPU: 0 PID: 62 Comm: kworker/0:1 Tainted: G        W         5.5.0-04008-g7b083332376e #454
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  check_noncircular+0x176/0x190
|  __lock_acquire+0xe2f/0x19f0
|  lock_acquire+0x95/0x190
|  ? sock_map_free+0x5f/0x180
|  lock_sock_nested+0x6b/0x90
|  ? sock_map_free+0x5f/0x180
|  sock_map_free+0x5f/0x180
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50

[0] https://lore.kernel.org/bpf/8736boor55.fsf@cloudflare.com/
[1] https://lore.kernel.org/bpf/5e3ba96ca7889_6b512aafe4b145b812@john-XPS-13-9370.notmuch/


Jakub Sitnicki (3):
  bpf, sockmap: Don't sleep while holding RCU lock on tear-down
  bpf, sockhash: synchronize_rcu before free'ing map
  selftests/bpf: Test freeing sockmap/sockhash with a socket in it

 net/core/sock_map.c                           | 12 ++-
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 74 +++++++++++++++++++
 2 files changed, 82 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down
  2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
@ 2020-02-06 11:16 ` Jakub Sitnicki
  2020-02-06 18:59   ` John Fastabend
  2020-02-06 11:16 ` [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map Jakub Sitnicki
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-06 11:16 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend

rcu_read_lock is needed to protect access to psock inside sock_map_unref
when tearing down the map. However, we can't afford to sleep in lock_sock
while in RCU read-side critical section. Grab the RCU lock only after we
have locked the socket.

This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a
sockmap/sockhash that contains at least one socket:

| =============================
| WARNING: suspicious RCU usage
| 5.5.0-04005-g8fc91b972b73 #450 Not tainted
| -----------------------------
| include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
|
| other info that might help us debug this:
|
|
| rcu_scheduler_active = 2, debug_locks = 1
| 4 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
|  #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170
|
| stack backtrace:
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73 #450
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  ___might_sleep+0x105/0x190
|  lock_sock_nested+0x28/0x90
|  sock_map_free+0x95/0x170
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50

| =============================
| WARNING: suspicious RCU usage
| 5.5.0-04005-g8fc91b972b73-dirty #452 Not tainted
| -----------------------------
| include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
|
| other info that might help us debug this:
|
|
| rcu_scheduler_active = 2, debug_locks = 1
| 4 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0
|  #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0
|
| stack backtrace:
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73-dirty #452
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  ___might_sleep+0x105/0x190
|  lock_sock_nested+0x28/0x90
|  sock_hash_free+0xec/0x1d0
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50

Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 8998e356f423..fd8b426dbdf3 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -234,7 +234,6 @@ static void sock_map_free(struct bpf_map *map)
 	int i;
 
 	synchronize_rcu();
-	rcu_read_lock();
 	raw_spin_lock_bh(&stab->lock);
 	for (i = 0; i < stab->map.max_entries; i++) {
 		struct sock **psk = &stab->sks[i];
@@ -243,12 +242,13 @@ static void sock_map_free(struct bpf_map *map)
 		sk = xchg(psk, NULL);
 		if (sk) {
 			lock_sock(sk);
+			rcu_read_lock();
 			sock_map_unref(sk, psk);
+			rcu_read_unlock();
 			release_sock(sk);
 		}
 	}
 	raw_spin_unlock_bh(&stab->lock);
-	rcu_read_unlock();
 
 	synchronize_rcu();
 
@@ -859,19 +859,19 @@ static void sock_hash_free(struct bpf_map *map)
 	int i;
 
 	synchronize_rcu();
-	rcu_read_lock();
 	for (i = 0; i < htab->buckets_num; i++) {
 		bucket = sock_hash_select_bucket(htab, i);
 		raw_spin_lock_bh(&bucket->lock);
 		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
 			hlist_del_rcu(&elem->node);
 			lock_sock(elem->sk);
+			rcu_read_lock();
 			sock_map_unref(elem->sk, elem);
+			rcu_read_unlock();
 			release_sock(elem->sk);
 		}
 		raw_spin_unlock_bh(&bucket->lock);
 	}
-	rcu_read_unlock();
 
 	bpf_map_area_free(htab->buckets);
 	kfree(htab);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map
  2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
  2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
@ 2020-02-06 11:16 ` Jakub Sitnicki
  2020-02-06 19:01   ` John Fastabend
  2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-06 11:16 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend

We need to have a synchronize_rcu before free'ing the sockhash because any
outstanding psock references will have a pointer to the map and when they
use it, this could trigger a use after free.

This is a sister fix for sockhash, following commit 2bb90e5cc90e ("bpf:
sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
which comes from a manual audit.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index fd8b426dbdf3..f36e13e577a3 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -250,6 +250,7 @@ static void sock_map_free(struct bpf_map *map)
 	}
 	raw_spin_unlock_bh(&stab->lock);
 
+	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
 	bpf_map_area_free(stab->sks);
@@ -873,6 +874,9 @@ static void sock_hash_free(struct bpf_map *map)
 		raw_spin_unlock_bh(&bucket->lock);
 	}
 
+	/* wait for psock readers accessing its map link */
+	synchronize_rcu();
+
 	bpf_map_area_free(htab->buckets);
 	kfree(htab);
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
  2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
  2020-02-06 11:16 ` [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map Jakub Sitnicki
@ 2020-02-06 11:16 ` Jakub Sitnicki
  2020-02-06 19:03   ` John Fastabend
  2020-02-09  2:41   ` Alexei Starovoitov
  2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
  2020-02-07 21:56 ` Daniel Borkmann
  4 siblings, 2 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-06 11:16 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend

Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
down") introduced sleeping issues inside RCU critical sections and while
holding a spinlock on sockmap/sockhash tear-down. There has to be at least
one socket in the map for the problem to surface.

This adds a test that triggers the warnings for broken locking rules. Not a
fix per se, but rather tooling to verify the accompanying fixes. Run on a
VM with 1 vCPU to reproduce the warnings.

Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 74 +++++++++++++++++++
 1 file changed, 74 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
new file mode 100644
index 000000000000..07f5b462c2ef
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Cloudflare
+
+#include "test_progs.h"
+
+static int connected_socket_v4(void)
+{
+	struct sockaddr_in addr = {
+		.sin_family = AF_INET,
+		.sin_port = htons(80),
+		.sin_addr = { inet_addr("127.0.0.1") },
+	};
+	socklen_t len = sizeof(addr);
+	int s, repair, err;
+
+	s = socket(AF_INET, SOCK_STREAM, 0);
+	if (CHECK_FAIL(s == -1))
+		goto error;
+
+	repair = TCP_REPAIR_ON;
+	err = setsockopt(s, SOL_TCP, TCP_REPAIR, &repair, sizeof(repair));
+	if (CHECK_FAIL(err))
+		goto error;
+
+	err = connect(s, (struct sockaddr *)&addr, len);
+	if (CHECK_FAIL(err))
+		goto error;
+
+	repair = TCP_REPAIR_OFF_NO_WP;
+	err = setsockopt(s, SOL_TCP, TCP_REPAIR, &repair, sizeof(repair));
+	if (CHECK_FAIL(err))
+		goto error;
+
+	return s;
+error:
+	perror(__func__);
+	close(s);
+	return -1;
+}
+
+/* Create a map, populate it with one socket, and free the map. */
+static void test_sockmap_create_update_free(enum bpf_map_type map_type)
+{
+	const int zero = 0;
+	int s, map, err;
+
+	s = connected_socket_v4();
+	if (CHECK_FAIL(s == -1))
+		return;
+
+	map = bpf_create_map(map_type, sizeof(int), sizeof(int), 1, 0);
+	if (CHECK_FAIL(map == -1)) {
+		perror("bpf_create_map");
+		goto out;
+	}
+
+	err = bpf_map_update_elem(map, &zero, &s, BPF_NOEXIST);
+	if (CHECK_FAIL(err)) {
+		perror("bpf_map_update");
+		goto out;
+	}
+
+out:
+	close(map);
+	close(s);
+}
+
+void test_sockmap_basic(void)
+{
+	if (test__start_subtest("sockmap create_update_free"))
+		test_sockmap_create_update_free(BPF_MAP_TYPE_SOCKMAP);
+	if (test__start_subtest("sockhash create_update_free"))
+		test_sockmap_create_update_free(BPF_MAP_TYPE_SOCKHASH);
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down
  2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
@ 2020-02-06 18:59   ` John Fastabend
  0 siblings, 0 replies; 19+ messages in thread
From: John Fastabend @ 2020-02-06 18:59 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend

Jakub Sitnicki wrote:
> rcu_read_lock is needed to protect access to psock inside sock_map_unref
> when tearing down the map. However, we can't afford to sleep in lock_sock
> while in RCU read-side critical section. Grab the RCU lock only after we
> have locked the socket.
> 
> This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a
> sockmap/sockhash that contains at least one socket:
> 
> | =============================
> | WARNING: suspicious RCU usage
> | 5.5.0-04005-g8fc91b972b73 #450 Not tainted
> | -----------------------------
> | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
> |
> | other info that might help us debug this:
> |
> |
> | rcu_scheduler_active = 2, debug_locks = 1
> | 4 locks held by kworker/0:1/62:
> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
> |  #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170
> |
> | stack backtrace:
> | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73 #450
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> | Workqueue: events bpf_map_free_deferred
> | Call Trace:
> |  dump_stack+0x71/0xa0
> |  ___might_sleep+0x105/0x190
> |  lock_sock_nested+0x28/0x90
> |  sock_map_free+0x95/0x170
> |  bpf_map_free_deferred+0x58/0x80
> |  process_one_work+0x260/0x5e0
> |  worker_thread+0x4d/0x3e0
> |  kthread+0x108/0x140
> |  ? process_one_work+0x5e0/0x5e0
> |  ? kthread_park+0x90/0x90
> |  ret_from_fork+0x3a/0x50
> 
> | =============================
> | WARNING: suspicious RCU usage
> | 5.5.0-04005-g8fc91b972b73-dirty #452 Not tainted
> | -----------------------------
> | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
> |
> | other info that might help us debug this:
> |
> |
> | rcu_scheduler_active = 2, debug_locks = 1
> | 4 locks held by kworker/0:1/62:
> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0
> |  #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0
> |
> | stack backtrace:
> | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73-dirty #452
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> | Workqueue: events bpf_map_free_deferred
> | Call Trace:
> |  dump_stack+0x71/0xa0
> |  ___might_sleep+0x105/0x190
> |  lock_sock_nested+0x28/0x90
> |  sock_hash_free+0xec/0x1d0
> |  bpf_map_free_deferred+0x58/0x80
> |  process_one_work+0x260/0x5e0
> |  worker_thread+0x4d/0x3e0
> |  kthread+0x108/0x140
> |  ? process_one_work+0x5e0/0x5e0
> |  ? kthread_park+0x90/0x90
> |  ret_from_fork+0x3a/0x50
> 
> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

Thanks! Fixes for fixes agh.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map
  2020-02-06 11:16 ` [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map Jakub Sitnicki
@ 2020-02-06 19:01   ` John Fastabend
  0 siblings, 0 replies; 19+ messages in thread
From: John Fastabend @ 2020-02-06 19:01 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend

Jakub Sitnicki wrote:
> We need to have a synchronize_rcu before free'ing the sockhash because any
> outstanding psock references will have a pointer to the map and when they
> use it, this could trigger a use after free.
> 
> This is a sister fix for sockhash, following commit 2bb90e5cc90e ("bpf:
> sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
> which comes from a manual audit.
> 
> Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c | 4 ++++
>  1 file changed, 4 insertions(+)

Nice catch thanks. As far as I know I've never seen this happen but
lets get this fixed.

Acked-by: John Fastabend <john.fastabend@gmail.com>

> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index fd8b426dbdf3..f36e13e577a3 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -250,6 +250,7 @@ static void sock_map_free(struct bpf_map *map)
>  	}
>  	raw_spin_unlock_bh(&stab->lock);
>  
> +	/* wait for psock readers accessing its map link */
>  	synchronize_rcu();
>  
>  	bpf_map_area_free(stab->sks);
> @@ -873,6 +874,9 @@ static void sock_hash_free(struct bpf_map *map)
>  		raw_spin_unlock_bh(&bucket->lock);
>  	}
>  
> +	/* wait for psock readers accessing its map link */
> +	synchronize_rcu();
> +
>  	bpf_map_area_free(htab->buckets);
>  	kfree(htab);
>  }
> -- 
> 2.24.1
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
@ 2020-02-06 19:03   ` John Fastabend
  2020-02-09  2:41   ` Alexei Starovoitov
  1 sibling, 0 replies; 19+ messages in thread
From: John Fastabend @ 2020-02-06 19:03 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend

Jakub Sitnicki wrote:
> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
> down") introduced sleeping issues inside RCU critical sections and while
> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
> one socket in the map for the problem to surface.
> 
> This adds a test that triggers the warnings for broken locking rules. Not a
> fix per se, but rather tooling to verify the accompanying fixes. Run on a
> VM with 1 vCPU to reproduce the warnings.
> 
> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  .../selftests/bpf/prog_tests/sockmap_basic.c  | 74 +++++++++++++++++++
>  1 file changed, 74 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> new file mode 100644
> index 000000000000..07f5b462c2ef
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c

Yes! This helps a lot now we will get some coverege on these cases.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
  2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
@ 2020-02-06 19:43 ` John Fastabend
  2020-02-07 10:45   ` Jakub Sitnicki
  2020-03-10 11:30   ` Jakub Sitnicki
  2020-02-07 21:56 ` Daniel Borkmann
  4 siblings, 2 replies; 19+ messages in thread
From: John Fastabend @ 2020-02-06 19:43 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend

Jakub Sitnicki wrote:
> Couple of fixes that came from recent discussion [0] on commit
> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
> 
> This series doesn't address the sleeping while holding a spinlock
> problem. We're still trying to decide how to fix that [1].
> 
> Until then sockmap users might see the following warnings:
> 
> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
> | in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
> | 3 locks held by kworker/0:1/62:
> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
> | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> | Workqueue: events bpf_map_free_deferred
> | Call Trace:
> |  dump_stack+0x71/0xa0
> |  ___might_sleep.cold+0xa6/0xb6
> |  lock_sock_nested+0x28/0x90
> |  sock_map_free+0x5f/0x180
> |  bpf_map_free_deferred+0x58/0x80
> |  process_one_work+0x260/0x5e0
> |  worker_thread+0x4d/0x3e0
> |  kthread+0x108/0x140
> |  ? process_one_work+0x5e0/0x5e0
> |  ? kthread_park+0x90/0x90
> |  ret_from_fork+0x3a/0x50
> |
> | ======================================================
> | WARNING: possible circular locking dependency detected
> | 5.5.0-04008-g7b083332376e #454 Tainted: G        W
> | ------------------------------------------------------
> | kworker/0:1/62 is trying to acquire lock:
> | ffff88813b280130 (sk_lock-AF_INET){+.+.}, at: sock_map_free+0x5f/0x180
> |
> | but task is already holding lock:
> | ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
> |
> | which lock already depends on the new lock.
> |
> |
> | the existing dependency chain (in reverse order) is:
> |
> | -> #1 (&stab->lock){+...}:
> |        _raw_spin_lock_bh+0x39/0x80
> |        sock_map_update_common+0xdc/0x300
> |        sock_map_update_elem+0xc3/0x150
> |        __do_sys_bpf+0x1285/0x1620
> |        do_syscall_64+0x6d/0x690
> |        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> |
> | -> #0 (sk_lock-AF_INET){+.+.}:
> |        __lock_acquire+0xe2f/0x19f0
> |        lock_acquire+0x95/0x190
> |        lock_sock_nested+0x6b/0x90
> |        sock_map_free+0x5f/0x180
> |        bpf_map_free_deferred+0x58/0x80
> |        process_one_work+0x260/0x5e0
> |        worker_thread+0x4d/0x3e0
> |        kthread+0x108/0x140
> |        ret_from_fork+0x3a/0x50
> |
> | other info that might help us debug this:
> |
> |  Possible unsafe locking scenario:
> |
> |        CPU0                    CPU1
> |        ----                    ----
> |   lock(&stab->lock);
> |                                lock(sk_lock-AF_INET);
> |                                lock(&stab->lock);
> |   lock(sk_lock-AF_INET);
> |
> |  *** DEADLOCK ***
> |
> | 3 locks held by kworker/0:1/62:
> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
> |
> | stack backtrace:
> | CPU: 0 PID: 62 Comm: kworker/0:1 Tainted: G        W         5.5.0-04008-g7b083332376e #454
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> | Workqueue: events bpf_map_free_deferred
> | Call Trace:
> |  dump_stack+0x71/0xa0
> |  check_noncircular+0x176/0x190
> |  __lock_acquire+0xe2f/0x19f0
> |  lock_acquire+0x95/0x190
> |  ? sock_map_free+0x5f/0x180
> |  lock_sock_nested+0x6b/0x90
> |  ? sock_map_free+0x5f/0x180
> |  sock_map_free+0x5f/0x180
> |  bpf_map_free_deferred+0x58/0x80
> |  process_one_work+0x260/0x5e0
> |  worker_thread+0x4d/0x3e0
> |  kthread+0x108/0x140
> |  ? process_one_work+0x5e0/0x5e0
> |  ? kthread_park+0x90/0x90
> |  ret_from_fork+0x3a/0x50

Hi Jakub,

Untested at the moment, but this should also be fine per your suggestion
(if I read it correctly).  The reason we have stab->lock and bucket->locks
here is to handle checking EEXIST in update/delete cases. We need to
be careful that when an update happens and we check for EEXIST that the
socket is added/removed during this check. So both map_update_common and
sock_map_delete need to guard from being run together potentially deleting
an entry we are checking, etc. But by the time we get here we just did
a synchronize_rcu() in the line above so no updates/deletes should be
in flight. So it seems safe to drop these locks because of the condition
no updates in flight.

So with patch below we keep the sync rcu but that is fine IMO these
map free's are rare. Take a look and make sure it seems sane to you
as well.

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index f36e13e577a3..1d56ec20330c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map)
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 	int i;
 
+	/* After the sync no updates or deletes will be in-flight so it
+	 * is safe to walk map and remove entries without risking a race
+	 * in EEXIST update case.
+	 */
 	synchronize_rcu();
-	raw_spin_lock_bh(&stab->lock);
 	for (i = 0; i < stab->map.max_entries; i++) {
 		struct sock **psk = &stab->sks[i];
 		struct sock *sk;
@@ -248,7 +251,6 @@ static void sock_map_free(struct bpf_map *map)
 			release_sock(sk);
 		}
 	}
-	raw_spin_unlock_bh(&stab->lock);
 
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
@@ -859,10 +861,13 @@ static void sock_hash_free(struct bpf_map *map)
 	struct hlist_node *node;
 	int i;
 
+	/* After the sync no updates or deletes will be in-flight so it
+	 * is safe to walk hash and remove entries without risking a race
+	 * in EEXIST update case.
+	 */
 	synchronize_rcu();
 	for (i = 0; i < htab->buckets_num; i++) {
 		bucket = sock_hash_select_bucket(htab, i);
-		raw_spin_lock_bh(&bucket->lock);
 		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
 			hlist_del_rcu(&elem->node);
 			lock_sock(elem->sk);
@@ -871,7 +876,6 @@ static void sock_hash_free(struct bpf_map *map)
 			rcu_read_unlock();
 			release_sock(elem->sk);
 		}
-		raw_spin_unlock_bh(&bucket->lock);
 	}
 
 	/* wait for psock readers accessing its map link */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
  2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
@ 2020-02-07 10:45   ` Jakub Sitnicki
  2020-03-10 16:44     ` John Fastabend
  2020-03-10 11:30   ` Jakub Sitnicki
  1 sibling, 1 reply; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-07 10:45 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, kernel-team

On Thu, Feb 06, 2020 at 08:43 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Couple of fixes that came from recent discussion [0] on commit
>> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
>>
>> This series doesn't address the sleeping while holding a spinlock
>> problem. We're still trying to decide how to fix that [1].
>>
>> Until then sockmap users might see the following warnings:
>>
>> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
>> | in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
>> | 3 locks held by kworker/0:1/62:
>> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
>> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>> | Workqueue: events bpf_map_free_deferred
>> | Call Trace:
>> |  dump_stack+0x71/0xa0
>> |  ___might_sleep.cold+0xa6/0xb6
>> |  lock_sock_nested+0x28/0x90
>> |  sock_map_free+0x5f/0x180
>> |  bpf_map_free_deferred+0x58/0x80
>> |  process_one_work+0x260/0x5e0
>> |  worker_thread+0x4d/0x3e0
>> |  kthread+0x108/0x140
>> |  ? process_one_work+0x5e0/0x5e0
>> |  ? kthread_park+0x90/0x90
>> |  ret_from_fork+0x3a/0x50
>> |
>> | ======================================================
>> | WARNING: possible circular locking dependency detected
>> | 5.5.0-04008-g7b083332376e #454 Tainted: G        W
>> | ------------------------------------------------------
>> | kworker/0:1/62 is trying to acquire lock:
>> | ffff88813b280130 (sk_lock-AF_INET){+.+.}, at: sock_map_free+0x5f/0x180
>> |
>> | but task is already holding lock:
>> | ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> |
>> | which lock already depends on the new lock.
>> |
>> |
>> | the existing dependency chain (in reverse order) is:
>> |
>> | -> #1 (&stab->lock){+...}:
>> |        _raw_spin_lock_bh+0x39/0x80
>> |        sock_map_update_common+0xdc/0x300
>> |        sock_map_update_elem+0xc3/0x150
>> |        __do_sys_bpf+0x1285/0x1620
>> |        do_syscall_64+0x6d/0x690
>> |        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> |
>> | -> #0 (sk_lock-AF_INET){+.+.}:
>> |        __lock_acquire+0xe2f/0x19f0
>> |        lock_acquire+0x95/0x190
>> |        lock_sock_nested+0x6b/0x90
>> |        sock_map_free+0x5f/0x180
>> |        bpf_map_free_deferred+0x58/0x80
>> |        process_one_work+0x260/0x5e0
>> |        worker_thread+0x4d/0x3e0
>> |        kthread+0x108/0x140
>> |        ret_from_fork+0x3a/0x50
>> |
>> | other info that might help us debug this:
>> |
>> |  Possible unsafe locking scenario:
>> |
>> |        CPU0                    CPU1
>> |        ----                    ----
>> |   lock(&stab->lock);
>> |                                lock(sk_lock-AF_INET);
>> |                                lock(&stab->lock);
>> |   lock(sk_lock-AF_INET);
>> |
>> |  *** DEADLOCK ***
>> |
>> | 3 locks held by kworker/0:1/62:
>> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> |
>> | stack backtrace:
>> | CPU: 0 PID: 62 Comm: kworker/0:1 Tainted: G        W         5.5.0-04008-g7b083332376e #454
>> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>> | Workqueue: events bpf_map_free_deferred
>> | Call Trace:
>> |  dump_stack+0x71/0xa0
>> |  check_noncircular+0x176/0x190
>> |  __lock_acquire+0xe2f/0x19f0
>> |  lock_acquire+0x95/0x190
>> |  ? sock_map_free+0x5f/0x180
>> |  lock_sock_nested+0x6b/0x90
>> |  ? sock_map_free+0x5f/0x180
>> |  sock_map_free+0x5f/0x180
>> |  bpf_map_free_deferred+0x58/0x80
>> |  process_one_work+0x260/0x5e0
>> |  worker_thread+0x4d/0x3e0
>> |  kthread+0x108/0x140
>> |  ? process_one_work+0x5e0/0x5e0
>> |  ? kthread_park+0x90/0x90
>> |  ret_from_fork+0x3a/0x50

Hey John,

> Untested at the moment, but this should also be fine per your suggestion
> (if I read it correctly).  The reason we have stab->lock and bucket->locks
> here is to handle checking EEXIST in update/delete cases. We need to
> be careful that when an update happens and we check for EEXIST that the
> socket is added/removed during this check. So both map_update_common and
> sock_map_delete need to guard from being run together potentially deleting
> an entry we are checking, etc.

Okay, thanks for explanation. IOW, we're serializing map writers.

> But by the time we get here we just did a synchronize_rcu() in the
> line above so no updates/deletes should be in flight. So it seems safe
> to drop these locks because of the condition no updates in flight.

This part is not clear to me. I might be missing something.

Here's my thinking - for any map writes (update/delete) to start,
map->refcnt needs to be > 0, and the ref is not dropped until the write
operation has finished.

Map FDs hold a ref to map until the FD gets released. And BPF progs hold
refs to maps until the prog gets unloaded.

This would mean that map_free will get scheduled from __bpf_map_put only
when no one is holding a map ref, and could start a write that would be
happening concurrently with sock_{map,hash}_free:

/* decrement map refcnt and schedule it for freeing via workqueue
 * (unrelying map implementation ops->map_free() might sleep)
 */
static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
{
	if (atomic64_dec_and_test(&map->refcnt)) {
		/* bpf_map_free_id() must be called first */
		bpf_map_free_id(map, do_idr_lock);
		btf_put(map->btf);
		INIT_WORK(&map->work, bpf_map_free_deferred);
		schedule_work(&map->work);
	}
}

> So with patch below we keep the sync rcu but that is fine IMO these
> map free's are rare. Take a look and make sure it seems sane to you
> as well.

I can't vouch for the need to keep synchronize_rcu here because I don't
understand that part, but otherwise the change LGTM.

-jkbs

>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index f36e13e577a3..1d56ec20330c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map)
>  	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
>  	int i;
>
> +	/* After the sync no updates or deletes will be in-flight so it
> +	 * is safe to walk map and remove entries without risking a race
> +	 * in EEXIST update case.
> +	 */
>  	synchronize_rcu();
> -	raw_spin_lock_bh(&stab->lock);
>  	for (i = 0; i < stab->map.max_entries; i++) {
>  		struct sock **psk = &stab->sks[i];
>  		struct sock *sk;
> @@ -248,7 +251,6 @@ static void sock_map_free(struct bpf_map *map)
>  			release_sock(sk);
>  		}
>  	}
> -	raw_spin_unlock_bh(&stab->lock);
>
>  	/* wait for psock readers accessing its map link */
>  	synchronize_rcu();
> @@ -859,10 +861,13 @@ static void sock_hash_free(struct bpf_map *map)
>  	struct hlist_node *node;
>  	int i;
>
> +	/* After the sync no updates or deletes will be in-flight so it
> +	 * is safe to walk hash and remove entries without risking a race
> +	 * in EEXIST update case.
> +	 */
>  	synchronize_rcu();
>  	for (i = 0; i < htab->buckets_num; i++) {
>  		bucket = sock_hash_select_bucket(htab, i);
> -		raw_spin_lock_bh(&bucket->lock);
>  		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
>  			hlist_del_rcu(&elem->node);
>  			lock_sock(elem->sk);
> @@ -871,7 +876,6 @@ static void sock_hash_free(struct bpf_map *map)
>  			rcu_read_unlock();
>  			release_sock(elem->sk);
>  		}
> -		raw_spin_unlock_bh(&bucket->lock);
>  	}
>
>  	/* wait for psock readers accessing its map link */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
  2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
@ 2020-02-07 21:56 ` Daniel Borkmann
  4 siblings, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-02-07 21:56 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend

On 2/6/20 12:16 PM, Jakub Sitnicki wrote:
> Couple of fixes that came from recent discussion [0] on commit
> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").

Series applied, thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
  2020-02-06 19:03   ` John Fastabend
@ 2020-02-09  2:41   ` Alexei Starovoitov
  2020-02-09  4:12     ` Yonghong Song
  2020-02-09 15:29     ` Jakub Sitnicki
  1 sibling, 2 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2020-02-09  2:41 UTC (permalink / raw)
  To: Jakub Sitnicki, Daniel Borkmann
  Cc: bpf, Network Development, kernel-team, John Fastabend

On Thu, Feb 6, 2020 at 3:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
> down") introduced sleeping issues inside RCU critical sections and while
> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
> one socket in the map for the problem to surface.
>
> This adds a test that triggers the warnings for broken locking rules. Not a
> fix per se, but rather tooling to verify the accompanying fixes. Run on a
> VM with 1 vCPU to reproduce the warnings.
>
> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

selftests/bpf no longer builds for me.
make
  BINARY   test_maps
  TEST-OBJ [test_progs] sockmap_basic.test.o
/data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:
In function ‘connected_socket_v4’:
/data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
error: ‘TCP_REPAIR_ON’ undeclared (first use in this function); did
you mean ‘TCP_REPAIR’?
   20 |  repair = TCP_REPAIR_ON;
      |           ^~~~~~~~~~~~~
      |           TCP_REPAIR
/data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
note: each undeclared identifier is reported only once for each
function it appears in
/data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:29:11:
error: ‘TCP_REPAIR_OFF_NO_WP’ undeclared (first use in this function);
did you mean ‘TCP_REPAIR_OPTIONS’?
   29 |  repair = TCP_REPAIR_OFF_NO_WP;
      |           ^~~~~~~~~~~~~~~~~~~~
      |           TCP_REPAIR_OPTIONS

Clearly /usr/include/linux/tcp.h is too old.
Suggestions?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-09  2:41   ` Alexei Starovoitov
@ 2020-02-09  4:12     ` Yonghong Song
  2020-02-09 15:29     ` Jakub Sitnicki
  1 sibling, 0 replies; 19+ messages in thread
From: Yonghong Song @ 2020-02-09  4:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Sitnicki, Daniel Borkmann
  Cc: bpf, Network Development, kernel-team, John Fastabend



On 2/8/20 6:41 PM, Alexei Starovoitov wrote:
> On Thu, Feb 6, 2020 at 3:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
>> down") introduced sleeping issues inside RCU critical sections and while
>> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
>> one socket in the map for the problem to surface.
>>
>> This adds a test that triggers the warnings for broken locking rules. Not a
>> fix per se, but rather tooling to verify the accompanying fixes. Run on a
>> VM with 1 vCPU to reproduce the warnings.
>>
>> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> 
> selftests/bpf no longer builds for me.
> make
>    BINARY   test_maps
>    TEST-OBJ [test_progs] sockmap_basic.test.o
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:
> In function ‘connected_socket_v4’:
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
> error: ‘TCP_REPAIR_ON’ undeclared (first use in this function); did
> you mean ‘TCP_REPAIR’?
>     20 |  repair = TCP_REPAIR_ON;
>        |           ^~~~~~~~~~~~~
>        |           TCP_REPAIR
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
> note: each undeclared identifier is reported only once for each
> function it appears in
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:29:11:
> error: ‘TCP_REPAIR_OFF_NO_WP’ undeclared (first use in this function);
> did you mean ‘TCP_REPAIR_OPTIONS’?
>     29 |  repair = TCP_REPAIR_OFF_NO_WP;
>        |           ^~~~~~~~~~~~~~~~~~~~
>        |           TCP_REPAIR_OPTIONS
> 
> Clearly /usr/include/linux/tcp.h is too old.
> Suggestions?

In the past, when such situation happens, people typically sync to
linux/tools/include/uapi/ directory. This probably will work in this 
case as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-09  2:41   ` Alexei Starovoitov
  2020-02-09  4:12     ` Yonghong Song
@ 2020-02-09 15:29     ` Jakub Sitnicki
  2020-02-10  3:55       ` John Fastabend
  1 sibling, 1 reply; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-09 15:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, bpf, Network Development, kernel-team, John Fastabend

On Sun, Feb 09, 2020 at 03:41 AM CET, Alexei Starovoitov wrote:
> On Thu, Feb 6, 2020 at 3:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
>> down") introduced sleeping issues inside RCU critical sections and while
>> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
>> one socket in the map for the problem to surface.
>>
>> This adds a test that triggers the warnings for broken locking rules. Not a
>> fix per se, but rather tooling to verify the accompanying fixes. Run on a
>> VM with 1 vCPU to reproduce the warnings.
>>
>> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>
> selftests/bpf no longer builds for me.
> make
>   BINARY   test_maps
>   TEST-OBJ [test_progs] sockmap_basic.test.o
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:
> In function ‘connected_socket_v4’:
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
> error: ‘TCP_REPAIR_ON’ undeclared (first use in this function); did
> you mean ‘TCP_REPAIR’?
>    20 |  repair = TCP_REPAIR_ON;
>       |           ^~~~~~~~~~~~~
>       |           TCP_REPAIR
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
> note: each undeclared identifier is reported only once for each
> function it appears in
> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:29:11:
> error: ‘TCP_REPAIR_OFF_NO_WP’ undeclared (first use in this function);
> did you mean ‘TCP_REPAIR_OPTIONS’?
>    29 |  repair = TCP_REPAIR_OFF_NO_WP;
>       |           ^~~~~~~~~~~~~~~~~~~~
>       |           TCP_REPAIR_OPTIONS
>
> Clearly /usr/include/linux/tcp.h is too old.
> Suggestions?

Sorry for the inconvenience. I see that tcp.h header is missing under
linux/tools/include/uapi/.

I have been building against my distro kernel headers, completely
unaware of this. This is an oversight on my side.

Can I ask for a revert? I'm traveling today with limited ability to
post patches.

I can resubmit the test with the missing header for bpf-next once it
reopens.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-09 15:29     ` Jakub Sitnicki
@ 2020-02-10  3:55       ` John Fastabend
  2020-02-10 11:52         ` Jakub Sitnicki
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-02-10  3:55 UTC (permalink / raw)
  To: Jakub Sitnicki, Alexei Starovoitov
  Cc: Daniel Borkmann, bpf, Network Development, kernel-team, John Fastabend

Jakub Sitnicki wrote:
> On Sun, Feb 09, 2020 at 03:41 AM CET, Alexei Starovoitov wrote:
> > On Thu, Feb 6, 2020 at 3:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
> >> down") introduced sleeping issues inside RCU critical sections and while
> >> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
> >> one socket in the map for the problem to surface.
> >>
> >> This adds a test that triggers the warnings for broken locking rules. Not a
> >> fix per se, but rather tooling to verify the accompanying fixes. Run on a
> >> VM with 1 vCPU to reproduce the warnings.
> >>
> >> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >
> > selftests/bpf no longer builds for me.
> > make
> >   BINARY   test_maps
> >   TEST-OBJ [test_progs] sockmap_basic.test.o
> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:
> > In function ‘connected_socket_v4’:
> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
> > error: ‘TCP_REPAIR_ON’ undeclared (first use in this function); did
> > you mean ‘TCP_REPAIR’?
> >    20 |  repair = TCP_REPAIR_ON;
> >       |           ^~~~~~~~~~~~~
> >       |           TCP_REPAIR
> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
> > note: each undeclared identifier is reported only once for each
> > function it appears in
> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:29:11:
> > error: ‘TCP_REPAIR_OFF_NO_WP’ undeclared (first use in this function);
> > did you mean ‘TCP_REPAIR_OPTIONS’?
> >    29 |  repair = TCP_REPAIR_OFF_NO_WP;
> >       |           ^~~~~~~~~~~~~~~~~~~~
> >       |           TCP_REPAIR_OPTIONS
> >
> > Clearly /usr/include/linux/tcp.h is too old.
> > Suggestions?
> 
> Sorry for the inconvenience. I see that tcp.h header is missing under
> linux/tools/include/uapi/.

How about we just add the couple defines needed to sockmap_basic.c I don't
see a need to pull in all of tcp.h just for a couple defines that wont
change anyways.

> 
> I have been building against my distro kernel headers, completely
> unaware of this. This is an oversight on my side.
> 
> Can I ask for a revert? I'm traveling today with limited ability to
> post patches.

I don't think we need a full revert.

> 
> I can resubmit the test with the missing header for bpf-next once it
> reopens.

If you are traveling I'll post a patch with the defines.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-10  3:55       ` John Fastabend
@ 2020-02-10 11:52         ` Jakub Sitnicki
  2020-02-10 23:38           ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Sitnicki @ 2020-02-10 11:52 UTC (permalink / raw)
  To: Alexei Starovoitov, John Fastabend
  Cc: Daniel Borkmann, bpf, Network Development, kernel-team

On Mon, Feb 10, 2020 at 03:55 AM GMT, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Sun, Feb 09, 2020 at 03:41 AM CET, Alexei Starovoitov wrote:
>> > On Thu, Feb 6, 2020 at 3:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
>> >> down") introduced sleeping issues inside RCU critical sections and while
>> >> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
>> >> one socket in the map for the problem to surface.
>> >>
>> >> This adds a test that triggers the warnings for broken locking rules. Not a
>> >> fix per se, but rather tooling to verify the accompanying fixes. Run on a
>> >> VM with 1 vCPU to reproduce the warnings.
>> >>
>> >> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >
>> > selftests/bpf no longer builds for me.
>> > make
>> >   BINARY   test_maps
>> >   TEST-OBJ [test_progs] sockmap_basic.test.o
>> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:
>> > In function ‘connected_socket_v4’:
>> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
>> > error: ‘TCP_REPAIR_ON’ undeclared (first use in this function); did
>> > you mean ‘TCP_REPAIR’?
>> >    20 |  repair = TCP_REPAIR_ON;
>> >       |           ^~~~~~~~~~~~~
>> >       |           TCP_REPAIR
>> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
>> > note: each undeclared identifier is reported only once for each
>> > function it appears in
>> > /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:29:11:
>> > error: ‘TCP_REPAIR_OFF_NO_WP’ undeclared (first use in this function);
>> > did you mean ‘TCP_REPAIR_OPTIONS’?
>> >    29 |  repair = TCP_REPAIR_OFF_NO_WP;
>> >       |           ^~~~~~~~~~~~~~~~~~~~
>> >       |           TCP_REPAIR_OPTIONS
>> >
>> > Clearly /usr/include/linux/tcp.h is too old.
>> > Suggestions?
>>
>> Sorry for the inconvenience. I see that tcp.h header is missing under
>> linux/tools/include/uapi/.
>
> How about we just add the couple defines needed to sockmap_basic.c I don't
> see a need to pull in all of tcp.h just for a couple defines that wont
> change anyways.

Looking back at how this happened. test_progs.h pulls in netinet/tcp.h:

# 19 "/home/jkbs/src/linux/tools/testing/selftests/bpf/test_progs.h" 2
# 1 "/usr/include/netinet/tcp.h" 1 3 4
# 92 "/usr/include/netinet/tcp.h" 3 4

A glibc header, which gained TCP_REPAIR_* constants in 2.29 [0]:

$ git describe --contains 5cd7dbdea13eb302620491ef44837b17e9d39c5a
glibc-2.29~510

Pulling in linux/tcp.h would conflict with struct definitions in
netinet/tcp.h. So redefining the possibly missing constants, like John
suggests, is the right way out.

I'm not sure, though, how to protect against such mistakes in the
future. Any ideas?

[0] https://sourceware.org/git/?p=glibc.git;a=commit;h=5cd7dbdea13eb302620491ef44837b17e9d39c5a

>
>>
>> I have been building against my distro kernel headers, completely
>> unaware of this. This is an oversight on my side.
>>
>> Can I ask for a revert? I'm traveling today with limited ability to
>> post patches.
>
> I don't think we need a full revert.
>
>>
>> I can resubmit the test with the missing header for bpf-next once it
>> reopens.
>
> If you are traveling I'll post a patch with the defines.

Thanks, again.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  2020-02-10 11:52         ` Jakub Sitnicki
@ 2020-02-10 23:38           ` Daniel Borkmann
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-02-10 23:38 UTC (permalink / raw)
  To: Jakub Sitnicki, Alexei Starovoitov, John Fastabend
  Cc: bpf, Network Development, kernel-team

On 2/10/20 12:52 PM, Jakub Sitnicki wrote:
> On Mon, Feb 10, 2020 at 03:55 AM GMT, John Fastabend wrote:
>> Jakub Sitnicki wrote:
>>> On Sun, Feb 09, 2020 at 03:41 AM CET, Alexei Starovoitov wrote:
>>>> On Thu, Feb 6, 2020 at 3:28 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>>>>
>>>>> Commit 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear
>>>>> down") introduced sleeping issues inside RCU critical sections and while
>>>>> holding a spinlock on sockmap/sockhash tear-down. There has to be at least
>>>>> one socket in the map for the problem to surface.
>>>>>
>>>>> This adds a test that triggers the warnings for broken locking rules. Not a
>>>>> fix per se, but rather tooling to verify the accompanying fixes. Run on a
>>>>> VM with 1 vCPU to reproduce the warnings.
>>>>>
>>>>> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down")
>>>>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>>>>
>>>> selftests/bpf no longer builds for me.
>>>> make
>>>>    BINARY   test_maps
>>>>    TEST-OBJ [test_progs] sockmap_basic.test.o
>>>> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:
>>>> In function ‘connected_socket_v4’:
>>>> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
>>>> error: ‘TCP_REPAIR_ON’ undeclared (first use in this function); did
>>>> you mean ‘TCP_REPAIR’?
>>>>     20 |  repair = TCP_REPAIR_ON;
>>>>        |           ^~~~~~~~~~~~~
>>>>        |           TCP_REPAIR
>>>> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:20:11:
>>>> note: each undeclared identifier is reported only once for each
>>>> function it appears in
>>>> /data/users/ast/net/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:29:11:
>>>> error: ‘TCP_REPAIR_OFF_NO_WP’ undeclared (first use in this function);
>>>> did you mean ‘TCP_REPAIR_OPTIONS’?
>>>>     29 |  repair = TCP_REPAIR_OFF_NO_WP;
>>>>        |           ^~~~~~~~~~~~~~~~~~~~
>>>>        |           TCP_REPAIR_OPTIONS
>>>>
>>>> Clearly /usr/include/linux/tcp.h is too old.
>>>> Suggestions?
>>>
>>> Sorry for the inconvenience. I see that tcp.h header is missing under
>>> linux/tools/include/uapi/.
>>
>> How about we just add the couple defines needed to sockmap_basic.c I don't
>> see a need to pull in all of tcp.h just for a couple defines that wont
>> change anyways.
> 
> Looking back at how this happened. test_progs.h pulls in netinet/tcp.h:
> 
> # 19 "/home/jkbs/src/linux/tools/testing/selftests/bpf/test_progs.h" 2
> # 1 "/usr/include/netinet/tcp.h" 1 3 4
> # 92 "/usr/include/netinet/tcp.h" 3 4
> 
> A glibc header, which gained TCP_REPAIR_* constants in 2.29 [0]:
> 
> $ git describe --contains 5cd7dbdea13eb302620491ef44837b17e9d39c5a
> glibc-2.29~510
> 
> Pulling in linux/tcp.h would conflict with struct definitions in
> netinet/tcp.h. So redefining the possibly missing constants, like John
> suggests, is the right way out.
> 
> I'm not sure, though, how to protect against such mistakes in the
> future. Any ideas?

We've usually been pulling in header copies into tools/include/. Last bigger one
was in commit 13a748ea6df1 ("bpf: Sync asm-generic/socket.h to tools/") to name
one example, but redefining the way John did for smaller things is also okay,
especially if a header has further dependencies which would then also be needed
under tools/ infra.

> [0] https://sourceware.org/git/?p=glibc.git;a=commit;h=5cd7dbdea13eb302620491ef44837b17e9d39c5a

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
  2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
  2020-02-07 10:45   ` Jakub Sitnicki
@ 2020-03-10 11:30   ` Jakub Sitnicki
  1 sibling, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2020-03-10 11:30 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, kernel-team

On Thu, Feb 06, 2020 at 08:43 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Couple of fixes that came from recent discussion [0] on commit
>> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
>>
>> This series doesn't address the sleeping while holding a spinlock
>> problem. We're still trying to decide how to fix that [1].
>>
>> Until then sockmap users might see the following warnings:
>>
>> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
>> | in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
>> | 3 locks held by kworker/0:1/62:
>> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
>> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>> | Workqueue: events bpf_map_free_deferred
>> | Call Trace:
>> |  dump_stack+0x71/0xa0
>> |  ___might_sleep.cold+0xa6/0xb6
>> |  lock_sock_nested+0x28/0x90
>> |  sock_map_free+0x5f/0x180
>> |  bpf_map_free_deferred+0x58/0x80
>> |  process_one_work+0x260/0x5e0
>> |  worker_thread+0x4d/0x3e0
>> |  kthread+0x108/0x140
>> |  ? process_one_work+0x5e0/0x5e0
>> |  ? kthread_park+0x90/0x90
>> |  ret_from_fork+0x3a/0x50
>> |
>> | ======================================================
>> | WARNING: possible circular locking dependency detected
>> | 5.5.0-04008-g7b083332376e #454 Tainted: G        W
>> | ------------------------------------------------------
>> | kworker/0:1/62 is trying to acquire lock:
>> | ffff88813b280130 (sk_lock-AF_INET){+.+.}, at: sock_map_free+0x5f/0x180
>> |
>> | but task is already holding lock:
>> | ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> |
>> | which lock already depends on the new lock.
>> |
>> |
>> | the existing dependency chain (in reverse order) is:
>> |
>> | -> #1 (&stab->lock){+...}:
>> |        _raw_spin_lock_bh+0x39/0x80
>> |        sock_map_update_common+0xdc/0x300
>> |        sock_map_update_elem+0xc3/0x150
>> |        __do_sys_bpf+0x1285/0x1620
>> |        do_syscall_64+0x6d/0x690
>> |        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> |
>> | -> #0 (sk_lock-AF_INET){+.+.}:
>> |        __lock_acquire+0xe2f/0x19f0
>> |        lock_acquire+0x95/0x190
>> |        lock_sock_nested+0x6b/0x90
>> |        sock_map_free+0x5f/0x180
>> |        bpf_map_free_deferred+0x58/0x80
>> |        process_one_work+0x260/0x5e0
>> |        worker_thread+0x4d/0x3e0
>> |        kthread+0x108/0x140
>> |        ret_from_fork+0x3a/0x50
>> |
>> | other info that might help us debug this:
>> |
>> |  Possible unsafe locking scenario:
>> |
>> |        CPU0                    CPU1
>> |        ----                    ----
>> |   lock(&stab->lock);
>> |                                lock(sk_lock-AF_INET);
>> |                                lock(&stab->lock);
>> |   lock(sk_lock-AF_INET);
>> |
>> |  *** DEADLOCK ***
>> |
>> | 3 locks held by kworker/0:1/62:
>> |  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
>> |  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
>> |
>> | stack backtrace:
>> | CPU: 0 PID: 62 Comm: kworker/0:1 Tainted: G        W         5.5.0-04008-g7b083332376e #454
>> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>> | Workqueue: events bpf_map_free_deferred
>> | Call Trace:
>> |  dump_stack+0x71/0xa0
>> |  check_noncircular+0x176/0x190
>> |  __lock_acquire+0xe2f/0x19f0
>> |  lock_acquire+0x95/0x190
>> |  ? sock_map_free+0x5f/0x180
>> |  lock_sock_nested+0x6b/0x90
>> |  ? sock_map_free+0x5f/0x180
>> |  sock_map_free+0x5f/0x180
>> |  bpf_map_free_deferred+0x58/0x80
>> |  process_one_work+0x260/0x5e0
>> |  worker_thread+0x4d/0x3e0
>> |  kthread+0x108/0x140
>> |  ? process_one_work+0x5e0/0x5e0
>> |  ? kthread_park+0x90/0x90
>> |  ret_from_fork+0x3a/0x50
>
> Hi Jakub,
>
> Untested at the moment, but this should also be fine per your suggestion
> (if I read it correctly).  The reason we have stab->lock and bucket->locks
> here is to handle checking EEXIST in update/delete cases. We need to
> be careful that when an update happens and we check for EEXIST that the
> socket is added/removed during this check. So both map_update_common and
> sock_map_delete need to guard from being run together potentially deleting
> an entry we are checking, etc. But by the time we get here we just did
> a synchronize_rcu() in the line above so no updates/deletes should be
> in flight. So it seems safe to drop these locks because of the condition
> no updates in flight.
>
> So with patch below we keep the sync rcu but that is fine IMO these
> map free's are rare. Take a look and make sure it seems sane to you
> as well.
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index f36e13e577a3..1d56ec20330c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map)
>  	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
>  	int i;
>
> +	/* After the sync no updates or deletes will be in-flight so it
> +	 * is safe to walk map and remove entries without risking a race
> +	 * in EEXIST update case.
> +	 */
>  	synchronize_rcu();
> -	raw_spin_lock_bh(&stab->lock);
>  	for (i = 0; i < stab->map.max_entries; i++) {
>  		struct sock **psk = &stab->sks[i];
>  		struct sock *sk;
> @@ -248,7 +251,6 @@ static void sock_map_free(struct bpf_map *map)
>  			release_sock(sk);
>  		}
>  	}
> -	raw_spin_unlock_bh(&stab->lock);
>
>  	/* wait for psock readers accessing its map link */
>  	synchronize_rcu();
> @@ -859,10 +861,13 @@ static void sock_hash_free(struct bpf_map *map)
>  	struct hlist_node *node;
>  	int i;
>
> +	/* After the sync no updates or deletes will be in-flight so it
> +	 * is safe to walk hash and remove entries without risking a race
> +	 * in EEXIST update case.
> +	 */
>  	synchronize_rcu();
>  	for (i = 0; i < htab->buckets_num; i++) {
>  		bucket = sock_hash_select_bucket(htab, i);
> -		raw_spin_lock_bh(&bucket->lock);
>  		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
>  			hlist_del_rcu(&elem->node);
>  			lock_sock(elem->sk);
> @@ -871,7 +876,6 @@ static void sock_hash_free(struct bpf_map *map)
>  			rcu_read_unlock();
>  			release_sock(elem->sk);
>  		}
> -		raw_spin_unlock_bh(&bucket->lock);
>  	}
>
>  	/* wait for psock readers accessing its map link */

Hi John,

We would like to get rid of lockdep splats we are seeing in testing.

Mind if I submit the above fix for bpf-next on your behalf?

That is, of course, unless you have cycles to tend to it yourself.

Thanks,
-jkbs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
  2020-02-07 10:45   ` Jakub Sitnicki
@ 2020-03-10 16:44     ` John Fastabend
  2020-03-11 11:51       ` Jakub Sitnicki
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-03-10 16:44 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend; +Cc: bpf, netdev, kernel-team

Jakub Sitnicki wrote:
> On Thu, Feb 06, 2020 at 08:43 PM CET, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> Couple of fixes that came from recent discussion [0] on commit
> >> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
> >>
> >> This series doesn't address the sleeping while holding a spinlock
> >> problem. We're still trying to decide how to fix that [1].
> >>
> >> Until then sockmap users might see the following warnings:
> >>
> >> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
> 

[...]

> Hey John,

Patch sent.

> 
> > Untested at the moment, but this should also be fine per your suggestion
> > (if I read it correctly).  The reason we have stab->lock and bucket->locks
> > here is to handle checking EEXIST in update/delete cases. We need to
> > be careful that when an update happens and we check for EEXIST that the
> > socket is added/removed during this check. So both map_update_common and
> > sock_map_delete need to guard from being run together potentially deleting
> > an entry we are checking, etc.
> 
> Okay, thanks for explanation. IOW, we're serializing map writers.
> 
> > But by the time we get here we just did a synchronize_rcu() in the
> > line above so no updates/deletes should be in flight. So it seems safe
> > to drop these locks because of the condition no updates in flight.
> 
> This part is not clear to me. I might be missing something.
> 
> Here's my thinking - for any map writes (update/delete) to start,
> map->refcnt needs to be > 0, and the ref is not dropped until the write
> operation has finished.
> 
> Map FDs hold a ref to map until the FD gets released. And BPF progs hold
> refs to maps until the prog gets unloaded.
> 
> This would mean that map_free will get scheduled from __bpf_map_put only
> when no one is holding a map ref, and could start a write that would be
> happening concurrently with sock_{map,hash}_free:

Sorry bringing back this old thread I'm not sure I followed the couple
paragraphs here. Is this with regards to the lock or the rcu? II didn't
want to just drop this thanks.

We can't have new updates/lookups/deletes happening while we are free'ing
a map that would cause all sorts of problems, use after free's, etc.

> 
> /* decrement map refcnt and schedule it for freeing via workqueue
>  * (unrelying map implementation ops->map_free() might sleep)
>  */
> static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
> {
> 	if (atomic64_dec_and_test(&map->refcnt)) {
> 		/* bpf_map_free_id() must be called first */
> 		bpf_map_free_id(map, do_idr_lock);
> 		btf_put(map->btf);
> 		INIT_WORK(&map->work, bpf_map_free_deferred);
> 		schedule_work(&map->work);
> 	}
> }
> 
> > So with patch below we keep the sync rcu but that is fine IMO these
> > map free's are rare. Take a look and make sure it seems sane to you
> > as well.
> 
> I can't vouch for the need to keep synchronize_rcu here because I don't
> understand that part, but otherwise the change LGTM.
> 
> -jkbs
>

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
  2020-03-10 16:44     ` John Fastabend
@ 2020-03-11 11:51       ` Jakub Sitnicki
  0 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2020-03-11 11:51 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, kernel-team

On Tue, Mar 10, 2020 at 05:44 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Thu, Feb 06, 2020 at 08:43 PM CET, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> Couple of fixes that came from recent discussion [0] on commit
>> >> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
>> >>
>> >> This series doesn't address the sleeping while holding a spinlock
>> >> problem. We're still trying to decide how to fix that [1].
>> >>
>> >> Until then sockmap users might see the following warnings:
>> >>
>> >> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
>>
>
> [...]
>
>> Hey John,
>
> Patch sent.

Thanks!

>
>>
>> > Untested at the moment, but this should also be fine per your suggestion
>> > (if I read it correctly).  The reason we have stab->lock and bucket->locks
>> > here is to handle checking EEXIST in update/delete cases. We need to
>> > be careful that when an update happens and we check for EEXIST that the
>> > socket is added/removed during this check. So both map_update_common and
>> > sock_map_delete need to guard from being run together potentially deleting
>> > an entry we are checking, etc.
>>
>> Okay, thanks for explanation. IOW, we're serializing map writers.
>>
>> > But by the time we get here we just did a synchronize_rcu() in the
>> > line above so no updates/deletes should be in flight. So it seems safe
>> > to drop these locks because of the condition no updates in flight.
>>
>> This part is not clear to me. I might be missing something.
>>
>> Here's my thinking - for any map writes (update/delete) to start,
>> map->refcnt needs to be > 0, and the ref is not dropped until the write
>> operation has finished.
>>
>> Map FDs hold a ref to map until the FD gets released. And BPF progs hold
>> refs to maps until the prog gets unloaded.
>>
>> This would mean that map_free will get scheduled from __bpf_map_put only
>> when no one is holding a map ref, and could start a write that would be
>> happening concurrently with sock_{map,hash}_free:
>
> Sorry bringing back this old thread I'm not sure I followed the couple
> paragraphs here. Is this with regards to the lock or the rcu? II didn't
> want to just drop this thanks.
>
> We can't have new updates/lookups/deletes happening while we are free'ing
> a map that would cause all sorts of problems, use after free's, etc.

Happy to pick up the discussion back up.

Sorry for the delay in my reply. I wanted to take another hard look at
the code and make sure I'm not getting ahead of myself here.

Let me back up a little and try to organize the access paths to sockmap
we have, and when they happen in relation to sock_map_free.

A) Access via bpf_map_ops

When bpf_map, and its backing object - bpf_stab, is accessed via map ops
(map_update_elem, map_delete_elem, map_lookup_elem), either (i) a
process has an FD for the map, or (ii) a loaded BPF prog holds a map
reference. Also, we always grab a map ref when creating an FD for it.

This means that map->refcnt is > 0 while a call to one of the map_ops is
in progress.

Hence, bpf_map_free_deferred -> sock_map_free won't get called during
these operations. This fact allowed us to get rid of locking the stab in
sock_map_free.

B) Access via bpf_{sk|msg}_redirect_map

Similar to previous case. BPF prog invoking these helpers must hold a
map reference, so we know that map->refcnt is > 0, and sock_map_free
can't be in progress the same time.

C) Access via sk_psock_link

sk_psock_link has a pointer to bpf_map (link->map) and to an entry in
stab->sks (link->link_raw), but doesn't hold a ref to the map.

We need to ensure bpf_stab doesn't go away, while tcp_bpf_remove ->
sk_psock_unlink -> sock_{map|hash}_delete_from_link call chain is in
progress.

That explains why in sock_map_free, after walking the map and destroying
all links, we wait for the RCU grace period to end with a call to
synchronize_rcu before freeing the map:

	/* wait for psock readers accessing its map link */
	synchronize_rcu();

	bpf_map_area_free(stab->sks);
	kfree(stab);


What is tripping me up, however, is that we also have another call to
synchronize_rcu before walking the map:

	/* After the sync no updates or deletes will be in-flight so it
	 * is safe to walk map and remove entries without risking a race
	 * in EEXIST update case.
	 */
	synchronize_rcu(); // <-- Is it needed?
	for (i = 0; i < stab->map.max_entries; i++) {
		// ...
	}

	/* wait for psock readers accessing its map link */
	synchronize_rcu();


I'm not grasping what purpose the 1st synchronize_rcu call serves.

New readers can start accessing the map after the 1st synchronize_rcu,
and this seems fine since the map will not be freed until after the 2nd
synchronize_rcu call.


Okay, so we can have deletes in-flight, which the explanatory comment
for the 1st synchronize_rcu mentions. What about updates in-flight?

I don't think they can happen with (A) being the only case I know of
when we update the map.

Sorry this was a bit long. So the question is what am I missing?
Can updates happen despite no refs to the map being held?

Thanks,
-jkbs

>
>>
>> /* decrement map refcnt and schedule it for freeing via workqueue
>>  * (unrelying map implementation ops->map_free() might sleep)
>>  */
>> static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
>> {
>> 	if (atomic64_dec_and_test(&map->refcnt)) {
>> 		/* bpf_map_free_id() must be called first */
>> 		bpf_map_free_id(map, do_idr_lock);
>> 		btf_put(map->btf);
>> 		INIT_WORK(&map->work, bpf_map_free_deferred);
>> 		schedule_work(&map->work);
>> 	}
>> }
>>
>> > So with patch below we keep the sync rcu but that is fine IMO these
>> > map free's are rare. Take a look and make sure it seems sane to you
>> > as well.
>>
>> I can't vouch for the need to keep synchronize_rcu here because I don't
>> understand that part, but otherwise the change LGTM.
>>
>> -jkbs
>>
>
> [...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-03-11 11:51 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
2020-02-06 18:59   ` John Fastabend
2020-02-06 11:16 ` [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map Jakub Sitnicki
2020-02-06 19:01   ` John Fastabend
2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
2020-02-06 19:03   ` John Fastabend
2020-02-09  2:41   ` Alexei Starovoitov
2020-02-09  4:12     ` Yonghong Song
2020-02-09 15:29     ` Jakub Sitnicki
2020-02-10  3:55       ` John Fastabend
2020-02-10 11:52         ` Jakub Sitnicki
2020-02-10 23:38           ` Daniel Borkmann
2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
2020-02-07 10:45   ` Jakub Sitnicki
2020-03-10 16:44     ` John Fastabend
2020-03-11 11:51       ` Jakub Sitnicki
2020-03-10 11:30   ` Jakub Sitnicki
2020-02-07 21:56 ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).