bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept()
@ 2019-08-09 16:10 Stanislav Fomichev
  2019-08-09 16:10 ` [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero Stanislav Fomichev
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-09 16:10 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Martin KaFai Lau, Yonghong Song

Currently there is no way to propagate sk storage from the listener
socket to a newly accepted one. Consider the following use case:

        fd = socket();
        setsockopt(fd, SOL_IP, IP_TOS,...);
        /* ^^^ setsockopt BPF program triggers here and saves something
         * into sk storage of the listener.
         */
        listen(fd, ...);
        while (client = accept(fd)) {
                /* At this point all association between listener
                 * socket and newly accepted one is gone. New
                 * socket will not have any sk storage attached.
                 */
        }

Let's add new BPF_F_CLONE flag that can be specified when creating
a socket storage map. This new flag indicates that map contents
should be cloned when the socket is cloned.

v2:
* remove spinlocks around selem_link_map/sk (Martin KaFai Lau)
* BPF_F_CLONE on a map, not selem (Martin KaFai Lau)
* hold a map while cloning (Martin KaFai Lau)
* use BTF maps in selftests (Yonghong Song)
* do proper cleanup selftests; don't call close(-1) (Yonghong Song)
* export bpf_map_inc_not_zero

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>

Stanislav Fomichev (4):
  bpf: export bpf_map_inc_not_zero
  bpf: support cloning sk storage on accept()
  bpf: sync bpf.h to tools/
  selftests/bpf: add sockopt clone/inheritance test

 include/linux/bpf.h                           |   2 +
 include/net/bpf_sk_storage.h                  |  10 +
 include/uapi/linux/bpf.h                      |   3 +
 kernel/bpf/syscall.c                          |  16 +-
 net/core/bpf_sk_storage.c                     | 100 ++++++-
 net/core/sock.c                               |   9 +-
 tools/include/uapi/linux/bpf.h                |   3 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/sockopt_inherit.c     |  97 +++++++
 .../selftests/bpf/test_sockopt_inherit.c      | 253 ++++++++++++++++++
 11 files changed, 487 insertions(+), 10 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/sockopt_inherit.c
 create mode 100644 tools/testing/selftests/bpf/test_sockopt_inherit.c

-- 
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero
  2019-08-09 16:10 [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
@ 2019-08-09 16:10 ` Stanislav Fomichev
  2019-08-11 23:53   ` Yonghong Song
  2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-09 16:10 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Martin KaFai Lau, Yonghong Song

Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
indicate that it's caller's responsibility to do proper locking.
Create and export bpf_map_inc_not_zero wrapper that properly
locks map_idr_lock. Will be used in the next commit to
hold a map while cloning a socket.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/syscall.c | 16 +++++++++++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f9a506147c8a..15ae49862b82 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -647,6 +647,8 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
 struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref);
+struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map,
+						   bool uref);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 int bpf_map_charge_memlock(struct bpf_map *map, u32 pages);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f16f6fa..cf8052b016e7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -683,8 +683,8 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd)
 }
 
 /* map_idr_lock should have been held */
-static struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map,
-					    bool uref)
+static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map,
+					      bool uref)
 {
 	int refold;
 
@@ -704,6 +704,16 @@ static struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map,
 	return map;
 }
 
+struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
+{
+	spin_lock_bh(&map_idr_lock);
+	map = __bpf_map_inc_not_zero(map, uref);
+	spin_unlock_bh(&map_idr_lock);
+
+	return map;
+}
+EXPORT_SYMBOL_GPL(bpf_map_inc_not_zero);
+
 int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 {
 	return -ENOTSUPP;
@@ -2177,7 +2187,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
 	spin_lock_bh(&map_idr_lock);
 	map = idr_find(&map_idr, id);
 	if (map)
-		map = bpf_map_inc_not_zero(map, true);
+		map = __bpf_map_inc_not_zero(map, true);
 	else
 		map = ERR_PTR(-ENOENT);
 	spin_unlock_bh(&map_idr_lock);
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-09 16:10 [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
  2019-08-09 16:10 ` [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero Stanislav Fomichev
@ 2019-08-09 16:10 ` Stanislav Fomichev
  2019-08-11 23:54   ` Yonghong Song
                     ` (2 more replies)
  2019-08-09 16:10 ` [PATCH bpf-next v2 3/4] bpf: sync bpf.h to tools/ Stanislav Fomichev
  2019-08-09 16:10 ` [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test Stanislav Fomichev
  3 siblings, 3 replies; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-09 16:10 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Martin KaFai Lau, Yonghong Song

Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/bpf_sk_storage.h |  10 ++++
 include/uapi/linux/bpf.h     |   3 ++
 net/core/bpf_sk_storage.c    | 100 +++++++++++++++++++++++++++++++++--
 net/core/sock.c              |   9 ++--
 4 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
index b9dcb02e756b..8e4f831d2e52 100644
--- a/include/net/bpf_sk_storage.h
+++ b/include/net/bpf_sk_storage.h
@@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
 extern const struct bpf_func_proto bpf_sk_storage_get_proto;
 extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
 
+#ifdef CONFIG_BPF_SYSCALL
+int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
+#else
+static inline int bpf_sk_storage_clone(const struct sock *sk,
+				       struct sock *newsk)
+{
+	return 0;
+}
+#endif
+
 #endif /* _BPF_SK_STORAGE_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4393bd4b2419..0ef594ac3899 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -337,6 +337,9 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY_PROG	(1U << 7)
 #define BPF_F_WRONLY_PROG	(1U << 8)
 
+/* Clone map from listener for newly accepted socket */
+#define BPF_F_CLONE		(1U << 9)
+
 /* flags for BPF_PROG_QUERY */
 #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
 
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 94c7f77ecb6b..584e08ee0ca3 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -12,6 +12,9 @@
 
 static atomic_t cache_idx;
 
+#define SK_STORAGE_CREATE_FLAG_MASK					\
+	(BPF_F_NO_PREALLOC | BPF_F_CLONE)
+
 struct bucket {
 	struct hlist_head list;
 	raw_spinlock_t lock;
@@ -209,7 +212,6 @@ static void selem_unlink_sk(struct bpf_sk_storage_elem *selem)
 		kfree_rcu(sk_storage, rcu);
 }
 
-/* sk_storage->lock must be held and sk_storage->list cannot be empty */
 static void __selem_link_sk(struct bpf_sk_storage *sk_storage,
 			    struct bpf_sk_storage_elem *selem)
 {
@@ -509,7 +511,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
 	return 0;
 }
 
-/* Called by __sk_destruct() */
+/* Called by __sk_destruct() & bpf_sk_storage_clone() */
 void bpf_sk_storage_free(struct sock *sk)
 {
 	struct bpf_sk_storage_elem *selem;
@@ -557,6 +559,11 @@ static void bpf_sk_storage_map_free(struct bpf_map *map)
 
 	smap = (struct bpf_sk_storage_map *)map;
 
+	/* Note that this map might be concurrently cloned from
+	 * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone
+	 * RCU read section to finish before proceeding. New RCU
+	 * read sections should be prevented via bpf_map_inc_not_zero.
+	 */
 	synchronize_rcu();
 
 	/* bpf prog and the userspace can no longer access this map
@@ -601,7 +608,8 @@ static void bpf_sk_storage_map_free(struct bpf_map *map)
 
 static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr)
 {
-	if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries ||
+	if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK ||
+	    attr->max_entries ||
 	    attr->key_size != sizeof(int) || !attr->value_size ||
 	    /* Enforce BTF for userspace sk dumping */
 	    !attr->btf_key_type_id || !attr->btf_value_type_id)
@@ -739,6 +747,92 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
 	return err;
 }
 
+static struct bpf_sk_storage_elem *
+bpf_sk_storage_clone_elem(struct sock *newsk,
+			  struct bpf_sk_storage_map *smap,
+			  struct bpf_sk_storage_elem *selem)
+{
+	struct bpf_sk_storage_elem *copy_selem;
+
+	copy_selem = selem_alloc(smap, newsk, NULL, true);
+	if (!copy_selem)
+		return NULL;
+
+	if (map_value_has_spin_lock(&smap->map))
+		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
+				      SDATA(selem)->data, true);
+	else
+		copy_map_value(&smap->map, SDATA(copy_selem)->data,
+			       SDATA(selem)->data);
+
+	return copy_selem;
+}
+
+int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
+{
+	struct bpf_sk_storage *new_sk_storage = NULL;
+	struct bpf_sk_storage *sk_storage;
+	struct bpf_sk_storage_elem *selem;
+	int ret;
+
+	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
+
+	rcu_read_lock();
+	sk_storage = rcu_dereference(sk->sk_bpf_storage);
+
+	if (!sk_storage || hlist_empty(&sk_storage->list))
+		goto out;
+
+	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
+		struct bpf_sk_storage_elem *copy_selem;
+		struct bpf_sk_storage_map *smap;
+		struct bpf_map *map;
+		int refold;
+
+		smap = rcu_dereference(SDATA(selem)->smap);
+		if (!(smap->map.map_flags & BPF_F_CLONE))
+			continue;
+
+		map = bpf_map_inc_not_zero(&smap->map, false);
+		if (IS_ERR(map))
+			continue;
+
+		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
+		if (!copy_selem) {
+			ret = -ENOMEM;
+			bpf_map_put(map);
+			goto err;
+		}
+
+		if (new_sk_storage) {
+			selem_link_map(smap, copy_selem);
+			__selem_link_sk(new_sk_storage, copy_selem);
+		} else {
+			ret = sk_storage_alloc(newsk, smap, copy_selem);
+			if (ret) {
+				kfree(copy_selem);
+				atomic_sub(smap->elem_size,
+					   &newsk->sk_omem_alloc);
+				bpf_map_put(map);
+				goto err;
+			}
+
+			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
+		}
+		bpf_map_put(map);
+	}
+
+out:
+	rcu_read_unlock();
+	return 0;
+
+err:
+	rcu_read_unlock();
+
+	bpf_sk_storage_free(newsk);
+	return ret;
+}
+
 BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
 	   void *, value, u64, flags)
 {
diff --git a/net/core/sock.c b/net/core/sock.c
index d57b0cc995a0..f5e801a9cea4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 			goto out;
 		}
 		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
-#ifdef CONFIG_BPF_SYSCALL
-		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
-#endif
+
+		if (bpf_sk_storage_clone(sk, newsk)) {
+			sk_free_unlock_clone(newsk);
+			newsk = NULL;
+			goto out;
+		}
 
 		newsk->sk_err	   = 0;
 		newsk->sk_err_soft = 0;
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v2 3/4] bpf: sync bpf.h to tools/
  2019-08-09 16:10 [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
  2019-08-09 16:10 ` [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero Stanislav Fomichev
  2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
@ 2019-08-09 16:10 ` Stanislav Fomichev
  2019-08-09 16:10 ` [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test Stanislav Fomichev
  3 siblings, 0 replies; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-09 16:10 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Martin KaFai Lau, Yonghong Song

Sync new sk storage clone flag.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/include/uapi/linux/bpf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4393bd4b2419..0ef594ac3899 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -337,6 +337,9 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY_PROG	(1U << 7)
 #define BPF_F_WRONLY_PROG	(1U << 8)
 
+/* Clone map from listener for newly accepted socket */
+#define BPF_F_CLONE		(1U << 9)
+
 /* flags for BPF_PROG_QUERY */
 #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
 
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test
  2019-08-09 16:10 [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2019-08-09 16:10 ` [PATCH bpf-next v2 3/4] bpf: sync bpf.h to tools/ Stanislav Fomichev
@ 2019-08-09 16:10 ` Stanislav Fomichev
  2019-08-11 23:54   ` Yonghong Song
  3 siblings, 1 reply; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-09 16:10 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Martin KaFai Lau, Yonghong Song

Add a test that calls setsockopt on the listener socket which triggers
BPF program. This BPF program writes to the sk storage and sets
clone flag. Make sure that sk storage is cloned for a newly
accepted connection.

We have two cloned maps in the tests to make sure we hit both cases
in bpf_sk_storage_clone: first element (sk_storage_alloc) and
non-first element(s) (selem_link_map).

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/sockopt_inherit.c     |  97 +++++++
 .../selftests/bpf/test_sockopt_inherit.c      | 253 ++++++++++++++++++
 4 files changed, 353 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/sockopt_inherit.c
 create mode 100644 tools/testing/selftests/bpf/test_sockopt_inherit.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 90f70d2c7c22..60c9338cd9b4 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -42,4 +42,5 @@ xdping
 test_sockopt
 test_sockopt_sk
 test_sockopt_multi
+test_sockopt_inherit
 test_tcp_rtt
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a0336a..c875763a851a 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
-	test_sockopt_multi test_tcp_rtt
+	test_sockopt_multi test_sockopt_inherit test_tcp_rtt
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
@@ -110,6 +110,7 @@ $(OUTPUT)/test_cgroup_attach: cgroup_helpers.c
 $(OUTPUT)/test_sockopt: cgroup_helpers.c
 $(OUTPUT)/test_sockopt_sk: cgroup_helpers.c
 $(OUTPUT)/test_sockopt_multi: cgroup_helpers.c
+$(OUTPUT)/test_sockopt_inherit: cgroup_helpers.c
 $(OUTPUT)/test_tcp_rtt: cgroup_helpers.c
 
 .PHONY: force
diff --git a/tools/testing/selftests/bpf/progs/sockopt_inherit.c b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
new file mode 100644
index 000000000000..dede0fcd6102
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
+
+#define SOL_CUSTOM			0xdeadbeef
+#define CUSTOM_INHERIT1			0
+#define CUSTOM_INHERIT2			1
+#define CUSTOM_LISTENER			2
+
+struct sockopt_inherit {
+	__u8 val;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_CLONE);
+	__type(key, int);
+	__type(value, struct sockopt_inherit);
+} cloned1_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_CLONE);
+	__type(key, int);
+	__type(value, struct sockopt_inherit);
+} cloned2_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct sockopt_inherit);
+} listener_only_map SEC(".maps");
+
+static __inline struct sockopt_inherit *get_storage(struct bpf_sockopt *ctx)
+{
+	if (ctx->optname == CUSTOM_INHERIT1)
+		return bpf_sk_storage_get(&cloned1_map, ctx->sk, 0,
+					  BPF_SK_STORAGE_GET_F_CREATE);
+	else if (ctx->optname == CUSTOM_INHERIT2)
+		return bpf_sk_storage_get(&cloned2_map, ctx->sk, 0,
+					  BPF_SK_STORAGE_GET_F_CREATE);
+	else
+		return bpf_sk_storage_get(&listener_only_map, ctx->sk, 0,
+					  BPF_SK_STORAGE_GET_F_CREATE);
+}
+
+SEC("cgroup/getsockopt")
+int _getsockopt(struct bpf_sockopt *ctx)
+{
+	__u8 *optval_end = ctx->optval_end;
+	struct sockopt_inherit *storage;
+	__u8 *optval = ctx->optval;
+
+	if (ctx->level != SOL_CUSTOM)
+		return 1; /* only interested in SOL_CUSTOM */
+
+	if (optval + 1 > optval_end)
+		return 0; /* EPERM, bounds check */
+
+	storage = get_storage(ctx);
+	if (!storage)
+		return 0; /* EPERM, couldn't get sk storage */
+
+	ctx->retval = 0; /* Reset system call return value to zero */
+
+	optval[0] = storage->val;
+	ctx->optlen = 1;
+
+	return 1;
+}
+
+SEC("cgroup/setsockopt")
+int _setsockopt(struct bpf_sockopt *ctx)
+{
+	__u8 *optval_end = ctx->optval_end;
+	struct sockopt_inherit *storage;
+	__u8 *optval = ctx->optval;
+
+	if (ctx->level != SOL_CUSTOM)
+		return 1; /* only interested in SOL_CUSTOM */
+
+	if (optval + 1 > optval_end)
+		return 0; /* EPERM, bounds check */
+
+	storage = get_storage(ctx);
+	if (!storage)
+		return 0; /* EPERM, couldn't get sk storage */
+
+	storage->val = optval[0];
+	ctx->optlen = -1;
+
+	return 1;
+}
diff --git a/tools/testing/selftests/bpf/test_sockopt_inherit.c b/tools/testing/selftests/bpf/test_sockopt_inherit.c
new file mode 100644
index 000000000000..1bf699815b9b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockopt_inherit.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <error.h>
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <pthread.h>
+
+#include <linux/filter.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#define CG_PATH				"/sockopt_inherit"
+#define SOL_CUSTOM			0xdeadbeef
+#define CUSTOM_INHERIT1			0
+#define CUSTOM_INHERIT2			1
+#define CUSTOM_LISTENER			2
+
+static int connect_to_server(int server_fd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
+	int fd;
+
+	fd = socket(AF_INET, SOCK_STREAM, 0);
+	if (fd < 0) {
+		log_err("Failed to create client socket");
+		return -1;
+	}
+
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) {
+		log_err("Failed to get server addr");
+		goto out;
+	}
+
+	if (connect(fd, (const struct sockaddr *)&addr, len) < 0) {
+		log_err("Fail to connect to server");
+		goto out;
+	}
+
+	return fd;
+
+out:
+	close(fd);
+	return -1;
+}
+
+static int verify_sockopt(int fd, int optname, const char *msg, char expected)
+{
+	socklen_t optlen = 1;
+	char buf = 0;
+	int err;
+
+	err = getsockopt(fd, SOL_CUSTOM, optname, &buf, &optlen);
+	if (err) {
+		log_err("%s: failed to call getsockopt", msg);
+		return 1;
+	}
+
+	printf("%s %d: got=0x%x ? expected=0x%x\n", msg, optname, buf, expected);
+
+	if (buf != expected) {
+		log_err("%s: unexpected getsockopt value %d != %d", msg,
+			buf, expected);
+		return 1;
+	}
+
+	return 0;
+}
+
+static void *server_thread(void *arg)
+{
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
+	int fd = *(int *)arg;
+	int client_fd;
+	int err = 0;
+
+	if (listen(fd, 1) < 0)
+		error(1, errno, "Failed to listed on socket");
+
+	err += verify_sockopt(fd, CUSTOM_INHERIT1, "listen", 1);
+	err += verify_sockopt(fd, CUSTOM_INHERIT2, "listen", 1);
+	err += verify_sockopt(fd, CUSTOM_LISTENER, "listen", 1);
+
+	client_fd = accept(fd, (struct sockaddr *)&addr, &len);
+	if (client_fd < 0)
+		error(1, errno, "Failed to accept client");
+
+	err += verify_sockopt(client_fd, CUSTOM_INHERIT1, "accept", 1);
+	err += verify_sockopt(client_fd, CUSTOM_INHERIT2, "accept", 1);
+	err += verify_sockopt(client_fd, CUSTOM_LISTENER, "accept", 0);
+
+	close(client_fd);
+
+	return (void *)(long)err;
+}
+
+static int start_server(void)
+{
+	struct sockaddr_in addr = {
+		.sin_family = AF_INET,
+		.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
+	};
+	char buf;
+	int err;
+	int fd;
+	int i;
+
+	fd = socket(AF_INET, SOCK_STREAM, 0);
+	if (fd < 0) {
+		log_err("Failed to create server socket");
+		return -1;
+	}
+
+	for (i = CUSTOM_INHERIT1; i <= CUSTOM_LISTENER; i++) {
+		buf = 0x01;
+		err = setsockopt(fd, SOL_CUSTOM, i, &buf, 1);
+		if (err) {
+			log_err("Failed to call setsockopt(%d)", i);
+			close(fd);
+			return -1;
+		}
+	}
+
+	if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		log_err("Failed to bind socket");
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title)
+{
+	enum bpf_attach_type attach_type;
+	enum bpf_prog_type prog_type;
+	struct bpf_program *prog;
+	int err;
+
+	err = libbpf_prog_type_by_name(title, &prog_type, &attach_type);
+	if (err) {
+		log_err("Failed to deduct types for %s BPF program", title);
+		return -1;
+	}
+
+	prog = bpf_object__find_program_by_title(obj, title);
+	if (!prog) {
+		log_err("Failed to find %s BPF program", title);
+		return -1;
+	}
+
+	err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd,
+			      attach_type, 0);
+	if (err) {
+		log_err("Failed to attach %s BPF program", title);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int run_test(int cgroup_fd)
+{
+	struct bpf_prog_load_attr attr = {
+		.file = "./sockopt_inherit.o",
+	};
+	int server_fd = -1, client_fd;
+	struct bpf_object *obj;
+	void *server_err;
+	pthread_t tid;
+	int ignored;
+	int err;
+
+	err = bpf_prog_load_xattr(&attr, &obj, &ignored);
+	if (err) {
+		log_err("Failed to load BPF object");
+		return -1;
+	}
+
+	err = prog_attach(obj, cgroup_fd, "cgroup/getsockopt");
+	if (err)
+		goto close_bpf_object;
+
+	err = prog_attach(obj, cgroup_fd, "cgroup/setsockopt");
+	if (err)
+		goto close_bpf_object;
+
+	server_fd = start_server();
+	if (server_fd < 0) {
+		err = -1;
+		goto close_bpf_object;
+	}
+
+	pthread_create(&tid, NULL, server_thread, (void *)&server_fd);
+
+	client_fd = connect_to_server(server_fd);
+	if (client_fd < 0) {
+		err = -1;
+		goto close_server_fd;
+	}
+
+	err += verify_sockopt(client_fd, CUSTOM_INHERIT1, "connect", 0);
+	err += verify_sockopt(client_fd, CUSTOM_INHERIT2, "connect", 0);
+	err += verify_sockopt(client_fd, CUSTOM_LISTENER, "connect", 0);
+
+	pthread_join(tid, &server_err);
+
+	err += (int)(long)server_err;
+
+	close(client_fd);
+
+close_server_fd:
+	close(server_fd);
+close_bpf_object:
+	bpf_object__close(obj);
+	return err;
+}
+
+int main(int args, char **argv)
+{
+	int cgroup_fd;
+	int err = EXIT_SUCCESS;
+
+	if (setup_cgroup_environment())
+		return err;
+
+	cgroup_fd = create_and_get_cgroup(CG_PATH);
+	if (cgroup_fd < 0)
+		goto cleanup_cgroup_env;
+
+	if (join_cgroup(CG_PATH))
+		goto cleanup_cgroup;
+
+	if (run_test(cgroup_fd))
+		err = EXIT_FAILURE;
+
+	printf("test_sockopt_inherit: %s\n",
+	       err == EXIT_SUCCESS ? "PASSED" : "FAILED");
+
+cleanup_cgroup:
+	close(cgroup_fd);
+cleanup_cgroup_env:
+	cleanup_cgroup_environment();
+	return err;
+}
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero
  2019-08-09 16:10 ` [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero Stanislav Fomichev
@ 2019-08-11 23:53   ` Yonghong Song
  0 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2019-08-11 23:53 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf; +Cc: davem, ast, daniel, Martin Lau



On 8/9/19 9:10 AM, Stanislav Fomichev wrote:
> Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
> indicate that it's caller's responsibility to do proper locking.
> Create and export bpf_map_inc_not_zero wrapper that properly
> locks map_idr_lock. Will be used in the next commit to
> hold a map while cloning a socket.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
@ 2019-08-11 23:54   ` Yonghong Song
  2019-08-12 10:17   ` Daniel Borkmann
  2019-08-13  1:47   ` Martin Lau
  2 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2019-08-11 23:54 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf; +Cc: davem, ast, daniel, Martin Lau



On 8/9/19 9:10 AM, Stanislav Fomichev wrote:
> Add new helper bpf_sk_storage_clone which optionally clones sk storage
> and call it from sk_clone_lock.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test
  2019-08-09 16:10 ` [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test Stanislav Fomichev
@ 2019-08-11 23:54   ` Yonghong Song
  0 siblings, 0 replies; 14+ messages in thread
From: Yonghong Song @ 2019-08-11 23:54 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf; +Cc: davem, ast, daniel, Martin Lau



On 8/9/19 9:10 AM, Stanislav Fomichev wrote:
> Add a test that calls setsockopt on the listener socket which triggers
> BPF program. This BPF program writes to the sk storage and sets
> clone flag. Make sure that sk storage is cloned for a newly
> accepted connection.
> 
> We have two cloned maps in the tests to make sure we hit both cases
> in bpf_sk_storage_clone: first element (sk_storage_alloc) and
> non-first element(s) (selem_link_map).
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
  2019-08-11 23:54   ` Yonghong Song
@ 2019-08-12 10:17   ` Daniel Borkmann
  2019-08-12 17:52     ` Stanislav Fomichev
  2019-08-13  1:47   ` Martin Lau
  2 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2019-08-12 10:17 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf
  Cc: davem, ast, Martin KaFai Lau, Yonghong Song

On 8/9/19 6:10 PM, Stanislav Fomichev wrote:
> Add new helper bpf_sk_storage_clone which optionally clones sk storage
> and call it from sk_clone_lock.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
[...]
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> +{
> +	struct bpf_sk_storage *new_sk_storage = NULL;
> +	struct bpf_sk_storage *sk_storage;
> +	struct bpf_sk_storage_elem *selem;
> +	int ret;
> +
> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> +
> +	rcu_read_lock();
> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> +
> +	if (!sk_storage || hlist_empty(&sk_storage->list))
> +		goto out;
> +
> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> +		struct bpf_sk_storage_elem *copy_selem;
> +		struct bpf_sk_storage_map *smap;
> +		struct bpf_map *map;
> +		int refold;
> +
> +		smap = rcu_dereference(SDATA(selem)->smap);
> +		if (!(smap->map.map_flags & BPF_F_CLONE))
> +			continue;
> +
> +		map = bpf_map_inc_not_zero(&smap->map, false);
> +		if (IS_ERR(map))
> +			continue;
> +
> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> +		if (!copy_selem) {
> +			ret = -ENOMEM;
> +			bpf_map_put(map);
> +			goto err;
> +		}
> +
> +		if (new_sk_storage) {
> +			selem_link_map(smap, copy_selem);
> +			__selem_link_sk(new_sk_storage, copy_selem);
> +		} else {
> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> +			if (ret) {
> +				kfree(copy_selem);
> +				atomic_sub(smap->elem_size,
> +					   &newsk->sk_omem_alloc);
> +				bpf_map_put(map);
> +				goto err;
> +			}
> +
> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> +		}
> +		bpf_map_put(map);

The map get/put combination /under/ RCU read lock seems a bit odd to me, could
you exactly describe the race that this would be preventing?

> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	return 0;
> +
> +err:
> +	rcu_read_unlock();
> +
> +	bpf_sk_storage_free(newsk);
> +	return ret;
> +}
Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-12 10:17   ` Daniel Borkmann
@ 2019-08-12 17:52     ` Stanislav Fomichev
  2019-08-13 21:12       ` Daniel Borkmann
  0 siblings, 1 reply; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-12 17:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, Martin KaFai Lau,
	Yonghong Song

On 08/12, Daniel Borkmann wrote:
> On 8/9/19 6:10 PM, Stanislav Fomichev wrote:
> > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > and call it from sk_clone_lock.
> > 
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Yonghong Song <yhs@fb.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> [...]
> > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > +{
> > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > +	struct bpf_sk_storage *sk_storage;
> > +	struct bpf_sk_storage_elem *selem;
> > +	int ret;
> > +
> > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > +
> > +	rcu_read_lock();
> > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > +
> > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > +		goto out;
> > +
> > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > +		struct bpf_sk_storage_elem *copy_selem;
> > +		struct bpf_sk_storage_map *smap;
> > +		struct bpf_map *map;
> > +		int refold;
> > +
> > +		smap = rcu_dereference(SDATA(selem)->smap);
> > +		if (!(smap->map.map_flags & BPF_F_CLONE))
> > +			continue;
> > +
> > +		map = bpf_map_inc_not_zero(&smap->map, false);
> > +		if (IS_ERR(map))
> > +			continue;
> > +
> > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > +		if (!copy_selem) {
> > +			ret = -ENOMEM;
> > +			bpf_map_put(map);
> > +			goto err;
> > +		}
> > +
> > +		if (new_sk_storage) {
> > +			selem_link_map(smap, copy_selem);
> > +			__selem_link_sk(new_sk_storage, copy_selem);
> > +		} else {
> > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > +			if (ret) {
> > +				kfree(copy_selem);
> > +				atomic_sub(smap->elem_size,
> > +					   &newsk->sk_omem_alloc);
> > +				bpf_map_put(map);
> > +				goto err;
> > +			}
> > +
> > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > +		}
> > +		bpf_map_put(map);
> 
> The map get/put combination /under/ RCU read lock seems a bit odd to me, could
> you exactly describe the race that this would be preventing?
There is a race between sk storage release and sk storage clone.
bpf_sk_storage_map_free uses synchronize_rcu to wait for all existing
users to finish and the new ones are prevented via map's refcnt being
zero; we need to do something like that for the clone.
Martin suggested to use bpf_map_inc_not_zero/bpf_map_put.
If I read everythin correctly, I think without map_inc/map_put we
get the following race:

CPU0                                   CPU1

bpf_map_put
  bpf_sk_storage_map_free(smap)
    synchronize_rcu

    // no more users via bpf or
    // syscall, but clone
    // can still happen

    for each (bucket)
      selem_unlink
        selem_unlink_map(smap)

        // adding anything at
        // this point to the
        // bucket will leak

                                       rcu_read_lock
                                       tcp_v4_rcv
                                         tcp_v4_do_rcv
                                           // sk is lockless TCP_LISTEN
                                           tcp_v4_cookie_check
                                             tcp_v4_syn_recv_sock
                                               bpf_sk_storage_clone
                                                 rcu_dereference(sk->sk_bpf_storage)
                                                 selem_link_map(smap, copy)
                                                 // adding new element to the
                                                 // map -> leak
                                       rcu_read_unlock

      selem_unlink_sk
       sk->sk_bpf_storage = NULL

    synchronize_rcu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
  2019-08-11 23:54   ` Yonghong Song
  2019-08-12 10:17   ` Daniel Borkmann
@ 2019-08-13  1:47   ` Martin Lau
  2019-08-13  5:05     ` Stanislav Fomichev
  2 siblings, 1 reply; 14+ messages in thread
From: Martin Lau @ 2019-08-13  1:47 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, bpf, davem, ast, daniel, Yonghong Song

On Fri, Aug 09, 2019 at 09:10:36AM -0700, Stanislav Fomichev wrote:
> Add new helper bpf_sk_storage_clone which optionally clones sk storage
> and call it from sk_clone_lock.
Thanks for v2.  Sorry for the delay.  I am traveling.

> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/net/bpf_sk_storage.h |  10 ++++
>  include/uapi/linux/bpf.h     |   3 ++
>  net/core/bpf_sk_storage.c    | 100 +++++++++++++++++++++++++++++++++--
>  net/core/sock.c              |   9 ++--
>  4 files changed, 116 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> index b9dcb02e756b..8e4f831d2e52 100644
> --- a/include/net/bpf_sk_storage.h
> +++ b/include/net/bpf_sk_storage.h
> @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
>  extern const struct bpf_func_proto bpf_sk_storage_get_proto;
>  extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
> +#else
> +static inline int bpf_sk_storage_clone(const struct sock *sk,
> +				       struct sock *newsk)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _BPF_SK_STORAGE_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..0ef594ac3899 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -337,6 +337,9 @@ enum bpf_attach_type {
>  #define BPF_F_RDONLY_PROG	(1U << 7)
>  #define BPF_F_WRONLY_PROG	(1U << 8)
>  
> +/* Clone map from listener for newly accepted socket */
> +#define BPF_F_CLONE		(1U << 9)
> +
>  /* flags for BPF_PROG_QUERY */
>  #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
>  
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index 94c7f77ecb6b..584e08ee0ca3 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -12,6 +12,9 @@
>  
>  static atomic_t cache_idx;
>  
> +#define SK_STORAGE_CREATE_FLAG_MASK					\
> +	(BPF_F_NO_PREALLOC | BPF_F_CLONE)
> +
>  struct bucket {
>  	struct hlist_head list;
>  	raw_spinlock_t lock;
> @@ -209,7 +212,6 @@ static void selem_unlink_sk(struct bpf_sk_storage_elem *selem)
>  		kfree_rcu(sk_storage, rcu);
>  }
>  
> -/* sk_storage->lock must be held and sk_storage->list cannot be empty */
>  static void __selem_link_sk(struct bpf_sk_storage *sk_storage,
>  			    struct bpf_sk_storage_elem *selem)
>  {
> @@ -509,7 +511,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
>  	return 0;
>  }
>  
> -/* Called by __sk_destruct() */
> +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
>  void bpf_sk_storage_free(struct sock *sk)
>  {
>  	struct bpf_sk_storage_elem *selem;
> @@ -557,6 +559,11 @@ static void bpf_sk_storage_map_free(struct bpf_map *map)
>  
>  	smap = (struct bpf_sk_storage_map *)map;
>  
> +	/* Note that this map might be concurrently cloned from
> +	 * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone
> +	 * RCU read section to finish before proceeding. New RCU
> +	 * read sections should be prevented via bpf_map_inc_not_zero.
> +	 */
Thanks!

>  	synchronize_rcu();
>  
>  	/* bpf prog and the userspace can no longer access this map
> @@ -601,7 +608,8 @@ static void bpf_sk_storage_map_free(struct bpf_map *map)
>  
>  static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr)
>  {
> -	if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries ||
> +	if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK ||
> +	    attr->max_entries ||
I think "!(attr->map_flags & BPF_F_NO_PREALLOC)" should also be needed.

>  	    attr->key_size != sizeof(int) || !attr->value_size ||
>  	    /* Enforce BTF for userspace sk dumping */
>  	    !attr->btf_key_type_id || !attr->btf_value_type_id)
> @@ -739,6 +747,92 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
>  	return err;
>  }
>  
> +static struct bpf_sk_storage_elem *
> +bpf_sk_storage_clone_elem(struct sock *newsk,
> +			  struct bpf_sk_storage_map *smap,
> +			  struct bpf_sk_storage_elem *selem)
> +{
> +	struct bpf_sk_storage_elem *copy_selem;
> +
> +	copy_selem = selem_alloc(smap, newsk, NULL, true);
> +	if (!copy_selem)
> +		return NULL;
> +
> +	if (map_value_has_spin_lock(&smap->map))
> +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
> +				      SDATA(selem)->data, true);
> +	else
> +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
> +			       SDATA(selem)->data);
> +
> +	return copy_selem;
> +}
> +
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> +{
> +	struct bpf_sk_storage *new_sk_storage = NULL;
> +	struct bpf_sk_storage *sk_storage;
> +	struct bpf_sk_storage_elem *selem;
> +	int ret;
> +
> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> +
> +	rcu_read_lock();
> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> +
> +	if (!sk_storage || hlist_empty(&sk_storage->list))
> +		goto out;
> +
> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> +		struct bpf_sk_storage_elem *copy_selem;
> +		struct bpf_sk_storage_map *smap;
> +		struct bpf_map *map;
> +		int refold;
> +
> +		smap = rcu_dereference(SDATA(selem)->smap);
> +		if (!(smap->map.map_flags & BPF_F_CLONE))
> +			continue;
> +
> +		map = bpf_map_inc_not_zero(&smap->map, false);
> +		if (IS_ERR(map))
> +			continue;
> +
> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> +		if (!copy_selem) {
> +			ret = -ENOMEM;
> +			bpf_map_put(map);
> +			goto err;
> +		}
> +
> +		if (new_sk_storage) {
> +			selem_link_map(smap, copy_selem);
> +			__selem_link_sk(new_sk_storage, copy_selem);
> +		} else {
> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> +			if (ret) {
> +				kfree(copy_selem);
> +				atomic_sub(smap->elem_size,
> +					   &newsk->sk_omem_alloc);
> +				bpf_map_put(map);
> +				goto err;
> +			}
> +
> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> +		}
> +		bpf_map_put(map);
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	return 0;
> +
> +err:
> +	rcu_read_unlock();
> +
> +	bpf_sk_storage_free(newsk);
The later sk_free_unlock_clone(newsk) should eventually call
bpf_sk_storage_free(newsk) also?

Others LGTM.

> +	return ret;
> +}
> +
>  BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
>  	   void *, value, u64, flags)
>  {
> diff --git a/net/core/sock.c b/net/core/sock.c
> index d57b0cc995a0..f5e801a9cea4 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  			goto out;
>  		}
>  		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> -#ifdef CONFIG_BPF_SYSCALL
> -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> -#endif
> +
> +		if (bpf_sk_storage_clone(sk, newsk)) {
> +			sk_free_unlock_clone(newsk);
> +			newsk = NULL;
> +			goto out;
> +		}
>  
>  		newsk->sk_err	   = 0;
>  		newsk->sk_err_soft = 0;
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-13  1:47   ` Martin Lau
@ 2019-08-13  5:05     ` Stanislav Fomichev
  0 siblings, 0 replies; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-13  5:05 UTC (permalink / raw)
  To: Martin Lau
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel, Yonghong Song

On 08/13, Martin Lau wrote:
> On Fri, Aug 09, 2019 at 09:10:36AM -0700, Stanislav Fomichev wrote:
> > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > and call it from sk_clone_lock.
> Thanks for v2.  Sorry for the delay.  I am traveling.
No worries, take your time, if you're OOO feel free to do another
round when you get back, not urgent.

> > 
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Yonghong Song <yhs@fb.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/net/bpf_sk_storage.h |  10 ++++
> >  include/uapi/linux/bpf.h     |   3 ++
> >  net/core/bpf_sk_storage.c    | 100 +++++++++++++++++++++++++++++++++--
> >  net/core/sock.c              |   9 ++--
> >  4 files changed, 116 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> > index b9dcb02e756b..8e4f831d2e52 100644
> > --- a/include/net/bpf_sk_storage.h
> > +++ b/include/net/bpf_sk_storage.h
> > @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
> >  extern const struct bpf_func_proto bpf_sk_storage_get_proto;
> >  extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
> >  
> > +#ifdef CONFIG_BPF_SYSCALL
> > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
> > +#else
> > +static inline int bpf_sk_storage_clone(const struct sock *sk,
> > +				       struct sock *newsk)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  #endif /* _BPF_SK_STORAGE_H */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4393bd4b2419..0ef594ac3899 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -337,6 +337,9 @@ enum bpf_attach_type {
> >  #define BPF_F_RDONLY_PROG	(1U << 7)
> >  #define BPF_F_WRONLY_PROG	(1U << 8)
> >  
> > +/* Clone map from listener for newly accepted socket */
> > +#define BPF_F_CLONE		(1U << 9)
> > +
> >  /* flags for BPF_PROG_QUERY */
> >  #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
> >  
> > diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> > index 94c7f77ecb6b..584e08ee0ca3 100644
> > --- a/net/core/bpf_sk_storage.c
> > +++ b/net/core/bpf_sk_storage.c
> > @@ -12,6 +12,9 @@
> >  
> >  static atomic_t cache_idx;
> >  
> > +#define SK_STORAGE_CREATE_FLAG_MASK					\
> > +	(BPF_F_NO_PREALLOC | BPF_F_CLONE)
> > +
> >  struct bucket {
> >  	struct hlist_head list;
> >  	raw_spinlock_t lock;
> > @@ -209,7 +212,6 @@ static void selem_unlink_sk(struct bpf_sk_storage_elem *selem)
> >  		kfree_rcu(sk_storage, rcu);
> >  }
> >  
> > -/* sk_storage->lock must be held and sk_storage->list cannot be empty */
> >  static void __selem_link_sk(struct bpf_sk_storage *sk_storage,
> >  			    struct bpf_sk_storage_elem *selem)
> >  {
> > @@ -509,7 +511,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
> >  	return 0;
> >  }
> >  
> > -/* Called by __sk_destruct() */
> > +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
> >  void bpf_sk_storage_free(struct sock *sk)
> >  {
> >  	struct bpf_sk_storage_elem *selem;
> > @@ -557,6 +559,11 @@ static void bpf_sk_storage_map_free(struct bpf_map *map)
> >  
> >  	smap = (struct bpf_sk_storage_map *)map;
> >  
> > +	/* Note that this map might be concurrently cloned from
> > +	 * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone
> > +	 * RCU read section to finish before proceeding. New RCU
> > +	 * read sections should be prevented via bpf_map_inc_not_zero.
> > +	 */
> Thanks!
> 
> >  	synchronize_rcu();
> >  
> >  	/* bpf prog and the userspace can no longer access this map
> > @@ -601,7 +608,8 @@ static void bpf_sk_storage_map_free(struct bpf_map *map)
> >  
> >  static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr)
> >  {
> > -	if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries ||
> > +	if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK ||
> > +	    attr->max_entries ||
> I think "!(attr->map_flags & BPF_F_NO_PREALLOC)" should also be needed.
Makes sense, we always want to have BPF_F_NO_PREALLOC set. Will add,
thanks!

> >  	    attr->key_size != sizeof(int) || !attr->value_size ||
> >  	    /* Enforce BTF for userspace sk dumping */
> >  	    !attr->btf_key_type_id || !attr->btf_value_type_id)
> > @@ -739,6 +747,92 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
> >  	return err;
> >  }
> >  
> > +static struct bpf_sk_storage_elem *
> > +bpf_sk_storage_clone_elem(struct sock *newsk,
> > +			  struct bpf_sk_storage_map *smap,
> > +			  struct bpf_sk_storage_elem *selem)
> > +{
> > +	struct bpf_sk_storage_elem *copy_selem;
> > +
> > +	copy_selem = selem_alloc(smap, newsk, NULL, true);
> > +	if (!copy_selem)
> > +		return NULL;
> > +
> > +	if (map_value_has_spin_lock(&smap->map))
> > +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
> > +				      SDATA(selem)->data, true);
> > +	else
> > +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
> > +			       SDATA(selem)->data);
> > +
> > +	return copy_selem;
> > +}
> > +
> > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > +{
> > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > +	struct bpf_sk_storage *sk_storage;
> > +	struct bpf_sk_storage_elem *selem;
> > +	int ret;
> > +
> > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > +
> > +	rcu_read_lock();
> > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > +
> > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > +		goto out;
> > +
> > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > +		struct bpf_sk_storage_elem *copy_selem;
> > +		struct bpf_sk_storage_map *smap;
> > +		struct bpf_map *map;
> > +		int refold;
> > +
> > +		smap = rcu_dereference(SDATA(selem)->smap);
> > +		if (!(smap->map.map_flags & BPF_F_CLONE))
> > +			continue;
> > +
> > +		map = bpf_map_inc_not_zero(&smap->map, false);
> > +		if (IS_ERR(map))
> > +			continue;
> > +
> > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > +		if (!copy_selem) {
> > +			ret = -ENOMEM;
> > +			bpf_map_put(map);
> > +			goto err;
> > +		}
> > +
> > +		if (new_sk_storage) {
> > +			selem_link_map(smap, copy_selem);
> > +			__selem_link_sk(new_sk_storage, copy_selem);
> > +		} else {
> > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > +			if (ret) {
> > +				kfree(copy_selem);
> > +				atomic_sub(smap->elem_size,
> > +					   &newsk->sk_omem_alloc);
> > +				bpf_map_put(map);
> > +				goto err;
> > +			}
> > +
> > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > +		}
> > +		bpf_map_put(map);
> > +	}
> > +
> > +out:
> > +	rcu_read_unlock();
> > +	return 0;
> > +
> > +err:
> > +	rcu_read_unlock();
> > +
> > +	bpf_sk_storage_free(newsk);
> The later sk_free_unlock_clone(newsk) should eventually call
> bpf_sk_storage_free(newsk) also?
Hm, good point, I can drop it from here and rely on
sk_free_unlock_clone to call __sk_destruct/bpf_sk_storage_free.
That probably deserves a comment though :-)

> Others LGTM.
Thank you for a review!

> > +	return ret;
> > +}
> > +
> >  BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> >  	   void *, value, u64, flags)
> >  {
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index d57b0cc995a0..f5e801a9cea4 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> >  			goto out;
> >  		}
> >  		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> > -#ifdef CONFIG_BPF_SYSCALL
> > -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > -#endif
> > +
> > +		if (bpf_sk_storage_clone(sk, newsk)) {
> > +			sk_free_unlock_clone(newsk);
> > +			newsk = NULL;
> > +			goto out;
> > +		}
> >  
> >  		newsk->sk_err	   = 0;
> >  		newsk->sk_err_soft = 0;
> > -- 
> > 2.23.0.rc1.153.gdeed80330f-goog
> > 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-12 17:52     ` Stanislav Fomichev
@ 2019-08-13 21:12       ` Daniel Borkmann
  2019-08-13 21:28         ` Stanislav Fomichev
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2019-08-13 21:12 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, Martin KaFai Lau,
	Yonghong Song

On 8/12/19 7:52 PM, Stanislav Fomichev wrote:
> On 08/12, Daniel Borkmann wrote:
>> On 8/9/19 6:10 PM, Stanislav Fomichev wrote:
>>> Add new helper bpf_sk_storage_clone which optionally clones sk storage
>>> and call it from sk_clone_lock.
>>>
>>> Cc: Martin KaFai Lau <kafai@fb.com>
>>> Cc: Yonghong Song <yhs@fb.com>
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> [...]
>>> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
>>> +{
>>> +	struct bpf_sk_storage *new_sk_storage = NULL;
>>> +	struct bpf_sk_storage *sk_storage;
>>> +	struct bpf_sk_storage_elem *selem;
>>> +	int ret;
>>> +
>>> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
>>> +
>>> +	rcu_read_lock();
>>> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
>>> +
>>> +	if (!sk_storage || hlist_empty(&sk_storage->list))
>>> +		goto out;
>>> +
>>> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
>>> +		struct bpf_sk_storage_elem *copy_selem;
>>> +		struct bpf_sk_storage_map *smap;
>>> +		struct bpf_map *map;
>>> +		int refold;
>>> +
>>> +		smap = rcu_dereference(SDATA(selem)->smap);
>>> +		if (!(smap->map.map_flags & BPF_F_CLONE))
>>> +			continue;
>>> +
>>> +		map = bpf_map_inc_not_zero(&smap->map, false);
>>> +		if (IS_ERR(map))
>>> +			continue;
>>> +
>>> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
>>> +		if (!copy_selem) {
>>> +			ret = -ENOMEM;
>>> +			bpf_map_put(map);
>>> +			goto err;
>>> +		}
>>> +
>>> +		if (new_sk_storage) {
>>> +			selem_link_map(smap, copy_selem);
>>> +			__selem_link_sk(new_sk_storage, copy_selem);
>>> +		} else {
>>> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
>>> +			if (ret) {
>>> +				kfree(copy_selem);
>>> +				atomic_sub(smap->elem_size,
>>> +					   &newsk->sk_omem_alloc);
>>> +				bpf_map_put(map);
>>> +				goto err;
>>> +			}
>>> +
>>> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
>>> +		}
>>> +		bpf_map_put(map);
>>
>> The map get/put combination /under/ RCU read lock seems a bit odd to me, could
>> you exactly describe the race that this would be preventing?
> There is a race between sk storage release and sk storage clone.
> bpf_sk_storage_map_free uses synchronize_rcu to wait for all existing
> users to finish and the new ones are prevented via map's refcnt being
> zero; we need to do something like that for the clone.
> Martin suggested to use bpf_map_inc_not_zero/bpf_map_put.
> If I read everythin correctly, I think without map_inc/map_put we
> get the following race:
> 
> CPU0                                   CPU1
> 
> bpf_map_put
>    bpf_sk_storage_map_free(smap)
>      synchronize_rcu
> 
>      // no more users via bpf or
>      // syscall, but clone
>      // can still happen
> 
>      for each (bucket)
>        selem_unlink
>          selem_unlink_map(smap)
> 
>          // adding anything at
>          // this point to the
>          // bucket will leak
> 
>                                         rcu_read_lock
>                                         tcp_v4_rcv
>                                           tcp_v4_do_rcv
>                                             // sk is lockless TCP_LISTEN
>                                             tcp_v4_cookie_check
>                                               tcp_v4_syn_recv_sock
>                                                 bpf_sk_storage_clone
>                                                   rcu_dereference(sk->sk_bpf_storage)
>                                                   selem_link_map(smap, copy)
>                                                   // adding new element to the
>                                                   // map -> leak
>                                         rcu_read_unlock
> 
>        selem_unlink_sk
>         sk->sk_bpf_storage = NULL
> 
>      synchronize_rcu
> 

Makes sense, thanks for clarifying. Perhaps a small comment on top of
the bpf_map_inc_not_zero() would be great as well, so it's immediately
clear also from this location when reading the code why this is done.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
  2019-08-13 21:12       ` Daniel Borkmann
@ 2019-08-13 21:28         ` Stanislav Fomichev
  0 siblings, 0 replies; 14+ messages in thread
From: Stanislav Fomichev @ 2019-08-13 21:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, Martin KaFai Lau,
	Yonghong Song

On 08/13, Daniel Borkmann wrote:
> On 8/12/19 7:52 PM, Stanislav Fomichev wrote:
> > On 08/12, Daniel Borkmann wrote:
> > > On 8/9/19 6:10 PM, Stanislav Fomichev wrote:
> > > > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > > > and call it from sk_clone_lock.
> > > > 
> > > > Cc: Martin KaFai Lau <kafai@fb.com>
> > > > Cc: Yonghong Song <yhs@fb.com>
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > [...]
> > > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > > > +{
> > > > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > > > +	struct bpf_sk_storage *sk_storage;
> > > > +	struct bpf_sk_storage_elem *selem;
> > > > +	int ret;
> > > > +
> > > > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > > > +
> > > > +	rcu_read_lock();
> > > > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > > > +
> > > > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > > > +		goto out;
> > > > +
> > > > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > > > +		struct bpf_sk_storage_elem *copy_selem;
> > > > +		struct bpf_sk_storage_map *smap;
> > > > +		struct bpf_map *map;
> > > > +		int refold;
> > > > +
> > > > +		smap = rcu_dereference(SDATA(selem)->smap);
> > > > +		if (!(smap->map.map_flags & BPF_F_CLONE))
> > > > +			continue;
> > > > +
> > > > +		map = bpf_map_inc_not_zero(&smap->map, false);
> > > > +		if (IS_ERR(map))
> > > > +			continue;
> > > > +
> > > > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > > > +		if (!copy_selem) {
> > > > +			ret = -ENOMEM;
> > > > +			bpf_map_put(map);
> > > > +			goto err;
> > > > +		}
> > > > +
> > > > +		if (new_sk_storage) {
> > > > +			selem_link_map(smap, copy_selem);
> > > > +			__selem_link_sk(new_sk_storage, copy_selem);
> > > > +		} else {
> > > > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > > > +			if (ret) {
> > > > +				kfree(copy_selem);
> > > > +				atomic_sub(smap->elem_size,
> > > > +					   &newsk->sk_omem_alloc);
> > > > +				bpf_map_put(map);
> > > > +				goto err;
> > > > +			}
> > > > +
> > > > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > > > +		}
> > > > +		bpf_map_put(map);
> > > 
> > > The map get/put combination /under/ RCU read lock seems a bit odd to me, could
> > > you exactly describe the race that this would be preventing?
> > There is a race between sk storage release and sk storage clone.
> > bpf_sk_storage_map_free uses synchronize_rcu to wait for all existing
> > users to finish and the new ones are prevented via map's refcnt being
> > zero; we need to do something like that for the clone.
> > Martin suggested to use bpf_map_inc_not_zero/bpf_map_put.
> > If I read everythin correctly, I think without map_inc/map_put we
> > get the following race:
> > 
> > CPU0                                   CPU1
> > 
> > bpf_map_put
> >    bpf_sk_storage_map_free(smap)
> >      synchronize_rcu
> > 
> >      // no more users via bpf or
> >      // syscall, but clone
> >      // can still happen
> > 
> >      for each (bucket)
> >        selem_unlink
> >          selem_unlink_map(smap)
> > 
> >          // adding anything at
> >          // this point to the
> >          // bucket will leak
> > 
> >                                         rcu_read_lock
> >                                         tcp_v4_rcv
> >                                           tcp_v4_do_rcv
> >                                             // sk is lockless TCP_LISTEN
> >                                             tcp_v4_cookie_check
> >                                               tcp_v4_syn_recv_sock
> >                                                 bpf_sk_storage_clone
> >                                                   rcu_dereference(sk->sk_bpf_storage)
> >                                                   selem_link_map(smap, copy)
> >                                                   // adding new element to the
> >                                                   // map -> leak
> >                                         rcu_read_unlock
> > 
> >        selem_unlink_sk
> >         sk->sk_bpf_storage = NULL
> > 
> >      synchronize_rcu
> > 
> 
> Makes sense, thanks for clarifying. Perhaps a small comment on top of
> the bpf_map_inc_not_zero() would be great as well, so it's immediately
> clear also from this location when reading the code why this is done.
Sure, no problem, will have something similar to what I have before
synchronize_rcu in bpf_sk_storage_map_free.

> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-08-13 21:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 16:10 [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
2019-08-09 16:10 ` [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero Stanislav Fomichev
2019-08-11 23:53   ` Yonghong Song
2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
2019-08-11 23:54   ` Yonghong Song
2019-08-12 10:17   ` Daniel Borkmann
2019-08-12 17:52     ` Stanislav Fomichev
2019-08-13 21:12       ` Daniel Borkmann
2019-08-13 21:28         ` Stanislav Fomichev
2019-08-13  1:47   ` Martin Lau
2019-08-13  5:05     ` Stanislav Fomichev
2019-08-09 16:10 ` [PATCH bpf-next v2 3/4] bpf: sync bpf.h to tools/ Stanislav Fomichev
2019-08-09 16:10 ` [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test Stanislav Fomichev
2019-08-11 23:54   ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).