bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets
@ 2019-11-23 11:07 Jakub Sitnicki
  2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
                   ` (8 more replies)
  0 siblings, 9 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

This patch set makes SOCKMAP more flexible by allowing it to hold TCP
sockets that are either in established or listening state. With it SOCKMAP
can act as a drop-in replacement for REUSEPORT_SOCKARRAY which reuseport
BPF programs use. Granted, it is limited to only TCP sockets.

The idea started out at LPC '19 as feedback from John Fastabend to our
troubles with repurposing REUSEPORT_SOCKARRAY as a collection of listening
sockets accessed by a BPF program ran on socket lookup [1]. Without going
into details, REUSEPORT_SOCKARRAY proved to be tightly coupled with
reuseport logic. Talk from LPC (see slides [2] or video [3]) highlights
what problems we ran into when trying to make REUSEPORT_SOCKARRAY work for
our use-case.

Patches have evolved quite a bit since the RFC series from a month ago
[4]. To recap the RFC feedback, John pointed out that BPF redirect helpers
for SOCKMAP need sane semantics when used with listening sockets [5], and
that SOCKMAP lookup from BPF would be useful [6]. While Martin asked for
UDP support [7].

As it happens, patches needed more work to get SOCKMAP to actually behave
correctly with listening sockets. It turns out flexibility has its
price. Change log below outlines them all.

With more than I would like patches in the set, I left the new features,
lookup from BPF as well as UDP support, for another series. I'm quite happy
with how the changes turned out and the test coverage so I'm boldly
proposing it as v1 :-)

Curious to see what you think.

RFC -> v1:

- Switch from overriding proto->accept to af_ops->syn_recv_sock, which
  happens earlier. Clearing the psock state after accept() does not work
  for child sockets that become orphaned (never got accepted). v4-mapped
  sockets need special care.

- Return the socket cookie on SOCKMAP lookup from syscall to be on par with
  REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from
  syscall.

- Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress)
  SOCKMAP helpers fail when target socket is a listening one.

- Make bpf_sk_select_reuseport helper fail when target is a TCP established
  socket.

- Teach libbpf to recognize SK_REUSEPORT program type from section name.

- Add a dedicated set of tests for SOCKMAP holding listening sockets,
  covering map operations, overridden socket callbacks, and BPF helpers.

Thanks,
Jakub

[1] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
[2] https://linuxplumbersconf.org/event/4/contributions/487/
[3] https://www.youtube.com/watch?v=qRDoUpqvYjY
[4] https://lore.kernel.org/bpf/20191022113730.29303-1-jakub@cloudflare.com/
[5] https://lore.kernel.org/bpf/5db1da20174b1_5c282ada047205c046@john-XPS-13-9370.notmuch/
[6] https://lore.kernel.org/bpf/5db1d7a810bdb_5c282ada047205c08f@john-XPS-13-9370.notmuch/
[7] https://lore.kernel.org/bpf/20191028213804.yv3xfjjlayfghkcr@kafai-mbp/


Jakub Sitnicki (8):
  bpf, sockmap: Return socket cookie on lookup from syscall
  bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
  bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP
  bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  bpf: Allow selecting reuseport socket from a SOCKMAP
  libbpf: Recognize SK_REUSEPORT programs from section name
  selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  selftests/bpf: Tests for SOCKMAP holding listening sockets

 include/linux/skmsg.h                         |  17 +-
 kernel/bpf/verifier.c                         |   6 +-
 net/core/filter.c                             |   2 +
 net/core/sock_map.c                           |  68 +-
 net/ipv4/tcp_bpf.c                            |  66 +-
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   9 +-
 .../bpf/progs/test_sockmap_listen_kern.c      |  75 ++
 tools/testing/selftests/bpf/test_maps.c       |   6 +-
 .../selftests/bpf/test_select_reuseport.c     | 141 ++-
 .../selftests/bpf/test_select_reuseport.sh    |  14 +
 .../selftests/bpf/test_sockmap_listen.c       | 820 ++++++++++++++++++
 13 files changed, 1170 insertions(+), 56 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_select_reuseport.sh
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_listen.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  5:32   ` John Fastabend
  2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Tooling that populates the SOCKMAP with sockets from user-space needs a way
to inspect its contents. Returning the struct sock * that SOCKMAP holds to
user-space is neither safe nor useful. An approach established by
REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
instead.

Since socket cookies are u64 values SOCKMAP needs to support such a value
size for lookup to be possible. This requires special handling on update,
though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
with ENOSPC error.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index eb114ee419b6..e8460fdc597d 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -10,6 +10,7 @@
 #include <linux/skmsg.h>
 #include <linux/list.h>
 #include <linux/jhash.h>
+#include <linux/sock_diag.h>
 
 struct bpf_stab {
 	struct bpf_map map;
@@ -31,7 +32,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 	if (attr->max_entries == 0 ||
 	    attr->key_size    != 4 ||
-	    attr->value_size  != 4 ||
+	    (attr->value_size != sizeof(u32) &&
+	     attr->value_size != sizeof(u64)) ||
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
@@ -274,6 +276,23 @@ static void *sock_map_lookup(struct bpf_map *map, void *key)
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
+{
+	struct sock *sk;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (map->value_size != sizeof(u64))
+		return ERR_PTR(-ENOSPC);
+
+	sk = __sock_map_lookup_elem(map, *(u32 *)key);
+	if (!sk)
+		return ERR_PTR(-ENOENT);
+
+	sock_gen_cookie(sk);
+	return &sk->sk_cookie;
+}
+
 static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
@@ -399,12 +418,19 @@ static bool sock_map_sk_is_suitable(const struct sock *sk)
 static int sock_map_update_elem(struct bpf_map *map, void *key,
 				void *value, u64 flags)
 {
-	u32 ufd = *(u32 *)value;
 	u32 idx = *(u32 *)key;
 	struct socket *sock;
 	struct sock *sk;
+	u64 ufd;
 	int ret;
 
+	if (map->value_size == sizeof(u64))
+		ufd = *(u64 *)value;
+	else
+		ufd = *(u32 *)value;
+	if (ufd > S32_MAX)
+		return -EINVAL;
+
 	sock = sockfd_lookup(ufd, &ret);
 	if (!sock)
 		return ret;
@@ -500,6 +526,7 @@ const struct bpf_map_ops sock_map_ops = {
 	.map_alloc		= sock_map_alloc,
 	.map_free		= sock_map_free,
 	.map_get_next_key	= sock_map_get_next_key,
+	.map_lookup_elem_sys_only = sock_map_lookup_sys,
 	.map_update_elem	= sock_map_update_elem,
 	.map_delete_elem	= sock_map_delete_elem,
 	.map_lookup_elem	= sock_map_lookup,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
  2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  5:35   ` John Fastabend
  2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Don't require the kernel code, like BPF helpers, that needs access to
SOCKMAP map contents to live in the sock_map module. Expose the SOCKMAP
lookup operation to all kernel-land.

Lookup from BPF context is not whitelisted yet. While syscalls have a
dedicated lookup handler.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index e8460fdc597d..9f572d56e81a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -273,7 +273,7 @@ static struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
 
 static void *sock_map_lookup(struct bpf_map *map, void *key)
 {
-	return ERR_PTR(-EOPNOTSUPP);
+	return __sock_map_lookup_elem(map, *(u32 *)key);
 }
 
 static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
@@ -899,6 +899,11 @@ static void sock_hash_free(struct bpf_map *map)
 	kfree(htab);
 }
 
+static void *sock_hash_lookup(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static void sock_hash_release_progs(struct bpf_map *map)
 {
 	psock_progs_drop(&container_of(map, struct bpf_htab, map)->progs);
@@ -978,7 +983,7 @@ const struct bpf_map_ops sock_hash_ops = {
 	.map_get_next_key	= sock_hash_get_next_key,
 	.map_update_elem	= sock_hash_update_elem,
 	.map_delete_elem	= sock_hash_delete_elem,
-	.map_lookup_elem	= sock_map_lookup,
+	.map_lookup_elem	= sock_hash_lookup,
 	.map_release_uref	= sock_hash_release_progs,
 	.map_check_btf		= map_check_no_btf,
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
  2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
  2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  5:38   ` John Fastabend
  2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

In order for SOCKMAP type to become a generic collection for storing TCP
sockets we need to loosen the checks in update callback.

Currently SOCKMAP requires the TCP socket to be in established state, which
prevents us from using it to keep references to listening sockets.

Change the update pre-checks so that it is sufficient for socket to be in a
hash table, i.e. have a local address/port assigned, to be inserted. Return
-EINVAL if the condition is not met to be consistent with
REUSEPORT_SOCKARRY map type.

This creates a possibility of pointing one of the BPF redirect helpers that
splice two SOCKMAP sockets on ingress or egress at a listening socket,
which doesn't make sense. Introduce appropriate checks in the helpers so
that only established TCP sockets can be a target for redirects.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c                     | 28 ++++++++++++++++++-------
 tools/testing/selftests/bpf/test_maps.c |  6 +-----
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 9f572d56e81a..49744b344137 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -439,11 +439,14 @@ static int sock_map_update_elem(struct bpf_map *map, void *key,
 		ret = -EINVAL;
 		goto out;
 	}
-	if (!sock_map_sk_is_suitable(sk) ||
-	    sk->sk_state != TCP_ESTABLISHED) {
+	if (!sock_map_sk_is_suitable(sk)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
+	if (!sk_hashed(sk)) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	sock_map_sk_acquire(sk);
 	ret = sock_map_update_common(map, idx, sk, flags);
@@ -480,13 +483,17 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+	struct sock *sk;
 
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	tcb->bpf.flags = flags;
-	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
-	if (!tcb->bpf.sk_redir)
+
+	sk = __sock_map_lookup_elem(map, key);
+	if (!sk || sk->sk_state != TCP_ESTABLISHED)
 		return SK_DROP;
+
+	tcb->bpf.flags = flags;
+	tcb->bpf.sk_redir = sk;
 	return SK_PASS;
 }
 
@@ -503,12 +510,17 @@ const struct bpf_func_proto bpf_sk_redirect_map_proto = {
 BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
+	struct sock *sk;
+
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	msg->flags = flags;
-	msg->sk_redir = __sock_map_lookup_elem(map, key);
-	if (!msg->sk_redir)
+
+	sk = __sock_map_lookup_elem(map, key);
+	if (!sk || sk->sk_state != TCP_ESTABLISHED)
 		return SK_DROP;
+
+	msg->flags = flags;
+	msg->sk_redir = sk;
 	return SK_PASS;
 }
 
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 02eae1e864c2..c6766b2cff85 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -756,11 +756,7 @@ static void test_sockmap(unsigned int tasks, void *data)
 	/* Test update without programs */
 	for (i = 0; i < 6; i++) {
 		err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
-		if (i < 2 && !err) {
-			printf("Allowed update sockmap '%i:%i' not in ESTABLISHED\n",
-			       i, sfd[i]);
-			goto out_sockmap;
-		} else if (i >= 2 && err) {
+		if (err) {
 			printf("Failed noprog update sockmap '%i:%i'\n",
 			       i, sfd[i]);
 			goto out_sockmap;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  5:56   ` John Fastabend
  2019-11-25 22:38   ` Martin Lau
  2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Sockets cloned from the listening sockets that belongs to a SOCKMAP must
not inherit the psock state. Otherwise child sockets unintentionally share
the SOCKMAP entry with the listening socket, which would lead to
use-after-free bugs.

Restore the child socket psock state and its callbacks at the earliest
possible moment, that is right after the child socket gets created. This
ensures that neither children that get accept()'ed, nor those that are left
in accept queue and will get orphaned, don't inadvertently inherit parent's
psock.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/skmsg.h | 17 +++++++++--
 net/ipv4/tcp_bpf.c    | 66 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 6cb077b646a5..b5ade8dac69d 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -98,6 +98,7 @@ struct sk_psock {
 	void (*saved_close)(struct sock *sk, long timeout);
 	void (*saved_write_space)(struct sock *sk);
 	struct proto			*sk_proto;
+	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	struct sk_psock_work_state	work_state;
 	struct work_struct		work;
 	union {
@@ -345,23 +346,30 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
 
 static inline void sk_psock_update_proto(struct sock *sk,
 					 struct sk_psock *psock,
-					 struct proto *ops)
+					 struct proto *ops,
+					 struct inet_connection_sock_af_ops *af_ops)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
 	psock->saved_unhash = sk->sk_prot->unhash;
 	psock->saved_close = sk->sk_prot->close;
 	psock->saved_write_space = sk->sk_write_space;
 
 	psock->sk_proto = sk->sk_prot;
 	sk->sk_prot = ops;
+
+	psock->icsk_af_ops = icsk->icsk_af_ops;
+	icsk->icsk_af_ops = af_ops;
 }
 
 static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
 	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
-		struct inet_connection_sock *icsk = inet_csk(sk);
 		bool has_ulp = !!icsk->icsk_ulp_data;
 
 		if (has_ulp)
@@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 			sk->sk_prot = psock->sk_proto;
 		psock->sk_proto = NULL;
 	}
+
+	if (psock->icsk_af_ops) {
+		icsk->icsk_af_ops = psock->icsk_af_ops;
+		psock->icsk_af_ops = NULL;
+	}
 }
 
 static inline void sk_psock_set_state(struct sk_psock *psock,
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 8a56e09cfb0e..dc709949c8e5 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -10,6 +10,8 @@
 #include <net/inet_common.h>
 #include <net/tls.h>
 
+extern const struct inet_connection_sock_af_ops ipv4_specific;
+
 static bool tcp_bpf_stream_read(const struct sock *sk)
 {
 	struct sk_psock *psock;
@@ -535,6 +537,10 @@ static void tcp_bpf_remove(struct sock *sk, struct sk_psock *psock)
 {
 	struct sk_psock_link *link;
 
+	/* Did a child socket inadvertently inherit parent's psock? */
+	if (WARN_ON(sk != psock->sk))
+		return;
+
 	while ((link = sk_psock_link_pop(psock))) {
 		sk_psock_unlink(sk, link);
 		sk_psock_free_link(link);
@@ -582,6 +588,45 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
 	saved_close(sk, timeout);
 }
 
+static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
+					  struct sk_buff *skb,
+					  struct request_sock *req,
+					  struct dst_entry *dst,
+					  struct request_sock *req_unhash,
+					  bool *own_req)
+{
+	const struct inet_connection_sock_af_ops *ops;
+	void (*write_space)(struct sock *sk);
+	struct sk_psock *psock;
+	struct proto *proto;
+	struct sock *child;
+
+	rcu_read_lock();
+	psock = sk_psock(sk);
+	if (likely(psock)) {
+		proto = psock->sk_proto;
+		write_space = psock->saved_write_space;
+		ops = psock->icsk_af_ops;
+	} else {
+		ops = inet_csk(sk)->icsk_af_ops;
+	}
+	rcu_read_unlock();
+
+	child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
+
+	/* Child must not inherit psock or its ops. */
+	if (child && psock) {
+		rcu_assign_sk_user_data(child, NULL);
+		child->sk_prot = proto;
+		child->sk_write_space = write_space;
+
+		/* v4-mapped sockets don't inherit parent ops. Don't restore. */
+		if (inet_csk(child)->icsk_af_ops == inet_csk(sk)->icsk_af_ops)
+			inet_csk(child)->icsk_af_ops = ops;
+	}
+	return child;
+}
+
 enum {
 	TCP_BPF_IPV4,
 	TCP_BPF_IPV6,
@@ -597,6 +642,7 @@ enum {
 static struct proto *tcpv6_prot_saved __read_mostly;
 static DEFINE_SPINLOCK(tcpv6_prot_lock);
 static struct proto tcp_bpf_prots[TCP_BPF_NUM_PROTS][TCP_BPF_NUM_CFGS];
+static struct inet_connection_sock_af_ops tcp_bpf_af_ops[TCP_BPF_NUM_PROTS];
 
 static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 				   struct proto *base)
@@ -612,13 +658,23 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_TX].sendpage		= tcp_bpf_sendpage;
 }
 
-static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops)
+static void tcp_bpf_rebuild_af_ops(struct inet_connection_sock_af_ops *ops,
+				   const struct inet_connection_sock_af_ops *base)
+{
+	*ops = *base;
+	ops->syn_recv_sock = tcp_bpf_syn_recv_sock;
+}
+
+static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops,
+					   const struct inet_connection_sock_af_ops *af_ops)
 {
 	if (sk->sk_family == AF_INET6 &&
 	    unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) {
 		spin_lock_bh(&tcpv6_prot_lock);
 		if (likely(ops != tcpv6_prot_saved)) {
 			tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV6], ops);
+			tcp_bpf_rebuild_af_ops(&tcp_bpf_af_ops[TCP_BPF_IPV6],
+					       af_ops);
 			smp_store_release(&tcpv6_prot_saved, ops);
 		}
 		spin_unlock_bh(&tcpv6_prot_lock);
@@ -628,6 +684,8 @@ static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops)
 static int __init tcp_bpf_v4_build_proto(void)
 {
 	tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV4], &tcp_prot);
+	tcp_bpf_rebuild_af_ops(&tcp_bpf_af_ops[TCP_BPF_IPV4], &ipv4_specific);
+
 	return 0;
 }
 core_initcall(tcp_bpf_v4_build_proto);
@@ -637,7 +695,8 @@ static void tcp_bpf_update_sk_prot(struct sock *sk, struct sk_psock *psock)
 	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
 	int config = psock->progs.msg_parser   ? TCP_BPF_TX   : TCP_BPF_BASE;
 
-	sk_psock_update_proto(sk, psock, &tcp_bpf_prots[family][config]);
+	sk_psock_update_proto(sk, psock, &tcp_bpf_prots[family][config],
+			      &tcp_bpf_af_ops[family]);
 }
 
 static void tcp_bpf_reinit_sk_prot(struct sock *sk, struct sk_psock *psock)
@@ -677,6 +736,7 @@ void tcp_bpf_reinit(struct sock *sk)
 
 int tcp_bpf_init(struct sock *sk)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct proto *ops = READ_ONCE(sk->sk_prot);
 	struct sk_psock *psock;
 
@@ -689,7 +749,7 @@ int tcp_bpf_init(struct sock *sk)
 		rcu_read_unlock();
 		return -EINVAL;
 	}
-	tcp_bpf_check_v6_needs_rebuild(sk, ops);
+	tcp_bpf_check_v6_needs_rebuild(sk, ops, icsk->icsk_af_ops);
 	tcp_bpf_update_sk_prot(sk, psock);
 	rcu_read_unlock();
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  5:57   ` John Fastabend
  2019-11-25  1:24   ` Alexei Starovoitov
  2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

SOCKMAP now supports storing references to listening sockets. Nothing keeps
us from using it as an array of sockets to select from in SK_REUSEPORT
programs.

Whitelist the map type with the BPF helper for selecting socket. However,
impose a restriction that the selected socket needs to be a listening TCP
socket or a bound UDP socket (connected or not).

The only other map type that works with the BPF reuseport helper,
REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
handler.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 kernel/bpf/verifier.c | 6 ++++--
 net/core/filter.c     | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a0482e1c4a77..111a1eb543ab 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3685,7 +3685,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_sk_redirect_map &&
 		    func_id != BPF_FUNC_sock_map_update &&
 		    func_id != BPF_FUNC_map_delete_elem &&
-		    func_id != BPF_FUNC_msg_redirect_map)
+		    func_id != BPF_FUNC_msg_redirect_map &&
+		    func_id != BPF_FUNC_sk_select_reuseport)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_SOCKHASH:
@@ -3766,7 +3767,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 			goto error;
 		break;
 	case BPF_FUNC_sk_select_reuseport:
-		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
+		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
+		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
 			goto error;
 		break;
 	case BPF_FUNC_map_peek_elem:
diff --git a/net/core/filter.c b/net/core/filter.c
index 49ded4a7588a..e3fb77353248 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
 	selected_sk = map->ops->map_lookup_elem(map, key);
 	if (!selected_sk)
 		return -ENOENT;
+	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
+		return -EINVAL;
 
 	reuse = rcu_dereference(selected_sk->sk_reuseport_cb);
 	if (!reuse)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  5:57   ` John Fastabend
  2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Allow loading BPF object files that contain SK_REUSEPORT programs without
having to manually set the program type before load if the the section name
is set to "sk_reuseport".

Makes user-space code needed to load SK_REUSEPORT BPF program more concise.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/lib/bpf/libbpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e1698461c6b3..dbb77283b0c6 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4987,6 +4987,7 @@ static const struct {
 	enum bpf_attach_type attach_type;
 } section_names[] = {
 	BPF_PROG_SEC("socket",			BPF_PROG_TYPE_SOCKET_FILTER),
+	BPF_PROG_SEC("sk_reuseport",		BPF_PROG_TYPE_SK_REUSEPORT),
 	BPF_PROG_SEC("kprobe/",			BPF_PROG_TYPE_KPROBE),
 	BPF_PROG_SEC("uprobe/",			BPF_PROG_TYPE_KPROBE),
 	BPF_PROG_SEC("kretprobe/",		BPF_PROG_TYPE_KPROBE),
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  6:00   ` John Fastabend
  2019-11-25 22:30   ` Martin Lau
  2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
  2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
  8 siblings, 2 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
can be selected at run-time. Also allow choosing which L4 protocols get
tested.

Run the extended reuseport program test two times, once for
REUSEPORT_ARRAY, and once for SOCKMAP but just with TCP to cover the newly
enabled map type.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/Makefile          |   7 +-
 .../selftests/bpf/test_select_reuseport.c     | 141 ++++++++++++++----
 .../selftests/bpf/test_select_reuseport.sh    |  14 ++
 3 files changed, 131 insertions(+), 31 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_select_reuseport.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 085678d88ef8..51d16837b6ca 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -28,7 +28,7 @@ LDLIBS += -lcap -lelf -lrt -lpthread
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap get_cgroup_id_user test_socket_cookie \
-	test_cgroup_storage test_select_reuseport \
+	test_cgroup_storage \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_cgroup_attach test_progs-no_alu32
 
@@ -60,7 +60,8 @@ TEST_PROGS := test_kmod.sh \
 	test_tc_tunnel.sh \
 	test_tc_edt.sh \
 	test_xdping.sh \
-	test_bpftool_build.sh
+	test_bpftool_build.sh \
+	test_select_reuseport.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
@@ -71,7 +72,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping
+	test_lirc_mode2_user xdping test_select_reuseport
 
 TEST_CUSTOM_PROGS = urandom_read
 
diff --git a/tools/testing/selftests/bpf/test_select_reuseport.c b/tools/testing/selftests/bpf/test_select_reuseport.c
index 7566c13eb51a..732cfeee189f 100644
--- a/tools/testing/selftests/bpf/test_select_reuseport.c
+++ b/tools/testing/selftests/bpf/test_select_reuseport.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2018 Facebook */
 
+#define _GNU_SOURCE
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdbool.h>
@@ -29,6 +30,12 @@
 #define TCP_FO_SYSCTL "/proc/sys/net/ipv4/tcp_fastopen"
 #define REUSEPORT_ARRAY_SIZE 32
 
+#define BIND_TO_INANY		true
+#define BIND_TO_LOOPBACK	(!BIND_TO_INANY)
+
+static enum bpf_map_type cfg_map_type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
+static unsigned int cfg_sock_types = (1 << SOCK_STREAM) | (1 << SOCK_DGRAM);
+
 static int result_map, tmp_index_ovr_map, linum_map, data_check_map;
 static enum result expected_results[NR_RESULTS];
 static int sk_fds[REUSEPORT_ARRAY_SIZE];
@@ -61,7 +68,7 @@ static void create_maps(void)
 
 	/* Creating reuseport_array */
 	attr.name = "reuseport_array";
-	attr.map_type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
+	attr.map_type = cfg_map_type;
 	attr.key_size = sizeof(__u32);
 	attr.value_size = sizeof(__u32);
 	attr.max_entries = REUSEPORT_ARRAY_SIZE;
@@ -680,53 +687,131 @@ static void cleanup(void)
 	bpf_object__close(obj);
 }
 
+static const char *family_to_str(int family)
+{
+	switch (family) {
+	case AF_INET:
+		return "IPv4";
+	case AF_INET6:
+		return "IPv6";
+	default:
+		return "unknown";
+	}
+}
+
+static const char *type_to_str(int type)
+{
+	switch (type) {
+	case SOCK_STREAM:
+		return "TCP";
+	case SOCK_DGRAM:
+		return "UDP";
+	default:
+		return "unknown";
+	}
+}
+
+static void test_one(int family, int type, bool inany)
+{
+	int err;
+
+	printf("######## %s/%s %-8s ########\n",
+	       family_to_str(family), type_to_str(type),
+	       inany ? "INANY" : "LOOPBACK");
+
+	setup_per_test(type, family, inany);
+
+	test_err_inner_map(type, family);
+
+	/* Install reuseport_array to the outer_map */
+	err = bpf_map_update_elem(outer_map, &index_zero, &reuseport_array,
+				  BPF_ANY);
+	CHECK(err == -1, "update_elem(outer_map)",
+	      "err:%d errno:%d\n", err, errno);
+
+	test_err_skb_data(type, family);
+	test_err_sk_select_port(type, family);
+	test_pass(type, family);
+	test_syncookie(type, family);
+	test_pass_on_err(type, family);
+	/* Must be the last test */
+	test_detach_bpf(type, family);
+
+	cleanup_per_test();
+	printf("\n");
+}
+
 static void test_all(void)
 {
-	/* Extra SOCK_STREAM to test bind_inany==true */
-	const int types[] = { SOCK_STREAM, SOCK_DGRAM, SOCK_STREAM };
-	const char * const type_strings[] = { "TCP", "UDP", "TCP" };
-	const char * const family_strings[] = { "IPv6", "IPv4" };
+	const int types[] = { SOCK_STREAM, SOCK_DGRAM };
 	const unsigned short families[] = { AF_INET6, AF_INET };
-	const bool bind_inany[] = { false, false, true };
-	int t, f, err;
+	int t, f;
 
 	for (f = 0; f < ARRAY_SIZE(families); f++) {
 		unsigned short family = families[f];
 
 		for (t = 0; t < ARRAY_SIZE(types); t++) {
-			bool inany = bind_inany[t];
 			int type = types[t];
 
-			printf("######## %s/%s %s ########\n",
-			       family_strings[f], type_strings[t],
-				inany ? " INANY  " : "LOOPBACK");
+			/* Socket type excluded from tests? */
+			if (~cfg_sock_types & (1 << type))
+				continue;
 
-			setup_per_test(type, family, inany);
+			test_one(family, type, BIND_TO_LOOPBACK);
+			test_one(family, type, BIND_TO_INANY);
+		}
+	}
+}
 
-			test_err_inner_map(type, family);
+static void __attribute__((noreturn)) usage(void)
+{
+	fprintf(stderr,
+		"Usage: %s [-m reuseport_sockarray|sockmap] [-t] [-u]\n",
+		program_invocation_short_name);
+	exit(1);
+}
 
-			/* Install reuseport_array to the outer_map */
-			err = bpf_map_update_elem(outer_map, &index_zero,
-						  &reuseport_array, BPF_ANY);
-			CHECK(err == -1, "update_elem(outer_map)",
-			      "err:%d errno:%d\n", err, errno);
+static enum bpf_map_type parse_map_type(const char *optarg)
+{
+	if (!strcmp(optarg, "reuseport_sockarray"))
+		return BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
+	if (!strcmp(optarg, "sockmap"))
+		return BPF_MAP_TYPE_SOCKMAP;
 
-			test_err_skb_data(type, family);
-			test_err_sk_select_port(type, family);
-			test_pass(type, family);
-			test_syncookie(type, family);
-			test_pass_on_err(type, family);
-			/* Must be the last test */
-			test_detach_bpf(type, family);
+	return BPF_MAP_TYPE_UNSPEC;
+}
 
-			cleanup_per_test();
-			printf("\n");
+static void parse_opts(int argc, char **argv)
+{
+	unsigned int sock_types = 0;
+	int c;
+
+	while ((c = getopt(argc, argv, "hm:tu")) != -1) {
+		switch (c) {
+		case 'h':
+			usage();
+			break;
+		case 'm':
+			cfg_map_type = parse_map_type(optarg);
+			break;
+		case 't':
+			sock_types |= 1 << SOCK_STREAM;
+			break;
+		case 'u':
+			sock_types |= 1 << SOCK_DGRAM;
+			break;
 		}
 	}
+
+	if (cfg_map_type == BPF_MAP_TYPE_UNSPEC)
+		usage();
+	if (sock_types != 0)
+		cfg_sock_types = sock_types;
 }
 
-int main(int argc, const char **argv)
+int main(int argc, char **argv)
 {
+	parse_opts(argc, argv);
 	create_maps();
 	prepare_bpf_obj();
 	saved_tcp_fo = read_int_sysctl(TCP_FO_SYSCTL);
diff --git a/tools/testing/selftests/bpf/test_select_reuseport.sh b/tools/testing/selftests/bpf/test_select_reuseport.sh
new file mode 100755
index 000000000000..1951b4886021
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_select_reuseport.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -eu
+
+DIR=$(dirname $0)
+
+echo "Testing reuseport with REUSEPORT_SOCKARRAY..."
+$DIR/test_select_reuseport -m reuseport_sockarray
+
+echo "Testing reuseport with SOCKMAP (TCP only)..."
+$DIR/test_select_reuseport -m sockmap -t
+
+exit 0
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
@ 2019-11-23 11:07 ` Jakub Sitnicki
  2019-11-24  6:04   ` John Fastabend
  2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
  8 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-23 11:07 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Now that SOCKMAP can store listening sockets, its user-space and BPF API
is open to a new set of potential pitfalls. Exercise the map operations,
code paths that trigger overridden socket callbacks, and BPF helpers
that work with SOCKMAP to gain confidence that all works as expected.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../bpf/progs/test_sockmap_listen_kern.c      |  75 ++
 .../selftests/bpf/test_sockmap_listen.c       | 820 ++++++++++++++++++
 4 files changed, 897 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_listen.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 4865116b96c7..94e335b3f677 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -39,3 +39,4 @@ test_btf_dump
 xdping
 /no_alu32
 /bpf_gcc
+test_sockmap_listen
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 51d16837b6ca..bacb80a208e3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -30,7 +30,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_sock test_btf test_sockmap get_cgroup_id_user test_socket_cookie \
 	test_cgroup_storage \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
-	test_cgroup_attach test_progs-no_alu32
+	test_cgroup_attach test_progs-no_alu32 test_sockmap_listen
 
 # Also test bpf-gcc, if present
 ifneq ($(BPF_GCC),)
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen_kern.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen_kern.c
new file mode 100644
index 000000000000..16d00de67be1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen_kern.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <errno.h>
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, __u64);
+} sock_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, unsigned int);
+} verdict_map SEC(".maps");
+
+SEC("sk_skb/stream_parser")
+int prog_skb_parser(struct __sk_buff *skb)
+{
+	return skb->len;
+}
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+	unsigned int *count;
+	int verdict;
+
+	verdict = bpf_sk_redirect_map(skb, &sock_map, 0, 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+SEC("sk_msg")
+int prog_msg_verdict(struct sk_msg_md *msg)
+{
+	unsigned int *count;
+	int verdict;
+
+	verdict = bpf_msg_redirect_map(msg, &sock_map, 0, 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+SEC("sk_reuseport")
+int prog_reuseport(struct sk_reuseport_md *reuse)
+{
+	unsigned int *count;
+	int err, verdict;
+	int key = 0;
+
+	err = bpf_sk_select_reuseport(reuse, &sock_map, &key, 0);
+	verdict = (!err || err == -ENOENT) ? SK_PASS : SK_DROP;
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_sockmap_listen.c b/tools/testing/selftests/bpf/test_sockmap_listen.c
new file mode 100644
index 000000000000..81ac4e319c5e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockmap_listen.c
@@ -0,0 +1,820 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Set of tests for SOCKMAP holding listening sockets covering:
+ *  - map operations,
+ *  - tcp_bpf socket callback overrides,
+ *  - BPF redirect helpers that work with SOCKMAP,
+ *  - BPF reuseport helper.
+ */
+
+#include <errno.h>
+#include <error.h>
+#include <limits.h>
+#include <netinet/ip.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define FAIL(fmt...) \
+	error_at_line(1, 0, __func__, __LINE__, fmt)
+#define FAIL_SYS(fmt...) \
+	error_at_line(1, errno, __func__, __LINE__, fmt)
+#define FAIL_LIBBPF(err, buf, str) ({					\
+	libbpf_strerror(err, buf, sizeof(buf));				\
+	FAIL(str ": %s", buf);						\
+})
+
+/* Fail-fast syscall wrappers */
+
+#define xsocket(domain, type, proto) ({					\
+	int __fd = socket(domain, type, proto);				\
+	if (__fd == -1)							\
+		FAIL_SYS("socket");					\
+	__fd;								\
+})
+
+#define xbind(fd, addr, len) ({						\
+	int __ret = bind(fd, addr, len);				\
+	if (__ret == -1)						\
+		FAIL_SYS("bind");					\
+	__ret;								\
+})
+
+#define xlisten(fd, backlog) ({						\
+	int __ret = listen(fd, backlog);				\
+	if (__ret == -1)						\
+		FAIL_SYS("listen");					\
+	__ret;								\
+})
+
+#define xgetsockname(fd, addr, len) ({					\
+	int __ret = getsockname(fd, addr, len);				\
+	if (__ret == -1)						\
+		FAIL_SYS("getsockname");				\
+	__ret;								\
+})
+
+#define xgetsockopt(fd, level, name, val, len) ({			\
+	int __ret = getsockopt(fd, level, name, val, len);		\
+	if (__ret == -1)						\
+		FAIL_SYS("getsockopt(" #name ")");			\
+	__ret;								\
+})
+
+#define xsetsockopt(fd, level, name, val, len) ({			\
+	int __ret = setsockopt(fd, level, name, val, len);		\
+	if (__ret == -1)						\
+		FAIL_SYS("setsockopt(" #name ")");			\
+	__ret;								\
+})
+
+#define xconnect(fd, addr, len) ({					\
+	int __ret = connect(fd, addr, len);				\
+	if (__ret == -1)						\
+		FAIL_SYS("connect");					\
+	__ret;								\
+})
+
+#define xaccept(fd, addr, len) ({					\
+	int __ret = accept(fd, addr, len);				\
+	if (__ret == -1)						\
+		FAIL_SYS("accept");					\
+	__ret;								\
+})
+
+#define xclose(fd) ({							\
+	int __ret = close(fd);						\
+	if (__ret == -1)						\
+		FAIL_SYS("close");					\
+	__ret;								\
+})
+
+#define xbpf_map_update_elem(fd, key, val, flags) ({			\
+	int __ret = bpf_map_update_elem(fd, key, val, flags);		\
+	if (__ret == -1)						\
+		FAIL_SYS("map_update");					\
+	__ret;								\
+})
+
+#define xbpf_map_delete_elem(fd, key) ({				\
+	int __ret = bpf_map_delete_elem(fd, key);			\
+	if (__ret == -1)						\
+		FAIL_SYS("map_delete");					\
+	__ret;								\
+})
+
+#define xbpf_map_lookup_elem(fd, key, val) ({				\
+	int __ret = bpf_map_lookup_elem(fd, key, val);			\
+	if (__ret == -1)						\
+		FAIL_SYS("map_lookup");					\
+	__ret;								\
+})
+
+#define xbpf_prog_attach(prog, target_fd, type, flags) ({		\
+	int __ret = bpf_prog_attach(prog, target_fd, type, flags);	\
+	if (__ret == -1)						\
+		FAIL_SYS("prog_attach(" #type ")");			\
+	__ret;								\
+})
+
+#define xbpf_prog_detach2(prog, target_fd, type) ({			\
+	int __ret = bpf_prog_detach2(prog, target_fd, type);		\
+	if (__ret == -1)						\
+		FAIL_SYS("prog_detach2(" #type ")");			\
+	__ret;								\
+})
+
+#define BPF_OBJECT_FILE "test_sockmap_listen_kern.o"
+#define MAX_STRERR_LEN 256
+#define ON_TX false
+#define ON_RX true
+
+/* Same order as in BPF object file. */
+enum {
+	SOCK_MAP = 0,
+	VERDICT_MAP,
+	MAX_MAP
+};
+
+enum {
+	SKB_PARSER_PROG = 0,
+	SKB_VERDICT_PROG,
+	MSG_VERDICT_PROG,
+	REUSEPORT_PROG,
+	MAX_PROG
+};
+
+static void init_addr_loopback4(struct sockaddr_storage *ss, socklen_t *len)
+{
+	struct sockaddr_in *addr4 = memset(ss, 0, sizeof(*ss));
+
+	addr4->sin_family = AF_INET;
+	addr4->sin_port = 0;
+	addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	*len = sizeof(*addr4);
+}
+
+static void init_addr_loopback6(struct sockaddr_storage *ss, socklen_t *len)
+{
+	struct sockaddr_in6 *addr6 = memset(ss, 0, sizeof(*ss));
+
+	addr6->sin6_family = AF_INET6;
+	addr6->sin6_port = 0;
+	addr6->sin6_addr = in6addr_loopback;
+	*len = sizeof(*addr6);
+}
+
+static void init_addr_loopback(int family, struct sockaddr_storage *ss,
+			       socklen_t *len)
+{
+	switch (family) {
+	case AF_INET:
+		init_addr_loopback4(ss, len);
+		return;
+	case AF_INET6:
+		init_addr_loopback6(ss, len);
+		return;
+	default:
+		FAIL("unsupported address family %d", family);
+	}
+}
+
+static inline struct sockaddr *sockaddr(struct sockaddr_storage *ss)
+{
+	return (struct sockaddr *)ss;
+}
+
+static void test_sockmap_insert_invalid(int mapfd)
+{
+	u32 key = 0;
+	u64 value;
+	int err;
+
+	value = -1;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_SYS("map_update: expected EINVAL");
+
+	value = INT_MAX;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EBADF)
+		FAIL_SYS("map_update: expected EBADF");
+}
+
+static void test_sockmap_insert_opened(int family, int socktype, int mapfd)
+{
+	u32 key = 0;
+	u64 value;
+	int err, s;
+
+	s = xsocket(family, socktype, 0);
+
+	errno = 0;
+	value = s;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_SYS("map_update: expected EINVAL");
+	xclose(s);
+}
+
+static void test_sockmap_insert_bound(int family, int socktype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	u64 value;
+	int err, s;
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+
+	errno = 0;
+	value = s;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_SYS("map_update: expected EINVAL");
+	xclose(s);
+}
+
+static void test_sockmap_insert_listening(int family, int socktype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	u64 value;
+	int s;
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xlisten(s, 1);
+
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xclose(s);
+}
+
+static void test_sockmap_delete_after_insert(int family, int socktype,
+					     int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	u64 value;
+	int s;
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xlisten(s, 1);
+
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xclose(s);
+}
+
+static void test_sockmap_delete_after_close(int family, int socktype,
+					    int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	int err, s;
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xlisten(s, 1);
+
+	xclose(s);
+	errno = 0;
+	err = bpf_map_delete_elem(mapfd, &key);
+	if (!err || errno != EINVAL)
+		FAIL_SYS("map_update: expected EINVAL");
+}
+
+static void test_sockmap_lookup_after_insert(int family, int socktype,
+					     int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u64 cookie, value;
+	const int key = 0;
+	socklen_t len;
+	int s;
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xlisten(s, 1);
+
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	len = sizeof(cookie);
+	xgetsockopt(s, SOL_SOCKET, SO_COOKIE, &cookie, &len);
+
+	xbpf_map_lookup_elem(mapfd, &key, &value);
+	if (value != cookie) {
+		FAIL("map_lookup: have %#llx, want %#llx",
+		     (unsigned long long)value,
+		     (unsigned long long)cookie);
+	}
+	xclose(s);
+}
+
+static void test_sockmap_lookup_after_delete(int family, int socktype,
+					     int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	int err, s;
+	u64 value;
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xlisten(s, 1);
+
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+
+	errno = 0;
+	err = bpf_map_lookup_elem(mapfd, &key, &value);
+	if (!err || errno != ENOENT)
+		FAIL_SYS("map_lookup: expected ENOENT");
+	xclose(s);
+}
+
+static void test_sockmap_lookup_32_bit_value(int family, int socktype)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	int err, mapfd, s;
+	u32 key, value;
+
+	mapfd = bpf_create_map(BPF_MAP_TYPE_SOCKMAP,
+			       sizeof(key), sizeof(value), 1, 0);
+	if (mapfd == -1)
+		FAIL_SYS("map_create");
+
+	s = xsocket(family, socktype, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xlisten(s, 1);
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	errno = 0;
+	err = bpf_map_lookup_elem(mapfd, &key, &value);
+	if (!err || errno != ENOSPC)
+		FAIL_SYS("map_lookup: expected ENOSPC");
+
+	xclose(s);
+	xclose(mapfd);
+}
+
+static void test_sockmap_update_listening(int family, int socktype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	u64 value;
+	int s1, s2;
+
+	init_addr_loopback(family, &addr, &addrlen);
+
+	s1 = xsocket(family, socktype, 0);
+	xbind(s1, (struct sockaddr *)&addr, addrlen);
+	xlisten(s1, 1);
+
+	s2 = xsocket(family, socktype, 0);
+	xbind(s2, (struct sockaddr *)&addr, addrlen);
+	xlisten(s2, 1);
+
+	value = s1;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	value = s2;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_EXIST);
+
+	xclose(s1);
+	xclose(s2);
+}
+
+/*
+ * Exercise the code path where we destroy child sockets that never
+ * got accept()'ed, aka orphans, when parent socket gets closed.
+ */
+static void test_sockmap_destroy_orphan_child(int family, int socktype,
+					      int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	u64 value;
+	int s, c;
+
+	s = xsocket(family, socktype | SOCK_NONBLOCK, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xgetsockname(s, (struct sockaddr *)&addr, &addrlen);
+	xlisten(s, 1);
+
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	c = xsocket(family, socktype, 0);
+	xconnect(c, (struct sockaddr *)&addr, addrlen);
+
+	xclose(c);
+	xclose(s);
+}
+
+/*
+ * Exercise the listening socket SYN receive callback after removing
+ * it from SOCKMAP to ensure that callbacks get restored properly.
+ */
+static void test_sockmap_syn_recv_after_delete(int family, int socktype,
+					       int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+	u32 key = 0;
+	u64 value;
+	int s, c;
+
+	s = xsocket(family, socktype | SOCK_NONBLOCK, 0);
+	init_addr_loopback(family, &addr, &addrlen);
+	xbind(s, (struct sockaddr *)&addr, addrlen);
+	xgetsockname(s, (struct sockaddr *)&addr, &addrlen);
+	xlisten(s, 128);
+
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+
+	c = xsocket(family, socktype, 0);
+	xconnect(c, (struct sockaddr *)&addr, addrlen);
+
+	xclose(c);
+	xclose(s);
+}
+
+static void zero_verdict_count(int mapfd)
+{
+	unsigned int zero = 0;
+	int key;
+
+	key = SK_DROP;
+	xbpf_map_update_elem(mapfd, &key, &zero, BPF_ANY);
+	key = SK_PASS;
+	xbpf_map_update_elem(mapfd, &key, &zero, BPF_ANY);
+}
+
+static void redir_to_connected(int family, int socktype, int sock_mapfd,
+			       int verd_mapfd, bool on_rx)
+{
+	const char *test_name = on_rx ? "rx" : "tx";
+	struct sockaddr_storage addr;
+	int s, c0, c1, p0, p1;
+	unsigned int pass;
+	socklen_t addrlen;
+	u64 value;
+	u32 key;
+	char b;
+	int n;
+
+	init_addr_loopback(family, &addr, &addrlen);
+	zero_verdict_count(verd_mapfd);
+
+	s = xsocket(family, socktype | SOCK_NONBLOCK, 0);
+	xbind(s, sockaddr(&addr), addrlen);
+	xgetsockname(s, sockaddr(&addr), &addrlen);
+	xlisten(s, 1);
+
+	c0 = xsocket(family, socktype, 0);
+	xconnect(c0, sockaddr(&addr), addrlen);
+	p0 = xaccept(s, NULL, NULL);
+
+	c1 = xsocket(family, socktype, 0);
+	xconnect(c1, sockaddr(&addr), addrlen);
+	p1 = xaccept(s, NULL, NULL);
+
+	key = 0;
+	value = p0;
+	xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	key = 1;
+	value = p1;
+	xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+
+	n = write(on_rx ? c1 : p1, "a", 1);
+	if (n < 0)
+		FAIL_SYS("%s: write", test_name);
+
+	key = SK_PASS;
+	xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+	if (pass != 1)
+		FAIL("%s: want pass count 1, have %d", test_name, pass);
+
+	n = read(c0, &b, 1);
+	if (n < 0)
+		FAIL_SYS("%s: read", test_name);
+
+	xclose(s);
+	xclose(c0);
+	xclose(c1);
+	xclose(p0);
+	xclose(p1);
+}
+
+static void test_sockmap_skb_redir_to_connected(int family, int socktype,
+						int sock_mapfd, int verd_mapfd,
+						int parser_fd, int verdict_fd)
+{
+	xbpf_prog_attach(parser_fd, sock_mapfd, BPF_SK_SKB_STREAM_PARSER, 0);
+	xbpf_prog_attach(verdict_fd, sock_mapfd, BPF_SK_SKB_STREAM_VERDICT, 0);
+	redir_to_connected(family, socktype, sock_mapfd, verd_mapfd, ON_RX);
+	xbpf_prog_detach2(parser_fd, sock_mapfd, BPF_SK_SKB_STREAM_PARSER);
+	xbpf_prog_detach2(verdict_fd, sock_mapfd, BPF_SK_SKB_STREAM_VERDICT);
+}
+
+static void redir_to_listening(int family, int socktype, int sock_mapfd,
+			       int verd_mapfd, bool on_rx)
+{
+	const char *test_name = on_rx ? "rx" : "tx";
+	struct sockaddr_storage addr;
+	unsigned int drop;
+	socklen_t addrlen;
+	int s, c, p;
+	u64 value;
+	u32 key;
+
+	init_addr_loopback(family, &addr, &addrlen);
+	zero_verdict_count(verd_mapfd);
+
+	s = xsocket(family, socktype | SOCK_NONBLOCK, 0);
+	xbind(s, sockaddr(&addr), addrlen);
+	xgetsockname(s, sockaddr(&addr), &addrlen);
+	xlisten(s, 1);
+
+	c = xsocket(family, socktype, 0);
+	xconnect(c, sockaddr(&addr), addrlen);
+	p = xaccept(s, NULL, NULL);
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	key = 1;
+	value = p;
+	xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+
+	write(on_rx ? c : p, "a", 1);
+
+	key = SK_DROP;
+	xbpf_map_lookup_elem(verd_mapfd, &key, &drop);
+	if (drop != 1)
+		FAIL("%s: want drop count 1, have %d", test_name, drop);
+
+	xclose(s);
+	xclose(c);
+	xclose(p);
+}
+
+static void test_sockmap_skb_redir_to_listening(int family, int socktype,
+						int sock_mapfd, int verd_mapfd,
+						int parser_fd, int verdict_fd)
+{
+	xbpf_prog_attach(parser_fd, sock_mapfd, BPF_SK_SKB_STREAM_PARSER, 0);
+	xbpf_prog_attach(verdict_fd, sock_mapfd, BPF_SK_SKB_STREAM_VERDICT, 0);
+	redir_to_listening(family, socktype, sock_mapfd, verd_mapfd, ON_RX);
+	xbpf_prog_detach2(parser_fd, sock_mapfd, BPF_SK_SKB_STREAM_PARSER);
+	xbpf_prog_detach2(verdict_fd, sock_mapfd, BPF_SK_SKB_STREAM_VERDICT);
+}
+
+static void test_sockmap_msg_redir_to_connected(int family, int socktype,
+						int sock_mapfd, int verd_mapfd,
+						int verdict_fd)
+{
+	xbpf_prog_attach(verdict_fd, sock_mapfd, BPF_SK_MSG_VERDICT, 0);
+	redir_to_connected(family, socktype, sock_mapfd, verd_mapfd, ON_TX);
+	xbpf_prog_detach2(verdict_fd, sock_mapfd, BPF_SK_MSG_VERDICT);
+}
+
+static void test_sockmap_msg_redir_to_listening(int family, int socktype,
+						int sock_mapfd, int verd_mapfd,
+						int verdict_fd)
+{
+	xbpf_prog_attach(verdict_fd, sock_mapfd, BPF_SK_MSG_VERDICT, 0);
+	redir_to_listening(family, socktype, sock_mapfd, verd_mapfd, ON_TX);
+	xbpf_prog_detach2(verdict_fd, sock_mapfd, BPF_SK_MSG_VERDICT);
+}
+
+static void test_sockmap_reuseport_select_listening(int family, int socktype,
+						    int sock_mapfd,
+						    int verd_mapfd,
+						    int reuseport_fd)
+{
+	struct sockaddr_storage addr;
+	unsigned int pass;
+	socklen_t addrlen;
+	int one = 1;
+	int s, c, p;
+	u64 value;
+	u32 key;
+
+	init_addr_loopback(family, &addr, &addrlen);
+	zero_verdict_count(verd_mapfd);
+
+	s = xsocket(family, socktype | SOCK_NONBLOCK, 0);
+	xsetsockopt(s, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one));
+	xsetsockopt(s, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+		    &reuseport_fd, sizeof(reuseport_fd));
+	xbind(s, sockaddr(&addr), addrlen);
+	xgetsockname(s, sockaddr(&addr), &addrlen);
+	xlisten(s, 1);
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+
+	c = xsocket(family, socktype, 0);
+	xconnect(c, sockaddr(&addr), addrlen);
+	p = xaccept(s, NULL, NULL);
+
+	key = SK_PASS;
+	xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+	if (pass != 1)
+		FAIL("want drop count 1, have %d", pass);
+
+	xclose(s);
+	xclose(c);
+	xclose(p);
+}
+
+static void test_sockmap_reuseport_select_connected(int family, int socktype,
+						    int sock_mapfd,
+						    int verd_mapfd,
+						    int reuseport_fd)
+{
+	struct sockaddr_storage addr;
+	int s, c0, c1, p;
+	unsigned int drop;
+	socklen_t addrlen;
+	int err, one = 1;
+	u64 value;
+	u32 key;
+
+	init_addr_loopback(family, &addr, &addrlen);
+	zero_verdict_count(verd_mapfd);
+
+	s = xsocket(family, socktype | SOCK_NONBLOCK, 0);
+	xsetsockopt(s, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one));
+	xsetsockopt(s, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+		    &reuseport_fd, sizeof(reuseport_fd));
+	xbind(s, sockaddr(&addr), addrlen);
+	xgetsockname(s, sockaddr(&addr), &addrlen);
+	xlisten(s, 1);
+
+	c0 = xsocket(family, socktype, 0);
+	xconnect(c0, sockaddr(&addr), addrlen);
+	p = xaccept(s, NULL, NULL);
+
+	key = 0;
+	value = p;
+	xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+
+	c1 = xsocket(family, socktype, 0);
+	errno = 0;
+	err = connect(c1, sockaddr(&addr), addrlen);
+	if (!err || errno != ECONNREFUSED)
+		FAIL_SYS("connect: expected ECONNREFUSED");
+
+	key = SK_DROP;
+	xbpf_map_lookup_elem(verd_mapfd, &key, &drop);
+	if (drop != 1)
+		FAIL("want drop count 1, have %d", drop);
+
+	xclose(s);
+	xclose(c0);
+	xclose(c1);
+	xclose(p);
+}
+
+static void run_tests(int family, int socktype, int *maps, int *progs)
+{
+	/* Test SOCKMAP map operations */
+	test_sockmap_insert_invalid(maps[SOCK_MAP]);
+	test_sockmap_insert_opened(family, socktype, maps[SOCK_MAP]);
+	test_sockmap_insert_bound(family, socktype, maps[SOCK_MAP]);
+	test_sockmap_insert_listening(family, socktype, maps[SOCK_MAP]);
+
+	test_sockmap_delete_after_insert(family, socktype, maps[SOCK_MAP]);
+	test_sockmap_delete_after_close(family, socktype, maps[SOCK_MAP]);
+
+	test_sockmap_lookup_after_insert(family, socktype, maps[SOCK_MAP]);
+	test_sockmap_lookup_after_delete(family, socktype, maps[SOCK_MAP]);
+	test_sockmap_lookup_32_bit_value(family, socktype);
+
+	test_sockmap_update_listening(family, socktype, maps[SOCK_MAP]);
+
+	/* Test overridden socket callbacks */
+	test_sockmap_destroy_orphan_child(family, socktype, maps[SOCK_MAP]);
+	test_sockmap_syn_recv_after_delete(family, socktype, maps[SOCK_MAP]);
+
+	/* Test redirect with SOCKMAP */
+	test_sockmap_skb_redir_to_connected(family, socktype,
+					    maps[SOCK_MAP], maps[VERDICT_MAP],
+					    progs[SKB_PARSER_PROG],
+					    progs[SKB_VERDICT_PROG]);
+	test_sockmap_skb_redir_to_listening(family, socktype,
+					    maps[SOCK_MAP], maps[VERDICT_MAP],
+					    progs[SKB_PARSER_PROG],
+					    progs[SKB_VERDICT_PROG]);
+	test_sockmap_msg_redir_to_connected(family, socktype,
+					    maps[SOCK_MAP], maps[VERDICT_MAP],
+					    progs[MSG_VERDICT_PROG]);
+	test_sockmap_msg_redir_to_listening(family, socktype,
+					    maps[SOCK_MAP], maps[VERDICT_MAP],
+					    progs[MSG_VERDICT_PROG]);
+
+	/* Test reuseport with SOCKMAP */
+	test_sockmap_reuseport_select_listening(family, socktype,
+						maps[SOCK_MAP],
+						maps[VERDICT_MAP],
+						progs[REUSEPORT_PROG]);
+	test_sockmap_reuseport_select_connected(family, socktype,
+						maps[SOCK_MAP],
+						maps[VERDICT_MAP],
+						progs[REUSEPORT_PROG]);
+}
+
+static struct bpf_object *load_bpf_object(const char *obj_path, int *maps,
+					  size_t n_maps, int *progs,
+					  size_t n_progs)
+{
+	char buf[MAX_STRERR_LEN];
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_map *map;
+	long err;
+	int i;
+
+	obj = bpf_object__open(obj_path);
+	err = libbpf_get_error(obj);
+	if (err)
+		FAIL_LIBBPF(err, buf, "object open");
+
+	err = bpf_object__load(obj);
+	if (err)
+		FAIL_LIBBPF(err, buf, "object load");
+
+	i = 0;
+	bpf_object__for_each_map(map, obj) {
+		if (i < n_maps)
+			maps[i] = bpf_map__fd(map);
+		i++;
+	}
+
+	i = 0;
+	bpf_object__for_each_program(prog, obj) {
+		if (i < n_progs)
+			progs[i] = bpf_program__fd(prog);
+		i++;
+	}
+
+	return obj;
+}
+
+static void unload_bpf_object(struct bpf_object *obj)
+{
+	char buf[MAX_STRERR_LEN];
+	long err;
+
+	err = bpf_object__unload(obj);
+	if (err) {
+		libbpf_strerror(err, buf, sizeof(buf));
+		FAIL("object unload: %s", buf);
+	}
+}
+
+int main(void)
+{
+	struct bpf_object *obj;
+	int maps[MAX_MAP];
+	int progs[MAX_PROG];
+
+	obj = load_bpf_object(BPF_OBJECT_FILE, maps, MAX_MAP, progs, MAX_PROG);
+	run_tests(AF_INET, SOCK_STREAM, maps, progs);
+	run_tests(AF_INET6, SOCK_STREAM, maps, progs);
+	unload_bpf_object(obj);
+
+	printf("PASS\n");
+	return 0;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall
  2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
@ 2019-11-24  5:32   ` John Fastabend
  0 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  5:32 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> Tooling that populates the SOCKMAP with sockets from user-space needs a way
> to inspect its contents. Returning the struct sock * that SOCKMAP holds to
> user-space is neither safe nor useful. An approach established by
> REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
> instead.
> 
> Since socket cookies are u64 values SOCKMAP needs to support such a value
> size for lookup to be possible. This requires special handling on update,
> though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
> with ENOSPC error.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
  2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
@ 2019-11-24  5:35   ` John Fastabend
  0 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  5:35 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> Don't require the kernel code, like BPF helpers, that needs access to
> SOCKMAP map contents to live in the sock_map module. Expose the SOCKMAP
> lookup operation to all kernel-land.
> 
> Lookup from BPF context is not whitelisted yet. While syscalls have a
> dedicated lookup handler.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP
  2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
@ 2019-11-24  5:38   ` John Fastabend
  0 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  5:38 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> In order for SOCKMAP type to become a generic collection for storing TCP
> sockets we need to loosen the checks in update callback.
> 
> Currently SOCKMAP requires the TCP socket to be in established state, which
> prevents us from using it to keep references to listening sockets.
> 
> Change the update pre-checks so that it is sufficient for socket to be in a
> hash table, i.e. have a local address/port assigned, to be inserted. Return
> -EINVAL if the condition is not met to be consistent with
> REUSEPORT_SOCKARRY map type.
> 
> This creates a possibility of pointing one of the BPF redirect helpers that
> splice two SOCKMAP sockets on ingress or egress at a listening socket,
> which doesn't make sense. Introduce appropriate checks in the helpers so
> that only established TCP sockets can be a target for redirects.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.co,>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
@ 2019-11-24  5:56   ` John Fastabend
  2019-11-25 22:38   ` Martin Lau
  1 sibling, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  5:56 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> Sockets cloned from the listening sockets that belongs to a SOCKMAP must
> not inherit the psock state. Otherwise child sockets unintentionally share
> the SOCKMAP entry with the listening socket, which would lead to
> use-after-free bugs.
> 
> Restore the child socket psock state and its callbacks at the earliest
> possible moment, that is right after the child socket gets created. This
> ensures that neither children that get accept()'ed, nor those that are left
> in accept queue and will get orphaned, don't inadvertently inherit parent's
> psock.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
@ 2019-11-24  5:57   ` John Fastabend
  2019-11-25  1:24   ` Alexei Starovoitov
  1 sibling, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  5:57 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> SOCKMAP now supports storing references to listening sockets. Nothing keeps
> us from using it as an array of sockets to select from in SK_REUSEPORT
> programs.
> 
> Whitelist the map type with the BPF helper for selecting socket. However,
> impose a restriction that the selected socket needs to be a listening TCP
> socket or a bound UDP socket (connected or not).
> 
> The only other map type that works with the BPF reuseport helper,
> REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
> handler.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name
  2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
@ 2019-11-24  5:57   ` John Fastabend
  0 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  5:57 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> Allow loading BPF object files that contain SK_REUSEPORT programs without
> having to manually set the program type before load if the the section name
> is set to "sk_reuseport".
> 
> Makes user-space code needed to load SK_REUSEPORT BPF program more concise.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
@ 2019-11-24  6:00   ` John Fastabend
  2019-11-25 22:30   ` Martin Lau
  1 sibling, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  6:00 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
> can be selected at run-time. Also allow choosing which L4 protocols get
> tested.
> 
> Run the extended reuseport program test two times, once for
> REUSEPORT_ARRAY, and once for SOCKMAP but just with TCP to cover the newly
> enabled map type.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets
  2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
@ 2019-11-24  6:04   ` John Fastabend
  0 siblings, 0 replies; 39+ messages in thread
From: John Fastabend @ 2019-11-24  6:04 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> Now that SOCKMAP can store listening sockets, its user-space and BPF API
> is open to a new set of potential pitfalls. Exercise the map operations,
> code paths that trigger overridden socket callbacks, and BPF helpers
> that work with SOCKMAP to gain confidence that all works as expected.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets
  2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (7 preceding siblings ...)
  2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
@ 2019-11-24  6:10 ` John Fastabend
  2019-11-25  9:22   ` Jakub Sitnicki
  8 siblings, 1 reply; 39+ messages in thread
From: John Fastabend @ 2019-11-24  6:10 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf; +Cc: netdev, kernel-team, John Fastabend, Martin KaFai Lau

Jakub Sitnicki wrote:
> This patch set makes SOCKMAP more flexible by allowing it to hold TCP
> sockets that are either in established or listening state. With it SOCKMAP
> can act as a drop-in replacement for REUSEPORT_SOCKARRAY which reuseport
> BPF programs use. Granted, it is limited to only TCP sockets.
> 
> The idea started out at LPC '19 as feedback from John Fastabend to our
> troubles with repurposing REUSEPORT_SOCKARRAY as a collection of listening
> sockets accessed by a BPF program ran on socket lookup [1]. Without going
> into details, REUSEPORT_SOCKARRAY proved to be tightly coupled with
> reuseport logic. Talk from LPC (see slides [2] or video [3]) highlights
> what problems we ran into when trying to make REUSEPORT_SOCKARRAY work for
> our use-case.
> 
> Patches have evolved quite a bit since the RFC series from a month ago
> [4]. To recap the RFC feedback, John pointed out that BPF redirect helpers
> for SOCKMAP need sane semantics when used with listening sockets [5], and
> that SOCKMAP lookup from BPF would be useful [6]. While Martin asked for
> UDP support [7].

Curious if you've started looking into UDP support. I had hoped to do
it but haven't got there yet.

> 
> As it happens, patches needed more work to get SOCKMAP to actually behave
> correctly with listening sockets. It turns out flexibility has its
> price. Change log below outlines them all.
> 

But looks pretty clean to me, only major change here is to add an extra
hook to remove psock from the child socket. And that looks fine to me and
cleaner than any other solution I had in mind.

Changes +/- looks good as well most the updates are in selftests to update
tests and add some new ones. +1 

> With more than I would like patches in the set, I left the new features,
> lookup from BPF as well as UDP support, for another series. I'm quite happy
> with how the changes turned out and the test coverage so I'm boldly
> proposing it as v1 :-)
> 
> Curious to see what you think.

Ack on the series from me.

> 
> RFC -> v1:
> 
> - Switch from overriding proto->accept to af_ops->syn_recv_sock, which
>   happens earlier. Clearing the psock state after accept() does not work
>   for child sockets that become orphaned (never got accepted). v4-mapped
>   sockets need special care.
> 
> - Return the socket cookie on SOCKMAP lookup from syscall to be on par with
>   REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from
>   syscall.
> 
> - Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress)
>   SOCKMAP helpers fail when target socket is a listening one.
> 
> - Make bpf_sk_select_reuseport helper fail when target is a TCP established
>   socket.
> 
> - Teach libbpf to recognize SK_REUSEPORT program type from section name.
> 
> - Add a dedicated set of tests for SOCKMAP holding listening sockets,
>   covering map operations, overridden socket callbacks, and BPF helpers.
> 
> Thanks,
> Jakub
> 
> [1] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
> [2] https://linuxplumbersconf.org/event/4/contributions/487/
> [3] https://www.youtube.com/watch?v=qRDoUpqvYjY
> [4] https://lore.kernel.org/bpf/20191022113730.29303-1-jakub@cloudflare.com/
> [5] https://lore.kernel.org/bpf/5db1da20174b1_5c282ada047205c046@john-XPS-13-9370.notmuch/
> [6] https://lore.kernel.org/bpf/5db1d7a810bdb_5c282ada047205c08f@john-XPS-13-9370.notmuch/
> [7] https://lore.kernel.org/bpf/20191028213804.yv3xfjjlayfghkcr@kafai-mbp/
> 
> 
> Jakub Sitnicki (8):
>   bpf, sockmap: Return socket cookie on lookup from syscall
>   bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
>   bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP
>   bpf, sockmap: Don't let child socket inherit psock or its ops on copy
>   bpf: Allow selecting reuseport socket from a SOCKMAP
>   libbpf: Recognize SK_REUSEPORT programs from section name
>   selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
>   selftests/bpf: Tests for SOCKMAP holding listening sockets
> 
>  include/linux/skmsg.h                         |  17 +-
>  kernel/bpf/verifier.c                         |   6 +-
>  net/core/filter.c                             |   2 +
>  net/core/sock_map.c                           |  68 +-
>  net/ipv4/tcp_bpf.c                            |  66 +-
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/testing/selftests/bpf/.gitignore        |   1 +
>  tools/testing/selftests/bpf/Makefile          |   9 +-
>  .../bpf/progs/test_sockmap_listen_kern.c      |  75 ++
>  tools/testing/selftests/bpf/test_maps.c       |   6 +-
>  .../selftests/bpf/test_select_reuseport.c     | 141 ++-
>  .../selftests/bpf/test_select_reuseport.sh    |  14 +
>  .../selftests/bpf/test_sockmap_listen.c       | 820 ++++++++++++++++++
>  13 files changed, 1170 insertions(+), 56 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen_kern.c
>  create mode 100755 tools/testing/selftests/bpf/test_select_reuseport.sh
>  create mode 100644 tools/testing/selftests/bpf/test_sockmap_listen.c
> 
> -- 
> 2.20.1
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
  2019-11-24  5:57   ` John Fastabend
@ 2019-11-25  1:24   ` Alexei Starovoitov
  2019-11-25  4:17     ` John Fastabend
  1 sibling, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2019-11-25  1:24 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: bpf, netdev, kernel-team, John Fastabend, Martin KaFai Lau

On Sat, Nov 23, 2019 at 12:07:48PM +0100, Jakub Sitnicki wrote:
> SOCKMAP now supports storing references to listening sockets. Nothing keeps
> us from using it as an array of sockets to select from in SK_REUSEPORT
> programs.
> 
> Whitelist the map type with the BPF helper for selecting socket. However,
> impose a restriction that the selected socket needs to be a listening TCP
> socket or a bound UDP socket (connected or not).
> 
> The only other map type that works with the BPF reuseport helper,
> REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
> handler.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  kernel/bpf/verifier.c | 6 ++++--
>  net/core/filter.c     | 2 ++
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a0482e1c4a77..111a1eb543ab 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3685,7 +3685,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		if (func_id != BPF_FUNC_sk_redirect_map &&
>  		    func_id != BPF_FUNC_sock_map_update &&
>  		    func_id != BPF_FUNC_map_delete_elem &&
> -		    func_id != BPF_FUNC_msg_redirect_map)
> +		    func_id != BPF_FUNC_msg_redirect_map &&
> +		    func_id != BPF_FUNC_sk_select_reuseport)
>  			goto error;
>  		break;
>  	case BPF_MAP_TYPE_SOCKHASH:
> @@ -3766,7 +3767,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  			goto error;
>  		break;
>  	case BPF_FUNC_sk_select_reuseport:
> -		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
> +		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
> +		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
>  			goto error;
>  		break;
>  	case BPF_FUNC_map_peek_elem:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 49ded4a7588a..e3fb77353248 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>  	selected_sk = map->ops->map_lookup_elem(map, key);
>  	if (!selected_sk)
>  		return -ENOENT;
> +	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
> +		return -EINVAL;

hmm. I wonder whether this breaks existing users...
Martin,
what do you think?
Could you also take a look at other patches too?
In particular patch 7?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-25  1:24   ` Alexei Starovoitov
@ 2019-11-25  4:17     ` John Fastabend
  2019-11-25 10:40       ` Jakub Sitnicki
  0 siblings, 1 reply; 39+ messages in thread
From: John Fastabend @ 2019-11-25  4:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, John Fastabend, Martin KaFai Lau

Alexei Starovoitov wrote:
> On Sat, Nov 23, 2019 at 12:07:48PM +0100, Jakub Sitnicki wrote:
> > SOCKMAP now supports storing references to listening sockets. Nothing keeps
> > us from using it as an array of sockets to select from in SK_REUSEPORT
> > programs.
> > 
> > Whitelist the map type with the BPF helper for selecting socket. However,
> > impose a restriction that the selected socket needs to be a listening TCP
> > socket or a bound UDP socket (connected or not).
> > 
> > The only other map type that works with the BPF reuseport helper,
> > REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
> > handler.
> > 
> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> > ---

[...]

> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 49ded4a7588a..e3fb77353248 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
> >  	selected_sk = map->ops->map_lookup_elem(map, key);
> >  	if (!selected_sk)
> >  		return -ENOENT;
> > +	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
> > +		return -EINVAL;
> 
> hmm. I wonder whether this breaks existing users...

There is already this check in reuseport_array_update_check()

	/*
	 * sk must be hashed (i.e. listening in the TCP case or binded
	 * in the UDP case) and
	 * it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL).
	 *
	 * Also, sk will be used in bpf helper that is protected by
	 * rcu_read_lock().
	 */
	if (!sock_flag(nsk, SOCK_RCU_FREE) || !sk_hashed(nsk) || !nsk_reuse)
		return -EINVAL;

So I believe it should not cause any problems with existing users. Perhaps
we could consolidate the checks a bit or move it into the update paths if we
wanted. I assume Jakub was just ensuring we don't get here with SOCK_RCU_FREE
set from any of the new paths now. I'll let him answer though.

> Martin,
> what do you think?

More eyes the better.

> Could you also take a look at other patches too?
> In particular patch 7?
> 

Agreed would be good to give 7/8 a look I'm not too familiar with the
selftests there.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets
  2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
@ 2019-11-25  9:22   ` Jakub Sitnicki
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-25  9:22 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, kernel-team, Martin KaFai Lau

On Sun, Nov 24, 2019 at 07:10 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> This patch set makes SOCKMAP more flexible by allowing it to hold TCP
>> sockets that are either in established or listening state. With it SOCKMAP
>> can act as a drop-in replacement for REUSEPORT_SOCKARRAY which reuseport
>> BPF programs use. Granted, it is limited to only TCP sockets.
>>
>> The idea started out at LPC '19 as feedback from John Fastabend to our
>> troubles with repurposing REUSEPORT_SOCKARRAY as a collection of listening
>> sockets accessed by a BPF program ran on socket lookup [1]. Without going
>> into details, REUSEPORT_SOCKARRAY proved to be tightly coupled with
>> reuseport logic. Talk from LPC (see slides [2] or video [3]) highlights
>> what problems we ran into when trying to make REUSEPORT_SOCKARRAY work for
>> our use-case.
>>
>> Patches have evolved quite a bit since the RFC series from a month ago
>> [4]. To recap the RFC feedback, John pointed out that BPF redirect helpers
>> for SOCKMAP need sane semantics when used with listening sockets [5], and
>> that SOCKMAP lookup from BPF would be useful [6]. While Martin asked for
>> UDP support [7].
>
> Curious if you've started looking into UDP support. I had hoped to do
> it but haven't got there yet.

No, not yet. I only made sure the newly added tests were easy to modify
to cover UDP by not hard-coding the socket type.

I expect to break ground with UDP work soon, though. Right after I push
out another iteration of programmable socket lookup [1] patches adapted for
SOCKMAP, which we've been testing internally.

>> As it happens, patches needed more work to get SOCKMAP to actually behave
>> correctly with listening sockets. It turns out flexibility has its
>> price. Change log below outlines them all.
>>
>
> But looks pretty clean to me, only major change here is to add an extra
> hook to remove psock from the child socket. And that looks fine to me and
> cleaner than any other solution I had in mind.
>
> Changes +/- looks good as well most the updates are in selftests to update
> tests and add some new ones. +1

Thanks for taking a look at the patches so quickly. I appreciate it.

-Jakub

[1] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-25  4:17     ` John Fastabend
@ 2019-11-25 10:40       ` Jakub Sitnicki
  2019-11-25 22:07         ` Martin Lau
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-25 10:40 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, bpf, netdev, kernel-team, Martin KaFai Lau

On Mon, Nov 25, 2019 at 05:17 AM CET, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Sat, Nov 23, 2019 at 12:07:48PM +0100, Jakub Sitnicki wrote:
>> > SOCKMAP now supports storing references to listening sockets. Nothing keeps
>> > us from using it as an array of sockets to select from in SK_REUSEPORT
>> > programs.
>> >
>> > Whitelist the map type with the BPF helper for selecting socket. However,
>> > impose a restriction that the selected socket needs to be a listening TCP
>> > socket or a bound UDP socket (connected or not).
>> >
>> > The only other map type that works with the BPF reuseport helper,
>> > REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
>> > handler.
>> >
>> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> > ---
>
> [...]
>
>> > diff --git a/net/core/filter.c b/net/core/filter.c
>> > index 49ded4a7588a..e3fb77353248 100644
>> > --- a/net/core/filter.c
>> > +++ b/net/core/filter.c
>> > @@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>> >  	selected_sk = map->ops->map_lookup_elem(map, key);
>> >  	if (!selected_sk)
>> >  		return -ENOENT;
>> > +	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
>> > +		return -EINVAL;
>>
>> hmm. I wonder whether this breaks existing users...
>
> There is already this check in reuseport_array_update_check()
>
> 	/*
> 	 * sk must be hashed (i.e. listening in the TCP case or binded
> 	 * in the UDP case) and
> 	 * it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL).
> 	 *
> 	 * Also, sk will be used in bpf helper that is protected by
> 	 * rcu_read_lock().
> 	 */
> 	if (!sock_flag(nsk, SOCK_RCU_FREE) || !sk_hashed(nsk) || !nsk_reuse)
> 		return -EINVAL;
>
> So I believe it should not cause any problems with existing users. Perhaps
> we could consolidate the checks a bit or move it into the update paths if we
> wanted. I assume Jakub was just ensuring we don't get here with SOCK_RCU_FREE
> set from any of the new paths now. I'll let him answer though.

That was exactly my thinking here.

REUSEPORT_SOCKARRAY can't be populated with sockets that don't have
SOCK_RCU_FREE set. This makes the flag check in sk_select_reuseport BPF
helper redundant for this map type.

SOCKMAP, OTOH, allows storing established TCP sockets, which don't have
SOCK_RCU_FREE flag and shouldn't be used as reuseport targets. The newly
added check protects us against it.

I have a couple tests in the last patch for it -
test_sockmap_reuseport_select_{listening,connected}. Admittedly, UDP is
not covered.

Not sure how we could go about moving the checks to the update path for
SOCKMAP. At update time we don't know if the map will be used with a
reuseport or a sk_{skb,msg} program.

-Jakub

>
>> Martin,
>> what do you think?
>
> More eyes the better.
>
>> Could you also take a look at other patches too?
>> In particular patch 7?
>>
>
> Agreed would be good to give 7/8 a look I'm not too familiar with the
> selftests there.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-25 10:40       ` Jakub Sitnicki
@ 2019-11-25 22:07         ` Martin Lau
  2019-11-26 14:30           ` Jakub Sitnicki
  0 siblings, 1 reply; 39+ messages in thread
From: Martin Lau @ 2019-11-25 22:07 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: John Fastabend, Alexei Starovoitov, bpf, netdev, kernel-team

On Mon, Nov 25, 2019 at 11:40:41AM +0100, Jakub Sitnicki wrote:
> On Mon, Nov 25, 2019 at 05:17 AM CET, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> >> On Sat, Nov 23, 2019 at 12:07:48PM +0100, Jakub Sitnicki wrote:
> >> > SOCKMAP now supports storing references to listening sockets. Nothing keeps
> >> > us from using it as an array of sockets to select from in SK_REUSEPORT
> >> > programs.
> >> >
> >> > Whitelist the map type with the BPF helper for selecting socket. However,
> >> > impose a restriction that the selected socket needs to be a listening TCP
> >> > socket or a bound UDP socket (connected or not).
> >> >
> >> > The only other map type that works with the BPF reuseport helper,
> >> > REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
> >> > handler.
> >> >
> >> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> > ---
> >
> > [...]
> >
> >> > diff --git a/net/core/filter.c b/net/core/filter.c
> >> > index 49ded4a7588a..e3fb77353248 100644
> >> > --- a/net/core/filter.c
> >> > +++ b/net/core/filter.c
> >> > @@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
> >> >  	selected_sk = map->ops->map_lookup_elem(map, key);
> >> >  	if (!selected_sk)
> >> >  		return -ENOENT;
> >> > +	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
> >> > +		return -EINVAL;
If I read it correctly,
this is to avoid the following "if (!reuse)" to return -ENOENT,
and instead returns -EINVAL for non TCP_LISTEN tcp_sock.
It should at least only be done under the "if (!reuse)" then.

Checking SOCK_RCU_FREE to imply TCP_LISTEN is not ideal.
It is not immediately obvious.  Why not directly check
TCP_LISTEN?

Note that the SOCK_RCU_FREE check at the 'slow-path'
reuseport_array_update_check() is because reuseport_array does depend on
call_rcu(&sk->sk_rcu,...) to work, e.g. the reuseport_array
does not hold the sk_refcnt.

> >>
> >> hmm. I wonder whether this breaks existing users...
> >
> > There is already this check in reuseport_array_update_check()
> >
> > 	/*
> > 	 * sk must be hashed (i.e. listening in the TCP case or binded
> > 	 * in the UDP case) and
> > 	 * it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL).
> > 	 *
> > 	 * Also, sk will be used in bpf helper that is protected by
> > 	 * rcu_read_lock().
> > 	 */
> > 	if (!sock_flag(nsk, SOCK_RCU_FREE) || !sk_hashed(nsk) || !nsk_reuse)
> > 		return -EINVAL;
> >
> > So I believe it should not cause any problems with existing users. Perhaps
> > we could consolidate the checks a bit or move it into the update paths if we
> > wanted. I assume Jakub was just ensuring we don't get here with SOCK_RCU_FREE
> > set from any of the new paths now. I'll let him answer though.
> 
> That was exactly my thinking here.
> 
> REUSEPORT_SOCKARRAY can't be populated with sockets that don't have
> SOCK_RCU_FREE set. This makes the flag check in sk_select_reuseport BPF
> helper redundant for this map type.
> 
> SOCKMAP, OTOH, allows storing established TCP sockets, which don't have
> SOCK_RCU_FREE flag and shouldn't be used as reuseport targets. The newly
> added check protects us against it.
> 
> I have a couple tests in the last patch for it -
> test_sockmap_reuseport_select_{listening,connected}. Admittedly, UDP is
> not covered.
> 
> Not sure how we could go about moving the checks to the update path for
> SOCKMAP. At update time we don't know if the map will be used with a
> reuseport or a sk_{skb,msg} program.
or make these checks specific to the sockmap's lookup path.



digress a little from this patch,
will the upcoming patches/examples show the use case to have both
TCP_LISTEN and TCP_ESTABLISHED sk in the same sock_map?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
  2019-11-24  6:00   ` John Fastabend
@ 2019-11-25 22:30   ` Martin Lau
  2019-11-26 14:32     ` Jakub Sitnicki
  2019-12-12 10:30     ` Jakub Sitnicki
  1 sibling, 2 replies; 39+ messages in thread
From: Martin Lau @ 2019-11-25 22:30 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: bpf, netdev, kernel-team, John Fastabend

On Sat, Nov 23, 2019 at 12:07:50PM +0100, Jakub Sitnicki wrote:
> Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
> can be selected at run-time. Also allow choosing which L4 protocols get
> tested.
If new cmdline args are added to select different subtests,
I would prefer to move it to the test_progs framework and reuse the subtests
support in test_progs commit 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs").
Its default is to run all instead of having a separate bash script to
do that.

> 
> Run the extended reuseport program test two times, once for
> REUSEPORT_ARRAY, and once for SOCKMAP but just with TCP to cover the newly
> enabled map type.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

[ ... ]
> +static const char *family_to_str(int family)
> +{
> +	switch (family) {
> +	case AF_INET:
> +		return "IPv4";
> +	case AF_INET6:
> +		return "IPv6";
> +	default:
> +		return "unknown";
> +	}
> +}
> +
> +static const char *type_to_str(int type)
> +{
> +	switch (type) {
> +	case SOCK_STREAM:
> +		return "TCP";
> +	case SOCK_DGRAM:
> +		return "UDP";
> +	default:
> +		return "unknown";
> +	}
> +}
+1

[ ... ]
> +static void parse_opts(int argc, char **argv)
> +{
> +	unsigned int sock_types = 0;
> +	int c;
> +
> +	while ((c = getopt(argc, argv, "hm:tu")) != -1) {
> +		switch (c) {
> +		case 'h':
> +			usage();
> +			break;
> +		case 'm':
> +			cfg_map_type = parse_map_type(optarg);
> +			break;
> +		case 't':
> +			sock_types |= 1 << SOCK_STREAM;
> +			break;
> +		case 'u':
> +			sock_types |= 1 << SOCK_DGRAM;
> +			break;
>  		}
>  	}
> +
> +	if (cfg_map_type == BPF_MAP_TYPE_UNSPEC)
> +		usage();
> +	if (sock_types != 0)
> +		cfg_sock_types = sock_types;
>  }
>  
> -int main(int argc, const char **argv)
> +int main(int argc, char **argv)
>  {
> +	parse_opts(argc, argv);
>  	create_maps();
>  	prepare_bpf_obj();
>  	saved_tcp_fo = read_int_sysctl(TCP_FO_SYSCTL);
> diff --git a/tools/testing/selftests/bpf/test_select_reuseport.sh b/tools/testing/selftests/bpf/test_select_reuseport.sh
> new file mode 100755
> index 000000000000..1951b4886021
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_select_reuseport.sh
> @@ -0,0 +1,14 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +set -eu
> +
> +DIR=$(dirname $0)
> +
> +echo "Testing reuseport with REUSEPORT_SOCKARRAY..."
> +$DIR/test_select_reuseport -m reuseport_sockarray
> +
> +echo "Testing reuseport with SOCKMAP (TCP only)..."
> +$DIR/test_select_reuseport -m sockmap -t
> +
> +exit 0
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
  2019-11-24  5:56   ` John Fastabend
@ 2019-11-25 22:38   ` Martin Lau
  2019-11-26 15:54     ` Jakub Sitnicki
  1 sibling, 1 reply; 39+ messages in thread
From: Martin Lau @ 2019-11-25 22:38 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: bpf, netdev, kernel-team, John Fastabend

On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
[ ... ]

> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>  			sk->sk_prot = psock->sk_proto;
>  		psock->sk_proto = NULL;
>  	}
> +
> +	if (psock->icsk_af_ops) {
> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> +		psock->icsk_af_ops = NULL;
> +	}
>  }

[ ... ]

> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> +					  struct sk_buff *skb,
> +					  struct request_sock *req,
> +					  struct dst_entry *dst,
> +					  struct request_sock *req_unhash,
> +					  bool *own_req)
> +{
> +	const struct inet_connection_sock_af_ops *ops;
> +	void (*write_space)(struct sock *sk);
> +	struct sk_psock *psock;
> +	struct proto *proto;
> +	struct sock *child;
> +
> +	rcu_read_lock();
> +	psock = sk_psock(sk);
> +	if (likely(psock)) {
> +		proto = psock->sk_proto;
> +		write_space = psock->saved_write_space;
> +		ops = psock->icsk_af_ops;
It is not immediately clear to me what ensure
ops is not NULL here.

It is likely I missed something.  A short comment would
be very useful here.

> +	} else {
> +		ops = inet_csk(sk)->icsk_af_ops;
> +	}
> +	rcu_read_unlock();
> +
> +	child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> +
> +	/* Child must not inherit psock or its ops. */
> +	if (child && psock) {
> +		rcu_assign_sk_user_data(child, NULL);
> +		child->sk_prot = proto;
> +		child->sk_write_space = write_space;
> +
> +		/* v4-mapped sockets don't inherit parent ops. Don't restore. */
> +		if (inet_csk(child)->icsk_af_ops == inet_csk(sk)->icsk_af_ops)
> +			inet_csk(child)->icsk_af_ops = ops;
> +	}
> +	return child;
> +}
> +
>  enum {
>  	TCP_BPF_IPV4,
>  	TCP_BPF_IPV6,
> @@ -597,6 +642,7 @@ enum {
>  static struct proto *tcpv6_prot_saved __read_mostly;
>  static DEFINE_SPINLOCK(tcpv6_prot_lock);
>  static struct proto tcp_bpf_prots[TCP_BPF_NUM_PROTS][TCP_BPF_NUM_CFGS];
> +static struct inet_connection_sock_af_ops tcp_bpf_af_ops[TCP_BPF_NUM_PROTS];
>  
>  static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
>  				   struct proto *base)
> @@ -612,13 +658,23 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
>  	prot[TCP_BPF_TX].sendpage		= tcp_bpf_sendpage;
>  }
>  
> -static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops)
> +static void tcp_bpf_rebuild_af_ops(struct inet_connection_sock_af_ops *ops,
> +				   const struct inet_connection_sock_af_ops *base)
> +{
> +	*ops = *base;
> +	ops->syn_recv_sock = tcp_bpf_syn_recv_sock;
> +}
> +
> +static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops,
> +					   const struct inet_connection_sock_af_ops *af_ops)
>  {
>  	if (sk->sk_family == AF_INET6 &&
>  	    unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) {
>  		spin_lock_bh(&tcpv6_prot_lock);
>  		if (likely(ops != tcpv6_prot_saved)) {
>  			tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV6], ops);
> +			tcp_bpf_rebuild_af_ops(&tcp_bpf_af_ops[TCP_BPF_IPV6],
> +					       af_ops);
>  			smp_store_release(&tcpv6_prot_saved, ops);
>  		}
>  		spin_unlock_bh(&tcpv6_prot_lock);
> @@ -628,6 +684,8 @@ static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops)
>  static int __init tcp_bpf_v4_build_proto(void)
>  {
>  	tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV4], &tcp_prot);
> +	tcp_bpf_rebuild_af_ops(&tcp_bpf_af_ops[TCP_BPF_IPV4], &ipv4_specific);
> +
>  	return 0;
>  }
>  core_initcall(tcp_bpf_v4_build_proto);
> @@ -637,7 +695,8 @@ static void tcp_bpf_update_sk_prot(struct sock *sk, struct sk_psock *psock)
>  	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
>  	int config = psock->progs.msg_parser   ? TCP_BPF_TX   : TCP_BPF_BASE;
>  
> -	sk_psock_update_proto(sk, psock, &tcp_bpf_prots[family][config]);
> +	sk_psock_update_proto(sk, psock, &tcp_bpf_prots[family][config],
> +			      &tcp_bpf_af_ops[family]);
>  }
>  
>  static void tcp_bpf_reinit_sk_prot(struct sock *sk, struct sk_psock *psock)
> @@ -677,6 +736,7 @@ void tcp_bpf_reinit(struct sock *sk)
>  
>  int tcp_bpf_init(struct sock *sk)
>  {
> +	struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct proto *ops = READ_ONCE(sk->sk_prot);
>  	struct sk_psock *psock;
>  
> @@ -689,7 +749,7 @@ int tcp_bpf_init(struct sock *sk)
>  		rcu_read_unlock();
>  		return -EINVAL;
>  	}
> -	tcp_bpf_check_v6_needs_rebuild(sk, ops);
> +	tcp_bpf_check_v6_needs_rebuild(sk, ops, icsk->icsk_af_ops);
>  	tcp_bpf_update_sk_prot(sk, psock);
>  	rcu_read_unlock();
>  	return 0;
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-25 22:07         ` Martin Lau
@ 2019-11-26 14:30           ` Jakub Sitnicki
  2019-11-26 19:03             ` Martin Lau
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-26 14:30 UTC (permalink / raw)
  To: Martin Lau; +Cc: John Fastabend, Alexei Starovoitov, bpf, netdev, kernel-team

On Mon, Nov 25, 2019 at 11:07 PM CET, Martin Lau wrote:
> On Mon, Nov 25, 2019 at 11:40:41AM +0100, Jakub Sitnicki wrote:
>> On Mon, Nov 25, 2019 at 05:17 AM CET, John Fastabend wrote:
>> > Alexei Starovoitov wrote:
>> >> On Sat, Nov 23, 2019 at 12:07:48PM +0100, Jakub Sitnicki wrote:
>> >> > SOCKMAP now supports storing references to listening sockets. Nothing keeps
>> >> > us from using it as an array of sockets to select from in SK_REUSEPORT
>> >> > programs.
>> >> >
>> >> > Whitelist the map type with the BPF helper for selecting socket. However,
>> >> > impose a restriction that the selected socket needs to be a listening TCP
>> >> > socket or a bound UDP socket (connected or not).
>> >> >
>> >> > The only other map type that works with the BPF reuseport helper,
>> >> > REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
>> >> > handler.
>> >> >
>> >> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> > ---
>> >
>> > [...]
>> >
>> >> > diff --git a/net/core/filter.c b/net/core/filter.c
>> >> > index 49ded4a7588a..e3fb77353248 100644
>> >> > --- a/net/core/filter.c
>> >> > +++ b/net/core/filter.c
>> >> > @@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>> >> >  	selected_sk = map->ops->map_lookup_elem(map, key);
>> >> >  	if (!selected_sk)
>> >> >  		return -ENOENT;
>> >> > +	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
>> >> > +		return -EINVAL;
> If I read it correctly,
> this is to avoid the following "if (!reuse)" to return -ENOENT,
> and instead returns -EINVAL for non TCP_LISTEN tcp_sock.
> It should at least only be done under the "if (!reuse)" then.

Yes, exactly. For an established TCP socket in SOCKMAP we would get
-ENOENT because sk_reuseport_cb is not set. Which is a bit confusing
since the map entry exists.

Returning -EINVAL matches the REUSEPORT_SOCKARRAY update operation
semantics for established TCP sockets.

But this is just about returning an informative error so you're
completely right that this should be done under "if (!reuse)" branch to
avoid the extra cost on the happy path.

> Checking SOCK_RCU_FREE to imply TCP_LISTEN is not ideal.
> It is not immediately obvious.  Why not directly check
> TCP_LISTEN?

I agree, it's not obvious. When I first saw this check in
reuseport_array_update_check it got me puzzled too. I should have added
an explanatory comment there.

Thing is we're not matching on just TCP_LISTEN. REUSEPORT_SOCKARRAY
allows selecting a connected UDP socket as a target as well. It takes
some effort to set up but it's possible even if obscure.

> Note that the SOCK_RCU_FREE check at the 'slow-path'
> reuseport_array_update_check() is because reuseport_array does depend on
> call_rcu(&sk->sk_rcu,...) to work, e.g. the reuseport_array
> does not hold the sk_refcnt.

Oh, so it's not only about socket state like I thought.

This raises the question - does REUSEPORT_SOCKARRAY allow storing
connected UDP sockets by design or is it a happy accident? It doesn't
seem particularly useful.

Either way, thanks for the explanation.

>
>> >>
>> >> hmm. I wonder whether this breaks existing users...
>> >
>> > There is already this check in reuseport_array_update_check()
>> >
>> > 	/*
>> > 	 * sk must be hashed (i.e. listening in the TCP case or binded
>> > 	 * in the UDP case) and
>> > 	 * it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL).
>> > 	 *
>> > 	 * Also, sk will be used in bpf helper that is protected by
>> > 	 * rcu_read_lock().
>> > 	 */
>> > 	if (!sock_flag(nsk, SOCK_RCU_FREE) || !sk_hashed(nsk) || !nsk_reuse)
>> > 		return -EINVAL;
>> >
>> > So I believe it should not cause any problems with existing users. Perhaps
>> > we could consolidate the checks a bit or move it into the update paths if we
>> > wanted. I assume Jakub was just ensuring we don't get here with SOCK_RCU_FREE
>> > set from any of the new paths now. I'll let him answer though.
>>
>> That was exactly my thinking here.
>>
>> REUSEPORT_SOCKARRAY can't be populated with sockets that don't have
>> SOCK_RCU_FREE set. This makes the flag check in sk_select_reuseport BPF
>> helper redundant for this map type.
>>
>> SOCKMAP, OTOH, allows storing established TCP sockets, which don't have
>> SOCK_RCU_FREE flag and shouldn't be used as reuseport targets. The newly
>> added check protects us against it.
>>
>> I have a couple tests in the last patch for it -
>> test_sockmap_reuseport_select_{listening,connected}. Admittedly, UDP is
>> not covered.
>>
>> Not sure how we could go about moving the checks to the update path for
>> SOCKMAP. At update time we don't know if the map will be used with a
>> reuseport or a sk_{skb,msg} program.
> or make these checks specific to the sockmap's lookup path.
>
> digress a little from this patch,
> will the upcoming patches/examples show the use case to have both
> TCP_LISTEN and TCP_ESTABLISHED sk in the same sock_map?

No, we have no use for a map instance that mixes listening and
established TCP sockets that I know of.

I'm guessing you would like to avoid adding a new check on the fast-path
(at socket selection time) by filering out sockets in invalid state on
map update, like SOCKARRAY does.

I could imagine setting a flag at map creation time to put SOCKMAP in a
a certain mode. Storing just listening or just established sockets.

OTOH why restrict the user? If you are okay with not adding extra checks
on the happy path in sk_select_reuseport, I would opt for that.

Thanks,
Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2019-11-25 22:30   ` Martin Lau
@ 2019-11-26 14:32     ` Jakub Sitnicki
  2019-12-12 10:30     ` Jakub Sitnicki
  1 sibling, 0 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-26 14:32 UTC (permalink / raw)
  To: Martin Lau; +Cc: bpf, netdev, kernel-team, John Fastabend

On Mon, Nov 25, 2019 at 11:30 PM CET, Martin Lau wrote:
> On Sat, Nov 23, 2019 at 12:07:50PM +0100, Jakub Sitnicki wrote:
>> Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
>> can be selected at run-time. Also allow choosing which L4 protocols get
>> tested.
> If new cmdline args are added to select different subtests,
> I would prefer to move it to the test_progs framework and reuse the subtests
> support in test_progs commit 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs").
> Its default is to run all instead of having a separate bash script to
> do that.

Great idea. Let me convert these tests and the newly added ones to
test_progs framework.

-Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-25 22:38   ` Martin Lau
@ 2019-11-26 15:54     ` Jakub Sitnicki
  2019-11-26 17:16       ` Martin Lau
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-26 15:54 UTC (permalink / raw)
  To: Martin Lau; +Cc: bpf, netdev, kernel-team, John Fastabend

On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> [ ... ]
>
>> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>>  			sk->sk_prot = psock->sk_proto;
>>  		psock->sk_proto = NULL;
>>  	}
>> +
>> +	if (psock->icsk_af_ops) {
>> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> +		psock->icsk_af_ops = NULL;
>> +	}
>>  }
>
> [ ... ]
>
>> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> +					  struct sk_buff *skb,
>> +					  struct request_sock *req,
>> +					  struct dst_entry *dst,
>> +					  struct request_sock *req_unhash,
>> +					  bool *own_req)
>> +{
>> +	const struct inet_connection_sock_af_ops *ops;
>> +	void (*write_space)(struct sock *sk);
>> +	struct sk_psock *psock;
>> +	struct proto *proto;
>> +	struct sock *child;
>> +
>> +	rcu_read_lock();
>> +	psock = sk_psock(sk);
>> +	if (likely(psock)) {
>> +		proto = psock->sk_proto;
>> +		write_space = psock->saved_write_space;
>> +		ops = psock->icsk_af_ops;
> It is not immediately clear to me what ensure
> ops is not NULL here.
>
> It is likely I missed something.  A short comment would
> be very useful here.

I can see the readability problem. Looking at it now, perhaps it should
be rewritten, to the same effect, as:

static struct sock *tcp_bpf_syn_recv_sock(...)
{
	const struct inet_connection_sock_af_ops *ops = NULL;
        ...

        rcu_read_lock();
	psock = sk_psock(sk);
	if (likely(psock)) {
		proto = psock->sk_proto;
		write_space = psock->saved_write_space;
		ops = psock->icsk_af_ops;
	}
	rcu_read_unlock();

        if (!ops)
		ops = inet_csk(sk)->icsk_af_ops;
        child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);

If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
properly. To double check what happens here:

In sock_map_link we do a setup dance where we first create the psock and
later initialize the socket callbacks (tcp_bpf_init).

static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
			 struct sock *sk)
{
        ...
	if (psock) {
                ...
	} else {
		psock = sk_psock_init(sk, map->numa_node);
		if (!psock) {
			ret = -ENOMEM;
			goto out_progs;
		}
		sk_psock_is_new = true;
	}
        ...
        if (sk_psock_is_new) {
		ret = tcp_bpf_init(sk);
		if (ret < 0)
			goto out_drop;
	} else {
		tcp_bpf_reinit(sk);
	}

The "if (sk_psock_new)" branch triggers the call chain that leads to
saving & overriding socket callbacks.

tcp_bpf_init -> tcp_bpf_update_sk_prot -> sk_psock_update_proto

Among them, icsk_af_ops.

static inline void sk_psock_update_proto(...)
{
        ...
	psock->icsk_af_ops = icsk->icsk_af_ops;
	icsk->icsk_af_ops = af_ops;
}

Goes without saying that a comment is needed.

Thanks for the feedback,
Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-26 15:54     ` Jakub Sitnicki
@ 2019-11-26 17:16       ` Martin Lau
  2019-11-26 18:36         ` Jakub Sitnicki
  2019-11-26 18:43         ` John Fastabend
  0 siblings, 2 replies; 39+ messages in thread
From: Martin Lau @ 2019-11-26 17:16 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: bpf, netdev, kernel-team, John Fastabend

On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> > [ ... ]
> >
> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >>  			sk->sk_prot = psock->sk_proto;
> >>  		psock->sk_proto = NULL;
> >>  	}
> >> +
> >> +	if (psock->icsk_af_ops) {
> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> >> +		psock->icsk_af_ops = NULL;
> >> +	}
> >>  }
> >
> > [ ... ]
> >
> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> >> +					  struct sk_buff *skb,
> >> +					  struct request_sock *req,
> >> +					  struct dst_entry *dst,
> >> +					  struct request_sock *req_unhash,
> >> +					  bool *own_req)
> >> +{
> >> +	const struct inet_connection_sock_af_ops *ops;
> >> +	void (*write_space)(struct sock *sk);
> >> +	struct sk_psock *psock;
> >> +	struct proto *proto;
> >> +	struct sock *child;
> >> +
> >> +	rcu_read_lock();
> >> +	psock = sk_psock(sk);
> >> +	if (likely(psock)) {
> >> +		proto = psock->sk_proto;
> >> +		write_space = psock->saved_write_space;
> >> +		ops = psock->icsk_af_ops;
> > It is not immediately clear to me what ensure
> > ops is not NULL here.
> >
> > It is likely I missed something.  A short comment would
> > be very useful here.
> 
> I can see the readability problem. Looking at it now, perhaps it should
> be rewritten, to the same effect, as:
> 
> static struct sock *tcp_bpf_syn_recv_sock(...)
> {
> 	const struct inet_connection_sock_af_ops *ops = NULL;
>         ...
> 
>         rcu_read_lock();
> 	psock = sk_psock(sk);
> 	if (likely(psock)) {
> 		proto = psock->sk_proto;
> 		write_space = psock->saved_write_space;
> 		ops = psock->icsk_af_ops;
> 	}
> 	rcu_read_unlock();
> 
>         if (!ops)
> 		ops = inet_csk(sk)->icsk_af_ops;
>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> 
> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> properly. To double check what happens here:
I did not mean the init path.  The init path is fine since it init
eveything on psock before publishing the sk to the sock_map.

I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
to me what prevent the earlier pasted sk_psock_restore_proto() which sets
psock->icsk_af_ops to NULL from running in parallel with
tcp_bpf_syn_recv_sock()?  An explanation would be useful.

> 
> In sock_map_link we do a setup dance where we first create the psock and
> later initialize the socket callbacks (tcp_bpf_init).
> 
> static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
> 			 struct sock *sk)
> {
>         ...
> 	if (psock) {
>                 ...
> 	} else {
> 		psock = sk_psock_init(sk, map->numa_node);
> 		if (!psock) {
> 			ret = -ENOMEM;
> 			goto out_progs;
> 		}
> 		sk_psock_is_new = true;
> 	}
>         ...
>         if (sk_psock_is_new) {
> 		ret = tcp_bpf_init(sk);
> 		if (ret < 0)
> 			goto out_drop;
> 	} else {
> 		tcp_bpf_reinit(sk);
> 	}
> 
> The "if (sk_psock_new)" branch triggers the call chain that leads to
> saving & overriding socket callbacks.
> 
> tcp_bpf_init -> tcp_bpf_update_sk_prot -> sk_psock_update_proto
> 
> Among them, icsk_af_ops.
> 
> static inline void sk_psock_update_proto(...)
> {
>         ...
> 	psock->icsk_af_ops = icsk->icsk_af_ops;
> 	icsk->icsk_af_ops = af_ops;
> }
> 
> Goes without saying that a comment is needed.
> 
> Thanks for the feedback,
> Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-26 17:16       ` Martin Lau
@ 2019-11-26 18:36         ` Jakub Sitnicki
       [not found]           ` <87sglsfdda.fsf@cloudflare.com>
  2019-11-26 18:43         ` John Fastabend
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-26 18:36 UTC (permalink / raw)
  To: Martin Lau; +Cc: bpf, netdev, kernel-team, John Fastabend

On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> > [ ... ]
>> >
>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> >>  			sk->sk_prot = psock->sk_proto;
>> >>  		psock->sk_proto = NULL;
>> >>  	}
>> >> +
>> >> +	if (psock->icsk_af_ops) {
>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> >> +		psock->icsk_af_ops = NULL;
>> >> +	}
>> >>  }
>> >
>> > [ ... ]
>> >
>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> >> +					  struct sk_buff *skb,
>> >> +					  struct request_sock *req,
>> >> +					  struct dst_entry *dst,
>> >> +					  struct request_sock *req_unhash,
>> >> +					  bool *own_req)
>> >> +{
>> >> +	const struct inet_connection_sock_af_ops *ops;
>> >> +	void (*write_space)(struct sock *sk);
>> >> +	struct sk_psock *psock;
>> >> +	struct proto *proto;
>> >> +	struct sock *child;
>> >> +
>> >> +	rcu_read_lock();
>> >> +	psock = sk_psock(sk);
>> >> +	if (likely(psock)) {
>> >> +		proto = psock->sk_proto;
>> >> +		write_space = psock->saved_write_space;
>> >> +		ops = psock->icsk_af_ops;
>> > It is not immediately clear to me what ensure
>> > ops is not NULL here.
>> >
>> > It is likely I missed something.  A short comment would
>> > be very useful here.
>>
>> I can see the readability problem. Looking at it now, perhaps it should
>> be rewritten, to the same effect, as:
>>
>> static struct sock *tcp_bpf_syn_recv_sock(...)
>> {
>> 	const struct inet_connection_sock_af_ops *ops = NULL;
>>         ...
>>
>>         rcu_read_lock();
>> 	psock = sk_psock(sk);
>> 	if (likely(psock)) {
>> 		proto = psock->sk_proto;
>> 		write_space = psock->saved_write_space;
>> 		ops = psock->icsk_af_ops;
>> 	}
>> 	rcu_read_unlock();
>>
>>         if (!ops)
>> 		ops = inet_csk(sk)->icsk_af_ops;
>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>>
>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> properly. To double check what happens here:
> I did not mean the init path.  The init path is fine since it init
> eveything on psock before publishing the sk to the sock_map.
>
> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> psock->icsk_af_ops to NULL from running in parallel with
> tcp_bpf_syn_recv_sock()?  An explanation would be useful.

Ah, I misunderstood. Nothing prevents the race, AFAIK.

Setting psock->icsk_af_ops to null on restore and not checking for it
here was a bad move on my side.  Also I need to revisit what to do about
psock->sk_proto so the child socket doesn't end up with null sk_proto.

This race should be easy enough to trigger. Will give it a shot.

Thank you for bringing this up,
Jakub

>
>>
>> In sock_map_link we do a setup dance where we first create the psock and
>> later initialize the socket callbacks (tcp_bpf_init).
>>
>> static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>> 			 struct sock *sk)
>> {
>>         ...
>> 	if (psock) {
>>                 ...
>> 	} else {
>> 		psock = sk_psock_init(sk, map->numa_node);
>> 		if (!psock) {
>> 			ret = -ENOMEM;
>> 			goto out_progs;
>> 		}
>> 		sk_psock_is_new = true;
>> 	}
>>         ...
>>         if (sk_psock_is_new) {
>> 		ret = tcp_bpf_init(sk);
>> 		if (ret < 0)
>> 			goto out_drop;
>> 	} else {
>> 		tcp_bpf_reinit(sk);
>> 	}
>>
>> The "if (sk_psock_new)" branch triggers the call chain that leads to
>> saving & overriding socket callbacks.
>>
>> tcp_bpf_init -> tcp_bpf_update_sk_prot -> sk_psock_update_proto
>>
>> Among them, icsk_af_ops.
>>
>> static inline void sk_psock_update_proto(...)
>> {
>>         ...
>> 	psock->icsk_af_ops = icsk->icsk_af_ops;
>> 	icsk->icsk_af_ops = af_ops;
>> }
>>
>> Goes without saying that a comment is needed.
>>
>> Thanks for the feedback,
>> Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-26 17:16       ` Martin Lau
  2019-11-26 18:36         ` Jakub Sitnicki
@ 2019-11-26 18:43         ` John Fastabend
  2019-11-27 22:18           ` Jakub Sitnicki
  1 sibling, 1 reply; 39+ messages in thread
From: John Fastabend @ 2019-11-26 18:43 UTC (permalink / raw)
  To: Martin Lau, Jakub Sitnicki; +Cc: bpf, netdev, kernel-team, John Fastabend

Martin Lau wrote:
> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> > > [ ... ]
> > >
> > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> > >>  			sk->sk_prot = psock->sk_proto;
> > >>  		psock->sk_proto = NULL;
> > >>  	}
> > >> +
> > >> +	if (psock->icsk_af_ops) {
> > >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> > >> +		psock->icsk_af_ops = NULL;
> > >> +	}
> > >>  }
> > >
> > > [ ... ]
> > >
> > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> > >> +					  struct sk_buff *skb,
> > >> +					  struct request_sock *req,
> > >> +					  struct dst_entry *dst,
> > >> +					  struct request_sock *req_unhash,
> > >> +					  bool *own_req)
> > >> +{
> > >> +	const struct inet_connection_sock_af_ops *ops;
> > >> +	void (*write_space)(struct sock *sk);
> > >> +	struct sk_psock *psock;
> > >> +	struct proto *proto;
> > >> +	struct sock *child;
> > >> +
> > >> +	rcu_read_lock();
> > >> +	psock = sk_psock(sk);
> > >> +	if (likely(psock)) {
> > >> +		proto = psock->sk_proto;
> > >> +		write_space = psock->saved_write_space;
> > >> +		ops = psock->icsk_af_ops;
> > > It is not immediately clear to me what ensure
> > > ops is not NULL here.
> > >
> > > It is likely I missed something.  A short comment would
> > > be very useful here.
> > 
> > I can see the readability problem. Looking at it now, perhaps it should
> > be rewritten, to the same effect, as:
> > 
> > static struct sock *tcp_bpf_syn_recv_sock(...)
> > {
> > 	const struct inet_connection_sock_af_ops *ops = NULL;
> >         ...
> > 
> >     rcu_read_lock();
> > 	psock = sk_psock(sk);
> > 	if (likely(psock)) {
> > 		proto = psock->sk_proto;
> > 		write_space = psock->saved_write_space;
> > 		ops = psock->icsk_af_ops;
> > 	}
> > 	rcu_read_unlock();
> > 
> >         if (!ops)
> > 		ops = inet_csk(sk)->icsk_af_ops;
> >         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> > 
> > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> > properly. To double check what happens here:
> I did not mean the init path.  The init path is fine since it init
> eveything on psock before publishing the sk to the sock_map.
> 
> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> psock->icsk_af_ops to NULL from running in parallel with
> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> 

I'll answer. Updates are protected via sk_callback_lock so we don't have
parrallel updates in-flight causing write_space and sk_proto to be out
of sync. However access should be OK because its a pointer write we
never update the pointer in place, e.g.

static inline void sk_psock_restore_proto(struct sock *sk,
					  struct sk_psock *psock)
{
+       struct inet_connection_sock *icsk = inet_csk(sk);
+
	sk->sk_write_space = psock->saved_write_space;

	if (psock->sk_proto) {
		struct inet_connection_sock *icsk = inet_csk(sk);
		bool has_ulp = !!icsk->icsk_ulp_data;

		if (has_ulp)
			tcp_update_ulp(sk, psock->sk_proto);
		else
			sk->sk_prot = psock->sk_proto;
		psock->sk_proto = NULL;
	}

+
+       if (psock->icsk_af_ops) {
+               icsk->icsk_af_ops = psock->icsk_af_ops;
+               psock->icsk_af_ops = NULL;
+       }
}

In restore case either psock->icsk_af_ops is null or not. If its
null below code catches it. If its not null (from init path) then
we have a valid pointer.

        rcu_read_lock();
	psock = sk_psock(sk);
 	if (likely(psock)) {
 		proto = psock->sk_proto;
 		write_space = psock->saved_write_space;
 		ops = psock->icsk_af_ops;
 	}
 	rcu_read_unlock();
 
        if (!ops)
		ops = inet_csk(sk)->icsk_af_ops;
        child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);


We should do this with proper READ_ONCE/WRITE_ONCE to make it clear
what is going on and to stop compiler from breaking these assumptions. I
was going to generate that patch after this series but can do it before
as well. I didn't mention it here because it seems a bit out of scope
for this series because its mostly a fix to older code.

Also I started to think that write_space might be out of sync with ops but
it seems we never actually remove psock_write_space until after
rcu grace period so that should be OK as well and always point to the
previous write_space.

Finally I wondered if we could remove the ops and then add it back
quickly which seems at least in theory possible, but that would get
hit with a grace period because we can't have conflicting psock
definitions on the same sock. So expanding the rcu block to include
the ops = inet_csk(sk)->icsk_af_ops would fix that case.

So in summary I think we should expand the rcu lock here to include the
ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear
down and create. I'll push the necessary update with WRITE_ONCE and
READ_ONCE to fix that up. Seeing we have to wait until the merge
window opens most likely anyways I'll send those out sooner rather
then later and this series can add the proper annotations as well.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-26 14:30           ` Jakub Sitnicki
@ 2019-11-26 19:03             ` Martin Lau
  2019-11-27 21:34               ` Jakub Sitnicki
  0 siblings, 1 reply; 39+ messages in thread
From: Martin Lau @ 2019-11-26 19:03 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: John Fastabend, Alexei Starovoitov, bpf, netdev, kernel-team

On Tue, Nov 26, 2019 at 03:30:57PM +0100, Jakub Sitnicki wrote:
> On Mon, Nov 25, 2019 at 11:07 PM CET, Martin Lau wrote:
> > On Mon, Nov 25, 2019 at 11:40:41AM +0100, Jakub Sitnicki wrote:
> >> On Mon, Nov 25, 2019 at 05:17 AM CET, John Fastabend wrote:
> >> > Alexei Starovoitov wrote:
> >> >> On Sat, Nov 23, 2019 at 12:07:48PM +0100, Jakub Sitnicki wrote:
> >> >> > SOCKMAP now supports storing references to listening sockets. Nothing keeps
> >> >> > us from using it as an array of sockets to select from in SK_REUSEPORT
> >> >> > programs.
> >> >> >
> >> >> > Whitelist the map type with the BPF helper for selecting socket. However,
> >> >> > impose a restriction that the selected socket needs to be a listening TCP
> >> >> > socket or a bound UDP socket (connected or not).
> >> >> >
> >> >> > The only other map type that works with the BPF reuseport helper,
> >> >> > REUSEPORT_SOCKARRAY, has a corresponding check in its update operation
> >> >> > handler.
> >> >> >
> >> >> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> > ---
> >> >
> >> > [...]
> >> >
> >> >> > diff --git a/net/core/filter.c b/net/core/filter.c
> >> >> > index 49ded4a7588a..e3fb77353248 100644
> >> >> > --- a/net/core/filter.c
> >> >> > +++ b/net/core/filter.c
> >> >> > @@ -8723,6 +8723,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
> >> >> >  	selected_sk = map->ops->map_lookup_elem(map, key);
> >> >> >  	if (!selected_sk)
> >> >> >  		return -ENOENT;
> >> >> > +	if (!sock_flag(selected_sk, SOCK_RCU_FREE))
> >> >> > +		return -EINVAL;
> > If I read it correctly,
> > this is to avoid the following "if (!reuse)" to return -ENOENT,
> > and instead returns -EINVAL for non TCP_LISTEN tcp_sock.
> > It should at least only be done under the "if (!reuse)" then.
> 
> Yes, exactly. For an established TCP socket in SOCKMAP we would get
> -ENOENT because sk_reuseport_cb is not set. Which is a bit confusing
> since the map entry exists.
> 
> Returning -EINVAL matches the REUSEPORT_SOCKARRAY update operation
> semantics for established TCP sockets.
> 
> But this is just about returning an informative error so you're
> completely right that this should be done under "if (!reuse)" branch to
> avoid the extra cost on the happy path.
> 
> > Checking SOCK_RCU_FREE to imply TCP_LISTEN is not ideal.
> > It is not immediately obvious.  Why not directly check
> > TCP_LISTEN?
> 
> I agree, it's not obvious. When I first saw this check in
> reuseport_array_update_check it got me puzzled too. I should have added
> an explanatory comment there.
> 
> Thing is we're not matching on just TCP_LISTEN. REUSEPORT_SOCKARRAY
> allows selecting a connected UDP socket as a target as well. It takes
> some effort to set up but it's possible even if obscure.
How about this instead:
if (!reuse)
 	/* reuseport_array only has sk that has non NULL sk_reuseport_cb.
	 * The only (!reuse) case here is, the sk has already been removed from
	 * reuseport_array, so treat it as -ENOENT.
	 *
	 * Other maps (e.g. sock_map) do not provide this guarantee and the sk may
	 * never be in the reuseport to begin with.
	 */
	return map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY ? -ENOENT : -EINVAL;

> 
> > Note that the SOCK_RCU_FREE check at the 'slow-path'
> > reuseport_array_update_check() is because reuseport_array does depend on
> > call_rcu(&sk->sk_rcu,...) to work, e.g. the reuseport_array
> > does not hold the sk_refcnt.
> 
> Oh, so it's not only about socket state like I thought.
> 
> This raises the question - does REUSEPORT_SOCKARRAY allow storing
> connected UDP sockets by design or is it a happy accident? It doesn't
> seem particularly useful.
Not by design/accident on the REUSEPORT_SOCKARRAY side ;)

The intention of REUSEPORT_SOCKARRAY is to allow sk that can be added to
reuse->socks[].

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP
  2019-11-26 19:03             ` Martin Lau
@ 2019-11-27 21:34               ` Jakub Sitnicki
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-27 21:34 UTC (permalink / raw)
  To: Martin Lau; +Cc: John Fastabend, Alexei Starovoitov, bpf, netdev, kernel-team

On Tue, Nov 26, 2019 at 08:03 PM CET, Martin Lau wrote:
> On Tue, Nov 26, 2019 at 03:30:57PM +0100, Jakub Sitnicki wrote:
>> On Mon, Nov 25, 2019 at 11:07 PM CET, Martin Lau wrote:
>> > On Mon, Nov 25, 2019 at 11:40:41AM +0100, Jakub Sitnicki wrote:

[...]

>> I agree, it's not obvious. When I first saw this check in
>> reuseport_array_update_check it got me puzzled too. I should have added
>> an explanatory comment there.
>>
>> Thing is we're not matching on just TCP_LISTEN. REUSEPORT_SOCKARRAY
>> allows selecting a connected UDP socket as a target as well. It takes
>> some effort to set up but it's possible even if obscure.
> How about this instead:
> if (!reuse)
>  	/* reuseport_array only has sk that has non NULL sk_reuseport_cb.
> 	 * The only (!reuse) case here is, the sk has already been removed from
> 	 * reuseport_array, so treat it as -ENOENT.
> 	 *
> 	 * Other maps (e.g. sock_map) do not provide this guarantee and the sk may
> 	 * never be in the reuseport to begin with.
> 	 */
> 	return map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY ? -ENOENT : -EINVAL;

Right, apart from established TCP sockets we must not select a listening
socket that's not in a reuseport group either. This covers both
cases. Clever. Thanks for the suggestion.

>
>>
>> > Note that the SOCK_RCU_FREE check at the 'slow-path'
>> > reuseport_array_update_check() is because reuseport_array does depend on
>> > call_rcu(&sk->sk_rcu,...) to work, e.g. the reuseport_array
>> > does not hold the sk_refcnt.
>>
>> Oh, so it's not only about socket state like I thought.
>>
>> This raises the question - does REUSEPORT_SOCKARRAY allow storing
>> connected UDP sockets by design or is it a happy accident? It doesn't
>> seem particularly useful.
> Not by design/accident on the REUSEPORT_SOCKARRAY side ;)
>
> The intention of REUSEPORT_SOCKARRAY is to allow sk that can be added to
> reuse->socks[].

Ah, makes sense. REUSEPORT_SOCKARRAY had to mimic reuseport groups.

-Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-11-26 18:43         ` John Fastabend
@ 2019-11-27 22:18           ` Jakub Sitnicki
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-11-27 22:18 UTC (permalink / raw)
  To: John Fastabend; +Cc: Martin Lau, bpf, netdev, kernel-team

On Tue, Nov 26, 2019 at 07:43 PM CET, John Fastabend wrote:
> Martin Lau wrote:
>> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> > > [ ... ]
>> > >
>> > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> > >>  			sk->sk_prot = psock->sk_proto;
>> > >>  		psock->sk_proto = NULL;
>> > >>  	}
>> > >> +
>> > >> +	if (psock->icsk_af_ops) {
>> > >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> > >> +		psock->icsk_af_ops = NULL;
>> > >> +	}
>> > >>  }
>> > >
>> > > [ ... ]
>> > >
>> > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> > >> +					  struct sk_buff *skb,
>> > >> +					  struct request_sock *req,
>> > >> +					  struct dst_entry *dst,
>> > >> +					  struct request_sock *req_unhash,
>> > >> +					  bool *own_req)
>> > >> +{
>> > >> +	const struct inet_connection_sock_af_ops *ops;
>> > >> +	void (*write_space)(struct sock *sk);
>> > >> +	struct sk_psock *psock;
>> > >> +	struct proto *proto;
>> > >> +	struct sock *child;
>> > >> +
>> > >> +	rcu_read_lock();
>> > >> +	psock = sk_psock(sk);
>> > >> +	if (likely(psock)) {
>> > >> +		proto = psock->sk_proto;
>> > >> +		write_space = psock->saved_write_space;
>> > >> +		ops = psock->icsk_af_ops;
>> > > It is not immediately clear to me what ensure
>> > > ops is not NULL here.
>> > >
>> > > It is likely I missed something.  A short comment would
>> > > be very useful here.
>> >
>> > I can see the readability problem. Looking at it now, perhaps it should
>> > be rewritten, to the same effect, as:
>> >
>> > static struct sock *tcp_bpf_syn_recv_sock(...)
>> > {
>> > 	const struct inet_connection_sock_af_ops *ops = NULL;
>> >         ...
>> >
>> >     rcu_read_lock();
>> > 	psock = sk_psock(sk);
>> > 	if (likely(psock)) {
>> > 		proto = psock->sk_proto;
>> > 		write_space = psock->saved_write_space;
>> > 		ops = psock->icsk_af_ops;
>> > 	}
>> > 	rcu_read_unlock();
>> >
>> >         if (!ops)
>> > 		ops = inet_csk(sk)->icsk_af_ops;
>> >         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>> >
>> > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> > properly. To double check what happens here:
>> I did not mean the init path.  The init path is fine since it init
>> eveything on psock before publishing the sk to the sock_map.
>>
>> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
>> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
>> psock->icsk_af_ops to NULL from running in parallel with
>> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
>>
>
> I'll answer. Updates are protected via sk_callback_lock so we don't have
> parrallel updates in-flight causing write_space and sk_proto to be out
> of sync. However access should be OK because its a pointer write we
> never update the pointer in place, e.g.
>
> static inline void sk_psock_restore_proto(struct sock *sk,
> 					  struct sk_psock *psock)
> {
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +
> 	sk->sk_write_space = psock->saved_write_space;
>
> 	if (psock->sk_proto) {
> 		struct inet_connection_sock *icsk = inet_csk(sk);
> 		bool has_ulp = !!icsk->icsk_ulp_data;
>
> 		if (has_ulp)
> 			tcp_update_ulp(sk, psock->sk_proto);
> 		else
> 			sk->sk_prot = psock->sk_proto;
> 		psock->sk_proto = NULL;
> 	}
>
> +
> +       if (psock->icsk_af_ops) {
> +               icsk->icsk_af_ops = psock->icsk_af_ops;
> +               psock->icsk_af_ops = NULL;
> +       }
> }
>
> In restore case either psock->icsk_af_ops is null or not. If its
> null below code catches it. If its not null (from init path) then
> we have a valid pointer.
>
>         rcu_read_lock();
> 	psock = sk_psock(sk);
>  	if (likely(psock)) {
>  		proto = psock->sk_proto;
>  		write_space = psock->saved_write_space;
>  		ops = psock->icsk_af_ops;
>  	}
>  	rcu_read_unlock();
>
>         if (!ops)
> 		ops = inet_csk(sk)->icsk_af_ops;
>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>
>
> We should do this with proper READ_ONCE/WRITE_ONCE to make it clear
> what is going on and to stop compiler from breaking these assumptions. I
> was going to generate that patch after this series but can do it before
> as well. I didn't mention it here because it seems a bit out of scope
> for this series because its mostly a fix to older code.

+1, looking forward to your patch. Also, as I've recently learned, that
should enable KTSAN to reason about the psock code [0].

> Also I started to think that write_space might be out of sync with ops but
> it seems we never actually remove psock_write_space until after
> rcu grace period so that should be OK as well and always point to the
> previous write_space.
>
> Finally I wondered if we could remove the ops and then add it back
> quickly which seems at least in theory possible, but that would get
> hit with a grace period because we can't have conflicting psock
> definitions on the same sock. So expanding the rcu block to include
> the ops = inet_csk(sk)->icsk_af_ops would fix that case.

I see, ops = inet_csk(sk)->icsk_af_ops might read out a re-overwritten
ops after sock_map_unlink, followed by sock_map_link. Ouch.

> So in summary I think we should expand the rcu lock here to include the
> ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear
> down and create. I'll push the necessary update with WRITE_ONCE and
> READ_ONCE to fix that up. Seeing we have to wait until the merge
> window opens most likely anyways I'll send those out sooner rather
> then later and this series can add the proper annotations as well.

Or I could leave psock->icsk_af_ops set in restore_proto, like we do for
write_space as you noted. Restoring it twice doesn't seem harmful, it
has no side-effects. Less state changes to think about?

I'll still have to apply what you suggest for saving psock->sk_proto,
though.

Thanks,
Jakub

[0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
       [not found]           ` <87sglsfdda.fsf@cloudflare.com>
@ 2019-12-11 17:20             ` Martin Lau
  2019-12-12 11:27               ` Jakub Sitnicki
  0 siblings, 1 reply; 39+ messages in thread
From: Martin Lau @ 2019-12-11 17:20 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: John Fastabend, bpf, netdev, kernel-team

On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
> John, Martin,
> 
> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> >>> > [ ... ]
> >>> >
> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >>> >>  			sk->sk_prot = psock->sk_proto;
> >>> >>  		psock->sk_proto = NULL;
> >>> >>  	}
> >>> >> +
> >>> >> +	if (psock->icsk_af_ops) {
> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> >>> >> +		psock->icsk_af_ops = NULL;
> >>> >> +	}
> >>> >>  }
> >>> >
> >>> > [ ... ]
> >>> >
> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> >>> >> +					  struct sk_buff *skb,
> >>> >> +					  struct request_sock *req,
> >>> >> +					  struct dst_entry *dst,
> >>> >> +					  struct request_sock *req_unhash,
> >>> >> +					  bool *own_req)
> >>> >> +{
> >>> >> +	const struct inet_connection_sock_af_ops *ops;
> >>> >> +	void (*write_space)(struct sock *sk);
> >>> >> +	struct sk_psock *psock;
> >>> >> +	struct proto *proto;
> >>> >> +	struct sock *child;
> >>> >> +
> >>> >> +	rcu_read_lock();
> >>> >> +	psock = sk_psock(sk);
> >>> >> +	if (likely(psock)) {
> >>> >> +		proto = psock->sk_proto;
> >>> >> +		write_space = psock->saved_write_space;
> >>> >> +		ops = psock->icsk_af_ops;
> >>> > It is not immediately clear to me what ensure
> >>> > ops is not NULL here.
> >>> >
> >>> > It is likely I missed something.  A short comment would
> >>> > be very useful here.
> >>>
> >>> I can see the readability problem. Looking at it now, perhaps it should
> >>> be rewritten, to the same effect, as:
> >>>
> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
> >>> {
> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
> >>>         ...
> >>>
> >>>         rcu_read_lock();
> >>> 	psock = sk_psock(sk);
> >>> 	if (likely(psock)) {
> >>> 		proto = psock->sk_proto;
> >>> 		write_space = psock->saved_write_space;
> >>> 		ops = psock->icsk_af_ops;
> >>> 	}
> >>> 	rcu_read_unlock();
> >>>
> >>>         if (!ops)
> >>> 		ops = inet_csk(sk)->icsk_af_ops;
> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> >>>
> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> >>> properly. To double check what happens here:
> >> I did not mean the init path.  The init path is fine since it init
> >> eveything on psock before publishing the sk to the sock_map.
> >>
> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> >> psock->icsk_af_ops to NULL from running in parallel with
> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> >
> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
> >
> > Setting psock->icsk_af_ops to null on restore and not checking for it
> > here was a bad move on my side.  Also I need to revisit what to do about
> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
> >
> > This race should be easy enough to trigger. Will give it a shot.
> 
> I've convinced myself that this approach is racy beyond repair.
> 
> Once syn_recv_sock() has returned it is too late to reset the child
> sk_user_data and restore its callbacks. It has been already inserted
> into ehash and ingress path can invoke its callbacks.
> 
> The race can be triggered with with a reproducer where:
> 
> thread-1:
> 
>         p = accept(s, ...);
>         close(p);
> 
> thread-2:
> 
> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
> 	bpf_map_delete_elem(mapfd, &key);
> 
> This a dead-end because we can't have the parent and the child share the
> psock state. Even though psock itself is refcounted, and potentially we
> could grab a reference before cloning the parent, link into the map that
> psock holds is not.
> 
> Two ways out come to mind. Both involve touching TCP code, which I was
> hoping to avoid:
> 
> 1) reset sk_user_data when initializing the child
> 
>    This is problematic because tcp_bpf callbacks are not designed to
>    handle sockets with no psock _and_ with overridden sk_prot
>    callbacks. (Although, I think they could if the fallback was directly
>    on {tcp,tcpv6}_prot based on socket domain.)
> 
>    Also, there are other sk_user_data users like DRBD which rely on
>    sharing the sk_user_data pointer between parent and child, if I read
>    the code correctly [0]. If anything, clearing the sk_user_data on
>    clone would have to be guarded by a flag.
Can the copy/not-to-copy sk_user_data decision be made in
sk_clone_lock()?

> 
> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
> 
>    The simpler way out. tcp_bpf callbacks never get invoked on the child
>    socket so the copied psock reference is no longer a problem. We can
>    clear the pointer on accept().
> 
>    So far I wasn't able poke any holes in it and it comes down to
>    patching tcp_create_openreq_child() with:
> 
> 	/* sk_msg and ULP frameworks can override the callbacks into
> 	 * protocol. We don't assume they are intended to be inherited
> 	 * by the child. Frameworks can re-install the callbacks on
> 	 * accept() if needed.
> 	 */
> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
> 
>    That's what I'm going with for v2.
> 
> Open to suggestions.
> 
> Thanks,
> Jakub
> 
> BTW. Reading into kTLS code, I noticed it has been limited down to just
> established sockets due to the same problem I'm struggling with here:
> 
> static int tls_init(struct sock *sk)
> {
> ...
> 	/* The TLS ulp is currently supported only for TCP sockets
> 	 * in ESTABLISHED state.
> 	 * Supporting sockets in LISTEN state will require us
> 	 * to modify the accept implementation to clone rather then
> 	 * share the ulp context.
> 	 */
> 	if (sk->sk_state != TCP_ESTABLISHED)
> 		return -ENOTCONN;
> 
> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e= 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2019-11-25 22:30   ` Martin Lau
  2019-11-26 14:32     ` Jakub Sitnicki
@ 2019-12-12 10:30     ` Jakub Sitnicki
  1 sibling, 0 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-12-12 10:30 UTC (permalink / raw)
  To: Martin Lau; +Cc: bpf, netdev, kernel-team, John Fastabend

On Mon, Nov 25, 2019 at 11:30 PM CET, Martin Lau wrote:
> On Sat, Nov 23, 2019 at 12:07:50PM +0100, Jakub Sitnicki wrote:
>> Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
>> can be selected at run-time. Also allow choosing which L4 protocols get
>> tested.
> If new cmdline args are added to select different subtests,
> I would prefer to move it to the test_progs framework and reuse the subtests
> support in test_progs commit 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs").
> Its default is to run all instead of having a separate bash script to
> do that.

This turned out to be more work than I expected, so I've split it out
into a separate series:

https://lore.kernel.org/bpf/20191212102259.418536-1-jakub@cloudflare.com/T/#t

Thanks,
Jakub

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-12-11 17:20             ` Martin Lau
@ 2019-12-12 11:27               ` Jakub Sitnicki
  2019-12-12 19:23                 ` Martin Lau
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Sitnicki @ 2019-12-12 11:27 UTC (permalink / raw)
  To: Martin Lau; +Cc: John Fastabend, bpf, netdev, kernel-team

On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
> On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
>> John, Martin,
>>
>> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
>> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
>> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> >>> > [ ... ]
>> >>> >
>> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> >>> >>  			sk->sk_prot = psock->sk_proto;
>> >>> >>  		psock->sk_proto = NULL;
>> >>> >>  	}
>> >>> >> +
>> >>> >> +	if (psock->icsk_af_ops) {
>> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> >>> >> +		psock->icsk_af_ops = NULL;
>> >>> >> +	}
>> >>> >>  }
>> >>> >
>> >>> > [ ... ]
>> >>> >
>> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> >>> >> +					  struct sk_buff *skb,
>> >>> >> +					  struct request_sock *req,
>> >>> >> +					  struct dst_entry *dst,
>> >>> >> +					  struct request_sock *req_unhash,
>> >>> >> +					  bool *own_req)
>> >>> >> +{
>> >>> >> +	const struct inet_connection_sock_af_ops *ops;
>> >>> >> +	void (*write_space)(struct sock *sk);
>> >>> >> +	struct sk_psock *psock;
>> >>> >> +	struct proto *proto;
>> >>> >> +	struct sock *child;
>> >>> >> +
>> >>> >> +	rcu_read_lock();
>> >>> >> +	psock = sk_psock(sk);
>> >>> >> +	if (likely(psock)) {
>> >>> >> +		proto = psock->sk_proto;
>> >>> >> +		write_space = psock->saved_write_space;
>> >>> >> +		ops = psock->icsk_af_ops;
>> >>> > It is not immediately clear to me what ensure
>> >>> > ops is not NULL here.
>> >>> >
>> >>> > It is likely I missed something.  A short comment would
>> >>> > be very useful here.
>> >>>
>> >>> I can see the readability problem. Looking at it now, perhaps it should
>> >>> be rewritten, to the same effect, as:
>> >>>
>> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
>> >>> {
>> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
>> >>>         ...
>> >>>
>> >>>         rcu_read_lock();
>> >>> 	psock = sk_psock(sk);
>> >>> 	if (likely(psock)) {
>> >>> 		proto = psock->sk_proto;
>> >>> 		write_space = psock->saved_write_space;
>> >>> 		ops = psock->icsk_af_ops;
>> >>> 	}
>> >>> 	rcu_read_unlock();
>> >>>
>> >>>         if (!ops)
>> >>> 		ops = inet_csk(sk)->icsk_af_ops;
>> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>> >>>
>> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> >>> properly. To double check what happens here:
>> >> I did not mean the init path.  The init path is fine since it init
>> >> eveything on psock before publishing the sk to the sock_map.
>> >>
>> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
>> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
>> >> psock->icsk_af_ops to NULL from running in parallel with
>> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
>> >
>> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
>> >
>> > Setting psock->icsk_af_ops to null on restore and not checking for it
>> > here was a bad move on my side.  Also I need to revisit what to do about
>> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
>> >
>> > This race should be easy enough to trigger. Will give it a shot.
>>
>> I've convinced myself that this approach is racy beyond repair.
>>
>> Once syn_recv_sock() has returned it is too late to reset the child
>> sk_user_data and restore its callbacks. It has been already inserted
>> into ehash and ingress path can invoke its callbacks.
>>
>> The race can be triggered with with a reproducer where:
>>
>> thread-1:
>>
>>         p = accept(s, ...);
>>         close(p);
>>
>> thread-2:
>>
>> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
>> 	bpf_map_delete_elem(mapfd, &key);
>>
>> This a dead-end because we can't have the parent and the child share the
>> psock state. Even though psock itself is refcounted, and potentially we
>> could grab a reference before cloning the parent, link into the map that
>> psock holds is not.
>>
>> Two ways out come to mind. Both involve touching TCP code, which I was
>> hoping to avoid:
>>
>> 1) reset sk_user_data when initializing the child
>>
>>    This is problematic because tcp_bpf callbacks are not designed to
>>    handle sockets with no psock _and_ with overridden sk_prot
>>    callbacks. (Although, I think they could if the fallback was directly
>>    on {tcp,tcpv6}_prot based on socket domain.)
>>
>>    Also, there are other sk_user_data users like DRBD which rely on
>>    sharing the sk_user_data pointer between parent and child, if I read
>>    the code correctly [0]. If anything, clearing the sk_user_data on
>>    clone would have to be guarded by a flag.
> Can the copy/not-to-copy sk_user_data decision be made in
> sk_clone_lock()?

Yes, this could be pushed down to sk_clone_lock(), where we do similar
work (reset sk_reuseport_cb and clone bpf_sk_storage):

	/* User data can hold reference. Child must not
	 * inherit the pointer without acquiring a reference.
	 */
	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
	}

I belive this would still need to be guarded by a flag.  Do you see
value in clearing child sk_user_data on clone as opposed to dealying
that work until accept() time?

-Jakub

>
>>
>> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
>>
>>    The simpler way out. tcp_bpf callbacks never get invoked on the child
>>    socket so the copied psock reference is no longer a problem. We can
>>    clear the pointer on accept().
>>
>>    So far I wasn't able poke any holes in it and it comes down to
>>    patching tcp_create_openreq_child() with:
>>
>> 	/* sk_msg and ULP frameworks can override the callbacks into
>> 	 * protocol. We don't assume they are intended to be inherited
>> 	 * by the child. Frameworks can re-install the callbacks on
>> 	 * accept() if needed.
>> 	 */
>> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
>>
>>    That's what I'm going with for v2.
>>
>> Open to suggestions.
>>
>> Thanks,
>> Jakub
>>
>> BTW. Reading into kTLS code, I noticed it has been limited down to just
>> established sockets due to the same problem I'm struggling with here:
>>
>> static int tls_init(struct sock *sk)
>> {
>> ...
>> 	/* The TLS ulp is currently supported only for TCP sockets
>> 	 * in ESTABLISHED state.
>> 	 * Supporting sockets in LISTEN state will require us
>> 	 * to modify the accept implementation to clone rather then
>> 	 * share the ulp context.
>> 	 */
>> 	if (sk->sk_state != TCP_ESTABLISHED)
>> 		return -ENOTCONN;
>>
>> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-12-12 11:27               ` Jakub Sitnicki
@ 2019-12-12 19:23                 ` Martin Lau
  2019-12-17 15:06                   ` Jakub Sitnicki
  0 siblings, 1 reply; 39+ messages in thread
From: Martin Lau @ 2019-12-12 19:23 UTC (permalink / raw)
  Cc: John Fastabend, bpf, netdev, kernel-team

On Thu, Dec 12, 2019 at 12:27:19PM +0100, Jakub Sitnicki wrote:
> On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
> > On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
> >> John, Martin,
> >>
> >> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
> >> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
> >> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> >> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> >> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> >> >>> > [ ... ]
> >> >>> >
> >> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >> >>> >>  			sk->sk_prot = psock->sk_proto;
> >> >>> >>  		psock->sk_proto = NULL;
> >> >>> >>  	}
> >> >>> >> +
> >> >>> >> +	if (psock->icsk_af_ops) {
> >> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> >> >>> >> +		psock->icsk_af_ops = NULL;
> >> >>> >> +	}
> >> >>> >>  }
> >> >>> >
> >> >>> > [ ... ]
> >> >>> >
> >> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> >> >>> >> +					  struct sk_buff *skb,
> >> >>> >> +					  struct request_sock *req,
> >> >>> >> +					  struct dst_entry *dst,
> >> >>> >> +					  struct request_sock *req_unhash,
> >> >>> >> +					  bool *own_req)
> >> >>> >> +{
> >> >>> >> +	const struct inet_connection_sock_af_ops *ops;
> >> >>> >> +	void (*write_space)(struct sock *sk);
> >> >>> >> +	struct sk_psock *psock;
> >> >>> >> +	struct proto *proto;
> >> >>> >> +	struct sock *child;
> >> >>> >> +
> >> >>> >> +	rcu_read_lock();
> >> >>> >> +	psock = sk_psock(sk);
> >> >>> >> +	if (likely(psock)) {
> >> >>> >> +		proto = psock->sk_proto;
> >> >>> >> +		write_space = psock->saved_write_space;
> >> >>> >> +		ops = psock->icsk_af_ops;
> >> >>> > It is not immediately clear to me what ensure
> >> >>> > ops is not NULL here.
> >> >>> >
> >> >>> > It is likely I missed something.  A short comment would
> >> >>> > be very useful here.
> >> >>>
> >> >>> I can see the readability problem. Looking at it now, perhaps it should
> >> >>> be rewritten, to the same effect, as:
> >> >>>
> >> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
> >> >>> {
> >> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
> >> >>>         ...
> >> >>>
> >> >>>         rcu_read_lock();
> >> >>> 	psock = sk_psock(sk);
> >> >>> 	if (likely(psock)) {
> >> >>> 		proto = psock->sk_proto;
> >> >>> 		write_space = psock->saved_write_space;
> >> >>> 		ops = psock->icsk_af_ops;
> >> >>> 	}
> >> >>> 	rcu_read_unlock();
> >> >>>
> >> >>>         if (!ops)
> >> >>> 		ops = inet_csk(sk)->icsk_af_ops;
> >> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> >> >>>
> >> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> >> >>> properly. To double check what happens here:
> >> >> I did not mean the init path.  The init path is fine since it init
> >> >> eveything on psock before publishing the sk to the sock_map.
> >> >>
> >> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> >> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> >> >> psock->icsk_af_ops to NULL from running in parallel with
> >> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> >> >
> >> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
> >> >
> >> > Setting psock->icsk_af_ops to null on restore and not checking for it
> >> > here was a bad move on my side.  Also I need to revisit what to do about
> >> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
> >> >
> >> > This race should be easy enough to trigger. Will give it a shot.
> >>
> >> I've convinced myself that this approach is racy beyond repair.
> >>
> >> Once syn_recv_sock() has returned it is too late to reset the child
> >> sk_user_data and restore its callbacks. It has been already inserted
> >> into ehash and ingress path can invoke its callbacks.
> >>
> >> The race can be triggered with with a reproducer where:
> >>
> >> thread-1:
> >>
> >>         p = accept(s, ...);
> >>         close(p);
> >>
> >> thread-2:
> >>
> >> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
> >> 	bpf_map_delete_elem(mapfd, &key);
> >>
> >> This a dead-end because we can't have the parent and the child share the
> >> psock state. Even though psock itself is refcounted, and potentially we
> >> could grab a reference before cloning the parent, link into the map that
> >> psock holds is not.
> >>
> >> Two ways out come to mind. Both involve touching TCP code, which I was
> >> hoping to avoid:
> >>
> >> 1) reset sk_user_data when initializing the child
> >>
> >>    This is problematic because tcp_bpf callbacks are not designed to
> >>    handle sockets with no psock _and_ with overridden sk_prot
> >>    callbacks. (Although, I think they could if the fallback was directly
> >>    on {tcp,tcpv6}_prot based on socket domain.)
> >>
> >>    Also, there are other sk_user_data users like DRBD which rely on
> >>    sharing the sk_user_data pointer between parent and child, if I read
> >>    the code correctly [0]. If anything, clearing the sk_user_data on
> >>    clone would have to be guarded by a flag.
> > Can the copy/not-to-copy sk_user_data decision be made in
> > sk_clone_lock()?
> 
> Yes, this could be pushed down to sk_clone_lock(), where we do similar
> work (reset sk_reuseport_cb and clone bpf_sk_storage):
aha.  I missed your eariler "clearing the sk_user_data on clone would have
to be guarded by a flag..." part.  It turns out we were talking the same
thing on (1).  sock_flag works better if there is still bit left (and it
seems there is one),  although I was thinking more like adding
something (e.g. a func ptr) to 'struct proto' to mangle sk_user_data
before returning newsk....but not sure this kind of logic
belongs to 'struct proto'

> 
> 	/* User data can hold reference. Child must not
> 	 * inherit the pointer without acquiring a reference.
> 	 */
> 	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
> 		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
> 		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
> 	}
> 
> I belive this would still need to be guarded by a flag.  Do you see
> value in clearing child sk_user_data on clone as opposed to dealying
> that work until accept() time?
It seems to me clearing things up front at the very beginning is more
straight forward, such that it does not have to worry about the
sk_user_data may be used in a wrong way before it gets a chance
to be cleared in accept().

Just something to consider, if it is obvious that there is no hole in
clearing it in accept(), it is fine too.

> >>
> >> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
> >>
> >>    The simpler way out. tcp_bpf callbacks never get invoked on the child
> >>    socket so the copied psock reference is no longer a problem. We can
> >>    clear the pointer on accept().
> >>
> >>    So far I wasn't able poke any holes in it and it comes down to
> >>    patching tcp_create_openreq_child() with:
> >>
> >> 	/* sk_msg and ULP frameworks can override the callbacks into
> >> 	 * protocol. We don't assume they are intended to be inherited
> >> 	 * by the child. Frameworks can re-install the callbacks on
> >> 	 * accept() if needed.
> >> 	 */
> >> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
> >>
> >>    That's what I'm going with for v2.
> >>
> >> Open to suggestions.
> >>
> >> Thanks,
> >> Jakub
> >>
> >> BTW. Reading into kTLS code, I noticed it has been limited down to just
> >> established sockets due to the same problem I'm struggling with here:
> >>
> >> static int tls_init(struct sock *sk)
> >> {
> >> ...
> >> 	/* The TLS ulp is currently supported only for TCP sockets
> >> 	 * in ESTABLISHED state.
> >> 	 * Supporting sockets in LISTEN state will require us
> >> 	 * to modify the accept implementation to clone rather then
> >> 	 * share the ulp context.
> >> 	 */
> >> 	if (sk->sk_state != TCP_ESTABLISHED)
> >> 		return -ENOTCONN;
> >>
> >> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
  2019-12-12 19:23                 ` Martin Lau
@ 2019-12-17 15:06                   ` Jakub Sitnicki
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Sitnicki @ 2019-12-17 15:06 UTC (permalink / raw)
  To: Martin Lau; +Cc: John Fastabend, bpf, netdev, kernel-team

On Thu, Dec 12, 2019 at 08:23 PM CET, Martin Lau wrote:
> On Thu, Dec 12, 2019 at 12:27:19PM +0100, Jakub Sitnicki wrote:
>> On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
>> > On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
>> >> John, Martin,
>> >>
>> >> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
>> >> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
>> >> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> >> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> >> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> >> >>> > [ ... ]
>> >> >>> >
>> >> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> >> >>> >>  			sk->sk_prot = psock->sk_proto;
>> >> >>> >>  		psock->sk_proto = NULL;
>> >> >>> >>  	}
>> >> >>> >> +
>> >> >>> >> +	if (psock->icsk_af_ops) {
>> >> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> >> >>> >> +		psock->icsk_af_ops = NULL;
>> >> >>> >> +	}
>> >> >>> >>  }
>> >> >>> >
>> >> >>> > [ ... ]
>> >> >>> >
>> >> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> >> >>> >> +					  struct sk_buff *skb,
>> >> >>> >> +					  struct request_sock *req,
>> >> >>> >> +					  struct dst_entry *dst,
>> >> >>> >> +					  struct request_sock *req_unhash,
>> >> >>> >> +					  bool *own_req)
>> >> >>> >> +{
>> >> >>> >> +	const struct inet_connection_sock_af_ops *ops;
>> >> >>> >> +	void (*write_space)(struct sock *sk);
>> >> >>> >> +	struct sk_psock *psock;
>> >> >>> >> +	struct proto *proto;
>> >> >>> >> +	struct sock *child;
>> >> >>> >> +
>> >> >>> >> +	rcu_read_lock();
>> >> >>> >> +	psock = sk_psock(sk);
>> >> >>> >> +	if (likely(psock)) {
>> >> >>> >> +		proto = psock->sk_proto;
>> >> >>> >> +		write_space = psock->saved_write_space;
>> >> >>> >> +		ops = psock->icsk_af_ops;
>> >> >>> > It is not immediately clear to me what ensure
>> >> >>> > ops is not NULL here.
>> >> >>> >
>> >> >>> > It is likely I missed something.  A short comment would
>> >> >>> > be very useful here.
>> >> >>>
>> >> >>> I can see the readability problem. Looking at it now, perhaps it should
>> >> >>> be rewritten, to the same effect, as:
>> >> >>>
>> >> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
>> >> >>> {
>> >> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
>> >> >>>         ...
>> >> >>>
>> >> >>>         rcu_read_lock();
>> >> >>> 	psock = sk_psock(sk);
>> >> >>> 	if (likely(psock)) {
>> >> >>> 		proto = psock->sk_proto;
>> >> >>> 		write_space = psock->saved_write_space;
>> >> >>> 		ops = psock->icsk_af_ops;
>> >> >>> 	}
>> >> >>> 	rcu_read_unlock();
>> >> >>>
>> >> >>>         if (!ops)
>> >> >>> 		ops = inet_csk(sk)->icsk_af_ops;
>> >> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>> >> >>>
>> >> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> >> >>> properly. To double check what happens here:
>> >> >> I did not mean the init path.  The init path is fine since it init
>> >> >> eveything on psock before publishing the sk to the sock_map.
>> >> >>
>> >> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
>> >> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
>> >> >> psock->icsk_af_ops to NULL from running in parallel with
>> >> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
>> >> >
>> >> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
>> >> >
>> >> > Setting psock->icsk_af_ops to null on restore and not checking for it
>> >> > here was a bad move on my side.  Also I need to revisit what to do about
>> >> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
>> >> >
>> >> > This race should be easy enough to trigger. Will give it a shot.
>> >>
>> >> I've convinced myself that this approach is racy beyond repair.
>> >>
>> >> Once syn_recv_sock() has returned it is too late to reset the child
>> >> sk_user_data and restore its callbacks. It has been already inserted
>> >> into ehash and ingress path can invoke its callbacks.
>> >>
>> >> The race can be triggered with with a reproducer where:
>> >>
>> >> thread-1:
>> >>
>> >>         p = accept(s, ...);
>> >>         close(p);
>> >>
>> >> thread-2:
>> >>
>> >> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
>> >> 	bpf_map_delete_elem(mapfd, &key);
>> >>
>> >> This a dead-end because we can't have the parent and the child share the
>> >> psock state. Even though psock itself is refcounted, and potentially we
>> >> could grab a reference before cloning the parent, link into the map that
>> >> psock holds is not.
>> >>
>> >> Two ways out come to mind. Both involve touching TCP code, which I was
>> >> hoping to avoid:
>> >>
>> >> 1) reset sk_user_data when initializing the child
>> >>
>> >>    This is problematic because tcp_bpf callbacks are not designed to
>> >>    handle sockets with no psock _and_ with overridden sk_prot
>> >>    callbacks. (Although, I think they could if the fallback was directly
>> >>    on {tcp,tcpv6}_prot based on socket domain.)
>> >>
>> >>    Also, there are other sk_user_data users like DRBD which rely on
>> >>    sharing the sk_user_data pointer between parent and child, if I read
>> >>    the code correctly [0]. If anything, clearing the sk_user_data on
>> >>    clone would have to be guarded by a flag.
>> > Can the copy/not-to-copy sk_user_data decision be made in
>> > sk_clone_lock()?
>>
>> Yes, this could be pushed down to sk_clone_lock(), where we do similar
>> work (reset sk_reuseport_cb and clone bpf_sk_storage):
> aha.  I missed your eariler "clearing the sk_user_data on clone would have
> to be guarded by a flag..." part.  It turns out we were talking the same
> thing on (1).  sock_flag works better if there is still bit left (and it
> seems there is one),  although I was thinking more like adding
> something (e.g. a func ptr) to 'struct proto' to mangle sk_user_data
> before returning newsk....but not sure this kind of logic
> belongs to 'struct proto'

Sorry for late reply.

We have 4 bits left by my count. The multi-line comment for SOCK_NOFCS
is getting in the way of counting them line-for-bit.

A callback invoked on socket clone is something I was considering too.
I'm not sure either where it belongs. At risk of being too use-case
specific, perhaps it could live together with sk_user_data and sk_prot,
which it would mangle on sk_clone_lock():

struct sock {
        ...
	void			*sk_user_data;
	void			(*sk_clone)(struct sock *sk,
					    struct sock *newsk);
        ...
}

But, I feel adding a new sock field just for this wouldn't be justified.
I can get by with a sock flag. Unless we have other uses for it?

>
>>
>> 	/* User data can hold reference. Child must not
>> 	 * inherit the pointer without acquiring a reference.
>> 	 */
>> 	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
>> 		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
>> 		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
>> 	}
>>
>> I belive this would still need to be guarded by a flag.  Do you see
>> value in clearing child sk_user_data on clone as opposed to dealying
>> that work until accept() time?
> It seems to me clearing things up front at the very beginning is more
> straight forward, such that it does not have to worry about the
> sk_user_data may be used in a wrong way before it gets a chance
> to be cleared in accept().
>
> Just something to consider, if it is obvious that there is no hole in
> clearing it in accept(), it is fine too.

Just when I thought I could get away with lazily clearing the
sk_user_data at accept() time, it occurred to me that it is not enough.

Listening socket could get deleted from sockmap before a child socket
that inherited a copy of sk_user_data pointer gets accept()'ed. In such
scenario the pointer would not get NULL'ed on accept(), because
listening socket would have it's sk_prot->accept restored by then.

I will need that flag after all...

-jkbs

>
>> >>
>> >> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
>> >>
>> >>    The simpler way out. tcp_bpf callbacks never get invoked on the child
>> >>    socket so the copied psock reference is no longer a problem. We can
>> >>    clear the pointer on accept().
>> >>
>> >>    So far I wasn't able poke any holes in it and it comes down to
>> >>    patching tcp_create_openreq_child() with:
>> >>
>> >> 	/* sk_msg and ULP frameworks can override the callbacks into
>> >> 	 * protocol. We don't assume they are intended to be inherited
>> >> 	 * by the child. Frameworks can re-install the callbacks on
>> >> 	 * accept() if needed.
>> >> 	 */
>> >> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
>> >>
>> >>    That's what I'm going with for v2.
>> >>
>> >> Open to suggestions.
>> >>
>> >> Thanks,
>> >> Jakub
>> >>
>> >> BTW. Reading into kTLS code, I noticed it has been limited down to just
>> >> established sockets due to the same problem I'm struggling with here:
>> >>
>> >> static int tls_init(struct sock *sk)
>> >> {
>> >> ...
>> >> 	/* The TLS ulp is currently supported only for TCP sockets
>> >> 	 * in ESTABLISHED state.
>> >> 	 * Supporting sockets in LISTEN state will require us
>> >> 	 * to modify the accept implementation to clone rather then
>> >> 	 * share the ulp context.
>> >> 	 */
>> >> 	if (sk->sk_state != TCP_ESTABLISHED)
>> >> 		return -ENOTCONN;
>> >>
>> >> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-12-17 15:06 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2019-11-24  5:32   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2019-11-24  5:35   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
2019-11-24  5:38   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
2019-11-24  5:56   ` John Fastabend
2019-11-25 22:38   ` Martin Lau
2019-11-26 15:54     ` Jakub Sitnicki
2019-11-26 17:16       ` Martin Lau
2019-11-26 18:36         ` Jakub Sitnicki
     [not found]           ` <87sglsfdda.fsf@cloudflare.com>
2019-12-11 17:20             ` Martin Lau
2019-12-12 11:27               ` Jakub Sitnicki
2019-12-12 19:23                 ` Martin Lau
2019-12-17 15:06                   ` Jakub Sitnicki
2019-11-26 18:43         ` John Fastabend
2019-11-27 22:18           ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-25  1:24   ` Alexei Starovoitov
2019-11-25  4:17     ` John Fastabend
2019-11-25 10:40       ` Jakub Sitnicki
2019-11-25 22:07         ` Martin Lau
2019-11-26 14:30           ` Jakub Sitnicki
2019-11-26 19:03             ` Martin Lau
2019-11-27 21:34               ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2019-11-24  6:00   ` John Fastabend
2019-11-25 22:30   ` Martin Lau
2019-11-26 14:32     ` Jakub Sitnicki
2019-12-12 10:30     ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2019-11-24  6:04   ` John Fastabend
2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
2019-11-25  9:22   ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).