bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets
@ 2020-01-10 10:50 Jakub Sitnicki
  2020-01-10 10:50 ` [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore Jakub Sitnicki
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

With the realization that properly cloning listening sockets that have
psock state/callbacks is tricky, comes the second version of patches.

The spirit of the patch set stays the same - make SOCKMAP a generic
collection for listening and established sockets. This would let us use the
SOCKMAP with reuseport today, and in the future hopefully with BPF programs
that run at socket lookup time [0]. For a bit more context, please see v1
cover letter [1].

The biggest change that happened since v1 is how we deal with clearing
psock state in a copy of parent socket when cloning it (patches 3 & 4).

As much as I did not want to touch icsk/tcp clone path, it seems
unavoidable. The changes were kept down to a minimum, with attention to not
break existing users. That said, a review from the TCP maintainer would be
invaluable (patches 3 & 4).

Patches 1 & 2 will conflict with recently posted "Fixes for sockmap/tls
from more complex BPF progs" series [0]. I'll adapt or split them out this
series once sockmap/tls fixes from John land in bpf-next branch.

Some food for thought - is mixing listening and established sockets in the
same BPF map a good idea? I don't know but I couldn't find a good reason to
restrict the user.

Considering how much the code evolved, I didn't carry over Acks from v1.

Thanks,
jkbs

[0] https://lore.kernel.org/bpf/157851776348.1732.12600714815781177085.stgit@ubuntu3-kvm2/T/#t
[1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/

v1 -> v2:

- af_ops->syn_recv_sock callback is no longer overridden and burdened with
  restoring sk_prot and clearing sk_user_data in the child socket. As child
  socket is already hashed when syn_recv_sock returns, it is too late to
  put it in the right state. Instead patches 3 & 4 restore sk_prot and
  clear sk_user_data before we hash the child socket. (Pointed out by
  Martin Lau)

- Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as
  we write to it from sk_msg while socket might be getting cloned on
  another CPU. (Suggested by John Fastabend)

- Convert tests for SOCKMAP holding listening sockets to return-on-error
  style, and hook them up to test_progs. Also use BPF skeleton for setup.
  Add new tests to cover the race scenario discovered during v1 review.

RFC -> v1:

- Switch from overriding proto->accept to af_ops->syn_recv_sock, which
  happens earlier. Clearing the psock state after accept() does not work
  for child sockets that become orphaned (never got accepted). v4-mapped
  sockets need special care.

- Return the socket cookie on SOCKMAP lookup from syscall to be on par with
  REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from
  syscall.

- Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress)
  SOCKMAP helpers fail when target socket is a listening one.

- Make bpf_sk_select_reuseport helper fail when target is a TCP established
  socket.

- Teach libbpf to recognize SK_REUSEPORT program type from section name.

- Add a dedicated set of tests for SOCKMAP holding listening sockets,
  covering map operations, overridden socket callbacks, and BPF helpers.


Jakub Sitnicki (11):
  bpf, sk_msg: Don't reset saved sock proto on restore
  net, sk_msg: Annotate lockless access to sk_prot on clone
  net, sk_msg: Clear sk_user_data pointer on clone if tagged
  tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  bpf, sockmap: Don't set up sockmap progs for listening sockets
  bpf, sockmap: Return socket cookie on lookup from syscall
  bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
  bpf: Allow selecting reuseport socket from a SOCKMAP
  selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  selftests/bpf: Tests for SOCKMAP holding listening sockets

 include/linux/skmsg.h                         |   14 +-
 include/net/sock.h                            |   37 +-
 include/net/tcp.h                             |    1 +
 kernel/bpf/verifier.c                         |    6 +-
 net/core/filter.c                             |   15 +-
 net/core/skmsg.c                              |    2 +-
 net/core/sock.c                               |   11 +-
 net/core/sock_map.c                           |  120 +-
 net/ipv4/tcp_bpf.c                            |   19 +-
 net/ipv4/tcp_minisocks.c                      |    2 +
 net/ipv4/tcp_ulp.c                            |    2 +-
 net/tls/tls_main.c                            |    2 +-
 .../bpf/prog_tests/select_reuseport.c         |   60 +-
 .../selftests/bpf/prog_tests/sockmap_listen.c | 1378 +++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |   76 +
 tools/testing/selftests/bpf/test_maps.c       |    6 +-
 16 files changed, 1696 insertions(+), 55 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-11 22:50   ` John Fastabend
  2020-01-10 10:50 ` [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

There is no need to reset psock->sk_proto when restoring socket protocol
callbacks (sk->sk_prot). The psock is about to get detached from the sock
and eventually destroyed.

No harm done if we restore the protocol callbacks twice, while it makes
reasoning about psock state easier, that is once psock was initialized, we
can assume psock->sk_proto is set.

Also, we don't need a fallback for when socket is not using ULP.
tcp_update_ulp already does this for us.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/skmsg.h | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index ef7031f8a304..41ea1258d15e 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -359,17 +359,7 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
 	sk->sk_write_space = psock->saved_write_space;
-
-	if (psock->sk_proto) {
-		struct inet_connection_sock *icsk = inet_csk(sk);
-		bool has_ulp = !!icsk->icsk_ulp_data;
-
-		if (has_ulp)
-			tcp_update_ulp(sk, psock->sk_proto);
-		else
-			sk->sk_prot = psock->sk_proto;
-		psock->sk_proto = NULL;
-	}
+	tcp_update_ulp(sk, psock->sk_proto);
 }
 
 static inline void sk_psock_set_state(struct sk_psock *psock,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
  2020-01-10 10:50 ` [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-11 23:14   ` John Fastabend
  2020-01-10 10:50 ` [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

sk_msg and ULP frameworks override protocol callbacks pointer in
sk->sk_prot, while TCP accesses it locklessly when cloning the listening
socket.

Once we enable use of listening sockets with sockmap (and hence sk_msg),
there can be shared access to sk->sk_prot if socket is getting cloned while
being inserted/deleted to/from the sockmap from another CPU. Mark the
shared access with READ_ONCE/WRITE_ONCE annotations.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/skmsg.h | 2 +-
 net/core/sock.c       | 5 +++--
 net/ipv4/tcp_bpf.c    | 2 +-
 net/ipv4/tcp_ulp.c    | 2 +-
 net/tls/tls_main.c    | 2 +-
 5 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 41ea1258d15e..d2d39d108354 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -352,7 +352,7 @@ static inline void sk_psock_update_proto(struct sock *sk,
 	psock->saved_write_space = sk->sk_write_space;
 
 	psock->sk_proto = sk->sk_prot;
-	sk->sk_prot = ops;
+	WRITE_ONCE(sk->sk_prot, ops);
 }
 
 static inline void sk_psock_restore_proto(struct sock *sk,
diff --git a/net/core/sock.c b/net/core/sock.c
index 8459ad579f73..96b4e8820ae8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1792,16 +1792,17 @@ static void sk_init_common(struct sock *sk)
  */
 struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 {
+	struct proto *prot = READ_ONCE(sk->sk_prot);
 	struct sock *newsk;
 	bool is_charged = true;
 
-	newsk = sk_prot_alloc(sk->sk_prot, priority, sk->sk_family);
+	newsk = sk_prot_alloc(prot, priority, sk->sk_family);
 	if (newsk != NULL) {
 		struct sk_filter *filter;
 
 		sock_copy(newsk, sk);
 
-		newsk->sk_prot_creator = sk->sk_prot;
+		newsk->sk_prot_creator = prot;
 
 		/* SANITY */
 		if (likely(newsk->sk_net_refcnt))
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e38705165ac9..e6ffdb47b619 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -649,7 +649,7 @@ static void tcp_bpf_reinit_sk_prot(struct sock *sk, struct sk_psock *psock)
 	 * or added requiring sk_prot hook updates. We keep original saved
 	 * hooks in this case.
 	 */
-	sk->sk_prot = &tcp_bpf_prots[family][config];
+	WRITE_ONCE(sk->sk_prot, &tcp_bpf_prots[family][config]);
 }
 
 static int tcp_bpf_assert_proto_ops(struct proto *ops)
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 12ab5db2b71c..211af9759732 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -104,7 +104,7 @@ void tcp_update_ulp(struct sock *sk, struct proto *proto)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	if (!icsk->icsk_ulp_ops) {
-		sk->sk_prot = proto;
+		WRITE_ONCE(sk->sk_prot, proto);
 		return;
 	}
 
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index dac24c7aa7d4..d466b43c7eb6 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -740,7 +740,7 @@ static void tls_update(struct sock *sk, struct proto *p)
 	if (likely(ctx))
 		ctx->sk_proto = p;
 	else
-		sk->sk_prot = p;
+		WRITE_ONCE(sk->sk_prot, p);
 }
 
 static int tls_get_info(const struct sock *sk, struct sk_buff *skb)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
  2020-01-10 10:50 ` [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore Jakub Sitnicki
  2020-01-10 10:50 ` [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-11 23:38   ` John Fastabend
                     ` (2 more replies)
  2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned to 2 or more bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms it. Also, an RCU-protected
sk_user_data is not likely to hold a pointer to a char value or a
pathological case of "struct { char c; }". To be safe, warn when the
flag-bit is set when setting sk_user_data to catch any future misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/sock.h | 37 +++++++++++++++++++++++++++++++++++--
 net/core/skmsg.c   |  2 +-
 net/core/sock.c    |  6 ++++++
 net/ipv4/tcp_bpf.c |  4 ++++
 4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 8dff68b4c316..071003331f55 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -518,10 +518,43 @@ enum sk_pacing {
 	SK_PACING_FQ		= 2,
 };
 
+/* Pointer stored in sk_user_data might not be suitable for copying
+ * when cloning the socket. For instance, it can point to a reference
+ * counted object. sk_user_data bottom bit is set if pointer must not
+ * be copied.
+ */
+#define SK_USER_DATA_NOCOPY	1UL
+#define SK_USER_DATA_PTRMASK	~(SK_USER_DATA_NOCOPY)
+
+/**
+ * sk_user_data_is_nocopy - Test if sk_user_data pointer must not be copied
+ * @sk: socket
+ */
+static inline bool sk_user_data_is_nocopy(const struct sock *sk)
+{
+	return ((uintptr_t)sk->sk_user_data & SK_USER_DATA_NOCOPY);
+}
+
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
 
-#define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
-#define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
+#define rcu_dereference_sk_user_data(sk)				\
+({									\
+	void *__tmp = rcu_dereference(__sk_user_data((sk)));		\
+	(void *)((uintptr_t)__tmp & SK_USER_DATA_PTRMASK);		\
+})
+#define rcu_assign_sk_user_data(sk, ptr)				\
+({									\
+	uintptr_t __tmp = (uintptr_t)(ptr);				\
+	WARN_ON(__tmp & ~SK_USER_DATA_PTRMASK);				\
+	rcu_assign_pointer(__sk_user_data((sk)), __tmp);		\
+})
+#define rcu_assign_sk_user_data_nocopy(sk, ptr)				\
+({									\
+	uintptr_t __tmp = (uintptr_t)(ptr);				\
+	WARN_ON(__tmp & ~SK_USER_DATA_PTRMASK);				\
+	rcu_assign_pointer(__sk_user_data((sk)),			\
+			   __tmp | SK_USER_DATA_NOCOPY);		\
+})
 
 /*
  * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index ded2d5227678..eeb28cb85664 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -512,7 +512,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
 	refcount_set(&psock->refcnt, 1);
 
-	rcu_assign_sk_user_data(sk, psock);
+	rcu_assign_sk_user_data_nocopy(sk, psock);
 	sock_hold(sk);
 
 	return psock;
diff --git a/net/core/sock.c b/net/core/sock.c
index 96b4e8820ae8..4ad2bc4d4b55 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1864,6 +1864,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 			goto out;
 		}
 
+		/* Clear sk_user_data if parent had the pointer tagged
+		 * as not suitable for copying when cloning.
+		 */
+		if (sk_user_data_is_nocopy(newsk))
+			RCU_INIT_POINTER(newsk->sk_user_data, NULL);
+
 		newsk->sk_err	   = 0;
 		newsk->sk_err_soft = 0;
 		newsk->sk_priority = 0;
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e6ffdb47b619..f6c83747c71e 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -535,6 +535,10 @@ static void tcp_bpf_remove(struct sock *sk, struct sk_psock *psock)
 {
 	struct sk_psock_link *link;
 
+	/* Did a child socket inadvertently inherit parent's psock? */
+	if (WARN_ON(sk != psock->sk))
+		return;
+
 	while ((link = sk_psock_link_pop(psock))) {
 		sk_psock_unlink(sk, link);
 		sk_psock_free_link(link);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-11  2:42   ` kbuild test robot
                     ` (3 more replies)
  2020-01-10 10:50 ` [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
                   ` (8 subsequent siblings)
  12 siblings, 4 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before the it gets hashed and
any of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/tcp.h        |  1 +
 net/ipv4/tcp_bpf.c       | 13 +++++++++++++
 net/ipv4/tcp_minisocks.c |  2 ++
 3 files changed, 16 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9dd975be7fdf..7cbf9465bb10 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2181,6 +2181,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		    int nonblock, int flags, int *addr_len);
 int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		      struct msghdr *msg, int len, int flags);
+void tcp_bpf_clone(const struct sock *sk, struct sock *child);
 
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index f6c83747c71e..6f96320fb7cf 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -586,6 +586,19 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
 	saved_close(sk, timeout);
 }
 
+/* If a child got cloned from a listening socket that had tcp_bpf
+ * protocol callbacks installed, we need to restore the callbacks to
+ * the default ones because the child does not inherit the psock state
+ * that tcp_bpf callbacks expect.
+ */
+void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
+{
+	struct proto *prot = newsk->sk_prot;
+
+	if (prot->recvmsg == tcp_bpf_recvmsg)
+		newsk->sk_prot = sk->sk_prot_creator;
+}
+
 enum {
 	TCP_BPF_IPV4,
 	TCP_BPF_IPV6,
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index ad3b56d9fa71..c8274371c3d0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -548,6 +548,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
+	tcp_bpf_clone(sk, newsk);
+
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
 
 	return newsk;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-11 23:59   ` John Fastabend
  2020-01-10 10:50 ` [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets Jakub Sitnicki
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

In order for sockmap type to become a generic collection for storing TCP
sockets we need to loosen the checks during map update, while tightening
the checks in redirect helpers.

Currently sockmap requires the TCP socket to be in established state (or
transitioning out of SYN_RECV into established state when done from BPF),
which prevents inserting listening sockets.

Change the update pre-checks so that the socket can also be in listening
state. If the state is not white-listed, return -EINVAL to be consistent
with REUSEPORT_SOCKARRY map type.

Since it doesn't make sense to redirect with sockmap to listening sockets,
add appropriate socket state checks to BPF redirect helpers too.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c                     | 46 ++++++++++++++++++++-----
 tools/testing/selftests/bpf/test_maps.c |  6 +---
 2 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index eb114ee419b6..99daea502508 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -396,6 +396,23 @@ static bool sock_map_sk_is_suitable(const struct sock *sk)
 	       sk->sk_protocol == IPPROTO_TCP;
 }
 
+/* Is sock in a state that allows inserting into the map?
+ * SYN_RECV is needed for updates on BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB.
+ */
+static bool sock_map_update_okay(const struct sock *sk)
+{
+	return (1 << sk->sk_state) & (TCPF_ESTABLISHED |
+				      TCPF_SYN_RECV |
+				      TCPF_LISTEN);
+}
+
+/* Is sock in a state that allows redirecting into it? */
+static bool sock_map_redirect_okay(const struct sock *sk)
+{
+	return (1 << sk->sk_state) & (TCPF_ESTABLISHED |
+				      TCPF_SYN_RECV);
+}
+
 static int sock_map_update_elem(struct bpf_map *map, void *key,
 				void *value, u64 flags)
 {
@@ -413,11 +430,14 @@ static int sock_map_update_elem(struct bpf_map *map, void *key,
 		ret = -EINVAL;
 		goto out;
 	}
-	if (!sock_map_sk_is_suitable(sk) ||
-	    sk->sk_state != TCP_ESTABLISHED) {
+	if (!sock_map_sk_is_suitable(sk)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
+	if (!sock_map_update_okay(sk)) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	sock_map_sk_acquire(sk);
 	ret = sock_map_update_common(map, idx, sk, flags);
@@ -433,6 +453,7 @@ BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, sops,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	if (likely(sock_map_sk_is_suitable(sops->sk) &&
+		   sock_map_update_okay(sops->sk) &&
 		   sock_map_op_okay(sops)))
 		return sock_map_update_common(map, *(u32 *)key, sops->sk,
 					      flags);
@@ -454,13 +475,17 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+	struct sock *sk;
 
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	tcb->bpf.flags = flags;
-	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
-	if (!tcb->bpf.sk_redir)
+
+	sk = __sock_map_lookup_elem(map, key);
+	if (!sk || !sock_map_redirect_okay(sk))
 		return SK_DROP;
+
+	tcb->bpf.flags = flags;
+	tcb->bpf.sk_redir = sk;
 	return SK_PASS;
 }
 
@@ -477,12 +502,17 @@ const struct bpf_func_proto bpf_sk_redirect_map_proto = {
 BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
+	struct sock *sk;
+
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	msg->flags = flags;
-	msg->sk_redir = __sock_map_lookup_elem(map, key);
-	if (!msg->sk_redir)
+
+	sk = __sock_map_lookup_elem(map, key);
+	if (!sk || !sock_map_redirect_okay(sk))
 		return SK_DROP;
+
+	msg->flags = flags;
+	msg->sk_redir = sk;
 	return SK_PASS;
 }
 
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 02eae1e864c2..c6766b2cff85 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -756,11 +756,7 @@ static void test_sockmap(unsigned int tasks, void *data)
 	/* Test update without programs */
 	for (i = 0; i < 6; i++) {
 		err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
-		if (i < 2 && !err) {
-			printf("Allowed update sockmap '%i:%i' not in ESTABLISHED\n",
-			       i, sfd[i]);
-			goto out_sockmap;
-		} else if (i >= 2 && err) {
+		if (err) {
 			printf("Failed noprog update sockmap '%i:%i'\n",
 			       i, sfd[i]);
 			goto out_sockmap;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-12  0:51   ` John Fastabend
  2020-01-10 10:50 ` [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Now that sockmap can hold listening sockets, when setting up the psock we
will (i) grab references to verdict/parser progs, and (2) override socket
upcalls sk_data_ready and sk_write_space.

We cannot redirect to listening sockets so we don't need to link the socket
to the BPF progs, but more importantly we don't want the listening socket
to have overridden upcalls because they would get inherited by child
sockets cloned from it.

Introduce a separate initialization path for listening sockets that does
not change the upcalls and ignores the BPF progs.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99daea502508..d1a91e41ff82 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -228,6 +228,30 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 	return ret;
 }
 
+static int sock_map_link_no_progs(struct bpf_map *map, struct sock *sk)
+{
+	struct sk_psock *psock;
+	int ret;
+
+	psock = sk_psock_get_checked(sk);
+	if (IS_ERR(psock))
+		return PTR_ERR(psock);
+
+	if (psock) {
+		tcp_bpf_reinit(sk);
+		return 0;
+	}
+
+	psock = sk_psock_init(sk, map->numa_node);
+	if (!psock)
+		return -ENOMEM;
+
+	ret = tcp_bpf_init(sk);
+	if (ret < 0)
+		sk_psock_put(sk, psock);
+	return ret;
+}
+
 static void sock_map_free(struct bpf_map *map)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
@@ -352,7 +376,15 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 	if (!link)
 		return -ENOMEM;
 
-	ret = sock_map_link(map, &stab->progs, sk);
+	/* Only established or almost established sockets leaving
+	 * SYN_RECV state need to hold refs to parser/verdict progs
+	 * and have their sk_data_ready and sk_write_space callbacks
+	 * overridden.
+	 */
+	if (sk->sk_state == TCP_LISTEN)
+		ret = sock_map_link_no_progs(map, sk);
+	else
+		ret = sock_map_link(map, &stab->progs, sk);
 	if (ret < 0)
 		goto out_free;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-12  0:56   ` John Fastabend
  2020-01-13 23:12   ` Martin Lau
  2020-01-10 10:50 ` [PATCH bpf-next v2 08/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Tooling that populates the SOCKMAP with sockets from user-space needs a way
to inspect its contents. Returning the struct sock * that SOCKMAP holds to
user-space is neither safe nor useful. An approach established by
REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
instead.

Since socket cookies are u64 values SOCKMAP needs to support such a value
size for lookup to be possible. This requires special handling on update,
though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
with ENOSPC error.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d1a91e41ff82..3731191a7d1e 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -10,6 +10,7 @@
 #include <linux/skmsg.h>
 #include <linux/list.h>
 #include <linux/jhash.h>
+#include <linux/sock_diag.h>
 
 struct bpf_stab {
 	struct bpf_map map;
@@ -31,7 +32,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 	if (attr->max_entries == 0 ||
 	    attr->key_size    != 4 ||
-	    attr->value_size  != 4 ||
+	    (attr->value_size != sizeof(u32) &&
+	     attr->value_size != sizeof(u64)) ||
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
@@ -298,6 +300,23 @@ static void *sock_map_lookup(struct bpf_map *map, void *key)
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
+{
+	struct sock *sk;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (map->value_size != sizeof(u64))
+		return ERR_PTR(-ENOSPC);
+
+	sk = __sock_map_lookup_elem(map, *(u32 *)key);
+	if (!sk)
+		return ERR_PTR(-ENOENT);
+
+	sock_gen_cookie(sk);
+	return &sk->sk_cookie;
+}
+
 static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
@@ -448,12 +467,19 @@ static bool sock_map_redirect_okay(const struct sock *sk)
 static int sock_map_update_elem(struct bpf_map *map, void *key,
 				void *value, u64 flags)
 {
-	u32 ufd = *(u32 *)value;
 	u32 idx = *(u32 *)key;
 	struct socket *sock;
 	struct sock *sk;
+	u64 ufd;
 	int ret;
 
+	if (map->value_size == sizeof(u64))
+		ufd = *(u64 *)value;
+	else
+		ufd = *(u32 *)value;
+	if (ufd > S32_MAX)
+		return -EINVAL;
+
 	sock = sockfd_lookup(ufd, &ret);
 	if (!sock)
 		return ret;
@@ -562,6 +588,7 @@ const struct bpf_map_ops sock_map_ops = {
 	.map_alloc		= sock_map_alloc,
 	.map_free		= sock_map_free,
 	.map_get_next_key	= sock_map_get_next_key,
+	.map_lookup_elem_sys_only = sock_map_lookup_sys,
 	.map_update_elem	= sock_map_update_elem,
 	.map_delete_elem	= sock_map_delete_elem,
 	.map_lookup_elem	= sock_map_lookup,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 08/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-10 10:50 ` [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Don't require the kernel code, like BPF helpers, that needs access to
SOCKMAP map contents to live in the sock_map module. Expose the SOCKMAP
lookup operation to all kernel-land.

Lookup from BPF context is not whitelisted yet. While syscalls have a
dedicated lookup handler.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3731191a7d1e..a046c86a8362 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -297,7 +297,7 @@ static struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
 
 static void *sock_map_lookup(struct bpf_map *map, void *key)
 {
-	return ERR_PTR(-EOPNOTSUPP);
+	return __sock_map_lookup_elem(map, *(u32 *)key);
 }
 
 static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
@@ -961,6 +961,11 @@ static void sock_hash_free(struct bpf_map *map)
 	kfree(htab);
 }
 
+static void *sock_hash_lookup(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static void sock_hash_release_progs(struct bpf_map *map)
 {
 	psock_progs_drop(&container_of(map, struct bpf_htab, map)->progs);
@@ -1040,7 +1045,7 @@ const struct bpf_map_ops sock_hash_ops = {
 	.map_get_next_key	= sock_hash_get_next_key,
 	.map_update_elem	= sock_hash_update_elem,
 	.map_delete_elem	= sock_hash_delete_elem,
-	.map_lookup_elem	= sock_map_lookup,
+	.map_lookup_elem	= sock_hash_lookup,
 	.map_release_uref	= sock_hash_release_progs,
 	.map_check_btf		= map_check_no_btf,
 };
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (7 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 08/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-12  1:00   ` John Fastabend
                     ` (2 more replies)
  2020-01-10 10:50 ` [PATCH bpf-next v2 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

SOCKMAP now supports storing references to listening sockets. Nothing keeps
us from using it as an array of sockets to select from in SK_REUSEPORT
programs.

Whitelist the map type with the BPF helper for selecting socket.

The restriction that the socket has to be a member of a reuseport group
still applies. Socket from a SOCKMAP that does not have sk_reuseport_cb set
is not a valid target and we signal it with -EINVAL.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 kernel/bpf/verifier.c |  6 ++++--
 net/core/filter.c     | 15 ++++++++++-----
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f5af759a8a5f..0ee5f1594b5c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3697,7 +3697,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_sk_redirect_map &&
 		    func_id != BPF_FUNC_sock_map_update &&
 		    func_id != BPF_FUNC_map_delete_elem &&
-		    func_id != BPF_FUNC_msg_redirect_map)
+		    func_id != BPF_FUNC_msg_redirect_map &&
+		    func_id != BPF_FUNC_sk_select_reuseport)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_SOCKHASH:
@@ -3778,7 +3779,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 			goto error;
 		break;
 	case BPF_FUNC_sk_select_reuseport:
-		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
+		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
+		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
 			goto error;
 		break;
 	case BPF_FUNC_map_peek_elem:
diff --git a/net/core/filter.c b/net/core/filter.c
index a702761ef369..c79c62a54167 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8677,6 +8677,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
 	   struct bpf_map *, map, void *, key, u32, flags)
 {
+	bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
 	struct sock_reuseport *reuse;
 	struct sock *selected_sk;
 
@@ -8685,12 +8686,16 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
 		return -ENOENT;
 
 	reuse = rcu_dereference(selected_sk->sk_reuseport_cb);
-	if (!reuse)
-		/* selected_sk is unhashed (e.g. by close()) after the
-		 * above map_lookup_elem().  Treat selected_sk has already
-		 * been removed from the map.
+	if (!reuse) {
+		/* reuseport_array has only sk with non NULL sk_reuseport_cb.
+		 * The only (!reuse) case here is - the sk has already been
+		 * unhashed (e.g. by close()), so treat it as -ENOENT.
+		 *
+		 * Other maps (e.g. sock_map) do not provide this guarantee and
+		 * the sk may never be in the reuseport group to begin with.
 		 */
-		return -ENOENT;
+		return is_sockarray ? -ENOENT : -EINVAL;
+	}
 
 	if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) {
 		struct sock *sk;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (8 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-12  1:01   ` John Fastabend
  2020-01-10 10:50 ` [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
is not hard-coded in the test setup routine.

This, together with careful state cleaning after the tests, let's us run
the test cases once with REUSEPORT_ARRAY and once with SOCKMAP (TCP only),
to have test coverage for the latter as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Now a test run looks like so:

bash-5.0# ./test_progs -t reuseport
#39/1 reuseport_sockarray IPv4/TCP LOOPBACK test_err_inner_map:OK
#39/2 reuseport_sockarray IPv4/TCP LOOPBACK test_err_skb_data:OK
#39/3 reuseport_sockarray IPv4/TCP LOOPBACK test_err_sk_select_port:OK
#39/4 reuseport_sockarray IPv4/TCP LOOPBACK test_pass:OK
#39/5 reuseport_sockarray IPv4/TCP LOOPBACK test_syncookie:OK
#39/6 reuseport_sockarray IPv4/TCP LOOPBACK test_pass_on_err:OK
#39/7 reuseport_sockarray IPv4/TCP LOOPBACK test_detach_bpf:OK
#39/8 reuseport_sockarray IPv4/TCP INANY test_err_inner_map:OK
#39/9 reuseport_sockarray IPv4/TCP INANY test_err_skb_data:OK
#39/10 reuseport_sockarray IPv4/TCP INANY test_err_sk_select_port:OK
#39/11 reuseport_sockarray IPv4/TCP INANY test_pass:OK
#39/12 reuseport_sockarray IPv4/TCP INANY test_syncookie:OK
#39/13 reuseport_sockarray IPv4/TCP INANY test_pass_on_err:OK
#39/14 reuseport_sockarray IPv4/TCP INANY test_detach_bpf:OK
#39/15 reuseport_sockarray IPv6/TCP LOOPBACK test_err_inner_map:OK
#39/16 reuseport_sockarray IPv6/TCP LOOPBACK test_err_skb_data:OK
#39/17 reuseport_sockarray IPv6/TCP LOOPBACK test_err_sk_select_port:OK
#39/18 reuseport_sockarray IPv6/TCP LOOPBACK test_pass:OK
#39/19 reuseport_sockarray IPv6/TCP LOOPBACK test_syncookie:OK
#39/20 reuseport_sockarray IPv6/TCP LOOPBACK test_pass_on_err:OK
#39/21 reuseport_sockarray IPv6/TCP LOOPBACK test_detach_bpf:OK
#39/22 reuseport_sockarray IPv6/TCP INANY test_err_inner_map:OK
#39/23 reuseport_sockarray IPv6/TCP INANY test_err_skb_data:OK
#39/24 reuseport_sockarray IPv6/TCP INANY test_err_sk_select_port:OK
#39/25 reuseport_sockarray IPv6/TCP INANY test_pass:OK
#39/26 reuseport_sockarray IPv6/TCP INANY test_syncookie:OK
#39/27 reuseport_sockarray IPv6/TCP INANY test_pass_on_err:OK
#39/28 reuseport_sockarray IPv6/TCP INANY test_detach_bpf:OK
#39/29 reuseport_sockarray IPv4/UDP LOOPBACK test_err_inner_map:OK
#39/30 reuseport_sockarray IPv4/UDP LOOPBACK test_err_skb_data:OK
#39/31 reuseport_sockarray IPv4/UDP LOOPBACK test_err_sk_select_port:OK
#39/32 reuseport_sockarray IPv4/UDP LOOPBACK test_pass:OK
#39/33 reuseport_sockarray IPv4/UDP LOOPBACK test_syncookie:OK
#39/34 reuseport_sockarray IPv4/UDP LOOPBACK test_pass_on_err:OK
#39/35 reuseport_sockarray IPv4/UDP LOOPBACK test_detach_bpf:OK
#39/36 reuseport_sockarray IPv6/UDP LOOPBACK test_err_inner_map:OK
#39/37 reuseport_sockarray IPv6/UDP LOOPBACK test_err_skb_data:OK
#39/38 reuseport_sockarray IPv6/UDP LOOPBACK test_err_sk_select_port:OK
#39/39 reuseport_sockarray IPv6/UDP LOOPBACK test_pass:OK
#39/40 reuseport_sockarray IPv6/UDP LOOPBACK test_syncookie:OK
#39/41 reuseport_sockarray IPv6/UDP LOOPBACK test_pass_on_err:OK
#39/42 reuseport_sockarray IPv6/UDP LOOPBACK test_detach_bpf:OK
#39/43 sockmap IPv4/TCP LOOPBACK test_err_inner_map:OK
#39/44 sockmap IPv4/TCP LOOPBACK test_err_skb_data:OK
#39/45 sockmap IPv4/TCP LOOPBACK test_err_sk_select_port:OK
#39/46 sockmap IPv4/TCP LOOPBACK test_pass:OK
#39/47 sockmap IPv4/TCP LOOPBACK test_syncookie:OK
#39/48 sockmap IPv4/TCP LOOPBACK test_pass_on_err:OK
#39/49 sockmap IPv4/TCP LOOPBACK test_detach_bpf:OK
#39/50 sockmap IPv4/TCP INANY test_err_inner_map:OK
#39/51 sockmap IPv4/TCP INANY test_err_skb_data:OK
#39/52 sockmap IPv4/TCP INANY test_err_sk_select_port:OK
#39/53 sockmap IPv4/TCP INANY test_pass:OK
#39/54 sockmap IPv4/TCP INANY test_syncookie:OK
#39/55 sockmap IPv4/TCP INANY test_pass_on_err:OK
#39/56 sockmap IPv4/TCP INANY test_detach_bpf:OK
#39/57 sockmap IPv6/TCP LOOPBACK test_err_inner_map:OK
#39/58 sockmap IPv6/TCP LOOPBACK test_err_skb_data:OK
#39/59 sockmap IPv6/TCP LOOPBACK test_err_sk_select_port:OK
#39/60 sockmap IPv6/TCP LOOPBACK test_pass:OK
#39/61 sockmap IPv6/TCP LOOPBACK test_syncookie:OK
#39/62 sockmap IPv6/TCP LOOPBACK test_pass_on_err:OK
#39/63 sockmap IPv6/TCP LOOPBACK test_detach_bpf:OK
#39/64 sockmap IPv6/TCP INANY test_err_inner_map:OK
#39/65 sockmap IPv6/TCP INANY test_err_skb_data:OK
#39/66 sockmap IPv6/TCP INANY test_err_sk_select_port:OK
#39/67 sockmap IPv6/TCP INANY test_pass:OK
#39/68 sockmap IPv6/TCP INANY test_syncookie:OK
#39/69 sockmap IPv6/TCP INANY test_pass_on_err:OK
#39/70 sockmap IPv6/TCP INANY test_detach_bpf:OK
#39/71 sockmap IPv4/UDP LOOPBACK test_err_inner_map:OK
#39/72 sockmap IPv4/UDP LOOPBACK test_err_skb_data:OK
#39/73 sockmap IPv4/UDP LOOPBACK test_err_sk_select_port:OK
#39/74 sockmap IPv4/UDP LOOPBACK test_pass:OK
#39/75 sockmap IPv4/UDP LOOPBACK test_syncookie:OK
#39/76 sockmap IPv4/UDP LOOPBACK test_pass_on_err:OK
#39/77 sockmap IPv4/UDP LOOPBACK test_detach_bpf:OK
#39/78 sockmap IPv6/UDP LOOPBACK test_err_inner_map:OK
#39/79 sockmap IPv6/UDP LOOPBACK test_err_skb_data:OK
#39/80 sockmap IPv6/UDP LOOPBACK test_err_sk_select_port:OK
#39/81 sockmap IPv6/UDP LOOPBACK test_pass:OK
#39/82 sockmap IPv6/UDP LOOPBACK test_syncookie:OK
#39/83 sockmap IPv6/UDP LOOPBACK test_pass_on_err:OK
#39/84 sockmap IPv6/UDP LOOPBACK test_detach_bpf:OK
#39 select_reuseport:OK
Summary: 1/84 PASSED, 14 SKIPPED, 0 FAILED

 .../bpf/prog_tests/select_reuseport.c         | 60 +++++++++++++++----
 1 file changed, 50 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/select_reuseport.c b/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
index 2c37ae7dc214..e7b4abfca2ab 100644
--- a/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
+++ b/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
@@ -36,6 +36,7 @@ static int result_map, tmp_index_ovr_map, linum_map, data_check_map;
 static enum result expected_results[NR_RESULTS];
 static int sk_fds[REUSEPORT_ARRAY_SIZE];
 static int reuseport_array = -1, outer_map = -1;
+static enum bpf_map_type inner_map_type;
 static int select_by_skb_data_prog;
 static int saved_tcp_syncookie = -1;
 static struct bpf_object *obj;
@@ -63,13 +64,15 @@ static union sa46 {
 	}								\
 })
 
-static int create_maps(void)
+static int create_maps(enum bpf_map_type inner_type)
 {
 	struct bpf_create_map_attr attr = {};
 
+	inner_map_type = inner_type;
+
 	/* Creating reuseport_array */
 	attr.name = "reuseport_array";
-	attr.map_type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
+	attr.map_type = inner_type;
 	attr.key_size = sizeof(__u32);
 	attr.value_size = sizeof(__u32);
 	attr.max_entries = REUSEPORT_ARRAY_SIZE;
@@ -694,12 +697,34 @@ static void cleanup_per_test(bool no_inner_map)
 
 static void cleanup(void)
 {
-	if (outer_map != -1)
+	if (outer_map != -1) {
 		close(outer_map);
-	if (reuseport_array != -1)
+		outer_map = -1;
+	}
+
+	if (reuseport_array != -1) {
 		close(reuseport_array);
-	if (obj)
+		reuseport_array = -1;
+	}
+
+	if (obj) {
 		bpf_object__close(obj);
+		obj = NULL;
+	}
+
+	memset(expected_results, 0, sizeof(expected_results));
+}
+
+static const char *maptype_str(enum bpf_map_type type)
+{
+	switch (type) {
+	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
+		return "reuseport_sockarray";
+	case BPF_MAP_TYPE_SOCKMAP:
+		return "sockmap";
+	default:
+		return "unknown";
+	}
 }
 
 static const char *family_str(sa_family_t family)
@@ -747,13 +772,21 @@ static void test_config(int sotype, sa_family_t family, bool inany)
 	const struct test *t;
 
 	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
-		snprintf(s, sizeof(s), "%s/%s %s %s",
+		snprintf(s, sizeof(s), "%s %s/%s %s %s",
+			 maptype_str(inner_map_type),
 			 family_str(family), sotype_str(sotype),
 			 inany ? "INANY" : "LOOPBACK", t->name);
 
 		if (!test__start_subtest(s))
 			continue;
 
+		if (sotype == SOCK_DGRAM &&
+		    inner_map_type == BPF_MAP_TYPE_SOCKMAP) {
+			/* SOCKMAP doesn't support UDP yet */
+			test__skip();
+			continue;
+		}
+
 		setup_per_test(sotype, family, inany, t->no_inner_map);
 		t->fn(sotype, family);
 		cleanup_per_test(t->no_inner_map);
@@ -782,13 +815,20 @@ static void test_all(void)
 		test_config(c->sotype, c->family, c->inany);
 }
 
-void test_select_reuseport(void)
+void test_map_type(enum bpf_map_type mt)
 {
-	if (create_maps())
+	if (create_maps(mt))
 		goto out;
 	if (prepare_bpf_obj())
 		goto out;
 
+	test_all();
+out:
+	cleanup();
+}
+
+void test_select_reuseport(void)
+{
 	saved_tcp_fo = read_int_sysctl(TCP_FO_SYSCTL);
 	saved_tcp_syncookie = read_int_sysctl(TCP_SYNCOOKIE_SYSCTL);
 	if (saved_tcp_syncookie < 0 || saved_tcp_syncookie < 0)
@@ -799,8 +839,8 @@ void test_select_reuseport(void)
 	if (disable_syncookie())
 		goto out;
 
-	test_all();
+	test_map_type(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY);
+	test_map_type(BPF_MAP_TYPE_SOCKMAP);
 out:
-	cleanup();
 	restore_sysctls();
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (9 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
@ 2020-01-10 10:50 ` Jakub Sitnicki
  2020-01-12  1:06   ` John Fastabend
  2020-01-11  0:18 ` [PATCH bpf-next v2 00/11] Extend SOCKMAP to store " Alexei Starovoitov
  2020-01-11 22:47 ` John Fastabend
  12 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 10:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Now that SOCKMAP can store listening sockets, user-space and BPF API is
open to a new set of potential pitfalls. Exercise the map operations (with
extra attention to code paths susceptible to races between map ops and
socket cloning), and BPF helpers that work with SOCKMAP to gain confidence
that all works as expected.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

To run the newly added tests:

bash-5.0# ./test_progs -t sockmap
#44/1 IPv4 test_sockmap_insert_invalid:OK
#44/2 IPv4 test_sockmap_insert_opened:OK
#44/3 IPv4 test_sockmap_insert_bound:OK
#44/4 IPv4 test_sockmap_insert_listening:OK
#44/5 IPv4 test_sockmap_delete_after_insert:OK
#44/6 IPv4 test_sockmap_delete_after_close:OK
#44/7 IPv4 test_sockmap_lookup_after_insert:OK
#44/8 IPv4 test_sockmap_lookup_after_delete:OK
#44/9 IPv4 test_sockmap_lookup_32_bit_value:OK
#44/10 IPv4 test_sockmap_update_listening:OK
#44/11 IPv4 test_sockmap_destroy_orphan_child:OK
#44/12 IPv4 test_sockmap_syn_recv_insert_delete:OK
#44/13 IPv4 test_sockmap_race_insert_listen:OK
#44/14 IPv4 test_sockmap_clone_after_delete:OK
#44/15 IPv4 test_sockmap_accept_after_delete:OK
#44/16 IPv4 test_sockmap_accept_before_delete:OK
#44/17 IPv4 test_sockmap_skb_redir_to_connected:OK
#44/18 IPv4 test_sockmap_skb_redir_to_listening:OK
#44/19 IPv4 test_sockmap_msg_redir_to_connected:OK
#44/20 IPv4 test_sockmap_msg_redir_to_listening:OK
#44/21 IPv4 test_sockmap_reuseport_select_listening:OK
#44/22 IPv4 test_sockmap_reuseport_select_connected:OK
#44/23 IPv6 test_sockmap_insert_invalid:OK
#44/24 IPv6 test_sockmap_insert_opened:OK
#44/25 IPv6 test_sockmap_insert_bound:OK
#44/26 IPv6 test_sockmap_insert_listening:OK
#44/27 IPv6 test_sockmap_delete_after_insert:OK
#44/28 IPv6 test_sockmap_delete_after_close:OK
#44/29 IPv6 test_sockmap_lookup_after_insert:OK
#44/30 IPv6 test_sockmap_lookup_after_delete:OK
#44/31 IPv6 test_sockmap_lookup_32_bit_value:OK
#44/32 IPv6 test_sockmap_update_listening:OK
#44/33 IPv6 test_sockmap_destroy_orphan_child:OK
#44/34 IPv6 test_sockmap_syn_recv_insert_delete:OK
#44/35 IPv6 test_sockmap_race_insert_listen:OK
#44/36 IPv6 test_sockmap_clone_after_delete:OK
#44/37 IPv6 test_sockmap_accept_after_delete:OK
#44/38 IPv6 test_sockmap_accept_before_delete:OK
#44/39 IPv6 test_sockmap_skb_redir_to_connected:OK
#44/40 IPv6 test_sockmap_skb_redir_to_listening:OK
#44/41 IPv6 test_sockmap_msg_redir_to_connected:OK
#44/42 IPv6 test_sockmap_msg_redir_to_listening:OK
#44/43 IPv6 test_sockmap_reuseport_select_listening:OK
#44/44 IPv6 test_sockmap_reuseport_select_connected:OK
#44 sockmap_listen:OK
Summary: 1/44 PASSED, 0 SKIPPED, 0 FAILED

 .../selftests/bpf/prog_tests/sockmap_listen.c | 1378 +++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |   76 +
 2 files changed, 1454 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
new file mode 100644
index 000000000000..1e77d1854713
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -0,0 +1,1378 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Cloudflare
+/*
+ * Test suite for SOCKMAP holding listening sockets. Covers:
+ *  1. BPF map operations - bpf_map_{update,lookup delete}_elem
+ *  2. BPF redirect helpers - bpf_{sk,msg}_redirect_map
+ *  3. BPF reuseport helper - bpf_sk_select_reuseport
+ */
+
+#include <linux/compiler.h>
+#include <errno.h>
+#include <error.h>
+#include <limits.h>
+#include <netinet/in.h>
+#include <pthread.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_util.h"
+#include "test_progs.h"
+#include "test_sockmap_listen.skel.h"
+
+#define MAX_STRERR_LEN 256
+#define MAX_TEST_NAME 80
+
+#define _FAIL(errnum, fmt...)                                                  \
+	({                                                                     \
+		error_at_line(0, (errnum), __func__, __LINE__, fmt);           \
+		CHECK_FAIL(true);                                              \
+	})
+#define FAIL(fmt...) _FAIL(0, fmt)
+#define FAIL_ERRNO(fmt...) _FAIL(errno, fmt)
+#define FAIL_LIBBPF(err, msg)                                                  \
+	({                                                                     \
+		char __buf[MAX_STRERR_LEN];                                    \
+		libbpf_strerror((err), __buf, sizeof(__buf));                  \
+		FAIL("%s: %s", (msg), __buf);                                  \
+	})
+
+/* Wrappers that fail the test on error and report it. */
+
+#define xaccept(fd, addr, len)                                                 \
+	({                                                                     \
+		int __ret = accept((fd), (addr), (len));                       \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("accept");                                  \
+		__ret;                                                         \
+	})
+
+#define xbind(fd, addr, len)                                                   \
+	({                                                                     \
+		int __ret = bind((fd), (addr), (len));                         \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("bind");                                    \
+		__ret;                                                         \
+	})
+
+#define xclose(fd)                                                             \
+	({                                                                     \
+		int __ret = close((fd));                                       \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("close");                                   \
+		__ret;                                                         \
+	})
+
+#define xconnect(fd, addr, len)                                                \
+	({                                                                     \
+		int __ret = connect((fd), (addr), (len));                      \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("connect");                                 \
+		__ret;                                                         \
+	})
+
+#define xgetsockname(fd, addr, len)                                            \
+	({                                                                     \
+		int __ret = getsockname((fd), (addr), (len));                  \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("getsockname");                             \
+		__ret;                                                         \
+	})
+
+#define xgetsockopt(fd, level, name, val, len)                                 \
+	({                                                                     \
+		int __ret = getsockopt((fd), (level), (name), (val), (len));   \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("getsockopt(" #name ")");                   \
+		__ret;                                                         \
+	})
+
+#define xlisten(fd, backlog)                                                   \
+	({                                                                     \
+		int __ret = listen((fd), (backlog));                           \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("listen");                                  \
+		__ret;                                                         \
+	})
+
+#define xsetsockopt(fd, level, name, val, len)                                 \
+	({                                                                     \
+		int __ret = setsockopt((fd), (level), (name), (val), (len));   \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("setsockopt(" #name ")");                   \
+		__ret;                                                         \
+	})
+
+#define xsocket(family, sotype, flags)                                         \
+	({                                                                     \
+		int __ret = socket(family, sotype, flags);                     \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("socket");                                  \
+		__ret;                                                         \
+	})
+
+#define xbpf_map_delete_elem(fd, key)                                          \
+	({                                                                     \
+		int __ret = bpf_map_delete_elem((fd), (key));                  \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("map_delete");                              \
+		__ret;                                                         \
+	})
+
+#define xbpf_map_lookup_elem(fd, key, val)                                     \
+	({                                                                     \
+		int __ret = bpf_map_lookup_elem((fd), (key), (val));           \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("map_lookup");                              \
+		__ret;                                                         \
+	})
+
+#define xbpf_map_update_elem(fd, key, val, flags)                              \
+	({                                                                     \
+		int __ret = bpf_map_update_elem((fd), (key), (val), (flags));  \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("map_update");                              \
+		__ret;                                                         \
+	})
+
+#define xbpf_prog_attach(prog, target, type, flags)                            \
+	({                                                                     \
+		int __ret =                                                    \
+			bpf_prog_attach((prog), (target), (type), (flags));    \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("prog_attach(" #type ")");                  \
+		__ret;                                                         \
+	})
+
+#define xbpf_prog_detach2(prog, target, type)                                  \
+	({                                                                     \
+		int __ret = bpf_prog_detach2((prog), (target), (type));        \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("prog_detach2(" #type ")");                 \
+		__ret;                                                         \
+	})
+
+#define xpthread_create(thread, attr, func, arg)                               \
+	({                                                                     \
+		int __ret = pthread_create((thread), (attr), (func), (arg));   \
+		errno = __ret;                                                 \
+		if (__ret)                                                     \
+			FAIL_ERRNO("pthread_create");                          \
+		__ret;                                                         \
+	})
+
+#define xpthread_join(thread, retval)                                          \
+	({                                                                     \
+		int __ret = pthread_join((thread), (retval));                  \
+		errno = __ret;                                                 \
+		if (__ret)                                                     \
+			FAIL_ERRNO("pthread_join");                            \
+		__ret;                                                         \
+	})
+
+static void init_addr_loopback4(struct sockaddr_storage *ss, socklen_t *len)
+{
+	struct sockaddr_in *addr4 = memset(ss, 0, sizeof(*ss));
+
+	addr4->sin_family = AF_INET;
+	addr4->sin_port = 0;
+	addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	*len = sizeof(*addr4);
+}
+
+static void init_addr_loopback6(struct sockaddr_storage *ss, socklen_t *len)
+{
+	struct sockaddr_in6 *addr6 = memset(ss, 0, sizeof(*ss));
+
+	addr6->sin6_family = AF_INET6;
+	addr6->sin6_port = 0;
+	addr6->sin6_addr = in6addr_loopback;
+	*len = sizeof(*addr6);
+}
+
+static void init_addr_loopback(int family, struct sockaddr_storage *ss,
+			       socklen_t *len)
+{
+	switch (family) {
+	case AF_INET:
+		init_addr_loopback4(ss, len);
+		return;
+	case AF_INET6:
+		init_addr_loopback6(ss, len);
+		return;
+	default:
+		FAIL("unsupported address family %d", family);
+	}
+}
+
+static inline struct sockaddr *sockaddr(struct sockaddr_storage *ss)
+{
+	return (struct sockaddr *)ss;
+}
+
+static int enable_reuseport(int s, int progfd)
+{
+	int err, one = 1;
+
+	err = xsetsockopt(s, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one));
+	if (err)
+		return -1;
+	err = xsetsockopt(s, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF, &progfd,
+			  sizeof(progfd));
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int listen_loopback_reuseport(int family, int sotype, int progfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	int err, s;
+
+	init_addr_loopback(family, &addr, &len);
+
+	s = xsocket(family, sotype, 0);
+	if (s == -1)
+		return -1;
+
+	if (progfd >= 0)
+		enable_reuseport(s, progfd);
+
+	err = xbind(s, sockaddr(&addr), len);
+	if (err)
+		goto close;
+
+	err = xlisten(s, SOMAXCONN);
+	if (err)
+		goto close;
+
+	return s;
+close:
+	xclose(s);
+	return -1;
+}
+
+static int listen_loopback(int family, int sotype)
+{
+	return listen_loopback_reuseport(family, sotype, -1);
+}
+
+static void test_sockmap_insert_invalid(int family, int sotype, int mapfd)
+{
+	u32 key = 0;
+	u64 value;
+	int err;
+
+	value = -1;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_ERRNO("map_update: expected EINVAL");
+
+	value = INT_MAX;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EBADF)
+		FAIL_ERRNO("map_update: expected EBADF");
+}
+
+static void test_sockmap_insert_opened(int family, int sotype, int mapfd)
+{
+	u32 key = 0;
+	u64 value;
+	int err, s;
+
+	s = xsocket(family, sotype, 0);
+	if (s == -1)
+		return;
+
+	errno = 0;
+	value = s;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_ERRNO("map_update: expected EINVAL");
+
+	xclose(s);
+}
+
+static void test_sockmap_insert_bound(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	u32 key = 0;
+	u64 value;
+	int err, s;
+
+	init_addr_loopback(family, &addr, &len);
+
+	s = xsocket(family, sotype, 0);
+	if (s == -1)
+		return;
+
+	err = xbind(s, sockaddr(&addr), len);
+	if (err)
+		goto close;
+
+	errno = 0;
+	value = s;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_ERRNO("map_update: expected EINVAL");
+close:
+	xclose(s);
+}
+
+static void test_sockmap_insert_listening(int family, int sotype, int mapfd)
+{
+	u64 value;
+	u32 key;
+	int s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xclose(s);
+}
+
+static void test_sockmap_delete_after_insert(int family, int sotype, int mapfd)
+{
+	u64 value;
+	u32 key;
+	int s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+	xclose(s);
+}
+
+static void test_sockmap_delete_after_close(int family, int sotype, int mapfd)
+{
+	int err, s;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	xclose(s);
+
+	errno = 0;
+	err = bpf_map_delete_elem(mapfd, &key);
+	if (!err || errno != EINVAL)
+		FAIL_ERRNO("map_update: expected EINVAL");
+}
+
+static void test_sockmap_lookup_after_insert(int family, int sotype, int mapfd)
+{
+	u64 cookie, value;
+	socklen_t len;
+	u32 key;
+	int s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	len = sizeof(cookie);
+	xgetsockopt(s, SOL_SOCKET, SO_COOKIE, &cookie, &len);
+
+	xbpf_map_lookup_elem(mapfd, &key, &value);
+
+	if (value != cookie) {
+		FAIL("map_lookup: have %#llx, want %#llx",
+		     (unsigned long long)value, (unsigned long long)cookie);
+	}
+
+	xclose(s);
+}
+
+static void test_sockmap_lookup_after_delete(int family, int sotype, int mapfd)
+{
+	int err, s;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+
+	errno = 0;
+	err = bpf_map_lookup_elem(mapfd, &key, &value);
+	if (!err || errno != ENOENT)
+		FAIL_ERRNO("map_lookup: expected ENOENT");
+
+	xclose(s);
+}
+
+static void test_sockmap_lookup_32_bit_value(int family, int sotype, int mapfd)
+{
+	u32 key, value32;
+	int err, s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	mapfd = bpf_create_map(BPF_MAP_TYPE_SOCKMAP, sizeof(key),
+			       sizeof(value32), 1, 0);
+	if (mapfd < 0) {
+		FAIL_ERRNO("map_create");
+		goto close;
+	}
+
+	key = 0;
+	value32 = s;
+	xbpf_map_update_elem(mapfd, &key, &value32, BPF_NOEXIST);
+
+	errno = 0;
+	err = bpf_map_lookup_elem(mapfd, &key, &value32);
+	if (!err || errno != ENOSPC)
+		FAIL_ERRNO("map_lookup: expected ENOSPC");
+
+	xclose(mapfd);
+close:
+	xclose(s);
+}
+
+static void test_sockmap_update_listening(int family, int sotype, int mapfd)
+{
+	int s1, s2;
+	u64 value;
+	u32 key;
+
+	s1 = listen_loopback(family, sotype);
+	if (s1 < 0)
+		return;
+
+	s2 = listen_loopback(family, sotype);
+	if (s2 < 0)
+		goto close_s1;
+
+	key = 0;
+	value = s1;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	value = s2;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_EXIST);
+	xclose(s2);
+close_s1:
+	xclose(s1);
+}
+
+/* Exercise the code path where we destroy child sockets that never
+ * got accept()'ed, aka orphans, when parent socket gets closed.
+ */
+static void test_sockmap_destroy_orphan_child(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	int err, s, c;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	c = xsocket(family, sotype, 0);
+	if (c == -1)
+		goto close_srv;
+
+	xconnect(c, sockaddr(&addr), len);
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+/* Perform a passive open after removing listening socket from SOCKMAP
+ * to ensure that callbacks get restored properly.
+ */
+static void test_sockmap_clone_after_delete(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	int err, s, c;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv;
+
+	xconnect(c, sockaddr(&addr), len);
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+/* Check that child socket that got created while parent was in a
+ * SOCKMAP, but got accept()'ed only after the parent has been removed
+ * from SOCKMAP, gets cloned without parent psock state or callbacks.
+ */
+static void test_sockmap_accept_after_delete(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	const u32 zero = 0;
+	int err, s, c, p;
+	socklen_t len;
+	u64 value;
+
+	s = listen_loopback(family, sotype);
+	if (s == -1)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	value = s;
+	err = xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c == -1)
+		goto close_srv;
+
+	/* Create child while parent is in sockmap */
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	/* Remove parent from sockmap */
+	err = xbpf_map_delete_elem(mapfd, &zero);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p == -1)
+		goto close_cli;
+
+	/* Check that child sk_user_data is not set */
+	value = p;
+	xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+/* Check that child socket that got created and accepted while parent
+ * was in a SOCKMAP is cloned without parent psock state or callbacks.
+ */
+static void test_sockmap_accept_before_delete(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	const u32 zero = 0, one = 1;
+	int err, s, c, p;
+	socklen_t len;
+	u64 value;
+
+	s = listen_loopback(family, sotype);
+	if (s == -1)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	value = s;
+	err = xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c == -1)
+		goto close_srv;
+
+	/* Create & accept child while parent is in sockmap */
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p == -1)
+		goto close_cli;
+
+	/* Check that child sk_user_data is not set */
+	value = p;
+	xbpf_map_update_elem(mapfd, &one, &value, BPF_NOEXIST);
+
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+struct connect_accept_ctx {
+	int sockfd;
+	unsigned int done;
+	unsigned int nr_iter;
+};
+
+static bool is_thread_done(struct connect_accept_ctx *ctx)
+{
+	return READ_ONCE(ctx->done);
+}
+
+static void *connect_accept_thread(void *arg)
+{
+	struct connect_accept_ctx *ctx = arg;
+	struct sockaddr_storage addr;
+	int family, socktype;
+	socklen_t len;
+	int err, i, s;
+
+	s = ctx->sockfd;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto done;
+
+	len = sizeof(family);
+	err = xgetsockopt(s, SOL_SOCKET, SO_DOMAIN, &family, &len);
+	if (err)
+		goto done;
+
+	len = sizeof(socktype);
+	err = xgetsockopt(s, SOL_SOCKET, SO_TYPE, &socktype, &len);
+	if (err)
+		goto done;
+
+	for (i = 0; i < ctx->nr_iter; i++) {
+		int c, p;
+
+		c = xsocket(family, socktype, 0);
+		if (c < 0)
+			break;
+
+		err = xconnect(c, (struct sockaddr *)&addr, sizeof(addr));
+		if (err) {
+			xclose(c);
+			break;
+		}
+
+		p = xaccept(s, NULL, NULL);
+		if (p < 0) {
+			xclose(c);
+			break;
+		}
+
+		xclose(p);
+		xclose(c);
+	}
+done:
+	WRITE_ONCE(ctx->done, 1);
+	return NULL;
+}
+
+static void test_sockmap_syn_recv_insert_delete(int family, int sotype,
+						int mapfd)
+{
+	struct connect_accept_ctx ctx = { 0 };
+	struct sockaddr_storage addr;
+	socklen_t len;
+	u32 zero = 0;
+	pthread_t t;
+	int err, s;
+	u64 value;
+
+	s = listen_loopback(family, sotype | SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close;
+
+	ctx.sockfd = s;
+	ctx.nr_iter = 1000;
+
+	err = xpthread_create(&t, NULL, connect_accept_thread, &ctx);
+	if (err)
+		goto close;
+
+	value = s;
+	while (!is_thread_done(&ctx)) {
+		err = xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+		if (err)
+			break;
+
+		err = xbpf_map_delete_elem(mapfd, &zero);
+		if (err)
+			break;
+	}
+
+	xpthread_join(t, NULL);
+close:
+	xclose(s);
+}
+
+static void *listen_thread(void *arg)
+{
+	struct sockaddr unspec = { AF_UNSPEC };
+	struct connect_accept_ctx *ctx = arg;
+	int err, i, s;
+
+	s = ctx->sockfd;
+
+	for (i = 0; i < ctx->nr_iter; i++) {
+		err = xlisten(s, 1);
+		if (err)
+			break;
+		err = xconnect(s, &unspec, sizeof(unspec));
+		if (err)
+			break;
+	}
+
+	WRITE_ONCE(ctx->done, 1);
+	return NULL;
+}
+
+static void test_sockmap_race_insert_listen(int family, int socktype, int mapfd)
+{
+	struct connect_accept_ctx ctx = { 0 };
+	const u32 zero = 0;
+	const int one = 1;
+	pthread_t t;
+	int err, s;
+	u64 value;
+
+	s = xsocket(family, socktype, 0);
+	if (s < 0)
+		return;
+
+	err = xsetsockopt(s, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one));
+	if (err)
+		goto close;
+
+	ctx.sockfd = s;
+	ctx.nr_iter = 10000;
+
+	err = pthread_create(&t, NULL, listen_thread, &ctx);
+	if (err)
+		goto close;
+
+	value = s;
+	while (!is_thread_done(&ctx)) {
+		err = bpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+		if (err && errno != EINVAL) {
+			FAIL_ERRNO("map_update");
+			break;
+		}
+		err = bpf_map_delete_elem(mapfd, &zero);
+		if (err && errno != EINVAL) {
+			FAIL_ERRNO("map_delete");
+			break;
+		}
+	}
+
+	xpthread_join(t, NULL);
+close:
+	xclose(s);
+}
+
+static void zero_verdict_count(int mapfd)
+{
+	unsigned int zero = 0;
+	int key;
+
+	key = SK_DROP;
+	xbpf_map_update_elem(mapfd, &key, &zero, BPF_ANY);
+	key = SK_PASS;
+	xbpf_map_update_elem(mapfd, &key, &zero, BPF_ANY);
+}
+
+enum redir_mode {
+	REDIR_INGRESS,
+	REDIR_EGRESS,
+};
+
+static const char *redir_mode_str(enum redir_mode mode)
+{
+	switch (mode) {
+	case REDIR_INGRESS:
+		return "ingress";
+	case REDIR_EGRESS:
+		return "egress";
+	default:
+		return "unknown";
+	}
+}
+
+static void redir_to_connected(int family, int sotype, int sock_mapfd,
+			       int verd_mapfd, enum redir_mode mode)
+{
+	const char *log_prefix = redir_mode_str(mode);
+	struct sockaddr_storage addr;
+	int s, c0, c1, p0, p1;
+	unsigned int pass;
+	socklen_t len;
+	int err, n;
+	u64 value;
+	u32 key;
+	char b;
+
+	zero_verdict_count(verd_mapfd);
+
+	s = listen_loopback(family, sotype | SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	c0 = xsocket(family, sotype, 0);
+	if (c0 < 0)
+		goto close_srv;
+	err = xconnect(c0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+
+	p0 = xaccept(s, NULL, NULL);
+	if (p0 < 0)
+		goto close_cli0;
+
+	c1 = xsocket(family, sotype, 0);
+	if (c1 < 0)
+		goto close_peer0;
+	err = xconnect(c1, sockaddr(&addr), len);
+	if (err)
+		goto close_cli1;
+
+	p1 = xaccept(s, NULL, NULL);
+	if (p1 < 0)
+		goto close_cli1;
+
+	key = 0;
+	value = p0;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer1;
+
+	key = 1;
+	value = p1;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer1;
+
+	n = write(mode == REDIR_INGRESS ? c1 : p1, "a", 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: write", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete write", log_prefix);
+	if (n < 1)
+		goto close_peer1;
+
+	key = SK_PASS;
+	err = xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+	if (err)
+		goto close_peer1;
+	if (pass != 1)
+		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
+
+	n = read(c0, &b, 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: read", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete read", log_prefix);
+
+close_peer1:
+	xclose(p1);
+close_cli1:
+	xclose(c1);
+close_peer0:
+	xclose(p0);
+close_cli0:
+	xclose(c0);
+close_srv:
+	xclose(s);
+}
+
+static void
+test_sockmap_skb_redir_to_connected(struct test_sockmap_listen *skel,
+				    int family, int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int parser = bpf_program__fd(skel->progs.prog_skb_parser);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(skel->maps.sock_map);
+	int err;
+
+	err = xbpf_prog_attach(parser, sock_map, BPF_SK_SKB_STREAM_PARSER, 0);
+	if (err)
+		return;
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (err)
+		goto detach;
+
+	redir_to_connected(family, sotype, sock_map, verdict_map,
+			   REDIR_INGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT);
+detach:
+	xbpf_prog_detach2(parser, sock_map, BPF_SK_SKB_STREAM_PARSER);
+}
+
+static void
+test_sockmap_msg_redir_to_connected(struct test_sockmap_listen *skel,
+				    int family, int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_msg_verdict);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(skel->maps.sock_map);
+	int err;
+
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_MSG_VERDICT, 0);
+	if (err)
+		return;
+
+	redir_to_connected(family, sotype, sock_map, verdict_map, REDIR_EGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_MSG_VERDICT);
+}
+
+static void redir_to_listening(int family, int sotype, int sock_mapfd,
+			       int verd_mapfd, enum redir_mode mode)
+{
+	const char *log_prefix = redir_mode_str(mode);
+	struct sockaddr_storage addr;
+	int s, c, p, err, n;
+	unsigned int drop;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_mapfd);
+
+	s = listen_loopback(family, sotype | SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv;
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p < 0)
+		goto close_cli;
+
+	key = 0;
+	value = s;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer;
+
+	key = 1;
+	value = p;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer;
+
+	n = write(mode == REDIR_INGRESS ? c : p, "a", 1);
+	if (n < 0 && errno != EACCES)
+		FAIL_ERRNO("%s: write", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete write", log_prefix);
+	if (n < 1)
+		goto close_peer;
+
+	key = SK_DROP;
+	err = xbpf_map_lookup_elem(verd_mapfd, &key, &drop);
+	if (err)
+		goto close_peer;
+	if (drop != 1)
+		FAIL("%s: want drop count 1, have %d", log_prefix, drop);
+
+close_peer:
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+static void
+test_sockmap_skb_redir_to_listening(struct test_sockmap_listen *skel,
+				    int family, int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int parser = bpf_program__fd(skel->progs.prog_skb_parser);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(skel->maps.sock_map);
+	int err;
+
+	err = xbpf_prog_attach(parser, sock_map, BPF_SK_SKB_STREAM_PARSER, 0);
+	if (err)
+		return;
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (err)
+		goto detach;
+
+	redir_to_listening(family, sotype, sock_map, verdict_map,
+			   REDIR_INGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT);
+detach:
+	xbpf_prog_detach2(parser, sock_map, BPF_SK_SKB_STREAM_PARSER);
+}
+
+static void
+test_sockmap_msg_redir_to_listening(struct test_sockmap_listen *skel,
+				    int family, int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_msg_verdict);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(skel->maps.sock_map);
+	int err;
+
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_MSG_VERDICT, 0);
+	if (err)
+		return;
+
+	redir_to_listening(family, sotype, sock_map, verdict_map, REDIR_EGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_MSG_VERDICT);
+}
+
+static void test_sockmap_reuseport_select_listening(int family, int sotype,
+						    int sock_map, int verd_map,
+						    int reuseport_prog)
+{
+	struct sockaddr_storage addr;
+	unsigned int pass;
+	int s, c, p, err;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_map);
+
+	s = listen_loopback_reuseport(family, sotype, reuseport_prog);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	key = 0;
+	value = s;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv;
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p < 0)
+		goto close_cli;
+
+	key = SK_PASS;
+	err = xbpf_map_lookup_elem(verd_map, &key, &pass);
+	if (err)
+		goto close_peer;
+	if (pass != 1)
+		FAIL("want drop count 1, have %d", pass);
+
+close_peer:
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+static void test_sockmap_reuseport_select_connected(int family, int sotype,
+						    int sock_map, int verd_map,
+						    int reuseport_prog)
+{
+	struct sockaddr_storage addr;
+	int s, c0, c1, p0, err;
+	unsigned int drop;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_map);
+
+	s = listen_loopback_reuseport(family, sotype, reuseport_prog);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	c0 = xsocket(family, sotype, 0);
+	if (c0 < 0)
+		goto close_srv;
+
+	err = xconnect(c0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+
+	p0 = xaccept(s, NULL, NULL);
+	if (err)
+		goto close_cli0;
+
+	key = 0;
+	value = p0;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer0;
+
+	c1 = xsocket(family, sotype, 0);
+	if (c1 < 0)
+		goto close_peer0;
+
+	errno = 0;
+	err = connect(c1, sockaddr(&addr), len);
+	if (!err || errno != ECONNREFUSED)
+		FAIL_ERRNO("connect: expected ECONNREFUSED");
+
+	key = SK_DROP;
+	err = xbpf_map_lookup_elem(verd_map, &key, &drop);
+	if (err)
+		goto close_cli1;
+	if (drop != 1)
+		FAIL("want drop count 1, have %d", drop);
+
+close_cli1:
+	xclose(c1);
+close_peer0:
+	xclose(p0);
+close_cli0:
+	xclose(c0);
+close_srv:
+	xclose(s);
+}
+
+#define TEST(fn)                                                               \
+	{                                                                      \
+		fn, #fn                                                        \
+	}
+
+static void cleanup_sockmap_ops(int mapfd)
+{
+	int err;
+	u32 key;
+
+	for (key = 0; key < 2; key++) {
+		err = bpf_map_delete_elem(mapfd, &key);
+		if (err && errno != EINVAL)
+			FAIL_ERRNO("map_delete");
+	}
+}
+
+static const char *family_str(sa_family_t family)
+{
+	switch (family) {
+	case AF_INET:
+		return "IPv4";
+	case AF_INET6:
+		return "IPv6";
+	default:
+		return "unknown";
+	}
+}
+
+static void test_sockmap_ops(struct test_sockmap_listen *skel, int family,
+			     int sotype)
+{
+	const struct op_test {
+		void (*fn)(int family, int sotype, int sock_map);
+		const char *name;
+	} tests[] = {
+		/* insert */
+		TEST(test_sockmap_insert_invalid),
+		TEST(test_sockmap_insert_opened),
+		TEST(test_sockmap_insert_bound),
+		TEST(test_sockmap_insert_listening),
+		/* delete */
+		TEST(test_sockmap_delete_after_insert),
+		TEST(test_sockmap_delete_after_close),
+		/* lookup */
+		TEST(test_sockmap_lookup_after_insert),
+		TEST(test_sockmap_lookup_after_delete),
+		TEST(test_sockmap_lookup_32_bit_value),
+		/* update */
+		TEST(test_sockmap_update_listening),
+		/* races with insert/delete */
+		TEST(test_sockmap_destroy_orphan_child),
+		TEST(test_sockmap_syn_recv_insert_delete),
+		TEST(test_sockmap_race_insert_listen),
+		/* child clone */
+		TEST(test_sockmap_clone_after_delete),
+		TEST(test_sockmap_accept_after_delete),
+		TEST(test_sockmap_accept_before_delete),
+	};
+	const struct op_test *t;
+	char s[MAX_TEST_NAME];
+	int sock_map;
+
+	sock_map = bpf_map__fd(skel->maps.sock_map);
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		snprintf(s, sizeof(s), "%s %s", family_str(family), t->name);
+
+		if (!test__start_subtest(s))
+			continue;
+
+		t->fn(family, sotype, sock_map);
+		cleanup_sockmap_ops(sock_map);
+	}
+}
+
+static void test_sockmap_redir(struct test_sockmap_listen *skel, int family,
+			       int sotype)
+{
+	const struct redir_test {
+		void (*fn)(struct test_sockmap_listen *skel, int family,
+			   int sotype);
+		const char *name;
+	} tests[] = {
+		TEST(test_sockmap_skb_redir_to_connected),
+		TEST(test_sockmap_skb_redir_to_listening),
+		TEST(test_sockmap_msg_redir_to_connected),
+		TEST(test_sockmap_msg_redir_to_listening),
+	};
+	const struct redir_test *t;
+	char s[MAX_TEST_NAME];
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		snprintf(s, sizeof(s), "%s %s", family_str(family), t->name);
+
+		if (!test__start_subtest(s))
+			continue;
+
+		t->fn(skel, family, sotype);
+	}
+}
+
+static void test_sockmap_reuseport(struct test_sockmap_listen *skel, int family,
+				   int sotype)
+{
+	const struct reuseport_test {
+		void (*fn)(int family, int sotype, int sock_map,
+			   int verdict_map, int reuseport_prog);
+		const char *name;
+	} tests[] = {
+		TEST(test_sockmap_reuseport_select_listening),
+		TEST(test_sockmap_reuseport_select_connected),
+	};
+	int sock_map, verdict_map, reuseport_prog;
+	const struct reuseport_test *t;
+	char s[MAX_TEST_NAME];
+
+	sock_map = bpf_map__fd(skel->maps.sock_map);
+	verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	reuseport_prog = bpf_program__fd(skel->progs.prog_reuseport);
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		snprintf(s, sizeof(s), "%s %s", family_str(family), t->name);
+
+		if (!test__start_subtest(s))
+			continue;
+
+		t->fn(family, sotype, sock_map, verdict_map, reuseport_prog);
+	}
+}
+
+static void run_tests(struct test_sockmap_listen *skel, int family)
+{
+	test_sockmap_ops(skel, family, SOCK_STREAM);
+	test_sockmap_redir(skel, family, SOCK_STREAM);
+	test_sockmap_reuseport(skel, family, SOCK_STREAM);
+}
+
+void test_sockmap_listen(void)
+{
+	struct test_sockmap_listen *skel;
+
+	skel = test_sockmap_listen__open_and_load();
+	if (!skel) {
+		FAIL("skeleton open/load failed");
+		return;
+	}
+
+	run_tests(skel, AF_INET);
+	run_tests(skel, AF_INET6);
+
+	test_sockmap_listen__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
new file mode 100644
index 000000000000..b02cc3504200
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Cloudflare
+
+#include <errno.h>
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 2);
+	__type(key, __u32);
+	__type(value, __u64);
+} sock_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, unsigned int);
+} verdict_map SEC(".maps");
+
+SEC("sk_skb/stream_parser")
+int prog_skb_parser(struct __sk_buff *skb)
+{
+	return skb->len;
+}
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+	unsigned int *count;
+	int verdict;
+
+	verdict = bpf_sk_redirect_map(skb, &sock_map, 0, 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+SEC("sk_msg")
+int prog_msg_verdict(struct sk_msg_md *msg)
+{
+	unsigned int *count;
+	int verdict;
+
+	verdict = bpf_msg_redirect_map(msg, &sock_map, 0, 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+SEC("sk_reuseport")
+int prog_reuseport(struct sk_reuseport_md *reuse)
+{
+	unsigned int *count;
+	int err, verdict;
+	int key = 0;
+
+	err = bpf_sk_select_reuseport(reuse, &sock_map, &key, 0);
+	verdict = (!err || err == -ENOENT) ? SK_PASS : SK_DROP;
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (10 preceding siblings ...)
  2020-01-10 10:50 ` [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
@ 2020-01-11  0:18 ` Alexei Starovoitov
  2020-01-11 22:47 ` John Fastabend
  12 siblings, 0 replies; 49+ messages in thread
From: Alexei Starovoitov @ 2020-01-11  0:18 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend,
	Lorenz Bauer, Martin KaFai Lau

On Fri, Jan 10, 2020 at 11:50:16AM +0100, Jakub Sitnicki wrote:
> With the realization that properly cloning listening sockets that have
> psock state/callbacks is tricky, comes the second version of patches.
> 
> The spirit of the patch set stays the same - make SOCKMAP a generic
> collection for listening and established sockets. This would let us use the
> SOCKMAP with reuseport today, and in the future hopefully with BPF programs
> that run at socket lookup time [0]. For a bit more context, please see v1
> cover letter [1].
> 
> The biggest change that happened since v1 is how we deal with clearing
> psock state in a copy of parent socket when cloning it (patches 3 & 4).
> 
> As much as I did not want to touch icsk/tcp clone path, it seems
> unavoidable. The changes were kept down to a minimum, with attention to not
> break existing users. That said, a review from the TCP maintainer would be
> invaluable (patches 3 & 4).
> 
> Patches 1 & 2 will conflict with recently posted "Fixes for sockmap/tls
> from more complex BPF progs" series [0]. I'll adapt or split them out this
> series once sockmap/tls fixes from John land in bpf-next branch.
> 
> Some food for thought - is mixing listening and established sockets in the
> same BPF map a good idea? I don't know but I couldn't find a good reason to
> restrict the user.
> 
> Considering how much the code evolved, I didn't carry over Acks from v1.
> 
> Thanks,
> jkbs
> 
> [0] https://lore.kernel.org/bpf/157851776348.1732.12600714815781177085.stgit@ubuntu3-kvm2/T/#t
> [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
> 
> v1 -> v2:
> 
> - af_ops->syn_recv_sock callback is no longer overridden and burdened with
>   restoring sk_prot and clearing sk_user_data in the child socket. As child
>   socket is already hashed when syn_recv_sock returns, it is too late to
>   put it in the right state. Instead patches 3 & 4 restore sk_prot and
>   clear sk_user_data before we hash the child socket. (Pointed out by
>   Martin Lau)
> 
> - Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as
>   we write to it from sk_msg while socket might be getting cloned on
>   another CPU. (Suggested by John Fastabend)
> 
> - Convert tests for SOCKMAP holding listening sockets to return-on-error
>   style, and hook them up to test_progs. Also use BPF skeleton for setup.
>   Add new tests to cover the race scenario discovered during v1 review.

lgtm
Martin, John, please review.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
@ 2020-01-11  2:42   ` kbuild test robot
  2020-01-11  3:02   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: kbuild test robot @ 2020-01-11  2:42 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: kbuild-all, bpf, netdev, kernel-team, Eric Dumazet,
	John Fastabend, Lorenz Bauer, Martin KaFai Lau

[-- Attachment #1: Type: text/plain, Size: 1529 bytes --]

Hi Jakub,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[also build test ERROR on bpf/master net/master net-next/master linus/master ipvs/master v5.5-rc5 next-20200110]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Extend-SOCKMAP-to-store-listening-sockets/20200111-045213
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arc-defconfig (attached as .config)
compiler: arc-elf-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arc-elf-ld: net/ipv4/tcp_minisocks.o: in function `tcp_create_openreq_child':
   tcp_minisocks.c:(.text+0x474): undefined reference to `tcp_bpf_clone'
>> arc-elf-ld: tcp_minisocks.c:(.text+0x474): undefined reference to `tcp_bpf_clone'

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 9141 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
  2020-01-11  2:42   ` kbuild test robot
@ 2020-01-11  3:02   ` kbuild test robot
  2020-01-11 23:48   ` John Fastabend
  2020-01-13 22:23   ` Martin Lau
  3 siblings, 0 replies; 49+ messages in thread
From: kbuild test robot @ 2020-01-11  3:02 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: kbuild-all, bpf, netdev, kernel-team, Eric Dumazet,
	John Fastabend, Lorenz Bauer, Martin KaFai Lau

[-- Attachment #1: Type: text/plain, Size: 1551 bytes --]

Hi Jakub,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[also build test ERROR on bpf/master net/master net-next/master linus/master ipvs/master v5.5-rc5 next-20200110]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Extend-SOCKMAP-to-store-listening-sockets/20200111-045213
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: nds32-defconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   nds32le-linux-ld: net/ipv4/tcp_minisocks.o: in function `tcp_create_openreq_child':
   tcp_minisocks.c:(.text+0xf30): undefined reference to `tcp_bpf_clone'
>> nds32le-linux-ld: tcp_minisocks.c:(.text+0xf34): undefined reference to `tcp_bpf_clone'

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 10747 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets
  2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
                   ` (11 preceding siblings ...)
  2020-01-11  0:18 ` [PATCH bpf-next v2 00/11] Extend SOCKMAP to store " Alexei Starovoitov
@ 2020-01-11 22:47 ` John Fastabend
  12 siblings, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-11 22:47 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> With the realization that properly cloning listening sockets that have
> psock state/callbacks is tricky, comes the second version of patches.
> 
> The spirit of the patch set stays the same - make SOCKMAP a generic
> collection for listening and established sockets. This would let us use the
> SOCKMAP with reuseport today, and in the future hopefully with BPF programs
> that run at socket lookup time [0]. For a bit more context, please see v1
> cover letter [1].
> 
> The biggest change that happened since v1 is how we deal with clearing
> psock state in a copy of parent socket when cloning it (patches 3 & 4).
> 
> As much as I did not want to touch icsk/tcp clone path, it seems
> unavoidable. The changes were kept down to a minimum, with attention to not
> break existing users. That said, a review from the TCP maintainer would be
> invaluable (patches 3 & 4).
> 
> Patches 1 & 2 will conflict with recently posted "Fixes for sockmap/tls
> from more complex BPF progs" series [0]. I'll adapt or split them out this
> series once sockmap/tls fixes from John land in bpf-next branch.

Thanks I just posted a v2 of that series so once that lands we will need
to respin this series.

> 
> Some food for thought - is mixing listening and established sockets in the
> same BPF map a good idea? I don't know but I couldn't find a good reason to
> restrict the user.

+1 in general I've been trying to avoid adding arbitrary restriction.
In this case I agree I can't think of a good reason to do it for my use
cases but lets not stop someone from doing it if their use case wants to
for some reason.

> 
> Considering how much the code evolved, I didn't carry over Acks from v1.

Sounds good thanks for keeping this series going.

> 
> Thanks,
> jkbs
> 
> [0] https://lore.kernel.org/bpf/157851776348.1732.12600714815781177085.stgit@ubuntu3-kvm2/T/#t
> [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
> 
> v1 -> v2:
> 
> - af_ops->syn_recv_sock callback is no longer overridden and burdened with
>   restoring sk_prot and clearing sk_user_data in the child socket. As child
>   socket is already hashed when syn_recv_sock returns, it is too late to
>   put it in the right state. Instead patches 3 & 4 restore sk_prot and
>   clear sk_user_data before we hash the child socket. (Pointed out by
>   Martin Lau)
> 
> - Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as
>   we write to it from sk_msg while socket might be getting cloned on
>   another CPU. (Suggested by John Fastabend)
> 
> - Convert tests for SOCKMAP holding listening sockets to return-on-error
>   style, and hook them up to test_progs. Also use BPF skeleton for setup.
>   Add new tests to cover the race scenario discovered during v1 review.
> 
> RFC -> v1:
> 
> - Switch from overriding proto->accept to af_ops->syn_recv_sock, which
>   happens earlier. Clearing the psock state after accept() does not work
>   for child sockets that become orphaned (never got accepted). v4-mapped
>   sockets need special care.
> 
> - Return the socket cookie on SOCKMAP lookup from syscall to be on par with
>   REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from
>   syscall.
> 
> - Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress)
>   SOCKMAP helpers fail when target socket is a listening one.
> 
> - Make bpf_sk_select_reuseport helper fail when target is a TCP established
>   socket.
> 
> - Teach libbpf to recognize SK_REUSEPORT program type from section name.
> 
> - Add a dedicated set of tests for SOCKMAP holding listening sockets,
>   covering map operations, overridden socket callbacks, and BPF helpers.
> 
> 
> Jakub Sitnicki (11):
>   bpf, sk_msg: Don't reset saved sock proto on restore
>   net, sk_msg: Annotate lockless access to sk_prot on clone
>   net, sk_msg: Clear sk_user_data pointer on clone if tagged
>   tcp_bpf: Don't let child socket inherit parent protocol ops on copy
>   bpf, sockmap: Allow inserting listening TCP sockets into sockmap
>   bpf, sockmap: Don't set up sockmap progs for listening sockets
>   bpf, sockmap: Return socket cookie on lookup from syscall
>   bpf, sockmap: Let all kernel-land lookup values in SOCKMAP
>   bpf: Allow selecting reuseport socket from a SOCKMAP
>   selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
>   selftests/bpf: Tests for SOCKMAP holding listening sockets
> 
>  include/linux/skmsg.h                         |   14 +-
>  include/net/sock.h                            |   37 +-
>  include/net/tcp.h                             |    1 +
>  kernel/bpf/verifier.c                         |    6 +-
>  net/core/filter.c                             |   15 +-
>  net/core/skmsg.c                              |    2 +-
>  net/core/sock.c                               |   11 +-
>  net/core/sock_map.c                           |  120 +-
>  net/ipv4/tcp_bpf.c                            |   19 +-
>  net/ipv4/tcp_minisocks.c                      |    2 +
>  net/ipv4/tcp_ulp.c                            |    2 +-
>  net/tls/tls_main.c                            |    2 +-
>  .../bpf/prog_tests/select_reuseport.c         |   60 +-
>  .../selftests/bpf/prog_tests/sockmap_listen.c | 1378 +++++++++++++++++
>  .../selftests/bpf/progs/test_sockmap_listen.c |   76 +
>  tools/testing/selftests/bpf/test_maps.c       |    6 +-
>  16 files changed, 1696 insertions(+), 55 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c
> 
> -- 
> 2.24.1
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore
  2020-01-10 10:50 ` [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore Jakub Sitnicki
@ 2020-01-11 22:50   ` John Fastabend
  0 siblings, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-11 22:50 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> There is no need to reset psock->sk_proto when restoring socket protocol
> callbacks (sk->sk_prot). The psock is about to get detached from the sock
> and eventually destroyed.
> 
> No harm done if we restore the protocol callbacks twice, while it makes
> reasoning about psock state easier, that is once psock was initialized, we
> can assume psock->sk_proto is set.
> 
> Also, we don't need a fallback for when socket is not using ULP.
> tcp_update_ulp already does this for us.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-10 10:50 ` [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
@ 2020-01-11 23:14   ` John Fastabend
  2020-01-13 15:09     ` Jakub Sitnicki
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-11 23:14 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> sk_msg and ULP frameworks override protocol callbacks pointer in
> sk->sk_prot, while TCP accesses it locklessly when cloning the listening
> socket.
> 
> Once we enable use of listening sockets with sockmap (and hence sk_msg),
> there can be shared access to sk->sk_prot if socket is getting cloned while
> being inserted/deleted to/from the sockmap from another CPU. Mark the
> shared access with READ_ONCE/WRITE_ONCE annotations.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

In sockmap side I fixed this by wrapping the access in a lock_sock[0]. So
Do you think this is still needed with that in mind? The bpf_clone call
is using sk_prot_creater and also setting the newsk's proto field. Even
if the listening parent sock was being deleted in parallel would that be
a problem? We don't touch sk_prot_creator from the tear down path. I've
only scanned the 3..11 patches so maybe the answer is below. If that is
the case probably an improved commit message would be helpful.

[0] https://patchwork.ozlabs.org/patch/1221536/

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged
  2020-01-10 10:50 ` [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
@ 2020-01-11 23:38   ` John Fastabend
  2020-01-12 12:55   ` kbuild test robot
  2020-01-13 20:15   ` Martin Lau
  2 siblings, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-11 23:38 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> sk_user_data can hold a pointer to an object that is not intended to be
> shared between the parent socket and the child that gets a pointer copy on
> clone. This is the case when sk_user_data points at reference-counted
> object, like struct sk_psock.
> 
> One way to resolve it is to tag the pointer with a no-copy flag by
> repurposing its lowest bit. Based on the bit-flag value we clear the child
> sk_user_data pointer after cloning the parent socket.
> 
> The no-copy flag is stored in the pointer itself as opposed to externally,
> say in socket flags, to guarantee that the pointer and the flag are copied
> from parent to child socket in an atomic fashion. Parent socket state is
> subject to change while copying, we don't hold any locks at that time.
> 
> This approach relies on an assumption that sk_user_data holds a pointer to
> an object aligned to 2 or more bytes. A manual audit of existing users of
> rcu_dereference_sk_user_data helper confirms it. Also, an RCU-protected
> sk_user_data is not likely to hold a pointer to a char value or a
> pathological case of "struct { char c; }". To be safe, warn when the
> flag-bit is set when setting sk_user_data to catch any future misuses.
> 
> It is worth considering why clearing sk_user_data unconditionally is not an
> option. There exist users, DRBD, NVMe, and Xen drivers being among them,
> that rely on the pointer being copied when cloning the listening socket.
> 
> Potentially we could distinguish these users by checking if the listening
> socket has been created in kernel-space via sock_create_kern, and hence has
> sk_kern_sock flag set. However, this is not the case for NVMe and Xen
> drivers, which create sockets without marking them as belonging to the
> kernel.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

LGTM.
Acked-by: John Fastabend <john.fastabend@gmail.com>

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index e6ffdb47b619..f6c83747c71e 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -535,6 +535,10 @@ static void tcp_bpf_remove(struct sock *sk, struct sk_psock *psock)
>  {
>  	struct sk_psock_link *link;
>  
> +	/* Did a child socket inadvertently inherit parent's psock? */
> +	if (WARN_ON(sk != psock->sk))
> +		return;
> +

Not sure if this is needed. We would probably have hit problems before
we get here anyways for example if the parent sock was deleted while
the child is still around. I think I would just drop it.

>  	while ((link = sk_psock_link_pop(psock))) {
>  		sk_psock_unlink(sk, link);
>  		sk_psock_free_link(link);
> -- 
> 2.24.1
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
  2020-01-11  2:42   ` kbuild test robot
  2020-01-11  3:02   ` kbuild test robot
@ 2020-01-11 23:48   ` John Fastabend
  2020-01-13 22:31     ` Jakub Sitnicki
  2020-01-13 22:23   ` Martin Lau
  3 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-11 23:48 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> Prepare for cloning listening sockets that have their protocol callbacks
> overridden by sk_msg. Child sockets must not inherit parent callbacks that
> access state stored in sk_user_data owned by the parent.
> 
> Restore the child socket protocol callbacks before the it gets hashed and
> any of the callbacks can get invoked.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/net/tcp.h        |  1 +
>  net/ipv4/tcp_bpf.c       | 13 +++++++++++++
>  net/ipv4/tcp_minisocks.c |  2 ++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 9dd975be7fdf..7cbf9465bb10 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2181,6 +2181,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  		    int nonblock, int flags, int *addr_len);
>  int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
>  		      struct msghdr *msg, int len, int flags);
> +void tcp_bpf_clone(const struct sock *sk, struct sock *child);
>  
>  /* Call BPF_SOCK_OPS program that returns an int. If the return value
>   * is < 0, then the BPF op failed (for example if the loaded BPF
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index f6c83747c71e..6f96320fb7cf 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -586,6 +586,19 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
>  	saved_close(sk, timeout);
>  }
>  
> +/* If a child got cloned from a listening socket that had tcp_bpf
> + * protocol callbacks installed, we need to restore the callbacks to
> + * the default ones because the child does not inherit the psock state
> + * that tcp_bpf callbacks expect.
> + */
> +void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
> +{
> +	struct proto *prot = newsk->sk_prot;
> +
> +	if (prot->recvmsg == tcp_bpf_recvmsg)
> +		newsk->sk_prot = sk->sk_prot_creator;
> +}
> +

^^^^ probably needs to go into tcp.h wrapped in ifdef NET_SOCK_MSG with
a stub for ifndef NET_SOCK_MSG case.

Looks like build bot also caught this.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  2020-01-10 10:50 ` [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
@ 2020-01-11 23:59   ` John Fastabend
  2020-01-13 15:48     ` Jakub Sitnicki
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-11 23:59 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> In order for sockmap type to become a generic collection for storing TCP
> sockets we need to loosen the checks during map update, while tightening
> the checks in redirect helpers.
> 
> Currently sockmap requires the TCP socket to be in established state (or
> transitioning out of SYN_RECV into established state when done from BPF),
> which prevents inserting listening sockets.
> 
> Change the update pre-checks so that the socket can also be in listening
> state. If the state is not white-listed, return -EINVAL to be consistent
> with REUSEPORT_SOCKARRY map type.
> 
> Since it doesn't make sense to redirect with sockmap to listening sockets,
> add appropriate socket state checks to BPF redirect helpers too.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c                     | 46 ++++++++++++++++++++-----
>  tools/testing/selftests/bpf/test_maps.c |  6 +---
>  2 files changed, 39 insertions(+), 13 deletions(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index eb114ee419b6..99daea502508 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -396,6 +396,23 @@ static bool sock_map_sk_is_suitable(const struct sock *sk)
>  	       sk->sk_protocol == IPPROTO_TCP;
>  }
>  
> +/* Is sock in a state that allows inserting into the map?
> + * SYN_RECV is needed for updates on BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB.
> + */
> +static bool sock_map_update_okay(const struct sock *sk)
> +{
> +	return (1 << sk->sk_state) & (TCPF_ESTABLISHED |
> +				      TCPF_SYN_RECV |
> +				      TCPF_LISTEN);
> +}
> +
> +/* Is sock in a state that allows redirecting into it? */
> +static bool sock_map_redirect_okay(const struct sock *sk)
> +{
> +	return (1 << sk->sk_state) & (TCPF_ESTABLISHED |
> +				      TCPF_SYN_RECV);
> +}
> +
>  static int sock_map_update_elem(struct bpf_map *map, void *key,
>  				void *value, u64 flags)
>  {
> @@ -413,11 +430,14 @@ static int sock_map_update_elem(struct bpf_map *map, void *key,
>  		ret = -EINVAL;
>  		goto out;
>  	}
> -	if (!sock_map_sk_is_suitable(sk) ||
> -	    sk->sk_state != TCP_ESTABLISHED) {
> +	if (!sock_map_sk_is_suitable(sk)) {
>  		ret = -EOPNOTSUPP;
>  		goto out;
>  	}
> +	if (!sock_map_update_okay(sk)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}

I nit but seeing we need a v3 anyways. How about consolidating
this state checks into sock_map_sk_is_suitable() so we don't have
multiple if branches or this '|| TCP_ESTABLISHED' like we do now.

>  
>  	sock_map_sk_acquire(sk);
>  	ret = sock_map_update_common(map, idx, sk, flags);
> @@ -433,6 +453,7 @@ BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, sops,
>  	WARN_ON_ONCE(!rcu_read_lock_held());
>  
>  	if (likely(sock_map_sk_is_suitable(sops->sk) &&
> +		   sock_map_update_okay(sops->sk) &&
>  		   sock_map_op_okay(sops)))
>  		return sock_map_update_common(map, *(u32 *)key, sops->sk,
>  					      flags);
> @@ -454,13 +475,17 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
>  	   struct bpf_map *, map, u32, key, u64, flags)
>  {
>  	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> +	struct sock *sk;
>  
>  	if (unlikely(flags & ~(BPF_F_INGRESS)))
>  		return SK_DROP;
> -	tcb->bpf.flags = flags;
> -	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
> -	if (!tcb->bpf.sk_redir)
> +
> +	sk = __sock_map_lookup_elem(map, key);
> +	if (!sk || !sock_map_redirect_okay(sk))
>  		return SK_DROP;

unlikely(!sock_map_redirect_okay)? Or perhaps unlikely the entire case,
if (unlikely(!sk || !sock_map_redirect_okay(sk)). I think users should
know if the sk is a valid sock or not and this is just catching the
error case. Any opinion?

Otherwise looks good.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets
  2020-01-10 10:50 ` [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets Jakub Sitnicki
@ 2020-01-12  0:51   ` John Fastabend
  2020-01-12  1:07     ` John Fastabend
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-12  0:51 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> Now that sockmap can hold listening sockets, when setting up the psock we
> will (i) grab references to verdict/parser progs, and (2) override socket
> upcalls sk_data_ready and sk_write_space.
> 
> We cannot redirect to listening sockets so we don't need to link the socket
> to the BPF progs, but more importantly we don't want the listening socket
> to have overridden upcalls because they would get inherited by child
> sockets cloned from it.
> 
> Introduce a separate initialization path for listening sockets that does
> not change the upcalls and ignores the BPF progs.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)


Any reason only support for sock_map types are added? We can also support
sock_hash I presume? Could be a follow up patch I guess but if its not
too much trouble would be worth adding now vs trying to detect at run
time later. I think it should be as simple as using similar logic as
below in sock_hash_update_common

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-01-10 10:50 ` [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
@ 2020-01-12  0:56   ` John Fastabend
  2020-01-13 23:12   ` Martin Lau
  1 sibling, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-12  0:56 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> Tooling that populates the SOCKMAP with sockets from user-space needs a way
> to inspect its contents. Returning the struct sock * that SOCKMAP holds to
> user-space is neither safe nor useful. An approach established by
> REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
> instead.
> 
> Since socket cookies are u64 values SOCKMAP needs to support such a value
> size for lookup to be possible. This requires special handling on update,
> though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
> with ENOSPC error.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP
  2020-01-10 10:50 ` [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
@ 2020-01-12  1:00   ` John Fastabend
  2020-01-13 23:45   ` Martin Lau
  2020-01-13 23:51   ` Martin Lau
  2 siblings, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-12  1:00 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> SOCKMAP now supports storing references to listening sockets. Nothing keeps
> us from using it as an array of sockets to select from in SK_REUSEPORT
> programs.
> 
> Whitelist the map type with the BPF helper for selecting socket.
> 
> The restriction that the socket has to be a member of a reuseport group
> still applies. Socket from a SOCKMAP that does not have sk_reuseport_cb set
> is not a valid target and we signal it with -EINVAL.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP
  2020-01-10 10:50 ` [PATCH bpf-next v2 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
@ 2020-01-12  1:01   ` John Fastabend
  0 siblings, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-12  1:01 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
> is not hard-coded in the test setup routine.
> 
> This, together with careful state cleaning after the tests, let's us run
> the test cases once with REUSEPORT_ARRAY and once with SOCKMAP (TCP only),
> to have test coverage for the latter as well.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> 

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets
  2020-01-10 10:50 ` [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
@ 2020-01-12  1:06   ` John Fastabend
  2020-01-13 15:58     ` Jakub Sitnicki
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-12  1:06 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

Jakub Sitnicki wrote:
> Now that SOCKMAP can store listening sockets, user-space and BPF API is
> open to a new set of potential pitfalls. Exercise the map operations (with
> extra attention to code paths susceptible to races between map ops and
> socket cloning), and BPF helpers that work with SOCKMAP to gain confidence
> that all works as expected.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

[...]

> +static void test_sockmap_insert_listening(int family, int sotype, int mapfd)
> +{
> +	u64 value;
> +	u32 key;
> +	int s;
> +
> +	s = listen_loopback(family, sotype);
> +	if (s < 0)
> +		return;

Will the test be marked OK if listen fails here? Should we mark it skipped or
maybe even failed? Just concerned it may be passing even if the update doesn't
actually happen.

> +
> +	key = 0;
> +	value = s;
> +	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
> +	xclose(s);
> +}

Thanks,
John

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets
  2020-01-12  0:51   ` John Fastabend
@ 2020-01-12  1:07     ` John Fastabend
  2020-01-13 17:59       ` Jakub Sitnicki
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-12  1:07 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer,
	Martin KaFai Lau

John Fastabend wrote:
> Jakub Sitnicki wrote:
> > Now that sockmap can hold listening sockets, when setting up the psock we
> > will (i) grab references to verdict/parser progs, and (2) override socket
> > upcalls sk_data_ready and sk_write_space.
> > 
> > We cannot redirect to listening sockets so we don't need to link the socket
> > to the BPF progs, but more importantly we don't want the listening socket
> > to have overridden upcalls because they would get inherited by child
> > sockets cloned from it.
> > 
> > Introduce a separate initialization path for listening sockets that does
> > not change the upcalls and ignores the BPF progs.
> > 
> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> > ---
> >  net/core/sock_map.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> 
> Any reason only support for sock_map types are added? We can also support
> sock_hash I presume? Could be a follow up patch I guess but if its not
> too much trouble would be worth adding now vs trying to detect at run
> time later. I think it should be as simple as using similar logic as
> below in sock_hash_update_common
> 
> Thanks.

After running through the other patches I think its probably OK to do hash
support as a follow up. Up to you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged
  2020-01-10 10:50 ` [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
  2020-01-11 23:38   ` John Fastabend
@ 2020-01-12 12:55   ` kbuild test robot
  2020-01-13 20:15   ` Martin Lau
  2 siblings, 0 replies; 49+ messages in thread
From: kbuild test robot @ 2020-01-12 12:55 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: kbuild-all, bpf, netdev, kernel-team, Eric Dumazet,
	John Fastabend, Lorenz Bauer, Martin KaFai Lau

Hi Jakub,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[also build test WARNING on bpf/master net/master net-next/master linus/master ipvs/master v5.5-rc5 next-20200110]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Extend-SOCKMAP-to-store-listening-sockets/20200111-045213
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-129-g341daf20-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   include/trace/events/sock.h:177:1: sparse: sparse: directive in macro's argument list
   include/trace/events/sock.h:184:1: sparse: sparse: directive in macro's argument list
>> net/core/sock.c:1871:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> net/core/sock.c:1871:25: sparse:    void [noderef] <asn:4> *
>> net/core/sock.c:1871:25: sparse:    void *

vim +1871 net/core/sock.c

  1785	
  1786	/**
  1787	 *	sk_clone_lock - clone a socket, and lock its clone
  1788	 *	@sk: the socket to clone
  1789	 *	@priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
  1790	 *
  1791	 *	Caller must unlock socket even in error path (bh_unlock_sock(newsk))
  1792	 */
  1793	struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
  1794	{
  1795		struct proto *prot = READ_ONCE(sk->sk_prot);
  1796		struct sock *newsk;
  1797		bool is_charged = true;
  1798	
  1799		newsk = sk_prot_alloc(prot, priority, sk->sk_family);
  1800		if (newsk != NULL) {
  1801			struct sk_filter *filter;
  1802	
  1803			sock_copy(newsk, sk);
  1804	
  1805			newsk->sk_prot_creator = prot;
  1806	
  1807			/* SANITY */
  1808			if (likely(newsk->sk_net_refcnt))
  1809				get_net(sock_net(newsk));
  1810			sk_node_init(&newsk->sk_node);
  1811			sock_lock_init(newsk);
  1812			bh_lock_sock(newsk);
  1813			newsk->sk_backlog.head	= newsk->sk_backlog.tail = NULL;
  1814			newsk->sk_backlog.len = 0;
  1815	
  1816			atomic_set(&newsk->sk_rmem_alloc, 0);
  1817			/*
  1818			 * sk_wmem_alloc set to one (see sk_free() and sock_wfree())
  1819			 */
  1820			refcount_set(&newsk->sk_wmem_alloc, 1);
  1821			atomic_set(&newsk->sk_omem_alloc, 0);
  1822			sk_init_common(newsk);
  1823	
  1824			newsk->sk_dst_cache	= NULL;
  1825			newsk->sk_dst_pending_confirm = 0;
  1826			newsk->sk_wmem_queued	= 0;
  1827			newsk->sk_forward_alloc = 0;
  1828			atomic_set(&newsk->sk_drops, 0);
  1829			newsk->sk_send_head	= NULL;
  1830			newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
  1831			atomic_set(&newsk->sk_zckey, 0);
  1832	
  1833			sock_reset_flag(newsk, SOCK_DONE);
  1834			mem_cgroup_sk_alloc(newsk);
  1835			cgroup_sk_alloc(&newsk->sk_cgrp_data);
  1836	
  1837			rcu_read_lock();
  1838			filter = rcu_dereference(sk->sk_filter);
  1839			if (filter != NULL)
  1840				/* though it's an empty new sock, the charging may fail
  1841				 * if sysctl_optmem_max was changed between creation of
  1842				 * original socket and cloning
  1843				 */
  1844				is_charged = sk_filter_charge(newsk, filter);
  1845			RCU_INIT_POINTER(newsk->sk_filter, filter);
  1846			rcu_read_unlock();
  1847	
  1848			if (unlikely(!is_charged || xfrm_sk_clone_policy(newsk, sk))) {
  1849				/* We need to make sure that we don't uncharge the new
  1850				 * socket if we couldn't charge it in the first place
  1851				 * as otherwise we uncharge the parent's filter.
  1852				 */
  1853				if (!is_charged)
  1854					RCU_INIT_POINTER(newsk->sk_filter, NULL);
  1855				sk_free_unlock_clone(newsk);
  1856				newsk = NULL;
  1857				goto out;
  1858			}
  1859			RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
  1860	
  1861			if (bpf_sk_storage_clone(sk, newsk)) {
  1862				sk_free_unlock_clone(newsk);
  1863				newsk = NULL;
  1864				goto out;
  1865			}
  1866	
  1867			/* Clear sk_user_data if parent had the pointer tagged
  1868			 * as not suitable for copying when cloning.
  1869			 */
  1870			if (sk_user_data_is_nocopy(newsk))
> 1871				RCU_INIT_POINTER(newsk->sk_user_data, NULL);
  1872	
  1873			newsk->sk_err	   = 0;
  1874			newsk->sk_err_soft = 0;
  1875			newsk->sk_priority = 0;
  1876			newsk->sk_incoming_cpu = raw_smp_processor_id();
  1877			if (likely(newsk->sk_net_refcnt))
  1878				sock_inuse_add(sock_net(newsk), 1);
  1879	
  1880			/*
  1881			 * Before updating sk_refcnt, we must commit prior changes to memory
  1882			 * (Documentation/RCU/rculist_nulls.txt for details)
  1883			 */
  1884			smp_wmb();
  1885			refcount_set(&newsk->sk_refcnt, 2);
  1886	
  1887			/*
  1888			 * Increment the counter in the same struct proto as the master
  1889			 * sock (sk_refcnt_debug_inc uses newsk->sk_prot->socks, that
  1890			 * is the same as sk->sk_prot->socks, as this field was copied
  1891			 * with memcpy).
  1892			 *
  1893			 * This _changes_ the previous behaviour, where
  1894			 * tcp_create_openreq_child always was incrementing the
  1895			 * equivalent to tcp_prot->socks (inet_sock_nr), so this have
  1896			 * to be taken into account in all callers. -acme
  1897			 */
  1898			sk_refcnt_debug_inc(newsk);
  1899			sk_set_socket(newsk, NULL);
  1900			RCU_INIT_POINTER(newsk->sk_wq, NULL);
  1901	
  1902			if (newsk->sk_prot->sockets_allocated)
  1903				sk_sockets_allocated_inc(newsk);
  1904	
  1905			if (sock_needs_netstamp(sk) &&
  1906			    newsk->sk_flags & SK_FLAGS_TIMESTAMP)
  1907				net_enable_timestamp();
  1908		}
  1909	out:
  1910		return newsk;
  1911	}
  1912	EXPORT_SYMBOL_GPL(sk_clone_lock);
  1913	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-11 23:14   ` John Fastabend
@ 2020-01-13 15:09     ` Jakub Sitnicki
  2020-01-14  3:14       ` John Fastabend
  2020-01-20 17:00       ` John Fastabend
  0 siblings, 2 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 15:09 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Sun, Jan 12, 2020 at 12:14 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> sk_msg and ULP frameworks override protocol callbacks pointer in
>> sk->sk_prot, while TCP accesses it locklessly when cloning the listening
>> socket.
>>
>> Once we enable use of listening sockets with sockmap (and hence sk_msg),
>> there can be shared access to sk->sk_prot if socket is getting cloned while
>> being inserted/deleted to/from the sockmap from another CPU. Mark the
>> shared access with READ_ONCE/WRITE_ONCE annotations.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>
> In sockmap side I fixed this by wrapping the access in a lock_sock[0]. So
> Do you think this is still needed with that in mind? The bpf_clone call
> is using sk_prot_creater and also setting the newsk's proto field. Even
> if the listening parent sock was being deleted in parallel would that be
> a problem? We don't touch sk_prot_creator from the tear down path. I've
> only scanned the 3..11 patches so maybe the answer is below. If that is
> the case probably an improved commit message would be helpful.

I think it is needed. Not because of tcp_bpf_clone or that we access
listener's sk_prot_creator from there, if I'm grasping your question.

Either way I'm glad this came up. Let's go though my reasoning and
verify it. tcp stack accesses the listener sk_prot while cloning it:

tcp_v4_rcv
  sk = __inet_lookup_skb(...)
  tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
      tcp_v4_syn_recv_sock
        tcp_create_openreq_child
          inet_csk_clone_lock
            sk_clone_lock
              READ_ONCE(sk->sk_prot)

It grabs a reference to the listener, but doesn't grab the sk_lock.

On another CPU we can be inserting/removing the listener socket from the
sockmap and writing to its sk_prot. We have the update and the remove
path:

sock_map_ops->map_update_elem
  sock_map_update_elem
    sock_map_update_common
      sock_map_link_no_progs
        tcp_bpf_init
          tcp_bpf_update_sk_prot
            sk_psock_update_proto
              WRITE_ONCE(sk->sk_prot, ops)

sock_map_ops->map_delete_elem
  sock_map_delete_elem
    __sock_map_delete
     sock_map_unref
       sk_psock_put
         sk_psock_drop
           sk_psock_restore_proto
             tcp_update_ulp
               WRITE_ONCE(sk->sk_prot, proto)

Following the guidelines from KTSAN project [0], sk_prot looks like a
candidate for annotating it. At least on these 3 call paths.

If that sounds correct, I can add it to the patch description.

Thanks,
-jkbs

[0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  2020-01-11 23:59   ` John Fastabend
@ 2020-01-13 15:48     ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 15:48 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Sun, Jan 12, 2020 at 12:59 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> In order for sockmap type to become a generic collection for storing TCP
>> sockets we need to loosen the checks during map update, while tightening
>> the checks in redirect helpers.
>>
>> Currently sockmap requires the TCP socket to be in established state (or
>> transitioning out of SYN_RECV into established state when done from BPF),
>> which prevents inserting listening sockets.
>>
>> Change the update pre-checks so that the socket can also be in listening
>> state. If the state is not white-listed, return -EINVAL to be consistent
>> with REUSEPORT_SOCKARRY map type.
>>
>> Since it doesn't make sense to redirect with sockmap to listening sockets,
>> add appropriate socket state checks to BPF redirect helpers too.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  net/core/sock_map.c                     | 46 ++++++++++++++++++++-----
>>  tools/testing/selftests/bpf/test_maps.c |  6 +---
>>  2 files changed, 39 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index eb114ee419b6..99daea502508 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -396,6 +396,23 @@ static bool sock_map_sk_is_suitable(const struct sock *sk)
>>  	       sk->sk_protocol == IPPROTO_TCP;
>>  }
>>
>> +/* Is sock in a state that allows inserting into the map?
>> + * SYN_RECV is needed for updates on BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB.
>> + */
>> +static bool sock_map_update_okay(const struct sock *sk)
>> +{
>> +	return (1 << sk->sk_state) & (TCPF_ESTABLISHED |
>> +				      TCPF_SYN_RECV |
>> +				      TCPF_LISTEN);
>> +}
>> +
>> +/* Is sock in a state that allows redirecting into it? */
>> +static bool sock_map_redirect_okay(const struct sock *sk)
>> +{
>> +	return (1 << sk->sk_state) & (TCPF_ESTABLISHED |
>> +				      TCPF_SYN_RECV);
>> +}
>> +
>>  static int sock_map_update_elem(struct bpf_map *map, void *key,
>>  				void *value, u64 flags)
>>  {
>> @@ -413,11 +430,14 @@ static int sock_map_update_elem(struct bpf_map *map, void *key,
>>  		ret = -EINVAL;
>>  		goto out;
>>  	}
>> -	if (!sock_map_sk_is_suitable(sk) ||
>> -	    sk->sk_state != TCP_ESTABLISHED) {
>> +	if (!sock_map_sk_is_suitable(sk)) {
>>  		ret = -EOPNOTSUPP;
>>  		goto out;
>>  	}
>> +	if (!sock_map_update_okay(sk)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>
> I nit but seeing we need a v3 anyways. How about consolidating
> this state checks into sock_map_sk_is_suitable() so we don't have
> multiple if branches or this '|| TCP_ESTABLISHED' like we do now.

Ah, I see the pattern now :-)

>>
>>  	sock_map_sk_acquire(sk);
>>  	ret = sock_map_update_common(map, idx, sk, flags);
>> @@ -433,6 +453,7 @@ BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, sops,
>>  	WARN_ON_ONCE(!rcu_read_lock_held());
>>
>>  	if (likely(sock_map_sk_is_suitable(sops->sk) &&
>> +		   sock_map_update_okay(sops->sk) &&
>>  		   sock_map_op_okay(sops)))
>>  		return sock_map_update_common(map, *(u32 *)key, sops->sk,
>>  					      flags);
>> @@ -454,13 +475,17 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
>>  	   struct bpf_map *, map, u32, key, u64, flags)
>>  {
>>  	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
>> +	struct sock *sk;
>>
>>  	if (unlikely(flags & ~(BPF_F_INGRESS)))
>>  		return SK_DROP;
>> -	tcb->bpf.flags = flags;
>> -	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
>> -	if (!tcb->bpf.sk_redir)
>> +
>> +	sk = __sock_map_lookup_elem(map, key);
>> +	if (!sk || !sock_map_redirect_okay(sk))
>>  		return SK_DROP;
>
> unlikely(!sock_map_redirect_okay)? Or perhaps unlikely the entire case,
> if (unlikely(!sk || !sock_map_redirect_okay(sk)). I think users should
> know if the sk is a valid sock or not and this is just catching the
> error case. Any opinion?
>
> Otherwise looks good.

Both ideas SGTM. Will incorporate into next version. Thanks!

-jkbs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets
  2020-01-12  1:06   ` John Fastabend
@ 2020-01-13 15:58     ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 15:58 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Sun, Jan 12, 2020 at 02:06 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Now that SOCKMAP can store listening sockets, user-space and BPF API is
>> open to a new set of potential pitfalls. Exercise the map operations (with
>> extra attention to code paths susceptible to races between map ops and
>> socket cloning), and BPF helpers that work with SOCKMAP to gain confidence
>> that all works as expected.
>> 
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>
> [...]
>
>> +static void test_sockmap_insert_listening(int family, int sotype, int mapfd)
>> +{
>> +	u64 value;
>> +	u32 key;
>> +	int s;
>> +
>> +	s = listen_loopback(family, sotype);
>> +	if (s < 0)
>> +		return;
>
> Will the test be marked OK if listen fails here? Should we mark it skipped or
> maybe even failed? Just concerned it may be passing even if the update doesn't
> actually happen.

Yes, it will be marked as failed if we don't succeed in creating a
listening socket. The listen_loopback helper uses x{socket,bind,listen}
wrappers, which in turn use the CHECK_FAIL macro to fail the test.

Thanks for going through this series till the end :-)

-jkbs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets
  2020-01-12  1:07     ` John Fastabend
@ 2020-01-13 17:59       ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 17:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Sun, Jan 12, 2020 at 02:07 AM CET, John Fastabend wrote:
> John Fastabend wrote:
>> Jakub Sitnicki wrote:
>> > Now that sockmap can hold listening sockets, when setting up the psock we
>> > will (i) grab references to verdict/parser progs, and (2) override socket
>> > upcalls sk_data_ready and sk_write_space.
>> >
>> > We cannot redirect to listening sockets so we don't need to link the socket
>> > to the BPF progs, but more importantly we don't want the listening socket
>> > to have overridden upcalls because they would get inherited by child
>> > sockets cloned from it.
>> >
>> > Introduce a separate initialization path for listening sockets that does
>> > not change the upcalls and ignores the BPF progs.
>> >
>> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> > ---
>> >  net/core/sock_map.c | 34 +++++++++++++++++++++++++++++++++-
>> >  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>>
>> Any reason only support for sock_map types are added? We can also support
>> sock_hash I presume? Could be a follow up patch I guess but if its not
>> too much trouble would be worth adding now vs trying to detect at run
>> time later. I think it should be as simple as using similar logic as
>> below in sock_hash_update_common
>>
>> Thanks.
>
> After running through the other patches I think its probably OK to do hash
> support as a follow up. Up to you.

Yes, preferably. This series is already into double digits.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged
  2020-01-10 10:50 ` [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
  2020-01-11 23:38   ` John Fastabend
  2020-01-12 12:55   ` kbuild test robot
@ 2020-01-13 20:15   ` Martin Lau
  2020-01-14 16:04     ` Jakub Sitnicki
  2 siblings, 1 reply; 49+ messages in thread
From: Martin Lau @ 2020-01-13 20:15 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Fri, Jan 10, 2020 at 11:50:19AM +0100, Jakub Sitnicki wrote:
> sk_user_data can hold a pointer to an object that is not intended to be
> shared between the parent socket and the child that gets a pointer copy on
> clone. This is the case when sk_user_data points at reference-counted
> object, like struct sk_psock.
> 
> One way to resolve it is to tag the pointer with a no-copy flag by
> repurposing its lowest bit. Based on the bit-flag value we clear the child
> sk_user_data pointer after cloning the parent socket.
LGTM.  One nit, WARN_ON_ONCE should be enough for all the cases if they
would ever happen.  Having continuous splat on the same thing is not
necessary useful while it could be quite distributing for people
capture/log them.

Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
                     ` (2 preceding siblings ...)
  2020-01-11 23:48   ` John Fastabend
@ 2020-01-13 22:23   ` Martin Lau
  2020-01-13 22:42     ` Jakub Sitnicki
  3 siblings, 1 reply; 49+ messages in thread
From: Martin Lau @ 2020-01-13 22:23 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Fri, Jan 10, 2020 at 11:50:20AM +0100, Jakub Sitnicki wrote:
> Prepare for cloning listening sockets that have their protocol callbacks
> overridden by sk_msg. Child sockets must not inherit parent callbacks that
> access state stored in sk_user_data owned by the parent.
> 
> Restore the child socket protocol callbacks before the it gets hashed and
> any of the callbacks can get invoked.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/net/tcp.h        |  1 +
>  net/ipv4/tcp_bpf.c       | 13 +++++++++++++
>  net/ipv4/tcp_minisocks.c |  2 ++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 9dd975be7fdf..7cbf9465bb10 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2181,6 +2181,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  		    int nonblock, int flags, int *addr_len);
>  int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
>  		      struct msghdr *msg, int len, int flags);
> +void tcp_bpf_clone(const struct sock *sk, struct sock *child);
>  
>  /* Call BPF_SOCK_OPS program that returns an int. If the return value
>   * is < 0, then the BPF op failed (for example if the loaded BPF
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index f6c83747c71e..6f96320fb7cf 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -586,6 +586,19 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
>  	saved_close(sk, timeout);
>  }
>  
> +/* If a child got cloned from a listening socket that had tcp_bpf
> + * protocol callbacks installed, we need to restore the callbacks to
> + * the default ones because the child does not inherit the psock state
> + * that tcp_bpf callbacks expect.
> + */
> +void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
> +{
> +	struct proto *prot = newsk->sk_prot;
> +
> +	if (prot->recvmsg == tcp_bpf_recvmsg)
A question not related to this patch (may be it is more for patch 6).

How tcp_bpf_recvmsg may be used for a listening sock (sk here)?

> +		newsk->sk_prot = sk->sk_prot_creator;
> +}
> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-11 23:48   ` John Fastabend
@ 2020-01-13 22:31     ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 22:31 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Sun, Jan 12, 2020 at 12:48 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Prepare for cloning listening sockets that have their protocol callbacks
>> overridden by sk_msg. Child sockets must not inherit parent callbacks that
>> access state stored in sk_user_data owned by the parent.
>> 
>> Restore the child socket protocol callbacks before the it gets hashed and
>> any of the callbacks can get invoked.
>> 
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  include/net/tcp.h        |  1 +
>>  net/ipv4/tcp_bpf.c       | 13 +++++++++++++
>>  net/ipv4/tcp_minisocks.c |  2 ++
>>  3 files changed, 16 insertions(+)
>> 
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 9dd975be7fdf..7cbf9465bb10 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -2181,6 +2181,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>>  		    int nonblock, int flags, int *addr_len);
>>  int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
>>  		      struct msghdr *msg, int len, int flags);
>> +void tcp_bpf_clone(const struct sock *sk, struct sock *child);
>>  
>>  /* Call BPF_SOCK_OPS program that returns an int. If the return value
>>   * is < 0, then the BPF op failed (for example if the loaded BPF
>> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
>> index f6c83747c71e..6f96320fb7cf 100644
>> --- a/net/ipv4/tcp_bpf.c
>> +++ b/net/ipv4/tcp_bpf.c
>> @@ -586,6 +586,19 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
>>  	saved_close(sk, timeout);
>>  }
>>  
>> +/* If a child got cloned from a listening socket that had tcp_bpf
>> + * protocol callbacks installed, we need to restore the callbacks to
>> + * the default ones because the child does not inherit the psock state
>> + * that tcp_bpf callbacks expect.
>> + */
>> +void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
>> +{
>> +	struct proto *prot = newsk->sk_prot;
>> +
>> +	if (prot->recvmsg == tcp_bpf_recvmsg)
>> +		newsk->sk_prot = sk->sk_prot_creator;
>> +}
>> +
>
> ^^^^ probably needs to go into tcp.h wrapped in ifdef NET_SOCK_MSG with
> a stub for ifndef NET_SOCK_MSG case.
>
> Looks like build bot also caught this.

Oops, I need to add NET_SOCK_MSG to my build matrix :-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-13 22:23   ` Martin Lau
@ 2020-01-13 22:42     ` Jakub Sitnicki
  2020-01-13 23:23       ` Martin Lau
  0 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 22:42 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Mon, Jan 13, 2020 at 11:23 PM CET, Martin Lau wrote:
> On Fri, Jan 10, 2020 at 11:50:20AM +0100, Jakub Sitnicki wrote:
>> Prepare for cloning listening sockets that have their protocol callbacks
>> overridden by sk_msg. Child sockets must not inherit parent callbacks that
>> access state stored in sk_user_data owned by the parent.
>>
>> Restore the child socket protocol callbacks before the it gets hashed and
>> any of the callbacks can get invoked.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  include/net/tcp.h        |  1 +
>>  net/ipv4/tcp_bpf.c       | 13 +++++++++++++
>>  net/ipv4/tcp_minisocks.c |  2 ++
>>  3 files changed, 16 insertions(+)
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 9dd975be7fdf..7cbf9465bb10 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -2181,6 +2181,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>>  		    int nonblock, int flags, int *addr_len);
>>  int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
>>  		      struct msghdr *msg, int len, int flags);
>> +void tcp_bpf_clone(const struct sock *sk, struct sock *child);
>>
>>  /* Call BPF_SOCK_OPS program that returns an int. If the return value
>>   * is < 0, then the BPF op failed (for example if the loaded BPF
>> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
>> index f6c83747c71e..6f96320fb7cf 100644
>> --- a/net/ipv4/tcp_bpf.c
>> +++ b/net/ipv4/tcp_bpf.c
>> @@ -586,6 +586,19 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
>>  	saved_close(sk, timeout);
>>  }
>>
>> +/* If a child got cloned from a listening socket that had tcp_bpf
>> + * protocol callbacks installed, we need to restore the callbacks to
>> + * the default ones because the child does not inherit the psock state
>> + * that tcp_bpf callbacks expect.
>> + */
>> +void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
>> +{
>> +	struct proto *prot = newsk->sk_prot;
>> +
>> +	if (prot->recvmsg == tcp_bpf_recvmsg)
> A question not related to this patch (may be it is more for patch 6).
>
> How tcp_bpf_recvmsg may be used for a listening sock (sk here)?

It can't be used. It's a way of checking if sock has tcp_bpf callbacks
that I copied from sk_psock_get_checked:

static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
{
	struct sk_psock *psock;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (psock) {
		if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
			psock = ERR_PTR(-EBUSY);
			goto out;
		}
        ...

This makes me think that perhaps it deserves a well-named helper.

>
>> +		newsk->sk_prot = sk->sk_prot_creator;
>> +}
>> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-01-10 10:50 ` [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
  2020-01-12  0:56   ` John Fastabend
@ 2020-01-13 23:12   ` Martin Lau
  2020-01-14  3:16     ` John Fastabend
  1 sibling, 1 reply; 49+ messages in thread
From: Martin Lau @ 2020-01-13 23:12 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Fri, Jan 10, 2020 at 11:50:23AM +0100, Jakub Sitnicki wrote:
> Tooling that populates the SOCKMAP with sockets from user-space needs a way
> to inspect its contents. Returning the struct sock * that SOCKMAP holds to
> user-space is neither safe nor useful. An approach established by
> REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
> instead.
> 
> Since socket cookies are u64 values SOCKMAP needs to support such a value
> size for lookup to be possible. This requires special handling on update,
> though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
> with ENOSPC error.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index d1a91e41ff82..3731191a7d1e 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -10,6 +10,7 @@
>  #include <linux/skmsg.h>
>  #include <linux/list.h>
>  #include <linux/jhash.h>
> +#include <linux/sock_diag.h>
>  
>  struct bpf_stab {
>  	struct bpf_map map;
> @@ -31,7 +32,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
>  		return ERR_PTR(-EPERM);
>  	if (attr->max_entries == 0 ||
>  	    attr->key_size    != 4 ||
> -	    attr->value_size  != 4 ||
> +	    (attr->value_size != sizeof(u32) &&
> +	     attr->value_size != sizeof(u64)) ||
>  	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
>  		return ERR_PTR(-EINVAL);
>  
> @@ -298,6 +300,23 @@ static void *sock_map_lookup(struct bpf_map *map, void *key)
>  	return ERR_PTR(-EOPNOTSUPP);
>  }
>  
> +static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
> +{
> +	struct sock *sk;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
It seems unnecessary.  It is only called by syscall.c which
holds the rcu_read_lock().  Other than that,

Acked-by: Martin KaFai Lau <kafai@fb.com>

> +
> +	if (map->value_size != sizeof(u64))
> +		return ERR_PTR(-ENOSPC);
> +
> +	sk = __sock_map_lookup_elem(map, *(u32 *)key);
> +	if (!sk)
> +		return ERR_PTR(-ENOENT);
> +
> +	sock_gen_cookie(sk);
> +	return &sk->sk_cookie;
> +}
> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-01-13 22:42     ` Jakub Sitnicki
@ 2020-01-13 23:23       ` Martin Lau
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Lau @ 2020-01-13 23:23 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Mon, Jan 13, 2020 at 11:42:42PM +0100, Jakub Sitnicki wrote:
> On Mon, Jan 13, 2020 at 11:23 PM CET, Martin Lau wrote:
> > On Fri, Jan 10, 2020 at 11:50:20AM +0100, Jakub Sitnicki wrote:
> >> Prepare for cloning listening sockets that have their protocol callbacks
> >> overridden by sk_msg. Child sockets must not inherit parent callbacks that
> >> access state stored in sk_user_data owned by the parent.
> >>
> >> Restore the child socket protocol callbacks before the it gets hashed and
> >> any of the callbacks can get invoked.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>  include/net/tcp.h        |  1 +
> >>  net/ipv4/tcp_bpf.c       | 13 +++++++++++++
> >>  net/ipv4/tcp_minisocks.c |  2 ++
> >>  3 files changed, 16 insertions(+)
> >>
> >> diff --git a/include/net/tcp.h b/include/net/tcp.h
> >> index 9dd975be7fdf..7cbf9465bb10 100644
> >> --- a/include/net/tcp.h
> >> +++ b/include/net/tcp.h
> >> @@ -2181,6 +2181,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> >>  		    int nonblock, int flags, int *addr_len);
> >>  int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
> >>  		      struct msghdr *msg, int len, int flags);
> >> +void tcp_bpf_clone(const struct sock *sk, struct sock *child);
> >>
> >>  /* Call BPF_SOCK_OPS program that returns an int. If the return value
> >>   * is < 0, then the BPF op failed (for example if the loaded BPF
> >> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> >> index f6c83747c71e..6f96320fb7cf 100644
> >> --- a/net/ipv4/tcp_bpf.c
> >> +++ b/net/ipv4/tcp_bpf.c
> >> @@ -586,6 +586,19 @@ static void tcp_bpf_close(struct sock *sk, long timeout)
> >>  	saved_close(sk, timeout);
> >>  }
> >>
> >> +/* If a child got cloned from a listening socket that had tcp_bpf
> >> + * protocol callbacks installed, we need to restore the callbacks to
> >> + * the default ones because the child does not inherit the psock state
> >> + * that tcp_bpf callbacks expect.
> >> + */
> >> +void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
> >> +{
> >> +	struct proto *prot = newsk->sk_prot;
> >> +
> >> +	if (prot->recvmsg == tcp_bpf_recvmsg)
> > A question not related to this patch (may be it is more for patch 6).
> >
> > How tcp_bpf_recvmsg may be used for a listening sock (sk here)?
> 
> It can't be used. It's a way of checking if sock has tcp_bpf callbacks
> that I copied from sk_psock_get_checked:
ic.  It seems only tcp_bpf_close and tcp_bpf_unhash may be useful.
Asking because it intuitively made me think how tcp_bpf_recvmsg/sendmsg/...etc
may be used since they are also set to listening sk.

> 
> static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
> {
> 	struct sk_psock *psock;
> 
> 	rcu_read_lock();
> 	psock = sk_psock(sk);
> 	if (psock) {
> 		if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
> 			psock = ERR_PTR(-EBUSY);
> 			goto out;
> 		}
>         ...
> 
> This makes me think that perhaps it deserves a well-named helper.
> 
> >
> >> +		newsk->sk_prot = sk->sk_prot_creator;
> >> +}
> >> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP
  2020-01-10 10:50 ` [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
  2020-01-12  1:00   ` John Fastabend
@ 2020-01-13 23:45   ` Martin Lau
  2020-01-15 12:41     ` Jakub Sitnicki
  2020-01-13 23:51   ` Martin Lau
  2 siblings, 1 reply; 49+ messages in thread
From: Martin Lau @ 2020-01-13 23:45 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Fri, Jan 10, 2020 at 11:50:25AM +0100, Jakub Sitnicki wrote:
> SOCKMAP now supports storing references to listening sockets. Nothing keeps
> us from using it as an array of sockets to select from in SK_REUSEPORT
> programs.
> 
> Whitelist the map type with the BPF helper for selecting socket.
> 
> The restriction that the socket has to be a member of a reuseport group
> still applies. Socket from a SOCKMAP that does not have sk_reuseport_cb set
> is not a valid target and we signal it with -EINVAL.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  kernel/bpf/verifier.c |  6 ++++--
>  net/core/filter.c     | 15 ++++++++++-----
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f5af759a8a5f..0ee5f1594b5c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3697,7 +3697,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		if (func_id != BPF_FUNC_sk_redirect_map &&
>  		    func_id != BPF_FUNC_sock_map_update &&
>  		    func_id != BPF_FUNC_map_delete_elem &&
> -		    func_id != BPF_FUNC_msg_redirect_map)
> +		    func_id != BPF_FUNC_msg_redirect_map &&
> +		    func_id != BPF_FUNC_sk_select_reuseport)
>  			goto error;
>  		break;
>  	case BPF_MAP_TYPE_SOCKHASH:
> @@ -3778,7 +3779,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  			goto error;
>  		break;
>  	case BPF_FUNC_sk_select_reuseport:
> -		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
> +		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
> +		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
>  			goto error;
>  		break;
>  	case BPF_FUNC_map_peek_elem:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a702761ef369..c79c62a54167 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8677,6 +8677,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
>  BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>  	   struct bpf_map *, map, void *, key, u32, flags)
>  {
> +	bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
>  	struct sock_reuseport *reuse;
>  	struct sock *selected_sk;
>  
> @@ -8685,12 +8686,16 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>  		return -ENOENT;
>  
>  	reuse = rcu_dereference(selected_sk->sk_reuseport_cb);
> -	if (!reuse)
> -		/* selected_sk is unhashed (e.g. by close()) after the
> -		 * above map_lookup_elem().  Treat selected_sk has already
> -		 * been removed from the map.
> +	if (!reuse) {
> +		/* reuseport_array has only sk with non NULL sk_reuseport_cb.
> +		 * The only (!reuse) case here is - the sk has already been
> +		 * unhashed (e.g. by close()), so treat it as -ENOENT.
> +		 *
> +		 * Other maps (e.g. sock_map) do not provide this guarantee and
> +		 * the sk may never be in the reuseport group to begin with.
>  		 */
> -		return -ENOENT;
> +		return is_sockarray ? -ENOENT : -EINVAL;
> +	}
>  
>  	if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) {
I guess the later testing patch passed is because reuseport_id is init to 0.

Note that in reuseport_array, reuseport_get_id() is called at update_elem() to
init the reuse->reuseport_id.  It was done there because reuseport_array
was the only one requiring reuseport_id.  It is to ensure the bpf_prog
cannot accidentally use a sk from another reuseport-group.

The same has to be done in patch 5 or may be considering to
move it to reuseport_alloc() itself.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP
  2020-01-10 10:50 ` [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
  2020-01-12  1:00   ` John Fastabend
  2020-01-13 23:45   ` Martin Lau
@ 2020-01-13 23:51   ` Martin Lau
  2020-01-15 12:57     ` Jakub Sitnicki
  2 siblings, 1 reply; 49+ messages in thread
From: Martin Lau @ 2020-01-13 23:51 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Fri, Jan 10, 2020 at 11:50:25AM +0100, Jakub Sitnicki wrote:
> SOCKMAP now supports storing references to listening sockets. Nothing keeps
> us from using it as an array of sockets to select from in SK_REUSEPORT
> programs.
> 
> Whitelist the map type with the BPF helper for selecting socket.
> 
> The restriction that the socket has to be a member of a reuseport group
> still applies. Socket from a SOCKMAP that does not have sk_reuseport_cb set
> is not a valid target and we signal it with -EINVAL.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  kernel/bpf/verifier.c |  6 ++++--
>  net/core/filter.c     | 15 ++++++++++-----
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f5af759a8a5f..0ee5f1594b5c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3697,7 +3697,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  		if (func_id != BPF_FUNC_sk_redirect_map &&
>  		    func_id != BPF_FUNC_sock_map_update &&
>  		    func_id != BPF_FUNC_map_delete_elem &&
> -		    func_id != BPF_FUNC_msg_redirect_map)
> +		    func_id != BPF_FUNC_msg_redirect_map &&
> +		    func_id != BPF_FUNC_sk_select_reuseport)
>  			goto error;
>  		break;
>  	case BPF_MAP_TYPE_SOCKHASH:
> @@ -3778,7 +3779,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>  			goto error;
>  		break;
>  	case BPF_FUNC_sk_select_reuseport:
> -		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
> +		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
> +		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
>  			goto error;
>  		break;
>  	case BPF_FUNC_map_peek_elem:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a702761ef369..c79c62a54167 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8677,6 +8677,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
>  BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>  	   struct bpf_map *, map, void *, key, u32, flags)
>  {
> +	bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
A nit.
Since map_type is tested, reuseport_array_lookup_elem() or sock_map_lookup()
can directly be called also.  mostly for consideration.  will not insist.


>  	struct sock_reuseport *reuse;
>  	struct sock *selected_sk;
>  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-13 15:09     ` Jakub Sitnicki
@ 2020-01-14  3:14       ` John Fastabend
  2020-01-20 17:00       ` John Fastabend
  1 sibling, 0 replies; 49+ messages in thread
From: John Fastabend @ 2020-01-14  3:14 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

Jakub Sitnicki wrote:
> On Sun, Jan 12, 2020 at 12:14 AM CET, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> sk_msg and ULP frameworks override protocol callbacks pointer in
> >> sk->sk_prot, while TCP accesses it locklessly when cloning the listening
> >> socket.
> >>
> >> Once we enable use of listening sockets with sockmap (and hence sk_msg),
> >> there can be shared access to sk->sk_prot if socket is getting cloned while
> >> being inserted/deleted to/from the sockmap from another CPU. Mark the
> >> shared access with READ_ONCE/WRITE_ONCE annotations.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >
> > In sockmap side I fixed this by wrapping the access in a lock_sock[0]. So
> > Do you think this is still needed with that in mind? The bpf_clone call
> > is using sk_prot_creater and also setting the newsk's proto field. Even
> > if the listening parent sock was being deleted in parallel would that be
> > a problem? We don't touch sk_prot_creator from the tear down path. I've
> > only scanned the 3..11 patches so maybe the answer is below. If that is
> > the case probably an improved commit message would be helpful.
> 
> I think it is needed. Not because of tcp_bpf_clone or that we access
> listener's sk_prot_creator from there, if I'm grasping your question.
> 
> Either way I'm glad this came up. Let's go though my reasoning and
> verify it. tcp stack accesses the listener sk_prot while cloning it:
> 
> tcp_v4_rcv
>   sk = __inet_lookup_skb(...)
>   tcp_check_req(sk)
>     inet_csk(sk)->icsk_af_ops->syn_recv_sock
>       tcp_v4_syn_recv_sock
>         tcp_create_openreq_child
>           inet_csk_clone_lock
>             sk_clone_lock
>               READ_ONCE(sk->sk_prot)
> 
> It grabs a reference to the listener, but doesn't grab the sk_lock.
> 
> On another CPU we can be inserting/removing the listener socket from the
> sockmap and writing to its sk_prot. We have the update and the remove
> path:
> 
> sock_map_ops->map_update_elem
>   sock_map_update_elem
>     sock_map_update_common
>       sock_map_link_no_progs
>         tcp_bpf_init
>           tcp_bpf_update_sk_prot
>             sk_psock_update_proto
>               WRITE_ONCE(sk->sk_prot, ops)
> 
> sock_map_ops->map_delete_elem
>   sock_map_delete_elem
>     __sock_map_delete
>      sock_map_unref
>        sk_psock_put
>          sk_psock_drop
>            sk_psock_restore_proto
>              tcp_update_ulp
>                WRITE_ONCE(sk->sk_prot, proto)
> 
> Following the guidelines from KTSAN project [0], sk_prot looks like a
> candidate for annotating it. At least on these 3 call paths.
> 
> If that sounds correct, I can add it to the patch description.
> 

Logic looks correct to me thanks for the details, please put those in
the commit so we don't lose them. Can you also add a comment where it
makes most sense in the code? This is a bit subtle and we don't want
to miss it later. Probably in tcp_update_ulp near that WRITE_ONCE would
do. It doesn't need to be too verbose but something as simple as,

"{WRITE|READ}_ONCE wrappers needed around sk_prot to protect unlocked
 reads in sk_clone_lock"

> Thanks,
> -jkbs
> 
> [0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-01-13 23:12   ` Martin Lau
@ 2020-01-14  3:16     ` John Fastabend
  2020-01-14 15:48       ` Jakub Sitnicki
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-14  3:16 UTC (permalink / raw)
  To: Martin Lau, Jakub Sitnicki
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

Martin Lau wrote:
> On Fri, Jan 10, 2020 at 11:50:23AM +0100, Jakub Sitnicki wrote:
> > Tooling that populates the SOCKMAP with sockets from user-space needs a way
> > to inspect its contents. Returning the struct sock * that SOCKMAP holds to
> > user-space is neither safe nor useful. An approach established by
> > REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
> > instead.
> > 
> > Since socket cookies are u64 values SOCKMAP needs to support such a value
> > size for lookup to be possible. This requires special handling on update,
> > though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
> > with ENOSPC error.
> > 
> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> > ---

[...]
 
> > +static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
> > +{
> > +	struct sock *sk;
> > +
> > +	WARN_ON_ONCE(!rcu_read_lock_held());
> It seems unnecessary.  It is only called by syscall.c which
> holds the rcu_read_lock().  Other than that,
> 

+1 drop it. The normal rcu annotations/splats should catch anything here.

> Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-01-14  3:16     ` John Fastabend
@ 2020-01-14 15:48       ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-14 15:48 UTC (permalink / raw)
  To: Martin Lau, John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer

On Tue, Jan 14, 2020 at 04:16 AM CET, John Fastabend wrote:
> Martin Lau wrote:
>> On Fri, Jan 10, 2020 at 11:50:23AM +0100, Jakub Sitnicki wrote:
>> > Tooling that populates the SOCKMAP with sockets from user-space needs a way
>> > to inspect its contents. Returning the struct sock * that SOCKMAP holds to
>> > user-space is neither safe nor useful. An approach established by
>> > REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
>> > instead.
>> >
>> > Since socket cookies are u64 values SOCKMAP needs to support such a value
>> > size for lookup to be possible. This requires special handling on update,
>> > though. Attempts to do a lookup on SOCKMAP holding u32 values will be met
>> > with ENOSPC error.
>> >
>> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> > ---
>
> [...]
>
>> > +static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
>> > +{
>> > +	struct sock *sk;
>> > +
>> > +	WARN_ON_ONCE(!rcu_read_lock_held());
>> It seems unnecessary.  It is only called by syscall.c which
>> holds the rcu_read_lock().  Other than that,
>>
>
> +1 drop it. The normal rcu annotations/splats should catch anything
> here.

Oh, okay. Thanks for pointing it out.

I noticed __sock_map_lookup_elem called from sock_map_lookup_sys has the
same WARN_ON_ONCE check. Looks like it can be cleaned up.

Granted, __sock_map_lookup_elem also gets invoked by sockmap BPF helpers
for redirecting (bpf_msg_redirect_map, bpf_sk_redirect_map). But we
always run sk_skb and sk_msg progs RCU read lock held.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged
  2020-01-13 20:15   ` Martin Lau
@ 2020-01-14 16:04     ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-14 16:04 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Mon, Jan 13, 2020 at 09:15 PM CET, Martin Lau wrote:
> On Fri, Jan 10, 2020 at 11:50:19AM +0100, Jakub Sitnicki wrote:
>> sk_user_data can hold a pointer to an object that is not intended to be
>> shared between the parent socket and the child that gets a pointer copy on
>> clone. This is the case when sk_user_data points at reference-counted
>> object, like struct sk_psock.
>> 
>> One way to resolve it is to tag the pointer with a no-copy flag by
>> repurposing its lowest bit. Based on the bit-flag value we clear the child
>> sk_user_data pointer after cloning the parent socket.
> LGTM.  One nit, WARN_ON_ONCE should be enough for all the cases if they
> would ever happen.  Having continuous splat on the same thing is not
> necessary useful while it could be quite distributing for people
> capture/log them.

Will switch to WARN_ON_ONCE in v3. Thanks for the review!

>
> Acked-by: Martin KaFai Lau <kafai@fb.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP
  2020-01-13 23:45   ` Martin Lau
@ 2020-01-15 12:41     ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-15 12:41 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Tue, Jan 14, 2020 at 12:45 AM CET, Martin Lau wrote:
> On Fri, Jan 10, 2020 at 11:50:25AM +0100, Jakub Sitnicki wrote:
>> SOCKMAP now supports storing references to listening sockets. Nothing keeps
>> us from using it as an array of sockets to select from in SK_REUSEPORT
>> programs.
>>
>> Whitelist the map type with the BPF helper for selecting socket.
>>
>> The restriction that the socket has to be a member of a reuseport group
>> still applies. Socket from a SOCKMAP that does not have sk_reuseport_cb set
>> is not a valid target and we signal it with -EINVAL.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  kernel/bpf/verifier.c |  6 ++++--
>>  net/core/filter.c     | 15 ++++++++++-----
>>  2 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f5af759a8a5f..0ee5f1594b5c 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -3697,7 +3697,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>>  		if (func_id != BPF_FUNC_sk_redirect_map &&
>>  		    func_id != BPF_FUNC_sock_map_update &&
>>  		    func_id != BPF_FUNC_map_delete_elem &&
>> -		    func_id != BPF_FUNC_msg_redirect_map)
>> +		    func_id != BPF_FUNC_msg_redirect_map &&
>> +		    func_id != BPF_FUNC_sk_select_reuseport)
>>  			goto error;
>>  		break;
>>  	case BPF_MAP_TYPE_SOCKHASH:
>> @@ -3778,7 +3779,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>>  			goto error;
>>  		break;
>>  	case BPF_FUNC_sk_select_reuseport:
>> -		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
>> +		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
>> +		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
>>  			goto error;
>>  		break;
>>  	case BPF_FUNC_map_peek_elem:
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index a702761ef369..c79c62a54167 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -8677,6 +8677,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
>>  BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>>  	   struct bpf_map *, map, void *, key, u32, flags)
>>  {
>> +	bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
>>  	struct sock_reuseport *reuse;
>>  	struct sock *selected_sk;
>>
>> @@ -8685,12 +8686,16 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>>  		return -ENOENT;
>>
>>  	reuse = rcu_dereference(selected_sk->sk_reuseport_cb);
>> -	if (!reuse)
>> -		/* selected_sk is unhashed (e.g. by close()) after the
>> -		 * above map_lookup_elem().  Treat selected_sk has already
>> -		 * been removed from the map.
>> +	if (!reuse) {
>> +		/* reuseport_array has only sk with non NULL sk_reuseport_cb.
>> +		 * The only (!reuse) case here is - the sk has already been
>> +		 * unhashed (e.g. by close()), so treat it as -ENOENT.
>> +		 *
>> +		 * Other maps (e.g. sock_map) do not provide this guarantee and
>> +		 * the sk may never be in the reuseport group to begin with.
>>  		 */
>> -		return -ENOENT;
>> +		return is_sockarray ? -ENOENT : -EINVAL;
>> +	}
>>
>>  	if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) {
> I guess the later testing patch passed is because reuseport_id is init to 0.
>
> Note that in reuseport_array, reuseport_get_id() is called at update_elem() to
> init the reuse->reuseport_id.  It was done there because reuseport_array
> was the only one requiring reuseport_id.  It is to ensure the bpf_prog
> cannot accidentally use a sk from another reuseport-group.
>
> The same has to be done in patch 5 or may be considering to
> move it to reuseport_alloc() itself.

I see what you're saying.

With these patches, it is possible to redirect connections across
reuseport groups with reuseport BPF and sockmap. While it should be
prohibited to be consistent with sockarray. Redirect helper should
return an error.

Will try to pull up reuseport_id initialization to reuseport_alloc(),
and add a test for a sockmap with two listening sockets that belong to
different reuseport groups.

Thanks for catching this bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP
  2020-01-13 23:51   ` Martin Lau
@ 2020-01-15 12:57     ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-15 12:57 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, netdev, kernel-team, Eric Dumazet, John Fastabend, Lorenz Bauer

On Tue, Jan 14, 2020 at 12:51 AM CET, Martin Lau wrote:
> On Fri, Jan 10, 2020 at 11:50:25AM +0100, Jakub Sitnicki wrote:
>> SOCKMAP now supports storing references to listening sockets. Nothing keeps
>> us from using it as an array of sockets to select from in SK_REUSEPORT
>> programs.
>>
>> Whitelist the map type with the BPF helper for selecting socket.
>>
>> The restriction that the socket has to be a member of a reuseport group
>> still applies. Socket from a SOCKMAP that does not have sk_reuseport_cb set
>> is not a valid target and we signal it with -EINVAL.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  kernel/bpf/verifier.c |  6 ++++--
>>  net/core/filter.c     | 15 ++++++++++-----
>>  2 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f5af759a8a5f..0ee5f1594b5c 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -3697,7 +3697,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>>  		if (func_id != BPF_FUNC_sk_redirect_map &&
>>  		    func_id != BPF_FUNC_sock_map_update &&
>>  		    func_id != BPF_FUNC_map_delete_elem &&
>> -		    func_id != BPF_FUNC_msg_redirect_map)
>> +		    func_id != BPF_FUNC_msg_redirect_map &&
>> +		    func_id != BPF_FUNC_sk_select_reuseport)
>>  			goto error;
>>  		break;
>>  	case BPF_MAP_TYPE_SOCKHASH:
>> @@ -3778,7 +3779,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>>  			goto error;
>>  		break;
>>  	case BPF_FUNC_sk_select_reuseport:
>> -		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
>> +		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
>> +		    map->map_type != BPF_MAP_TYPE_SOCKMAP)
>>  			goto error;
>>  		break;
>>  	case BPF_FUNC_map_peek_elem:
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index a702761ef369..c79c62a54167 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -8677,6 +8677,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
>>  BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
>>  	   struct bpf_map *, map, void *, key, u32, flags)
>>  {
>> +	bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
> A nit.
> Since map_type is tested, reuseport_array_lookup_elem() or sock_map_lookup()
> can directly be called also.  mostly for consideration.  will not
> insist.

sock_map_lookup() isn't global currently.

If I'm following your thinking, you're suggesting an optimization
against retpoline overhead along the lines of INDIRECT_CALL_$n wrappers:

/*
 * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
 *  @f: function pointer
 *  @f$NR: builtin functions names, up to $NR of them
 *  @__VA_ARGS__: arguments for @f
 *
 * Avoid retpoline overhead for known builtin, checking @f vs each of them and
 * eventually invoking directly the builtin function. The functions are check
 * in the given order. Fallback to the indirect call.
 */
#define INDIRECT_CALL_1(f, f1, ...)					\
	({								\
		likely(f == f1) ? f1(__VA_ARGS__) : f(__VA_ARGS__);	\
	})
#define INDIRECT_CALL_2(f, f2, f1, ...)					\
	({								\
		likely(f == f2) ? f2(__VA_ARGS__) :			\
				  INDIRECT_CALL_1(f, f1, __VA_ARGS__);	\
	})

Will resist the temptation to optimize it as part of this series,
because the indirect call is already there.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-13 15:09     ` Jakub Sitnicki
  2020-01-14  3:14       ` John Fastabend
@ 2020-01-20 17:00       ` John Fastabend
  2020-01-20 18:11         ` Jakub Sitnicki
  1 sibling, 1 reply; 49+ messages in thread
From: John Fastabend @ 2020-01-20 17:00 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

Jakub Sitnicki wrote:
> On Sun, Jan 12, 2020 at 12:14 AM CET, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> sk_msg and ULP frameworks override protocol callbacks pointer in
> >> sk->sk_prot, while TCP accesses it locklessly when cloning the listening
> >> socket.
> >>
> >> Once we enable use of listening sockets with sockmap (and hence sk_msg),
> >> there can be shared access to sk->sk_prot if socket is getting cloned while
> >> being inserted/deleted to/from the sockmap from another CPU. Mark the
> >> shared access with READ_ONCE/WRITE_ONCE annotations.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >
> > In sockmap side I fixed this by wrapping the access in a lock_sock[0]. So
> > Do you think this is still needed with that in mind? The bpf_clone call
> > is using sk_prot_creater and also setting the newsk's proto field. Even
> > if the listening parent sock was being deleted in parallel would that be
> > a problem? We don't touch sk_prot_creator from the tear down path. I've
> > only scanned the 3..11 patches so maybe the answer is below. If that is
> > the case probably an improved commit message would be helpful.
> 
> I think it is needed. Not because of tcp_bpf_clone or that we access
> listener's sk_prot_creator from there, if I'm grasping your question.
> 
> Either way I'm glad this came up. Let's go though my reasoning and
> verify it. tcp stack accesses the listener sk_prot while cloning it:
> 
> tcp_v4_rcv
>   sk = __inet_lookup_skb(...)
>   tcp_check_req(sk)
>     inet_csk(sk)->icsk_af_ops->syn_recv_sock
>       tcp_v4_syn_recv_sock
>         tcp_create_openreq_child
>           inet_csk_clone_lock
>             sk_clone_lock
>               READ_ONCE(sk->sk_prot)
> 
> It grabs a reference to the listener, but doesn't grab the sk_lock.
> 
> On another CPU we can be inserting/removing the listener socket from the
> sockmap and writing to its sk_prot. We have the update and the remove
> path:
> 
> sock_map_ops->map_update_elem
>   sock_map_update_elem
>     sock_map_update_common
>       sock_map_link_no_progs
>         tcp_bpf_init
>           tcp_bpf_update_sk_prot
>             sk_psock_update_proto
>               WRITE_ONCE(sk->sk_prot, ops)
> 
> sock_map_ops->map_delete_elem
>   sock_map_delete_elem
>     __sock_map_delete
>      sock_map_unref
>        sk_psock_put
>          sk_psock_drop
>            sk_psock_restore_proto
>              tcp_update_ulp
>                WRITE_ONCE(sk->sk_prot, proto)
> 
> Following the guidelines from KTSAN project [0], sk_prot looks like a
> candidate for annotating it. At least on these 3 call paths.
> 
> If that sounds correct, I can add it to the patch description.
> 
> Thanks,
> -jkbs
> 
> [0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

Hi Jakub, can push this to bpf tree as well? There is another case
already in-kernel where this is needed. If the map is removed while
a recvmsg is in flight.

 tcp_bpf_recvmsg()
  psock = sk_psock_get(sk)                         <- refcnt 2
  lock_sock(sk);
  ...                                
                                  sock_map_free()  <- refcnt 1
  release_sock(sk)
  sk_psock_put()                                   <- refcnt 0

Then can you add this diff as well I got a bit too carried away
with that. If your busy I can do it as well if you want. Thanks!

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 3866d7e20c07..ded2d5227678 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -594,8 +594,6 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-       sock_owned_by_me(sk);
-
        sk_psock_cork_free(psock);
        sk_psock_zap_ingress(psock);

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-20 17:00       ` John Fastabend
@ 2020-01-20 18:11         ` Jakub Sitnicki
  2020-01-21 12:42           ` Jakub Sitnicki
  0 siblings, 1 reply; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-20 18:11 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Mon, Jan 20, 2020 at 06:00 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Sun, Jan 12, 2020 at 12:14 AM CET, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> sk_msg and ULP frameworks override protocol callbacks pointer in
>> >> sk->sk_prot, while TCP accesses it locklessly when cloning the listening
>> >> socket.
>> >>
>> >> Once we enable use of listening sockets with sockmap (and hence sk_msg),
>> >> there can be shared access to sk->sk_prot if socket is getting cloned while
>> >> being inserted/deleted to/from the sockmap from another CPU. Mark the
>> >> shared access with READ_ONCE/WRITE_ONCE annotations.
>> >>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >
>> > In sockmap side I fixed this by wrapping the access in a lock_sock[0]. So
>> > Do you think this is still needed with that in mind? The bpf_clone call
>> > is using sk_prot_creater and also setting the newsk's proto field. Even
>> > if the listening parent sock was being deleted in parallel would that be
>> > a problem? We don't touch sk_prot_creator from the tear down path. I've
>> > only scanned the 3..11 patches so maybe the answer is below. If that is
>> > the case probably an improved commit message would be helpful.
>>
>> I think it is needed. Not because of tcp_bpf_clone or that we access
>> listener's sk_prot_creator from there, if I'm grasping your question.
>>
>> Either way I'm glad this came up. Let's go though my reasoning and
>> verify it. tcp stack accesses the listener sk_prot while cloning it:
>>
>> tcp_v4_rcv
>>   sk = __inet_lookup_skb(...)
>>   tcp_check_req(sk)
>>     inet_csk(sk)->icsk_af_ops->syn_recv_sock
>>       tcp_v4_syn_recv_sock
>>         tcp_create_openreq_child
>>           inet_csk_clone_lock
>>             sk_clone_lock
>>               READ_ONCE(sk->sk_prot)
>>
>> It grabs a reference to the listener, but doesn't grab the sk_lock.
>>
>> On another CPU we can be inserting/removing the listener socket from the
>> sockmap and writing to its sk_prot. We have the update and the remove
>> path:
>>
>> sock_map_ops->map_update_elem
>>   sock_map_update_elem
>>     sock_map_update_common
>>       sock_map_link_no_progs
>>         tcp_bpf_init
>>           tcp_bpf_update_sk_prot
>>             sk_psock_update_proto
>>               WRITE_ONCE(sk->sk_prot, ops)
>>
>> sock_map_ops->map_delete_elem
>>   sock_map_delete_elem
>>     __sock_map_delete
>>      sock_map_unref
>>        sk_psock_put
>>          sk_psock_drop
>>            sk_psock_restore_proto
>>              tcp_update_ulp
>>                WRITE_ONCE(sk->sk_prot, proto)
>>
>> Following the guidelines from KTSAN project [0], sk_prot looks like a
>> candidate for annotating it. At least on these 3 call paths.
>>
>> If that sounds correct, I can add it to the patch description.
>>
>> Thanks,
>> -jkbs
>>
>> [0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE
>
> Hi Jakub, can push this to bpf tree as well? There is another case
> already in-kernel where this is needed. If the map is removed while
> a recvmsg is in flight.
>
>  tcp_bpf_recvmsg()
>   psock = sk_psock_get(sk)                         <- refcnt 2
>   lock_sock(sk);
>   ...
>                                   sock_map_free()  <- refcnt 1
>   release_sock(sk)
>   sk_psock_put()                                   <- refcnt 0
>
> Then can you add this diff as well I got a bit too carried away
> with that. If your busy I can do it as well if you want. Thanks!

Hi John, I get the race between map_free and tcp_bpf_recvmsg, and how we
end up dropping psock on a path where we don't hold the sock lock. What
a rare case, since we don't destory maps that often usually.

However, I'm not sure I follow where shared lockless access to
sk->sk_prot is in this case?

Perhaps between drop path:

sk_psock_put
  sk_psock_drop
    sk_psock_restore_proto
      WRITE_ONCE(sk->sk_prot, proto)

... and update path where we grab sk_callback_lock a little too late,
that is after updating the proto?

sock_map_update_common
  sock_map_link
    tcp_bpf_init
      tcp_bpf_update_sk_prot
        sk_psock_update_proto
          WRITE_ONCE(sk->sk_prot, ops)

I'm getting v3 ready to post, so happy to help you spin these bits.
I'll need to do it with a fresh head tomorrow, though.

If I don't see any patches from you hit the ML, I'll split out the
chunks that annotate sk_prot access in sk_psock_{retore,update}_proto
and post them together with the revert you suggested below.

-jkbs

>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 3866d7e20c07..ded2d5227678 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -594,8 +594,6 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
>
>  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>  {
> -       sock_owned_by_me(sk);
> -
>         sk_psock_cork_free(psock);
>         sk_psock_zap_ingress(psock);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-01-20 18:11         ` Jakub Sitnicki
@ 2020-01-21 12:42           ` Jakub Sitnicki
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Sitnicki @ 2020-01-21 12:42 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, kernel-team, Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Mon, Jan 20, 2020 at 07:11 PM CET, Jakub Sitnicki wrote:
> On Mon, Jan 20, 2020 at 06:00 PM CET, John Fastabend wrote:
>> Jakub Sitnicki wrote:
>>> On Sun, Jan 12, 2020 at 12:14 AM CET, John Fastabend wrote:
>>> > Jakub Sitnicki wrote:
>>> >> sk_msg and ULP frameworks override protocol callbacks pointer in
>>> >> sk->sk_prot, while TCP accesses it locklessly when cloning the listening
>>> >> socket.
>>> >>
>>> >> Once we enable use of listening sockets with sockmap (and hence sk_msg),
>>> >> there can be shared access to sk->sk_prot if socket is getting cloned while
>>> >> being inserted/deleted to/from the sockmap from another CPU. Mark the
>>> >> shared access with READ_ONCE/WRITE_ONCE annotations.
>>> >>
>>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>>> >
>>> > In sockmap side I fixed this by wrapping the access in a lock_sock[0]. So
>>> > Do you think this is still needed with that in mind? The bpf_clone call
>>> > is using sk_prot_creater and also setting the newsk's proto field. Even
>>> > if the listening parent sock was being deleted in parallel would that be
>>> > a problem? We don't touch sk_prot_creator from the tear down path. I've
>>> > only scanned the 3..11 patches so maybe the answer is below. If that is
>>> > the case probably an improved commit message would be helpful.
>>>
>>> I think it is needed. Not because of tcp_bpf_clone or that we access
>>> listener's sk_prot_creator from there, if I'm grasping your question.
>>>
>>> Either way I'm glad this came up. Let's go though my reasoning and
>>> verify it. tcp stack accesses the listener sk_prot while cloning it:
>>>
>>> tcp_v4_rcv
>>>   sk = __inet_lookup_skb(...)
>>>   tcp_check_req(sk)
>>>     inet_csk(sk)->icsk_af_ops->syn_recv_sock
>>>       tcp_v4_syn_recv_sock
>>>         tcp_create_openreq_child
>>>           inet_csk_clone_lock
>>>             sk_clone_lock
>>>               READ_ONCE(sk->sk_prot)
>>>
>>> It grabs a reference to the listener, but doesn't grab the sk_lock.
>>>
>>> On another CPU we can be inserting/removing the listener socket from the
>>> sockmap and writing to its sk_prot. We have the update and the remove
>>> path:
>>>
>>> sock_map_ops->map_update_elem
>>>   sock_map_update_elem
>>>     sock_map_update_common
>>>       sock_map_link_no_progs
>>>         tcp_bpf_init
>>>           tcp_bpf_update_sk_prot
>>>             sk_psock_update_proto
>>>               WRITE_ONCE(sk->sk_prot, ops)
>>>
>>> sock_map_ops->map_delete_elem
>>>   sock_map_delete_elem
>>>     __sock_map_delete
>>>      sock_map_unref
>>>        sk_psock_put
>>>          sk_psock_drop
>>>            sk_psock_restore_proto
>>>              tcp_update_ulp
>>>                WRITE_ONCE(sk->sk_prot, proto)
>>>
>>> Following the guidelines from KTSAN project [0], sk_prot looks like a
>>> candidate for annotating it. At least on these 3 call paths.
>>>
>>> If that sounds correct, I can add it to the patch description.
>>>
>>> Thanks,
>>> -jkbs
>>>
>>> [0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE
>>
>> Hi Jakub, can push this to bpf tree as well? There is another case
>> already in-kernel where this is needed. If the map is removed while
>> a recvmsg is in flight.
>>
>>  tcp_bpf_recvmsg()
>>   psock = sk_psock_get(sk)                         <- refcnt 2
>>   lock_sock(sk);
>>   ...
>>                                   sock_map_free()  <- refcnt 1
>>   release_sock(sk)
>>   sk_psock_put()                                   <- refcnt 0
>>
>> Then can you add this diff as well I got a bit too carried away
>> with that. If your busy I can do it as well if you want. Thanks!
>
> Hi John, I get the race between map_free and tcp_bpf_recvmsg, and how we
> end up dropping psock on a path where we don't hold the sock lock. What
> a rare case, since we don't destory maps that often usually.
>
> However, I'm not sure I follow where shared lockless access to
> sk->sk_prot is in this case?
>
> Perhaps between drop path:
>
> sk_psock_put
>   sk_psock_drop
>     sk_psock_restore_proto
>       WRITE_ONCE(sk->sk_prot, proto)
>
> ... and update path where we grab sk_callback_lock a little too late,
> that is after updating the proto?
>
> sock_map_update_common
>   sock_map_link
>     tcp_bpf_init
>       tcp_bpf_update_sk_prot
>         sk_psock_update_proto
>           WRITE_ONCE(sk->sk_prot, ops)
>
> I'm getting v3 ready to post, so happy to help you spin these bits.
> I'll need to do it with a fresh head tomorrow, though.
>
> If I don't see any patches from you hit the ML, I'll split out the
> chunks that annotate sk_prot access in sk_psock_{retore,update}_proto
> and post them together with the revert you suggested below.

I've sent out the partial revert you wanted:

https://lore.kernel.org/netdev/20200121123147.706666-1-jakub@cloudflare.com/T/#u

But otherwise didn't manage to convince myself that we need to annotate
access to sk_prot with READ/WRITE_ONCE in sk_psock_update/restore_proto.

Instead, I believe we might need to the extend critical section that
grabs sk_callback_lock in sock_map_link over tcp_bpf_init/reinit() to
serialize the writers.

Unless I'm missing the point here and you had some other race in mind?

-jkbs

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2020-01-21 12:42 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10 10:50 [PATCH bpf-next v2 00/11] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 01/11] bpf, sk_msg: Don't reset saved sock proto on restore Jakub Sitnicki
2020-01-11 22:50   ` John Fastabend
2020-01-10 10:50 ` [PATCH bpf-next v2 02/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
2020-01-11 23:14   ` John Fastabend
2020-01-13 15:09     ` Jakub Sitnicki
2020-01-14  3:14       ` John Fastabend
2020-01-20 17:00       ` John Fastabend
2020-01-20 18:11         ` Jakub Sitnicki
2020-01-21 12:42           ` Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 03/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
2020-01-11 23:38   ` John Fastabend
2020-01-12 12:55   ` kbuild test robot
2020-01-13 20:15   ` Martin Lau
2020-01-14 16:04     ` Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 04/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
2020-01-11  2:42   ` kbuild test robot
2020-01-11  3:02   ` kbuild test robot
2020-01-11 23:48   ` John Fastabend
2020-01-13 22:31     ` Jakub Sitnicki
2020-01-13 22:23   ` Martin Lau
2020-01-13 22:42     ` Jakub Sitnicki
2020-01-13 23:23       ` Martin Lau
2020-01-10 10:50 ` [PATCH bpf-next v2 05/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
2020-01-11 23:59   ` John Fastabend
2020-01-13 15:48     ` Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 06/11] bpf, sockmap: Don't set up sockmap progs for listening sockets Jakub Sitnicki
2020-01-12  0:51   ` John Fastabend
2020-01-12  1:07     ` John Fastabend
2020-01-13 17:59       ` Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 07/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2020-01-12  0:56   ` John Fastabend
2020-01-13 23:12   ` Martin Lau
2020-01-14  3:16     ` John Fastabend
2020-01-14 15:48       ` Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 08/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 09/11] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2020-01-12  1:00   ` John Fastabend
2020-01-13 23:45   ` Martin Lau
2020-01-15 12:41     ` Jakub Sitnicki
2020-01-13 23:51   ` Martin Lau
2020-01-15 12:57     ` Jakub Sitnicki
2020-01-10 10:50 ` [PATCH bpf-next v2 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2020-01-12  1:01   ` John Fastabend
2020-01-10 10:50 ` [PATCH bpf-next v2 11/11] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2020-01-12  1:06   ` John Fastabend
2020-01-13 15:58     ` Jakub Sitnicki
2020-01-11  0:18 ` [PATCH bpf-next v2 00/11] Extend SOCKMAP to store " Alexei Starovoitov
2020-01-11 22:47 ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).